#PBS -j oe -m e -M a.gemmell@physics.gla.ac.ukwith your e-mail address – this enables the batch system to send you an e-mail informing you of the completion (successful or otherwise) of your job.
check out the latest version of the code running framework from subversion (check what this is on trac) using the command
svn co https://ppesvn.physics.gla.ac.uk/svn/atlas/NNFitter/tags/NNFitter-00-00-0X %BR%
check out a version of the GlaNtp code into your home directory (or set up genemflat_batch_Complete2_SL5.sh to point at someone else's installation of the the code). The procedure for how to do this is described in the next section.
ensure you know the ntuple_area variable to be passed in at run-time to genemflat_batch_Complete2_SL5.sh. This will be the directory where the input ntuples are stored (currently /data/atlas07/ahgemmell/NTuple-v15-30Aug for ntuples which have sensible states, events passing preselection and events failing preselection having my_failEvent == 3, 1 and 0 respectively – note that by their very nature, sensible states have also passed preselection).
The BASEBATCHDIR is now set automatically to the working directory when the script is executed.
Set yourself up for access into SVN (using a proxy to access SVN, as described here)
source /data/ppe01/sl5x/x86_64/grid/glite-ui/latest/external/etc/profile.d/grid-env.sh svn-grid-proxy-init
Create the directory where you want to set up your copy, and copy in the setup script:
mkdir /home/ahgemmell/GlaNtp cd /home/ahgemmell/GlaNtp cp /home/stdenis/GlaNtpScript.sh .
You then need to set up your environment ready for the validation. This is done with the setup_glantp.sh script, which is available within the NNFitter package. I am aware of the tautology of getting a script from the package so you can get the package, but that's the way it is. Just download this one file and go from there - you can delete it later when you've got the whole thing. You run the script (which is also used for DebuggingTheCode) with
source setup_glantp.sh
Make a directory to hold the code itself:
mkdir GlaNtpPackage
GlaNtpScript.sh not only checks out and compiles the code, it also then goes and validates it. setup_glantp.sh sets up the environment variables so the validation data can be found.
You now run the script, specifying whether you want a specific tag (e.g. 00-00-10), or just from the head of the trunk (h) so you're more free to play around with it. It's always a good idea to check out a specific tag, so that whatever you do to the head, you can still run over a valid release.
./GlaNtpScript.sh SVN 00-00-10* This will check out everything, and run a few simple validations - the final output should look like this (i.e. don't be worried that not everything seems to have passed validation!):
HwwFlatFitATLAS Validation succeeded Done with core tests Result of UtilBase validation: NOT DONE: NEED Result of Steer validation: OK Result of StringStringSet validation: OK Result of StringIntMap validation: OK Result of ItemCategoryMap validation: OK Result of FlatSystematic validation: OK Result of LJMetValues validation: OK Result of PhysicsProc validation: OK Result of FlatNonTriggerableFakeScale validation: OK Result of FlatProcessInfo validation: OK Result of PaletteList validation: OK Result of CutInterface validation: NOT DONE: NEED Result of NNWeight validation: NOT DONE: NEED Result of FlatFileMetadata validation: OK Result of FlatFileMetadataContainer validation: OK Result of Masks validation: NOT DONE: NEED Result of FFMetadata validation: OK Result of RUtil validation: NOT DONE: NEED Result of HistHolder validation: NOT DONE: NEED Result of GlaFlatFitCDF validation: OK Result of GlaFlatFitBigSysTableCDF validation: OK Result of GlaFlatFitBigSysTableNoScalingCDF validation: OK Result of GlaFlatFitATLAS validation: OK Result of FlatTuple validation: OK Result of FlatReWeight validation: OK Result of FlatReWeight_global validation: OK Result of FlatReWeightMVA validation: OK Result of FlatReWeightMVA_global validation: OK Result of TreeSpecGenerator validation: OK Result of FlatAscii validation: OK Result of FlatAscii_global validation: OK Result of FlatTRntp validation: OK
GeneralParameter string 1 FlatTupleVar/<variable_name>=<tree>/<variable_name_in_tree>Also specified are the name of the leaf for the cutmask and invert word -- these are global values for a file.
GeneralParameter string 1 CutMaskString=cutMask GeneralParameter string 1 InvertWordString=invertWordThe structure of Computentp's output is specified by
ListParameter EvInfoTree:1 1 NN_BJetWeight_Jet1:NN_BJetWeight_Jet1/NN_BJetWeight_Jet1If you want a parameter to be found in the output, best to list it here....
ListParameter EvInfoTree:1 1 NN_BJetWeight_Jet1:NN_BJetWeight_Jet1/NN_BJetWeight_Jet1Currently all information is in the EvInfoTree, which provides event level information. However, future work will involve trying to establish a GlobalInfoTree, which contains information about the entire sample, such as cross-section - this will only need to be loaded once, and saves having to write the same information into the tree repeatedly, and subsequently reading it repeatedly.
ListParameter <tag> <onoff> <colon-separated-parameter-list><onoff> - specifies whether this parameter will be taken into consideration (1) or ignored (0) - generally this should be set to 1.
ColumnParameter <tag> <sequence> <keyword=doubleValue:keyword=doubleValue...>The expression <tag>:<sequence> must be unique, e.g.
ColumnParameter File 0 OnOff=0:SorB=0:Process=Data ColumnParameter File 1 OnOff=1:SorB=0:Process=Fakewhere <tag> is the same, but <sequence> is different. The fact that the <sequence> carries meaning is specific to the implementation. Note that all of the values passed from ColumnParameter will eventually be evaluated as Doubles - any variables where you pass a string (as for 'Process' above), this is not actually passed to the code - these code snippets are to make the code more easily readable by puny humans, who comprehend the meaning of strings more readily than Doubles.
0 116102-filter.root FlatPlotter/NNScoreAny_0_0_0 0
Process_0_0 TTjj:Semileptonic Process_1_0 ttH:Semileptonic Process_2_0 EWK:Semileptonic Process_3_0 QCD:Semileptonic
GeneralParameter string 1 FileString=my_EventtypeIndicates the leaf in the input file which shows which process the event belongs to - this is the same number as we've specified as in atlastth_hislist_flat-v15.txt and AtlasttHRealTitles.txt.
ColumnParameter File 1 OnOff=1:SorB=1:Process=tthThe number before the switches (OnOff, SorB, etc - in this case it is 1) corresponds to the number given in AtlasttHRealTitles.txt. The other numbers are self-explanatory - they establish if that file is to be used, if it is signal or background (1=signal. 0=background) and the name of the process. In this instance, the Process name is just a comment for your own elucidation - it is not used itself in the code, so does not necessarily have to correspond to the process names as provided in AtlasttHRealTitles.txt (though of course it is useful for them to be similar). The other file that is produced by genemflat that specifies the input files for Computentp is steerComputentp.txt
# Specify the known metadata ListParameter SignalProcessList 1 Alistair_tth ListParameter Process:Alistair_tth 1 Filename:${ntuple_area}/ttH-v15.root:File:${mh}:IntLumi:1.0This is just a list of the various input files, and we specify the integrated luminosity. The 'File' parameter is only used for book-keeping by Computentp, and does not have to correspond to the file numbers used in the ANN steering files (or to my_Eventtype), but for sanity's sake it is probably best to keep things consistent. We make an exception for the signal - we assign it the number ${mh} - so that we can keep track of things if we have different mass Higgs in our signals.
# Map of input file name to output file name: The ComputentpOutput will have a sed used to get the right mapping. ListParameter InputOutputMapName:1 1 ${ntuple_area}/ttH-v15.root:${Computentpoutput}/tth_NNinput.rootThe InputOutputMapName is a list of integers - this doesn't have to bear any relevance to any numbers that have gone before - just give each output a unique number. This is followed by the mapping of input file names provided, to the output names that Computentp will produce.
ColumnParameter BackgroundList 0 tt0j=0 ColumnParameter SignalList 1 ttH=1Here you specify once again the numbers assigned to the processes by my_Eventtype (for tt0j it equals zero), and list things as BackgroundList or SignalList. The number after 'BackgroundList' or 'SignalList' is unique for each process (to preserve the uniqueness of <tag>:<sequence>), but does not need to correspond to my_Eventtype. However, for completeness' sake within this file I have set it as such. The number at the end of this declaration (ttjj=0 in this case) needs to be sequential - it instructs the net of the order in which to process the samples, so it must go from 0 to n-1 (when you have n samples).
ColumnParameter PseudoDataList 0 tt0j=0This is simply a restatement of the BackgroundList (as we're looking for exclusion, the pseudodata is background only) - the same numbers in the same place. This list specifies the processes included in the pseudoexperiments, and therefore the signal process is not included in this list.
ListParameter ProcessLabels:1 1 tt0j:t#bar{t}0jThe number after ProcessLabels again doesn't correspond to my_Eventtype - I have made it the same as the number after BackgroundList/SignalList and PseudoDataList. The important feature from this is that it tells the ANN what to label each of the various processes as in the results plots. Again, the numbers must run from 0 to n-1.
ColumnParameter UCSDPalette 0 tt0j=19 ColumnParameter PrimaryColorPalette 0 tt0j=0These two parameters specify the colours used in the plotting for each of the processes (the numbers correspond to those in the Color Wheel of TColor). The numbers after the UCSDPalette and PrimaryColorPalette are the same ones as have been used previously in this file. Whether the plotting uses the colours stated in UCSDPalette or PrimaryColourPalette is determined in the file flatsteerStackNNAtlas.txt by setting the parameter:
GeneralParameter string 1 Palette=UCSDPaletteThe final parameter to be set in FlatAtlastthPhysicsProc1.txt is:
ColumnParameter ProcessOrder 0 tt0j=0Once again, the number on its own (in this case 0) is the same as the other such instances in this file. The final number (zero in this case) is the order in which this process should be plotted - i.e. in this case, the tt0j sample will be plotted first in the output, with the other samples piled on top of it. This number obviously does not need to correspond to my_Eventtype.
ColumnParameter Combine:Lumi 0 OnOff=1:Low=-0.11:High=0.11:Channel=1:Process=TTjjThe <sequence> parameter (in this case '0') is there so that you can specify the parameters for a given error for multiple channels, without falling foul of the uniqueness requirement for <tag>:<sequence>. We have chosen it so that it equals my_Eventtype for that process. 'Channel' is present just in case you're considering multiple channels. We're only considering the one channel in this case (SemiLeptonic). The final parameter (Process) is not actually used - the second parameter tells the ANN which errors are which, but this isn't very easily read by you, so feel free to add it in to help you keep track of the various errors! These final few parameters can be placed in any order, so long as they are separated by semicolons.
GeneralParameter int 1 NEvent=20000000 GeneralParameter int 0 FirstEvent=1 GeneralParameter int 0 LastEvent=10FirstEvent and LastEvent allow you to specify a range of events to run over - this is liable only to be useful during debugging. (Note that these parameters are currently turned off). NEvent gives the maximum number of events processed for any given sample - take care with this, if you are running a particularly large sample through the code....
sysfile=FlatSysSetAtlastth.txt steerfile=FlatFitSteer.txt mkdir -p templates/fit rm -f templates/fit/out_${mh}.log Fit ${basehistlistname} ${template_area}/ \$sysfile \$steerfile $mh > templates/fit/out_${mh}.logThe final call is rendered in the actual job file (e.g. run114) as
Fit /home/ahgemmell/NNFitter-00-00-09-Edited/NNTraining/atlastth_histlist_flat-v15.txt templates/tth120/ $sysfile $steerfile 120 > templates/fit/out_120.logIf you want to save time, (by not having to run templating for every error you wish to consider), you can instead only consider the rate uncertainties, and provide these as fractional changes to the rate, specified in FlatSysSetAtlastth1.txt. Whether or not you consider shape uncertainties is controlled by a couple of parameters in the steering file FlatFitSteer,txt, (which is created by the action of genemflat_batch_Complete2_SL5.sh)
GeneralParameter bool 1 UseShape=0 GeneralParameter bool 1 UseShapeMean=0Setting UseShape=1 means shape uncertainties will be taken into account for all the uncertainties that you provide the extra steering files and ANN scores for, UseShapeMean=1 means that the ANN results for your various uncertainties will be used to produce the rate uncertainties based on their integrals, rather than on the numbers provided in FlatSysSetAtlastth1.txt - using the relative sizes of the integrals of the AAN output as an estimator of the rate uncertainty can be useful if you don't want to be subject to statistical variations in the computation of your systematic uncertainties (if UseShapeMean=0, the systematic rate uncertainty is calculated as a fractional change on the nominal rate). Considering shape uncertainties requires more steering files, and this will be detailed in later.
ColumnParameter Combine:Lumi 0 OnOff=1:Low=-0.11:High=0.11:Channel=1:Process=TTjjThe first parameter consists of two parts in this example: 'Combine' and 'Lumi'. The second part is the name of the uncertainty being considered. The first part 'Combine' (and the associated semicolon between them) is optional. It tells the ANN that the uncertainty thus labelled are independent of each other, and can be added in quadrature. 'OnOff' obviously tells the ANN to consider those uncertainty (1) or not (0). 'Low' and 'High' establish the relevant bounds of the uncertainty as fractions of the total (however, for the ANN these uncertainties are symmetrised, so to save time they are here assumed to be symmetric unless elsewhere stated) - note that these are not the uncertainties on the quantity, but rather the effect of that uncertainty on the rate of your process. Process is not actually read by the ANN, but is there to make the whole thing more human-friendly to read. The current errors, and their bounds are below. If no source for these error bounds is given, then they were the defaults found in the files from time immemorial (where as necessary I assumed that all tt + X errors were the same, as were all ttbb (QCD) errors, as in the original files the only samples considered were ttjj, ttbb(EWK), ttbb(QCD) and ttH - these errors probably originate from the CSC note). If you are only considering rate uncertainties, this is where the fitting code will find the relevant numbers.
ListParameter SysInfoToSysMap:1 1 Combine:LumiTrigLepIDThe number in the <tag> after SysInfoToSysMap is unique for each error (in this case it goes from one to eight). There is one entry per error considered, apart from the cases where the errors are combined in quadrature (as specified in FlatSysSetAtlastth1.txt), where they are given one entry to share between them. The <colon-separated-parameter-list> provides a map between the name of the errors as considered by FlatSysSetAtlastth1.txt (the errors combined in quadrature are lumped together under the name 'Combine'), and something more human-readable. The human-readable names are what will be written out by the fitting code (which identifies each error based on numbers, rather than the names in FlatSysSetAtlastth1.txt) when it is producing its logfile. Obviously there is often not much change between the two names, apart form in the case of Combined errors.
template_area=templates/${process}${mh} 0 116102-filter.root FlatPlotter/NNScoreAny_0_0_0 0could become:
template_area=${MAINDIR} 0 run1/templates/tth120/116102-filter.root FlatPlotter/NNScoreAny_0_0_0 0This ensures that atlastth_histlist_flat-v15.txt will still point toward the ANN templates from the nominal run. You must now create additional steering files to point toward the high and low error ANN templates - their names are of the format:
"ShapePos_"+errorname+"_"+HistOutput "ShapeNeg_"+errorname+"_"+HistOutputwhere HistOutput is atlastth_histlist_flat-v15.txt and errorname is the human-readable error name, as defined in SysNamesAtlastth1.txt. You also need to change the ${basehistlistname} in the call to the fitting code so that it points directly at atlastth_histlist_flat-v15.txt, with no preceding directory structure - the code bases the names of the two extra shape steering files on this argument, and will not take into account any directories in the argument. (So that if ${basehistlistname} was directory/file.txt, the fitting code would look for the extra steering files with the name ShapePos _ISR_directory/file.txt in the case of ISR being our error).
GeneralParameter string 1 FlatTupleVar/cutWord=my_GoodJets_N/my_GoodJets_NThis sets the variable we wish to use in our filter - it interfaces with the cutMask and invertWord as specified in TreeSpecATLAStth.txt. Note that depending on the number of jets you wish to run your analysis on (set as a command line argument during the running of the script), this is edited with genemflat.
ListParameter SpecifyVariable:Higgs:cutMask 1 Type:int:Default:3 ListParameter SpecifyVariable:Higgs:invertWord 1 Type:int:Default:0InvertWord is used to invert the relevant bits (in this case no bits are inverted) before the cut from cutMask is applied. The cutMask tells the filter which bits we care about (we use a binary filter). So, for example, if cutMask is set to 6 (110 in binary), we are telling the filter that we wish the second and third bit to be equal to one in cutWord - we don't care about the first bit.
GeneralParameter string 1 Constraint=(my_failEvent&3)==3This controls the events used in the training, using a bitwise comparison. If the constraint is true (i.e. the first two bits are set, and not equal to zero), then the event is used for training. This filter is not used currently, as training of the net takes place based on the Computentp output - this Computentp output only contains sensible states (as specified in the TreeSpecATLAStth.txt file's filter). If further filtering is required, then care must be taken to ensure that my_failEvent (or whatever you wish to base your filter on) is specified in the VariableTreeToNTP file, so that Computentp will copy it into its output. **If USEHILOSB is set to 1 then && must be appended to cut criteria, e.g. GeneralParameter string 1 Constraint=(my_failEvent&65536)==0&&. This is because USEHILOSB adds more constraints.**
./genemflat_batch_Complete2_SL5.sh 12 480 1.0 tth 120 120 6 00-00-17 /data/atlas09/ahgemmell/NNInputFiles_v16/mergedfilesProcessed These options denote: * <p>12 is the run number</p> * <p>480 is the jobstart - this is a potentially redundant parameter to do with the PBS queue.</p> * <p>1.0 is the luminosity will be normalised to (in fb^-1).</p> * <p>tth is the process type - aim to develop this to incorporate other processes, e.g. lbb</p> * <p>120 is the min. Higgs mass</p> * <p>120 is the max. Higgs mass</p> * <p>6 is the number of jets in the events you want to run over (i.e. this is an exclusive 6 jet analysis - events with 7 jets are excluded)</p> * <p>/data/atlas07/ahgemmell/NTuple-v15-30Aug is the directory where the input ntuples are located (having my_failEvent bits set for ( 65536 for >0 sensible states) and ( 131072 for 4 tight b-tagged jets)</p> Once the job has been completed you will receive an email summarising the outcome. Running: * <p>Creates a run12 subdirectory in working directory and makes it the working directory</p> * <p>Creates TMVAsteer.txt - writes fitting parameters to it</p> * <p>NN structure is set ( H6AONN5MEMLP MLP 1 !H:!V:NCycles=1000:HiddenLayers=N+1,N:RandomSeed=9876543). This line sets up two hidden layers with N+1 and N neurons respectively (where N is the number of input variables).</p> * <p>Training cycles (1000) and hidden layers - N+1?</p> * <p>4 text steer files are copied into the run directory for templating</p> * <p>2 text steer files are copied into the run directory for stacking plots.</p> * <p>2 lines of text are appended to a temporary copy of flatsteerStackNNAtlas.txt: GeneralParameter string 1 HWW=tth-TMVA %BR% GeneralParameter double 1 <nop>IntLumi=${lumi}</p> Contents of training file: * <p>jetmin/jetmax - These seem to be redundant. Commented out, effective as of v.3</p> * <p>zmin/zmax (1/2) - what is their function?</p> * <p>weighting = <nop>TrainWeight - is this redundant?</p> * <p>TMVAvarset.txt - input variable set</p> ---+++ Other switches to influence the running In genemflat_batch_Complete2_SL5.sh, at the start of the file there are a number of switches established: <verbatim> # Flags to limit the scope of the run if desired Computentps=1 DoTraining=1 ComputeTMVA=1 DoTemplates=1 DoStackedPlots=1 DoFit=1 </verbatim> These control whether or not various parts of the code are run - the names of the flags are pretty self-explanatory about what parts of the code they control. For example, it is possible to omit the training in subsequent (templating) runs, if it has previously been done. This shortens the run time significantly. <strong>***NOTE***</strong> The flags DoTraining and DoTemplates had previously (until release 00-00-21) been set on the command line. They were moved from the command line when the other flags were introduced. ---++ %MAKETEXT{"Where the output is stored"}% <ol> <li> trees/NNInputs_120.root The output from Computentp - it is a copy of all of the input datasets, with the addition of the variables TrainWeight and weight. </li> <li> computentp_output/ The same as for trees/NNInputs_120.root, but now each individual input dataset has its own appropriately named file. </li> <li> weights/TMVTEST2_120_3.root_H6AONN5MEMLP.weights.txt Contains the weights of the Neural Net - results of the training. Starts with info about the run: Date, time, etc. followed by various parameters you've set. Then the variables you're considering in the net, with their range. Finally comes the Neural Net structure itself - weights etc. </li> <li> TMVA2_120_3.root <ol type="i"> <li> InputVariable _NoTransform/my_----_B_NoTransform The distribution of this variable for the Background (or signal for S). N.B. Has N_sig / 2 entries (other half used for testing, not training). Check this to make sure there are no obvious problems with the input - missing data points, random spikes, etc. </li> <li> Method_MLP/H6AONN5MEMLP/estimatorHistTrain A measure after each iteration of how much tweaking is required to get S=1, B=0. Should settle after a while to a stable number </li> <li> Method_MLP/H6AONN5MEMLP/estimatorHistTest Same, but for the test sample - should look broadly similar. Otherwise we're over/undertraining. </li> <li> Method_MLP/H6AONN5MEMLP/MVA_H6AONN5MEMLP_B The final Neural Net Score for the Background (or signal for _S_) (Test result - not the final result). </li> <li> Method_MLP/H6AONN5MEMLP/MVA_H6AONN5MEMLP_B_high Rick's not sure. </li> <li> Method_MLP/H6AONN5MEMLP/MVA_H6AONN5MEMLP_rejBvsS Comparing Bkg rejection to Signal Efficiency. Ideally want rej=1 with eff=1 (i.e. reject all bkg, accept all signal). </li> </ol></li> <li> train2.log A text summary of the training process. Mentions files used (how many files, numbers of events, if they signal/background, etc). Will refer to file 0, 1... - refer to Atlas ttHRealTitles.txt to work out which file is which. Use this to make sure the right numbers of events are getting though your filters. </li> <li> templates/out/FlatPlotter${prefix}.out The output from the screen from the templating stage, where the Neural Net scores are actually produced, detailing errors etc. </li> <li> templates/ttH120/ Files showing the final results of the ANN. NNScoreAny _0_0_0 is of particular interest - this is the ANN output for that file, weighted to represent real data (i.e. if you simply add all of these together, you should get a realistic 'Signal and Background' result, which is shown in: </li> <li> stacked/Plots_120_TMVA_Lum_1.0/FlatStack_1.eps stacked/Plots_120_TMVA_Lum_1.0/FlatStack_2.eps The final ANN scores of the signal and background, scaled as to real data </li> <li> stacked/out1_120.log stacked/out2_120.log Output from making the stacked (combined) plots. </li> <li> templates/fit/out_120.log This is the paydirt - the output to screen from the fitting stage. Look at the end of this file, and you shall see the exclusions generated by all the pseudoexperiments performed on the ANN output, in a little table, also giving +/- 1/2 sigma results. </li> <li> drivetestFlatFitAtlastth.rootUnscaledTemplates.root. The distribution of the exclusions. This is plotted only if <verbatim>GeneralParameter bool 1 PlotLikelihood =1</verbatim> in teststeerFlatReaderATLAStthSemileptonic.txt. The range and number of bins in this plot can be controlled by editing the following switches: <verbatim>GeneralParameter int 1 LikeliPseudoExpNBin=400 GeneralParameter double 1 LikeliPseudoExpMin=0. GeneralParameter double 1 LikeliPseudoExpMax=10.</verbatim> </li> <li> tr[run number].o[PBS job number] This output file is written as soon as the job stops, and contains a summary of the full run - useful for looking for error messages. </li> <li> TMVAPerf_120_Run[run number]Job2.html A useful little html page that one of Rick's scripts creates, showing a number of useful plots - the signal and background Net scores, distributions of input variables and their correlations, and so on. </li> </ol> ---+++ %MAKETEXT{"Limitations"}% * <p>It must also be run on a PBS machine because of the structure of the genemflat_batch_Complete2_SL5.sh file (i.e. PBS commands).</p> * <p>The file <nop>teststeerFlatPlotterATLAStthSemileptonic.txt appears to contain an invalid range for the pseudorapidity (max. value = pi)</p> * <p>If USEHILOSB is set to 1 then && must be appended to cut criteria, e.g. GeneralParameter string 1 Constraint=(my_failEvent&65536)==0&&</p> * <p>It would be desirable to adapt the code to be able to process different signals, e.g. lbb.</p> ---+++ %MAKETEXT{"Diagnostic Run"}% A diagnostic run may be carried out by setting Debug=1 and NEvent=99 in teststeerFlatReaderATLAStthSemileptonic.txt. It is also advisable to cut the run time down by setting the number of training cycles to a low number (e.g. 20) in genemflat_batch_Complete2_SL5.sh - this appears as NCycles in the TMVAsteer.txt part of the file. ---++ %MAKETEXT{"TMVA Training Plots"}% There is a macro in [[https://ppes8.physics.gla.ac.uk/trac/atlas/browser/NNFitter/trunk][the latest NNFitter version]] which will plot the contents of run12/TMVA2_120_3.root showing the responsiveness of the neural net as a function of the number of training cycles, in order to gauge the optimal number of cycles to use (i.e. avoid the dangers of under- or over-training). The macro must be run in a directory where a neural net run has already been carried out. * <p>Type source exportscript.sh to export the relevant parameters.</p> * <p>Enter ROOT and type .x runTrainTest.C</p> This will create two .eps output files, one showing the success of signal/background fitting, and the other displaying the sensitivity of the neural net to the number of training cycles used, allowing the speed of convergence to be gauged. ---++ %MAKETEXT{"Running analysis & making ntuples"}% This is the procedure used to remake the ATLAS ntuples, using code from CERN Subversion repositories. These ntuples were then used as input for the neural net. cd ~ %BR% mkdir tth_analysis_making_ntuples_v.13 %BR% cd tth_analysis_making_ntuples_v.13 %BR% export SVNROOT=svn+ssh://kirby@svn.cern.ch/reps/atlasoff %BR% svn co $SVNROOT/PhysicsAnalysis/HiggsPhys/HiggsAssocTop/HiggsAssocWithTopToBBbar/tags/HiggsAssocWithTopToBBbar-00-00-00-13 <nop>PhysicsAnalysis/HiggsPhys/HiggsAssocTop/HiggsAssocWithTopToBBbar %BR% cd <nop>PhysicsAnalysis/HiggsPhys/HiggsAssocTop/HiggsAssocWithTopToBBbar/NtupleAnalysis/ %BR% make In a new terminal window: cd /data/atlas07 %BR% mkdir gkirby %BR% cd gkirby %BR% mkdir ntuples_sensstatecutword %BR% The script files in the <nop>NtupleAnalysis directory were then altered to output to this new directory. This is the output line for the signal ( tthhbbOptions86581430018061-nn.txt) OUTPUT /data/atlas07/gkirby/ntuples_sensstatecutword/86581430018061-nn.root Code changes: The tthhbbClass.cxx file was edited to include a new error code: the following lines were added to it to allow us to exclude events that we do not wish the NN to train/test with. if (<nop>SensibleStates.size()==0) { %BR% m_failEvent+=65536; %BR% } %BR% Then the make command was used again in this directory. The tthhbb executable was run with each of the input ("Options") text files to prepare the ntuples. Another change was also required; the tthhbbClass.cxx file was also edited to include a fail code to allow for events not having '4 tight b-tagged jets' since this was one of the criteria used in the cut-based preselection. The following code was added to tthhbbClass.cxx: if (<nop>BJets.size()<4) { %BR% m_failEvent+=131072; %BR% } %BR% This was done so that the Signal to Background ratio could be increased in order for the fit to finish and provide sensible results. Requiring 4 tight b-Jets removes proportionally much more of the ttjj background than the other samples because there are fewer b-jets. The reason the S/B ratio was so low was that a problem was found which meant that the original ratio used in the 'fix_weight' variable in genemflat_batch_Complete2_SL5.sh (up to version -06) was in fact incorrect, and when the correct weights were calculated the S/B ratio was very low indeed. The Neural Net was then configured to exclude events where: (m_failEvent & 196608)==1 %BR% with 196608=131072+65536. ---+ %MAKETEXT{"Creating plots to review the data"}% There is a simple shell script included in the running Neural Net code package that can produce a nice html document you can use to review a few plots of interest - plotTMVA.sh. To run it, move it into the run directory you want to review, then it's a simple one-line command: <verbatim> ./plotTMVA.sh 120 <run> <job> </verbatim> N.B. This is done automatically by genemflat currently. ---+ Debugging the code Before trying debugging, you should set up the environment in your terminal (when running the code normally, this is done automatically by tr${run}.job). This can be done by sourcing setup.sh, which automates the following lines of code: <verbatim> # Set where your GlaNtp installation is GLANTPDIR=/home/ahgemmell/GlaNtp/GlaNtpPackage/GlaNtpSVN00-00-10 source ~/GlaNtp/cleanpath3.sh export PATH=${PATH}:${GLANTPDIR}/bin/Linux2.6-GCC_4_1 export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${GLANTPDIR}/shlib/Linux2.6-GCC_4_1 export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${GLANTPDIR}/lib/Linux2.6-GCC_4_1 </verbatim> To debug the code, two things need to be done - first, all the debug switches need to be turned on, and then you need to restrict the number of events to ~10 (for a Computentp run this will still manage to generate a 2 GB log file!). All of these switches are found in teststeerFlatReaderATLAStthSemileptonic.txt (the progenitor for all FlatReader files) and steerComputentp.txt (created by genemflat). The debug switches are: <pre>GeneralParameter bool 1 Debug=0 GeneralParameter bool 1 DebugGlobalInfo=0 GeneralParameter bool 1 DebugEvInfo=0 GeneralParameter int 1 ReportInterval=100</pre> In steerComputentp.txt there is also one additional debug option: <pre>GeneralParameter bool 1 DebugFlatTRntp=1</pre> All the debug switches can be set to one (I'm not sure of the exact effect of each individual switch) - the report interval can be adapted depending on how many events are present in your input files and on how large you want your log files to be. To restrict the events you use <pre># # Loop Control # GeneralParameter int 1 NEvent=999999 GeneralParameter int 0 FirstEvent=1 GeneralParameter int 0 LastEvent=10</pre> The easiest switch is to set NEvent=10 - however, if desired you can run over a specified range, by switching of the NEvent switch (changing it to int 0 NEvent) and switching on the other two switches, using them to specify the events you wish to run over. Then you can run a subset of a complete run, but altering the flags found in genemflat: <verbatim> # Flags to limit the scope of the run if desired Computentps=1 DoTraining=0 ComputeTMVA=0 DoTemplates=0 DoStackedPlots=0 DoFit=0 </verbatim> However, sometimes even this can not produce enough information., so there exist a few other options for checking your code. The first option is <verbatim>runFlatReader FlatReaderATLAStthNoNN.txt /data/atlas09/ahgemmell/NNInputFiles_v16/mergedfilesProcessed/ttH-v16.root</verbatim> This produces a *lot* of printout, so be sure to restrict the number of events as described above! If you want to get more debugging from Computentp, then run it with another argument (doesn't matter what the argument is - in the example below it's simply 1): <verbatim>Computentp steerComputentp.txt 1</verbatim> ---++ Some error messages and how to fix them <verbatim>Double Variable: my_NN_BJet12_M not valid and hence saved : 1</verbatim> Look at VariableTreeToNTPATLASttHSemiLeptonic-v16.txt - are the names of the variables really consistent? ---+ Various other switches of interest In FlatReader: <pre><verbatim>GeneralParameter int 1 LikeliPseudoExpMin=0. GeneralParameter int 1 LikeliPseudoExpMax=10. GeneralParameter int 1 LikeliPseudoExpNBin=400</verbatim> </pre> To enable you to specify the range and number of bins in the histogram showing the distribution of the pseudoexperiment exclusions. (Found in drivetestFlatFitAtlastth.rootUnscaledTemplates.root) In both FlatReader and FlatPlotter: GeneralParameter bool 1 LoadGlobalOnEachEvent=1 This needs to be set to one if you wish to load the global variables anew for each event. Otherwise the global variables will be loaded once only - from the Global tree if you have specified it, or from the first event if you haven't. Therefore, if your input datasets have non-sensible states and no global tree, this must be set to one. Otherwise, if the first entry is not sensible, or for some other reason has an unreasonable answer for this global value, a problem will develop. With this switch on, the values will be loaded each and every time – obviously this slows the code down – if the global values are safely stored in every entry, it might be best to set this to false. ---++ TMVAsteer.txt (genemflat_batch_Complete2_SL5.sh) <verbatim>H6AONN5MEMLP MLP 1 !H:!V:NCycles=1000:HiddenLayers=N+1,N:RandomSeed=9876543</verbatim> If the phrase 'H6AONN5MEMLP' is changed, then this change must also be propogated to the webpage plotter (e-mail from Rick 1 Mar 2011) </noautolink>
I | Attachment | History | Action | Size | Date | Who | Comment |
---|---|---|---|---|---|---|---|
![]() |
Est_12_120.eps | r1 | manage | 16.0 K | 2009-07-24 - 12:06 | GavinKirby | |
![]() |
FlatStack_1.eps | r1 | manage | 90.1 K | 2009-07-24 - 11:12 | GavinKirby | |
![]() |
FlatStack_2.eps | r1 | manage | 109.0 K | 2009-07-24 - 12:06 | GavinKirby | |
![]() |
Report-FINAL.pdf | r1 | manage | 202.3 K | 2009-09-29 - 10:47 | ChrisCollins | |
![]() |
drivetestFlatFitAtlastth.rootSemiLeptonic_lnsb1.eps | r1 | manage | 17.0 K | 2009-07-24 - 12:06 | GavinKirby | |
![]() |
drivetestFlatFitAtlastth.rootSemiLeptonic_lnsb2.eps | r1 | manage | 16.1 K | 2009-07-24 - 12:06 | GavinKirby | |
![]() |
score_12_120.eps | r1 | manage | 28.5 K | 2009-07-24 - 12:06 | GavinKirby |