Computentp, Neural Nets and MCLIMITS

This page has been substantially rewritten (and remains a work in progress) to focus just on the information required for a successful run of the Computentp and Neural Net package, to deliver exclusions. For information on results obtained using inputs created in v12 of athena, please refer to the archive. This page also describes how to run on GlaNtp - the version of the code set up for use in Glasgow, with no CDF dependencies. To use the previous version of the code (there are some important differences) refer to r93 and earlier.


Project Aims

This project aims to document the use of an Artificial Neural Network (ANN) system and fitting software for the analysis of data from inclusive Higgs searches at ATLAS involving a lepton trigger and Higgs decay to b+bbar. This will use input from the Computentp software, designed to automate the weighting of the input files as required.

Tools

ANN :- This is a kind of algorithm with a structure consisting of "neurons" organised in a sequence of layers. The most common type, which is used here, is the Multi-Layer Perceptron (MLP), which comprises three kinds of layer. The input neurons are activated by a set trigger, and once activated they pass data on to a further set of "hidden" neurons (which can in principle be organised into any number of layers, but most frequently one or two - in our case one), and finally the processed data is forwarded to the output neurons.

The key feature of a neural network is its ability to be "trained" to recognise patterns in data, allowing high efficiency algorithms to be developed with relative ease. This training is typically done with sample data which has been generated artificially, resulting in an algorithm that is very effective at recognising certain patterns in data sets. The only shortcoming is the danger of "over-training" an ANN, meaning that it becomes overly discriminating and searches across a narrower range of patterns than is desired (one countermeasure is to add extra noise to training data).

Computentp :- Simply running the code as above will result in less than optimal Neural Net training. The training procedure requires equal numbers of events from signal and from background (in this case it results in half of the signal events being used in training, half for testing). However, the above code will take events from the background signal samples in proportion to the file sizes - these result in proportions not quite in accordance with physical ratios. As the Neural Net weights results according to information about the cross-section of the process and so on stored in the tree, the final result is that while the outputs are weighted in a physical fashion, the Net is not trained to the same ratios, and so is not optimally trained. To solve this problem, Computentp is used to mix together all background and signal samples., and assign TrainWeights to them, so that the events are weighted correctly for the Net's training.

Preparing samples for the Neural Net

Previous work went into producing samples for the Neural Net from AODs - results have previously been obtained for MC samples derived from v12 and v15 of athena, with work directed toward upgrading this to v16. The inputs are created from AODs using the TtHHbbDPDBasedAnalysis package (currently 00-04-18 and its branches are for v15, 00-04-19 is for v16), which can be found here. However, the work on athena v16 ran into problems with application of trigger matching, and so was curtailed for the moment.

Now inputs are produced in a two-step process from D3PDs. First, the desired D3PDs have cleaning cuts applied to them, and our desired event-by-event information is stored in a flat ntuple. This is performed by the code found here. Then we add in the global event variables (ones that are constant throughout the sample, such as luminosity and cross-section).

Current samples in use

Input data and cross-sections

These cross-sections are for the overall process, at √s = 7 TeV.

The ttH sample cross-sections are provided for the overall process - the MC is divided into two samples with W+ and W- independent of one another. These two samples are merged before being put through the ANN.

The tt samples were initially generated to produce the equivalent of 75fb-1 of data, based on the LO cross-sections. Taking into account the k-factor of 1.84, this means that now all samples simulate 40.8fb-1 of data. These samples have also had a generator-level filter applied - most events (especially for tt+0j) are of no interest to us, so we don't want to fill up disk-space with them, so we apply filters based on the numbers of jets etc. The Filter Efficiency is the fraction of events that pass from the general sample into the final simulated sample. To clarify how all the numbers hang together, consider the case of tt+0j. We have simulated 66,911 events - as said above, this corresponds to 40.8fb-1 of data. We have a Filter Efficiency of 0.06774, so the full number of events that a complete semi-leptonic event would be comes to 987,762 events in 40fb-1. Divide this by 40 to get the number of events in 1fb-1 (i.e. the cross-section), and you get 24,694 events per fb-1. Our starting point for our cross-section is 13.18, with a k-factor of 1.84, which gives a cross-section of 24.25 - so all the numbers compare with each other pretty favourably. This of course makes getting from the number of sensible state events to the number expected per fb-1 rather easy - simply divide by 40.8.... You'll notice that the cross-section includes all the branching ratios already, so we don't need to worry about that.

**IMPORTANT** The Filter Efficiency for these samples was calculated based on a no-pileup sample. The filter is generator level, and one of the things it will cut an event for is not enough jets. However, pileup adds jets, but these are added well after the filter. The net result is that a number of events that failed the filter would have passed, had the pileup been added earlier in the process. This means the filter efficiency (and thus the cross-sections) are incorrect, by a yet to determined amount....

For the other samples, however, we do need to worry about branching ratios - the quoted initial cross-section includes all final states, so we need to apply branching ratios to the cross-section to reduce it down, so that it reflects the sample we've generated. We then subsequently need to reduce the cross-section further so that it reflects the number of sensible states.

Sample Dataset numbers Cross-section (pb) Branching Ratios Filter Efficiency What the multiplicative factors are Effective cross-section (pb) Sources
ttH 109840, 109841 0.09756 0.676*0.216*2*0.675 0.8355 Overall 0.01607 Initial cross-section: https://twiki.cern.ch/twiki/bin/view/LHCPhysics/CERNYellowReportPageAt7TeV
      0.676   W → hadrons   Branching ratios: 2008 PDG Booklet
      0.216   W → leptons (electron/muon)    
      2   Account for the 2 W decay routes    
      0.675   H → bb    
        0.8355 Lepton filter efficiency   Filter eff: https://twiki.cern.ch/twiki/bin/view/AtlasProtected/HiggsWGHSG5Dataset7TeV
tt + 0j 105894, 116102 13.18 1.84   For sample 105894 24.25120 Initial cross-section and filter efficiency: https://twiki.cern.ch/twiki/bin/view/AtlasProtected/HiggsWGBGDataset7TeV#ttbar
        0.06774 Filter efficiency for sample 116102   k-factor: https://twiki.cern.ch/twiki/bin/view/AtlasProtected/TopMC2009#ttbar_7_TeV
      1.84   k-factor    
tt + 1j 105895, 116103 13.17 1.84   For sample 105895 24.23280 Initial cross-section and filter efficiency: https://twiki.cern.ch/twiki/bin/view/AtlasProtected/HiggsWGBGDataset7TeV#ttbar
        0.2142 Filter efficiency for sample 116103   k-factor: https://twiki.cern.ch/twiki/bin/view/AtlasProtected/TopMC2009#ttbar_7_TeV
      1.84   k-factor    
tt + 2j 105896, 116104 7.87 1.84   For sample 105896 14.48080 Initial cross-section and filter efficiency: https://twiki.cern.ch/twiki/bin/view/AtlasProtected/HiggsWGBGDataset7TeV#ttbar
        0.4502 Filter efficiency for sample 116104   k-factor: https://twiki.cern.ch/twiki/bin/view/AtlasProtected/TopMC2009#ttbar_7_TeV
      1.84   k-factor    
tt + 3j 105897, 116105 5.49 1.84   For sample 105897 10.10160 Initial cross-section and filter efficiency: https://twiki.cern.ch/twiki/bin/view/AtlasProtected/HiggsWGBGDataset7TeV#ttbar
        0.5860 Filter efficiency for sample 116105   k-factor: https://twiki.cern.ch/twiki/bin/view/AtlasProtected/TopMC2009#ttbar_7_TeV
      1.84   k-factor    
gg → ttbb (QCD) 116101 0.8986 0.676*0.216*2*1.84   Overall 0.48285 Initial cross-section: https://twiki.cern.ch/twiki/bin/view/AtlasProtected/HiggsWGBGDataset7TeV#ttbarbbbar
      0.676   W → hadrons (electron/muon)   Branching ratios: 2008 PDG Booklet
      0.216   W → leptons (electron/muon)   k-factor: https://twiki.cern.ch/twiki/bin/view/AtlasProtected/TopMC2009#ttbar_7_TeV - need to verify there's nothing more suitable than applying tt+X value!
      2   Account for the 2 W decay routes    
      1.84   k-factor    
qq → ttbb (QCD) 116106 0.1416 0.676*0.216*2*1.84   Overall 0.07609 Initial cross-section: https://twiki.cern.ch/twiki/bin/view/AtlasProtected/HiggsWGBGDataset7TeV#ttbarbbbar
      0.676   W → hadrons (electron/muon)   Branching ratios: 2008 PDG Booklet
      0.216   W → leptons (electron/muon)   k-factor: https://twiki.cern.ch/twiki/bin/view/AtlasProtected/TopMC2009#ttbar_7_TeV - need to verify there's nothing more suitable than applying tt+X value!
      2   Account for the 2 W decay routes    
      1.84   k-factor    
gg → ttbb (EWK) 116100 0.0875 0.676*0.216*2*1.84   Overall 0.04702 Initial cross-section: https://twiki.cern.ch/twiki/bin/view/AtlasProtected/HiggsWGBGDataset7TeV#ttbarbbbar
      0.676   W → hadrons (electron/muon)   Branching ratios: 2008 PDG Booklet
      0.216   W → leptons (electron/muon)   k-factor: https://twiki.cern.ch/twiki/bin/view/AtlasProtected/TopMC2009#ttbar_7_TeV - need to verify there's nothing more suitable than applying tt+X value!
      2   Account for the 2 W decay routes    
      1.84   k-factor    
qq → ttbb (EWK) 116107 0.0101 0.676*0.216*2*1.84   Overall 0.00543 Initial cross-section: https://twiki.cern.ch/twiki/bin/view/AtlasProtected/HiggsWGBGDataset7TeV#ttbarbbbar
      0.676   W → hadrons (electron/muon)   Branching ratios: 2008 PDG Booklet
      0.216   W → leptons (electron/muon)   k-factor: https://twiki.cern.ch/twiki/bin/view/AtlasProtected/TopMC2009#ttbar_7_TeV - need to verify there's nothing more suitable than applying tt+X value!
      2   Account for the 2 W decay routes    
      1.84   k-factor    

These cross-sections and branching ratios are correct as of 8 Feb 2011. qq→ttbb (EWK) is currently not being used, thanks to a bug in the production of the MC

Number of events surviving preselection, weights and TrainWeights

(See later in the TWiki for an explanation of weights and TrainWeights.) This table will be completed with all the relevant weights and TrainWeights at a later date - these values are to be compared to the output from Computentp to ensure everything is working as intended, and are calculated for the sensible cross-sections/events. (A quick check of the TrainWeight is to multiply the number so events of each background by their TrainWeight and sum them - by design, this should equal the number of entries in the ttH sample.)

Sample Dataset Number Pileup?   Number of events     Cross-section (fb)  
      Total Passing Preselection Sensible States Total Passing Preselection Sensible States
ttH (W+ sample) 109840 Yes 29968 2685 1936      
    No   2497 1761      
ttH (W- sample) 109841 Yes 29980 2764 2020      
    No   2600 1879      
ttH (total)   Yes 59948 5449 3956 16.07 1.460 1.060
    No   5097 3640   1.366 0.976
tt + 0j 105894 No 25487 6 5 24251 5.709 4.758
  116102 Yes 66911 149 123      
    No   78 66      
tt + 1j 105895 No 26980 21 18 24233 18.862 16.167
  116103 Yes 211254 960 787      
    No   638 517      
tt + 2j 105896 No 17487 69 53 14481 57.138 43.889
  116104 Yes 265166 2478 1957      
    No   2026 1548      
tt + 3j 105896 No 10990 96 77 10102 88.240 70.776
  116105 Yes 241235 3946 3022      
    No   3469 2619      
gg → ttbb (QCD) 116101 Yes 89887 3550 2560 483 19.070 13.752
qq → ttbb (QCD) 116106 Yes 19985 496 366 76.09 1.888 1.393
gg → ttbb (EWK) 116100 Yes 19987 981 706 47.02 2.308 1.661

Running the Neural Net

Things to do

*

In the script used to make the webpage showing the results, the reference to H6AONN5MEMLP is hardwired. It should become a argument. It is the name of the method you give TMVA in the training, and so if it changes in one you should be able to change in the other

Overview of the process

  1. Computentp takes in the original root files from ${ntuple_area}, and is steered based on steerComputentp.txt (created by genemflat_batch) . Computentp calculates the TrainWeight etc based on nGenForType etc, stored within the root files - no external hard-coding (I think).

  2. There are two main outputs from Computentp:

    1. computentp_output/ contains a file for each root file, containing all the variables for the tree - including weight and trainweight.

    2. trees/ contains all the same information, in one large file.

  3. The training runs on the Computentp output (the single large file), and produces the weights file (weights/) . It uses TrainWeight (branch inside Computentp output).

  4. The templating is what actually produces the NN score plots. It uses one file per signal/background - uses the original root files, and the ANN weight files as produced by the training. Will calculate the scale-factor to apply to each event (also often referred to as the weight) based upon NGenForType etc, which are within the root files.

  5. Using the MCLIMIT program, these ANN outputs are used to generate 1,000 pseudoexperiments (this number is set in the code, and can be adjusted if desired – the variable NPE in genemflat_batch).

    1. For each pseudoexperiment we simulate a Background-only sample of ANN output, which is subjected to both systematic and statistical (Poisson) variations. The Poisson variations are applied twice – once to each individual background, and then again to the sum of the backgrounds. (A possible bug?)

    2. For comparison to these null hypothesis datasets, for both signal and background an array of 10,000 templates is created, each of these templates being subjected to systematic and Poisson variation.

    3. The pseudodata and the array are passed to the function cslimit, which uses them to calculate an exclusion for each pseudoexperimen.

    4. Based on these 1,000 separate exclusions, we produce a final exclusion.

Setting up

Initial Setup

In the file genemflat_batch_Complete2.sh, you should correct the line:

#PBS -j oe -m e -M a.gemmell@physics.gla.ac.uk

with your e-mail address – this enables the batch system to send you an e-mail informing you of the completion (successful or otherwise) of your job.

Trainweights and Weights

Initial versions of the code used to have the weights for the ANN hard-coded into the runcards (General Parameter FixWeight in the FlatReader file). However, later versions of the code do not need this hardcoding – weights are calculated from values found in the input files themselves, and so FixWeight has been set to 1 (this can be used to multiply certain samples' cross-sections by a given number, if desired). However, the formulae to obtain the weights are still quoted below, so that you can check the Computentp's work and make sure that it makes sense.

N.B. These weights are wrong for the ttjj (5212) sample. The input that was produced in v12 of athena was initially produced using MC@NLO. This produces both positive (+1) and negative (-1) weighted events (an easy way to consider this is to consider the negatively weighted events as destructive events, that interfere with the positively weighted events, with the net result of decreasing the cross-section of the process). We considered all events equally for our calculation of the weights, simply considering the total number of events in the files. This problem will disappear when we switch to v15 inputs, where the ttjj samples have been produced using Alpgen.

The first weight to be considered is TrainWeight – the scale factor we multiply each of the background events by so that they are in physically realistic ratios in relation to one another, while enforcing the requirement of the ANN training that we have equal numbers of signal and background events – used for the training of the ANN.

The calculation for the Trainweight is:

(Number of generated signal events / Number of generated events for that background) * (Cross-section for that background / Cross-section for all backgrounds combined)

The next weight is simply called Weight – this is the scale factor used to produce a physically realistic input for the ANN – now with the signal weighted as well. The formula used to find this is simply

(Number of events expected for your desired luminosity) / (Number of events present in input dataset)

Both of these numbers are calculated by Computentp, and can be checked in the file trees/NNInputs_120.root, where they are in branches labelled by 'TrainWeight' and 'weight'. However, it should be noted that while TrainWeight as produced by Computentp is used by the ANN in the training sequence, the final results are produced independently Computentp – the ANN calculates the weight on its own.

FlatReader and FlatPlotter

Then, genemflat_batch_Complete2_SL5.sh creates a file called FlatPlotterATLAStth${prefix}.txt . This file is used in the 'templating' phase of genemflat_batch_Complete2_SL5.sh and is based on the templates provided in teststeerFlatPlotterATLAStthSemileptonic-v15.txt.

And so via the FlatPlotter file the FlatReader files are included in the call:
runFlatPlotter \$steerPlotter ...
which produces a template for each of the signal and individual background samples.

User Setup

To set up the neural net,

  • check out the latest version of the code running framework from subversion (check what this is on trac) using the command

    svn co https://ppesvn.physics.gla.ac.uk/svn/atlas/NNFitter/tags/NNFitter-00-00-0X %BR%

  • check out a version of the GlaNtp code into your home directory (or set up genemflat_batch_Complete2_SL5.sh to point at someone else's installation of the the code). The procedure for how to do this is described in the next section.

  • ensure you know the ntuple_area variable to be passed in at run-time to genemflat_batch_Complete2_SL5.sh. This will be the directory where the input ntuples are stored (currently /data/atlas07/ahgemmell/NTuple-v15-30Aug for ntuples which have sensible states, events passing preselection and events failing preselection having my_failEvent == 3, 1 and 0 respectively – note that by their very nature, sensible states have also passed preselection).

  • The BASEBATCHDIR is now set automatically to the working directory when the script is executed.

Getting a copy of GlaNtp

  • Set yourself up for access into SVN (e.g.

    svn proxy init

  • Create the directory where you want to set up your copy, and copy in the setup script:

mkdir /home/ahgemmell/GlaNtp
cd /home/ahgemmell/GlaNtp
cp /home/stdenis/GlaNtpScript.sh .

  • To avoid any potential problems with previously set up aliases etc, clear them all (you will need to do this during the running of the code, so you might as well copy it into your directory):

cp /home/stdenis/atlas/testGlaNtp/cleanpath3.sh
source cleanpath3.sh

  • Make a directory to hold the code itself:

    mkdir GlaNtpPackage

  • GlaNtpScript.sh not only checks out and compiles the code, it also then goes and validates it. You need to define the input for this:

export GLANTP_DATA=/data/cdf01/stdenis/GlaNtpData

  • You now run the script, specifying whether you want a specific tag (e.g. 00-00-10), or just from the head of the trunk (h) so you're more free to play around with it. It's always a good idea to check out a specific tag, so that whatever you do to the head, you can still run over a valid release.

./GlaNtpScript.sh SVN 00-00-10

*

This will check out everything, and run a few simple validations - the final output should look like this (i.e. don't be worried that not everything seems to have passed validation!):

HwwFlatFitATLAS Validation succeeded
Done with core tests
Result of UtilBase                          validation:  NOT DONE: NEED
Result of Steer                             validation:  OK
Result of StringStringSet                   validation:  OK
Result of StringIntMap                      validation:  OK
Result of ItemCategoryMap                   validation:  OK
Result of FlatSystematic                    validation:  OK
Result of LJMetValues                       validation:  OK
Result of PhysicsProc                       validation:  OK
Result of FlatNonTriggerableFakeScale       validation:  OK
Result of FlatProcessInfo                   validation:  OK
Result of PaletteList                       validation:  OK
Result of CutInterface                      validation:  NOT DONE: NEED
Result of NNWeight                          validation:  NOT DONE: NEED
Result of FlatFileMetadata                  validation:  OK
Result of FlatFileMetadataContainer         validation:  OK
Result of Masks                             validation:  NOT DONE: NEED
Result of FFMetadata                        validation:  OK
Result of RUtil                             validation:  NOT DONE: NEED
Result of HistHolder                        validation:  NOT DONE: NEED
Result of GlaFlatFitCDF                     validation:  OK
Result of GlaFlatFitBigSysTableCDF          validation:  OK
Result of GlaFlatFitBigSysTableNoScalingCDF validation:  OK
Result of GlaFlatFitATLAS                   validation:  OK
Result of FlatTuple                         validation:  OK
Result of FlatReWeight                      validation:  OK
Result of FlatReWeight_global               validation:  OK
Result of FlatReWeightMVA                   validation:  OK
Result of FlatReWeightMVA_global            validation:  OK
Result of TreeSpecGenerator                 validation:  OK
Result of FlatAscii                         validation:  OK
Result of FlatAscii_global                  validation:  OK
Result of FlatTRntp                         validation:  OK

Variables used by the GlaNtp package

The variables used by the package can be divided into two sets. The first are those variables that are constant throughout the sample - the 'global' variables (e.g. cross-section of the sample). These can be specified in their own tree, where they will be recorded (and read by GlaNtp) once only. If desired, these variables can be defined within the main tree of the input file - however, then they will be recorded once per event, and read in once per event. This is obviously a bit wasteful, but for historical reasons it can be done. To determine which of these behaviours you use, set LoadGlobalOnEachEvent in FlatPlotter and FlatReader to 1 for the events to be read in on an event-by-event basis, or 0 to be read in once from the global tree (or from the first event only). For more information on this switch, refer to this. The other variables are those that change on an event-by-event basis. These variables include both the variables we are going to train the Neural Net on (more information relevant to those variables is given in the relevant section of this TWiki), and other useful variables, such as filter flags (that tell GlaNtp whether an event is sensible or not). All of these variables are listed in the file VariableTreeToNTPATLASttHSemiLeptonic-v15.txt

The file maps logical values to their branch/leaf. The tree can be the global tree or the event tree.

GeneralParameter string 1 FlatTupleVar/<variable_name>=<tree>/<variable_name_in_tree>

Also specified are the name of the leaf for the cutmask and invert word -- these are global values for a file.

GeneralParameter string 1 CutMaskString=cutMask
GeneralParameter string 1 InvertWordString=invertWord

There's also something about

ListParameter   EvInfoTree:1  1 NN_BJetWeight_Jet1:NN_BJetWeight_Jet1/NN_BJetWeight_Jet1
that I need to ask Rick about...

Variables used for training the Neural Net

The list of variables on which the neural net is to train is set in the shell script, under TMVAvarset.txt (this file is created when the script runs). At present, these variables are:

The b-weights for the six 'leading' jets - currently the jets are ranked according to their b-weights, but it is possible to rank them according to pT and energy. The decision about how to rank them is done in the AOD -> NTuple stage: NN_BJetWeight_Jet1
NN_BJetWeight_Jet2
NN_BJetWeight_Jet3
NN_BJetWeight_Jet4
NN_BJetWeight_Jet5
NN_BJetWeight_Jet6

The masses and pT of the various jet combinations (only considering the four 'top' jets - i.e. if ranked by b-weights, the jets that we expect to really be b-jets in our signal: NN_BJet12_M
NN_BJet13_M
NN_BJet14_M
NN_BJet23_M
NN_BJet24_M
NN_BJet34_M
NN_BJet12_Pt
NN_BJet13_Pt
NN_BJet14_Pt
NN_BJet23_Pt
NN_BJet24_Pt
NN_BJet34_Pt

The sums of the eT of the two reconstructed tops, for each of the top three states: NN_State1_SumTopEt
NN_State2_SumTopEt
NN_State3_SumTopEt

And the differences between the eta and phi of the two reconstructed tops, again from the top three states: NN_State1_DiffTopEta
NN_State2_DiffTopEta
NN_State3_DiffTopEta
NN_State1_DiffTopPhi
NN_State2_DiffTopPhi
NN_State3_DiffTopPhi

You also need to provide addresses to the Neural Net so that it can find the variables in the input trees. This is done inside VariableTreeToNTPATLASttHSemiLeptonic-v15.txt

ListParameter   EvInfoTree:1  1 NN_BJetWeight_Jet1:NN_BJetWeight_Jet1/NN_BJetWeight_Jet1

Currently all information is in the EvInfoTree, which provides event level information. However, future work will involve trying to establish a GlobalInfoTree, which contains information about the entire sample, such as cross-section - this will only need to be loaded once, and saves having to write the same information into the tree repeatedly, and subsequently reading it repeatedly.

Variable Weights in the Neural Net

To set up a neural net for the analysis of a particular kind of data it is necessary to train it with sample data; this process will adjust the "weights" on each variable that the neural net analyses in the ntuple, in order to optimise performance. These weights can then be viewed as a scatter plot in ROOT.

Specifying files as Signal/Background

The input datasets need to be specified in a number of peripheral files, so that the ANN can distinguish between signal and background files. Errors for each process also need to be specified - how this is done is detailed in that section. The relevant files for adding processes are atlastth_histlist_flat-v15.txt, AtlasttHRealTitles.txt, FlatAtlastthPhysicsProc1.txt and FlatSysSetAtlastth1.txt. There are also some files that are produced through the action of genemflat_batch_Complete2_SL5.sh. At several points in these files, there are common structures for inputting data, relating to ListParameter and ColumnParameter:

ListParameter <tag> <onoff> <colon-separated-parameter-list>

<onoff> - specifies whether this parameter will be taken into consideration (1) or ignored (0) - generally this should be set to 1.
<tag> and <colon-seperated-parameter-list> - varies from process to process, will be explained for individual cases below. There can only be one instance of a <tag> active at any one time (i.e. you can write more than one version, but only one can be taken into consideration.

ColumnParameter <tag> <sequence> <keyword=doubleValue:keyword=doubleValue...>

The expression <tag>:<sequence> must be unique, e.g.

ColumnParameter   File         0 OnOff=0:SorB=0:Process=Data
ColumnParameter   File         1 OnOff=1:SorB=0:Process=Fake

where <tag> is the same, but <sequence> is different. The fact that the <sequence> carries meaning is specific to the implementation. Note that all of the values passed from ColumnParameter will eventually be evaluated as Doubles - any variables where you pass a string (as for 'Process' above), this is not actually passed to the code - these code snippets are to make the code more easily readable by puny humans, who comprehend the meaning of strings more readily than Doubles.

atlastth_hislist_flat-v15.txt

This file provides a map for the ANN, giving it the output file names (and in which directory they are to be stored, relative to ${template_area} - set in genemflat) and the tree structure where the final result of the ANN will be stored in the output (in the example below, the output file is ${template_area}/11602-filter.root, and the result graph will be FlatPlotter/NNScoreAny_0_0_0 0). The number to the left of the file name indicates which process it is - this is established using a file called TMVAsteer.txt, which is created through the running of genemflat_batch_Complete2_SL5.sh), and corresponds to the variable my_Eventtype in the input files (this can also be influenced in genemflat).

0 116102-filter.root    FlatPlotter/NNScoreAny_0_0_0 0

AtlasttHRealTitles.txt

The list of signal/background processes can be found in AtlasttHRealTitles.txt (where the names are specified and associated with numbers). At present these are:

Process_0_0 TTjj:Semileptonic
Process_1_0 ttH:Semileptonic
Process_2_0 EWK:Semileptonic
Process_3_0 QCD:Semileptonic

genemflat_batch_Complete2_SL5.sh

genemflat creates the file TMVAsteer.txt, which sets a number of parameters for the running of the ANN - the constraints on the events, the precise Neural Net structure and so on - for establishing the input files, we are interested in only a couple of these parameters:

GeneralParameter  string      1 FileString=my_Eventtype

Indicates the leaf in the input file which shows which process the event belongs to - this is the same number as we've specified as in atlastth_hislist_flat-v15.txt and AtlasttHRealTitles.txt.

ColumnParameter   File         1 OnOff=1:SorB=1:Process=tth

The number before the switches (OnOff, SorB, etc - in this case it is 1) corresponds to the number given in AtlasttHRealTitles.txt. The other numbers are self-explanatory - they establish if that file is to be used, if it is signal or background (1=signal. 0=background) and the name of the process. In this instance, the Process name is just a comment for your own elucidation - it is not used itself in the code, so does not necessarily have to correspond to the process names as provided in AtlasttHRealTitles.txt (though of course it is useful for them to be similar).

The other file that is produced by genemflat that specifies the input files for Computentp is steerComputentp.txt

# Specify the known metadata
ListParameter SignalProcessList 1 Alistair_tth
ListParameter  Process:Alistair_tth       1 Filename:${ntuple_area}/ttH-v15.root:File:${mh}:IntLumi:1.0

This is just a list of the various input files, and we specify the integrated luminosity. The 'File' parameter is only used for book-keeping by Computentp, and does not have to correspond to the file numbers used in the ANN steering files (or to my_Eventtype), but for sanity's sake it is probably best to keep things consistent. We make an exception for the signal - we assign it the number ${mh} - so that we can keep track of things if we have different mass Higgs in our signals.

#  Map of input file name to output file name: The ComputentpOutput will have a sed used to get the right mapping.
ListParameter  InputOutputMapName:1  1 ${ntuple_area}/ttH-v15.root:${Computentpoutput}/tth_NNinput.root

The InputOutputMapName is a list of integers - this doesn't have to bear any relevance to any numbers that have gone before - just give each output a unique number. This is followed by the mapping of input file names provided, to the output names that Computentp will produce.

FlatAtlastthPhysicsProc1.txt

This file contains various parameters:

ColumnParameter BackgroundList 0 tt0j=0
ColumnParameter SignalList     1 ttH=1

Here you specify once again the numbers assigned to the processes by my_Eventtype (for tt0j it equals zero), and list things as BackgroundList or SignalList. The number after 'BackgroundList' or 'SignalList' is unique for each process (to preserve the uniqueness of <tag>:<sequence>), but does not need to correspond to my_Eventtype. However, for completeness' sake within this file I have set it as such. The number at the end of this declaration (ttjj=0 in this case) needs to be sequential - it instructs the net of the order in which to process the samples, so it must go from 0 to n-1 (when you have n samples).

ColumnParameter PseudoDataList 0 tt0j=0

This is simply a restatement of the BackgroundList (as we're looking for exclusion, the pseudodata is background only) - the same numbers in the same place. This list specifies the processes included in the pseudoexperiments, and therefore the signal process is not included in this list.

ListParameter     ProcessLabels:1 1 tt0j:t#bar{t}0j

The number after ProcessLabels again doesn't correspond to my_Eventtype - I have made it the same as the number after BackgroundList/SignalList and PseudoDataList. The important feature from this is that it tells the ANN what to label each of the various processes as in the results plots. Again, the numbers must run from 0 to n-1.

ColumnParameter UCSDPalette 0    tt0j=19
ColumnParameter PrimaryColorPalette 0    tt0j=0

These two parameters specify the colours used in the plotting for each of the processes (the numbers correspond to those in the Color Wheel of TColor). The numbers after the UCSDPalette and PrimaryColorPalette are the same ones as have been used previously in this file. Whether the plotting uses the colours stated in UCSDPalette or PrimaryColourPalette is determined in the file flatsteerStackNNAtlas.txt by setting the parameter:

GeneralParameter  string       1 Palette=UCSDPalette

The final parameter to be set in FlatAtlastthPhysicsProc1.txt is:

ColumnParameter ProcessOrder 0 tt0j=0

Once again, the number on its own (in this case 0) is the same as the other such instances in this file. The final number (zero in this case) is the order in which this process should be plotted - i.e. in this case, the tt0j sample will be plotted first in the output, with the other samples piled on top of it. This number obviously does not need to correspond to my_Eventtype.

FlatSysSetAtlastth1.txt

This file contains all the information on the errors that you pass to the ANN so that it can work out how the errors propogate to the final plots and answers, and so most of the details of this file will be covered in that section. The basic format of the file is:

ColumnParameter   Combine:Lumi   0  OnOff=1:Low=-0.11:High=0.11:Channel=1:Process=TTjj

The <sequence> parameter (in this case '0') is there so that you can specify the parameters for a given error for multiple channels, without falling foul of the uniqueness requirement for <tag>:<sequence>. We have chosen it so that it equals my_Eventtype for that process. 'Channel' is present just in case you're considering multiple channels. We're only considering the one channel in this case (SemiLeptonic). The final parameter (Process) is not actually used - the second parameter tells the ANN which errors are which, but this isn't very easily read by you, so feel free to add it in to help you keep track of the various errors! These final few parameters can be placed in any order, so long as they are separated by semicolons.

Passing preselection and sensible states

Many of the generated events in our samples will not pass the preselection cuts we would use in our final analysis. Sometimes to pass preselection requires some mistakes on the part of the reconstruction (e.g. tt + 0j), othertimes to fail preselection requires either the final state particles to be inherently unsuitable for our reconstruction, or to be mis-reconstructed. However, even if an event passes preselection it is possible that the events as reconstructed give a nonsensical final state - for example, the the light jets might not be able to be combined in such a way as to give a reasonable value of the W mass. Based on a few simple mass cuts, an event passing preselection can be determined to have a sensible state or not.

Currently, the type of event you are looking at is determined by looking at my_failEvent. States failing preselection have this equal to 0, passing preselection but not having a sensible final state equal 1 and passing preselection and having a sensible final state equal 3. These numbers are the basis of a number of bitwise tests - thus when setting your own my_failEvents, consider which bits in a binary string you want to represent various things, and then convert those to decimal.

Setting Systematic Uncertainties

The fitting code can take into account two different types of systematic uncertainty - rate and shape. The basic method to obtain both these uncertainties is that you should make your input samples for both your nominal sample, and for the two bounds of a given error (e.g. Initial State Radiation, ISR). Repeat this for all of the errors you wish to consider. The rate systematic uncertainty is simply how the number of events change that pass your preselection cuts etc. (you can only consider this, if you like). To obtain the shape uncertainty, you should pass each of the resulting datasets through the ANN (up to and including the templating, so that you have ANN results for both the nominal results, and as a result of varying each background). These ANN outputs can then be used to produce the rate uncertainties based on their integrals, before being normalised to the nominal cross-section so as to find the shape uncertainty - a measure of the percentage change in the bin-by-bin distribution for each error.

The fitting code is passed the relevant information about errors through the use of a number of files, but in the simplest case (when shape uncertainties are not being considered), there are only two: FlatSysSetAtlastth1.txt and SysNamesAtlastth1.txt. The basic call to the fitting code is in genemflat_batch_Complete2_SL5.sh:

sysfile=FlatSysSetAtlastth.txt
steerfile=FlatFitSteer.txt
mkdir -p templates/fit
rm -f templates/fit/out_${mh}.log
Fit ${basehistlistname} ${template_area}/ \$sysfile \$steerfile $mh > templates/fit/out_${mh}.log

The final call is rendered in the actual job file (e.g. run114) as

Fit /home/ahgemmell/NNFitter-00-00-09-Edited/NNTraining/atlastth_histlist_flat-v15.txt templates/tth120/ $sysfile $steerfile 120 > templates/fit/out_120.log

If you want to save time, (by not having to run templating for every error you wish to consider), you can instead only consider the rate uncertainties, and provide these as fractional changes to the rate, specified in FlatSysSetAtlastth1.txt. Whether or not you consider shape uncertainties is controlled by a couple of parameters in the steering file FlatFitSteer,txt, (which is created by the action of genemflat_batch_Complete2_SL5.sh)

GeneralParameter  bool        1 UseShape=0
GeneralParameter  bool        1 UseShapeMean=0

Setting UseShape=1 means shape uncertainties will be taken into account for all the uncertainties that you provide the extra steering files and ANN scores for, UseShapeMean=1 means that the ANN results for your various uncertainties will be used to produce the rate uncertainties based on their integrals, rather than on the numbers provided in FlatSysSetAtlastth1.txt - using the relative sizes of the integrals of the AAN output as an estimator of the rate uncertainty can be useful if you don't want to be subject to statistical variations in the computation of your systematic uncertainties (if UseShapeMean=0, the systematic rate uncertainty is calculated as a fractional change on the nominal rate). Considering shape uncertainties requires more steering files, and this will be detailed in later.

FlatSysSetAtlastth1.txt

ColumnParameter   Combine:Lumi   0  OnOff=1:Low=-0.11:High=0.11:Channel=1:Process=TTjj

The first parameter consists of two parts in this example: 'Combine' and 'Lumi'. The second part is the name of the uncertainty being considered. The first part 'Combine' (and the associated semicolon between them) is optional. It tells the ANN that the uncertainty thus labelled are independent of each other, and can be added in quadrature. 'OnOff' obviously tells the ANN to consider those uncertainty (1) or not (0). 'Low' and 'High' establish the relevant bounds of the uncertainty as fractions of the total (however, for the ANN these uncertainties are symmetrised, so to save time they are here assumed to be symmetric unless elsewhere stated) - note that these are not the uncertainties on the quantity, but rather the effect of that uncertainty on the rate of your process. Process is not actually read by the ANN, but is there to make the whole thing more human-friendly to read. The current errors, and their bounds are below. If no source for these error bounds is given, then they were the defaults found in the files from time immemorial (where as necessary I assumed that all tt + X errors were the same, as were all ttbb (QCD) errors, as in the original files the only samples considered were ttjj, ttbb(EWK), ttbb(QCD) and ttH - these errors probably originate from the CSC note). If you are only considering rate uncertainties, this is where the fitting code will find the relevant numbers.

Error Combined? Process Upper/Lower Bound Source
Luminosity Yes All 11% https://twiki.cern.ch/twiki/bin/view/AtlasProtected/TopSystematicUncertainties15
Trigger Yes tt + X 1.5%  
    ttbb (EWK) 1.4%  
    ttbb (QCD) 1.3%  
    ttH 1.5%  
Lepton ID Yes Backgrounds 0.3%  
    Signal 0.6%  
MET No All 1.0%  
NLO Acceptance No tt + X 5.5%  
    Others 10%  
X-Section No All 10%  
PDF No tt + X 1.9%  
    ttbb 2.7%  
    ttH 2.2%  
b-tagging No Backgrounds 20%  
    Signal 16%  
JES No Backgrounds 5.0%  
    Signal 9.0%  

SysNamesAtlastth1.txt

ListParameter   SysInfoToSysMap:1 1  Combine:LumiTrigLepID

The number in the <tag> after SysInfoToSysMap is unique for each error (in this case it goes from one to eight). There is one entry per error considered, apart from the cases where the errors are combined in quadrature (as specified in FlatSysSetAtlastth1.txt), where they are given one entry to share between them. The <colon-separated-parameter-list> provides a map between the name of the errors as considered by FlatSysSetAtlastth1.txt (the errors combined in quadrature are lumped together under the name 'Combine'), and something more human-readable. The human-readable names are what will be written out by the fitting code (which identifies each error based on numbers, rather than the names in FlatSysSetAtlastth1.txt) when it is producing its logfile. Obviously there is often not much change between the two names, apart form in the case of Combined errors.

Including shape uncertainties

For the sake of argument, we shall pretend to only be considering the one error overall. It is possible to consider rate errors independently of shape uncertainties (by setting UseShapeMean=0 in FlatFitSteer.txt) - this might be useful and quicker to run if a given error produces a large rate error, but the change to the shape of the ANN distribution is minimal (you can have a look at the ANN results yourself and make your own judgements). If you are not considering a given shape uncertainty but you are considering the rate uncertainty, all that needs to be done is to not produce the relevant steering files.

In this example, we will already have run three ANN templating steps - run1 (the nominal run), run2 (the results of taking the lower bound of the error) and run3 (the results of taking the higher bound of the error). You can then move into a new directory (e.g. run1_2) in which you want to perform the fitting, and at the very least set UseShape=1 in FlatFitSteer.txt (also perhaps setting UseShapeMean=1). This requires some changes to the call to the fitting code and atlastth_histlist_flat-v15.txt so that the combination of ${template_area} and the filenames given in atlastth_histlist_flat-v15.txt still point toward the ANN template files you wish to consider - as shown in the two lines below (the first from genemflat establishing ${template_area}, the second from atlastth_histlist_flat-v15.txt establishing the filename for the ANN template):

template_area=templates/${process}${mh}
0 116102-filter.root    FlatPlotter/NNScoreAny_0_0_0 0

could become:

template_area=${MAINDIR}
0 run1/templates/tth120/116102-filter.root    FlatPlotter/NNScoreAny_0_0_0 0

This ensures that atlastth_histlist_flat-v15.txt will still point toward the ANN templates from the nominal run. You must now create additional steering files to point toward the high and low error ANN templates - their names are of the format:

"ShapePos_"+errorname+"_"+HistOutput
"ShapeNeg_"+errorname+"_"+HistOutput

where HistOutput is atlastth_histlist_flat-v15.txt and errorname is the human-readable error name, as defined in SysNamesAtlastth1.txt. You also need to change the ${basehistlistname} in the call to the fitting code so that it points directly at atlastth_histlist_flat-v15.txt, with no preceding directory structure - the code bases the names of the two extra shape steering files on this argument, and will not take into account any directories in the argument. (So that if ${basehistlistname} was directory/file.txt, the fitting code would look for the extra steering files with the name ShapePos _ISR_directory/file.txt in the case of ISR being our error).

Filters

It is possible for the inputs to the ANN to have more events in than those that you want to pass to on for processing. We only want to train the ANN on those samples that would pass our preselection cuts - general cleaning cuts and the like. (There was a previous version of our inputs where we also required 'sensible states' - for each candidate event we required it to reconstruct tops and Ws with vaguely realistic masses. However - this is a Neural Net analysis, so it has been decided to remove these cuts - they will after all in effect be reintroduced by the net itself if they would have been useful, and by not applying them ourselves, we are passing more information to the net.) We therefore have filters so that Computentp and the ANN only look at events of our choosing. These filters take the place of various bitwise tests in TMVAsteer.txt (created in genemflat_batch_Complete2.sh) (not currently used, as explained below) and TreeSpecATLAStth.txt.

TreeSpecATLAStth.txt

In TreeSpecATLAStth.txt, we establish the filters which control what is used for the templating, and Computentp:

ListParameter SpecifyVariable:Higgs:cutMask 1 Type:int:Default:3
ListParameter SpecifyVariable:Higgs:invertWord 1 Type:int:Default:0

InvertWord is used to invert the relevant bits (in this case no bits are inverted) before the cut from cutMask is applied. The cutMask will exclude from templating those events where the matching bits are equal to zero AFTER the inversion. So here, with no inversion applied, those events with my_failEvent == 3 will be used for templating.

TMVAsteer.txt (genemflat_batch_Complete2.sh)

GeneralParameter string 1 Constraint=(my_failEvent&3)==3

This controls the events used in the training, using a bitwise comparison. If the constraint is true (i.e. the first two bits are set, and not equal to zero), then the event is used for training. This filter is not used currently, as training of the net takes place based on the Computentp output - this Computentp output only contains sensible states (as specified in the TreeSpecATLAStth.txt file's filter). If further filtering is required, then care must be taken to ensure that my_failEvent (or whatever you wish to base your filter on) is specified in the VariableTreeToNTP file, so that Computentp will copy it into its output.

**If USEHILOSB is set to 1 then && must be appended to cut criteria, e.g. GeneralParameter string 1 Constraint=(my_failEvent&65536)==0&&. This is because USEHILOSB adds more constraints.**

Running

To run the script, first log into the batch system (ppepbs).

The genemflat_batch_Complete2_SL5.sh script can be executed with the command:

./genemflat_batch_Complete2_SL5.sh 12 480 1.0 tth 120 120 6 /data/atlas07/ahgemmell/NTuple-v15-30Aug 

These options denote:

  • 12 is the run number

  • 480 is the jobstart - this is a potentially redundant parameter to do with the PBS queue.

  • 1.0 is the luminosity will be normalised to (in fb^-1).

  • tth is the process type - aim to develop this to incorporate other processes, e.g. lbb

  • 120 is the min. Higgs mass

  • 120 is the max. Higgs mass

  • 6 is the number of jets in the events you want to run over (i.e. this is an exclusive 6 jet analysis - events with 7 jets are excluded)

  • /data/atlas07/ahgemmell/NTuple-v15-30Aug is the directory where the input ntuples are located (having my_failEvent bits set for ( 65536 for >0 sensible states) and ( 131072 for 4 tight b-tagged jets)

Once the job has been completed you will receive an email summarising the outcome.

Running:

  • Creates a run12 subdirectory in working directory and makes it the working directory

  • Creates TMVAsteer.txt - writes fitting parameters to it

  • NN structure is set ( H6AONN5MEMLP MLP 1 H:!V:NCycles=1000:HiddenLayers=N+1,N:RandomSeed=9876543). This line sets up two hidden layers with N+1 and N neurons respectively (where N is the number of input variables).

  • Training cycles (1000) and hidden layers - N+1?

  • 4 text steer files are copied into the run directory for templating

  • 2 text steer files are copied into the run directory for stacking plots.

  • 2 lines of text are appended to a temporary copy of flatsteerStackNNAtlas.txt: GeneralParameter string 1 HWW=tth-TMVA
    GeneralParameter double 1 <nop>IntLumi=${lumi}

Contents of training file:

  • jetmin/jetmax - These seem to be redundant. Commented out, effective as of v.3

  • zmin/zmax (1/2) - what is their function?

  • weighting = <nop>TrainWeight - is this redundant?

  • TMVAvarset.txt - input variable set

Other switches to influence the running

In genemflat_batch_Complete2_SL5.sh, at the start of the file there are a number of switches established:

# Flags to limit the scope of the run if desired                                                                                                                                                               
Computentps=1
DoTraining=1
ComputeTMVA=1
DoTemplates=1
DoStackedPlots=1
DoFit=1

These control whether or not various parts of the code are run - the names of the flags are pretty self-explanatory about what parts of the code they control. For example, it is possible to omit the training in subsequent (templating) runs, if it has previously been done. This shortens the run time significantly.

***NOTE*** The flags DoTraining and DoTemplates had previously (until release 00-00-21) been set on the command line. They were moved from the command line when the other flags were introduced.

Where the output is stored

  1. trees/NNInputs_120.root

    The output from Computentp - it is a copy of all of the input datasets, with the addition of the variables TrainWeight and weight.

  2. computentp_output/

    The same as for trees/NNInputs_120.root, but now each individual input dataset has its own appropriately named file.

  3. weights/TMVTEST2_120_3.root_H6AONN5MEMLP.weights.txt

    Contains the weights of the Neural Net - results of the training. Starts with info about the run: Date, time, etc. followed by various parameters you've set. Then the variables you're considering in the net, with their range. Finally comes the Neural Net structure itself - weights etc.

  4. TMVA2_120_3.root

    1. InputVariable _NoTransform/my_----_B_NoTransform

      The distribution of this variable for the Background (or signal for S). N.B. Has N_sig / 2 entries (other half used for testing, not training). Check this to make sure there are no obvious problems with the input - missing data points, random spikes, etc.

    2. Method_MLP/H6AONN5MEMLP/estimatorHistTrain

      A measure after each iteration of how much tweaking is required to get S=1, B=0. Should settle after a while to a stable number

    3. Method_MLP/H6AONN5MEMLP/estimatorHistTest

      Same, but for the test sample - should look broadly similar. Otherwise we're over/undertraining.

    4. Method_MLP/H6AONN5MEMLP/MVA_H6AONN5MEMLP_B

      The final Neural Net Score for the Background (or signal for S) (Test result - not the final result).

    5. Method_MLP/H6AONN5MEMLP/MVA_H6AONN5MEMLP_B_high

      Rick's not sure.

    6. Method_MLP/H6AONN5MEMLP/MVA_H6AONN5MEMLP_rejBvsS

      Comparing Bkg rejection to Signal Efficiency. Ideally want rej=1 with eff=1 (i.e. reject all bkg, accept all signal).

  5. train2.log

    A text summary of the training process. Mentions files used (how many files, numbers of events, if they signal/background, etc). Will refer to file 0, 1... - refer to Atlas ttHRealTitles.txt to work out which file is which. Use this to make sure the right numbers of events are getting though your filters.

  6. templates/out/FlatPlotter${prefix}.out

    The output from the screen from the templating stage, where the Neural Net scores are actually produced, detailing errors etc.

  7. templates/ttH120/

    Files showing the final results of the ANN. NNScoreAny _0_0_0 is of particular interest - this is the ANN output for that file, weighted to represent real data (i.e. if you simply add all of these together, you should get a realistic 'Signal and Background' result, which is shown in:

  8. stacked/Plots_120_TMVA_Lum_1.0/FlatStack_1.eps

    stacked/Plots_120_TMVA_Lum_1.0/FlatStack_2.eps

    The final ANN scores of the signal and background, scaled as to real data

  9. stacked/out1_120.log

    stacked/out2_120.log

    Output from making the stacked (combined) plots.

  10. templates/fit/out_120.log

    This is the paydirt - the output to screen from the fitting stage. Look at the end of this file, and you shall see the exclusions generated by all the pseudoexperiments performed on the ANN output, in a little table, also giving +/- 1/2 sigma results.

  11. drivetestFlatFitAtlastth.rootUnscaledTemplates.root.

    The distribution of the exclusions. This is plotted only if

    GeneralParameter bool 1 PlotLikelihood =1
    in teststeerFlatReaderATLAStthSemileptonic.txt. The range and number of bins in this plot can be controlled by editing the following switches:
    GeneralParameter int    1 LikeliPseudoExpNBin=400
    GeneralParameter double 1 LikeliPseudoExpMin=0.
    GeneralParameter double 1 LikeliPseudoExpMax=10.
  12. tr[run number].o[PBS job number]

    This output file is written as soon as the job stops, and contains a summary of the full run - useful for looking for error messages.

  13. TMVAPerf_120_Run[run number]Job2.html

    A useful little html page that one of Rick's scripts creates, showing a number of useful plots - the signal and background Net scores, distributions of input variables and their correlations, and so on.

Limitations

  • It must also be run on a PBS machine because of the structure of the genemflat_batch_Complete2_SL5.sh file (i.e. PBS commands).

  • The file <nop>teststeerFlatPlotterATLAStthSemileptonic.txt appears to contain an invalid range for the pseudorapidity (max. value = pi)

  • If USEHILOSB is set to 1 then && must be appended to cut criteria, e.g. GeneralParameter string 1 Constraint=(my_failEvent&65536)==0&&

  • It would be desirable to adapt the code to be able to process different signals, e.g. lbb.

Diagnostic Run

A diagnostic run may be carried out by setting Debug=1 and NEvent=99 in teststeerFlatReaderATLAStthSemileptonic.txt. It is also advisable to cut the run time down by setting the number of training cycles to a low number (e.g. 20) in genemflat_batch_Complete2_SL5.sh - this appears as NCycles in the TMVAsteer.txt part of the file.

TMVA Training Plots

There is a macro in the latest NNFitter version which will plot the contents of run12/TMVA2_120_3.root showing the responsiveness of the neural net as a function of the number of training cycles, in order to gauge the optimal number of cycles to use (i.e. avoid the dangers of under- or over-training). The macro must be run in a directory where a neural net run has already been carried out.

  • Type source exportscript.sh to export the relevant parameters.

  • Enter ROOT and type .x runTrainTest.C

This will create two .eps output files, one showing the success of signal/background fitting, and the other displaying the sensitivity of the neural net to the number of training cycles used, allowing the speed of convergence to be gauged.

Running analysis & making ntuples

This is the procedure used to remake the ATLAS ntuples, using code from CERN Subversion repositories. These ntuples were then used as input for the neural net.

cd ~
mkdir tth_analysis_making_ntuples_v.13
cd tth_analysis_making_ntuples_v.13
export SVNROOT=svn+ssh://kirby@svn.cern.ch/reps/atlasoff
svn co $SVNROOT/PhysicsAnalysis/HiggsPhys/HiggsAssocTop/HiggsAssocWithTopToBBbar/tags/HiggsAssocWithTopToBBbar-00-00-00-13 <nop>PhysicsAnalysis/HiggsPhys/HiggsAssocTop/HiggsAssocWithTopToBBbar
cd <nop>PhysicsAnalysis/HiggsPhys/HiggsAssocTop/HiggsAssocWithTopToBBbar/NtupleAnalysis/
make

In a new terminal window: cd /data/atlas07
mkdir gkirby
cd gkirby
mkdir ntuples_sensstatecutword

The script files in the <nop>NtupleAnalysis directory were then altered to output to this new directory. This is the output line for the signal ( tthhbbOptions86581430018061-nn.txt)

OUTPUT /data/atlas07/gkirby/ntuples_sensstatecutword/86581430018061-nn.root

Code changes: The tthhbbClass.cxx file was edited to include a new error code: the following lines were added to it to allow us to exclude events that we do not wish the NN to train/test with.

if (<nop>SensibleStates.size()==0) {
m_failEvent+=65536;
}

Then the make command was used again in this directory. The tthhbb executable was run with each of the input ("Options") text files to prepare the ntuples.

Another change was also required; the tthhbbClass.cxx file was also edited to include a fail code to allow for events not having '4 tight b-tagged jets' since this was one of the criteria used in the cut-based preselection. The following code was added to tthhbbClass.cxx:

if (<nop>BJets.size()<4) {
m_failEvent+=131072;
}

This was done so that the Signal to Background ratio could be increased in order for the fit to finish and provide sensible results. Requiring 4 tight b-Jets removes proportionally much more of the ttjj background than the other samples because there are fewer b-jets. The reason the S/B ratio was so low was that a problem was found which meant that the original ratio used in the 'fix_weight' variable in genemflat_batch_Complete2_SL5.sh (up to version -06) was in fact incorrect, and when the correct weights were calculated the S/B ratio was very low indeed.

The Neural Net was then configured to exclude events where: (m_failEvent & 196608)==1
with 196608=131072+65536.

Creating plots to review the data

There is a simple shell script included in the running Neural Net code package that can produce a nice html document you can use to review a few plots of interest - plotTMVA.sh. To run it, move it into the run directory you want to review, then it's a simple one-line command:

./plotTMVA.sh 120 <run> <job> 

N.B. This is done automatically by genemflat currently.

Debugging the code

To debug the code, two things need to be done - first, all the debug switches need to be turned on, and then you need to restrict the number of events to ~10 (for a Computentp run this will still manage to generate a 2 GB log file!). All of these switches are found in teststeerFlatReaderATLAStthSemileptonic.txt (the progenitor for all FlatReader files). The debug switches are:

GeneralParameter bool 1 Debug=0
GeneralParameter bool 1 DebugGlobalInfo=0
GeneralParameter bool 1 DebugEvInfo=0
GeneralParameter int  1 ReportInterval=100

All the debug switches can be set to one (I'm not sure of the exact effect of each individual switch) - the report interval is probably best left at 100. To restrict the events you use

#
# Loop Control
#
GeneralParameter int  1 NEvent=999999
GeneralParameter int  0 FirstEvent=1
GeneralParameter int  0 LastEvent=10

The easiest switch is to set NEvent=10 - however, if desired you can run over a specified range, by switching of the NEvent switch (changing it to int 0 NEvent) and switching on the other two switches, using them to specify the events you wish to run over.

Various other switches of interest

In FlatReader:

GeneralParameter int 1 LikeliPseudoExpMin=0.
GeneralParameter int 1 LikeliPseudoExpMax=10.

GeneralParameter int 1 LikeliPseudoExpNBin=400

To enable you to specify the range and number of bins in the histogram showing the distribution of the pseudoexperiment exclusions. (Found in drivetestFlatFitAtlastth.rootUnscaledTemplates.root)

In both FlatReader and FlatPlotter:

GeneralParameter bool 1 LoadGlobalOnEachEvent=1

This needs to be set to one if you wish to load the global variables anew for each event. Otherwise the global variables will be loaded once only - from the Global tree if you have specified it, or from the first event if you haven't. Therefore, if your input datasets have non-sensible states and no global tree, this must be set to one. Otherwise, if the first entry is not sensible, or for some other reason has an unreasonable answer for this global value, a problem will develop. With this switch on, the values will be loaded each and every time – obviously this slows the code down – if the global values are safely stored in every entry, it might be best to set this to false.

TMVAsteer.txt (genemflat_batch_Complete2_SL5.sh)

H6AONN5MEMLP      MLP         1 !H:!V:NCycles=1000:HiddenLayers=N+1,N:RandomSeed=9876543

If the phrase 'H6AONN5MEMLP' is changed, then this change must also be propogated to the webpage plotter (e-mail from Rick 1 Mar 2011)

Topic attachments
I Attachment History ActionSorted ascending Size Date Who Comment
Unknown file formateps Est_12_120.eps r1 manage 16.0 K 2009-07-24 - 12:06 GavinKirby  
Unknown file formateps FlatStack_1.eps r1 manage 90.1 K 2009-07-24 - 11:12 GavinKirby  
Unknown file formateps FlatStack_2.eps r1 manage 109.0 K 2009-07-24 - 12:06 GavinKirby  
PDFpdf Report-FINAL.pdf r1 manage 202.3 K 2009-09-29 - 10:47 ChrisCollins  
Unknown file formateps drivetestFlatFitAtlastth.rootSemiLeptonic_lnsb1.eps r1 manage 17.0 K 2009-07-24 - 12:06 GavinKirby  
Unknown file formateps drivetestFlatFitAtlastth.rootSemiLeptonic_lnsb2.eps r1 manage 16.1 K 2009-07-24 - 12:06 GavinKirby  
Unknown file formateps score_12_120.eps r1 manage 28.5 K 2009-07-24 - 12:06 GavinKirby  
Edit | Attach | Print version | History: r161 | r107 < r106 < r105 < r104 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r105 - 2011-09-02 - AlistairGemmell
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback