Difference: JetFlavourTagging (29 vs. 30)

Revision 302017-03-22 - DanProtopopescu

Line: 1 to 1

Jet Flavour Tagging Howto


Jet Flavour Tagging Howto with LCFIVertex

This is a detailed record on how the Marlin framework and included LCFI packages are used for jet flavour tagging. b-jet flavour tagging is part of our analysis of the feasibility of the ZZ fusion channel with CLIC ILD at 1.4 TeV.
This is a detailed record on how the Marlin framework and included LCFIVertex packages are used for jet flavour tagging. b-jet flavour tagging is part of our analysis of the feasibility of the ZZ fusion channel with CLIC ILD at 1.4 TeV.

Note: this wiki refers to the LCFIVertex package, that can be found now on GitHub here.


Jet Finder and Truth Tagging

We use the LCFI flavour tagging package. This package consists of a topological vertex finder ZVTOP, which reconstructs secondary interactions, and a multivariate classifier which combines several jet-related variables to tag bottom, charm, and light quark jets (see diagram).
The LCFI package consists of a topological vertex finder ZVTOP, which reconstructs secondary interactions, and a multivariate classifier which combines several jet-related variables to tag bottom, charm, and light quark jets (see diagram).
  Our steering file will contain the jet finder, flavour tagging and LCFI processors, and we will write new slcio files containing the added collections:
  <group name="JetFinders"/>
  <group name="MyTrueAngularJetFlavourProcessorCollection"/>
  <processor name="IPRPCutProcessor"/>
  <processor name="MyPerEventIPFitterProcessor"/>
  <processor name="ZVRESRPCutProcessor"/>
  <processor name="MyZVTOP_ZVRES"/>
  <processor name="FTRPCutProcessor"/>
  <processor name="MyFlavourTagInputsProcessor"/>
  <processor name="MyLCIOOutputProcessor"/>  

The JetFinder processor reconstructs 2 and 4 jets events from the input collection (LooseSelectedPandoraPFANewPFOs was used). For the reconstructed 4 jets, MyTrueAngularJetFlavourProcessor determines MC Jet Flavour by angular matching of heavy quarks to jets, and also determines hadronic and partonic charge of the jet. The LCIOOutput processor creates new slcio files containing the new collections added by the above processors.

The LCFI processors have the following functions:

  • IPRPCut - selects Reconstructed Particles based on track parameters, number of hits etc.
  • MyPerEventIPFitter - determines IP position and error from the tracks in an event by simple fitting
  • ZVRESRPCut - applies cuts on the d0 and z0 values of the track
  • MyZVTOP_ZVRES - topological vertex finder
  • FTRPCut - flavour tagging reconstructed particle cuts (on d0, z0 and PT)
  • MyFlavourTagInputs - from vertices and tracks calculates discriminating variables for the neural net

Table of input and output collections for our setup (one can choose other names, of course):

# Processor Type Input Collection name Output Collection name
1 JetFinder SatoruJetFinder LooseSelectedPandoraPFANewPFOs Durham_4Jets
2 MyTrueAngularJetFlavour TrueAngularJetFlavour MCParticlesSkimmed, Durham_4Jets TrueJetFlavour_4Jets
3 IPRPCut RPCut LooseSelectedPandoraPFANewPFOs IPFitSelectedParticles
4 MyPerEventIPFitter PerEventIPFitter IPFitSelectedParticles IPVertex
5 ZVRESRPCut RPCut RecoMCTruthLink, Durham_4Jets ZVRESSelectedJets
6 MyZVTOP_ZVRES ZVTOP_ZVRES IPVertex, ZVRESSelectedJets ZVRESDecayChains, ZVRESDecayChainRPTracks, ZVRESSelectedJets
7 FTRPCut RPCut RecoMCTruthLink, ZVRESDecayChains FTSelectedJets
8 MyFlavourTagInputs FlavourTagInputs ZVRESDecayChains, FTSelectedJets FlavourTagInputs

Our input slcio files contain the collections: LooseSelectedPandoraPFANewPFOs, MCParticlesSkimmed, PandoraPFANewClusters, PandoraPFANewPFOs, PandoraPFANewReclusterMonitoring, ProngVertices, RecoMCTruthLink, SelectedLDCTracks, SelectedPandoraPFANewPFOs, TightSelectedPandoraPFANewPFOs and V0Vertices.

The processors listed above could be run in sequence, or split in several steps, invoking a LCIOOutput processor to write intermediate slcio outputs at every step. Here's a script for that, where the intermediate xml files are slight modifications of the files provided in LCFIVertex/steering examples. We found that the most time-consuming processor is ZVTOP_ZVRES with more than 10 s/event.

We found it easier to run processors 1 to 6 on batches of 10 input files, and save the outputs as /afs/phas.gla.ac.uk/data/ilc/datasets01/1.4tev/ZVRES_out/runX_out.slcio. Then we ran processors 7 and 8 on these files to produce a new set ftiX_out.slcio. We used half of these files to train the neural net, but then again was easier to run flavour tagging (see below) on the individual ftiX_out.slcio files. We used the other half of the files files for the purity vs. efficiency plots (see below). These slcio output files were stored in /afs/phas.gla.ac.uk/data/ilc/datasets01/1.4tev/ZVRES_out/ as well.

Troubleshooting: The b3_D0CutValue parameter of the IPRPCutProcessor was set to 5O instead of 50, and was causing a crash. For the ZVRESRPCut processor, h1_MCPIDEnable had to be set to false. See also this post.

Neural Network Training

In the previous step, we have extracted the discriminating parameters and truth-tagged the jets. The slcio files created are stored in:

These files contain the collections Durham_4Jets, FlavourTagInputs and TrueJetFlavour_4Jets, which we will use now to train our neural nets. We use a customised version of the NeuralNetTrainer code included in the LCFI package. Separate nets were trained for 1, 2, or 3+ vertices to identify b-jets, c-jets, and c-jets with b background. Our steering file contains only:
  <processor name="MyNeuralNetTrainer" type="GlasgowNeuralNetTrainer"/>
This processor is a slightly modified version of the NeuralNetTrainer included in the LCFI package, where the polar angle cut was introduced as a steering parameter:
// Theta cut parameters: _jetThetaAngleCut < theta < (180 - _jetThetaAngleCut)
        registerProcessorParameter( "JetThetaAngleCut" ,
                                    "Cut on the jets theta angle"  ,
such that we can pass a cut angle different of the 30 default:
<!-- Jet theta angle cut (only the Glasgow NN trainer has this option) -->
                <parameter name="JetThetaAngleCut" type="float"> 24. </parameter>
The neural nets are saved as XML files in gnets/ and will be used for flavour tagging (next step). No slcio output is written at this time. These neural networks can be downloaded from here: gnets.tgz

Flavour Tagging

Now we are ready to employ the FlavourTag processor, which will do flavour tagging using the neural nets trained in the previous step. The input slcio files contain the FlavourTagInputs and FTSelectedJets (or Durham_4Jets, not sure if there's a difference at this level) collections.

  <processor name="MyFlavourTag"/>
  <processor name="MyLCIOOutputProcessor"/>  
The output slcio will contain a new collection FlavourTag (or FlavourTagGla in our customised configuration).

Purity and Efficiency Studies

To determine the optimal cut for our b-tagging, a purity vs. efficiency study needs to be done. We use the MakePurityVsEfficiencyRootPlot.C macro provided by the LCFIVertex package. First, we have to run:

  <processor name="MyAIDAProcessor"/>
  <processor name="MyPlot"/>
  <processor name="MyLCFIAIDAPlotProcessor"/>
We had to provide MyPlot with the actual names of our collections:
<parameter name="TrueJetFlavourCollection" type="string">TrueJetFlavour_4Jets </parameter>
<!--In fti-steer.xml this parameter is called "FlavourTagCollection", without the 's' -->  
<parameter name="FlavourTagCollections" type="string">FlavourTagGla </parameter>
Note that LCFI must be compiled with ROOT if one wants .root output from PlotProcessor (instead of .txt). For this, we added
 IF( ${pkg}_FOUND )
to the LCFIVertex/CMakeLists.txt file, sourced the root environment, then ran cmake and make install.

Once the AIDA Plots processors are run via Marlin, a text file PurityEfficiencyOutput.txt and a RAIDA root file are produced. We customised the MakePurityVsEfficiencyRootPlot.C macro and ran it to use the RAIDA file as input to produce the purity vs. efficiency plots:

root -l MakePurityVsEfficiencyRootPlotGla.C
Here's a plot of our flavour tagging purity vs. efficiencies (using cca. 25k events):

The cut values and corresponding purity and efficiencies are tabulated in the file PurityEfficiencyOutput.txt. We used another Root macro to plot these values and determine the optimum b-cut.

For b-tagging, we've compared the plot produced by MakePurityVsEfficiencyRootPlotGla.C and graphs drawn with the data tabulated in PurityEfficiencyOutput.txt, for 1, 2, 3 (corresponding to distinct neural networks) or any number of vertices (which we don't know yet how to interpret):

Here's the script used to extract the numbers from PurityEfficiencyOutput.txt

Adding background

Up to now all that input slcio files contained was signal, so the 'background' in the plots above is not really background. To add background, we've downloaded 1.4!TeV eeqq files to:

We had to run processors 1-7 on these files to identify jets and calculate the discriminating parameters. For this, we used slightly modified versions of the scripts, which can be found here. The resulting outputs, containing approximately 25k events, are stored in /afs/phas.gla.ac.uk/data/ilc/datasets01/1.4tev/ZVRES_out/ as well.


Training the neural nets with eeqq background added

We have trained another set of neural nets but adding this time eeqq background to the input. The resulting neural nets (25k signal + 25k background) are here: bnets.tgz.


Using an alternate Jet Finder

The whole sequence can be redone with and alternate jet finder. We used '!FastJet' and we replaced the Durham algorithm with KT. For this, we changed Durham_4Jets to Kt_4Jets and used

<processor name="MyFastJetProcessor4" type="FastJetProcessor">
       <parameter name="algorithm" type="StringVec"> kt_algorithm 1.0 </parameter>
       <parameter name="clusteringMode" type="StringVec"> ExclusiveNJets 4 </parameter>
       <parameter name="jetOut" type="string" lcioOutType="ReconstructedParticle"> Kt_4Jets </parameter>
       <parameter name="recParticleIn" type="string" lcioInType="ReconstructedParticle"> LooseSelectedPandoraPFANewPFOs </parameter>
       <parameter name="recombinationScheme" type="string">E_scheme </parameter>
       <parameter name="recParticleOut" type="string" lcioOutType="ReconstructedParticle"> LooseSelectedPandoraPFOsInKt_4Jets </parameter>
The relevant scripts and XML templates are here. The outputs are stored in /afs/phas.gla.ac.uk/data/ilc/datasets01/1.4tev/ZVRES_out/ and have the prefix kt added the file name.

Using 4-Jet background

This is done following the same procedure, but using qqll 1.4TeV, CLIC_ILD DST-s (2645) and the script subqqll.sh and its dependencies, zvresQQllPBS.sh, and train_qnets.xml. The relevant files are:

  • qqll_scripts.tgz: Scripts for signal + qqll background, KT algorithm
  • qnets.tgz: Neural nets for signal + qqll background, KT jet finder algorithm

The outputs are stored in the usual directory: /afs/phas.gla.ac.uk/data/ilc/datasets01/1.4tev/ZVRES_out/*qnets*. Purity vs. efficiency plot using qnets (1 vertex NN is gibberish):

Useful Links

META FILEATTACHMENT attachment="BRTotalUncertBands_lm.png" attr="" comment="Higgs branching ratios (from A. Denner et al., EPJ C71, p.1753)" date="1372689661" name="BRTotalUncertBands_lm.png" path="BRTotalUncertBands.png" size="114345" stream="BRTotalUncertBands.png" tmpFilename="/usr/tmp/CGItemp36847" user="DanProtopopescu" version="2"
META FILEATTACHMENT attachment="Timing-ScreenShot.png" attr="" comment="Screen shot: time used by Marlin processors" date="1372694627" name="Timing-ScreenShot.png" path="Timing-ScreenShot.png" size="72589" stream="Timing-ScreenShot.png" tmpFilename="/usr/tmp/CGItemp32197" user="DanProtopopescu" version="1"
META FILEATTACHMENT attachment="LCFI_Flow_Diagram.pdf" attr="" comment="LCFI processors - flow diagram" date="1372695037" name="LCFI_Flow_Diagram.pdf" path="LCFI Flow Diagram.pdf" size="89775" stream="LCFI Flow Diagram.pdf" tmpFilename="/usr/tmp/CGItemp32291" user="DanProtopopescu" version="1"
META FILEATTACHMENT attachment="Vertexing_Howto.pdf" attr="" comment="Vertexing HowTo talk by B. Jeffery" date="1372697529" name="Vertexing_Howto.pdf" path="Vertexing Howto.pdf" size="761330" stream="Vertexing Howto.pdf" tmpFilename="/usr/tmp/CGItemp32366" user="DanProtopopescu" version="1"
META FILEATTACHMENT attachment="jet_truth_tag-steer.xml" attr="" comment="Steering file 1: jet finder and truth flavour tagging" date="1372699245" name="jet_truth_tag-steer.xml" path="jet_truth_tag-steer.xml" size="84107" stream="jet_truth_tag-steer.xml" tmpFilename="/usr/tmp/CGItemp32410" user="DanProtopopescu" version="1"
META FILEATTACHMENT attachment="runLCFI.txt" attr="" comment="Script to run LCFI sequence of processors" date="1372954507" name="runLCFI.txt" path="runLCFI.txt" size="1170" stream="runLCFI.txt" tmpFilename="/usr/tmp/CGItemp32298" user="DanProtopopescu" version="1"
META FILEATTACHMENT attachment="Screen_Shot_2013-07-16_at_13.40.40.png" attr="" comment="Flavour tagging purity vs. efficiency plots (using cca. 22k events)" date="1373979756" name="Screen_Shot_2013-07-16_at_13.40.40.png" path="Screen Shot 2013-07-16 at 13.40.40.png" size="22046" stream="Screen Shot 2013-07-16 at 13.40.40.png" tmpFilename="/usr/tmp/CGItemp35477" user="DanProtopopescu" version="1"
META FILEATTACHMENT attachment="zvresRun.sh" attr="" comment="Script to run processors 1-7, see text" date="1373983508" name="zvresRun.sh" path="zvresRun.sh" size="1734" stream="zvresRun.sh" tmpFilename="/usr/tmp/CGItemp35271" user="DanProtopopescu" version="1"
META FILEATTACHMENT attachment="MakePurityVsEfficiencyRootPlotGla.C" attr="" comment="Customised macro to draw purity vs. efficiency plots" date="1373983605" name="MakePurityVsEfficiencyRootPlotGla.C" path="MakePurityVsEfficiencyRootPlotGla.C" size="2527" stream="MakePurityVsEfficiencyRootPlotGla.C" tmpFilename="/usr/tmp/CGItemp35411" user="DanProtopopescu" version="1"
META FILEATTACHMENT attachment="LCFI_Hillert.pdf" attr="" comment="LCFI Vertex package talk by S. Hillert" date="1374661268" name="LCFI_Hillert.pdf" path="LCFI_Hillert.pdf" size="806747" stream="LCFI_Hillert.pdf" tmpFilename="/usr/tmp/CGItemp36980" user="DanProtopopescu" version="1"
META FILEATTACHMENT attachment="LCFI_ecfa_krakow.pdf" attr="" comment="Flavour tagging at the linear collider - talk by M. Wing" date="1374661320" name="LCFI_ecfa_krakow.pdf" path="LCFI_ecfa_krakow.pdf" size="303536" stream="LCFI_ecfa_krakow.pdf" tmpFilename="/usr/tmp/CGItemp36921" user="DanProtopopescu" version="1"
META FILEATTACHMENT attachment="PEcomparison.png" attr="" comment="B-tagging Purity vs. Efficiency comparison" date="1374664155" name="PEcomparison.png" path="PEcomparison.png" size="12262" stream="PEcomparison.png" tmpFilename="/usr/tmp/CGItemp40207" user="DanProtopopescu" version="2"
META FILEATTACHMENT attachment="PEcomparisonBkg.png" attr="" comment="B-tagging Purity vs. Efficiency comparison (25k eeqq background)" date="1375279055" name="PEcomparisonBkg.png" path="PEcomparisonBkg.png" size="17521" stream="PEcomparisonBkg.png" tmpFilename="/usr/tmp/CGItemp50307" user="DanProtopopescu" version="2"
META FILEATTACHMENT attachment="xtractPEv.txt" attr="" comment="Script used to extract the numbers from PurityEfficiencyOutput.txt" date="1375112805" name="xtractPEv.txt" path="xtractPEv.txt" size="1642" stream="xtractPEv.txt" tmpFilename="/usr/tmp/CGItemp45803" user="DanProtopopescu" version="1"
META FILEATTACHMENT attachment="gnets.tgz" attr="" comment="Neural nets (Satoru Jet Finder, trained with signal only)" date="1375797750" name="gnets.tgz" path="gnets.tgz" size="11972" stream="gnets.tgz" tmpFilename="/usr/tmp/CGItemp45656" user="DanProtopopescu" version="1"
META FILEATTACHMENT attachment="bnets.tgz" attr="" comment="Neural nets (Satoru Jet Finder, trained with signal + eeqq background)" date="1375805087" name="bnets.tgz" path="bnets.tgz" size="18627" stream="bnets.tgz" tmpFilename="/usr/tmp/CGItemp48387" user="DanProtopopescu" version="3"
META FILEATTACHMENT attachment="bkg_scripts.tgz" attr="" comment="Scripts for running processors 1-7 on eeqq background files" date="1375803997" name="bkg_scripts.tgz" path="bkg_scripts.tgz" size="8717" stream="bkg_scripts.tgz" tmpFilename="/usr/tmp/CGItemp45761" user="DanProtopopescu" version="1"
META FILEATTACHMENT attachment="PEcomparisonBNN.png" attr="" comment="B-tagging Purity vs. Efficiency comparison (signal + background, see text)" date="1375804728" name="PEcomparisonBNN.png" path="PEcomparisonBNN.png" size="14806" stream="PEcomparisonBNN.png" tmpFilename="/usr/tmp/CGItemp45679" user="DanProtopopescu" version="1"
META FILEATTACHMENT attachment="KTscripts.tgz" attr="" comment="Scripts and templates modified to use FastJet/KT algorithm" date="1376298561" name="KTscripts.tgz" path="KTscripts.tgz" size="13823" stream="KTscripts.tgz" tmpFilename="/usr/tmp/CGItemp50170" user="DanProtopopescu" version="2"
META FILEATTACHMENT attachment="qqll_scripts.tgz" attr="" comment="Scripts for signal + qqll background, KT algorithm" date="1379084407" name="qqll_scripts.tgz" path="qqll_scripts.tgz" size="14749" stream="qqll_scripts.tgz" tmpFilename="/usr/tmp/CGItemp45933" user="DanProtopopescu" version="1"
META FILEATTACHMENT attachment="qnets.tgz" attr="" comment="Neural nets for signal + qqll background, KT jet finder algorithm" date="1379084450" name="qnets.tgz" path="qnets.tgz" size="9675" stream="qnets.tgz" tmpFilename="/usr/tmp/CGItemp46040" user="DanProtopopescu" version="1"
META FILEATTACHMENT attachment="PEcomparisonQQll.png" attr="" comment="Purity vs. efficiency plot using qnets (1 vertex NN is gibberish)" date="1379084640" name="PEcomparisonQQll.png" path="PEcomparisonQQll.png" size="12863" stream="PEcomparisonQQll.png" tmpFilename="/usr/tmp/CGItemp46082" user="DanProtopopescu" version="1"
META TOPICMOVED by="DanProtopopescu" date="1373037183" from="LinearCollider.ZZfusionMain" to="LinearCollider.JetFlavourTagging"
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback