ATLAS Web>TWikiUsers>NickEdwards>RunningOnData (2010-11-10, NickEdwards)

Running on Real Data

Here are some notes on the various elements needed to conduct an analysis on real ATLAS data. It's a work in progress!

Running on the grid with prun

The prun utility lets you submit non-athena jobs to the grid via the Panda backend. It's part of the pathena / panda release; to set it up run

export PATHENA_GRID_SETUP_SH=/afs/cern.ch/project/gd/LCG-share/current/etc/profile.d/grid_env.sh
source /afs/cern.ch/atlas/offline/external/GRID/DA/panda-client/latest/etc/panda/panda_setup.sh

The basic usage is:

prun --exec "echo %IN | sed -e "s/,/\n/g" > inputFiles.txt; source runZll.sh" \
--inDS data10_7TeV.00160899.physics_Egamma.merge.NTUP_WZ.f282_p198_tid158344_00/ \
--outputs output.root,Zlloutput.log --outDS user.nedwards.grltest --athenaTag 15.6.9 --nFiles 1

When you execute the command, prun creates an archive of all of the files in the working directory and below (excluding object files, root files, files over 1048576B and possibly binaries. You can override this with --extFile). These files are sent to the Worker nodes. The --exec flag tells prun what command to execute (or a ; separated list of commands). The %IN gives a comma separated list of the input files (e.g. "file1.root,file2.root"). --inDS specifies the input dataset. --outputs should be a comma separated list of the files that you want returned to you in the output dataset. --outDS specifies the output dataset name and should be in the form user...

The last flags are optional: --athenaTag 15.6.9 sets up an athena release on the worker nodes. This is the simplest way to get root set up correctly on the worker node. (~~I needed to use this to get ROOT to see the input files~~ Not necessary for this). --useAthenaPackages let's the worker nodes access Athena packages and --nFiles specifies how many files out of the dataset to use.

This is all well and good, but if you have a GoodRunList and want to run over all the evnts in it, you're going to have to find the dataset names for all of the runs in the GRL, and then submit separate prun jobs for each of them. But wait... prun is cleverer than it looks:

prun --exec "echo %IN | sed -e "s/,/\n/g" > inputFiles.txt; source runZll.sh" \
--goodRunListXML=data10_7TeV.periodE1.160387-160479_LBSUMM_DetStatus-v03-pass1-analysis-2010E_data_eg_standard_7TeV.xml \
--goodRunListDS="physics_Egamma" --goodRunListDataType=NTUP_WZ \
--outputs output.root,Zlloutput.log --outDS user.nedwards.00160387.ZeeAna.physics_Egamma.merge.NTUP_WZ.f280_p198  --athenaTag 15.6.9

This takes the GRL specified in --goodRunListXML and uses AMI to find all of the relevant datasets. You need to specify what format you want the input dataset to be, e.g --goodRunListDataType=NTUP_WZ or --goodRunListDataType=AOD . You can apply a filter to select e.g. a specific trigger stream or a specific tag with --goodRunListDS="physics_Egamma" . Only datasets with names containing the specified string will be used (* can be used as a wildcard).

For more info on prun see https://twiki.cern.ch/twiki/bin/view/Atlas/PandaRun

Common Problems

Root is unable to find the file on the worker node

Some grid nodes use exotic protocols like dcap or root:// to give you access to the files. Make sure that you use TFile::Open rather than just the normal TFile constructor to open the file, as TFile::Open supports more protocols. E.g.

TFile *f = TFile::Open("myfile.root")

-- NickEdwards - 2010-09-23

Topic revision: r3 - 2010-11-10 - NickEdwards

ATLAS