(r13) ATLASADCoSShiftDiary < ATLAS

ATLAS Web>ATLASADCoSShiftDiary (revision 13) (raw view)~~EditAttach~~
---+ <nop>ADCoS shift summary

Report from Alistair. 27/10/2009. Modified by <nop>ChrisCT on 11/11/2009 and on 17/6/2010:

See at the bottom of the page for general conclusions.

To book your shifts, you need to use the [[https://pptevm.cern.ch/mao/client/cern.ppt.mao.app.gwt.MaoClient/MaoClient.html][Operation Task Planner (OTP)]] - if you don't have the option to book <nop>ADCoS shifts, then contact [[mailto:espinal@pic.es][Xavier]], and he should be able to sort you out. You'll all be glad to know that,as Graeme said, I ran my shift from 9-5 rather than 8-4, and nobody complained (in fact, the expert did his shift 9-4:30 - so even if you want to run by CERN time, that still gives you an extra hour in bed. But Graeme said that it's fine to run it 9-5 by your own time zone.

You can see who is supposed to be on shift: 
   * [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCoS#Communication_and_Organization][https://twiki.cern.ch/twiki/bin/view/Atlas/ADCoS#Communication_and_Organization]] shows you how to find who is on senior/expert shift.
   * Task IDs useful to you:  529222=ADCoS Senior shifts,  529221=ADCoS Expert shifts, 529223=ADCoS Trainee shifts
   * You should then know who your senior shifter and expert shifter are for the same date and time as your shift.
   * ADCPoint1Shift  page: [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCPoint1Shift#Shift_schedule][https://twiki.cern.ch/twiki/bin/view/Atlas/ADCPoint1Shift#Shift_schedule]].
   * You may also find the <nop>ADCPoint1Shift page useful to see how <nop>ADC@P1 shifts fit in with <nop>ADCoS (your shift!)

You will need:
   * a Jabber account (I visited www.jabber.org.uk) 
   * and a jabber client - (<nop>ADCoS reccommends Adium for OSX) 
   * Configure Adium to use the jabber account at jabber.org.uk. [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCoS#ADC_Virtual_Control_Room][https://twiki.cern.ch/twiki/bin/view/Atlas/ADCoS#ADC_Virtual_Control_Room]] - in Adium, FILE, JOIN GROUP CHAT and enter info needed.
   * NB: you will need the adcvcr password, so get this before your shift!

You will also need:
   *  a GGUS account (to open TEAM tickets) - which you need log in to using your grid certificate - which HAS to be imported into your web-browser!
   * an ATLAS eLog account - try to log in at: [[https://prod-grid-logger.cern.ch/elog/ATLAS+Computer+Operations+Logbook/][https://prod-grid-logger.cern.ch/elog/ATLAS+Computer+Operations+Logbook/]]

So, starting your shift: 
Bear in mind, if you are "senior shifter" you will have to submit a shift report [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCoS#Daily_SHIFT_report][https://twiki.cern.ch/twiki/bin/view/Atlas/ADCoS#Daily_SHIFT_report]].

First, open the [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCoS][ADCoS TWiki]], and go to the [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCoS#CHECKLIST][Checklist]] for things to do. Out of the list of links they ask you to open in your browser, I only really looked at the [[http://dashb-atlas-prodsys.cern.ch/dashboard/request.py/overview][Production Dashboard]], [[http://panda.cern.ch:25980/server/pandamon/query?dash=prod&reload=yes][Production (PANDA)]], the [[http://dashb-atlas-data.cern.ch/dashboard/request.py/site][DDM Dashboard]] and the [[https://prod-grid-logger.cern.ch/elog/ATLAS+Computer+Operations+Logbook/][ADC eLog]]. As the checklist says, switch to the cloud view on the Dashboards. Note that of those four, all refresh automatically (with LARGE delays and possible problems!-see later) apart from the ADC eLog. However, every time a new eLog entry is created, you'll get an e-mail, so there's not much to worry about there (TURN OFF your email filter if you have it switched on normally!). 

Once you have all the windows open, probably best to check the eLog and previous shift summary (there will be a list of hot issues!)
You may also find it useful to have a window open with your 'shifts' emails right by the browser.

<nop>ChrisCT got a reply from Graeme: apparently the dashboards do not really update very well at all- be careful! - If you go in to the cloud view and get a url like <br>
=http://dashb-atlas-prodsys.cern.ch/dashboard/request.py/overview?grouping=cloud&start-date=2009-11-11%2000:00:00&end-date=2009-11-11%2012:59:59&grouping=cloud= <br>
as opposed to:<br>
=http://dashb-atlas-prodsys.cern.ch/dashboard/request.py/overview?grouping=cloud=  <br>
then you have a 'frozen' url which will NEVER update!

On the Panda monitor, look out for 'Waiting' jobs. These might come from jobs not able to get their input (e.g. AOD). you can tell this by 
   * clicking on the waiting jobs
   * click on a panda job id
   * see the parent input dataset: your aim is to find failed jobs or problems
   * To find failed jobs, you can click on the TaskID of the input parent Dataset, and then there is a link to failed jobs.

In the ADC monitoring window [[https://sls.cern.ch/sls/service.php?id=ADC_CS][https://sls.cern.ch/sls/service.php?id=ADC_CS]] if all is not green, the expert (Graeme) should be notified.

Then sign into the ADC Control Room (if at CERN, refer [[http://security.web.cern.ch/security/skype/][here]] for ways to set up your Skype)
NB: you might want to check the new EVO room: 
=http://evo.caltech.edu/evoGate/koala.jnlp?meeting=MtM8Ma2B2DD8Dv9D92Ds9t=

   * Everyone initially appears offline - this is because they're not on your contacts list. Don't worry, just type in the control room (it's not done using the headset) conversation, and everyone will see it. Though some comments didn't appear as soon as they were typed for me, so I missed a couple as they appear in the correct chronological time-slot, but by the time they do, you've moved on from it.
   * Chris CT had problems - the instant messages were not being delivered so I emailed my senior shifter. After I added him as a contact, everything worked fine. Weird.
   * If you see a rotating circle next to your instant messages, they are NOT BEING DELIVERED! Quite often you will need to add your senior shifter as a contact before this starts working; and you may have to get them to add you as well! 

First thing you should probably do is read the previous eLog and shift summary. 

Also have a look at the sheduled downtime for sites here: [[https://atlas-install.roma1.infn.it/atlas_install/list.php?sitename=AGLT2][https://atlas-install.roma1.infn.it/atlas_install/list.php?sitename=AGLT2]] or here :[[https://twiki.cern.ch/twiki///bin/view/Atlas/AtlasGridDowntime][https://twiki.cern.ch/twiki///bin/view/Atlas/AtlasGridDowntime]]. (NB: at the last link, you can import the downtime in to your "Google Calendar" - go on, it's easy to remove when you're not on shift!). If you do raise GGUS tickets, check downtime. GGUS tickets are mainly for SITE problems. Other things (like datasets not being found because they weren't made properly) don't need GGUS tickets. 
   * Raise GGUS tickets as "TEAM". This is important and easy to forget. You have to click the link at the top of the page to get TEAM tickets (as opposed to GGUS tickets!).
   * In your GGUS ticket, cc the list: "atlas-project-adc-operations-shiftsNOSPAMcern.ch"

Also, you will need to be able to edit eLogs; there is a simple (instantaneous!) registration process when you try to edit one.

Everyone seems friendly, and will try to answer any questions - I spotted a couple of things and mentioned them during my time on shift, and invariably the expert had already seen them, but was happy to explain that he'd already dealt with it, or why it wasn't important.

Looking at PANDA, if you see a problem with any of the clouds, click on it to get more detailed information. If it's only one site failing, click on the number of failures that site is experiencing, and then once that URL has popped up, change the 'mode=archive' bit to 'job=*' (the capitalisation matters here)in order to get the exact error message sent out by that site. In my experience, this was where I got most of the detailed error messages.

The Dashboards rather disconcertingly have different names for their clouds - where PANDA labels it UK, the Dashboard labels it RAL. I managed to find a list comparing the two though:
| Cloud Name in PANDA | Cloud Name in ProdSys / Dashboard |
| CA | TRIUMF |
| CERN | CERN |
| DE | FZK |
| ES | PIC |
| FR | LYON |
| IT | CNAF |
| ND | NDGF |
| NL | SARA |
| TW | ASGC |
| UK | RAL |
| US | BNL |

The DDM Dashboard tells the successes of transfers between sites, and for some reason seems to respond quicker to queries then the Production Dashboard - so if you're having a quick hunt through the clouds to try and find where a specific site is located, I'd recommend using this (or PANDA, which is also fairly responsive). If you find somewhere that's failing a lot of transfers, once you've narrowed it down to the site, click on the number of failures to get the error message, and then click on the '+' next to the site's name to see where the sources for these transfers are - it can sometimes be a case that the transfers are failing thanks to the source, not the site that's showing the errors (either through downtime, or some other reason)

For both of the dashboards, keep an eye on the graphs at the top of the page, not only the numbers down at the bottom - the numbers may suggest you have a problem, but the graphs can tell you if it's still an ongoing concern - it might have been a temporary blip (it happened that once a site (BNL) was going for some scheduled downtime, a lot of failed transfers happened involving it).

In the Production Dashboard, there was a time when there was a 'None' cloud listed - this seemed to contain only sites from other clouds that were known to be having problems - I asked, and apparently 'it happens when an input dataset/file replica is not found in any cloud or not in a cloud where the task is assigned'.

<strong>=====================================================================================================</strong>

General Comments: The people in the Skype control room seem friendly enough, and will happily answer any questions you have - however, the process is still inherently a remote one. I don't know how they'll be able to tell when you are no longer a trainee and can be considered an expert. I felt a bit lost, and just wandered around the various websites looking for things that looked red (or not green) - with time it'll be easier to tell what's worth looking at and what's not, but it's made a bit harder by the fact you only see the end result of the expert's efforts - you don't get to see the processes and tricks he/she uses to quickly find and diagnose a problem. Once the eLog came out, I could follow the information contained in that and generally find the fault myself. I found a couple of things myself, but when I mentioned them in the control room the expert seemed to be about 5 pages ahead of me, though he didn't seem to mind telling me why he was leaving it, or not. It's also a bit easy to lose track of what's been dealt with and what hasn't. Partly because it's the expert, not me who was dealing with the problems, but also because it's also all very interconnected, and a fault at one place could be caused by a fault at another.

But I'm pretty sure it'll get easier with a few shifts - and once I manage to digest the entire TWiki properly!

<strong>=====================================================================================================</strong>
Specific issues:<br>
Chris CT 24/11/09:<br>
I was asked to submit some test jobs to cloud 'CA' site 'TRIUMF'. 

I followed the procedures but got a nasty python error!
On lxplus, I did:<br>
=source /afs/cern.ch/project/gd/LCG-share/current/etc/profile.d/grid_env.sh= <br>
=source /afs/cern.ch/atlas/offline/external/GRID/ddm/DQ2Clients/setup.zsh= <br>
=voms-proxy-init --voms atlas= <br>
=mkdir panda= <br>
=cd panda= <br>
=svn co http://www.usatlas.bnl.gov/svn/panda/panda-server/current/pandaserver/test= <br>
=svn co http://www.usatlas.bnl.gov/svn/panda/panda-server/current/pandaserver/taskbuffer= <br>
=svn co http://www.usatlas.bnl.gov/svn/panda/panda-server/current/pandaserver/userinterface= <br>
=cd test= <br>
=export PYTHONPATH=..:$PYTHONPATH= <br>
edited the file =test/testG4sim15.py= to make sure =site='TRIUMF'= and =cloud='CA'= 
then ran:<br>
=python testG4sim15.py= <br>
got error: <br>
=File "/afs/cern.ch/user/c/ccollins/panda/taskbuffer/FileSpec.py", line 94, in __getstate__= <br>
=if isinstance(val,cx_Oracle.Timestamp):= <br>
=NameError: global name 'cx_Oracle' is not defined= <br>
I was using python 2.5 (you can check this by typing =python= then =CTRL-D= to exit) 
so the fix!...edit the file<br>
=../taskbuffer/FileSpec.py= <br>
and comment out these 3 lines below the comment line which is already there: <br>           
            =# convert cx_Oracle.Timestamp to datetime. this is not needed since python 2.4= <br>
            =#if isinstance(val,cx_Oracle.Timestamp):= <br>
            =#    val = datetime.datetime(val.year,val.month,val.day,= <br>
            =#                            val.hour,val.minute,val.second)= <br>
Job was submitted: 
=python testG4sim15.py= (with hardcoded site inside as site = 'TRIUMF' and cloud = 'NL')

My output looked like: <br>
=---------------------= <br>
=0= <br>
=PandaID=1031455144= <br>
and the test jobs appeared here: [[http://panda.cern.ch:25980/server/pandamon/query?job=*&type=test&hours=3][http://panda.cern.ch:25980/server/pandamon/query?job=*&type=test&hours=3]]
Topic revision: r13 - 2010-06-17 - ChrisCollins
ATLAS