ATLAS Web>ATLASADCoSShiftDiary (revision 10)~~EditAttach~~

ADCoS shift summary

Report from Alistair. 27/10/2009. Modified by ChrisCT on 11/11/2009:

See at the bottom of the page for general conclusions.

To book your shifts, you need to use the Operation Task Planner (OTP) - if you don't have the option to book ADCoS shifts, then contact Xavier, and he should be able to sort you out. You'll all be glad to know that,as Graeme said, I ran my shift from 9-5 rather than 8-4, and nobody complained (in fact, the expert did his shift 9-4:30 - so even if you want to run by CERN time, that still gives you an extra hour in bed. But Graeme said that it's fine to run it 9-5 by your own time zone.

You can see who is supposed to be on shift at the ADCPoint1Shift page: https://twiki.cern.ch/twiki/bin/view/Atlas/ADCPoint1Shift#Shift_schedule.

You may also find the ADCPoint1Shift page useful to see how ADC@P1 shifts fit in with ADCoS (your shift!)
look at the bottom for "Shift Shedule" and inside there, look for the inset window.
There is a link to "all" - click it. (it links here http://iueda.web.cern.ch/iueda/adc/shifts/index.html
"Task 88" is the "Distributed Computing Shifts" and it has 3 time zones - you will probably be in the "Requirement: Shifter(Trainee)" bit if you booked in successfully.
You should then know who your senior shifter and expert shifter are for the same date and time as your shift.

So, starting your shift: First, open the ADCoS TWiki, and go to the Checklist for things to do. Out of the list of links they ask you to open in your browser, I only really looked at the Production Dashboard, Production (PANDA), the DDM Dashboard and the ADC eLog. As the checklist says, switch to the cloud view on the Dashboards. Note that of those four, all refresh automatically (with LARGE delays and possible problems!-see later) apart from the ADC eLog. However, every time a new eLog entry is created, you'll get an e-mail, so there's not much to worry about there (TURN OFF your email filter if you have it switched on normally!).

Once you have all the windows open, probably best to check the eLog and previous shift summary (there will be a list of hot issues!) You may also find it useful to have a window open with your 'shifts' emails right by the browser.

ChrisCT got a reply from Graeme: apparently the dashboards do not really update very well at all- be careful! - If you go in to the cloud view and get a url like
http://dashb-atlas-prodsys.cern.ch/dashboard/request.py/overview?grouping=cloud&start-date=2009-11-11%2000:00:00&end-date=2009-11-11%2012:59:59&grouping=cloud
as opposed to:
http://dashb-atlas-prodsys.cern.ch/dashboard/request.py/overview?grouping=cloud
then you have a 'frozen' url which will NEVER update!

On the Panda monitor, look out for 'Waiting' jobs. These might come from jobs not able to get their input (e.g. AOD). you can tell this by

clicking on the waiting jobs
click on a panda job id
see the parent input dataset: your aim is to find failed jobs or problems
To find failed jobs, you can click on the TaskID of the input parent Dataset, and then there is a link to failed jobs.

In the ADC monitoring window https://sls.cern.ch/sls/service.php?id=ADC_CS if all is not green, the expert (Graeme) should be notified.

Then sign into the ADC Control Room (if at CERN, refer here for ways to set up your Skype)

Everyone initially appears offline - this is because they're not on your contacts list. Don't worry, just type in the control room (it's not done using the headset) conversation, and everyone will see it. Though some comments didn't appear as soon as they were typed for me, so I missed a couple as they appear in the correct chronological time-slot, but by the time they do, you've moved on from it.
Chris CT had problems - the instant messages were not being delivered so I emailed my senior shifter. After I added him as a contact, everything worked fine. Weird.
If you see a rotating circle next to your instant messages, they are NOT BEING DELIVERED! Quite often you will need to add your senior shifter as a contact before this starts working; and you may have to get them to add you as well!

First thing you should probably do is read the previous eLog and shift summary.

Also have a look at the sheduled downtime for sites here: https://atlas-install.roma1.infn.it/atlas_install/list.php?sitename=AGLT2 or here :https://twiki.cern.ch/twiki///bin/view/Atlas/AtlasGridDowntime. (NB: at the last link, you can import the downtime in to your "Google Calendar" - go on, it's easy to remove when you're not on shift!). If you do raise GGUS tickets, check downtime. GGUS tickets are mainly for SITE problems. Other things (like datasets not being found because they weren't made properly) don't need GGUS tickets.

Raise GGUS tickets as "TEAM". This is important and easy to forget. You have to click the link at the top of the page to get TEAM tickets (as opposed to GGUS tickets!).
In your GGUS ticket, cc the list: "atlas-project-adc-operations-shiftsNOSPAMcern.ch"

Also, you will need to be able to edit eLogs; there is a simple (instantaneous!) registration process when you try to edit one.

Everyone seems friendly, and will try to answer any questions - I spotted a couple of things and mentioned them during my time on shift, and invariably the expert had already seen them, but was happy to explain that he'd already dealt with it, or why it wasn't important.

Looking at PANDA, if you see a problem with any of the clouds, click on it to get more detailed information. If it's only one site failing, click on the number of failures that site is experiencing, and then once that URL has popped up, change the 'mode=archive' bit to 'job=*' (the capitalisation matters here)in order to get the exact error message sent out by that site. In my experience, this was where I got most of the detailed error messages.

The Dashboards rather disconcertingly have different names for their clouds - where PANDA labels it UK, the Dashboard labels it RAL. I managed to find a list comparing the two though:

Cloud Name in PANDA	Cloud Name in ProdSys / Dashboard
CA	TRIUMF
CERN	CERN
DE	FZK
ES	PIC
FR	LYON
IT	CNAF
ND	NDGF
NL	SARA
TW	ASGC
UK	RAL
US	BNL

The DDM Dashboard tells the successes of transfers between sites, and for some reason seems to respond quicker to queries then the Production Dashboard - so if you're having a quick hunt through the clouds to try and find where a specific site is located, I'd recommend using this (or PANDA, which is also fairly responsive). If you find somewhere that's failing a lot of transfers, once you've narrowed it down to the site, click on the number of failures to get the error message, and then click on the '+' next to the site's name to see where the sources for these transfers are - it can sometimes be a case that the transfers are failing thanks to the source, not the site that's showing the errors (either through downtime, or some other reason)

For both of the dashboards, keep an eye on the graphs at the top of the page, not only the numbers down at the bottom - the numbers may suggest you have a problem, but the graphs can tell you if it's still an ongoing concern - it might have been a temporary blip (it happened that once a site (BNL) was going for some scheduled downtime, a lot of failed transfers happened involving it).

In the Production Dashboard, there was a time when there was a 'None' cloud listed - this seemed to contain only sites from other clouds that were known to be having problems - I asked, and apparently 'it happens when an input dataset/file replica is not found in any cloud or not in a cloud where the task is assigned'.

=====================================================================================================

General Comments: The people in the Skype control room seem friendly enough, and will happily answer any questions you have - however, the process is still inherently a remote one. I don't know how they'll be able to tell when you are no longer a trainee and can be considered an expert. I felt a bit lost, and just wandered around the various websites looking for things that looked red (or not green) - with time it'll be easier to tell what's worth looking at and what's not, but it's made a bit harder by the fact you only see the end result of the expert's efforts - you don't get to see the processes and tricks he/she uses to quickly find and diagnose a problem. Once the eLog came out, I could follow the information contained in that and generally find the fault myself. I found a couple of things myself, but when I mentioned them in the control room the expert seemed to be about 5 pages ahead of me, though he didn't seem to mind telling me why he was leaving it, or not. It's also a bit easy to lose track of what's been dealt with and what hasn't. Partly because it's the expert, not me who was dealing with the problems, but also because it's also all very interconnected, and a fault at one place could be caused by a fault at another.

But I'm pretty sure it'll get easier with a few shifts - and once I manage to digest the entire TWiki properly!

===================================================================================================== Specific issues:
Chris CT 24/11/09:
I was asked to submit some test jobs to cloud 'CA' site 'TRIUMF'.

I followed the procedures but got a nasty python error! On lxplus, I did:
source /afs/cern.ch/project/gd/LCG-share/current/etc/profile.d/grid_env.sh
source /afs/cern.ch/atlas/offline/external/GRID/ddm/DQ2Clients/setup.zsh
voms-proxy-init --voms atlas
mkdir panda
cd panda
=svn co http://www.usatlas.bnl.gov/svn/panda/panda-server/current/pandaserver/test=
=svn co http://www.usatlas.bnl.gov/svn/panda/panda-server/current/pandaserver/taskbuffer=
=svn co http://www.usatlas.bnl.gov/svn/panda/panda-server/current/pandaserver/userinterface=
cd test
export PYTHONPATH..:$PYTHONPATH=
edited the file test/testG4sim15.py to make sure site='TRIUMF' and cloud='CA' then ran:
python testG4sim15.py
got error:
File "/afs/cern.ch/user/c/ccollins/panda/taskbuffer/FileSpec.py", line 94, in __getstate__
if isinstance(val,cx_Oracle.Timestamp):
NameError: global name 'cx_Oracle' is not defined
I was using python 2.5 (you can check this by typing python then CTRL-D to exit) so the fix!...edit the file
../taskbuffer/FileSpec.py
and comment out these 3 lines below the comment line which is already there:
# convert cx_Oracle.Timestamp to datetime. this is not needed since python 2.4
#if isinstance(val,cx_Oracle.Timestamp):
# val = datetime.datetime(val.year,val.month,val.day,
# val.hour,val.minute,val.second)
Job was submitted. My output looked like:
---------------------
0
PandaID=1031455144
and the test jobs appeared here: http://panda.cern.ch:25980/server/pandamon/query?job=*&type=test&hours=3

Topic revision: r10 - 2010-01-18 - ChrisCollins

ATLAS