ADCoS shift summary
Report from Alistair. 27/10/2009. Modified by ChrisCT on 11/11/2009 and on 17/6/2010:
See at the bottom of the page for general conclusions.
Before Your Shift
To book your shifts, you need to use the
Operation Task Planner (OTP) - if you don't have the option to book ADCoS shifts, then contact
Xavier, and he should be able to sort you out.
REMEMBER: if you book a shift and the box turns YELLOW, it's already booked by someone else! Unbook and choose a RED slot.
You can see who is supposed to be on shift:
You will need:
- a Jabber account (I visited www.jabber.org.uk - make sure to select
jabber.org.uk
in the dropdown box!)
- and a jabber client - (ADCoS reccommends Adium for OSX. Stan has used Pandium for Win.)
- Configure Adium to use the jabber account at jabber.org.uk. https://twiki.cern.ch/twiki/bin/view/Atlas/ADCoS#ADC_Virtual_Control_Room - in Adium, FILE, JOIN GROUP CHAT and enter info needed.
- you are likely to need a jabber username with no 'dots' in it (Pandium does not like it)
- your jabber username+
@jabber.org.uk
will be needed by Adium/Pandium
- Adium/Pandium will need the 'Conference server' and 'room' (Currently 'conference.uio.no' and 'adcvcr', but see the official Twiki )
- NB: you will need the adcvcr password, so get this before your shift!
You will also need:
- a GGUS account (to open TEAM tickets) - which you need log in to using your grid certificate - (grid certificate
something.p12
HAS to be imported into your web-browser!
- you might need the "Master Password" for Firefox, and also your Grid Cert Password!
- an ATLAS eLog account - try to log in at: https://prod-grid-logger.cern.ch/elog/ATLAS+Computer+Operations+Logbook/
Finally, Savannah:
- Check you can see Savannah - you may have to raise a bug at some point or search it for other bugs! https://savannah.cern.ch
- The TWO main savannah groups most used on ADCoS
- Group: "Atlas Distributed Computing Operations Support"
- Group: "ATLAS validation"
- See if you can see all the Savannah "ADCoS task" & "ATLAS validation" bugs etc.
Starting Your Shift
So, starting your shift:
Bear in mind, if you are "senior shifter" you will have to submit a shift report
https://twiki.cern.ch/twiki/bin/view/Atlas/ADCoS#Daily_SHIFT_report.
Even if you are doing a 'Trainee' shift, you should still read the previous shift report- it will tell you what's outstanding and urgent! Go to this page and sort it by "LAST MODIFIED":
http://lhcweb01.pic.es/atlas/Shift_Summaries/reports/.
First, open the
ADCoS TWiki, and go to the
Checklist for things to do. Out of the list of links they ask you to open in your browser (I have uploaded a
bookmarks.html bookmarks html and a bookmark "JSON" file (from firefox:
bookmarks-2010-08-10.json) which you may be able to upload into firefox (it might overwrite your own bookmarks though!). When you have all the pages bookmarked, you can use "OPEN ALL" in firefox from your bookmarks to save time...), most important are the
Production Dashboard,
Production (PANDA), the
DDM Dashboard and the
ADC eLog. As the checklist says, switch to the cloud view on the Dashboards. Note that of those four, I don't think any refresh automatically (see later). However, every time a new eLog entry is created, you'll get an e-mail, so there's not much to worry about there (TURN OFF your email filter if you have it switched on normally!).
- NB: On the eLog, I find it most useful to use the 'THREADED' view, as otherwise you get completely lost in the discussions- each update to an existing discussion appears as a new eLog entry unless you use THREADED view!
- Also in eLog, you can raise a new log with the 'NEW' hyperlink, then just go back to 'LIST'.
Once you have all the windows open, probably best to check the eLog and previous shift summary (there will be a list of hot issues!)
You may also find it useful to have a window open with your 'shifts' emails right by the browser.
Another useful thing to do would be to have a terminal window open where you have obtained a grid proxy.
Frozen URL's - CAREFUL
I always RIGHT-CLICK to open a new tab/window when I am drilling down from one of the main screens. This is because, if you hit 'back' you mostly will end up with a FROZEN URL which will never update... please read on!
ChrisCT got a reply from Graeme: apparently the dashboards do not really update very well at all- be careful! - If you go in to the cloud view and get a url like
http://dashb-atlas-prodsys.cern.ch/dashboard/request.py/overview?grouping=cloud&start-date=2009-11-11%2000:00:00&end-date=2009-11-11%2012:59:59&grouping=cloud
as opposed to:
http://dashb-atlas-prodsys.cern.ch/dashboard/request.py/overview?grouping=cloud
then you have a 'frozen' url which will NEVER update!
On the Panda monitor, look out for 'Waiting' jobs. These might come from jobs not able to get their input (e.g. AOD). you can tell this by
- clicking on the waiting jobs
- click on a panda job id
- see the parent input dataset: your aim is to find failed jobs or problems
- To find failed jobs, you can click on the TaskID of the input parent Dataset, and then there is a link to failed jobs.
In the ADC monitoring window
https://sls.cern.ch/sls/service.php?id=ADC_CS if all is not green, the expert (Graeme) should be notified.
Also have a look at the sheduled downtime for sites here:
https://goc.gridops.org/downtime?scope=ALL
or here:
https://twiki.cern.ch/twiki///bin/view/Atlas/AtlasGridDowntime. (NB: at the bottom right of the image showing the calendar, you can import the downtime in to your "Google Calendar" - go on, it's easy to remove when you're not on shift!). If you do raise GGUS tickets, check downtime. GGUS tickets are mainly for SITE problems. Other things (like datasets not being found because they weren't made properly) don't need GGUS tickets.
- Raise GGUS tickets as "TEAM". This is important and easy to forget. You have to click the link at the top of the page to get TEAM tickets (as opposed to GGUS tickets!).
- In your GGUS ticket, cc the list: "atlas-project-adc-operations-shiftsNOSPAMcern.ch"
Also, you will need to be able to edit eLogs; there is a simple (instantaneous!) registration process when you try to edit one.
Everyone seems friendly, and will try to answer any questions - I spotted a couple of things and mentioned them during my time on shift, and invariably the expert had already seen them, but was happy to explain that he'd already dealt with it, or why it wasn't important.
Looking at PANDA, if you see a problem with any of the clouds, click on it to get more detailed information. If it's only one site failing, click on the number of failures that site is experiencing, and then once that URL has popped up, change the 'mode=archive' bit to 'job=*' (the capitalisation matters here)in order to get the exact error message sent out by that site. In my experience, this was where I got most of the detailed error messages.
The Dashboards rather disconcertingly have different names for their clouds - where PANDA labels it UK, the Dashboard labels it RAL. I managed to find a list comparing the two though:
Cloud Name in PANDA |
Cloud Name in ProdSys / Dashboard |
CA |
TRIUMF |
CERN |
CERN |
DE |
FZK |
ES |
PIC |
FR |
LYON |
IT |
CNAF |
ND |
NDGF |
NL |
SARA |
TW |
ASGC |
UK |
RAL |
US |
BNL |
NB: Cloud support often has to bew cc'd and the emails are hard to find..
cloud support emails are here
The DDM Dashboard tells the successes of transfers between sites, and for some reason seems to respond quicker to queries then the Production Dashboard - so if you're having a quick hunt through the clouds to try and find where a specific site is located, I'd recommend using this (or PANDA, which is also fairly responsive). If you find somewhere that's failing a lot of transfers, once you've narrowed it down to the site, click on the number of failures to get the error message, and then click on the '+' next to the site's name to see where the sources for these transfers are - it can sometimes be a case that the transfers are failing thanks to the source, not the site that's showing the errors (either through downtime, or some other reason)
For both of the dashboards, keep an eye on the graphs at the top of the page, not only the numbers down at the bottom - the numbers may suggest you have a problem, but the graphs can tell you if it's still an ongoing concern - it might have been a temporary blip (it happened that once a site (BNL) was going for some scheduled downtime, a lot of failed transfers happened involving it).
In the Production Dashboard, there was a time when there was a 'None' cloud listed - this seemed to contain only sites from other clouds that were known to be having problems - I asked, and apparently 'it happens when an input dataset/file replica is not found in any cloud or not in a cloud where the task is assigned'.
Useful Links not on the ADCoS official pages:
SAM - Site Availability Monitor
AGIS - Downtime calendar - then appears in your GMail Calendar if you click on the icon bottom right!
Functional Test Progress Monitors - click around!
Free Space -not got this to load anything yet!
=====================================================================================================
General Comments: The people in the Skype control room seem friendly enough, and will happily answer any questions you have - however, the process is still inherently a remote one. I don't know how they'll be able to tell when you are no longer a trainee and can be considered an expert. I felt a bit lost, and just wandered around the various websites looking for things that looked red (or not green) - with time it'll be easier to tell what's worth looking at and what's not, but it's made a bit harder by the fact you only see the end result of the expert's efforts - you don't get to see the processes and tricks he/she uses to quickly find and diagnose a problem. Once the eLog came out, I could follow the information contained in that and generally find the fault myself. I found a couple of things myself, but when I mentioned them in the control room the expert seemed to be about 5 pages ahead of me, though he didn't seem to mind telling me why he was leaving it, or not. It's also a bit easy to lose track of what's been dealt with and what hasn't. Partly because it's the expert, not me who was dealing with the problems, but also because it's also all very interconnected, and a fault at one place could be caused by a fault at another.
But I'm pretty sure it'll get easier with a few shifts - and once I manage to digest the entire TWiki properly!
=====================================================================================================
Specific issues:
Chris CT 24/11/09:
I was asked to submit some test jobs to cloud 'CA' site 'TRIUMF'.
I followed the procedures but got a nasty python error!
On lxplus, I did:
source /afs/cern.ch/project/gd/LCG-share/current/etc/profile.d/grid_env.sh
source /afs/cern.ch/atlas/offline/external/GRID/ddm/DQ2Clients/setup.zsh
voms-proxy-init --voms atlas
mkdir panda
cd panda
=svn co
http://www.usatlas.bnl.gov/svn/panda/panda-server/current/pandaserver/test=
=svn co
http://www.usatlas.bnl.gov/svn/panda/panda-server/current/pandaserver/taskbuffer=
=svn co
http://www.usatlas.bnl.gov/svn/panda/panda-server/current/pandaserver/userinterface=
cd test
export PYTHONPATH
..:$PYTHONPATH=
edited the file
test/testG4sim15.py
to make sure
site='TRIUMF'
and
cloud='CA'
then ran:
python testG4sim15.py
got error:
File "/afs/cern.ch/user/c/ccollins/panda/taskbuffer/FileSpec.py", line 94, in __getstate__
if isinstance(val,cx_Oracle.Timestamp):
NameError: global name 'cx_Oracle' is not defined
I was using python 2.5 (you can check this by typing
python
then
CTRL-D
to exit)
so the fix!...edit the file
../taskbuffer/FileSpec.py
and comment out these 3 lines below the comment line which is already there:
# convert cx_Oracle.Timestamp to datetime. this is not needed since python 2.4
#if isinstance(val,cx_Oracle.Timestamp):
# val = datetime.datetime(val.year,val.month,val.day,
# val.hour,val.minute,val.second)
Job was submitted:
python testG4sim15.py
(with hardcoded site inside as site = 'TRIUMF' and cloud = 'NL')
My output looked like:
---------------------
0
PandaID=1031455144
and the test jobs appeared here:
http://panda.cern.ch:25980/server/pandamon/query?job=*&type=test&hours=3
*