ATLAS Web>ATLASADCoSShiftDiary (revision 2)~~EditAttach~~

ADCoS shift summary

Report from Alistair. 27/10/2009:

See at the bottom of the page for general conclusions.

To book your shifts, you need to use the Operation Task Planner (OTP) - if you don't have the option to book ADCoS shifts, then contact Xavier, and he should be able to sort you out. You'll all be glad to know that,as Graeme said, I ran my shift from 9-5 rather than 8-4, and nobody complained (in fact, the expert did his shift 9-4:30 - so even if you want to run by CERN time, that still gives you an extra hour in bed. But Graeme said that it's fine to run it 9-5 by your own time zone.

You can see who is supposed to be on shift at the ADCPoint1Shift page: https://twiki.cern.ch/twiki/bin/view/Atlas/ADCPoint1Shift#Shift_schedule.

You may also find the ADCPoint1Shift page useful to see how ADC@P1 shifts fit in with ADCoS (your shift!)
look at the bottom for "Shift Shedule" and inside there, look for the inset window.
There is a link to "all" - click it. (it links here http://iueda.web.cern.ch/iueda/adc/shifts/index.html
"Task 88" is the "Distributed Computing Shifts" and it has 3 time zones - you will probably be in the "Requirement: Shifter(Trainee)" bit if you booked in successfully.
You should then know who your senior shifter and expert shifter are for the same date and time as your shift.

So, starting your shift: First, open the ADCoS TWiki, and go to the Checklist for things to do. Out of the list of links they ask you to open in your browser, I only really looked at the Production Dashboard, Production (PANDA), the DDM Dashboard and the ADC eLog. As the checklist says, switch to the cloud view on the Dashboards. Note that of those four, all refresh automatically apart from the ADC eLog. However, every time a new eLog entry is created, you'll get an e-mail, so there's not much to worry about there.

Then sign into the ADC Control Room (if at CERN, refer here for ways to set up your Skype)

* Everyone initially appears offline - this is because they're not on your contacts list. Don't worry, just type in the control room (it's not done using the headset) conversation, and everyone will see it. Though some comments didn't appear as soon as they were typed for me, so I missed a couple as they appear in the correct chronological time-slot, but by the time they do, you've moved on from it.

Everyone seems friendly, and will try to answer any questions - I spotted a couple of things and mentioned them during my time on shift, and invariably the expert had already seen them, but was happy to explain that he'd already dealt with it, or why it wasn't important.

Looking at PANDA, if you see a problem with any of the clouds, click on it to get more detailed information. If it's only one site failing, click on the number of failures that site is experiencing, and then once that URL has popped up, change the 'mode=archive' bit to 'job=*' (the capitalisation matters here)in order to get the exact error message sent out by that site. In my experience, this was where I got most of the detailed error messages.

The Dashboards rather disconcertingly have different names for their clouds - where PANDA labels it UK, the Dashboard labels it RAL. I managed to find a list comparing the two though:

Cloud Name in PANDA	Cloud Name in ProdSys / Dashboard
CA	TRIUMF
CERN	CERN
DE	FZK
ES	PIC
FR	LYON
IT	CNAF
ND	NDGF
NL	SARA
TW	ASGC
UK	RAL
US	BNL

The DDM Dashboard tells the successes of transfers between sites, and for some reason seems to respond quicker to queries then the Production Dashboard - so if you're having a quick hunt through the clouds to try and find where a specific site is located, I'd recommend using this (or PANDA, which is also fairly responsive). If you find somewhere that's failing a lot of transfers, once you've narrowed it down to the site, click on the number of failures to get the error message, and then click on the '+' next to the site's name to see where the sources for these transfers are - it can sometimes be a case that the transfers are failing thanks to the source, not the site that's showing the errors (either through downtime, or some other reason)

For both of the dashboards, keep an eye on the graphs at the top of the page, not only the numbers down at the bottom - the numbers may suggest you have a problem, but the graphs can tell you if it's still an ongoing concern - it might have been a temporary blip (it happened that once a site (BNL) was going for some scheduled downtime, a lot of failed transfers happened involving it).

In the Production Dashboard, there was a time when there was a 'None' cloud listed - this seemed to contain only sites from other clouds that were known to be having problems - I asked, and apparently 'it happens when an input dataset/file replica is not found in any cloud or not in a cloud where the task is assigned'.

=====================================================================================================

General Comments: The people in the Skype control room seem friendly enough, and will happily answer any questions you have - however, the process is still inherently a remote one. I don't know how they'll be able to tell when you are no longer a trainee and can be considered an expert. I felt a bit lost, and just wandered around the various websites looking for things that looked red (or not green) - with time it'll be easier to tell what's worth looking at and what's not, but it's made a bit harder by the fact you only see the end result of the expert's efforts - you don't get to see the processes and tricks he/she uses to quickly find and diagnose a problem. Once the eLog came out, I could follow the information contained in that and generally find the fault myself. I found a couple of things myself, but when I mentioned them in the control room the expert seemed to be about 5 pages ahead of me, though he didn't seem to mind telling me why he was leaving it, or not. It's also a bit easy to lose track of what's been dealt with and what hasn't. Partly because it's the expert, not me who was dealing with the problems, but also because it's also all very interconnected, and a fault at one place could be caused by a fault at another.

But I'm pretty sure it'll get easier with a few shifts - and once I manage to digest the entire TWiki properly!

Topic revision: r2 - 2009-11-10 - ChrisCollins

ATLAS