Difference: BatchSystem (13 vs. 14)

Revision 142016-04-25 - GordonStewart

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

Batch System

Added:
>
>

Overview

 The PPE group maintains a PBS cluster for running small quantities of jobs. If you need to run large numbers of jobs, you should investigate the possibility of running on ScotGrid. The current composition of the batch system is as follows:

Nodes Operating System Total CPU Cores
Line: 15 to 17
 
tempnode001 to tempnode006 Scientific Linux 5 24
tempnode007 to tempnode015 Scientific Linux 6 36
Changed:
<
<
The following queues are provided:
>
>
The PBS headnode is offler.ppe.gla.ac.uk, and you will see this name in the output of various PBS commands.

Queues

 
Name Operating System Maximum runtime
short5 Scientific Linux 5 1 hour
Line: 29 to 34
  Jobs running in the vlong* queues can be pre-empted by jobs in the short* and medium* queues. A pre-empted job is placed in the suspended state; it remains in memory on the compute node, but is no longer being executed. Once the pre-empting job has finished, the pre-empted job will be allowed to continue.
Changed:
<
<
The PBS headnode is offler.ppe.gla.ac.uk, and you will see this name in the output of various PBS commands.
>
>

Job Prioritisation

The cluster is configured with a fair-share scheduler, which aims to distribute compute time fairly among users. When multiple users are competing for resources, preference will be shown to users whose recent usage has been lower. Short jobs are also generally given priority over longer jobs.

 

Using PBS

Line: 42 to 51
 
#PBS -N TestJob

Changed:
<
<
#PBS -l walltime=1,mem=1024Mb #PBS -m abe #PBS -M user@machine #
>
>
#PBS -o test.log #PBS -j oe #PBS -l mem=1024Mb
 echo "This is a test..."
Line: 75 to 86
  Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------- -------- ---------------- ------ ----- --- ------ ----- - -----
Changed:
<
<
1000151.offler.p rrabbit medium6 maus_sim_814 56289 1 1 -- 05:59 R 03:21 node034 1000152.offler.p bbunny long6 test_job 29669 1 1 -- 24:00 R 01:24 node007
>
>
1000151.offler.p rrabbit medium6 test_job_123 56299 1 1 -- 05:59 R 03:21 node034 1000152.offler.p bbunny long6 test_job 29369 1 1 -- 24:00 R 01:24 node007
 
Changed:
<
<
This

Queues

There are currently eight queues on the batch system. The four queues ending in '4' will run jobs on SL4 machines and the four queues ending in '5' will run jobs on SL5 machines:
>
>
You can also provide a job ID to limit the output to a particular job:
 

Changed:
<
<
Queue Memory CPU Time Walltime Node Run Que Lm State
------ -------- -------- ---- --- --- -- ----- short4 -- -- 01:00:00 -- 0 0 -- E R medium4 -- -- 06:00:00 -- 0 0 -- E R long4 -- -- 24:00:00 -- 0 0 -- E R vlong4 -- -- 120:00:0 -- 0 0 -- E R short5 -- -- 01:00:00 -- 0 0 -- E R medium5 -- -- 06:00:00 -- 0 0 -- E R long5 -- -- 24:00:00 -- 0 0 -- E R vlong5 -- -- 120:00:0 -- 0 0 -- E R

where short5 is the default queue and Walltime is the maximum walltime allowed on each queue.

While it is possible to view your own jobs with qstat, the command will not display all jobs. To display all jobs use the Maui client command showq

To see the current priorities of waiting jobs use the command showq -i.

>
>
$ qstat 1000151
 
Changed:
<
<

Job Priority

The priority of a job is the sum of several weighting factors.

>
>
offler.ppe.gla.ac.uk: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------- -------- ---------------- ------ ----- --- ------ ----- - ----- 1000151.offler.p rrabbit medium6 test_job_123 56299 1 1 -- 05:59 R 03:21 node034
 
Deleted:
<
<
  • There is a constant weighting given to short jobs and smaller weighting given to medium and long jobs. So that if all other factors are equal short jobs will have priority.
  • The primary weighting is user fairshare. As a users jobs run their usage increases and the priority of their queued jobs decreases. This is balanced so that a user who uses exactly their fairshare allotment (currently 20% of the cpu averaged over the previous 48 days) will have their medium job priority decreased such that the medium job priority is equal to someone else's vlong job priority who has not used the batch system in the previous 48 days.
  • Waiting jobs priority slowly increases as a function of time waiting in the queue. Currently a vlong job would have to wait several weeks to match the priority of a medium queue job all other things being equal.
 

Delete a job

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback