Difference: HTCondor (1 vs. 9)

Revision 92020-06-01 - GordonStewart

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

Batch System (HTCondor)

Line: 60 to 60
  Further information can be found in the Condor documentation:
Changed:
<
<
>
>
 

Submit a job

Line: 156 to 156
  Further information can be found in the Commands for Matchmaking section of the Condor documentation:
Changed:
<
<
>
>
 

Line: 176 to 176
  Further information can be found in the Commands for Matchmaking section of the Condor documentation:
Deleted:
<
<
 \ No newline at end of file
Added:
>
>

Revision 82019-03-02 - GordonStewart

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

Batch System (HTCondor)

Line: 61 to 61
 Further information can be found in the Condor documentation:

Changed:
<
<
>
>
 

Submit a job

Line: 156 to 156
  Further information can be found in the Commands for Matchmaking section of the Condor documentation:
Changed:
<
<
>
>
 

Line: 176 to 176
  Further information can be found in the Commands for Matchmaking section of the Condor documentation:
Deleted:
<
<
 \ No newline at end of file
Added:
>
>
 \ No newline at end of file

Revision 72018-09-07 - GordonStewart

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

Batch System (HTCondor)

Line: 80 to 80
 

Show status information

You can view details of submitted jobs using the condor_q command:

Changed:
<
<
$ condor_q

>
>
$ condor_q -all

 

-- Schedd: hex.ppe.gla.ac.uk : <172.20.203.50:9618?... @ 05/30/17 11:18:00

Revision 62018-07-30 - GordonStewart

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

Batch System (HTCondor)

Line: 32 to 32
 

Using HTCondor

Changed:
<
<
Unlike PBS, which has a central server and multiple client machines, HTCondor features a distributed architecture. Jobs can be submitted from the central manager or from any machine running the scheduler daemon, which includes most Linux desktops. The job history which is reported by condor_history provides information for jobs submitted via the scheduler on the local machine (rather than across the whole pool), so it is a good idea to use a single machine for job submission. Running jobs must also communicate periodically with the submission machine.

You may find it easiest to submit jobs by first logging into hex.ppe.gla.ac.uk.

>
>
Unlike PBS, which has a central server and multiple client machines, HTCondor features a distributed architecture. The job history which is reported by condor_history provides information for jobs submitted via the scheduler on the local machine (rather than across the whole pool), so it is a good idea to use a single machine for job submission. Running jobs must also communicate periodically with the submission machine. For these reasons, it is recommended that you first log into hex.ppe.gla.ac.uk in order to submit your jobs.
 

Create a submit description file

Line: 148 to 146
 

Specify CPU and memory requirements

Changed:
<
<
Unlike the old PBS nodes, on which jobs were free to grab whatever resources they liked (to the detriment of both themselves and other jobs on the node), the Condor compute nodes are configured to use cgroups which will restrict a job's resource usage to those resources requested. By default, all Condor jobs are allocated a single CPU and 1 GiB memory. You can adjust these values by adding request_cpus and request_memory statements to your job submit description file:
>
>
Unlike the old PBS nodes, on which jobs were free to grab whatever resources they liked (to the detriment of both themselves and other jobs on the node), the Condor compute nodes are configured to use cgroups which will restrict a job's resource usage to those resources requested. By default, all Condor jobs are allocated a single CPU and 1 GiB memory. You can adjust these values by adding request_cpus and request_memory statements to your job submit description file:
 
request_cpus = 2
request_memory = 4 GB

Line: 164 to 162
 

Submit a job with additional requirements

Changed:
<
<
You can exert more control over where a job runs by including a requirements specification in your job submit description file. This allows you to specify values for various Condor ClassAds, combined with C-style boolean operators. For example, to specify that your job should run on a Scientific Linux 6 machine:
>
>
You can exert more control over where a job runs by including a requirements specification in your job submit description file. This allows you to specify values for various Condor ClassAds, combined with C-style boolean operators. For example, to specify that your job should run on a Scientific Linux 6 machine:
 
requirements = OpSysAndVer == "SL6"

Revision 52017-12-11 - GordonStewart

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

Batch System (HTCondor)

Line: 13 to 13
 
Nodes Operating System Total CPU Cores
node004 CentOS 7 32
node005 Scientific Linux 6 32
Added:
>
>
node006 CentOS 7 40
node008 CentOS 7 40
 
node034 CentOS 7 56
The Condor central manager (the closest thing it has to a headnode) is hex.ppe.gla.ac.uk.

Revision 42017-07-18 - GordonStewart

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

Batch System (HTCondor)

Line: 45 to 45
 output = test.out error = test.error log = test.log
Added:
>
>
requirements = OpSysAndVer == "CentOS7"
  queue
Line: 55 to 56
  The log file (test.log in this example) will contain logging information provided by Condor.
Changed:
<
<
Condor jobs are allocated a single CPU and 1 GiB memory by default, and will run on a machine with the same architecture and operating system as the submission host (i.e. jobs submitted from hex.ppe.gla.ac.uk will run on CentOS 7 nodes by default). To request a different resource allocation, or to specify that a job should run under a different operating system, see Specify CPU and memory requirements and Submit a job with additional requirements.
>
>
Condor jobs are allocated a single CPU and 1 GiB memory by default. The requirements specification above restricts the job to running on CentOS 7 nodes. To run on Scientific Linux 6 nodes, you would replace CentOS7 with SL6. If the operating system of the node is not important, the requirements line may be omitted. To request a different resource allocation, see Specify CPU and memory requirements and Submit a job with additional requirements.
  Further information can be found in the Condor documentation:

Revision 32017-06-29 - GordonStewart

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

Batch System (HTCondor)

Line: 6 to 6
 

Overview

Changed:
<
<
The PPE group maintains an HTCondor cluster for running batch jobs. This system is currently in its development phase, but it is expected it will replace the much older PBS batch system in the future. You are welcome to submit jobs to the Condor cluster, but please be aware that machines may be reconfigured and rebooted without warning while the system is being commissioned.
>
>
The PPE group maintains an HTCondor cluster for running batch jobs. This system is currently in its development phase, but it is expected it will replace the much older Batch System in the future. You are welcome to submit jobs to the Condor cluster, but please be aware that machines may be reconfigured and rebooted without warning while the system is being commissioned.
  The current composition of the batch system is as follows:

Nodes Operating System Total CPU Cores
node004 CentOS 7 32
Changed:
<
<
>
>
node005 Scientific Linux 6 32
node034 CentOS 7 56
 The Condor central manager (the closest thing it has to a headnode) is hex.ppe.gla.ac.uk.

HTCondor was known as Condor prior to 2012, when threatened legal action forced a change of name. It is still commonly referred to as simply "Condor", and you will find both names used interchangeably in this document.

Line: 36 to 34
  You may find it easiest to submit jobs by first logging into hex.ppe.gla.ac.uk.
Deleted:
<
<
 

Create a submit description file

Jobs are defined using a submit description file, which contains commands which tell HTCondor how to queue the job. These commands are analogous to the lines in a PBS submission script which began with the #PBS prefix and contained directives used by PBS when queuing the job.

A simple submit description file might look like the following:

Changed:
<
<
universe       = vanilla

>
>
universe       = vanilla

 executable = test.sh input = test.data output = test.out
Line: 55 to 50
 

This will run the executable test.sh in a manner similar to the following:

Changed:
<
<
./test.sh < test.data > test.out 2> test.error

>
>
./test.sh < test.data > test.out 2> test.error

 

The log file (test.log in this example) will contain logging information provided by Condor.

Added:
>
>
Condor jobs are allocated a single CPU and 1 GiB memory by default, and will run on a machine with the same architecture and operating system as the submission host (i.e. jobs submitted from hex.ppe.gla.ac.uk will run on CentOS 7 nodes by default). To request a different resource allocation, or to specify that a job should run under a different operating system, see Specify CPU and memory requirements and Submit a job with additional requirements.
 Further information can be found in the Condor documentation:

Line: 75 to 69
 $ condor_submit <FILENAME>

After running this command, the ID of the newly-submitted job will be output. For example, to submit a job defined by the submit description file test.job:

Changed:
<
<
$ condor_submit test.job

>
>
$ condor_submit test.job

 Submitting job(s). 1 job(s) submitted to cluster 38.
Line: 84 to 76
  This cluster ID (38 in this example) can be used to manage the job in the future.
Deleted:
<
<
 

Show status information

You can view details of submitted jobs using the condor_q command:

Changed:
<
<
$ condor_q

>
>
$ condor_q

 
Changed:
<
<
-- Schedd: hex.ppe.gla.ac.uk : <172.20.203.50:9618?... @ 05/30/17 11:18:00
>
>
-- Schedd: hex.ppe.gla.ac.uk : <172.20.203.50:9618?... @ 05/30/17 11:18:00
 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS gpstewart CMD: sleep.sh 5/30 11:17 _ 1 _ 1 42.0
Line: 101 to 90
 

You can view information about the state of the Condor system as a whole using the condor_status command:

Changed:
<
<
$ condor_status

>
>
$ condor_status

 Name OpSys Arch State Activity LoadAv Mem ActvtyTime

slot1@node004.ppe.gla.ac.uk LINUX X86_64 Unclaimed Idle 0.000 64010 6+00:49:19

Line: 126 to 112
 $ condor_rm <CLUSTER_ID>

To remove the job with cluster ID 43:

Changed:
<
<
$ condor_rm 43

>
>
$ condor_rm 43

 All jobs in cluster 43 have been marked for removal
Line: 132 to 116
 All jobs in cluster 43 have been marked for removal
Deleted:
<
<
 

View history

You can view information about historical job submission using the condor_history command:

Changed:
<
<
$ condor_history

>
>
$ condor_history

  ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD 43.0 gpstewart 5/30 11:28 X ??? /home/grid/gpstewart/condor/sleep/sleep.sh 42.0 gpstewart 5/30 11:17 0+00:00:31 C 5/30 11:18 /home/grid/gpstewart/condor/sleep/sleep.sh
Line: 147 to 128
  39.0 gpstewart 5/11 14:00 0+00:00:06 C 5/11 14:00 /home/grid/gpstewart/condor/mail/mail.sh
Added:
>
>
You can view detailed information a job by including the -long argument:
$ condor_history -long 3805
ResidentSetSize = 0
ResidentSetSize_RAW = 0
RemoteUserCpu = 0.0
RecentBlockWrites = 0
RecentBlockReadKbytes = 36
JobCurrentStartExecutingDate = 1498649218
...
 As noted previously, the history which is reported by Condor provides information for jobs submitted via the scheduler on the local machine only, and not across the whole pool.
Added:
>
>

Specify CPU and memory requirements

Unlike the old PBS nodes, on which jobs were free to grab whatever resources they liked (to the detriment of both themselves and other jobs on the node), the Condor compute nodes are configured to use cgroups which will restrict a job's resource usage to those resources requested. By default, all Condor jobs are allocated a single CPU and 1 GiB memory. You can adjust these values by adding request_cpus and request_memory statements to your job submit description file:

request_cpus = 2
request_memory = 4 GB

Requesting significantly more CPUs or memory than usual may mean that your job has to wait longer before sufficient resources can be allocated to run it. On the other hand, specifying a lower memory requirement may allow jobs to squeeze in to otherwise heavily-loaded nodes.

Further information can be found in the Commands for Matchmaking section of the Condor documentation:

Submit a job with additional requirements

You can exert more control over where a job runs by including a requirements specification in your job submit description file. This allows you to specify values for various Condor ClassAds, combined with C-style boolean operators. For example, to specify that your job should run on a Scientific Linux 6 machine:

requirements = OpSysAndVer == "SL6"

You can obtain a list of ClassAds and their values on a given node by running the following command:

condor_status -startd HOSTNAME

For example, to obtain the list of ClassAds from node005:

condor_status -startd node005.ppe.gla.ac.uk

Further information can be found in the Commands for Matchmaking section of the Condor documentation:

Revision 22017-05-30 - GordonStewart

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

Batch System (HTCondor)

Line: 66 to 66
 
Added:
>
>

Submit a job

Jobs are submitted using the condor_submit command:

$ condor_submit <FILENAME>

After running this command, the ID of the newly-submitted job will be output. For example, to submit a job defined by the submit description file test.job:

$ condor_submit test.job
Submitting job(s).
1 job(s) submitted to cluster 38.

This cluster ID (38 in this example) can be used to manage the job in the future.

Show status information

You can view details of submitted jobs using the condor_q command:

$ condor_q


-- Schedd: hex.ppe.gla.ac.uk : <172.20.203.50:9618?... @ 05/30/17 11:18:00
OWNER     BATCH_NAME       SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
gpstewart CMD: sleep.sh   5/30 11:17      _      1      _      1 42.0

1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

You can view information about the state of the Condor system as a whole using the condor_status command:

$ condor_status
Name                          OpSys      Arch   State     Activity LoadAv Mem    ActvtyTime

slot1@node004.ppe.gla.ac.uk   LINUX      X86_64 Unclaimed Idle      0.000 64010  6+00:49:19
slot1_1@node004.ppe.gla.ac.uk LINUX      X86_64 Claimed   Busy      0.000   128  0+00:00:03

                     Machines Owner Claimed Unclaimed Matched Preempting  Drain

        X86_64/LINUX        2     0       1         1       0          0      0

               Total        2     0       1         1       0          0      0

Within Condor, each compute node is configured with a single 'slot' which contains all the resources held by that node; this slot is then partitioned such that running jobs receive the resources they request and no more. In the above example, there are two slots associated with node004 (which has 32 CPU cores and 64 GB of memory); slot1_1 represents a running, single-core job with 128 MB of memory, while slot1 contains the remaining unallocated resources on the compute node.

Remove a job

Jobs are removed using the condor_rm command:

$ condor_rm <CLUSTER_ID>

To remove the job with cluster ID 43:

$ condor_rm 43
All jobs in cluster 43 have been marked for removal

View history

You can view information about historical job submission using the condor_history command:

$ condor_history
 ID     OWNER          SUBMITTED   RUN_TIME     ST COMPLETED   CMD            
  43.0   gpstewart       5/30 11:28              X         ???  /home/grid/gpstewart/condor/sleep/sleep.sh 
  42.0   gpstewart       5/30 11:17   0+00:00:31 C   5/30 11:18 /home/grid/gpstewart/condor/sleep/sleep.sh 
  41.0   gpstewart       5/11 14:10   0+00:00:06 C   5/11 14:10 /home/grid/gpstewart/condor/mail/mail.sh 
  40.0   gpstewart       5/11 14:09   0+00:00:07 C   5/11 14:09 /home/grid/gpstewart/condor/mail/mail.sh 
  39.0   gpstewart       5/11 14:00   0+00:00:06 C   5/11 14:00 /home/grid/gpstewart/condor/mail/mail.sh

As noted previously, the history which is reported by Condor provides information for jobs submitted via the scheduler on the local machine only, and not across the whole pool.

Revision 12017-05-11 - GordonStewart

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="WebHome"

Batch System (HTCondor)

Overview

The PPE group maintains an HTCondor cluster for running batch jobs. This system is currently in its development phase, but it is expected it will replace the much older PBS batch system in the future. You are welcome to submit jobs to the Condor cluster, but please be aware that machines may be reconfigured and rebooted without warning while the system is being commissioned.

The current composition of the batch system is as follows:

Nodes Operating System Total CPU Cores
node004 CentOS 7 32

The Condor central manager (the closest thing it has to a headnode) is hex.ppe.gla.ac.uk.

HTCondor was known as Condor prior to 2012, when threatened legal action forced a change of name. It is still commonly referred to as simply "Condor", and you will find both names used interchangeably in this document.

Queues

HTCondor does not have queues in the way that PBS does. Instead, jobs are submitted to the Condor pool and then matched to appropriate resources based on their individual requirements.

Job Prioritisation

The cluster is configured with a fair-share scheduler, which aims to distribute compute time fairly among users. When multiple users are competing for resources, preference will be shown to users whose recent usage has been lower.

Running jobs can be pre-empted by newly-submitted jobs with a higher priority. Pre-empted jobs will either be suspended or evicted. A suspended job remains on the node on which it was running, but is no longer executed; once the pre-empting job has finished, the pre-empted job will be allowed to continue. An evicted job is terminated and re-queued for execution at a later time.

Using HTCondor

Unlike PBS, which has a central server and multiple client machines, HTCondor features a distributed architecture. Jobs can be submitted from the central manager or from any machine running the scheduler daemon, which includes most Linux desktops. The job history which is reported by condor_history provides information for jobs submitted via the scheduler on the local machine (rather than across the whole pool), so it is a good idea to use a single machine for job submission. Running jobs must also communicate periodically with the submission machine.

You may find it easiest to submit jobs by first logging into hex.ppe.gla.ac.uk.

Create a submit description file

Jobs are defined using a submit description file, which contains commands which tell HTCondor how to queue the job. These commands are analogous to the lines in a PBS submission script which began with the #PBS prefix and contained directives used by PBS when queuing the job.

A simple submit description file might look like the following:

universe       = vanilla
executable     = test.sh
input          = test.data
output         = test.out
error          = test.error             
log            = test.log

queue

This will run the executable test.sh in a manner similar to the following:

./test.sh < test.data > test.out 2> test.error

The log file (test.log in this example) will contain logging information provided by Condor.

Further information can be found in the Condor documentation:

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2025 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback