When you log in, you will be directed to one of several login nodes. These allow Linux command line access to the system, which is necessary for the editing programs, compiling and running the code. Usage of the login nodes is shared amongst all who are logged in. These systems should not be used for running your code, other than for development and very short test runs.
Access to the job/batch submission system is through the login nodes. When a submitted job executes, processors on the compute nodes are exclusively made available for the purposes of running the job.
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
Users submit their jobs to the resource manager and scheduler, Slurm. The job requests resources, which are used by the job when they are allocated by Slurm.
As shown below, Slurm consists of a slurmd daemon running on each compute node and a central slurmctld daemon running on a management node (with optional fail-over twin). The slurmd daemons provide fault-tolerant hierarchical communications. The user commands include: sacct, salloc, sattach, sbatch, sbcast, scancel, scontrol, sinfo, smap, squeue, srun, strigger and sview. All of the commands can run anywhere in the cluster.
The entities managed by these Slurm daemons, shown below, include:
- Nodes: the compute resource in Slurm
- Partitions: which group nodes into logical (possibly overlapping) sets
- Jobs: or allocations of resources assigned to a user for a specified amount of time
- Job steps: which are sets of (possibly parallel) tasks within a job.
The partitions can be considered job queues, each of which has an assortment of constraints such as job size limit, job time limit, users permitted to use it, etc. Priority-ordered jobs are allocated nodes within a partition until the resources (nodes, processors, memory, etc.) within that partition are exhausted.
Once a job is assigned a set of nodes, the user is able to initiate parallel work in the form of job steps in any configuration within the allocation. For instance, a single job step may be started that utilises all nodes allocated to the job, or several job steps may independently use a portion of the allocation.
- Jobs: resource allocation requests
- Job steps: set of (typically parallel) tasks
- Partitions: job queues with limits and access controls
- Consumable resources:
- NUMA boards
- NUMA boards
- Generic resources, such as GPUs
Once a job has been submitted to a queue it is assigned a state which indicates at what stage in the processing hierarchy it is at.
A user submits a job "Job 3" to a partition "jobqueue".
When the job executes, it is allocated the requested resources.
Jobs spawn steps which are allocated resources within the jobs allocation.
Users have access to two filesystems:
This is the directory you start off in after logging on to one of the login nodes. It is NOT usable for large data-sets or frequent access files and no should jobs be run from here.
Lustre Fast Filesystem
This is the
scratch directory in your home directory, and potentially any group data stores that you have access to. It is available to the compute nodes, and is the recommended place for your jobs to read and store data.
Running jobs on the cluster
There are two ways to run your programs on the cluster
These are non-interactive sessions where a number of tasks are batched together into a job script, which is then scheduled and executed by Slurm when resources are available.
- you write a job script containing the commands you want to execute on the cluster
- you request an allocation of resources (nodes, cpus, memory)
- the system grants you one, or more, compute nodes to execute your commands
- your job script is automatically run
- your script terminates and the system releases the resources
These are similar to a normal remote login session, and are ideal for debugging and development, or for running interactive programs. The length of these sessions is limited compared to batch jobs however, so once your development is done, you should pack it up into a batch job and run it detached.
- you request an allocation of resources (cpus, memory)
- the system grants you a whole, or part, node to execute your commands
- you are logged into the node
- you run your commands interactively
- you exit and the system automatically releases the resources
In order to interact with the job/batch system (SLURM), the user must first give some indication of the resources they require. At a minimum these include:
- how long does the job need to run for
- on how many processors to run the job
The default resource allocation for jobs can be found here.
Armed with this information, the scheduler is able to dispatch the jobs at some point in the future when the resources become available. A fair-share policy is in operation to guide the scheduler towards allocating resources fairly between users.
Slurm Command Summary
|sacct||report job accounting information about active or completed jobs|
|salloc||allocate resources for a job in real time (typically used to allocate resources and spawn a shell, in which the srun command is used to launch parallel tasks)|
|sattach||attach standard input, output, and error to a currently running job , or job step|
|sbatch||submit a job script for later execution (the script typically contains one or more srun commands to launch parallel tasks)|
|scancel||cancel a pending or running job|
|sinfo||reports the state of partitions and nodes managed by Slurm (it has a variety of filtering, sorting, and formatting options)|
|squeue||reports the state of jobs (it has a variety of filtering, sorting, and formatting options), by default, reports the running jobs in priority order followed by the pending jobs in priority order|
used to submit a job for execution in real time
Man pages are available for all commands, and detail all of the options that can be passed to the command.
To run a job interactively (waits for execution) use the srun command:
srun [options] executable [args]
The first example creates a single task (one core) interactive session in the default partition "nodes", with a 30 minute time limit. The second example creates a four task (four cores) session.
To terminate the session, exit the shell.
|--cores-per-node||Restrict node selection to nodes with at least the specified number of cores|
|--cpu-freq||Request that the job be run at some requested frequency if possible|
|--cpus-per-task||Request that the number of cpus be allocated per process|
|--error||Specify how stderr is to be redirected|
|--exclusive||The job allocation cannot share nodes with other running jobs|
|--immediate||exit if resources are not available within the time period specified|
|--job-name||Specify a name for the job|
|--mem||Specify the real memory required per node|
|--mem-per-cpu||Minimum memory required per allocated CPU|
|--nodelist||Request a specific list of hosts|
|--ntasks||Specify the number of tasks to run|
|--ntasks-per-core||Request the maximum ntasks be invoked on each core|
|--ntasks-per-node||Request that ntasks be invoked on each node|
|--ntasks-per-socket||Request the maximum ntasks be invoked on each socket|
|--partition||Request a specific partition for the resource allocation|
|--pty||Execute task zero in pseudo terminal mode|
|--sockets-per-node||Restrict node selection to nodes with at least the specified number of sockets|
|--time||Set a limit on the total run time of the job allocation|
|--threads-per-core||Restrict node selection to nodes with at least the specified number of threads|
|--verbose||increase the verbosity of srun's informational messages|
The salloc command is used to get a resource allocation and use it interactively from your terminal. It is usually used to execute a single command, usually a shell.
Batch Job submission
sbatch is used to submit a script for later execution. It is used for jobs that execute for an extended period of time and is convenient for managing a sequence of dependent jobs. The script can contain job options requesting resources. Job arrays are managed by the sbatch command.
All batch jobs are submitted to a partition. Partitions are configured with specific rules that specify such things as the nodes that can be used, the maximum time a job can run, the number of cores and memory that can be requested.
The command to submit a job has the form:
where script_file_name is a file containing commands to executed by the batch request. For example:
$ sbatch example.job
For a more detailed description of job submission scripts and examples, please see these Example Job Script Files.
If invoked on its own, srun creates a job allocation (similar to salloc) and then launches the application on the compute node.
By default salloc uses the entire job allocation based upon the job specification. srun can use a subset of the job's resources. Thousands of jobs can run sequentially or in parallel within the job's resource allocation.
The sinfo [options] command displays node and partition (queue) information and state.
|PARTITION||Asterisk after partition name indicates the default partition|
|AVAIL||Partition is able to accept jobs|
|TIMELIMIT||Maximum time a job can run for|
|NODES||Number of available nodes in the partition|
|STATE||down - not available, alloc - jobs being run, idle - waiting for jobs|
|NODELIST||Nodes available in the partition|
The squeue command may be used to display information on the current status of SLURM jobs.
|JOBID||A number used to uniquely identify your job within SLURM|
|PARTITION||The partition the job has been submitted to|
|NAME||The job's name, as specified by the job submisison script's -J directive|
|USER||The username of the job owner|
|ST||Current job status: R (running), PD (pending - queued and waiting)|
|TIME||The time the job has been running|
|NODES||The number of nodes used by the job|
|NODELIST||The nodes used by the job|
Important switches to squeue are:
|-a||display all jobs|
|-l||display more information|
only display users jobs
|-p||only display jobs in a particular partition|
Cancelling a queued or running job
To delete a job from the queue use the scancel [options] <jobid> command, where jobid is a number referring to the specified job (available from squeue).
A user can delete all their jobs from the batch queues with the -u option:
$ scancel -u=<userid>
To display a list of recently completed jobs use the sacct command.
Important switches to sacct are:
|-a||display all users jobs|
|-b||display a brief listing|
|-E||select the jobs end date/time|
|-j||display a specific job|
|-l||display long format|
|--name||display jobs with name|
|-S||select the jobs start date/time|
|-u||display only this user|
The scontrol show job <jobid> command will show information about a running job.
Estimated time for job to start
The --start option to squeue will show the expected start time.
None running jobs
Use the --start option to squeue.
If your job has the start time of N/A and/or the REASON column state PartitionConfig, then you probably have requested resources that are not available on the cluster. In the above example job 36 has requested more CPUs than are available for jobs in the cluster.
Queue/Partition, Job (Step) State Information
- Associated with specific sets of nodes
- nodes can be in more than one partition
- Job size and time limits
- Access control list
- Preemption rules
- State information
- Over-subscription scheduling rules
Job information consists of:
- ID (number)
- Time limit (minimum and/or maximum)
- Size (min/max; nodes, CPUs, sockets, cores, threads)
- Allocated node names
- Node features
- Account name
- Quality of Service
Step state information:
- ID (number): <jobid>.<stepid>
- Time limit (maximum)
- Size (min/max; nodes, CPUs, sockets, cores, threads)
- Allocated node names
- Node features
Job State Codes
When submitting a job, the job will be given a "state code" (ST) based on a number of factors, such as priority and resource availability. This information is shown in the squeue commands. Common states are:
|R ( Running )||The job is currently running.|
|PD ( Pending )||The job is awaiting resource allocation.|
|CG ( Completing )||Job is in the process of completing. Some proccesses on some nodes may still be active.|
|F ( Failed )||Job terminated on non-zero exit code or other failure condition.|
Job Reason Codes
The REASON column from the squeue command provides useful information on why a job is in the current state. Some of these reasons may be one or more of the following:
|AssociationJobLimit||The job's association limit has reached it's maximum job count.|
|AssociationResourceLimit||The job's association has reached some resource limit.|
|AssociationTimeLimit||The job's association has reached it's time limit.|
The job earliest start time has not yet been reached.
|Cleaning||The job is being requeued is still cleaning up from it's previous execution.|
|Dependency||The job is waiting for a dependent job to complete.|
|JobHeldAdmin||The job has been held by the admin.|
|JobHeldUser||The job has been held by the user.|
|JobLaunchFailure||The job could not be launched. This may be due to a file system problem, invalid program name, etc.|
|NodeDown||A node required by the job is not available at the moment.|
|PartitionDown||The partition required by the job is DOWN.|
|PartitionInactive||The partition required by the job is in an Inactive state and unable to start jobs.|
|PartitionNodeLimit||The number of nodes required by this job is outside of the partition's node limit. Can also indicate that required nodes are DOWN or DRAINED.|
|PartitionTimeLimit||The job exceeds the partition's time limit.|
|Priority||One or more higher priority jobs exist for this partition or advanced reservation.|
|ReqNodeNotAvail||Some node specifically required for this job is not available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding. Nodes which are DOWN, DRAINED, or not responding will be identified as part of the job's "reason" field as "UnavailableNodes". Such nodes will typically require the intervention of a system administrator to make available.|
|Reservation||The job is awaiting its advanced reservation to become available.|
Selecting the right queue
Most jobs will run by default in the "nodes" queue.
To access the high memory nodes, select the "himem" queue:
To access the nodes with GPUs, select the "gpu" queue:
To access the longer running queues:
The "test" queue
The test queue aims to provide a means where you can get a faster run-around on job execution. This allows you to test test your jobs for short periods of time before you submit the job to the main queue.
The jobs submitted to this queue have the following restrictions:
- A maximum of 4 jobs can be submitted
- A maximum of 2 jobs will run concurrently
- The jobs will run for a maximum of 30 minutes
- The jobs can use a maximum of 8 cores/cpus
- The default memory per cpu is 1GB, maximum 16GB
To use the queue define the queue when submitting the script:
$ sbatch --partition=test myscript.job
The scontrol show partition command will show information about available partitions.
The scontrol show nodes command will show information about available nodes. Warning: there are many.
What resources did my job use?
The output from all batch jobs will include the following statistics block at the end:
Here we can see the job used four hours of CPU time and 7 GB of memory.
More information can be found by using the sacct command: