When you log in, you will be directed to one of several login nodes. These allow Linux command line access to the system, which is necessary for the editing programs, compiling and running the code. Usage of the login nodes is shared amongst all who are logged in. These systems should not be used for running your code, other than for development and very short test runs.   

Access to the job/batch submission system is through the login nodes. When a submitted job executes, processors on the compute nodes are exclusively made available for the purposes of running the job.   

Cluster architecture

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Users submit their jobs to the resource manager and scheduler, Slurm. The job requests resources, which are used by the job when they are allocated by Slurm.

As shown below, Slurm consists of a slurmd daemon running on each compute node and a central slurmctld daemon running on a management node (with optional fail-over twin). The slurmd daemons provide fault-tolerant hierarchical communications. The user commands include: sacctsallocsattachsbatchsbcastscancelscontrolsinfosmapsqueuesrunstrigger and sview. All of the commands can run anywhere in the cluster.

Cluster entities

The entities managed by these Slurm daemons, shown below, include:

  • Nodes: the compute resource in Slurm
  • Partitions: which group nodes into logical (possibly overlapping) sets
  • Jobs: or allocations of resources assigned to a user for a specified amount of time
  • Job steps: which are sets of (possibly parallel) tasks within a job.

The partitions can be considered job queues, each of which has an assortment of constraints such as job size limit, job time limit, users permitted to use it, etc. Priority-ordered jobs are allocated nodes within a partition until the resources (nodes, processors, memory, etc.) within that partition are exhausted.

Once a job is assigned a set of nodes, the user is able to initiate parallel work in the form of job steps in any configuration within the allocation. For instance, a single job step may be started that utilises all nodes allocated to the job, or several job steps may independently use a portion of the allocation.

  • Jobs: resource allocation requests
  • Job steps: set of (typically parallel) tasks
  • Partitions: job queues with limits and access controls
  • Consumable resources:
    • Nodes
      • NUMA boards
        • Sockets
          • Cores
        • Memory
    • Generic resources, such as GPUs

Image result for slurm sockets core threads

Job states

Once a job has been submitted to a queue it is assigned a state which indicates at what stage in the processing hierarchy it is at.


A user submits a job "Job 3" to a partition "jobqueue".

When the job executes, it is allocated the requested resources.

Jobs spawn steps which are allocated resources within the jobs allocation.

Cluster Filesystem

Users have access to two filesystems:

Home Directory

This is the directory you start off in after logging on to one of the login nodes. It is NOT usable for large data-sets or frequent access files and no should jobs be run from here.

Lustre Fast Filesystem

This is the scratch directory in your home directory, and potentially any group data stores that you have access to. It is available to the compute nodes, and is the recommended place for your jobs to read and store data.

Running jobs on the cluster

There are two ways to run your programs on the cluster

Batch Jobs

These are non-interactive sessions where a number of tasks are batched together into a job script, which is then scheduled and executed by Slurm when resources are available.

  • you write a job script containing the commands you want to execute on the cluster
  • you request an allocation of resources (nodes, cpus, memory)
  • the system grants you one, or more, compute nodes to execute your commands
  • your job script is automatically run
  • your script terminates and the system releases the resources

Interactive Sessions

These are similar to a normal remote login session, and are ideal for debugging and development, or for running interactive programs. The length of these sessions is limited compared to batch jobs however, so once your development is done, you should pack it up into a batch job and run it detached.

  • you request an allocation of resources (cpus, memory)
  • the system grants you a whole, or part, node to execute your commands
  • you are logged into the node
  • you run your commands interactively
  • you exit and the system automatically releases the resources

Slurm usage

Resource allocation

In order to interact with the job/batch system (SLURM), the user must first give some indication of the resources they require. At a minimum these include:   

  • how long does the job need to run for  
  • on how many processors to run the job

The default resource allocation for jobs can be found here.

Armed with this information, the scheduler is able to dispatch the jobs at some point in the future when the resources become available. A fair-share policy is in operation to guide the scheduler towards allocating resources fairly between users. 

Slurm Command Summary

sacctreport job accounting information about active or completed jobs
sallocallocate resources for a job in real time (typically used to allocate resources and spawn a shell, in which the srun command is used to launch parallel tasks)
sattachattach standard input, output, and error to a currently running job , or job step
sbatchsubmit a job script for later execution (the script typically contains one or more srun commands to launch parallel tasks)
scancelcancel a pending or running job
sinforeports the state of partitions and nodes managed by Slurm (it has a variety of filtering, sorting, and formatting options)
squeuereports the state of jobs (it has a variety of filtering, sorting, and formatting options), by default, reports the running jobs in priority order followed by the pending jobs in priority order

used to submit a job for execution in real time

Man pages are available for all commands, and detail all of the options that can be passed to the command.

[usr1@login1(viking) ~]$ man sacct

Interactive Session

To run a job interactively (waits for execution) use the srun command:

srun [options] executable [args]

[usr1@login1(viking) ~]$ srun --ntasks=1 --time=00:30:00 --pty /bin/bash
srun: job 6485884 queued and waiting for resources
srun: job 6485884 has been allocated resources
[usr1@node069 [viking] ~]$ pwd
[usr1@node069 [viking] ~]$ exit
[usr1@login1(viking) ~]$
[usr1@login1(viking) ~]$ srun --ntasks=4 --time=00:30:00 --pty /bin/bash
srun: job 6485885 queued and waiting for resources
srun: job 6485885 has been allocated resources
[usr1@node071 [viking] ~]$

The first example creates a single task (one core) interactive session in the default partition "nodes", with a 30 minute time limit. The second example creates a four task (four cores) session.

To terminate the session, exit the shell.

srun parameters
--cores-per-nodeRestrict node selection to nodes with at least the specified number of cores
--cpu-freqRequest that the job be run at some requested frequency if possible
--cpus-per-taskRequest that the number of cpus be allocated per process
--errorSpecify how stderr is to be redirected
--exclusiveThe job allocation cannot share nodes with other running jobs
--immediateexit if resources are not available within the time period specified
--job-nameSpecify a name for the job
--memSpecify the real memory required per node
--mem-per-cpuMinimum memory required per allocated CPU
--nodelistRequest a specific list of hosts
--ntasks Specify the number of tasks to run
--ntasks-per-coreRequest the maximum ntasks be invoked on each core
--ntasks-per-nodeRequest that ntasks be invoked on each node
--ntasks-per-socketRequest the maximum ntasks be invoked on each socket
--overcommitOvercommit resources
--partitionRequest a specific partition for the resource allocation
--ptyExecute task zero in pseudo terminal mode
--sockets-per-nodeRestrict node selection to nodes with at least the specified number of sockets
--timeSet a limit on the total run time of the job allocation
--threads-per-coreRestrict node selection to nodes with at least the specified number of threads
--verboseincrease the verbosity of srun's informational messages

The salloc command is used to get a resource allocation and use it interactively from your terminal. It is usually used to execute a single command, usually a shell. 

Batch Job submission

sbatch is used to submit a script for later execution. It is used for jobs that execute for an extended period of time and is convenient for managing a sequence of dependent jobs. The script can contain job options requesting resources. Job arrays are managed by the sbatch command.

All batch jobs are submitted to a partition. Partitions are configured with specific rules that specify such things as the nodes that can be used, the maximum time a job can run, the number of cores and memory that can be requested.

The command to submit a job has the form:

sbatch [options] script_file_name [args]

where script_file_name is a file containing commands to executed by the batch request.  For example:

$ sbatch example.job

#SBATCH --time=00:12:00		# Maximum time (HH:MM:SS)

echo simple.job running on `hostname`
sleep 600
[usr1@login1(viking) scratch]$ sbatch simple.job
Submitted batch job 147874
[usr1@login1(viking) scratch]$ squeue -u usr1
147874 nodes     simple.j usr1 R  0:06 1     node170
[usr1@login1(viking) scratch]$ ls
simple.job slurm-147874.out
[usr1@login1(viking) scratch]$ cat slurm-147874.out
simple.job running on

For a more detailed description of job submission scripts and examples, please see these Example Job Script Files.

srun Command

If invoked within a salloc shell or sbatch script, srun launches an application on the allocated compute nodes.

If invoked on its own, srun creates a job allocation (similar to salloc) and then launches the application on the compute node.

By default salloc uses the entire job allocation based upon the job specification. srun can use a subset of the job's resources. Thousands of jobs can run sequentially or in parallel within the job's resource allocation.

Querying queues

The sinfo [options] command displays node and partition (queue) information and state.

PARTITIONAsterisk after partition name indicates the default partition
AVAILPartition is able to accept jobs
TIMELIMITMaximum time a job can run for
NODESNumber of available nodes in the partition
STATEdown - not available, alloc - jobs being run, idle - waiting for jobs
NODELISTNodes available in the partition

The squeue command may be used to display information on the current status of SLURM jobs.

[usr1@login1(viking) scratch]$ squeue -u usr1
147875 nodes     simple.j usr1 R  0:04 1     node170
JOBIDA number used to uniquely identify your job within SLURM
PARTITIONThe partition the job has been submitted to
NAMEThe job's name, as specified by the job submisison script's -J directive
USERThe username of the job owner
STCurrent job status: R (running), PD (pending - queued and waiting)
TIMEThe time the job has been running
NODESThe number of nodes used by the job
NODELISTThe nodes used by the job

Important switches to squeue are:



-adisplay all jobs
-ldisplay more information

only display users jobs

-ponly display jobs in a particular partition


print help


verbose listing

Cancelling a queued or running job

To delete a job from the queue use the scancel [options] <jobid> command, where jobid is a number referring to the specified job (available from squeue).

[usr1@login1(viking) scratch]$ sbatch simple.job
Submitted batch job 147876
[usr1@login1(viking) scratch]$ squeue -u usr1
147876 nodes     simple.j usr1 R  0:05 1     node170
[usr1@login1(viking) scratch]$ scancel 147876
[usr1@login1(viking) scratch]$ sacct -j 147876
JobID        JobName    Partition  Account    AllocCPUS  State      ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
147876       simple.job nodes      dept-proj+ 1          CANCELLED+ 0:0
147876.batch batch                 dept-proj+ 1          CANCELLED  0:15

A user can delete all their jobs from the batch queues with the -u option:

$ scancel -u=<userid>

Job Information

Completed jobs

To display a list of recently completed jobs use the sacct command.

[usr1@login1(viking) scratch]$ sacct -j 147874
JobID        JobName    Partition  Account    AllocCPUS  State      ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
147874       simple.job nodes      dept-proj+ 1          COMPLETED  0:0
147874.batch batch                 dept-proj+ 1          COMPLETED  0:0

Important switches to sacct are:  



-adisplay all users jobs
-bdisplay a brief listing
-Eselect the jobs end date/time


print help
-jdisplay a specific job
-ldisplay long format
--namedisplay jobs with name
-Sselect the jobs start date/time
-udisplay only this user


verbose listing

Running jobs

The scontrol show job <jobid> command will show information about a running job.

Information on running job
[usr1@login1(viking) ~]$ scontrol show job 147877
JobId=147877 JobName=simple.job
 UserId=usr1(10506) GroupId=clusterusers(1447400001) MCS_label=N/A
 Priority=4294770448 Nice=0 Account=dept-proj-2018 QOS=normal
 JobState=RUNNING Reason=None Dependency=(null)
 Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
 RunTime=00:05:28 TimeLimit=UNLIMITED TimeMin=N/A
 SubmitTime=2019-01-14T10:31:26 EligibleTime=2019-01-14T10:31:26
 StartTime=2019-01-14T10:31:27 EndTime=Unknown Deadline=N/A
 PreemptTime=None SuspendTime=None SecsPreSuspend=0
 Partition=nodes AllocNode:Sid=login2:356158
 ReqNodeList=(null) ExcNodeList=(null)
 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
 MinCPUsNode=1 MinMemoryCPU=4800M MinTmpDiskNode=0
 Features=(null) DelayBoot=00:00:00
 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)

Queued jobs

Estimated time for job to start

The --start option to squeue will show the expected start time.

$ squeue --start
36    nodes     test usr1 PD N/A                 2     (null)     (PartitionConfig)
34    nodes     test usr1 PD 2018-09-26T10:58:53 1     node170    (Resources)
35    nodes     test usr1 PD 2018-09-26T11:08:00 1     node170    (Priority)

None running jobs

Use the --start option to squeue.

If your job has the start time of N/A and/or the REASON column state PartitionConfig, then you probably have requested resources that are not available on the cluster. In the above example job 36 has requested more CPUs than are available for jobs in the cluster.

Queue/Partition, Job (Step) State Information

Queue/Partition information:

  • Associated with specific sets of nodes
    • nodes can be in more than one partition
  • Job size and time limits
  • Access control list
  • Preemption rules
  • State information
  • Over-subscription scheduling rules

Job information consists of:

  • ID (number)
  • Name
  • Time limit (minimum and/or maximum)
  • Size (min/max; nodes, CPUs, sockets, cores, threads)
  • Allocated node names
  • Node features
  • Dependency
  • Account name
  • Quality of Service
  • State

Step state information:

  • ID (number): <jobid>.<stepid>
  • Name
  • Time limit (maximum)
  • Size (min/max; nodes, CPUs, sockets, cores, threads)
  • Allocated node names
  • Node features

Job State Codes

When submitting a job, the job will be given a "state code" (ST) based on a number of factors, such as priority and resource availability. This information is shown in the squeue commands. Common states are:

R ( Running )The job is currently running.
PD ( Pending )The job is awaiting resource allocation.
CG ( Completing )Job is in the process of completing. Some proccesses on some nodes may still be active.
F ( Failed )Job terminated on non-zero exit code or other failure condition.

Job Reason Codes

The REASON column from the squeue command provides useful information on why a job is in the current state. Some of these reasons may be one or more of the following:

AssociationJobLimitThe job's association limit has reached it's maximum job count.
AssociationResourceLimitThe job's association has reached some resource limit.
AssociationTimeLimitThe job's association has reached it's time limit.

The job earliest start time has not yet been reached.

CleaningThe job is being requeued is still cleaning up from it's previous execution.
DependencyThe job is waiting for a dependent job to complete.
JobHeldAdminThe job has been held by the admin.
JobHeldUserThe job has been held by the user.
JobLaunchFailureThe job could not be launched. This may be due to a file system problem, invalid program name, etc.
NodeDownA node required by the job is not available at the moment.
PartitionDownThe partition required by the job is DOWN.
PartitionInactiveThe partition required by the job is in an Inactive state and unable to start jobs.
PartitionNodeLimitThe number of nodes required by this job is outside of the partition's node limit. Can also indicate that required nodes are DOWN or DRAINED.
PartitionTimeLimitThe job exceeds the partition's time limit.
PriorityOne or more higher priority jobs exist for this partition or advanced reservation.
ReqNodeNotAvailSome node specifically required for this job is not available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding. Nodes which are DOWN, DRAINED, or not responding will be identified as part of the job's "reason" field as "UnavailableNodes". Such nodes will typically require the intervention of a system administrator to make available.
ReservationThe job is awaiting its advanced reservation to become available.

Selecting the right queue

Most jobs will run by default in the "nodes" queue.

To access the high memory nodes, select the "himem" queue:

#SBATCH --partition=himem

To access the nodes with GPUs, select the "gpu" queue:

#SBATCH --partition=gpu

To access the longer running queues:

#SBATCH --partition=week

#SBATCH --partition=month

The "test" queue

The test queue aims to provide a means where you can get a faster run-around on job execution. This allows you to test test your jobs for short periods of time before you submit the job to the main queue.

The jobs submitted to this queue have the following restrictions:

  1. A maximum of 4 jobs can be submitted
  2. A maximum of 2 jobs will run concurrently
  3. The jobs will run for a maximum of 30 minutes
  4. The jobs can use a maximum of 8 cores/cpus
  5. The default memory per cpu is 1GB, maximum 16GB

To use the queue define the queue when submitting the script:

$ sbatch --partition=test myscript.job

System Information


The scontrol show partition command will show information about available partitions.

Information on partitions
[usr1@login1(viking) ~]$ scontrol show partition
 AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
 AllocNodes=ALL Default=YES QoS=maxuserlimits
 DefaultTime=08:00:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
 MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
 PriorityJobFactor=1 PriorityTier=1000 RootOnly=NO ReqResv=NO OverSubscribe=NO
 OverTimeLimit=NONE PreemptMode=OFF
 State=UP TotalCPUs=6800 TotalNodes=170 SelectTypeParameters=NONE
 DefMemPerCPU=4750 MaxMemPerNode=UNLIMITED


The scontrol show nodes command will show information about available nodes. Warning: there are many.

Information on partitions
[usr1@login2(viking) simple]$ scontrol show nodes
NodeName=node170 Arch=x86_64 CoresPerSocket=20 
 CPUAlloc=40 CPUTot=40 CPULoad=23.91
 NodeAddr=node170 NodeHostName=node170 Version=18.08
 OS=Linux 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 
 RealMemory=191668 AllocMem=187904 FreeMem=117134 Sockets=2 Boards=1
 State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=10 Owner=N/A MCS_label=N/A
 BootTime=2019-08-28T18:06:45 SlurmdStartTime=2020-01-29T11:39:15
 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

What resources did my job use?

The output from all batch jobs will include the following statistics block at the end:

Job ID: 5816203
Cluster: viking
User/Group: usr1/clusterusers
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 20
CPU Utilized: 03:53:53
CPU Efficiency: 92.08% of 04:14:00 core-walltime
Job Wall-clock time: 00:12:42
Memory Utilized: 7.21 GB
Memory Efficiency: 36.06% of 20.00 GB

Here we can see the job used four hours of CPU time and 7 GB of memory.

More information can be found by using the sacct command:

[usr1@login1(viking) scratch]$ sacct -j 44 --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,AveCPU
     User        JobID    JobName  Partition      State  Timelimit               Start                 End    Elapsed     AveCPU 
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- 
     usr1 44                1hour      nodes  COMPLETED   01:10:00 2018-09-26T11:48:28 2018-09-26T12:48:28   01:00:00            
          44.batch          batch             COMPLETED            2018-09-26T11:48:28 2018-09-26T12:48:28   01:00:00            

[usr1@login1(viking) scratch]$ sacct --format=reqmem,maxrss,averss,elapsed -j 44
    ReqMem     MaxRSS     AveRSS    Elapsed 
---------- ---------- ---------- ---------- 
     100Mc                         01:00:00 
     100Mc                         01:00:00 
[usr1@login1(viking) scratch]$ sacct --starttime 2018-09-26 --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist
     User        JobID    JobName  Partition      State  Timelimit               Start                 End     Elapsed     MaxRSS  MaxVMSize   NNodes      NCPUS        NodeList 
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ----------- ---------- ---------- -------- ---------- --------------- 
     usr1 14             hostname       test  CANCELLED Partition+ 2018-08-16T14:44:34             Unknown 41-19:03:28                              0          0   None assigned 
     usr1 32                 test      nodes  COMPLETED   00:10:00 2018-09-26T10:14:13 2018-09-26T10:15:13    00:01:00                              1          1         node170 
          32.batch          batch             COMPLETED            2018-09-26T10:14:13 2018-09-26T10:15:13    00:01:00                              1          1         node170 
     usr1 33                 test      nodes  COMPLETED   00:10:00 2018-09-26T10:48:53 2018-09-26T10:58:53    00:10:00                              1          1         node170 
          33.batch          batch             COMPLETED            2018-09-26T10:48:53 2018-09-26T10:58:53    00:10:00                              1          1         node170 
          33.0           hostname             COMPLETED            2018-09-26T10:48:53 2018-09-26T10:48:53    00:00:00                              1          1         node170 
          33.1              sleep             COMPLETED            2018-09-26T10:48:53 2018-09-26T10:58:53    00:10:00                              1          1         node170 
     usr1 34                 test      nodes  COMPLETED   00:10:00 2018-09-26T10:58:53 2018-09-26T11:08:53    00:10:00                              1          1         node170 
          34.batch          batch             COMPLETED            2018-09-26T10:58:53 2018-09-26T11:08:53    00:10:00                              1          1         node170 
          34.0           hostname             COMPLETED            2018-09-26T10:58:53 2018-09-26T10:58:53    00:00:00                              1          1         node170 
          34.1              sleep             COMPLETED            2018-09-26T10:58:53 2018-09-26T11:08:53    00:10:00                              1          1         node170 
     usr1 35                 test      nodes  COMPLETED   00:10:00 2018-09-26T11:08:53 2018-09-26T11:09:53    00:01:00                              1          1         node170 
          35.batch          batch             COMPLETED            2018-09-26T11:08:53 2018-09-26T11:09:53    00:01:00                              1          1         node170 
     usr1 36                 test      nodes    PENDING   00:10:00             Unknown             Unknown    00:00:00                              1          2   None assigned 
     usr1 37                 test      nodes CANCELLED+   01:10:00 2018-09-26T11:09:53 2018-09-26T11:16:34    00:06:41                              1          1         node170 
          37.batch          batch             CANCELLED            2018-09-26T11:09:53 2018-09-26T11:16:35    00:06:42                              1          1         node170 
     usr1 38                 test      nodes CANCELLED+   00:10:00 2018-09-26T11:16:36 2018-09-26T11:17:25    00:00:49                              1          1         node170 
          38.batch          batch             CANCELLED            2018-09-26T11:16:36 2018-09-26T11:17:26    00:00:50                              1          1         node170 
  • No labels