Access
When you log in, you will be directed to one of several login nodes. These allow Linux command line access to the system, which is necessary for the editing programs, compiling and running the code. Usage of the login nodes is shared amongst all who are logged in. These systems should not be used for running your code, other than for development and very short test runs.
Access to the job/batch submission system is through the login nodes. When a submitted job executes, processors on the compute nodes are exclusively made available for the purposes of running the job.
Cluster architecture
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
Users submit their jobs to the resource manager and scheduler, Slurm. The job requests resources, which are used by the job when they are allocated by Slurm.
As shown below, Slurm consists of a slurmd daemon running on each compute node and a central slurmctld daemon running on a management node (with optional fail-over twin). The slurmd daemons provide fault-tolerant hierarchical communications. The user commands include: sacct, salloc, sattach, sbatch, sbcast, scancel, scontrol, sinfo, smap, squeue, srun, strigger and sview. All of the commands can run anywhere in the cluster.
Cluster entities
The entities managed by these Slurm daemons, shown below, include:
- Nodes: the compute resource in Slurm
- Partitions: which group nodes into logical (possibly overlapping) sets
- Jobs: or allocations of resources assigned to a user for a specified amount of time
- Job steps: which are sets of (possibly parallel) tasks within a job.
The partitions can be considered job queues, each of which has an assortment of constraints such as job size limit, job time limit, users permitted to use it, etc. Priority-ordered jobs are allocated nodes within a partition until the resources (nodes, processors, memory, etc.) within that partition are exhausted.
Once a job is assigned a set of nodes, the user is able to initiate parallel work in the form of job steps in any configuration within the allocation. For instance, a single job step may be started that utilises all nodes allocated to the job, or several job steps may independently use a portion of the allocation.
- Jobs: resource allocation requests
- Job steps: set of (typically parallel) tasks
- Partitions: job queues with limits and access controls
- Consumable resources:
- Nodes
- NUMA boards
- Sockets
- Cores
- Memory
- Sockets
- NUMA boards
- Generic resources, such as GPUs
- Nodes
Job states
Once a job has been submitted to a queue it is assigned a state which indicates at what stage in the processing hierarchy it is at.
Example
A user submits a job "Job 3" to a partition "jobqueue".
When the job executes, it is allocated the requested resources.
Jobs spawn steps which are allocated resources within the jobs allocation.
Cluster Filesystem
Users have access to two filesystems:
Home Directory
This is the directory you start off in after logging on to one of the login nodes. It is NOT usable for large data-sets or frequent access files and no should jobs be run from here.
Lustre Fast Filesystem
This is the scratch
directory in your home directory, and potentially any group data stores that you have access to. It is available to the compute nodes, and is the recommended place for your jobs to read and store data.
Running jobs on the cluster
There are two ways to run your programs on the cluster
Batch Jobs
These are non-interactive sessions where a number of tasks are batched together into a job script, which is then scheduled and executed by Slurm when resources are available.
- you write a job script containing the commands you want to execute on the cluster
- you request an allocation of resources (nodes, cpus, memory)
- the system grants you one, or more, compute nodes to execute your commands
- your job script is automatically run
- your script terminates and the system releases the resources
Interactive Sessions
These are similar to a normal remote login session, and are ideal for debugging and development, or for running interactive programs. The length of these sessions is limited compared to batch jobs however, so once your development is done, you should pack it up into a batch job and run it detached.
- you request an allocation of resources (cpus, memory)
- the system grants you a whole, or part, node to execute your commands
- you are logged into the node
- you run your commands interactively
- you exit and the system automatically releases the resources
Slurm usage
Resource allocation
In order to interact with the job/batch system (SLURM), the user must first give some indication of the resources they require. At a minimum these include:
- how long does the job need to run for
- on how many processors to run the job
The default resource allocation for jobs can be found here.
Armed with this information, the scheduler is able to dispatch the jobs at some point in the future when the resources become available. A fair-share policy is in operation to guide the scheduler towards allocating resources fairly between users.
Slurm Command Summary
Command | Description |
---|---|
sacct | report job accounting information about active or completed jobs |
salloc | allocate resources for a job in real time (typically used to allocate resources and spawn a shell, in which the srun command is used to launch parallel tasks) |
sattach | attach standard input, output, and error to a currently running job , or job step |
sbatch | submit a job script for later execution (the script typically contains one or more srun commands to launch parallel tasks) |
scancel | cancel a pending or running job |
sinfo | reports the state of partitions and nodes managed by Slurm (it has a variety of filtering, sorting, and formatting options) |
squeue | reports the state of jobs (it has a variety of filtering, sorting, and formatting options), by default, reports the running jobs in priority order followed by the pending jobs in priority order |
srun | used to submit a job for execution in real time |
Man pages are available for all commands, and detail all of the options that can be passed to the command.
[usr1@login1(viking) ~]$ man sacct
Interactive Session
To run a job interactively (waits for execution) use the srun command:
srun [options] executable [args]
[usr1@login1(viking) ~]$ srun --ntasks=1 --time=00:30:00 --pty /bin/bash srun: job 6485884 queued and waiting for resources srun: job 6485884 has been allocated resources [usr1@node069 [viking] ~]$ pwd /users/usr1 [usr1@node069 [viking] ~]$ exit exit [usr1@login1(viking) ~]$
[usr1@login1(viking) ~]$ srun --ntasks=4 --time=00:30:00 --pty /bin/bash srun: job 6485885 queued and waiting for resources srun: job 6485885 has been allocated resources [usr1@node071 [viking] ~]$
The first example creates a single task (one core) interactive session in the default partition "nodes", with a 30 minute time limit. The second example creates a four task (four cores) session.
To terminate the session, exit the shell.
srun parameters | |
---|---|
--cores-per-node | Restrict node selection to nodes with at least the specified number of cores |
--cpu-freq | Request that the job be run at some requested frequency if possible |
--cpus-per-task | Request that the number of cpus be allocated per process |
--error | Specify how stderr is to be redirected |
--exclusive | The job allocation cannot share nodes with other running jobs |
--immediate | exit if resources are not available within the time period specified |
--job-name | Specify a name for the job |
--mem | Specify the real memory required per node |
--mem-per-cpu | Minimum memory required per allocated CPU |
--nodelist | Request a specific list of hosts |
--ntasks | Specify the number of tasks to run |
--ntasks-per-core | Request the maximum ntasks be invoked on each core |
--ntasks-per-node | Request that ntasks be invoked on each node |
--ntasks-per-socket | Request the maximum ntasks be invoked on each socket |
--overcommit | Overcommit resources |
--partition | Request a specific partition for the resource allocation |
--pty | Execute task zero in pseudo terminal mode |
--sockets-per-node | Restrict node selection to nodes with at least the specified number of sockets |
--time | Set a limit on the total run time of the job allocation |
--threads-per-core | Restrict node selection to nodes with at least the specified number of threads |
--verbose | increase the verbosity of srun's informational messages |
The salloc command is used to get a resource allocation and use it interactively from your terminal. It is usually used to execute a single command, usually a shell.
Batch Job submission
sbatch is used to submit a script for later execution. It is used for jobs that execute for an extended period of time and is convenient for managing a sequence of dependent jobs. The script can contain job options requesting resources. Job arrays are managed by the sbatch command.
All batch jobs are submitted to a partition. Partitions are configured with specific rules that specify such things as the nodes that can be used, the maximum time a job can run, the number of cores and memory that can be requested.
The command to submit a job has the form:
sbatch [options] script_file_name [args]
where script_file_name is a file containing commands to executed by the batch request. For example:
$ sbatch example.job
#!/bin/bash #SBATCH --time=00:12:00 # Maximum time (HH:MM:SS) echo simple.job running on `hostname` sleep 600
[usr1@login1(viking) scratch]$ sbatch simple.job Submitted batch job 147874 [usr1@login1(viking) scratch]$ squeue -u usr1 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 147874 nodes simple.j usr1 R 0:06 1 node170 [usr1@login1(viking) scratch]$ ls simple.job slurm-147874.out [usr1@login1(viking) scratch]$ cat slurm-147874.out simple.job running on node170.pri.viking.alces.network
For a more detailed description of job submission scripts and examples, please see these Example Job Script Files.
srun Command
If invoked within a salloc shell or sbatch script, srun launches an application on the allocated compute nodes.
If invoked on its own, srun creates a job allocation (similar to salloc) and then launches the application on the compute node.
By default salloc uses the entire job allocation based upon the job specification. srun can use a subset of the job's resources. Thousands of jobs can run sequentially or in parallel within the job's resource allocation.
Querying queues
The sinfo [options] command displays node and partition (queue) information and state.
Column | Description |
---|---|
PARTITION | Asterisk after partition name indicates the default partition |
AVAIL | Partition is able to accept jobs |
TIMELIMIT | Maximum time a job can run for |
NODES | Number of available nodes in the partition |
STATE | down - not available, alloc - jobs being run, idle - waiting for jobs |
NODELIST | Nodes available in the partition |
The squeue command may be used to display information on the current status of SLURM jobs.
[usr1@login1(viking) scratch]$ squeue -u usr1 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 147875 nodes simple.j usr1 R 0:04 1 node170
Column | Description |
---|---|
JOBID | A number used to uniquely identify your job within SLURM |
PARTITION | The partition the job has been submitted to |
NAME | The job's name, as specified by the job submisison script's -J directive |
USER | The username of the job owner |
ST | Current job status: R (running), PD (pending - queued and waiting) |
TIME | The time the job has been running |
NODES | The number of nodes used by the job |
NODELIST | The nodes used by the job |
Important switches to squeue are:
Switch | Action |
---|---|
-a | display all jobs |
-l | display more information |
-u | only display users jobs |
-p | only display jobs in a particular partition |
--usage | print help |
-v | verbose listing |
Cancelling a queued or running job
To delete a job from the queue use the scancel [options] <jobid> command, where jobid is a number referring to the specified job (available from squeue).
[usr1@login1(viking) scratch]$ sbatch simple.job Submitted batch job 147876 [usr1@login1(viking) scratch]$ squeue -u usr1 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 147876 nodes simple.j usr1 R 0:05 1 node170 [usr1@login1(viking) scratch]$ scancel 147876 [usr1@login1(viking) scratch]$ sacct -j 147876 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 147876 simple.job nodes dept-proj+ 1 CANCELLED+ 0:0 147876.batch batch dept-proj+ 1 CANCELLED 0:15
A user can delete all their jobs from the batch queues with the -u option:
$ scancel -u=<userid>
Job Information
Completed jobs
To display a list of recently completed jobs use the sacct command.
[usr1@login1(viking) scratch]$ sacct -j 147874 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 147874 simple.job nodes dept-proj+ 1 COMPLETED 0:0 147874.batch batch dept-proj+ 1 COMPLETED 0:0
Important switches to sacct are:
Switch | Action |
-a | display all users jobs |
-b | display a brief listing |
-E | select the jobs end date/time |
-h | print help |
-j | display a specific job |
-l | display long format |
--name | display jobs with name |
-S | select the jobs start date/time |
-u | display only this user |
-v | verbose listing |
Running jobs
The scontrol show job <jobid> command will show information about a running job.
[usr1@login1(viking) ~]$ scontrol show job 147877 JobId=147877 JobName=simple.job UserId=usr1(10506) GroupId=clusterusers(1447400001) MCS_label=N/A Priority=4294770448 Nice=0 Account=dept-proj-2018 QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:05:28 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2019-01-14T10:31:26 EligibleTime=2019-01-14T10:31:26 AccrueTime=Unknown StartTime=2019-01-14T10:31:27 EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-01-14T10:31:27 Partition=nodes AllocNode:Sid=login2:356158 ReqNodeList=(null) ExcNodeList=(null) NodeList=node170 BatchHost=node170 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=4800M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=4800M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/mnt/lustre/users/usr1/slurm/simple/simple.job WorkDir=/mnt/lustre/users/usr1/slurm/simple StdErr=/mnt/lustre/users/usr1/slurm/simple/slurm-147877.out StdIn=/dev/null StdOut=/mnt/lustre/users/usr1/slurm/simple/slurm-147877.out Power=
Queued jobs
Estimated time for job to start
The --start option to squeue will show the expected start time.
$ squeue --start JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON) 36 nodes test usr1 PD N/A 2 (null) (PartitionConfig) 34 nodes test usr1 PD 2018-09-26T10:58:53 1 node170 (Resources) 35 nodes test usr1 PD 2018-09-26T11:08:00 1 node170 (Priority)
None running jobs
Use the --start option to squeue.
If your job has the start time of N/A and/or the REASON column state PartitionConfig, then you probably have requested resources that are not available on the cluster. In the above example job 36 has requested more CPUs than are available for jobs in the cluster.
Queue/Partition, Job (Step) State Information
Queue/Partition information:
- Associated with specific sets of nodes
- nodes can be in more than one partition
- Job size and time limits
- Access control list
- Preemption rules
- State information
- Over-subscription scheduling rules
Job information consists of:
- ID (number)
- Name
- Time limit (minimum and/or maximum)
- Size (min/max; nodes, CPUs, sockets, cores, threads)
- Allocated node names
- Node features
- Dependency
- Account name
- Quality of Service
- State
Step state information:
- ID (number): <jobid>.<stepid>
- Name
- Time limit (maximum)
- Size (min/max; nodes, CPUs, sockets, cores, threads)
- Allocated node names
- Node features
Job State Codes
When submitting a job, the job will be given a "state code" (ST) based on a number of factors, such as priority and resource availability. This information is shown in the squeue commands. Common states are:
State | Explanation |
---|---|
R ( Running ) | The job is currently running. |
PD ( Pending ) | The job is awaiting resource allocation. |
CG ( Completing ) | Job is in the process of completing. Some proccesses on some nodes may still be active. |
F ( Failed ) | Job terminated on non-zero exit code or other failure condition. |
Job Reason Codes
The REASON column from the squeue command provides useful information on why a job is in the current state. Some of these reasons may be one or more of the following:
REASON | Explanation |
---|---|
AssociationJobLimit | The job's association limit has reached it's maximum job count. |
AssociationResourceLimit | The job's association has reached some resource limit. |
AssociationTimeLimit | The job's association has reached it's time limit. |
BeginTime | The job earliest start time has not yet been reached. |
Cleaning | The job is being requeued is still cleaning up from it's previous execution. |
Dependency | The job is waiting for a dependent job to complete. |
JobHeldAdmin | The job has been held by the admin. |
JobHeldUser | The job has been held by the user. |
JobLaunchFailure | The job could not be launched. This may be due to a file system problem, invalid program name, etc. |
NodeDown | A node required by the job is not available at the moment. |
PartitionDown | The partition required by the job is DOWN. |
PartitionInactive | The partition required by the job is in an Inactive state and unable to start jobs. |
PartitionNodeLimit | The number of nodes required by this job is outside of the partition's node limit. Can also indicate that required nodes are DOWN or DRAINED. |
PartitionTimeLimit | The job exceeds the partition's time limit. |
Priority | One or more higher priority jobs exist for this partition or advanced reservation. |
ReqNodeNotAvail | Some node specifically required for this job is not available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding. Nodes which are DOWN, DRAINED, or not responding will be identified as part of the job's "reason" field as "UnavailableNodes". Such nodes will typically require the intervention of a system administrator to make available. |
Reservation | The job is awaiting its advanced reservation to become available. |
Selecting the right queue
Most jobs will run by default in the "nodes" queue.
To access the high memory nodes, select the "himem" queue:
#SBATCH --partition=himem
To access the nodes with GPUs, select the "gpu" queue:
#SBATCH --partition=gpu
To access the longer running queues:
#SBATCH --partition=week
#SBATCH --partition=month
The "test" queue
The test queue aims to provide a means where you can get a faster run-around on job execution. This allows you to test test your jobs for short periods of time before you submit the job to the main queue.
The jobs submitted to this queue have the following restrictions:
- A maximum of 4 jobs can be submitted
- A maximum of 2 jobs will run concurrently
- The jobs will run for a maximum of 30 minutes
- The jobs can use a maximum of 8 cores/cpus
- The default memory per cpu is 1GB, maximum 16GB
To use the queue define the queue when submitting the script:
$ sbatch --partition=test myscript.job
System Information
Partitions
The scontrol show partition command will show information about available partitions.
[usr1@login1(viking) ~]$ scontrol show partition PartitionName=nodes AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=maxuserlimits DefaultTime=08:00:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=node[001-170] PriorityJobFactor=1 PriorityTier=1000 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=6800 TotalNodes=170 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4750 MaxMemPerNode=UNLIMITED TRESBillingWeights=CPU=1.0,Mem=0.25G
Nodes
The scontrol show nodes command will show information about available nodes. Warning: there are many.
[usr1@login2(viking) simple]$ scontrol show nodes NodeName=node170 Arch=x86_64 CoresPerSocket=20 CPUAlloc=40 CPUTot=40 CPULoad=23.91 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=node170 NodeHostName=node170 Version=18.08 OS=Linux 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 RealMemory=191668 AllocMem=187904 FreeMem=117134 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=10 Owner=N/A MCS_label=N/A Partitions=nodes,test,month BootTime=2019-08-28T18:06:45 SlurmdStartTime=2020-01-29T11:39:15 CfgTRES=cpu=40,mem=191668M,billing=86 AllocTRES=cpu=40,mem=183.50G CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
What resources did my job use?
The output from all batch jobs will include the following statistics block at the end:
Job ID: 5816203 Cluster: viking User/Group: usr1/clusterusers State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 20 CPU Utilized: 03:53:53 CPU Efficiency: 92.08% of 04:14:00 core-walltime Job Wall-clock time: 00:12:42 Memory Utilized: 7.21 GB Memory Efficiency: 36.06% of 20.00 GB
Here we can see the job used four hours of CPU time and 7 GB of memory.
More information can be found by using the sacct command:
[usr1@login1(viking) scratch]$ sacct -j 44 --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,AveCPU User JobID JobName Partition State Timelimit Start End Elapsed AveCPU --------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- usr1 44 1hour nodes COMPLETED 01:10:00 2018-09-26T11:48:28 2018-09-26T12:48:28 01:00:00 44.batch batch COMPLETED 2018-09-26T11:48:28 2018-09-26T12:48:28 01:00:00 [usr1@login1(viking) scratch]$ sacct --format=reqmem,maxrss,averss,elapsed -j 44 ReqMem MaxRSS AveRSS Elapsed ---------- ---------- ---------- ---------- 100Mc 01:00:00 100Mc 01:00:00
[usr1@login1(viking) scratch]$ sacct --starttime 2018-09-26 --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist User JobID JobName Partition State Timelimit Start End Elapsed MaxRSS MaxVMSize NNodes NCPUS NodeList --------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ----------- ---------- ---------- -------- ---------- --------------- usr1 14 hostname test CANCELLED Partition+ 2018-08-16T14:44:34 Unknown 41-19:03:28 0 0 None assigned usr1 32 test nodes COMPLETED 00:10:00 2018-09-26T10:14:13 2018-09-26T10:15:13 00:01:00 1 1 node170 32.batch batch COMPLETED 2018-09-26T10:14:13 2018-09-26T10:15:13 00:01:00 1 1 node170 usr1 33 test nodes COMPLETED 00:10:00 2018-09-26T10:48:53 2018-09-26T10:58:53 00:10:00 1 1 node170 33.batch batch COMPLETED 2018-09-26T10:48:53 2018-09-26T10:58:53 00:10:00 1 1 node170 33.0 hostname COMPLETED 2018-09-26T10:48:53 2018-09-26T10:48:53 00:00:00 1 1 node170 33.1 sleep COMPLETED 2018-09-26T10:48:53 2018-09-26T10:58:53 00:10:00 1 1 node170 usr1 34 test nodes COMPLETED 00:10:00 2018-09-26T10:58:53 2018-09-26T11:08:53 00:10:00 1 1 node170 34.batch batch COMPLETED 2018-09-26T10:58:53 2018-09-26T11:08:53 00:10:00 1 1 node170 34.0 hostname COMPLETED 2018-09-26T10:58:53 2018-09-26T10:58:53 00:00:00 1 1 node170 34.1 sleep COMPLETED 2018-09-26T10:58:53 2018-09-26T11:08:53 00:10:00 1 1 node170 usr1 35 test nodes COMPLETED 00:10:00 2018-09-26T11:08:53 2018-09-26T11:09:53 00:01:00 1 1 node170 35.batch batch COMPLETED 2018-09-26T11:08:53 2018-09-26T11:09:53 00:01:00 1 1 node170 usr1 36 test nodes PENDING 00:10:00 Unknown Unknown 00:00:00 1 2 None assigned usr1 37 test nodes CANCELLED+ 01:10:00 2018-09-26T11:09:53 2018-09-26T11:16:34 00:06:41 1 1 node170 37.batch batch CANCELLED 2018-09-26T11:09:53 2018-09-26T11:16:35 00:06:42 1 1 node170 usr1 38 test nodes CANCELLED+ 00:10:00 2018-09-26T11:16:36 2018-09-26T11:17:25 00:00:49 1 1 node170 38.batch batch CANCELLED 2018-09-26T11:16:36 2018-09-26T11:17:26 00:00:50 1 1 node170