Skip to end of metadata
Go to start of metadata



Access

When you log in, you will be directed to one of several login nodes. These allow Linux command line access to the system, which is necessary for the editing programs, compiling and running the code. Usage of the login nodes is shared amongst all who are logged in. These systems should not be used for running your code, other than for development and very short test runs.   

Access to the job/batch submission system is through the login nodes. When a submitted job executes, processors on the compute nodes are exclusively made available for the purposes of running the job.   

Cluster and Slurm architecture

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

As shown below, Slurm consists of a slurmd daemon running on each compute node and a central slurmctld daemon running on a management node (with optional fail-over twin). The slurmd daemons provide fault-tolerant hierarchical communications. The user commands include: sacctsallocsattachsbatchsbcastscancelscontrolsinfosmapsqueuesrunstrigger and sview. All of the commands can run anywhere in the cluster.

Cluster entities

The entities managed by these Slurm daemons, shown below, include nodes, the compute resource in Slurm, partitions, which group nodes into logical (possibly overlapping) sets, jobs, or allocations of resources assigned to a user for a specified amount of time, and job steps, which are sets of (possibly parallel) tasks within a job. The partitions can be considered job queues, each of which has an assortment of constraints such as job size limit, job time limit, users permitted to use it, etc. Priority-ordered jobs are allocated nodes within a partition until the resources (nodes, processors, memory, etc.) within that partition are exhausted.

Once a job is assigned a set of nodes, the user is able to initiate parallel work in the form of job steps in any configuration within the allocation. For instance, a single job step may be started that utilises all nodes allocated to the job, or several job steps may independently use a portion of the allocation.


  • Jobs: resource allocation requests
  • Job steps: set of (typically parallel) tasks
  • Partitions: job queues with limits and access controls
  • Consumable resources
    • Nodes
      • NUMA boards
        • Sockets
          • Cores
            • Hyperthreads
        • Memory
    • Generic resources, such as GPUs

Image result for slurm sockets core threads


Resource manager and scheduler

A users submits their jobs to the resource manager and scheduler (Slurm). The job requests resources, which when granted are used by the job.

  • Jobs: resource allocation requests
  • Job steps: set of tasks (sequential and/or parallel)
    • uses allocated resources from the jobs allocation
    • job can contain multiple steps
  • Partitions: job queues with limits and access controls

Job states

Once a job has been submitted to a queue it is assigned a state which indicates at what stage in the processing hierarchy it is at.

Example

A user submits a job "Job 3" to a partition "jobqueue".

When the job executes, it is allocated the requested resources 

Jobs spawn steps which are allocated resources within the jobs allocation

Cluster Filesystem

Users have access to four filesystems:

  • Home Directory (not accessible for large data-sets or frequent access files)
  • Lustre Fast Filesystem (recommended storage)
  • Fast Temporary Storage (recommended for temporary or caches files)
  • Local Temporary Storage (local to compute node, not recommended)

More on location and access of storage.

Running jobs on the cluster

There are two ways to run your programs on the cluster

Batch Jobs

  • you write a job script containing the commands you want to execute on the cluster
  • you request an allocation of resources (nodes, cpus, memory)
  • the system grants you one, or more, compute nodes to execute your commands
  • your job script is automatically run
  • your script terminates and the system releases the resources

Interactive Sessions

  • you request an allocation of resources (cpus, memory)
  • the system grants you a whole, or part, node to execute your commands
  • you are logged into the node
  • you run your commands interactively
  • you exit and the system automatically releases the resources

Interactive vs Batch

Interactive

  • similar to a remote session
  • requires an active connection
  • used for development, debugging, or interactive applications

Batch

  • non-interactive
  • can run many jobs simultaniously
  • able to run jobs for longer periods of time

Resource allocation

In order to interact with the job/batch system (SLURM), the user must first give some indication of the resources they require. At a minimum these include:   

  • how long does the job need to run for  
  • on how many processors to run the job

The default resource allocation for jobs are

Default Resource Allocation

BatchInteractive
ResourceDefaultMaximumDefaultMaximum
Time limit848 hours and 7 and 30 days1 hour4 hours
Memory4.8GB1.5TB8GB1.5TB
Cores11024416

Armed with this information, the scheduler is able to dispatch the jobs at some point in the future when the resources become available. A fair-share policy is in operation to guide the scheduler towards allocating resources fairly between users. 

Slurm Command Summary

commanddescription
sacctreport job accounting information about active or completed jobs
sallocallocate resources for a job in real time (typically used to allocate resources and spawn a shell, in which the srun command is used to launch parallel tasks)
sattachattach standard input, output, and error to a currently running job , or job step
sbatchsubmit a job script for later execution (the script typically contains one or more srun commands to launch parallel tasks)
scancelcancel a pending or running job
sinforeports the state of partitions and nodes managed by Slurm (it has a variety of filtering, sorting, and formatting options)
squeuereports the state of jobs (it has a variety of filtering, sorting, and formatting options), by default, reports the running jobs in priority order followed by the pending jobs in priority order
srun

used to submit a job for execution in real time

Notes:

  • Man pages available for all commands
  • common options
    • options have two formats
      • single letter (e.g. "-p test" for partition "test")
      • verbose option (e.g. "--partition=test")
    • --help print brief description of all options
    • --usage print a list of options
    • -v verbose logging; more 'v's gives more details "-vvvv"

A useful command summary can be found at Slurm Command Summary.

Interactive Session

To run a job interactivly (waits for execution) use the srun command:

srun [options] executable [args]

Example interactive job
$ srun --ntasks=1 --pty /bin/bash
$ pwd
/home/andrew
$ hostname
node170
$ srun --ntasks=4 --pty /bin/bash
srun: Requested partition configuration not available now
srun: job 28 queued and waiting for resources

The first example creates a single task (one processor) interactive session in the partition "myq". The second example creates a four task (four processors) session.

To terminate the session, exit the shell.

srun parameters
--cores-per-node-cRestrict node selection to nodes with at least the specified number of cores
--cpu-freq
Request that the job be run at some requested frequency if possible
--cpus-per-task
Request that the number of cpus be allocated per process
--error
Specify how stderr is to be redirected
--exclusive
The job allocation cannot share nodes with other running jobs
--immediate
exit if resources are not available within the time period specified
--job-name
Specify a name for the job
--mem
Specify the real memory required per node
--mem-per-cpu
Minimum memory required per allocated CPU
--nodelist
Request a specific list of hosts
--ntasks
Specify the number of tasks to run
--ntasks-per-core
Request the maximum ntasks be invoked on each core
--ntasks-per-node
Request that ntasks be invoked on each node
--ntasks-per-socket
Request the maximum ntasks be invoked on each socket
--overcommit
Overcommit resources
--partition
Request a specific partition for the resource allocation
--pty
Execute task zero in pseudo terminal mode
--sockets-per-node
Restrict node selection to nodes with at least the specified number of sockets
--time
Set a limit on the total run time of the job allocation
--threads-per-core
Restrict node selection to nodes with at least the specified number of threads
--verbose
increase the verbosity of srun's informational messages

The salloc command is used to get a resource allocation and use it interactively from your terminal. It is usually used to execute a single command, usually a shell. 

Batch Job submission

sbatch is used to submit a script for later execution. It is used for jobs that execute for an extended period of time and is convenient for manageing a sequence of dependent jobs. The script can contain job options requesting resources. Job arrays are managed by the sbatch command.

All batch jobs are submitted to a queue. Queues are configured with specific rules that specify such things as the nodes that can be used, the maximum time a job can run, the number of cores and memory that can be requested.

Currently the facility is configured with a single general access queue, allowing submission to all available compute resources. Thus, there is no need to specify a queue name in job submissions.   

The command to submit a job has the form:

sbatch [options] script_file_name [args]

where script_file_name is a file containing commands to executed by the batch request.  For example:

$ sbatch example.job


Simple batch script - job.sh
#!/bin/bash

echo simple.job running on `hostname`
sleep 600

Submitting a job
[abs4@login2(viking) simple]$ sbatch simple.job 
Submitted batch job 147874
[abs4@login2(viking) simple]$ squeue -u abs4
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            147874     nodes simple.j     abs4  R       0:06      1 node170
[abs4@login2(viking) simple]$ ls
simple.job  slurm-147874.out
[abs4@login2(viking) simple]$ more slurm-147874.out 
simple.job running on node170.pri.viking.alces.network
[abs4@login2(viking) simple]$ 

For a more detailed description of job submission scripts and examples, please look at these Job Script FIles.

srun Command

If invoked within a salloc shell or sbatch script, srun launches an application on the allocated compute nodes.

If invoked on its own, srun creates a job allocation (similar to salloc) and then launches the application on the compute node.

By default salloc uses the entire job allocation based upon the job specification. srun can use a subset of the job's resources. Thousands of jobs can run serially or in parallel within the job's resource allocation.

Querying queues

The sinfo [options] command displays node and partition (queue) information and state.

List your jobs in the queue
[abs4@login2(viking) simple]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
nodes*       up   infinite      1  drain node059
nodes*       up   infinite    118    mix node[008-009,012-013,020,022-024,027,029,034-035,037-040,042-051,053-058,060-064,066-067,069-074,076-078,081-089,091-092,095-117,120-121,124-129,131-132,134-135,139-141,143-144,146-161,168-170]
nodes*       up   infinite     14  alloc node[065,079,090,118-119,122-123,130,133,136-138,142,145]
nodes*       up   infinite     37   idle node[001-007,010-011,014-019,021,025-026,028,030-033,036,041,052,068,075,080,093-094,162-167]
himem        up   infinite      3   idle himem[01-03]
gpu          up   infinite      1    mix gpu01
gpu          up   infinite      1   idle gpu02
test         up   infinite      1   idle admin02
ColumnDescription
PARTITIONAsterisk after partition name indicates the default partition
AVAILPartition is able to accept jobs
TIMELIMITMaximum time a ob can run for
NODESNumber of available nodes in the partition
STATEdown - not available, alloc - jobs being run, idle - waiting for jobs
NODELISTNodes available in the partition


The squeue command may be used to display information on the current status of SLURM jobs.

List jobs in the partition (queue)
[abs4@login2(viking) simple]$ squeue -u abs4
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            147875     nodes simple.j     abs4  R       0:04      1 node170


ColumnDescription
JOBIDA number used to uniquely identify your job within SLURM
PARTITIONThe partition the job has been submitted to
NAMEThe job's name, as specified by the job submisison script's -J directive
USERThe username of the job owner
STCurrent job status: R (running), PD (pending - queued and waiting)
TIMEThe time the job has been running
NODESThe number of nodes used by the job
NODELISTThe nodes used by the job


Important switches to squeue are:  


Switch

Action

-adisplay all jobs
-ldisplay more information
-uonly display users jobs

--usage

print help

-v

verbose listing

Cancelling a queued or running job

To delete a job from the queue use the scancel [options] <jobid> command, where jobid is a number referring to the specified job (available from squeue).

Cancel a job a job
[abs4@login2(viking) simple]$ sbatch simple.job 
Submitted batch job 147876
[abs4@login2(viking) simple]$ squeue -u abs4
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            147876     nodes simple.j     abs4  R       0:05      1 node170
[abs4@login2(viking) simple]$ scancel 147876
[abs4@login2(viking) simple]$ sacct -j 147876
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
147876       simple.job      nodes its-syste+          1 CANCELLED+      0:0 
147876.batch      batch            its-syste+          1  CANCELLED     0:15 
[abs4@login2(viking) simple]$

A user can delete all their jobs from the batch queues with the -u option:

scancel -u=<userid>

Job Information

Completed jobs

To display a list of recently completed jobs use the sacct command.

List recently completed jobs
$[abs4@login2(viking) simple]$ sacct -j 147874
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
147874       simple.job      nodes                     1  COMPLETED      0:0 
147874.batch      batch                                1  COMPLETED      0:0 
[abs4@login2(viking) simple]$ squeue -u abs4
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            147876     nodes simple.j     abs4  R       0:05      1 node170
[abs4@login2(viking) simple]$ scancel 147876
[abs4@login2(viking) simple]$ sacct -j 147876
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
147876       simple.job      nodes its-syste+          1 CANCELLED+      0:0 
147876.batch      batch            its-syste+          1  CANCELLED     0:15 
[abs4@login2(viking) simple]$ 

Important switches to sacct are:  

Switch

Action

-adisplay all users jobs
-bdisplay a brief listing
-Eselect the jobs end date/time

-h

print help
-jdisplay a specific job
-ldisplay long format
--namedisplay jobs with name
-Sselect the jobs start date/time
-udisplay only this user

-v

verbose listing

Running jobs

The scontrol show <jobid> command will show information about a running job.

Information on running job
[abs4@login2(viking) simple]$ scontrol show job 147877
JobId=147877 JobName=simple.job
   UserId=abs4(10506) GroupId=clusterusers(1447400001) MCS_label=N/A
   Priority=4294770448 Nice=0 Account=its-system-2018 QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:05:28 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2019-01-14T10:31:26 EligibleTime=2019-01-14T10:31:26
   AccrueTime=Unknown
   StartTime=2019-01-14T10:31:27 EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-01-14T10:31:27
   Partition=nodes AllocNode:Sid=login2:356158
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node170
   BatchHost=node170
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=4800M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4800M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/mnt/lustre/users/abs4/slurm/simple/simple.job
   WorkDir=/mnt/lustre/users/abs4/slurm/simple
   StdErr=/mnt/lustre/users/abs4/slurm/simple/slurm-147877.out
   StdIn=/dev/null
   StdOut=/mnt/lustre/users/abs4/slurm/simple/slurm-147877.out
   Power=



Queued jobs

Estimated time for job to start

The --start option to squeue will show the expected start time.


Display expected start time of job
$ squeue --start
             JOBID PARTITION     NAME     USER ST          START_TIME  NODES SCHEDNODES           NODELIST(REASON)
                36       myq     test   andrew PD                 N/A      2 (null)               (PartitionConfig)
                34       myq     test   andrew PD 2018-09-26T10:58:53      1 snode0               (Resources)
                35       myq     test   andrew PD 2018-09-26T11:08:00      1 snode0               (Priority)

None running jobs

Use the --start option to squeue.

Investigate none running job
$ squeue --start
             JOBID PARTITION     NAME     USER ST          START_TIME  NODES SCHEDNODES           NODELIST(REASON)
                36       myq     test   andrew PD                 N/A      2 (null)               (PartitionConfig)
                34       myq     test   andrew PD 2018-09-26T10:58:53      1 snode0               (Resources)
                35       myq     test   andrew PD 2018-09-26T11:08:00      1 snode0               (Priority)

If your job has the start time of N/A and/or the REASON column state PartitionConfig, then you probably have requested resources that are not available on the cluster. In the above example job 36 has request more CPUs than are avail for jobs in the cluster.

Queue/Partition, Job (Step) State Information

Queue/Partition information:

  • Associated with specific sets of nodes
    • nodes can be in more than one partition
  • Job size and time limits
  • Access control list
  • Preemption rules
  • State information
  • Over-subscription scheduling rules

Job information consists of:

  • ID (number)
  • Name
  • Time limit (minimum and/or maximum)
  • Size (min/max; nodes, CPUs, sockets, cores, threads)
  • Allocated node names
  • Node features
  • Dependency
  • Account name
  • Quality of Service
  • State

Step state information:

  • ID (number): <jobid>.<stepid>
  • Name
  • Time limit (maximum)
  • Size (min/max; nodes, CPUs, sockets, cores, threads)
  • Allocated node names
  • Node features

Job State Codes


When submitting a job, the job will be given a "state code" (ST) based on a number of factors, such as priority and resource availability. This information is shown in the squeue  commands. Common states are:

StateExplanation
R ( Running )The job is currently running.
PD ( Pending )The job is awaiting resource allocation.
CG ( Completing )Job is in the process of completing. Some proccesses on some nodes may still be active.
F ( Failed )Job terminated on non-zero exit code or other failure condition.

Job Reason Codes

The REASON column from the squeue command provides useful information on why a job is in the current state. Some of these reasons may be one or more of the following:

REASONExplanation
AssociationJobLimitThe job's association limit has reached it's maximum job count.
AssociationResourceLimitThe job's association has reached some resource limit.
AssociationTimeLimitThe job's association has reached it's time limit.
BeginTime

The job earliest start time has not yet been reached.

CleaningThe job is being requeued is still cleaning up from it's previous execution.
DependencyThe job is waiting for a dependent job to complete.
JobHeldAdminThe job has been held by the admin.
JobHeldUserThe job has been held by the user.
JobLaunchFailureThe job could not be launched. This may be due to a file system problem, invalid program name, etc.
NodeDownA node required by the job is not available at the moment.
PartitionDownThe partition required by the job is DOWN.
PartitionInactiveThe partition required by the job is in an Inactive state and unable to start jobs.
PartitionNodeLimitThe number of nodes required by this job is outside of the partition's node limit. Can also indicate that required nodes are DOWN or DRAINED.
PartitionTimeLimitThe job exceeds the partition's time limit.
PriorityOne or more higher priority jobs exist for this partition or advanced reservation.
ReqNodeNotAvailSome node specifically required for this job is not available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding. Nodes which are DOWN, DRAINED, or not responding will be identified as part of the job's "reason" field as "UnavailableNodes". Such nodes will typically require the intervention of a system administrator to make available.
ReservationThe job is awaiting its advanced reservation to become available.

Selecting the right queue

Most jobs will run by default in the default queue.

To access the high memory nodes, select the "himem" queue:

#SBATCH --partition=himem

To access the longer running queues:

#SBATCH --partition=week

#SBATCH --partition=month

The "test" queue

The test queue aims to provide a means where you can get a faster run-around on job execution. This allows you to test test your jobs for short periods of time before you submit the job to the main queue.

The jobs submitted to this queue have the following restrictions:

  1. A maximum of 4 jobs can be submitted
  2. A maximum of 2 jobs will run concurrently
  3. The jobs will run for a maximum of 30 minutes
  4. The jobs can use a maximum of 8 cores/cpus
  5. The default memory per cpu is 1GB, maximum 16GB

To use the queue define the queue when submitting the script:

sbatch --partition=test myscrip.job

System Information

Partitions

The scontrol show partition command will show information about available partitions.

Information on partitions
[abs4@login2(viking) simple]$ scontrol show partition
PartitionName=nodes
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=NONE DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=node[001-170]
   PriorityJobFactor=1 PriorityTier=1000 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=6800 TotalNodes=170 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4800 MaxMemPerNode=UNLIMITED

PartitionName=himem
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=himem[01-03]
   PriorityJobFactor=1 PriorityTier=3000 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=192 TotalNodes=3 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=11875 MaxMemPerNode=UNLIMITED

PartitionName=gpu
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=gpu[01-02]
   PriorityJobFactor=1 PriorityTier=3000 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=80 TotalNodes=2 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=9600 MaxMemPerNode=UNLIMITED


Nodes

The scontrol show nodes command will show information about available nodes. Warning: there are many.

Information on partitions
[abs4@login2(viking) simple]$ scontrol show nodes 
NodeName=admin02 Arch=x86_64 CoresPerSocket=1 
   CPUAlloc=0 CPUTot=1 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=admin02 NodeHostName=admin02 Version=18.08
   OS=Linux 3.10.0-862.3.3.el7.x86_64 #1 SMP Fri Jun 15 04:15:27 UTC 2018 
   RealMemory=1000 AllocMem=0 FreeMem=1197 Sockets=1 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=50 Owner=N/A MCS_label=N/A
   Partitions=test 
   BootTime=2018-12-22T04:17:07 SlurmdStartTime=2018-12-22T04:17:55
   CfgTRES=cpu=1,mem=1000M,billing=1
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   

NodeName=gpu01 Arch=x86_64 CoresPerSocket=20 
   CPUAlloc=3 CPUTot=40 CPULoad=2.56
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:4

What resources did my job use

This needs updating when the accounting is working.

In order to tune your job submission parameters use the '-me' directive to inform you of the resources used:

Job end email
Job 754967 (mpi-16_job) Complete
 User             = abs4
 Queue            = its-4hour@rnode2.york.ac.uk
 Host             = rnode2.york.ac.uk
 Start Time       = 02/18/2015 10:41:21
 End Time         = 02/18/2015 10:41:27
 User Time        = 00:00:01
 System Time      = 00:00:01
 Wallclock Time   = 00:00:06
 CPU              = 00:00:03
 Max vmem         = 69.367M
 Exit Status      = 0

Here we can see the job used 3 seconds of CPU time an 69 MB of memory.

More information can be found by using the sacct command:

Resources used by a terminated job
$ sacct -j 44 --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,AveCPU
     User        JobID    JobName  Partition      State  Timelimit               Start                 End    Elapsed     AveCPU 
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- 
   andrew 44                1hour        myq  COMPLETED   01:10:00 2018-09-26T11:48:28 2018-09-26T12:48:28   01:00:00            
          44.batch          batch             COMPLETED            2018-09-26T11:48:28 2018-09-26T12:48:28   01:00:00            

$ sacct --format=reqmem,maxrss,averss,elapsed -j 44
    ReqMem     MaxRSS     AveRSS    Elapsed 
---------- ---------- ---------- ---------- 
     100Mc                         01:00:00 
     100Mc                         01:00:00 



Resources used from a specific date
$ sacct --starttime 2018-09-26 --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist
     User        JobID    JobName  Partition      State  Timelimit               Start                 End    Elapsed     MaxRSS  MaxVMSize   NNodes      NCPUS        NodeList 
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- -------- ---------- --------------- 
   andrew 14             hostname      debug  CANCELLED Partition+ 2018-08-16T14:44:34             Unknown 41-19:03:28                              0          0   None assigned 
   andrew 32                 test        myq  COMPLETED   00:10:00 2018-09-26T10:14:13 2018-09-26T10:15:13   00:01:00                              1          1          snode0 
          32.batch          batch             COMPLETED            2018-09-26T10:14:13 2018-09-26T10:15:13   00:01:00                              1          1          snode0 
   andrew 33                 test        myq  COMPLETED   00:10:00 2018-09-26T10:48:53 2018-09-26T10:58:53   00:10:00                              1          1          snode0 
          33.batch          batch             COMPLETED            2018-09-26T10:48:53 2018-09-26T10:58:53   00:10:00                              1          1          snode0 
          33.0           hostname             COMPLETED            2018-09-26T10:48:53 2018-09-26T10:48:53   00:00:00                              1          1          snode0 
          33.1              sleep             COMPLETED            2018-09-26T10:48:53 2018-09-26T10:58:53   00:10:00                              1          1          snode0 
   andrew 34                 test        myq  COMPLETED   00:10:00 2018-09-26T10:58:53 2018-09-26T11:08:53   00:10:00                              1          1          snode0 
          34.batch          batch             COMPLETED            2018-09-26T10:58:53 2018-09-26T11:08:53   00:10:00                              1          1          snode0 
          34.0           hostname             COMPLETED            2018-09-26T10:58:53 2018-09-26T10:58:53   00:00:00                              1          1          snode0 
          34.1              sleep             COMPLETED            2018-09-26T10:58:53 2018-09-26T11:08:53   00:10:00                              1          1          snode0 
   andrew 35                 test        myq  COMPLETED   00:10:00 2018-09-26T11:08:53 2018-09-26T11:09:53   00:01:00                              1          1          snode0 
          35.batch          batch             COMPLETED            2018-09-26T11:08:53 2018-09-26T11:09:53   00:01:00                              1          1          snode0 
   andrew 36                 test        myq    PENDING   00:10:00             Unknown             Unknown   00:00:00                              1          2   None assigned 
   andrew 37                 test        myq CANCELLED+   01:10:00 2018-09-26T11:09:53 2018-09-26T11:16:34   00:06:41                              1          1          snode0 
          37.batch          batch             CANCELLED            2018-09-26T11:09:53 2018-09-26T11:16:35   00:06:42                              1          1          snode0 
   andrew 38                 test        myq CANCELLED+   00:10:00 2018-09-26T11:16:36 2018-09-26T11:17:25   00:00:49                              1          1          snode0 
          38.batch          batch             CANCELLED            2018-09-26T11:16:36 2018-09-26T11:17:26   00:00:50                              1          1          snode0 



  • No labels