Overview of checkpointing

In computing terms, "checkpointing" refers to a technique that effectively allows the snapshotting of a program's current state, and enabling continuation from this snapshot at a later time. This has several advantages, especially for jobs that take a long time to run.

  • job continuation: if a job fails unexpectedly (e.g. due to exceeding the job's time limit, or memory allocation), the job can be continued from the saved state instead of restarting from scratch.
  • extended runtime: for jobs that require more walltime than is allowed by the queues, the job can be re-submitted multiple times, each continuing from the previous checkpoint.
  • more resources for long jobs: by requesting the maximum of 48 hrs of walltime at a time in the `nodes` queue, and continuing from previous checkpoints, jobs that would usually be constrained to the `week` or `month` queues can hereby be submitted to `nodes`.


DMTCP

DMTCP ("Distributed MultiThread Checkpointing"), is a checkpointing tool which is capable of checkpointing both serial and parallelised applications (both MPI and multi-threaded). It does not require any special privileges, and does not require modification of the application binaries or source code. It also supports a range of applications and languages, including:

  • various MPI implementations (NOT CURRENTLY SUPPORTED ON VIKING)
  • OpenMP
  • MATLAB
  • Python
  • Perl
  • R
  • GNU screen
  • TightVNC

Please note that MPI checkpointing is currently not functional on Viking, due to issues with building DMTCP's infiniband interfaces. We hope to be able to fix this soon, so that MPI programs can be checkpointed as expected.

By default, DMTCP uses gzip to compress the checkpoint images; this can be disabled to improve speed, at the cost of using more disk space.

There are five fundamental steps involved in using DMTCP:

  1. Loading a DMTCP module (and the DMTCP-wrapper module)
  2. Starting the `dmtcp_coordinator`
  3. Launching the program with `dmtcp_launch`
  4. Creating checkpoints with `dmtcp_command --checkpoint` 
  5. Restarting from a checkpoint with `dmtcp_restart`


Example job scripts

Downloading example scripts

Each of the following sections contains a link to download the corresponding script individually. Alternatively, you can download an archive containing all of the example scripts:

Serial (no multi-threading or MPI)

The following serial example job scripts compile and run an example C++ program: dmtcp_serial_example.cpp.

The job scripts "serial_submit.sbatch" and "serial_restart.sbatch" (below) show a simple example of how to use the DMTCP coordinator to checkpoint a program regularly, and how to restart from a previous checkpoint in another job. To adapt these scripts to your own jobs, make sure to update:

  • the resources requested in the "Slurm directives" section ( --time and --mem in particular)
  • the software modules loaded (make sure that the DMTCP module matches the compiler/toolchain of any other modules loaded)
  • the pre-checkpointing step, which may not be needed at all
  • checkpointing frequency  ckpt_every which is in seconds. For jobs with a runtime of ~1hr, this should be set to about every 10 min (ckpt_every=600), and for longer jobs every hour should be sufficient (ckpt_every=3600)
serial_submit.sbatch
#!/usr/bin/env bash

#--------------------------- Slurm directives --------------------------------#
###############################################################################
# General Slurm directives for the job
###############################################################################
#SBATCH --job-name=dmtcp_serial_example
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:03:00
#SBATCH --mem=1G


#----------------------------- Load modules ----------------------------------# 
###############################################################################
# Load all necessary modules, a matching DMTCP module built with the same
# compiler/toolchain, and the DMTCP wrapper module.
###############################################################################

module load tools/DMTCP/2.6.0-GCCcore-9.3.0
module load tools/DMTCP-wrapper
module load compiler/GCC/9.3.0


#--------------------------- Pre-checkpointing -------------------------------#
###############################################################################
# Commands to run before checkpointing. In this example, we compile the example
# program.
###############################################################################

g++ -o serial_example dmtcp_serial_example.cpp


#-------------------------- Start Checkpointing ------------------------------#
###############################################################################
# Start the checkpointing coordinator
###############################################################################

ckpt_every=60   # auto-checkpoint every X seconds
start_coordinator -i ${ckpt_every}

###############################################################################
# Launch application.
###############################################################################

run_cmd="./serial_example 5"    # command to run, with any arguments
ckpt_dir="$(pwd)/checkpoints"   # directory in which to store checkpoint images  
dmtcp_launch --rm -j --no-gzip --ckptdir ${ckpt_dir} ${run_cmd}
serial_restart.sbatch
#!/usr/bin/env bash

#--------------------------- Slurm directives --------------------------------#
###############################################################################
# General Slurm directives for the job
###############################################################################
#SBATCH --job-name=dmtcp_serial_example_restart
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:05:00
#SBATCH --mem=1G


#----------------------------- Load modules ----------------------------------#
###############################################################################
# Load all necessary modules, a matching DMTCP module built with the same
# compiler/toolchain, and the DMTCP wrapper module.
############################################################################### 

module load tools/DMTCP/2.6.0-GCCcore-9.3.0
module load tools/DMTCP-wrapper
module load compiler/GCC/9.3.0

#-------------------------- Start Checkpointing ------------------------------#
###############################################################################
# Start the checkpointing coordinator
###############################################################################

ckpt_every=60   # auto-checkpoint every X seconds
start_coordinator -i ${ckpt_every}

###############################################################################
# Run restart script
###############################################################################
./dmtcp_restart_script.sh -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT

Alternatively, it is also possible to automatically resubmit the job script, and continue from the last checkpoint, as demonstrated in "serial_auto-resubmit.sbatch". This is particularly useful for long jobs, where the expected total runtime is significantly longer than the maximum walltime allowed in the queue. As before, please make sure to update:

  • the resources requested in the "Slurm directives" section ( --time and --mem in particular)
  • the software modules loaded (make sure that the DMTCP module matches the compiler/toolchain of any other modules loaded)
  • the pre-checkpointing step, which may not be needed at all
  • checkpointing frequency  ckpt_every which is in seconds. For long jobs, this should likely not be more frequent than every hour (ckpt_every=3600)
  • depending on how long it takes to create a checkpoint image, the Slurm directive #SBATCH --signal=B:USR1@60  may need to be increased to a larger value, such as #SBATCH --signal=B:USR1@300 
serial_auto-resubmit.sbatch
#!/usr/bin/env bash

#--------------------------- Slurm directives --------------------------------#
###############################################################################
# General Slurm directives for the job
###############################################################################
#SBATCH --job-name=dmtcp_serial_example_autorestart
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:04:00
#SBATCH --mem=1G

###############################################################################
# Additional directives needed for auto-restart
###############################################################################
#SBATCH --signal=B:USR1@60  # send USR1 signal 60s before job max walltime
#SBATCH --requeue
#SBATCH --open-mode=append  # append to existing log file


#----------------------------- Load modules ----------------------------------# 
###############################################################################
# Load all necessary modules, a matching DMTCP module built with the same
# compiler/toolchain, and the DMTCP wrapper module.
###############################################################################

module load tools/DMTCP/2.6.0-GCCcore-9.3.0
module load tools/DMTCP-wrapper
module load compiler/GCC/9.3.0


#--------------------------- Pre-checkpointing -------------------------------#
###############################################################################
# Commands to run before checkpointing. In this example, we compile the example
# program.
###############################################################################

if [ ! -f serial_example ]; then
    g++ -o serial_example dmtcp_serial_example.cpp
fi


#-------------------------- Start Checkpointing ------------------------------#
###############################################################################
# Start the checkpointing coordinator
###############################################################################

ckpt_every=60   # auto-checkpoint every X seconds
start_coordinator -i ${ckpt_every}

###############################################################################
# Launch/restart application.
###############################################################################

run_cmd="./serial_example 5"    # command to run, with any arguments
ckpt_dir="$(pwd)/checkpoints"   # directory in which to store checkpoint images  

# Check if job has been restarted
nrest=${SLURM_RESTART_COUNT}
if [[ "${nrest}" -eq "0" ]]; then
    # First time, use dmtcp_launch to start job
    dmtcp_launch --rm -j --no-gzip --ckptdir ${ckpt_dir} ${run_cmd} &

elif [[ "${nrest}" -gt "0" ]] && [[ -e dmtcp_restart_script.sh ]]; then
    # Restart job in background
    ./dmtcp_restart_script.sh &

else
    echo "Failed to restart job. Exiting..."
    exit
fi


#-------------------------- Auto-requeue Job ---------------------------------#
###############################################################################
# If the job hasn't completed near the end of the allocated time, trigger a
# manual checkpoint, kill the coordinator and resubmit the job script
###############################################################################

trap requeue_job USR1

# ensure job does not exit whilst background processes are still running
wait


Threaded (single-node)

The following serial example job scripts compile and run an example C++ program: dmtcp_omp_example.cpp.

The job scripts "omp_submit.sbatch" and "omp_restart.sbatch" (below) show a simple example of how to use the DMTCP coordinator to checkpoint an OpenMP threaded program regularly, and how to restart from a previous checkpoint in another job. To adapt these scripts to your own jobs, make sure to update:

  • the resources requested in the "Slurm directives" section ( --time, --mem and --cpus-per-task  in particular)
  • the software modules loaded (make sure that the DMTCP module matches the compiler/toolchain of any other modules loaded)
  • the pre-checkpointing step, which may not be needed at all
  • checkpointing frequency  ckpt_every which is in seconds. For jobs with a runtime of ~1hr, this should be set to about every 10 min (ckpt_every=600), and for longer jobs every hour should be sufficient (ckpt_every=3600)
omp_submit.sbatch
#!/usr/bin/env bash

#--------------------------- Slurm directives --------------------------------#
###############################################################################
# General Slurm directives for the job
###############################################################################
#SBATCH --job-name=dmtcp_omp_example
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=10
#SBATCH --time=00:03:00
#SBATCH --mem=1G


#----------------------------- Load modules ----------------------------------# 
###############################################################################
# Load all necessary modules, a matching DMTCP module built with the same
# compiler/toolchain, and the DMTCP wrapper module.
###############################################################################

module load tools/DMTCP/2.6.0-GCCcore-9.3.0
module load tools/DMTCP-wrapper
module load compiler/GCC/9.3.0


#--------------------------- Pre-checkpointing -------------------------------#
###############################################################################
# Commands to run before checkpointing. In this example, we compile the example
# program.
###############################################################################

g++ -fopenmp -o omp_example dmtcp_omp_example.cpp -lpthread


#-------------------------- Start Checkpointing ------------------------------#
###############################################################################
# Start the checkpointing coordinator
###############################################################################

ckpt_every=60   # auto-checkpoint every X seconds
start_coordinator -i ${ckpt_every}

###############################################################################
# Launch application.
###############################################################################

run_cmd="./omp_example"    # command to run, with any arguments
ckpt_dir="$(pwd)/checkpoints"   # directory in which to store checkpoint images  
dmtcp_launch --rm -j --no-gzip --ckptdir ${ckpt_dir} ${run_cmd}

omp_restart.sbatch
#!/usr/bin/env bash

#--------------------------- Slurm directives --------------------------------#
###############################################################################
# General Slurm directives for the job
###############################################################################
#SBATCH --job-name=dmtcp_omp_example_restart
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --time=00:05:00
#SBATCH --mem=1G


#----------------------------- Load modules ----------------------------------#
###############################################################################
# Load all necessary modules, a matching DMTCP module built with the same
# compiler/toolchain, and the DMTCP wrapper module.
############################################################################### 

module load tools/DMTCP/2.6.0-GCCcore-9.3.0
module load tools/DMTCP-wrapper
module load compiler/GCC/9.3.0

#-------------------------- Start Checkpointing ------------------------------#
###############################################################################
# Start the checkpointing coordinator
###############################################################################

ckpt_every=60   # auto-checkpoint every X seconds
start_coordinator -i ${ckpt_every}

###############################################################################
# Run restart script
###############################################################################
./dmtcp_restart_script.sh -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT

As in the serial example, it is again possible to automatically resubmit the job script, as demonstrated in "omp_auto-resubmit.sbatch". This is particularly useful for long jobs, where the expected total runtime is significantly longer than the maximum walltime allowed in the queue. As before, ensure that the following have been adapted to suit your job:

  • the resources requested in the "Slurm directives" section ( --time, --mem and --cpus-per-task  in particular)
  • the software modules loaded (make sure that the DMTCP module matches the compiler/toolchain of any other modules loaded)
  • the pre-checkpointing step, which may not be needed at all
  • checkpointing frequency  ckpt_every which is in seconds. For long jobs, this should likely not be more frequent than every hour (ckpt_every=3600)
  • depending on how long it takes to create a checkpoint image, the Slurm directive #SBATCH --signal=B:USR1@60  may need to be increased to a larger value, such as #SBATCH --signal=B:USR1@300 
omp_auto-resubmit.sbatch
#!/usr/bin/env bash

#--------------------------- Slurm directives --------------------------------#
###############################################################################
# General slurm directives for the job
###############################################################################
#SBATCH --job-name=dmtcp_omp_example_autorestart
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=10
#SBATCH --time=00:04:00
#SBATCH --mem=1G

###############################################################################
# Additional directives needed for auto-restart
###############################################################################
#SBATCH --signal=B:USR1@60  # send USR1 signal 60s before job max walltime
#SBATCH --requeue
#SBATCH --open-mode=append  # append to existing log file


#----------------------------- Load modules ----------------------------------# 
###############################################################################
# Load all necessary modules, a matching DMTCP module built with the same
# compiler/toolchain, and the DMTCP wrapper module.
###############################################################################

module load tools/DMTCP/2.6.0-GCCcore-9.3.0
module load tools/DMTCP-wrapper
module load compiler/GCC/9.3.0


#--------------------------- Pre-checkpointing -------------------------------#
###############################################################################
# Commands to run before checkpointing. In this example, we compile the example
# program.
###############################################################################

if [ ! -f omp_example ]; then
    g++ -fopenmp -o omp_example dmtcp_omp_example.cpp -lpthread
fi


#-------------------------- Start Checkpointing ------------------------------#
###############################################################################
# Start the checkpointing coordinator
###############################################################################

ckpt_every=60   # auto-checkpoint every X seconds
start_coordinator -i ${ckpt_every}

###############################################################################
# Launch/restart application.
###############################################################################

run_cmd="./omp_example"    # command to run, with any arguments
ckpt_dir="$(pwd)/checkpoints"   # directory in which to store checkpoint images  

# Check if job has been restarted
nrest=${SLURM_RESTART_COUNT}
if [[ "${nrest}" -eq "0" ]]; then
    # First time, use dmtcp_launch to start job
    dmtcp_launch --rm -j --no-gzip --ckptdir ${ckpt_dir} ${run_cmd} &

elif [[ "${nrest}" -gt "0" ]] && [[ -e dmtcp_restart_script.sh ]]; then
    # Restart job in background
    ./dmtcp_restart_script.sh &

else
    echo "Failed to restart job. Exiting..."
    exit
fi


#-------------------------- Auto-requeue Job ---------------------------------#
###############################################################################
# If the job hasn't completed near the end of the allocated time, trigger a
# manual checkpoint, kill the coordinator and resubmit the job script
###############################################################################

trap requeue_job USR1

# ensure job does not exit whilst background processes are still running
wait


DMTCP wrapper

In the examples documented above, we make use of some bash functions that are automatically sourced when the `tools/DMTCP-wrapper`  module is loaded. Currently, the wrapper contains two functions:

  • `requeue_job()` 
  • `start_coordinator()` 

requeue_job()

The `requeue_job`  function is designed to be called in the self-resubmitting job scripts, shortly before the job reaches its maximum allowed walltime. It triggers a blocking checkpoint, which includes open files, before stopping the coordinator. Finally, it calls `scontrol requeue`  on itself, which causes the job to be terminated and immediately requeued.

requeue_job
requeue_job() {
    # Function to trigger a checkpoint, stop the coordinator and resubmit the
    # job script
    
    # checkpoint now
    dmtcp_command -h ${DMTCP_COORD_HOST} -p ${DMTCP_COORD_PORT} \
        --ckpt-open-files -bc

    # stop the coordinator
    dmtcp_command -h ${DMTCP_COORD_HOST} -p ${DMTCP_COORD_PORT} --quit

    # resubmit job script
    scontrol requeue ${SLURM_JOB_ID}
}

start_coordinator()

The `start_coordinator`  function automates the dmtcp_coordinator setup, and provides an easy way to communicate with the coordinator through the `dmtcp_command` wrapper. It also sets the `$DMTCP_COORD_HOST`  and `$DMTCP_COORD_HOST`  environment variables, which are needed by `requeue_job` .

start_coordinator
# start_coordinator function from plugin/batch-queue/job_examples/slurm_launch.job
start_coordinator() {
    ############################################################
    # For debugging when launching a custom coordinator, uncomment 
    # the following lines and provide the proper host and port for 
    # the coordinator.
    ############################################################
    # export DMTCP_COORD_HOST=$h
    # export DMTCP_COORD_PORT=$p
    # return

    fname=dmtcp_command.$SLURM_JOBID
    h=`hostname`

    check_coordinator=`which dmtcp_coordinator`
    if [ -z "$check_coordinator" ]; then
        echo "No dmtcp_coordinator found. Check your DMTCP installation and PATH settings."
        exit 0
    fi

    dmtcp_coordinator --daemon --exit-on-last -p 0 --port-file $fname $@ 1>/dev/null 2>&1
    
    while true; do 
        if [ -f "$fname" ]; then
            p=`cat $fname`
            if [ -n "$p" ]; then
                # try to communicate ? dmtcp_command -p $p l
                break
            fi
        fi
    done
    
    # Create dmtcp_command wrapper for easy communication with coordinator
    p=`cat $fname`
    chmod +x $fname
    echo "#!/bin/bash" > $fname
    echo >> $fname
    echo "export PATH=$PATH" >> $fname
    echo "export DMTCP_COORD_HOST=$h" >> $fname
    echo "export DMTCP_COORD_PORT=$p" >> $fname
    echo "dmtcp_command \$@" >> $fname

    # Set up local environment for DMTCP
    export DMTCP_COORD_HOST=$h
    export DMTCP_COORD_PORT=$p
}
  • No labels