Overview


AlphaFold is an AI system developed by DeepMind that predicts a protein’s 3D structure from its amino acid sequence. The source code for the inference pipeline can be found on their github page.

Both CPU and GPU compute versions have been installed on Viking, and are made available through the module system. Since a few tweaks have been made to the installation, it is important to read through the following documentation before running any jobs with AlphaFold.

Video Introduction to Alphafold and Viking

If you are new to Viking and Alphafold you can also watch this video (password:4.#L5jdx) which will take you through the basics on using Alphafold on Viking and an introduction on Alphafold. 

A Viking's take on AlphaFold: Protein Structure Prediction at York
Host:  Prof Tony Wilkinson
Speakers:  Jon Agirre (YSBL) and Emma Barnes (IT Services)

The artificial intelligence system AlphaFold 2 has recently produced a much heralded solution to what is known as the Protein Folding Problem - one of Molecular
Biology's Grand Challenges of more than 50 years standing. In this seminar directed at researchers without specialist knowledge of computing, Jon Agirre (YSBL) will give a practical overview and assessment of AlphaFold 2. Together with Emma Barnes (IT Services), he will explain how AlphaFold can be accessed by Biological Scientists wishing to generate and interpret their own structural models.

Loading the AlphaFold software module

For each release of AlphaFold installed on Viking, two different modules may be provided - one optimised for CPU-based computing, and one for GPU computing.

CPU-only: use one of the following:

module load bio/AlphaFold/2.0.0-foss-2020b

GPU: use one of the following:

module load bio/AlphaFold/2.0.0-fosscuda-2020b
module load bio/AlphaFold/2.1.1-fosscuda-2020b

AlphaFold databases

AlphaFold currently requires various genetic databases to be available: UniRef90, MGnify, BFD, Uniclust30, PDB70, PDB.

To avoid needless duplication of large databases across the cluster, these have been made available in a central directory:

/mnt/bb/striped/alphafold_db/20210908 

The name of the subdirectory (20210908) here indicates the date when the databases were downloaded. The files are hosted on the burst buffer ( `/mnt/bb` ) - a shared filesystem powered by fast SSDs - which is recommended for AlphaFold due to the random I/O access patterns (in test jobs, we have observed up to 2x slowdown when using the disk-based lustre filesystem `/mnt/lustre` instead of `/mnt/bb`).

Modifications to running AlphaFold

It is important to note that we have made a few enhancements to the installation to facilitate easier usage:

CPU- versus GPU performance

Using the T1050.fasta example mentioned in the AlphaFold README, we have seen the following runtimes (using `--preset=full_dbs` ):

CPU coresGPUsRuntime (HH:MM:SS) on /mnt/bbRuntime (HH:MM:SS) on /mnt/lustre
8-> 24:00:0022:16:07
16-

15:37:54

21:35:56
20-17:11:1417:40:30
40-17:59:1321:20:14
10102:28:3704:58:51
20202:21:4903:22:28

This highlights the importance of the resources requested when running AlphaFold. These tests suggest that:


Example job scripts

The following example scripts were used to run the CPU and GPU benchmarks above. To modify these for your own needs, ensure that you update:

The walltime and memory requests are based on the observed job efficiency data when using the T1050.fasta file with `--preset=full_dbs` . These will likely need to be increased for other fasta input files, or when running with a different preset.

Important note.

For later versions of Alphafold you may need to update the flags.  

API changes between v2.0.0 and v2.1.0

We tried to keep the API as much backwards compatible as possible, but we had to change the following:

CPU job for AlphaFold/2.0.0-foss-2020b

#!/usr/bin/env bash

#SBATCH --job-name=alphafold-cpu-test
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=80G
#SBATCH --time=24:00:00
#SBATCH --output=%x-%j.log


# Load AlphaFold module
module load bio/AlphaFold/2.0.0-foss-2020b

# Path to genetic databases
export ALPHAFOLD_DATA_DIR=/mnt/bb/striped/alphafold_db/20210908/

# Optional: uncomment to change number of CPU cores to use for hhblits/jackhmmer
# export ALPHAFOLD_HHBLITS_N_CPU=8
# export ALPHAFOLD_JACKHMMER_N_CPU=8

# Run AlphaFold
alphafold --fasta_paths=T1050.fasta --max_template_date=2020-05-14 --preset=full_dbs --output_dir=$PWD --model_names=model_1,model_2,model_3,model_4,model_5


GPU example job AlphaFold/2.0.0-foss-2020b

#!/usr/bin/env bash

#SBATCH --job-name=alphafold-gpu-test
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=10
#SBATCH --gres=gpu:1
#SBATCH --partition=gpu
#SBATCH --time=4:00:00
#SBATCH --output=%x-%j.log

# Load AlphaFold module
module load bio/AlphaFold/2.0.0-fosscuda-2020b

# Path to genetic databases
export ALPHAFOLD_DATA_DIR=/mnt/bb/striped/alphafold_db/20210908/

# Optional: uncomment to change number of CPU cores to use for hhblits/jackhmmer
# export ALPHAFOLD_HHBLITS_N_CPU=8
# export ALPHAFOLD_JACKHMMER_N_CPU=8

# Run AlphaFold
alphafold --fasta_paths=T1050.fasta --max_template_date=2020-05-14 --preset=full_dbs --output_dir=$PWD --model_names=model_1,model_2,model_3,model_4,model_5