AlphaFold

AlphaFold is an AI system developed by DeepMind that predicts a protein’s 3D structure from its amino acid sequence. The source code for the inference pipeline can be found on the AlphaFold GitHub page.

Attention

Since a few tweaks have been made to the installation, it is important to read through the following documentation before running any jobs with AlphaFold.

The CPU-only version of AlphaFold can be loaded using the following:

$ module load AlphaFold/2.3.1-foss-2022a

And the GPU version of AlphaFold can be loaded using the following command:

$ module load AlphaFold/2.3.1-foss-2022a-CUDA-11.7.0

Example job scripts

using 16 CPUs, 80 GBs of memory and for up to 24 hours

#!/usr/bin/env bash
#SBATCH --job-name=AlphaFold_cpu_example        # Job name
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=80G
#SBATCH --time=24:00:00
#SBATCH --output=%x-%j.log
#SBATCH --mail-type=BEGIN,END,FAIL              # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=abc123@york.ac.uk           # Where to send mail
#SBATCH --account=dept-proj-year                # Project account to use

# Abort if any command fails
set -e

module purge                                    # purge any loaded modules
# Load AlphaFold module
module load AlphaFold/2.3.1-foss-2022a

# Path to genetic databases
export ALPHAFOLD_DATA_DIR=/mnt/scratch/projects/alphafold-db/latest

# Optional: uncomment to change number of CPU cores to use for hhblits/jackhmmer
# export ALPHAFOLD_HHBLITS_N_CPU=8
# export ALPHAFOLD_JACKHMMER_N_CPU=8

# Run AlphaFold
alphafold --fasta_paths=T1050.fasta --max_template_date=2020-05-14 --db_preset=full_dbs --output_dir=$PWD

Important

The CUDA enabled GPU module will only work on the gpu or gpu_week partitions with the nVidia A40 GPUs, it won’t work on the gpu_plus partition with the H100 GPUs as they require CUDA 11.8

using a GPU in addition to 10 CPUs for up to 4 hours

#!/usr/bin/env bash
#SBATCH --job-name=AlphaFold_GPU_example    # Job name
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=10
#SBATCH --gres=gpu:1
#SBATCH --partition=gpu
#SBATCH --time=4:00:00
#SBATCH --output=%x-%j.log
#SBATCH --mail-type=BEGIN,END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=abc123@york.ac.uk       # Where to send mail
#SBATCH --account=dept-proj-year            # Project account to use

# Abort if any command fails
set -e

module purge                                # purge any loaded modules
# Load AlphaFold module
module load AlphaFold/2.3.1-foss-2022a-CUDA-11.7.0

# Path to genetic databases
export ALPHAFOLD_DATA_DIR=/mnt/scratch/projects/alphafold-db/latest

# Optional: uncomment to change number of CPU cores to use for hhblits/jackhmmer
# export ALPHAFOLD_HHBLITS_N_CPU=8
# export ALPHAFOLD_JACKHMMER_N_CPU=8

# Run AlphaFold
alphafold --fasta_paths=T1050.fasta --max_template_date=2020-05-14 --db_preset=full_dbs --output_dir=$PWD

Notes for using AlphaFold on Viking

AlphaFold currently requires access to various genetic databases such as UniRef90, MGnify, BFD, Uniclust30, PDB70 and PDB.

To avoid needless duplication of large databases across the cluster, these have been made available in a central directory:

/mnt/scratch/projects/alphafold-db/latest

The name of the subdirectory latest is a symlink which points to the latest databases which have been downloaded. You can see all the sets within the /mnt/scratch/projects/alphafold-db/ directory. The files are hosted on the fast SSDs - which is recommended for AlphaFold due to the random I/O access patterns. As seen below this can cause jobs to run up to 2x faster than if the databases were stored on the disk-based lustre filesystem.

It is important to note that we have made a few enhancements to the installation to facilitate easier usage:

The location to the AlphaFold data can be specified via the $ALPHAFOLD_DATA_DIR environment variable, so you should define this variable in your AlphaFold job script: export ALPHAFOLD_DATA_DIR=/mnt/scratch/projects/alphafold-db/latest
A symbolic link named alphafold , which points to the run_alphafold.py script , is included. This means you can just use alphafold instead of run_alphafold.py or python run_alphafold.py.
The run_alphafold.py script has been slightly modified such that defining $ALPHAFOLD_DATA_DIR is sufficient to pick up all the data provided in that location, meaning that you don’t need to use options like --data_dir to specify the location of the data.
Similarly, the run_alphafold.py script was tweaked such that the location to commands like hhblits, hhsearch, jackhmmer or kalign are already correctly set, and thus options like --hhblits_binary_path are not required.
The Python script that are used to run hhblits and jackhmmer have been tweaked so you can control how many cores are used for these tools (rather than hard-coding this to 4 and 8 cores respectively).
- If set, the $ALPHAFOLD_HHBLITS_N_CPU environment variable can be used to specify how many cores should be used for running hhblits. The default of 4 cores will be used if $ALPHAFOLD_HHBLITS_N_CPU is not defined. The same applies for jackhmmer and $ALPHAFOLD_JACKHMMER_N_CPU .
- Tweaking either of these may not be worth it however, since test jobs indicated that using more than 4/8 cores actually resulted in worse performance (although this may be workload dependent)

CPU vs GPU performance

Shown below are the results of using the T1050.fasta example mentioned in the AlphaFold README with different resource allocations.

AlphaFold performance
CPU cores	GPUs	Runtime (HH:MM:SS) on /mnt/bb	Runtime (HH:MM:SS) on /mnt/lustre
8	0	>24:00:00	22:16:07
16	0	15:37:54	21:35:56
20	0	17:11:14	17:40:30
40	0	17:59:13	21:20:14
10	1	02:28:37	04:58:51
20	2	02:21:49	03:22:28

This highlights the importance of requesting resources when using AlphaFold. These results suggest:

It is faster for almost all jobs to use the AlphaFold with the database stored on the SSDs

Using a GPU can considerably increase the speed at which a job completes (up to 6x)

Using a second GPU does not significantly reduce the runtime for a job

Counter intuitively, using more cores can lower performance