Snacker News

How to run Rosetta with SLURM on a high-performance compute cluster

Feb 21, 2016

Rosetta’s sampling methodology allows searching protein sequence functional space (which is extremely rugged). To effectively use Rosetta, high performance computing (cluster computing) is necessary to carry out many sampling trajectories in parallel.

This tutorial assumes that you have access to a high-performance compute cluster such that you can log in from a Bash prompt. On our cluster at Davis (called Cabernet), this is the welcome message:

 _______________________________________
/ HPC-14 currently consists of 66 nodes \
| with 2248 CPUs and 11.03TB RAM. Type  |
\ sinfo for more info.                  /
 ---------------------------------------
                     \..................................
  _____________________________________________          ......
 /\    ___       _                             \           ....
/  :  / (_)     | |                             \            .
|  : |      __, | |   _   ,_    _  _    _ _| _   \________ ___
|  : |     /  | |/ \_|/  /  |  / |/ |  |/  |              |   |
|  :  \___/\_/|_/\_/ |__/   |_/  |  |_/|__/|_/    ________|___|
|  :               .genomecenter.ucdavis.edu     /
\  ;                                            /
 \/____________________________________________/

We also assume that the Rosetta binaries are in your PATH. See [this post on configuring and storing software] for how to do this.

I want to run many -nstruct in parallel

First, write a script that runs your Rosetta protocol. In this case, we want to relax an input crystal structure into Rosetta’s energy function with the relax app. We want 1,000 independent trajectories terminating in 1,000 relaxed output structures.

Put in a file called sub.sh:

#!/bin/bash
#SBATCH --array=1-1000
relax.linuxgccrelease @flags -suffix $SLURM_ARRAY_TASK_ID

The -suffix $SLURM_ARRAY_TASK_ID contains two parts. The Rosetta flag -suffix appends a suffix to each of the 1,000 structures, allowing Rosetta to name output files appropriately. The suffix that is appended is the environment variable SLURM_ARRAY_TASK_ID, which allows you to index into a SLURM task array. The SLURM_ARRAY_TASK_ID will be one of the values in the range you pass to the --array flag: in our case, 1—1000.

To run relax on an apo protein structure, the flags file contains the following Rosetta flags:

# ./flags
-s input.pdb
-renumber_pdb 1

The renumber_pdb flag will renumber all the residues in the structure starting with 1. This is a good practice if you are relaxing a crystal structure, which are often numbered according to the biological context. Residues in Rosetta models, by convention, are numbered sequentially starting from 1.

Change to the same directory as your sub.sh and input files and submit your job with:

sbatch sub.sh

I want to automate the running of similar jobs with variable flags

In the case when you want to run the same protocol on multiple input structures, you can take an embarrassingly parallel approach by running all of the jobs concurrently rather than consecutively. This more complex situation adds a list of inputs in a file called list, and uses the SLURM_ARRAY_TASK_ID environment variable to select which input file should use used as input from the list. When you submit a batch run, each job gets an integer ID (0, 1, 2 … n). We add a short Bash command to a print particular line of the list, which contains the Rosetta flags for a particular run. The list, in this case specifying three different input PDB structures, looks like

-s input1.pdb
-s input2.pdb
-s input3.pdb

For a parallel run, make a sub.sh like this:

#!/bin/bash
#SBATCH --array=1-3
S=$( head -${SLURM_ARRAY_TASK_ID} list | tail -1 )
module load rosetta
relax.linuxgccrelease @flags $S

The code head -${SLURM_ARRAY_TASK_ID} list | tail -1 returns the nth line in the file list, where n is equal to SLURM_ARRAY_TASK_ID.

Submit the run with

sbatch sub.sh

I want to scan over a parameter in RosettaScripts XML

This option also works well when you need to permute variables in your XML. For example, if we want make a series of point mutations to the same protein, we can add in a special %%variable_name%% variable into our RosettaScript XML protocol that will get replaced at runtime.

<ROSETTASCRIPTS>
<SCOREFXNS>
  <ScoreFunction name="my_score"/>
</SCOREFXNS>
<TASKOPERATIONS>
</TASKOPERATIONS>
<FILTERS>
</FILTERS>
<MOVERS>
  <MutateResidue name="mutate"
      target="%%target%%"
      new_res="%%new_res%%"/>
</MOVERS>
<APPLY_TO_POSE>
</APPLY_TO_POSE>
<PROTOCOLS>
  <Add mover="mutate" />
</PROTOCOLS>
</ROSETTASCRIPTS>

Note the special %%{variable_name}%% variables in the MutateResidue mover declaration. When we run Rosetta, we can use the script_vars option in the parser group to substitute in the values for each run.

Let’s make a text file where each line contains the Rosetta flags for a single run. In this case, 5 point mutations.

-parser:script_vars target=325 new_res=GLU
-parser:script_vars target=220 new_res=GLU
-parser:script_vars target=298 new_res=GLU
-parser:script_vars target=294 new_res=LEU
-parser:script_vars target=407 new_res=TYR

The sub.sh is the same except for which binary we’re calling, and you can submit the same way.

#!/bin/bash
#SBATCH --array=1-10
S=$( head -${SLURM_ARRAY_TASK_ID} list | tail -1 )
module load rosetta
rosetta_scripts.linuxgccrelease @flags $S

You can find a lot more information on using SLURM at the SchedMD site.