Feb 21, 2016
Rosetta’s sampling methodology allows searching protein sequence functional space (which is extremely rugged). To effectively use Rosetta, high performance computing (cluster computing) is necessary to carry out many sampling trajectories in parallel.
This tutorial assumes that you have access to a high-performance compute cluster such that you can log in from a Bash prompt. On our cluster at Davis (called Cabernet), this is the welcome message:
_______________________________________
/ HPC-14 currently consists of 66 nodes \
| with 2248 CPUs and 11.03TB RAM. Type |
\ sinfo for more info. /
---------------------------------------
\..................................
_____________________________________________ ......
/\ ___ _ \ ....
/ : / (_) | | \ .
| : | __, | | _ ,_ _ _ _ _| _ \________ ___
| : | / | |/ \_|/ / | / |/ | |/ | | |
| : \___/\_/|_/\_/ |__/ |_/ | |_/|__/|_/ ________|___|
| : .genomecenter.ucdavis.edu /
\ ; /
\/____________________________________________/
We also assume that the Rosetta binaries are in your PATH
. See [this post on configuring and storing software] for how to do this.
-nstruct
in parallelFirst, write a script that runs your Rosetta protocol. In this case, we want to relax an input crystal structure into Rosetta’s energy function with the relax
app. We want 1,000 independent trajectories terminating in 1,000 relaxed output structures.
Put in a file called sub.sh
:
#!/bin/bash
#SBATCH --array=1-1000
relax.linuxgccrelease @flags -suffix $SLURM_ARRAY_TASK_ID
The -suffix $SLURM_ARRAY_TASK_ID
contains two parts. The Rosetta flag -suffix
appends a suffix to each of the 1,000 structures, allowing Rosetta to name output files appropriately. The suffix that is appended is the environment variable SLURM_ARRAY_TASK_ID
, which allows you to index into a SLURM task array. The SLURM_ARRAY_TASK_ID
will be one of the values in the range you pass to the --array
flag: in our case, 1—1000.
To run relax on an apo protein structure, the flags
file contains the following Rosetta flags:
# ./flags
-s input.pdb
-renumber_pdb 1
The renumber_pdb
flag will renumber all the residues in the structure starting with 1. This is a good practice if you are relaxing a crystal structure, which are often numbered according to the biological context. Residues in Rosetta models, by convention, are numbered sequentially starting from 1.
Change to the same directory as your sub.sh
and input files and submit your job with:
sbatch sub.sh
In the case when you want to run the same protocol on multiple input structures, you can take an embarrassingly parallel approach by running all of the jobs concurrently rather than consecutively. This more complex situation adds a list of inputs in a file called list
, and uses the SLURM_ARRAY_TASK_ID
environment variable to select which input file should use used as input from the list
. When you submit a batch run, each job gets an integer ID (0, 1, 2 … n). We add a short Bash command to a print particular line of the list
, which contains the Rosetta flags for a particular run. The list, in this case specifying three different input PDB structures, looks like
-s input1.pdb
-s input2.pdb
-s input3.pdb
For a parallel run, make a sub.sh
like this:
#!/bin/bash
#SBATCH --array=1-3
S=$( head -${SLURM_ARRAY_TASK_ID} list | tail -1 )
module load rosetta
relax.linuxgccrelease @flags $S
The code head -${SLURM_ARRAY_TASK_ID} list | tail -1
returns the n
th line in the file list
, where n
is equal to SLURM_ARRAY_TASK_ID
.
Submit the run with
sbatch sub.sh
This option also works well when you need to permute variables in your XML. For example, if we want make a series of point mutations to the same protein, we can add in a special %%variable_name%%
variable into our RosettaScript XML protocol that will get replaced at runtime.
<ROSETTASCRIPTS>
<SCOREFXNS>
<ScoreFunction name="my_score"/>
</SCOREFXNS>
<TASKOPERATIONS>
</TASKOPERATIONS>
<FILTERS>
</FILTERS>
<MOVERS>
<MutateResidue name="mutate"
target="%%target%%"
new_res="%%new_res%%"/>
</MOVERS>
<APPLY_TO_POSE>
</APPLY_TO_POSE>
<PROTOCOLS>
<Add mover="mutate" />
</PROTOCOLS>
</ROSETTASCRIPTS>
Note the special %%{variable_name}%%
variables in the MutateResidue
mover declaration. When we run Rosetta, we can use the script_vars
option in the parser
group to substitute in the values for each run.
Let’s make a text file where each line contains the Rosetta flags for a single run. In this case, 5 point mutations.
-parser:script_vars target=325 new_res=GLU
-parser:script_vars target=220 new_res=GLU
-parser:script_vars target=298 new_res=GLU
-parser:script_vars target=294 new_res=LEU
-parser:script_vars target=407 new_res=TYR
The sub.sh
is the same except for which binary we’re calling, and you can submit the same way.
#!/bin/bash
#SBATCH --array=1-10
S=$( head -${SLURM_ARRAY_TASK_ID} list | tail -1 )
module load rosetta
rosetta_scripts.linuxgccrelease @flags $S
You can find a lot more information on using SLURM at the SchedMD site.