Notebook

This notebook contains material from PyRosetta; content is available on Github.

< Distributed computation example: miniprotein design | Contents | Index | Examples Using the dask Module >

Example of Using PyRosetta with GNU Parallel¶

Note: In this tutorial, we will write a simple PyRosetta script to disk, and demonstrate how to run it in parallel using GNU parallel. This Jupyter notebook uses parallelization and is not meant to be executed within a Google Colab environment.

Please see setup instructions in Chapter 16.00

In [1]:

import logging
logging.basicConfig(level=logging.INFO)
import os
import sys

if 'google.colab' in sys.modules:
    print("This Jupyter notebook uses parallelization and is therefore not set up for the Google Colab environment.")
    sys.exit(0)

In [2]:

%%file outputs/mutate_residue_from_command_line.py
import argparse
import os
import pyrosetta
import uuid


__author__ = "My Name"
__email__ = "name@email.com"


def main(target=None, new_res=None):
    """Example function to run my custom PyRosetta script."""
    
    # Initialize PyRosetta with custom flags
    pyrosetta.init("-ignore_unrecognized_res 1 -renumber_pdb 1 -mute all")

    # Setup pose
    pose = pyrosetta.pose_from_file("inputs/5JG9.clean.pdb")

    # Setup directory structure
    main_dir = os.getcwd()
    unique_dir = os.path.join("outputs", "testing_" + uuid.uuid4().hex)
    if not os.path.isdir(unique_dir):
        os.mkdir(unique_dir)
        os.chdir(unique_dir)

    # Create scorefunction
    scorefxn = pyrosetta.create_score_function("ref2015_cart")

    # PyRosetta design protocol
    keep_chA = pyrosetta.rosetta.protocols.grafting.simple_movers.KeepRegionMover(
        res_start=str(pose.chain_begin(1)), res_end=str(pose.chain_end(1))
    )
    keep_chA.apply(pose)
    
    mutate = pyrosetta.rosetta.protocols.simple_moves.MutateResidue(
        target=target, new_res=new_res
    )
    mutate.apply(pose)

    mm = pyrosetta.rosetta.core.kinematics.MoveMap()
    mm.set_bb(True)
    mm.set_chi(True)
    min_mover = pyrosetta.rosetta.protocols.minimization_packing.MinMover()
    min_mover.set_movemap(mm)
    min_mover.score_function(scorefxn)
    min_mover.min_type("lbfgs_armijo_nonmonotone")
    min_mover.cartesian(True)
    min_mover.tolerance(0.01)
    min_mover.max_iter(200)
    min_mover.apply(pose)

    total_score = scorefxn(pose)
    
    # Setup outputs
    pdb_output_filename = "_".join(["5JG9.clean", str(target), str(new_res), ".pdb"])
    
    pyrosetta.dump_pdb(pose, pdb_output_filename)
    
    # Append scores to scorefile
    pyrosetta.toolbox.py_jobdistributor.output_scorefile(
        pose=pose, pdb_name="5JG9.clean", current_name=pdb_output_filename,
        scorefilepath="score.fasc", scorefxn=scorefxn,
        nstruct=1, native_pose=None, additional_decoy_info=None, json_format=True
    )
    os.chdir(main_dir)

    
if __name__ == "__main__":
    
    # Declare parser object for managing input options
    parser = argparse.ArgumentParser()
    parser.add_argument("-t", "--target", type=int,
                    help="Target residue to mutate as integer.")
    parser.add_argument("-r", "--new_res", type=str,
                        help="Three letter amino acid code to which to mutate target.")
    args = parser.parse_args()
    
    # Run protocol
    main(target=args.target, new_res=args.new_res)

Writing outputs/mutate_residue_from_command_line.py

Now we will run this script in parallel several different ways to demonstrate different types of job submission styles.

1. Parallelize script in an interactive session:¶

On your laptop, you have access to as many cores as are on your machine. On a high-performance computing cluster with SLURM scheduling, you first have to request as many cores as you want to run on in an interactive login session:

qlogin -c 8 --mem=16g

will reserve 8 CPU cores and 16 GB of RAM for you and start your session on a node that has available resources.

Then, we need to write a run file to disc specifying our input parameters

In [5]:

with open("outputs/run_file_parallel_interactive.txt", "w") as f:
    for target in [2, 4, 6, 8, 10]:
        for new_res in ["ALA", "TRP"]:
            f.write("{0} outputs/mutate_residue_from_command_line.py -t {1} -r {2}\n".format(sys.executable, target, new_res))

Note: it's always a good idea to run just one command first to make sure there aren't any errors in your script!

Note: if you don't specify the correct python executable, activate the correct conda environment in your interactive session first.

Now submit outputs/run_file_parallel_interactive.txt to GNU parallel from the command line in your interactive session:

cat outputs/run_file_parallel_interactive.txt | parallel -j 8 --no-notice &

Note: The parallel exectuable is usually located at /usr/bin/parallel but the full path may differ on your computer. For installation info, visit: https://www.gnu.org/software/parallel/

2. Parallelize script on a high-performance computing cluster with Slurm scheduling (non-interactive submission):¶

We use GNU parallel again, but this time there is no need to pre-request server allocations. We can submit jobs to the Slurm scheduler from directly within this Jupyter Notebook or from command line!

Useful background information:

"Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. ... As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work." Read further: https://slurm.schedmd.com/overview.html
With the Slurm scheduler we will use the sbatch command, therefore it may be useful to review the available options:

https://slurm.schedmd.com/sbatch.html

First, write a SLURM submission script to disk specifying the job requirements. In this example, our conda environment is called pyrosetta-bootcamp:

In [7]:

with open("outputs/sbatch_parallel.sh", "w") as f:
    f.write("#!/bin/bash \n") # Bash script
    f.write("#SBATCH -p short \n") # Specify "short" partition/queue
    f.write("#SBATCH -n 8 \n") # Specify eight cores
    f.write("#SBATCH -N 1 \n") # Specify one node
    f.write("#SBATCH --mem=16g \n") # Specify 16GB RAM over eight cores
    f.write("#SBATCH -o sbatch_parallel.log \n") # Specify output log filename
    f.write("conda activate pyrosetta-bootcamp \n") # Activate conda environment
    f.write("cat outputs/run_file_parallel_interactive.txt | /usr/bin/parallel -j 8 --no-notice \n") # Submit jobs to GNU parallel

Then, submit outputs/sbatch_parallel.sh to the SLURM scheduler with the sbatch command:

In [17]:

if not os.getenv("DEBUG"):
    !sbatch outputs/sbatch_parallel.sh

Submitted batch job 12942806

We can then periodically check on the status of our jobs:

In [22]:

!squeue -u $USER

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          12942806     short sbatch.s   klimaj PD       0:00      1 (Priority)
          12931193 interacti      zsh   klimaj  R    2:17:02      1 dig1
          12931183 interacti jupyter-   klimaj  R    2:17:37      1 dig5
          12935447 interacti      zsh   klimaj  R      58:46      1 dig1

3. Submit jobs individually to the SLURM scheduler:¶

This time, we submit each job individually using the sbatch command directly on our PyRosetta script.

Warning: do not submit more than ~1000 jobs with this method or risk clogging the SLURM scheduler!

First, copy your python executable and paste it on the first line of the PyRosetta script after #!, followed by #SBATCH commands for each job:

In [8]:

sys.executable

Out[8]:

'/home/klimaj/anaconda3/envs/pyrosetta-dev/bin/python'

In [9]:

%%file outputs/mutate_residue_from_sbatch.py
#!/home/klimaj/anaconda3/envs/pyrosetta-bootcamp/bin/python
#SBATCH -p short
#SBATCH -n 1
#SBATCH --mem=2g
#SBATCH -o sbatch.log

import argparse
import os
import pyrosetta
import uuid


__author__ = "My Name"
__email__ = "name@email.com"


def main(target=None, new_res=None):
    """Example function to run my custom PyRosetta script.
    """
    
    # Initialize PyRosetta with custom flags
    pyrosetta.init("-ignore_unrecognized_res 1 -renumber_pdb 1 -mute all")

    # Setup pose
    pose = pyrosetta.pose_from_file("inputs/5JG9.clean.pdb")
    
    # Setup directory structure
    main_dir = os.getcwd()
    unique_dir = os.path.join("outputs", "testing_" + uuid.uuid4().hex)
    if not os.path.isdir(unique_dir):
        os.mkdir(unique_dir)
        os.chdir(unique_dir)

    # Create scorefunction
    scorefxn = pyrosetta.create_score_function("ref2015_cart")

    # PyRosetta design protocol
    keep_chA = pyrosetta.rosetta.protocols.grafting.simple_movers.KeepRegionMover(
        res_start=str(pose.chain_begin(1)), res_end=str(pose.chain_end(1))
    )
    keep_chA.apply(pose)
    
    mutate = pyrosetta.rosetta.protocols.simple_moves.MutateResidue(
        target=target, new_res=new_res
    )
    mutate.apply(pose)

    mm = pyrosetta.rosetta.core.kinematics.MoveMap()
    mm.set_bb(True)
    mm.set_chi(True)
    min_mover = pyrosetta.rosetta.protocols.minimization_packing.MinMover()
    min_mover.set_movemap(mm)
    min_mover.score_function(scorefxn)
    min_mover.min_type("lbfgs_armijo_nonmonotone")
    min_mover.cartesian(True)
    min_mover.tolerance(0.01)
    min_mover.max_iter(200)
    min_mover.apply(pose)

    total_score = scorefxn(pose)
    
    # Setup outputs
    pdb_output_filename = "_".join(["5JG9.clean", str(target), str(new_res), ".pdb"])
    
    pyrosetta.dump_pdb(pose, pdb_output_filename)
    
    # Append scores to scorefile
    pyrosetta.toolbox.py_jobdistributor.output_scorefile(
        pose=pose, pdb_name="5JG9.clean", current_name=pdb_output_filename,
        scorefilepath="score.fasc", scorefxn=scorefxn,
        nstruct=1, native_pose=None, additional_decoy_info=None, json_format=True
    )
    os.chdir(main_dir)

    
if __name__ == "__main__":
    
    # Declare parser object for managing input options
    parser = argparse.ArgumentParser()
    parser.add_argument("-t", "--target", type=int,
                    help="Target residue to mutate as integer.")
    parser.add_argument("-r", "--new_res", type=str,
                        help="Three letter amino acid code to which to mutate target.")
    args = parser.parse_args()
    
    # Run protocol
    main(target=args.target, new_res=args.new_res)

Writing outputs/mutate_residue_from_sbatch.py

Then, loop over your input parameters submitting the PyRosetta scripts to the scheduler using the sbatch command:

In [25]:

if not os.getenv("DEBUG"):
    for target in [2, 4, 6, 8, 10]:
        for new_res in ["ALA", "TRP"]:
            !sbatch ./outputs/mutate_residue_from_sbatch.py -t $target -r $new_res

Submitted batch job 12946283
Submitted batch job 12946284
Submitted batch job 12946285
Submitted batch job 12946286
Submitted batch job 12946287
Submitted batch job 12946288
Submitted batch job 12946289
Submitted batch job 12946290
Submitted batch job 12946291
Submitted batch job 12946292

We can then periodically check on the status of our jobs:

In [26]:

!squeue -u $USER

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          12946286     short mutate_r   klimaj PD       0:00      1 (Resources)
          12946287     short mutate_r   klimaj PD       0:00      1 (Priority)
          12946288     short mutate_r   klimaj PD       0:00      1 (Priority)
          12946289     short mutate_r   klimaj PD       0:00      1 (Priority)
          12946290     short mutate_r   klimaj PD       0:00      1 (Priority)
          12946291     short mutate_r   klimaj PD       0:00      1 (Priority)
          12946292     short mutate_r   klimaj PD       0:00      1 (Priority)
          12931193 interacti      zsh   klimaj  R    2:19:28      1 dig1
          12931183 interacti jupyter-   klimaj  R    2:20:03      1 dig5
          12935447 interacti      zsh   klimaj  R    1:01:12      1 dig1
          12946283     short mutate_r   klimaj  R       0:00      1 dig73
          12946284     short mutate_r   klimaj  R       0:00      1 dig87
          12946285     short mutate_r   klimaj  R       0:00      1 dig100

< Distributed computation example: miniprotein design | Contents | Index | Examples Using the dask Module >