This notebook contains material from PyRosetta; content is available on Github.

< Distributed analysis example: exhaustive ddG PSSM | Contents | Index | Example of Using PyRosetta with GNU Parallel >

Distributed computation example: miniprotein design¶

Notes¶

This tutorial will walk you through how to design miniproteins in PyRosetta using the PyData stack for analysis and distributed computing.

This Jupyter notebook uses parallelization and is not meant to be executed within a Google Colab environment.

Setup¶

Please see setup instructions in Chapter 16.00

Citation¶

Integration of the Rosetta Suite with the Python Software Stack via reproducible packaging and core programming interfaces for distributed simulation

Alexander S. Ford, Brian D. Weitzner, Christopher D. Bahl

Manual¶

Documentation for the pyrosetta.distributed namespace can be found here: https://nbviewer.jupyter.org/github/proteininnovation/Rosetta-PyData_Integration/blob/master/distributed_overview.ipynb

In [1]:

!pip install pyrosettacolabsetup
import pyrosettacolabsetup; pyrosettacolabsetup.install_pyrosetta()
import pyrosetta; pyrosetta.init()

In [2]:

import pyrosetta
import dask.bag
import dask.distributed

In [3]:

import os
import pandas
import seaborn
from matplotlib import pylab

In [4]:

from pyrosetta.distributed.tasks.score import ScorePoseTask
from pyrosetta.distributed.io import pose_from_pdbstring
from pyrosetta.distributed.packed_pose import to_dict

In [5]:

import zipfile

Init¶

Setup LocalCluster to full utilize the current machine.

In [6]:

if not os.getenv("DEBUG"):
    cluster = dask.distributed.LocalCluster(n_workers=1, threads_per_worker=1)
    client = dask.distributed.Client(cluster)
else:
    client = None
client

Out[6]:

Client

Scheduler: tcp://127.0.0.1:60557

Cluster

Workers: 8
Cores: 8
Memory: 17.18 GB

Load¶

Load decoys from batch compute run, adding annotations for designed sequence, cystine count and cystine location.

In [7]:

def load_source(decoy):
    src_pdb = zipfile.ZipFile(library).open(decoy).read()
    p = pose_from_pdbstring(src_pdb)
    
    cys_locations=[i for i, c in enumerate(p.pose.sequence()) if c == "C"]
    p = p.update_scores(
        library=library,
        decoy=decoy,
        sequence=p.pose.sequence(),
        num_res = len(p.pose.sequence()),
        num_cys=len(cys_locations),
        cys_locations=",".join(map(str, cys_locations))
    )

    return p

In [9]:

if not os.getenv("DEBUG"):
    library = "inputs/EHEE_library.zip"
    decoy_names = [f.filename for f in zipfile.ZipFile(library).filelist if f.filename.endswith(".pdb")]

In [10]:

if not os.getenv("DEBUG"):
    decoys = dask.bag.from_sequence(decoy_names).map(load_source).map(ScorePoseTask()).persist()

In [12]:

if not os.getenv("DEBUG"):
    result_frame = pandas.DataFrame.from_records(decoys.map(to_dict).compute()).sort_values("total_score")

Analysis¶

We anticipate a distribution of score results, with higher scores with more disulfide insertions.

In [11]:

if not os.getenv("DEBUG"):
    seaborn.boxplot(x="num_cys", y="total_score", data=result_frame)

Out[11]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fc3404f9860>

In [12]:

if not os.getenv("DEBUG"):
    seaborn.boxplot(x="cys_locations", y="total_score", data=result_frame)

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fc3404a5240>

Select¶

Select the best model by total_Score for each inserted disulfide location, allowing us to test a variety of disulfide architectures.

In [13]:

if not os.getenv("DEBUG"):
    best_by_location = to_dict(result_frame.groupby("cys_locations").head(1).reset_index(drop=True))
    print(len(best_by_location))

In [14]:

if not os.getenv("DEBUG"):
    with open("EHEE.best_by_location.fasta", "w") as out:
        for entry in best_by_location:
            print(f">{entry['decoy']}", file=out)
            print(entry['sequence'], file=out)

In [13]:

if not os.getenv("DEBUG"):
    !head expected_outputs/EHEE.best_by_location.fasta

>30652025_EHEE_254_0001.pdb
GDYQLHTCNTSEEELKKLTETLRRRLQTECKLERHGDCYTITCHV
>30887334_EHEE_738_0001.pdb
GCKTLTFCGYDDEQAKKIQKDISKTVQRPVEVHKHGSCWEFHVCV
>30856927_EHEE_395_0001.pdb
GCTTWEFHNVDPNEVKKALRELSEKTGAECHLEQHGNTFHITCCV
>30884953_EHEE_269_0001_0004.pdb
GQKCVTFCGQDPREVKKIAEEIARRLQVPYEIRRHGSCITVCFKV
>30696637_EHEE_467_0001_0002.pdb
GTKTFTYDGVDPTEVDKSRERLEKELKTRCELECRGNQCHIHCHV

Visualize¶

Visualize using Py3Dmol or PyMolMover as you have learned before.

In [1]:

'''
import py3Dmol
view = py3Dmol.view(viewergrid=(3, 3), linked=False, width=900, height=900)
for i in range(9):
    view.addModel( pyrosetta.distributed.io.to_pdbstring(best_by_location[i]), "pdb", viewer=(i/3, i%3),)
    
view.setStyle({'cartoon':{'color':'spectrum'}})
view.setStyle({"resn": "CYS"}, {'stick': {}, 'cartoon':{'color':'spectrum'}} )
view.zoomTo()
'''

Out[1]:

'\nimport py3Dmol\nview = py3Dmol.view(viewergrid=(3, 3), linked=False, width=900, height=900)\nfor i in range(9):\n    view.addModel( pyrosetta.distributed.io.to_pdbstring(best_by_location[i]), "pdb", viewer=(i/3, i%3),)\n    \nview.setStyle({\'cartoon\':{\'color\':\'spectrum\'}})\nview.setStyle({"resn": "CYS"}, {\'stick\': {}, \'cartoon\':{\'color\':\'spectrum\'}} )\nview.zoomTo()\n'

In [ ]:

< Distributed analysis example: exhaustive ddG PSSM | Contents | Index | Example of Using PyRosetta with GNU Parallel >