Make a Prediction¶

In this notebook you can enter a protein sequence and predict the fold classification.

Rule 9: Design Your Notebooks to Be Read, Run, and Explored. We use ipywidgets to present to users a text box to execute a prediction for a protein sequence of their choice. We provide a default sequence to generate a reproducible result.

In [1]:

import numpy as np
import joblib
import protvectors
from ipywidgets import widgets

Enter a Protein Sequence in Text Box¶

We have populated the text box with a default sequence from PDB chain 5YU2.A (expected result: alpha+beta).

In [2]:

text_box = widgets.Textarea(description='Sequence:', value='GAASMKIINTTRLPEALGPYSHATVVNGMVYTSGQIPLNVDGKIVSADVQAQTKQVLENLKVVLEEAGSDLNSVAKATIFIKDMNDFQKINEVYGQYFNEHKPARSCVEVARLPKDVKVEIELVSKIKEL')

In [3]:

display(text_box)

Textarea(value='GAASMKIINTTRLPEALGPYSHATVVNGMVYTSGQIPLNVDGKIVSADVQAQTKQVLENLKVVLEEAGSDLNSVAKATIFIKDMNDFQKINEVY…

In [4]:

sequence = text_box.value
print("Make prediction for:", sequence)

Make prediction for: GAASMKIINTTRLPEALGPYSHATVVNGMVYTSGQIPLNVDGKIVSADVQAQTKQVLENLKVVLEEAGSDLNSVAKATIFIKDMNDFQKINEVYGQYFNEHKPARSCVEVARLPKDVKVEIELVSKIKEL

Load Classifier model¶

In [5]:

classifier = joblib.load("./intermediate_data/classifier")

In [6]:

classifier

Out[6]:

SVC(class_weight='balanced', gamma='auto', probability=True, random_state=13)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Calculate 3-grams¶

In [7]:

ngrams = protvectors.ngrammer(sequence, n=3)
print(ngrams)

['GAA', 'AAS', 'ASM', 'SMK', 'MKI', 'KII', 'IIN', 'INT', 'NTT', 'TTR', 'TRL', 'RLP', 'LPE', 'PEA', 'EAL', 'ALG', 'LGP', 'GPY', 'PYS', 'YSH', 'SHA', 'HAT', 'ATV', 'TVV', 'VVN', 'VNG', 'NGM', 'GMV', 'MVY', 'VYT', 'YTS', 'TSG', 'SGQ', 'GQI', 'QIP', 'IPL', 'PLN', 'LNV', 'NVD', 'VDG', 'DGK', 'GKI', 'KIV', 'IVS', 'VSA', 'SAD', 'ADV', 'DVQ', 'VQA', 'QAQ', 'AQT', 'QTK', 'TKQ', 'KQV', 'QVL', 'VLE', 'LEN', 'ENL', 'NLK', 'LKV', 'KVV', 'VVL', 'VLE', 'LEE', 'EEA', 'EAG', 'AGS', 'GSD', 'SDL', 'DLN', 'LNS', 'NSV', 'SVA', 'VAK', 'AKA', 'KAT', 'ATI', 'TIF', 'IFI', 'FIK', 'IKD', 'KDM', 'DMN', 'MND', 'NDF', 'DFQ', 'FQK', 'QKI', 'KIN', 'INE', 'NEV', 'EVY', 'VYG', 'YGQ', 'GQY', 'QYF', 'YFN', 'FNE', 'NEH', 'EHK', 'HKP', 'KPA', 'PAR', 'ARS', 'RSC', 'SCV', 'CVE', 'VEV', 'EVA', 'VAR', 'ARL', 'RLP', 'LPK', 'PKD', 'KDV', 'DVK', 'VKV', 'KVE', 'VEI', 'EIE', 'IEL', 'ELV', 'LVS', 'VSK', 'SKI', 'KIK', 'IKE', 'KEL']

Read ProtVec Model¶

In [8]:

protvec = protvectors.read_protvectors("./data/protVec_100d_3grams.csv")

Calculate Feature Vector using ProtVec Model¶

In [9]:

featureVector = protvectors.apply_protvectors(ngrams, protvec)
print(featureVector)

[-2.3101338  -0.32331702  0.78323473 -2.17371844 -0.05491938 -0.04513578
  1.22565468 -0.4836296   0.29705189  2.28707543  0.22305553 -0.41752745
 -0.39105919  0.70440273  0.13918634  0.19040849  0.84646709 -1.56859546
  0.02784964 -1.49127177 -0.01911258 -1.30944049 -1.42153082  0.01448804
  0.36608489  0.39708845 -0.16598204 -0.15441528 -0.33733611 -1.21403695
 -0.5650013  -1.30023446  0.56057566 -0.1993203   0.07892812  1.4882538
  0.08757783 -0.22068366  0.53207356 -0.09555411  0.20772675  0.67549063
  0.52102914 -0.12743451 -0.47274557  0.02531047 -0.91127284 -0.41035579
  0.00657577  1.81890208  0.12983772 -0.76028579  1.77282759 -1.40223342
  1.04664272 -1.91512564  0.0619787   0.3228361  -0.30558078 -2.91303999
 -0.25224169  1.90002018  0.20970349  0.03197095  1.54426113  0.86039855
  0.22558089  2.08030942  0.80783672 -1.10335774  0.15124303  0.28051646
  0.7273708   1.120928    0.87064261 -0.78159259  0.86080856 -0.02225375
  0.7175734  -0.98596892 -1.03833133 -0.07469874  0.04270149 -0.09935889
  1.53536898 -0.9162694  -1.40590277 -0.72739099 -1.47466438  0.48371854
  0.35492522 -1.17893066  0.41448665  0.640024    1.64565464  0.09323905
 -0.32028101 -1.52077313  0.63261307  2.31153588]

Predict Fold Class¶

We use our classification model to predict the fold class. The class with the highest probability is reported as the final result.

In [10]:

predictions = classifier.predict([featureVector])
probabilities = classifier.predict_proba([featureVector])

print("Sequence:")
print(sequence)
print("\nProbabilities:")
print(classifier.classes_)
print(probabilities[0])
print("\nPrediction:", predictions[0])

Sequence:
GAASMKIINTTRLPEALGPYSHATVVNGMVYTSGQIPLNVDGKIVSADVQAQTKQVLENLKVVLEEAGSDLNSVAKATIFIKDMNDFQKINEVYGQYFNEHKPARSCVEVARLPKDVKVEIELVSKIKEL

Probabilities:
['alpha' 'alpha+beta' 'beta']
[0.14370899 0.74048448 0.11580653]

Prediction: alpha+beta

Note the limitations of the model (see 3-FitModel.ipynb). This is not a state-of-the art model to predict protein fold classes, rather it serves as an example how to create a reproducible and interactive workflow with Jupyter Notebooks.

Authors: Peter W. Rose, Shih-Cheng Huang, UC San Diego, October 1, 2018