In this notebook you can enter a protein sequence and predict the fold classification.
Rule 9: Design Your Notebooks to Be Read, Run, and Explored. We use ipywidgets to present to users a text box to execute a prediction for a protein sequence of their choice. We provide a default sequence to generate a reproducible result.
import numpy as np
import joblib
import protvectors
from ipywidgets import widgets
text_box = widgets.Textarea(description='Sequence:', value='GAASMKIINTTRLPEALGPYSHATVVNGMVYTSGQIPLNVDGKIVSADVQAQTKQVLENLKVVLEEAGSDLNSVAKATIFIKDMNDFQKINEVYGQYFNEHKPARSCVEVARLPKDVKVEIELVSKIKEL')
display(text_box)
Textarea(value='GAASMKIINTTRLPEALGPYSHATVVNGMVYTSGQIPLNVDGKIVSADVQAQTKQVLENLKVVLEEAGSDLNSVAKATIFIKDMNDFQKINEVY…
sequence = text_box.value
print("Make prediction for:", sequence)
Make prediction for: GAASMKIINTTRLPEALGPYSHATVVNGMVYTSGQIPLNVDGKIVSADVQAQTKQVLENLKVVLEEAGSDLNSVAKATIFIKDMNDFQKINEVYGQYFNEHKPARSCVEVARLPKDVKVEIELVSKIKEL
classifier = joblib.load("./intermediate_data/classifier")
classifier
SVC(class_weight='balanced', gamma='auto', probability=True, random_state=13)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(class_weight='balanced', gamma='auto', probability=True, random_state=13)
ngrams = protvectors.ngrammer(sequence, n=3)
print(ngrams)
['GAA', 'AAS', 'ASM', 'SMK', 'MKI', 'KII', 'IIN', 'INT', 'NTT', 'TTR', 'TRL', 'RLP', 'LPE', 'PEA', 'EAL', 'ALG', 'LGP', 'GPY', 'PYS', 'YSH', 'SHA', 'HAT', 'ATV', 'TVV', 'VVN', 'VNG', 'NGM', 'GMV', 'MVY', 'VYT', 'YTS', 'TSG', 'SGQ', 'GQI', 'QIP', 'IPL', 'PLN', 'LNV', 'NVD', 'VDG', 'DGK', 'GKI', 'KIV', 'IVS', 'VSA', 'SAD', 'ADV', 'DVQ', 'VQA', 'QAQ', 'AQT', 'QTK', 'TKQ', 'KQV', 'QVL', 'VLE', 'LEN', 'ENL', 'NLK', 'LKV', 'KVV', 'VVL', 'VLE', 'LEE', 'EEA', 'EAG', 'AGS', 'GSD', 'SDL', 'DLN', 'LNS', 'NSV', 'SVA', 'VAK', 'AKA', 'KAT', 'ATI', 'TIF', 'IFI', 'FIK', 'IKD', 'KDM', 'DMN', 'MND', 'NDF', 'DFQ', 'FQK', 'QKI', 'KIN', 'INE', 'NEV', 'EVY', 'VYG', 'YGQ', 'GQY', 'QYF', 'YFN', 'FNE', 'NEH', 'EHK', 'HKP', 'KPA', 'PAR', 'ARS', 'RSC', 'SCV', 'CVE', 'VEV', 'EVA', 'VAR', 'ARL', 'RLP', 'LPK', 'PKD', 'KDV', 'DVK', 'VKV', 'KVE', 'VEI', 'EIE', 'IEL', 'ELV', 'LVS', 'VSK', 'SKI', 'KIK', 'IKE', 'KEL']
protvec = protvectors.read_protvectors("./data/protVec_100d_3grams.csv")
featureVector = protvectors.apply_protvectors(ngrams, protvec)
print(featureVector)
[-2.3101338 -0.32331702 0.78323473 -2.17371844 -0.05491938 -0.04513578 1.22565468 -0.4836296 0.29705189 2.28707543 0.22305553 -0.41752745 -0.39105919 0.70440273 0.13918634 0.19040849 0.84646709 -1.56859546 0.02784964 -1.49127177 -0.01911258 -1.30944049 -1.42153082 0.01448804 0.36608489 0.39708845 -0.16598204 -0.15441528 -0.33733611 -1.21403695 -0.5650013 -1.30023446 0.56057566 -0.1993203 0.07892812 1.4882538 0.08757783 -0.22068366 0.53207356 -0.09555411 0.20772675 0.67549063 0.52102914 -0.12743451 -0.47274557 0.02531047 -0.91127284 -0.41035579 0.00657577 1.81890208 0.12983772 -0.76028579 1.77282759 -1.40223342 1.04664272 -1.91512564 0.0619787 0.3228361 -0.30558078 -2.91303999 -0.25224169 1.90002018 0.20970349 0.03197095 1.54426113 0.86039855 0.22558089 2.08030942 0.80783672 -1.10335774 0.15124303 0.28051646 0.7273708 1.120928 0.87064261 -0.78159259 0.86080856 -0.02225375 0.7175734 -0.98596892 -1.03833133 -0.07469874 0.04270149 -0.09935889 1.53536898 -0.9162694 -1.40590277 -0.72739099 -1.47466438 0.48371854 0.35492522 -1.17893066 0.41448665 0.640024 1.64565464 0.09323905 -0.32028101 -1.52077313 0.63261307 2.31153588]
We use our classification model to predict the fold class. The class with the highest probability is reported as the final result.
predictions = classifier.predict([featureVector])
probabilities = classifier.predict_proba([featureVector])
print("Sequence:")
print(sequence)
print("\nProbabilities:")
print(classifier.classes_)
print(probabilities[0])
print("\nPrediction:", predictions[0])
Sequence: GAASMKIINTTRLPEALGPYSHATVVNGMVYTSGQIPLNVDGKIVSADVQAQTKQVLENLKVVLEEAGSDLNSVAKATIFIKDMNDFQKINEVYGQYFNEHKPARSCVEVARLPKDVKVEIELVSKIKEL Probabilities: ['alpha' 'alpha+beta' 'beta'] [0.14370899 0.74048448 0.11580653] Prediction: alpha+beta
Note the limitations of the model (see 3-FitModel.ipynb). This is not a state-of-the art model to predict protein fold classes, rather it serves as an example how to create a reproducible and interactive workflow with Jupyter Notebooks.