This notebook reads a dataset with protein sequence and protein fold classification and calculates a feature vector for each protein sequence.
Rule 3: Use Divisions to Make Steps Clear. We use one cell for each distinct task.
Rule 4: Modularize Code. To avoid duplicating code, we have collected several functions in protvectors.py. These functions are also used in 4-Predict.
Rule 8: Share and Explain Your Data. To enable reproducibility we provide a local copy of a Word2vec model in the /data directory and a file that describes the datasets with download locations and dates.
import pandas as pd
import protvectors
# column names
feature_col = "features" # feature vector
value_col = "foldClass" # fold class to be predicted
df = pd.read_json("./intermediate_data/foldClassification.json")
We use the ProtVec model (Asgari et al.) to calculate a 100-dimensional feature vector for each protein sequence. ProtVec uses a Word2vec model (Mikolov et al.) that has been trained on 546,790 sequences in Swiss-Prot using 546,790 × 3 = 1,640,370 sequences of 3-grams. The 3-grams represent "biological words" in a protein sequence, e.g., sequence: SRMPSPP -> 3-grams: SRM RMP MPS PSP SPP. The ProtVec model is available for download at: https://github.com/ehsanasgari/Deep-Proteomics.
Asgari E, Mofrad MR (2015) Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics, PLoS One. 10(11):e0141287. doi: 10.1371/journal.pone.0141287.
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J, Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems; 2013. p. 3111–3119.
Next we read a local copy of the ProtVec model. The ProtVec model is represented as a dictionary, with the 3-gram as the key and the 100-dimensional feature vector as the value.
protvec = protvectors.read_protvectors("./data/protVec_100d_3grams.csv")
print("Example ProtVec for 3-gram SRM:\n", protvec['SRM'])
Example ProtVec for 3-gram SRM: [-0.349053 -0.034172 -0.14602 -0.112906 0.318846 0.100117 -0.104718 -0.194695 -0.08249 0.016351 -0.181182 0.109543 0.067238 -0.027135 0.222703 0.073312 -0.074177 -0.087137 -0.27853 0.003309 -0.065516 -0.035587 0.042179 0.169955 0.155156 -0.07882 0.203758 0.129488 -0.009507 -0.033186 -0.007172 -0.039388 0.243934 0.009303 0.043914 -0.018962 -0.23077 -0.136273 0.027782 0.232346 -0.2341 0.102889 -0.054253 -0.111376 0.106518 -0.027139 -0.139712 -0.049569 0.057983 -0.157097 0.090227 0.0228 0.114038 0.017181 -0.015422 -0.035576 -0.014446 0.000584 -0.292332 0.003074 0.097327 0.072325 0.138753 0.028772 -0.023035 0.024519 0.123589 0.021453 0.286168 0.094651 -0.145597 0.132008 -0.104951 0.121934 -0.042467 -0.075287 0.306096 0.096278 -0.121827 0.167771 0.059359 -0.169576 0.018486 -0.143597 0.211764 0.171916 0.200995 0.190091 -0.142053 0.022641 0.204606 -0.083642 0.016121 -0.147855 0.001436 -0.124035 0.00538 -0.177881 0.116058 0.195754]
Next, we create 3-grams for the protein sequences in our dataset.
# add column ngram to dataframe
df['ngram'] = df.sequence.apply(protvectors.ngrammer, n=3)
df.head(3)
Exptl. | FreeRvalue | R-factor | alpha | beta | coil | foldClass | length | pdbChainId | resolution | secondary_structure | sequence | ngram | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | XRAY | 0.26 | 0.19 | 0.469945 | 0.046448 | 0.483607 | alpha | 366 | 16VP.A | 2.100 | CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS... | SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL... | [SRM, RMP, MPS, PSP, SPP, PPM, PMP, MPV, PVP, ... |
1000 | XRAY | 0.23 | 0.18 | 0.504630 | 0.004630 | 0.490741 | alpha | 216 | 1PBW.B | 2.000 | CCCCCCCCCCCCCCHHHHCCTTSCSCHHHHHHHHHHHHHHTTCTTT... | MEADVEQQALTLPDLAEQFAPPDIAPPLLIKLVEAIEKKGLECSTL... | [MEA, EAD, ADV, DVE, VEQ, EQQ, QQA, QAL, ALT, ... |
10002 | XRAY | 0.26 | 0.22 | 0.716172 | 0.006601 | 0.277228 | alpha | 303 | 4TQ3.A | 2.408 | CCCCCCCCCCCCCCCHHHHHHCGGGGHHHHHHHHHHHHHHCCTTSC... | MDSSLANINQIDVPSKYLRLLRPVAWLCFLLPYAVGFGFGITPNAS... | [MDS, DSS, SSL, SLA, LAN, ANI, NIN, INQ, NQI, ... |
Here we create a 100-dimensional feature vector by adding up the ProtVectors for all 3-grams in a protein sequence and standardize each feature vector to zero-mean and unit-variance.
df[feature_col] = df.ngram.apply(protvectors.apply_protvectors, protvec=protvec)
df.head(3)
Exptl. | FreeRvalue | R-factor | alpha | beta | coil | foldClass | length | pdbChainId | resolution | secondary_structure | sequence | ngram | features | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | XRAY | 0.26 | 0.19 | 0.469945 | 0.046448 | 0.483607 | alpha | 366 | 16VP.A | 2.100 | CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS... | SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL... | [SRM, RMP, MPS, PSP, SPP, PPM, PMP, MPV, PVP, ... | [-2.618341208445193, -0.37215537192569575, 0.1... |
1000 | XRAY | 0.23 | 0.18 | 0.504630 | 0.004630 | 0.490741 | alpha | 216 | 1PBW.B | 2.000 | CCCCCCCCCCCCCCHHHHCCTTSCSCHHHHHHHHHHHHHHTTCTTT... | MEADVEQQALTLPDLAEQFAPPDIAPPLLIKLVEAIEKKGLECSTL... | [MEA, EAD, ADV, DVE, VEQ, EQQ, QQA, QAL, ALT, ... | [-2.4130836608297224, -0.5122827315971855, 0.1... |
10002 | XRAY | 0.26 | 0.22 | 0.716172 | 0.006601 | 0.277228 | alpha | 303 | 4TQ3.A | 2.408 | CCCCCCCCCCCCCCCHHHHHHCGGGGHHHHHHHHHHHHHHCCTTSC... | MDSSLANINQIDVPSKYLRLLRPVAWLCFLLPYAVGFGFGITPNAS... | [MDS, DSS, SSL, SLA, LAN, ANI, NIN, INQ, NQI, ... | [-2.6375752438981404, 0.18385725798670652, 0.2... |
We save the dataset with protein sequence, fold classification, and feature vectors as a Pandas dataframe for further analysis.
df.to_json("./intermediate_data/features.json")
After you saved the dataset here, run the next step in the workflow 3-FitModel.ipynb or go back go back to 0-Workflow.ipynb.