This notebook extracts from the Protein Data Bank information about the secondary structure of proteins. The ultimate goal is to assign a fold classification from a protein sequence.
Rule 2: Document the Process, Not Just the Results. Here we describe the steps how to produce the dataset.
Rule 7: Build a Pipeline. Besides documenting all steps, the entire process of dataset creation from the original data files in the /data directory is automated. There are no manual steps.
Rule 8: Share and Explain Your Data. To enable reproducibility we provide a /data directory with data files and a file that describes the datasets with download locations and dates.
# column names
value_col = "foldClass" # fold class to be predicted
import pandas as pd
import numpy as np
import pdbutils
Protein sequences in the Protein Data Bank (PDB) are redundant. For example, there are more than 300 structures of hemoglobin in the PDB. To reduce biases in our analysis, we use a representative set of protein chains, i.e., a set of protein chains with minimal sequence identity among its members. For this analysis we downloaded a non-redundant set from the PISCES website with a maximum of 25% sequence identity and an X-ray resolution of <= 3.0 Å.
Wang G, Dunbrack RL Jr (2005) PISCES: recent improvements to a PDB sequence culling server, Nucleic Acids Res. 33, W94-8. doi: 10.1093/nar/gki402
representatives = pdbutils.read_pisces_representatives('./data/cullpdb_pc25_res3.0_R1.0_d180920_chains14051.gz')
print("Number of representative PDB chains:", representatives.shape[0], "\n")
representatives.head()
Number of representative PDB chains: 14051
length | Exptl. | resolution | R-factor | FreeRvalue | pdbChainId | |
---|---|---|---|---|---|---|
0 | 330 | XRAY | 2.2 | 0.16 | 0.29 | 12AS.A |
1 | 366 | XRAY | 2.1 | 0.19 | 0.26 | 16VP.A |
2 | 348 | XRAY | 2.6 | 0.22 | 0.34 | 1A0I.A |
3 | 413 | XRAY | 1.7 | 0.19 | 0.22 | 1A12.C |
4 | 108 | XRAY | 2.0 | 0.21 | 0.25 | 1A1X.A |
Secondary structure of proteins is most commonly assigned with the DSSP method.
Kabsch W, Sander C (1983) Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features. Biopolymers 22, 2577–637. doi 0.1002/bip.360221211
DSSP defines 8 classes of secondary structure:
DSSP Class | Code |
---|---|
4-turn helix (α helix), min. length 4 residues | H |
Isolated β-bridge (single pair β-sheet hydrogen bond formation) | B |
Extended strand in parallel or anti-parallel β-sheet conformation, min length 2 residues | E |
3-turn helix (310 helix), min. length 3 residues | G |
5-turn helix (π helix), min. length 5 residues | I |
Hydrogen bonded turn (3, 4 or 5 turn) | T |
Bend (the only non-hydrogen-bond based assignment) | S |
Coil (residues which are not in any of the above conformations) | C |
The RCSB Protein Data Bank provides protein sequences and DSSP secondary structure assignments. The method below reads a local copy of this file and returns the data as a Pandas dataframe.
The secondary_structure string below is the secondary structure assignment for each amino acid residue in the sequence.
sec_struct = pdbutils.read_secondary_structure('./data/ss_dis.txt.gz')
print("Number of protein chains in PDB:", sec_struct.shape[0])
sec_struct.head()
Number of protein chains in PDB: 402007
pdbChainId | sequence | secondary_structure | |
---|---|---|---|
0 | 101M.A | MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR... | CCCCHHHHHHHHHHHHHHGGGHHHHHHHHHHHHHHHCGGGGGGCTT... |
1 | 102L.A | MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSE... | CCHHHHHHHHHCCEEEEEECTTSCEEEETTEEEESSSCTTTHHHHH... |
2 | 102M.A | MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR... | CCCCHHHHHHHHHHHHHHGGGHHHHHHHHHHHHHHHCGGGGGGCTT... |
3 | 103L.A | MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAK... | CCHHHHHHHHHCCEEEEEECTTSCEEEETTEECCCCCCCCCHHHHH... |
4 | 103M.A | MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR... | CCCCHHHHHHHHHHHHHHGGGHHHHHHHHHHHHHHHCGGGGGGCTT... |
Note the high level of redundancy in the PDB: the representative set contains about 14,000 protein chains, whereas the entire PDB contains more than 400,000 protein chains.
By merging the representative set of protein chains with the secondary structure dataset we obtain the intersection between the two sets. Both datasets contain a unique identifier column pdbChainId that is used to join the datasets.
df = sec_struct.merge(representatives, left_on='pdbChainId', right_on='pdbChainId', how='inner')
print("Number of representative chains with secondary structure data:", df.shape[0])
df.head()
Number of representative chains with secondary structure data: 13791
pdbChainId | sequence | secondary_structure | length | Exptl. | resolution | R-factor | FreeRvalue | |
---|---|---|---|---|---|---|---|---|
0 | 12AS.A | MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQD... | CCCCHHHHHHHHHHHHHHHHHHHHHHHCEEECCCCSEEETTSSCSC... | 330 | XRAY | 2.2 | 0.16 | 0.29 |
1 | 16VP.A | SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL... | CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS... | 366 | XRAY | 2.1 | 0.19 | 0.26 |
2 | 1A0I.A | VNIKTNPFKAVSFVESAIKKALDNAGYLIAEIKYDGVRGNICVDNT... | CTTCCCCEEEEECCHHHHHHHHHHHSSEEEEECCCSEEEEEEEETT... | 348 | XRAY | 2.6 | 0.22 | 0.34 |
3 | 1A12.C | RRSPPADAIPKSKKVKVSHRSHSTEPGLVLTLGQGDVGQLGLGENV... | CCCCCCCCCCCCCCCCCCCTTCCCCCBEEEEEEECTTSTTCSCTTC... | 413 | XRAY | 1.7 | 0.19 | 0.22 |
4 | 1A1X.A | GSAGEDVGAPPDHLWVHQEGIYRDEYQRTWVAVVEEETSFLRARVQ... | CCCCCCCCCCCSEEEEEETTEEEETTSCEEEEEEEECSSCEEEEEE... | 108 | XRAY | 2.0 | 0.21 | 0.25 |
To reduce the DSSP 8-state classification to a 3-state classification we group related secondary structure elements: alpha (I, H, G), beta (E, B), and coil (S, T, C). Then we calculate the fraction of the amino acid residues in each of the three classes.
def alpha_fraction(s):
return (s.count('I') + s.count('H') + s.count('G')) / len(s)
def beta_fraction(s):
return (s.count('E') + s.count('B')) / len(s)
def coil_fraction(s):
return (s.count('S') + s.count('T') + s.count('C')) / len(s)
df['alpha'] = df.secondary_structure.apply(alpha_fraction)
df['beta'] = df.secondary_structure.apply(beta_fraction)
df['coil'] = df.secondary_structure.apply(coil_fraction)
df.head()
pdbChainId | sequence | secondary_structure | length | Exptl. | resolution | R-factor | FreeRvalue | alpha | beta | coil | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 12AS.A | MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQD... | CCCCHHHHHHHHHHHHHHHHHHHHHHHCEEECCCCSEEETTSSCSC... | 330 | XRAY | 2.2 | 0.16 | 0.29 | 0.345455 | 0.206061 | 0.448485 |
1 | 16VP.A | SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL... | CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS... | 366 | XRAY | 2.1 | 0.19 | 0.26 | 0.469945 | 0.046448 | 0.483607 |
2 | 1A0I.A | VNIKTNPFKAVSFVESAIKKALDNAGYLIAEIKYDGVRGNICVDNT... | CTTCCCCEEEEECCHHHHHHHHHHHSSEEEEECCCSEEEEEEEETT... | 348 | XRAY | 2.6 | 0.22 | 0.34 | 0.232759 | 0.318966 | 0.448276 |
3 | 1A12.C | RRSPPADAIPKSKKVKVSHRSHSTEPGLVLTLGQGDVGQLGLGENV... | CCCCCCCCCCCCCCCCCCCTTCCCCCBEEEEEEECTTSTTCSCTTC... | 413 | XRAY | 1.7 | 0.19 | 0.22 | 0.038741 | 0.418886 | 0.542373 |
4 | 1A1X.A | GSAGEDVGAPPDHLWVHQEGIYRDEYQRTWVAVVEEETSFLRARVQ... | CCCCCCCCCCCSEEEEEETTEEEETTSCEEEEEEEECSSCEEEEEE... | 108 | XRAY | 2.0 | 0.21 | 0.25 | 0.037037 | 0.472222 | 0.490741 |
Next we classify each protein chain into one of four classes. We use a threshold of 25% to define a predominant class.
Protein chains in the other class will be ignored in the subsequent analysis.
def protein_fold_class(data, min_threshold, max_threshold):
"""
Returns fold classification:
"alpha", "beta", "alpha+beta", and "other" based upon the
fraction of alpha and beta.
"""
if data.alpha > max_threshold and data.beta < min_threshold:
return "alpha"
elif data.beta > max_threshold and data.alpha < min_threshold:
return "beta"
elif data.alpha > max_threshold and data.beta > max_threshold:
return "alpha+beta"
else:
return "other"
# assign protein fold class
df[value_col] = df.apply(protein_fold_class, min_threshold=0.05, max_threshold=0.25, axis=1)
# exclude protein chains without a dominant classification from further analysis.
df = df[df[value_col] != 'other']
print("Dataset size", df.shape[0])
df.head()
Dataset size 5370
pdbChainId | sequence | secondary_structure | length | Exptl. | resolution | R-factor | FreeRvalue | alpha | beta | coil | foldClass | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 16VP.A | SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL... | CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS... | 366 | XRAY | 2.10 | 0.19 | 0.26 | 0.469945 | 0.046448 | 0.483607 | alpha |
3 | 1A12.C | RRSPPADAIPKSKKVKVSHRSHSTEPGLVLTLGQGDVGQLGLGENV... | CCCCCCCCCCCCCCCCCCCTTCCCCCBEEEEEEECTTSTTCSCTTC... | 413 | XRAY | 1.70 | 0.19 | 0.22 | 0.038741 | 0.418886 | 0.542373 | beta |
4 | 1A1X.A | GSAGEDVGAPPDHLWVHQEGIYRDEYQRTWVAVVEEETSFLRARVQ... | CCCCCCCCCCCSEEEEEETTEEEETTSCEEEEEEEECSSCEEEEEE... | 108 | XRAY | 2.00 | 0.21 | 0.25 | 0.037037 | 0.472222 | 0.490741 | beta |
5 | 1A2X.B | GDEEKRNRAITARRQHLKSVMLQIAATELEKEEGRREAEKQNYLAEH | CCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHTCCCCCCCCCCCCCCC | 47 | XRAY | 2.30 | 0.22 | 0.33 | 0.574468 | 0.000000 | 0.425532 | alpha |
8 | 1A62.A | MNLTELKNTPVSELITLGENMGLENLARMRKQDIIFAILKQHAKSG... | CBHHHHHTSCHHHHHHHHHTTTCCCCTTSCHHHHHHHHHHHHHHTT... | 130 | XRAY | 1.55 | 0.22 | 0.25 | 0.284615 | 0.261538 | 0.453846 | alpha+beta |
We save the representative dataset with protein sequence and fold classification as a Pandas dataframe for further analysis.
df.to_json("./intermediate_data/foldClassification.json")
After you saved the dataset here, run the next step in the workflow 2-CalculateFeatures.ipynb or go back to 0-Workflow.ipynb.