In this problem we use the ColumnarStructure and boolean indexing to create a distance map of the HIV protease dimer. We will use C-beta atoms instead of C-alpha atoms.
from pyspark.sql import SparkSession
from mmtfPyspark.io import mmtfReader
from mmtfPyspark.utils import traverseStructureHierarchy, ColumnarStructure
from mmtfPyspark import structureViewer
import numpy as np
from scipy.spatial.distance import pdist, squareform
import matplotlib.pyplot as plt
spark = SparkSession.builder.appName("Problem-1").getOrCreate()
Here we download an HIV protease structure with a bound ligand (Nelfinavir).
pdb = mmtfReader.download_full_mmtf_files(["1OHR"])
Structures are represented as keyword-value pairs (tuples):
In this case, we only have one structure, so we can use the first() method to extract the data.
structure = pdb.values().first()
Here we convert an MMTF structure to a columnar structure. By specifying the firstModel flag, we only retrieve data for the first model (this structure has only one model, anyways).
arrays = ... your code here ...
x = ... your code here ...
y = ... your code here ...
z = ... your code here ...
Entity types can be used to distinguish polymer from non-polymer groups and select specific components, e.g., all protein groups. The following entity types are available:
entity_types = arrays.get_entity_types()
entity_types
array(['PRO', 'PRO', 'PRO', ..., 'WAT', 'WAT', 'WAT'], dtype=object)
atom_names = arrays.get_atom_names()
atom_names
array(['N', 'CA', 'C', ..., 'O', 'H1', 'H2'], dtype=object)
group_names = arrays.get_group_names()
group_names
array(['PRO', 'PRO', 'PRO', ..., 'HOH', 'HOH', 'HOH'], dtype=object)
Boolean indexing is an efficient way to access selected elements from numpy arrays.
This time, do the selection for the entire structure.
cb_idx = ... your code here ...
... your code here ...
array(['CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CA', 'CA', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CA', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CA', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CA', 'CA', 'CB', 'CA', 'CA', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CA', 'CB', 'CB', 'CB', 'CB', 'CA', 'CB', 'CB', 'CB', 'CB', 'CA', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CA', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CA', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CA', 'CA', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CA', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CA', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CA', 'CA', 'CB', 'CA', 'CA', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CA', 'CB', 'CB', 'CB', 'CB', 'CA', 'CB', 'CB', 'CB', 'CB', 'CA', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CA', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CB', 'CA', 'CB', 'CB', 'CB', 'CB', 'CB'], dtype=object)
Then, we apply this index to get the coordinates for the selected atoms
xc = x[cb_idx]
yc = y[cb_idx]
zc = z[cb_idx]
[x0, x1, ..., xn],[y0, y1,...,yn],[z0, z1, ...,zn]
to
[x0, y0, z0],[x1, y1, z1], ..., [xn, yn, zn]
coords = np.swapaxes(np.array([xc,yc,zc]), 0, 1)
dist_matrix = squareform(pdist(coords), 'euclidean')
plt.pcolor(dist_matrix, cmap='RdBu')
plt.title('C-beta distance map')
plt.gca().set_aspect('equal')
plt.colorbar();
Only consider distance <= 9. We use boolean indexing to set all distance > 9 to zero.
dist_matrix[dist_matrix > 9] = 0
plt.pcolor(dist_matrix, cmap='Greys')
plt.title('C-beta distance map')
plt.gca().set_aspect('equal')
plt.colorbar();
spark.stop()