This tutorial demonstrates how to split PDB structures into subcomponents or create biological assemblies. In Spark, a flatMap transformation splits each data record into zero or more records.
from pyspark.sql import SparkSession
from mmtfPyspark.io import mmtfReader
from mmtfPyspark.filters import ContainsDnaChain
from mmtfPyspark.mappers import StructureToBioassembly, StructureToPolymerChains, StructureToPolymerSequences
from mmtfPyspark.structureViewer import view_structure
from mmtfPyspark.utils import traverseStructureHierarchy
import py3Dmol
spark = SparkSession.builder.appName("4-Flatmapping").getOrCreate()
In this example we download the hemoglobin structure 4HHB, consisting of two alpha subunits and two beta subunits.
quaternary = mmtfReader.download_reduced_mmtf_files(["4HHB"])
view_structure(quaternary.keys().collect());
interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=0), Output()), …
Here we extract the polymer sequences using a flatMap transformation. Chains A and C (alpha subunits) and chains B and D (beta subunits) have identical sequences, respectively.
sequences = quaternary.flatMap(StructureToPolymerSequences())
sequences.take(4)
[('4HHB.A', 'VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR'), ('4HHB.B', 'VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH'), ('4HHB.C', 'VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR'), ('4HHB.D', 'VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH')]
A flatMap operation splits data records into zero or more records. Here, we use the StructureToPolymerChains class to flatMap a PDB entry (quaternary structure) to its polymer chains (tertiary structure). Note, the chain Id is appended to the PDB Id. The two alpha subunit are 4HHB.A and 4HHB.C and the beta subunits are 4HHB.B and 4HHB.C.
tertiary = quaternary.flatMap(StructureToPolymerChains())
tertiary.keys().collect()
['4HHB.A', '4HHB.B', '4HHB.C', '4HHB.D']
view_structure(tertiary.keys().collect());
interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=3), Output()), …
For some analyses we may only need one copy of each unique subunit (identical polymer sequence). This can be done by setting excludeDuplicates = True.
tertiary = quaternary.flatMap(StructureToPolymerChains(excludeDuplicates=True))
tertiary.keys().collect()
['4HHB.A', '4HHB.B']
The filter operations we used previously for whole structures can also be applied to single polymer chains. Here we flatMap PDB structures into polymer chains and then select select DNA chains.
path = "../resources/mmtf_reduced_sample"
dna_chains = mmtfReader \
.read_sequence_file(path) \
.flatMap(StructureToPolymerChains(excludeDuplicates=True)) \
.filter(ContainsDnaChain())
view_structure(dna_chains.keys().collect());
interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=241), Output())…
In this example we read the asymmetric unit of 1STP (Complex of Biotin with Streptavidin)
asymmetric_unit = mmtfReader.download_full_mmtf_files(["1STP"])
Print some summary data about this structure
traverseStructureHierarchy.print_structure_data(asymmetric_unit.first())
*** STRUCTURE DATA *** Number of models : 1 Number of chains : 3 Number of groups : 206 Number of atoms : 1001 Number of bonds : 940
Now, we use a flatMap operation to map an asymmetric unit to one or more biological assemblies. In the case of 1STP, there is only one biological assembly, which represents a tetramer.
bio_assembly = asymmetric_unit.flatMap(StructureToBioassembly())
bio_assembly.first()[0]
'1STP-BioAssembly1'
As you can see, the biological assembly contains 4 copies of the asymmetric unit
traverseStructureHierarchy.print_structure_data(bio_assembly.first())
*** STRUCTURE DATA *** Number of models : 1 Number of chains : 12 Number of groups : 824 Number of atoms : 4004 Number of bonds : 3280
view_structure(["1STP"], bioAssembly=True);
interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=0), Output()), …
spark.stop()