mmtf-pyspark operates on 3D structures in the compressed binary MMTF file format.
Info about MMTF:
Protein Data Bank structures are available in two MMTF data representations:
from pyspark.sql import SparkSession
from mmtfPyspark.io import mmtfReader
spark = SparkSession.builder.appName("1-Input").getOrCreate()
sc = spark.sparkContext
sc.defaultParallelism
4
For a small list of PDB entries (10s to 100), the download methods are the quickest way to import structures. Here we download a list of 4 structure in the full representation.
pdbids = ['1LQ9','1LXJ','4XPX','1P1J']
structures = mmtfReader.download_full_mmtf_files(pdbids)
Structures are represented as keyword-value pairs (tuples):
We can print the keys and values using the collect() method. Note, that the structures are loaded in an arbritray order. You cannot rely on the order of structures.
structures.keys().collect()
['1P1J', '1LXJ', '4XPX', '1LQ9']
structures.values().collect()
[<mmtfPyspark.utils.mmtfStructure.MmtfStructure at 0x1162845f8>, <mmtfPyspark.utils.mmtfStructure.MmtfStructure at 0x11628d550>, <mmtfPyspark.utils.mmtfStructure.MmtfStructure at 0x116298908>, <mmtfPyspark.utils.mmtfStructure.MmtfStructure at 0x1162cdcc0>]
Spark represents these keyword-value pairs as Resilient Distributed Datasets (RDDs), which are a fault-tolerant collection of elements that can be operated on in parallel. To see how the dataset was distributed, we can print the number of partitions.
structures.getNumPartitions()
4
Next, we read PDB structures from a local copy of an MMTF Hadoop Sequence file. For the following examples to work, the MMTF_FULL and MMTF_REDUCED environment variables need to be set. See installation instructions for details.
If you have long list (1000s) of PDB IDs, you can read the list of structures from a local copy of the MMTF Hadoop Sequence file, however, it's very inefficent for a few structures, e.g, in the example below.
path = "../resources/mmtf_reduced_sample/"
structures = mmtfReader.read_sequence_file(path, pdbids)
Let's print the keys again and see how long this takes. You can see that Spark loads the data only when and if it's required.
structures.keys().collect()
['1LQ9', '1LXJ', '4XPX', '1P1J']
Now, let's read a sample of the PDB archive from the MMTF Hadoop Sequence file
structures = mmtfReader.read_sequence_file(path).cache()
%%time
structures.count()
CPU times: user 9.6 ms, sys: 3.85 ms, total: 13.4 ms Wall time: 5.18 s
9756
Now, let's count the number of structures again. Should this be faster this time since we already loaded the entire PDB?
Not necessarily, the data from the Hadoop Sequence file are streamed through parallel threads. If you need the data again, they need to be reloaded from scratch, unless they are cached. See .cache() method call after reading the MMTF Hadoop Sequence file.
Remove the .cache() method call, run this notebook again and compare the time it takes to count the number of structures.
%%time
structures.count()
CPU times: user 7.45 ms, sys: 2.9 ms, total: 10.3 ms Wall time: 900 ms
9756
In this workshop we use a sample set of the PDB with about 10,000 structures.
To use the entire PDB, the MMTF_FULL and MMTF_REDUCED environment variables must to be set. See mmtf-pypark installation instructions for details.
We commented this lines below, since we are using a smaller sample of the PDB for the tutorials.
To use the whole PDB, the MMTF_FULL and MMTF_REDUCED environment variables need to be set to the full
and reduced
MMTF Hadoop Sequence file locations. See installation instructions for details.
# %%time
# pdb_full = mmtfReader.read_full_sequence_file();
# pdb_full.count()
# %%time
# pdb_reduced = mmtfReader.read_reduced_sequence_file();
# pdb_reduced.count()
It is very important to run the notebook all the way to the spark.stop() statement to terminate Spark. Otherwise you may end up running multiple instances of Spark that will interfere with each other.
spark.stop()