1-Input¶

mmtf-pyspark operates on 3D structures in the compressed binary MMTF file format.

Info about MMTF:

Protein Data Bank structures are available in two MMTF data representations:

full
All atom representation
0.001Å coordinate precision, 0.01 B-factor and occupancy precision
reduced
C-alpha atoms only for polypeptides
P-backbone atoms only for polynucleotides
All atom representation for all other residue types
0.1Å coordinate precision, 0.1 B-factor and occupancy precision.

Import pyspark and mmtfPyspark¶

In [1]:

from pyspark.sql import SparkSession
from mmtfPyspark.io import mmtfReader

Configure Spark¶

In [2]:

spark = SparkSession.builder.appName("1-Input").getOrCreate()
sc = spark.sparkContext

In [3]:

sc.defaultParallelism

Out[3]:

Download Structures¶

For a small list of PDB entries (10s to 100), the download methods are the quickest way to import structures. Here we download a list of 4 structure in the full representation.

In [4]:

pdbids = ['1LQ9','1LXJ','4XPX','1P1J']
structures = mmtfReader.download_full_mmtf_files(pdbids)

Structures are represented as keyword-value pairs (tuples):

key: structure identifier (e.g., PDB ID)
value: MmtfStructure (structure data)

We can print the keys and values using the collect() method. Note, that the structures are loaded in an arbritray order. You cannot rely on the order of structures.

In [5]:

structures.keys().collect()

Out[5]:

['1P1J', '1LXJ', '4XPX', '1LQ9']

In [6]:

structures.values().collect()

Out[6]:

[<mmtfPyspark.utils.mmtfStructure.MmtfStructure at 0x1162845f8>,
 <mmtfPyspark.utils.mmtfStructure.MmtfStructure at 0x11628d550>,
 <mmtfPyspark.utils.mmtfStructure.MmtfStructure at 0x116298908>,
 <mmtfPyspark.utils.mmtfStructure.MmtfStructure at 0x1162cdcc0>]

Spark represents these keyword-value pairs as Resilient Distributed Datasets (RDDs), which are a fault-tolerant collection of elements that can be operated on in parallel. To see how the dataset was distributed, we can print the number of partitions.

In [7]:

structures.getNumPartitions()

Out[7]:

Reading structures from an MMTF Hadoop Sequence File¶

Next, we read PDB structures from a local copy of an MMTF Hadoop Sequence file. For the following examples to work, the MMTF_FULL and MMTF_REDUCED environment variables need to be set. See installation instructions for details.

If you have long list (1000s) of PDB IDs, you can read the list of structures from a local copy of the MMTF Hadoop Sequence file, however, it's very inefficent for a few structures, e.g, in the example below.

In [8]:

path = "../resources/mmtf_reduced_sample/"
structures = mmtfReader.read_sequence_file(path, pdbids)

Let's print the keys again and see how long this takes. You can see that Spark loads the data only when and if it's required.

In [9]:

structures.keys().collect()

Out[9]:

['1LQ9', '1LXJ', '4XPX', '1P1J']

Now, let's read a sample of the PDB archive from the MMTF Hadoop Sequence file

In [10]:

structures = mmtfReader.read_sequence_file(path).cache()

There are 9756 structures in the sample file¶

In [11]:

%%time
structures.count()

CPU times: user 9.6 ms, sys: 3.85 ms, total: 13.4 ms
Wall time: 5.18 s

Out[11]:

About data flow and caching in Spark¶

Now, let's count the number of structures again. Should this be faster this time since we already loaded the entire PDB?

Not necessarily, the data from the Hadoop Sequence file are streamed through parallel threads. If you need the data again, they need to be reloaded from scratch, unless they are cached. See .cache() method call after reading the MMTF Hadoop Sequence file.

Remove the .cache() method call, run this notebook again and compare the time it takes to count the number of structures.

In [12]:

%%time
structures.count()

CPU times: user 7.45 ms, sys: 2.9 ms, total: 10.3 ms
Wall time: 900 ms

Out[12]:

Reading the whole PDB from MMTF-Hadoop Sequence files¶

In this workshop we use a sample set of the PDB with about 10,000 structures.

To use the entire PDB, the MMTF_FULL and MMTF_REDUCED environment variables must to be set. See mmtf-pypark installation instructions for details.

Read whole PDB in the full (all atom) representation¶

We commented this lines below, since we are using a smaller sample of the PDB for the tutorials.

To use the whole PDB, the MMTF_FULL and MMTF_REDUCED environment variables need to be set to the full and reduced MMTF Hadoop Sequence file locations. See installation instructions for details.

In [13]:

# %%time
# pdb_full = mmtfReader.read_full_sequence_file();
# pdb_full.count()

Read whole PDB in the reduced representation¶

In [14]:

# %%time
# pdb_reduced = mmtfReader.read_reduced_sequence_file();
# pdb_reduced.count()

Very Important: Stop Spark!!!¶

It is very important to run the notebook all the way to the spark.stop() statement to terminate Spark. Otherwise you may end up running multiple instances of Spark that will interfere with each other.

In [15]:

spark.stop()