Structures can be saved as single files in MMTF format, or multiple structures can be saved in an MMTF Hadoop Sequence file.
from pyspark.sql import SparkSession
from mmtfPyspark.io import mmtfReader, mmtfWriter
from mmtfPyspark.webfilters import Pisces
from mmtfPyspark.mappers import StructureToPolymerChains
spark = SparkSession.builder.appName("6-Output").getOrCreate()
path = "../resources/mmtf_reduced_sample"
pdb = mmtfReader.read_sequence_file(path)
Many analyses of the PDB use the PISCES CulledPDB sets maintained by the R. Dubrack group. A CulledPDB set is selected by specifying sequenceIdentity and resolution cutoff values from the following list:
In the example below, we create a high-resolution, non-redundant set of protein chains with a 20% sequence identity threshold and 1.6 A resolution threshold.
nr_chains = pdb \
.filter(Pisces(sequenceIdentity=20, resolution = 1.6)) \
.flatMap(StructureToPolymerChains()) \
.filter(Pisces(sequenceIdentity=20, resolution = 1.6))
nr_chains.count()
2004
If we need to use a set of structures multiple times, or want to create a snapshot in time (e.g., for reproducible analysis), we can save structures to an MMTF Hadoop Sequence File.
Here, we save the Pices subset we've just created. Note, if the output file already exists, you must delete it first.
# Writing is temporaryly disabled due to ongoing work on encoder and decoder libraries
# mmtfWriter.write_sequence_file(path + "_pices20_1.6", nr_chains)
Check if it contains the correct number of chains.
# nr_chains = mmtfReader.read_sequence_file(path + "_pices20_1.6")
# nr_chains.count()
spark.stop()