from pyspark.sql import SparkSession
from mmtfPyspark.io import mmtfReader
from mmtfPyspark.filters import ContainsGroup, ContainsLProteinChain, PolymerComposition, Resolution
from mmtfPyspark.structureViewer import view_group_interaction
spark = SparkSession.builder.appName("2-Filtering").getOrCreate()
path = "../resources/mmtf_reduced_sample"
pdb = mmtfReader.read_sequence_file(path).cache()
pdb.count()
9756
Structures can be filtered by Resolution and R-free. Each filter takes a minimum and maximum values. The example below returns structures with a resolution in the inclusive range [0.0, 1.5]
pdb = pdb.filter(Resolution(0.0, 1.5))
pdb.count()
2941
A number of filters are available to filter by the type of the polymer chain.
pdb = pdb.filter(ContainsLProteinChain())
pdb.count()
2912
pdb = pdb.filter(ContainsLProteinChain(exclusive=True))
pdb.count()
2788
pdb = pdb.filter(PolymerComposition(PolymerComposition.AMINO_ACIDS_20, exclusive=True))
pdb.count()
2171
pdb = pdb.filter(ContainsGroup("ATP"))
view_group_interaction(pdb.keys().collect(),"ATP");
interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=11), Output()),…
Rather than using a pre-made filter, we can create simple filters using lambda expressions. The expression needs to evaluate to a boolean type.
The variable t in the lambda expression below represents a tuple and t[1] is the second element in the tuple representing the mmtfStructure.
Here, we filter by the number of atoms in an entry. You will learn more about extracting structural information from an mmtfStructure in future tutorials.
pdb = pdb.filter(lambda t: t[1].num_atoms < 500)
pdb.count()
7
Or, we can filter by the key, represented by the first element in a tuple: t[0].
Keys are case sensitive. Always use upper case PDB IDs in mmtf-pyspark!
pdb = pdb.filter(lambda t: t[0] in ["4AFF", "4CBU"])
pdb.count()
1
spark.stop()