This tutorial demonstrates how to use Map and Reduce to count the number of atoms in a structure.
from pyspark.sql import SparkSession
from mmtfPyspark.io import mmtfReader
spark = SparkSession.builder.appName("5-MapReduce").getOrCreate()
path = "../resources/mmtf_full_sample"
pdb = mmtfReader.read_sequence_file(path)
Use a lambda expression to get the number of atoms for each entry. The variable t represents a tuple (PDB ID, mmtfStructure)
num_atoms = pdb.map(lambda t: t[1].num_atoms)
Print the number of atoms for 10 entries
num_atoms.take(10)
[1793, 2731, 3056, 2275, 4238, 5436, 3596, 4310, 1386, 1139]
Use the reduce method with a summation function defined as a lambda expression
num_atoms.reduce(lambda a, b: a+b)
34248081
spark.stop()