This tutorial shows how to use Spark datasets to retrieve metadata about PDB structures. mmtfPyspark provides a number of moduls to fetch data from external resources.
In this tutorial shows how to download and analyze PDB metadata from the SIFTS project as Spark Datasets.
from pyspark.sql import SparkSession
from pyspark.sql.functions import substring_index
from mmtfPyspark.datasets import pdbjMineDataset
import matplotlib.pyplot as plt
spark = SparkSession.builder.appName("1-Metadata").getOrCreate()
The SIFTS project maintains up-to-date mappings of protein chains in the PDB to Enzyme Classifications EC. We use the pdbjMinedDataset class to retrieve these mappings. An extensive demo shows how to query SIFTS data with pdbjMineDataset.
query = "SELECT * FROM sifts.pdb_chain_enzyme"
enzymes = pdbjMineDataset.get_dataset(query).cache()
enzymes.show()
+-----+-----+---------+---------+----------------+ |pdbid|chain|accession|ec_number|structureChainId| +-----+-----+---------+---------+----------------+ | 102L| A| P00720| 3.2.1.17| 102L.A| | 103L| A| P00720| 3.2.1.17| 103L.A| | 104L| A| P00720| 3.2.1.17| 104L.A| | 104L| B| P00720| 3.2.1.17| 104L.B| | 107L| A| P00720| 3.2.1.17| 107L.A| | 108L| A| P00720| 3.2.1.17| 108L.A| | 109L| A| P00720| 3.2.1.17| 109L.A| | 10GS| A| P09211| 2.5.1.18| 10GS.A| | 10GS| B| P09211| 2.5.1.18| 10GS.B| | 10MH| A| P05102| 2.1.1.37| 10MH.A| | 110L| A| P00720| 3.2.1.17| 110L.A| | 111L| A| P00720| 3.2.1.17| 111L.A| | 112L| A| P00720| 3.2.1.17| 112L.A| | 113L| A| P00720| 3.2.1.17| 113L.A| | 114L| A| P00720| 3.2.1.17| 114L.A| | 115L| A| P00720| 3.2.1.17| 115L.A| | 117E| A| P00817| 3.6.1.1| 117E.A| | 117E| B| P00817| 3.6.1.1| 117E.B| | 118L| A| P00720| 3.2.1.17| 118L.A| | 119L| A| P00720| 3.2.1.17| 119L.A| +-----+-----+---------+---------+----------------+ only showing top 20 rows
enzymes.toPandas().head(20)
pdbid | chain | accession | ec_number | structureChainId | |
---|---|---|---|---|---|
0 | 102L | A | P00720 | 3.2.1.17 | 102L.A |
1 | 103L | A | P00720 | 3.2.1.17 | 103L.A |
2 | 104L | A | P00720 | 3.2.1.17 | 104L.A |
3 | 104L | B | P00720 | 3.2.1.17 | 104L.B |
4 | 107L | A | P00720 | 3.2.1.17 | 107L.A |
5 | 108L | A | P00720 | 3.2.1.17 | 108L.A |
6 | 109L | A | P00720 | 3.2.1.17 | 109L.A |
7 | 10GS | A | P09211 | 2.5.1.18 | 10GS.A |
8 | 10GS | B | P09211 | 2.5.1.18 | 10GS.B |
9 | 10MH | A | P05102 | 2.1.1.37 | 10MH.A |
10 | 110L | A | P00720 | 3.2.1.17 | 110L.A |
11 | 111L | A | P00720 | 3.2.1.17 | 111L.A |
12 | 112L | A | P00720 | 3.2.1.17 | 112L.A |
13 | 113L | A | P00720 | 3.2.1.17 | 113L.A |
14 | 114L | A | P00720 | 3.2.1.17 | 114L.A |
15 | 115L | A | P00720 | 3.2.1.17 | 115L.A |
16 | 117E | A | P00817 | 3.6.1.1 | 117E.A |
17 | 117E | B | P00817 | 3.6.1.1 | 117E.B |
18 | 118L | A | P00720 | 3.2.1.17 | 118L.A |
19 | 119L | A | P00720 | 3.2.1.17 | 119L.A |
Here we select a single protein chain for each unique UniProt accession number
enzymes = enzymes.dropDuplicates(["accession"])
We use the withColumn method to add a new column and the substring_index method to extract the first two levels from the EC number hierarchy.
enzymes = enzymes.withColumn("enzymeType", substring_index(enzymes.ec_number, '.', 1))
enzymes = enzymes.withColumn("enzymeSubtype", substring_index(enzymes.ec_number, '.', 2))
enzymes.toPandas().head(20)
pdbid | chain | accession | ec_number | structureChainId | enzymeType | enzymeSubtype | |
---|---|---|---|---|---|---|---|
0 | 3P96 | A | A0QJI1 | 3.1.3.3 | 3P96.A | 3 | 3.1 |
1 | 2IXA | A | A4Q8F7 | 3.2.1.49 | 2IXA.A | 3 | 3.2 |
2 | 6IDO | A | A6TGP0 | 2.7.7.6 | 6IDO.A | 2 | 2.7 |
3 | 3FIE | A | A7GBG3 | 3.4.24.69 | 3FIE.A | 3 | 3.4 |
4 | 4Z38 | A | A7Z470 | 2.3.1.39 | 4Z38.A | 2 | 2.3 |
5 | 6DRE | B | A8GG78 | 2.4.2.31 | 6DRE.B | 2 | 2.4 |
6 | 3ZH4 | A | B1IBM3 | 2.5.1.7 | 3ZH4.A | 2 | 2.5 |
7 | 6CYZ | A | B1MLU6 | 2.7.1.39 | 6CYZ.A | 2 | 2.7 |
8 | 2FUQ | A | C6XZB6 | 4.2.2.7 | 2FUQ.A | 4 | 4.2 |
9 | 3C61 | A | D0VWT2 | 1.3.98.1 | 3C61.A | 1 | 1.3 |
10 | 5MZ6 | 1 | G5ED39 | 3.4.22.49 | 5MZ6.1 | 3 | 3.4 |
11 | 5A2A | A | I1VWH9 | 3.2.1.1 | 5A2A.A | 3 | 3.2 |
12 | 5J1S | A | O14656 | 3.6.4.- | 5J1S.A | 3 | 3.6 |
13 | 2J63 | A | O15922 | 4.2.99.18 | 2J63.A | 4 | 4.2 |
14 | 4Q5R | A | O18598 | 2.5.1.18 | 4Q5R.A | 2 | 2.5 |
15 | 5BSM | A | O24146 | 6.2.1.12 | 5BSM.A | 6 | 6.2 |
16 | 1KUF | A | O57413 | 3.4.24.- | 1KUF.A | 3 | 3.4 |
17 | 3ABD | X | O60673 | 2.7.7.7 | 3ABD.X | 2 | 2.7 |
18 | 3H0L | A | O66610 | 6.3.5.7 | 3H0L.A | 6 | 6.3 |
19 | 2PCJ | A | O66646 | 7.6.2.- | 2PCJ.A | 7 | 7.6 |
counts = enzymes.groupBy("enzymeType")\
.count()\
.sort("count", ascending=False)\
.toPandas()
counts
enzymeType | count | |
---|---|---|
0 | 2 | 4911 |
1 | 3 | 4635 |
2 | 1 | 2799 |
3 | 4 | 1203 |
4 | 5 | 805 |
5 | 6 | 620 |
6 | 7 | 261 |
counts.plot(x='enzymeType', y='count', kind='bar');
spark.stop()