This tutorial shows how to identify drug molecules in the PDB by joining two datasets:
from pyspark.sql import SparkSession
from mmtfPyspark.datasets import customReportService, drugBankDataset
from mmtfPyspark.structureViewer import view_binding_site
spark = SparkSession.builder.appName("2-JoiningDatasets").getOrCreate()
Download a dataset of drugs from DrugBank and filter out any drugs that do not have an InChIKey. InChIKeys are unique identifiers for small molecules.
DrugBank provides more detailed datasets, e.g., subset of approved drugs, but a DrugBank username and password is required. For this tutorial we use the open DrugBank dataset.
drugs = drugBankDataset.get_open_drug_links()
drugs = drugs.filter("StandardInChIKey IS NOT NULL").cache()
drugs.toPandas().head(5)
DrugBankID | AccessionNumbers | Commonname | CAS | UNII | Synonyms | StandardInChIKey | |
---|---|---|---|---|---|---|---|
0 | DB00006 | BIOD00076 | BTD00076 | DB02351 | EXPT03302 | Bivalirudin | 128270-60-0 | TN9BEX005G | Bivalirudin | Bivalirudina | Bivalirudinum | OIRCOABEOLEUMC-GEJPAHFPSA-N |
1 | DB00007 | BIOD00009 | BTD00009 | Leuprolide | 53714-56-0 | EFY6W0M8TG | Leuprorelin | Leuprorelina | Leuproreline | Le... | GFIJNRVAKGFPGQ-LIJARHBVSA-N |
2 | DB00014 | BIOD00113 | BTD00113 | Goserelin | 65807-02-5 | 0F65R8P09N | Goserelin | Goserelina | BLCLNMBMMGCOAS-URPVMXJPSA-N |
3 | DB00027 | BIOD00036 | BTD00036 | Gramicidin D | 1405-97-6 | 5IE62321P4 | Bacillus brevis gramicidin D | Gramicidin | Gr... | NDAYQJDHGXTBJL-MWWSRJDJSA-N |
4 | DB00035 | BIOD00061 | BIOD00112 | BTD00061 | BTD00112 | Desmopressin | 16679-58-6 | ENR1LLB0FP | 1-(3-mercaptopropionic acid)-8-D-arginine-vaso... | NFLWUMRGJYTJIN-PNIOQBSNSA-N |
New functionality needs to be developed to enable this example.
Here we use RCSB PDB web services to download InChIKeys and molecular weight for ligands in the PDB (this step can be slow!).
We filter out entries without an InChIKey and low molecular weight ligands using SQL syntax.
# ligands = customReportService.get_dataset(["ligandId","InChIKey","ligandMolecularWeight"])
# ligands = ligands.filter("InChIKey IS NOT NULL AND ligandMolecularWeight > 300").cache()
# ligands.toPandas().head(10)
# ligands = ligands.join(drugs, ligands.InChIKey == drugs.StandardInChIKey)
Here we drop rows with the same structureId and ligandId.
# ligands = ligands.dropDuplicates(["structureId","ligandId"]).cache()
# ligands = ligands.select("structureId","ligandId","chainId","Commonname")
# ligands.toPandas().head(10)
# pdb_ids = ligands.select("structureId").rdd.flatMap(lambda x: x).collect()
# ligand_ids = ligands.select("ligandId").rdd.flatMap(lambda x: x).collect()
# chain_ids = ligands.select("chainId").rdd.flatMap(lambda x: x).collect()
Disable scrollbar for the visualization below
#%%javascript
#IPython.OutputArea.prototype._should_scroll = function(lines) {return false;}
# view_binding_site(pdb_ids, ligand_ids, chain_ids, distance=4.5);
spark.stop()