2-JoiningDatasets¶

This tutorial shows how to identify drug molecules in the PDB by joining two datasets:

Drug information from DrugBank
Ligand information from RCSB PDB

In [1]:

from pyspark.sql import SparkSession
from mmtfPyspark.datasets import customReportService, drugBankDataset
from mmtfPyspark.structureViewer import view_binding_site

Configure Spark¶

In [2]:

spark = SparkSession.builder.appName("2-JoiningDatasets").getOrCreate()

Download open DrugBank dataset¶

Download a dataset of drugs from DrugBank and filter out any drugs that do not have an InChIKey. InChIKeys are unique identifiers for small molecules.

DrugBank provides more detailed datasets, e.g., subset of approved drugs, but a DrugBank username and password is required. For this tutorial we use the open DrugBank dataset.

In [3]:

drugs = drugBankDataset.get_open_drug_links()
drugs = drugs.filter("StandardInChIKey IS NOT NULL").cache()
drugs.toPandas().head(5)

Out[3]:

	DrugBankID	AccessionNumbers	Commonname	CAS	UNII	Synonyms	StandardInChIKey
0	DB00006	BIOD00076 \| BTD00076 \| DB02351 \| EXPT03302	Bivalirudin	128270-60-0	TN9BEX005G	Bivalirudin \| Bivalirudina \| Bivalirudinum	OIRCOABEOLEUMC-GEJPAHFPSA-N
1	DB00007	BIOD00009 \| BTD00009	Leuprolide	53714-56-0	EFY6W0M8TG	Leuprorelin \| Leuprorelina \| Leuproreline \| Le...	GFIJNRVAKGFPGQ-LIJARHBVSA-N
2	DB00014	BIOD00113 \| BTD00113	Goserelin	65807-02-5	0F65R8P09N	Goserelin \| Goserelina	BLCLNMBMMGCOAS-URPVMXJPSA-N
3	DB00027	BIOD00036 \| BTD00036	Gramicidin D	1405-97-6	5IE62321P4	Bacillus brevis gramicidin D \| Gramicidin \| Gr...	NDAYQJDHGXTBJL-MWWSRJDJSA-N
4	DB00035	BIOD00061 \| BIOD00112 \| BTD00061 \| BTD00112	Desmopressin	16679-58-6	ENR1LLB0FP	1-(3-mercaptopropionic acid)-8-D-arginine-vaso...	NFLWUMRGJYTJIN-PNIOQBSNSA-N

NOTE: RCSB PDB web services have been depreciated!¶

New functionality needs to be developed to enable this example.

Download ligand annotations from RCSB PDB¶

Here we use RCSB PDB web services to download InChIKeys and molecular weight for ligands in the PDB (this step can be slow!).

We filter out entries without an InChIKey and low molecular weight ligands using SQL syntax.

In [4]:

# ligands = customReportService.get_dataset(["ligandId","InChIKey","ligandMolecularWeight"])

# ligands = ligands.filter("InChIKey IS NOT NULL AND ligandMolecularWeight > 300").cache()

# ligands.toPandas().head(10)

Find drugs in PDB¶

By joining the two datasets on the InChIKey, we get the intersection between the two datasets.

In [5]:

# ligands = ligands.join(drugs, ligands.InChIKey == drugs.StandardInChIKey)

Keep only unique ligands per structure¶

Here we drop rows with the same structureId and ligandId.

In [6]:

# ligands = ligands.dropDuplicates(["structureId","ligandId"]).cache()

Keep only essential columns¶

In [7]:

# ligands = ligands.select("structureId","ligandId","chainId","Commonname")
# ligands.toPandas().head(10)

Visualize drug binding sites¶

Extract id columns as lists (required for visualization)¶

In [8]:

# pdb_ids = ligands.select("structureId").rdd.flatMap(lambda x: x).collect()
# ligand_ids = ligands.select("ligandId").rdd.flatMap(lambda x: x).collect()
# chain_ids = ligands.select("chainId").rdd.flatMap(lambda x: x).collect()

Disable scrollbar for the visualization below

In [9]:

#%%javascript 
#IPython.OutputArea.prototype._should_scroll = function(lines) {return false;}

Show binding site residues within 4.5 A from the drug molecule¶

In [10]:

# view_binding_site(pdb_ids, ligand_ids, chain_ids, distance=4.5);

In [11]:

spark.stop()

In [ ]: