SureChEMBL iPython Notebook Tutorial

An introduction to patent chemoinformatics using SureChEMBL data and the RDKit toolkit

George Papadatos, ChEMBL group, EMBL-EBI

In this tutorial:

  1. Read a file that contains all chemistry extracted from the Levitra US patent (US6566360) along with all the other members of the same patent family.
  2. Filter by different text-mining and chemoinformatics properties to remove noise and enrich the genuinely novel structures claimed in the patent documents.
  3. Visualise the chemical space using MDS and dimensionality reduction.
    i. Identify outliers in the chemistry space.
    ii. Fix wrongly extracted structures.
  4. Find the Murcko and MCS scaffolds of the claimed compounds in the patent family.
    i. Compare the derived core with the actual Markush structure as reported in the original patent document.
  5. Identify key compounds using structural information only. This is based on the publications by Hattori et al. and Tyrchan et al..
    i. Count the number of NNs per compound.
    ii. Perform R-Group decomposition.
In [80]:
import matplotlib.pyplot as plt
import pandas as pd

from IPython.display import display
from IPython.display import display_pretty, display_html, HTML, Javascript, Image

from rdkit.Chem import PandasTools
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import AllChem as Chem
from rdkit.Chem import DataStructs
from rdkit.Chem import MCS
from rdkit.Chem import Draw
Draw.DrawingOptions.elemDict[0]=(0.,0.,0.)  # draw dummy atoms in black

from itertools import cycle
from sklearn import manifold
from scipy.spatial.distance import *

import mpld3

import warnings
In [81]:
from rdkit import rdBase

This is the base URL of the publicly accessible ChEMBL Beaker server.

In [82]:
base_url = ''
In [83]:
pd.options.display.mpl_style = 'default'
pd.options.display.float_format = '{:.2f}'.format
In [84]:
rcParams['figure.figsize'] = 16,10

1. Read and filter the input file

The file was generated by extracting all chemistry from a list of patent documents in SureChEMBL. The Lucene query used to retrieve the list of relevant patents was: pn:"US6566360".

In [85]:
df = pd.read_csv('document chemistry_20141011_114421_271.csv',sep=',')

Let's check the contents:

In [86]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8497 entries, 0 to 8496
Data columns (total 28 columns):
Patent ID                    8497 non-null object
Annotation Reference         8497 non-null object
Chemical ID                  8497 non-null int64
SMILES                       8497 non-null object
Type                         8497 non-null object
Chemical Document Count      8497 non-null int64
Annotation Document Count    3372 non-null float64
Title Count                  3372 non-null float64
Abstract Count               3372 non-null float64
Claims Count                 3372 non-null float64
Description Count            3372 non-null float64
Chemical Corpus Count        8497 non-null int64
Annotation Corpus Count      3372 non-null float64
Molecular Weight             8497 non-null float64
Med Chem Alert               8497 non-null int64
Log P                        8497 non-null float64
Donor Count                  8497 non-null int64
Acceptor Count               8497 non-null int64
Ring Count                   8497 non-null int64
Rotatable Bound Count        8497 non-null int64
Radical                      8497 non-null int64
Fragment                     8497 non-null int64
Connected                    8497 non-null int64
Singleton                    8497 non-null int64
Simple                       8497 non-null int64
Lipinski                     8497 non-null int64
Lead Likeness                8497 non-null int64
Bio Availability             8497 non-null int64
dtypes: float64(8), int64(16), object(4)
In [87]:
(8497, 28)

Add a SCHEMBL ID column and sort:

In [88]:
df['SCHEMBL ID'] = df['Chemical ID'].map(lambda x: 'SCHEMBL{0}'.format(x))
df = df.sort('Chemical ID')
In [89]:

First round of filtering: Novel compounds appear in the description or claims section of the document. Alternatively, they are extracted from images or mol files

In [90]:
dff = df[(df['Claims Count'] > 0) | (df['Description Count'] > 0) | (df['Type'] != "TEXT")]
In [91]:
(8484, 29)

Second round of filtering: Simple physicochemical properties and counts

In [92]:
dff = dff[(dff['Rotatable Bound Count'] < 15) & (dff['Ring Count'] > 0) & (df['Radical'] == 0) & (df['Singleton'] == 0) & (df['Simple'] == 0)]
In [93]:
(8218, 29)
In [94]:
dff = dff[(dff['Molecular Weight'] >= 300) & (dff['Molecular Weight'] <= 800) & (dff['Log P'] > 0) & (dff['Log P'] < 7)]
In [95]:
(6711, 29)

Third round of filtering: Using Global Corpus Count

In [96]:
dff = dff[(dff['Chemical Corpus Count'] < 400) & ((dff['Annotation Corpus Count'] < 400) | (dff['Annotation Corpus Count'].isnull()))]
In [97]:
(6459, 29)

Keep largest fragment and convert SMILES to RDKit molecules

In [98]:
dff['SMILES'] = dff['SMILES'].map(lambda x: max(x.split('.'), key=len))
In [99]:
PandasTools.AddMoleculeColumnToFrame(dff, smilesCol = 'SMILES')
In [100]:

Fourth round of filtering: Remove duplicates based on Chemical IDs and InChI keys

In [101]:
dff['InChI'] = dff['ROMol'].map(Chem.MolToInchi)
dff['InChIKey'] = dff['InChI'].map(Chem.InchiToInchiKey)
In [102]:
In [103]:
dff = dff.drop_duplicates('Chemical ID')
In [104]:
(401, 32)
In [105]:
dff = dff.drop_duplicates('InChIKey')
In [106]:
(394, 32)

Wow, that was a lot of duplicates. This is because the same compounds come from different patents in the same patent family. In addition, in US patents a compound may come from 3 different sources: text, image and mol file.

Fifth round of filtering: Remove Boron-containing compounds as they are likely to be reactants.

In [107]:
dff = dff.ix[~(dff['ROMol'] >= Chem.MolFromSmarts('[#5]'))]
In [108]:
(394, 32)
In [109]:
dff = dff.set_index('Chemical ID', drop=False)

That was the end of the filters - let's see a summary of the remaining compounds and their patent document source:

In [110]:
dff_counts = dff[['Patent ID','ROMol']].groupby('Patent ID').count()
In [111]:
dff_counts['Link'] = x: '<a href="{0}/" target="_blank">{0}</a>'.format(x))
In [112]:
dff_counts = dff_counts.rename(columns={'ROMol':'# Compounds'})
In [113]:
# Compounds Link
Patent ID
EP-1049695-A1 2 EP-1049695-A1
EP-1049695-B1 7 EP-1049695-B1
EP-1174431-A2 1 EP-1174431-A2
EP-2295436-A1 68 EP-2295436-A1
US-20040067945-A1 4 US-20040067945-A1
US-20050070541-A1 2 US-20050070541-A1
US-20060189615-A1 2 US-20060189615-A1
US-20080113972-A1 34 US-20080113972-A1
US-20100016323-A1 29 US-20100016323-A1
US-20110009367-A1 43 US-20110009367-A1
US-20130059844-A1 46 US-20130059844-A1
US-6362178-B1 6 US-6362178-B1
US-6566360-B1 4 US-6566360-B1
US-6890922-B2 2 US-6890922-B2
US-7122540-B2 7 US-7122540-B2
US-7314871-B2 43 US-7314871-B2
US-7696206-B2 32 US-7696206-B2
US-7704999-B2 59 US-7704999-B2
WO-1999024433-A1 3 WO-1999024433-A1

Right, is Levitra included in this set? ChEMBL tells me that its InChiKey is 'SECKRCOLJRRGGV-UHFFFAOYSA-N'.

In [114]:
Chemical ID
5441 SCHEMBL5441 Mol

Good - it is...

2. Visualise the patent chemical space - MDS

Preparation of the pairwise distance matrix.

In [115]:
fps = [Chem.GetMorganFingerprintAsBitVect(m,2,nBits=2048) for m in dff['ROMol']]
In [116]:
dist_mat = squareform(pdist(fps,'jaccard'))
In [117]:
mds = manifold.MDS(n_components=2, dissimilarity="precomputed", random_state=3, n_jobs = 2)
In [118]:
results =
In [119]:
coords = results.embedding_
In [120]:
dff['X'] = [c[0] for c in coords]
In [121]:
dff['Y'] = [c[1] for c in coords]
In [122]:
csss = """
  border-collapse: collapse;
  color: #ffffff;
  background-color: #848482;
  background-color: #f2f3f4;
table, th, td
  font-family:Arial, Helvetica, sans-serif;
  border: 1px solid black;
  text-align: right;
In [123]:
import base64
In [124]:
dff['base64smi'] = dff['SMILES'].map(base64.b64encode)

Let's produce a scatter plot of the chemical space - points represent compounds, color-coded by the patent document they were found in. Thanks to mpld3, the scatter plot is interactive with live structure rendering calls to the local myChEMBL Beaker server.

In [125]:
fig, ax = plt.subplots()
fig.set_size_inches(14.0, 12.0)
#colors = cycle('bgrcmykwbgrcmykbgrcmykwbgrcmykw')
colors = cycle(cm.Dark2(np.linspace(0,1,len(dff_counts.index))))
for name, group in dff.groupby('Patent ID'):
    labels = []
    for i in group.index:
        zz = group.ix[[i],['Patent ID','SCHEMBL ID','base64smi']]
        zz['mol'] = zz['base64smi'].map(lambda x: '<img src="https://{0}/utils/smiles2image/{1}/300">'.format(base_url,x))
        del zz['base64smi']
        label = zz.T
        del zz
        label.columns = ['Row: {}'.format(i)]
    points = ax.scatter(group['X'], group['Y'],, s=80, alpha=0.8)
    tooltip = mpld3.plugins.PointHTMLTooltip(points, labels, voffset=10, hoffset=10, css=csss)