Notebook

In [ ]:

#helper functions

from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import AllChem

def depict(input):
    if(">>" in input):
        rxn = AllChem.ReactionFromSmarts(input)       
        return Draw.ReactionToImage(rxn)
    else:
        temp = Chem.MolFromSmiles(input)
        return temp

def showMMPs(in_string):
    f = in_string.split(",")
    
    rxn =f[-2].split(">>")      
        
    mols=[]
    ids=[]
    
    mols.append( Chem.MolFromSmiles(f[-6]) )
    mols.append( Chem.MolFromSmiles(f[-5]) )
    mols.append( Chem.MolFromSmiles(rxn[0]) )
    mols.append( Chem.MolFromSmiles(rxn[1]) )
    mols.append( Chem.MolFromSmiles(f[-1]) )
    ids.append(f[-3])
    ids.append(f[-4])
    ids.append("LHS")  
    ids.append("RHS")  
    ids.append("CONTEXT")  
    
    return Draw.MolsToGridImage(mols,molsPerRow=6,legends=ids)

def showLine(in_string):
    f = in_string.split(",")
    
    mols=[]
    ids=[]
    
    mols.append( Chem.MolFromSmiles(f[0]) )
    mols.append( Chem.MolFromSmiles(f[1]) )
    mols.append( Chem.MolFromSmiles(f[4]) )
    mols.append( Chem.MolFromSmiles(f[5]) )
    ids.append("Query:%s" % f[2])
    ids.append(f[3])
    ids.append("CHANGE")
    ids.append("CONTEXT")   
    
    return Draw.MolsToGridImage(mols,molsPerRow=4,legends=ids)

Generating and searching an MMP database¶

The pair index used in the MMP identification algorithm can be written to a relational database. For the indexing.py program already described, the index is written to memory and the program will identify all the MMPs in the dataset. However, if you just want to ask a (series of) specific questions on a dataset, a relational database containing the pair index (MMP db) can be used to do that.

The program create_mmp_db.py will build a MMP db for a given dataset and the program search_mmp_db.py can be used to search the MMP db. The types of searching that can be performed on the db are as follows:

Find all MMPs of an input/query compound to the compounds in the db
Find all MMPs in the db where the LHS of the transform matches an input substructure
Find all MMPs that match the input transform/SMIRKS
Find all MMPs in the db where the LHS of the transform matches an input SMARTS
Find all MMPs that match the LHS and RHS SMARTS of the input transform

The SMARTS searching utilises the DbCLI tools (http://code.google.com/p/rdkit/wiki/UsingTheDbCLI) that are part of the RDKit distribution.

Generating the db¶

To generate an MMP db use the following command:

python $RDBASE/Contrib/mmpa/create_mmp_db.py <FRAGMENT_OUTPUT

The program takes a FRAGMENT_OUTPUT generated by the rfrag.py command (described above) as input.

This program has several options (see help from program below):

Usage: create_mmp_db.py [options]

Program to create an MMP db.

Options:
  -h, --help        show this help message and exit
  -p PREFIX, --prefix=PREFIX
                    Prefix to use for the db file (and directory for
                    SMARTS index). DEFAULT=mmp
  -m MAXSIZE, --maxsize=MAXSIZE
                    Maximum size of change (in heavy atoms) that is stored
                    in the database. DEFAULT=15.
                    Note: Any MMPs that involve a change greater than this
                    value will not be stored in the database and hence not
                    be identified in the searching.
  -s, --smarts      Build SMARTS db so can perform SMARTS searching
                    against db. Note: Will make the build process somewhat
                    slower.

Let's build a MMP db:

In [ ]:

cd t2_files/

In [ ]:

ls

In [ ]:

!python $RDBASE/Contrib/mmpa/create_mmp_db.py <sample_fragmented.txt

In [ ]:

ls

A sqllite3 db file has be created called mmp.db

Other sample commands..

Generate a db with the prefix "my_MMP_db" and SMARTS searching capability:

In [ ]:

!python $RDBASE/Contrib/mmpa/create_mmp_db.py -p my_MMP_db -s <sample_fragmented.txt

In [ ]:

ls

Notice the file: my_MMP_db.db and the directory: my_MMP_db_smarts/ which is needed for the substructure searching

In [ ]:

ls my_MMP_db_smarts/

Generate a db with SMARTS searching capability and where only changes up to (and including) 10 heavy atoms are stored:

In [ ]:

!python $RDBASE/Contrib/mmpa/create_mmp_db.py -m 10 -s <sample_fragmented.txt

Searching the db¶

To search the MMP db use the following command:

python $RDBASE/Contrib/mmpa/search_mmp_db.py [options] <INPUT_FILE

This program has several options (see help from program below):

Options:
  -h, --help            show this help message and exit
  -t TYPE, --type=TYPE  Type of search required. Options are: mmp, subs,
                        trans, subs_smarts, trans_smarts
  -m MAXSIZE, --maxsize=MAXSIZE
                        Maximum size of change (in heavy atoms) allowed in
                        matched molecular pairs identified. DEFAULT=10.
                        Note: This option overrides the ratio option if both
                        are specified.
  -r RATIO, --ratio=RATIO
                        Only applicable with the mmp search type. Maximum
                        ratio of change allowed in matched molecular pairs
                        identified. The ratio is: size of change /
                        size of cmpd (in terms of heavy atoms) for the QUERY
                        MOLECULE. DEFAULT=0.3. Note: If this option is used
                        with the maxsize option, the maxsize option will be
                        used.
  -p PREFIX, --prefix=PREFIX
                        Prefix for the db file. DEFAULT=mmp

A description of the different search options are shown below:

a) mmp: Find all MMPs of a input/query compound to the compounds in the db

b) subs: Find all MMPs in the db where the LHS of the transform matches an input substructure.

c) trans: Find all MMPs that match the input transform/SMIRKS.

d) subs_smarts: Find all MMPs in the db where the LHS of the transform matches an input SMARTS. The attachment points in the SMARTS can be denoted by [#0] (eg.[#0]c1ccccc1).

e) trans_smarts: Find all MMPs that match the LHS and RHS SMARTS of the input transform. The transform SMARTS are input as LHS_SMARTS>>RHS_SMARTS (eg. [#0]c1ccccc1>>[#0]c1ccncc1). Note: This search can take a long time to run if a very general SMARTS expression is used.

a) To carry out a mmp search¶

Find all MMPs of a input/query compound to the compounds in the db. You imagine using this search to identify analogues with single point changes

Use search type: mmp

Format of input file (space or comma separated. The ID field is optional): SMILES ID

Format of output: SMILES_QUERY,SMILES_OF_MMP,QUERY_ID,RETRIEVED_ID,CHANGED_SMILES,CONTEXT_SMILES

In [ ]:

!head sample_db_input_smi.txt

In [ ]:

depict("c1cc2c(ncnc2NCc2cccnc2)s1")

In [ ]:

!python $RDBASE/Contrib/mmpa/search_mmp_db.py -t mmp <sample_db_input_smi.txt

In [ ]:

showLine("c1cc2c(ncnc2NCc2cccnc2)s1,c1cc2c(ncnc2NCc2ccccc2)s1,2531831,2139597,[*:1]c1ccccc1,[*:1]CNc1ncnc2sccc21")

b) To carry out a LHS transform substructure search:¶

Find all MMPs in the db where the LHS of the transform matches an input substructure. Make sure the attached points are denated by an asterisk and the input substructure has been canonicalised (eg. [*]c1ccccc1). Note: Up to 3 attachement points are allowed.

Use search type: subs

Format of input file (space or comma separated. The ID field is optional): Substructure_SMILES ID

Format of output: Input_substructure[,input_id],SMILES_MMP1,SMILES_MMP2,MMP1_ID,MMP2_ID,Transform,Context

In [ ]:

!head sample_db_input_subs.txt

In [ ]:

depict("[*]c1ccccc1")

In [ ]:

!python $RDBASE/Contrib/mmpa/search_mmp_db.py -t subs <sample_db_input_subs.txt

In [ ]:

showMMPs("[*]c1ccccc1,c1cc2c(ncnc2NCc2ccccc2)s1,c1cc2c(ncnc2NCc2cccnc2)s1,2139597,2531831,[*:1]c1ccccc1>>[*:1]c1cccnc1,[*:1]CNc1ncnc2sccc21")

c) To carry out a transform search:¶

Find all MMPs that match the input transform/SMIRKS. Make sure the input SMIRKS has been canonicalised using the cansmirk.py program.

Use search type: trans

Format of input file (space or comma separated. The ID field is optional): SMIRKS ID

Format of output: [input_id,]SMILES_MMP1,SMILES_MMP2,MMP1_ID,MMP2_ID,Transform,Context

In [ ]:

!head sample_db_input_trans.txt

In [ ]:

depict("[*:1]c1ccccc1>>[*:1]c1cccnc1")

In [ ]:

!python $RDBASE/Contrib/mmpa/search_mmp_db.py -t trans <sample_db_input_trans.txt

In [ ]:

showMMPs("t1,c1cc2c(ncnc2NCc2ccccc2)s1,c1cc2c(ncnc2NCc2cccnc2)s1,2139597,2531831,[*:1]c1ccccc1>>[*:1]c1cccnc1,[*:1]CNc1ncnc2sccc21")

d) To carry out a LHS transform substructure SMARTS search:¶

Find all MMPs in the db where the LHS of the transform matches an input SMARTS. The attachment points in the SMARTS can be denoted by [#0] (eg. [#0]c1ccccc1).

Use search type: subs_smarts

Format of input file (space or comma separated. The ID field is optional): SMARTS ID

Format of output: [input_id,]SMILES_MMP1,SMILES_MMP2,MMP1_ID,MMP2_ID,Transform,Context

In [ ]:

!head sample_db_input_subs_smarts.txt

In [ ]:

!python $RDBASE/Contrib/mmpa/search_mmp_db.py -t subs_smarts <sample_db_input_subs_smarts.txt

In [ ]:

showMMPs("a,NC(=O)c1ccc(NC(=O)C2COc3ccccc3O2)cc1,O=C(O)c1ccc(NC(=O)C2COc3ccccc3O2)cc1,2787356,2881039,[*:1]c1ccc(C(N)=O)cc1>>[*:1]c1ccc(C(=O)O)cc1,[*:1]NC(=O)C1COc2ccccc2O1")

In [ ]:

showMMPs("a,c1cc2c(ncnc2NCc2ccccc2)s1,c1cc2c(ncnc2NCc2cccnc2)s1,2139597,2531831,[*:1]c1ccccc1>>[*:1]c1cccnc1,[*:1]CNc1ncnc2sccc21")

e) To carry out a transform SMARTS search:¶

Find all MMPs that match the LHS and RHS SMARTS of the input transform. The transform SMARTS are input as LHS_SMARTS>>RHS_SMARTS (eg. [#0]c1ccccc1>>[#0]c1ccncc1). Note: This search can take a long time to run if a very general SMARTS expression is used.

Use search type: trans_smarts

Format of input file (space or comma separated. The ID field is optional): SMARTS ID

Format of output: input_transform_SMARTS,[input_id,]SMILES_MMP1,SMILES_MMP2,MMP1_ID,MMP2_ID,Transform,Context

In [ ]:

!head sample_db_input_trans_smarts.txt

In [ ]:

!python $RDBASE/Contrib/mmpa/search_mmp_db.py -t trans_smarts <sample_db_input_trans_smarts.txt

In [ ]:

showMMPs("c>>n,ts,c1cc2c(ncnc2NCc2ccccc2)s1,c1cc2c(ncnc2NCc2cccnc2)s1,2139597,2531831,[*:1]NCc1ccccc1>>[*:1]NCc1cccnc1,[*:1]c1ncnc2sccc21")