#helper functions
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
def depict(input):
if(">>" in input):
rxn = AllChem.ReactionFromSmarts(input)
return Draw.ReactionToImage(rxn)
else:
temp = Chem.MolFromSmiles(input)
return temp
def showMMPs(in_string):
f = in_string.split(",")
rxn =f[-2].split(">>")
mols=[]
ids=[]
mols.append( Chem.MolFromSmiles(f[-6]) )
mols.append( Chem.MolFromSmiles(f[-5]) )
mols.append( Chem.MolFromSmiles(rxn[0]) )
mols.append( Chem.MolFromSmiles(rxn[1]) )
mols.append( Chem.MolFromSmiles(f[-1]) )
ids.append(f[-3])
ids.append(f[-4])
ids.append("LHS")
ids.append("RHS")
ids.append("CONTEXT")
return Draw.MolsToGridImage(mols,molsPerRow=6,legends=ids)
def showLine(in_string):
f = in_string.split(",")
mols=[]
ids=[]
mols.append( Chem.MolFromSmiles(f[0]) )
mols.append( Chem.MolFromSmiles(f[1]) )
mols.append( Chem.MolFromSmiles(f[4]) )
mols.append( Chem.MolFromSmiles(f[5]) )
ids.append("Query:%s" % f[2])
ids.append(f[3])
ids.append("CHANGE")
ids.append("CONTEXT")
return Draw.MolsToGridImage(mols,molsPerRow=4,legends=ids)
The pair index used in the MMP identification algorithm can be written to a relational database. For the indexing.py program already described, the index is written to memory and the program will identify all the MMPs in the dataset. However, if you just want to ask a (series of) specific questions on a dataset, a relational database containing the pair index (MMP db) can be used to do that.
The program create_mmp_db.py will build a MMP db for a given dataset and the program search_mmp_db.py can be used to search the MMP db. The types of searching that can be performed on the db are as follows:
The SMARTS searching utilises the DbCLI tools (http://code.google.com/p/rdkit/wiki/UsingTheDbCLI) that are part of the RDKit distribution.
To generate an MMP db use the following command:
python $RDBASE/Contrib/mmpa/create_mmp_db.py <FRAGMENT_OUTPUT
The program takes a FRAGMENT_OUTPUT generated by the rfrag.py command (described above) as input.
This program has several options (see help from program below):
Usage: create_mmp_db.py [options]
Program to create an MMP db.
Options:
-h, --help show this help message and exit
-p PREFIX, --prefix=PREFIX
Prefix to use for the db file (and directory for
SMARTS index). DEFAULT=mmp
-m MAXSIZE, --maxsize=MAXSIZE
Maximum size of change (in heavy atoms) that is stored
in the database. DEFAULT=15.
Note: Any MMPs that involve a change greater than this
value will not be stored in the database and hence not
be identified in the searching.
-s, --smarts Build SMARTS db so can perform SMARTS searching
against db. Note: Will make the build process somewhat
slower.
Let's build a MMP db:
cd t2_files/
ls
!python $RDBASE/Contrib/mmpa/create_mmp_db.py <sample_fragmented.txt
ls
A sqllite3 db file has be created called mmp.db
Other sample commands..
Generate a db with the prefix "my_MMP_db" and SMARTS searching capability:
!python $RDBASE/Contrib/mmpa/create_mmp_db.py -p my_MMP_db -s <sample_fragmented.txt
ls
Notice the file: my_MMP_db.db and the directory: my_MMP_db_smarts/ which is needed for the substructure searching
ls my_MMP_db_smarts/
Generate a db with SMARTS searching capability and where only changes up to (and including) 10 heavy atoms are stored:
!python $RDBASE/Contrib/mmpa/create_mmp_db.py -m 10 -s <sample_fragmented.txt
To search the MMP db use the following command:
python $RDBASE/Contrib/mmpa/search_mmp_db.py [options] <INPUT_FILE
This program has several options (see help from program below):
Options:
-h, --help show this help message and exit
-t TYPE, --type=TYPE Type of search required. Options are: mmp, subs,
trans, subs_smarts, trans_smarts
-m MAXSIZE, --maxsize=MAXSIZE
Maximum size of change (in heavy atoms) allowed in
matched molecular pairs identified. DEFAULT=10.
Note: This option overrides the ratio option if both
are specified.
-r RATIO, --ratio=RATIO
Only applicable with the mmp search type. Maximum
ratio of change allowed in matched molecular pairs
identified. The ratio is: size of change /
size of cmpd (in terms of heavy atoms) for the QUERY
MOLECULE. DEFAULT=0.3. Note: If this option is used
with the maxsize option, the maxsize option will be
used.
-p PREFIX, --prefix=PREFIX
Prefix for the db file. DEFAULT=mmp
A description of the different search options are shown below:
a) mmp: Find all MMPs of a input/query compound to the compounds in the db
b) subs: Find all MMPs in the db where the LHS of the transform matches an input substructure.
c) trans: Find all MMPs that match the input transform/SMIRKS.
d) subs_smarts: Find all MMPs in the db where the LHS of the transform matches an input SMARTS. The attachment points in the SMARTS can be denoted by [#0] (eg.[#0]c1ccccc1).
e) trans_smarts: Find all MMPs that match the LHS and RHS SMARTS of the input transform. The transform SMARTS are input as LHS_SMARTS>>RHS_SMARTS (eg. [#0]c1ccccc1>>[#0]c1ccncc1). Note: This search can take a long time to run if a very general SMARTS expression is used.
Find all MMPs of a input/query compound to the compounds in the db. You imagine using this search to identify analogues with single point changes
Use search type: mmp
Format of input file (space or comma separated. The ID field is optional): SMILES ID
Format of output: SMILES_QUERY,SMILES_OF_MMP,QUERY_ID,RETRIEVED_ID,CHANGED_SMILES,CONTEXT_SMILES
!head sample_db_input_smi.txt
depict("c1cc2c(ncnc2NCc2cccnc2)s1")
!python $RDBASE/Contrib/mmpa/search_mmp_db.py -t mmp <sample_db_input_smi.txt
showLine("c1cc2c(ncnc2NCc2cccnc2)s1,c1cc2c(ncnc2NCc2ccccc2)s1,2531831,2139597,[*:1]c1ccccc1,[*:1]CNc1ncnc2sccc21")
Find all MMPs in the db where the LHS of the transform matches an input substructure. Make sure the attached points are denated by an asterisk and the input substructure has been canonicalised (eg. [*]c1ccccc1). Note: Up to 3 attachement points are allowed.
Use search type: subs
Format of input file (space or comma separated. The ID field is optional): Substructure_SMILES ID
Format of output: Input_substructure[,input_id],SMILES_MMP1,SMILES_MMP2,MMP1_ID,MMP2_ID,Transform,Context
!head sample_db_input_subs.txt
depict("[*]c1ccccc1")
!python $RDBASE/Contrib/mmpa/search_mmp_db.py -t subs <sample_db_input_subs.txt
showMMPs("[*]c1ccccc1,c1cc2c(ncnc2NCc2ccccc2)s1,c1cc2c(ncnc2NCc2cccnc2)s1,2139597,2531831,[*:1]c1ccccc1>>[*:1]c1cccnc1,[*:1]CNc1ncnc2sccc21")
Find all MMPs that match the input transform/SMIRKS. Make sure the input SMIRKS has been canonicalised using the cansmirk.py program.
Use search type: trans
Format of input file (space or comma separated. The ID field is optional): SMIRKS ID
Format of output: [input_id,]SMILES_MMP1,SMILES_MMP2,MMP1_ID,MMP2_ID,Transform,Context
!head sample_db_input_trans.txt
depict("[*:1]c1ccccc1>>[*:1]c1cccnc1")
!python $RDBASE/Contrib/mmpa/search_mmp_db.py -t trans <sample_db_input_trans.txt
showMMPs("t1,c1cc2c(ncnc2NCc2ccccc2)s1,c1cc2c(ncnc2NCc2cccnc2)s1,2139597,2531831,[*:1]c1ccccc1>>[*:1]c1cccnc1,[*:1]CNc1ncnc2sccc21")
Find all MMPs in the db where the LHS of the transform matches an input SMARTS. The attachment points in the SMARTS can be denoted by [#0] (eg. [#0]c1ccccc1).
Use search type: subs_smarts
Format of input file (space or comma separated. The ID field is optional): SMARTS ID
Format of output: [input_id,]SMILES_MMP1,SMILES_MMP2,MMP1_ID,MMP2_ID,Transform,Context
!head sample_db_input_subs_smarts.txt
!python $RDBASE/Contrib/mmpa/search_mmp_db.py -t subs_smarts <sample_db_input_subs_smarts.txt
showMMPs("a,NC(=O)c1ccc(NC(=O)C2COc3ccccc3O2)cc1,O=C(O)c1ccc(NC(=O)C2COc3ccccc3O2)cc1,2787356,2881039,[*:1]c1ccc(C(N)=O)cc1>>[*:1]c1ccc(C(=O)O)cc1,[*:1]NC(=O)C1COc2ccccc2O1")
showMMPs("a,c1cc2c(ncnc2NCc2ccccc2)s1,c1cc2c(ncnc2NCc2cccnc2)s1,2139597,2531831,[*:1]c1ccccc1>>[*:1]c1cccnc1,[*:1]CNc1ncnc2sccc21")
Find all MMPs that match the LHS and RHS SMARTS of the input transform. The transform SMARTS are input as LHS_SMARTS>>RHS_SMARTS (eg. [#0]c1ccccc1>>[#0]c1ccncc1). Note: This search can take a long time to run if a very general SMARTS expression is used.
Use search type: trans_smarts
Format of input file (space or comma separated. The ID field is optional): SMARTS ID
Format of output: input_transform_SMARTS,[input_id,]SMILES_MMP1,SMILES_MMP2,MMP1_ID,MMP2_ID,Transform,Context
!head sample_db_input_trans_smarts.txt
!python $RDBASE/Contrib/mmpa/search_mmp_db.py -t trans_smarts <sample_db_input_trans_smarts.txt
showMMPs("c>>n,ts,c1cc2c(ncnc2NCc2ccccc2)s1,c1cc2c(ncnc2NCc2cccnc2)s1,2139597,2531831,[*:1]NCc1ccccc1>>[*:1]NCc1cccnc1,[*:1]c1ncnc2sccc21")