Notebook

How to train a custom classifier¶

This tutorial provides step-by-step instructions for building a new classifier on a personalized set of features.

In [1]:

import os

if not os.path.isdir('local'):
    os.mkdir('local')
if not os.path.isdir('local/custom_classifier'):
    os.mkdir('local/custom_classifier')
    
os.chdir('local/custom_classifier')

Custom training dataset¶

First, we need to build a custom training dataset, for instance by extending the Integrated Dataset that comes with Rhapsody.

New feature arrays must have the same length as those already in the Integrated Dataset. Information about size and composition of the default training dataset is stored in the installation settings:

In [2]:

import rhapsody as rd

In [3]:

settings = rd.getSettings()

@> rhapsody_local_folder   : /home/luca/rhapsody
@> EVmutation_local_folder : /home/luca/rhapsody/EVmutation_mutation_effects
@> full classifier         : /home/luca/rhapsody/default_classifiers-sklearn_v0.21.3/full/trained_classifier.pkl
@> reduced classifier      : /home/luca/rhapsody/default_classifiers-sklearn_v0.21.3/reduced/trained_classifier.pkl
@> EVmut classifier        : /home/luca/rhapsody/default_classifiers-sklearn_v0.21.3/EVmut/trained_classifier.pkl
@> training dataset size   : 20361
@> EVmutation_metrics      : <computed>

In [4]:

settings['rhapsody_training_dataset']['size']

Out[4]:

In [5]:

settings['rhapsody_training_dataset']['fields']

Out[5]:

('SAV_coords',
 'Uniprot2PDB',
 'PDB_length',
 'true_label',
 'ANM_MSF-chain',
 'ANM_MSF-reduced',
 'ANM_MSF-sliced',
 'ANM_effectiveness-chain',
 'ANM_effectiveness-reduced',
 'ANM_effectiveness-sliced',
 'ANM_sensitivity-chain',
 'ANM_sensitivity-reduced',
 'ANM_sensitivity-sliced',
 'BLOSUM',
 'Delta_PSIC',
 'Delta_SASA',
 'EVmut-DeltaE_epist',
 'EVmut-DeltaE_indep',
 'EVmut-mut_aa_freq',
 'EVmut-wt_aa_cons',
 'GNM_MSF-chain',
 'GNM_MSF-reduced',
 'GNM_MSF-sliced',
 'GNM_effectiveness-chain',
 'GNM_effectiveness-reduced',
 'GNM_effectiveness-sliced',
 'GNM_sensitivity-chain',
 'GNM_sensitivity-reduced',
 'GNM_sensitivity-sliced',
 'SASA',
 'SASA_in_complex',
 'entropy',
 'ranked_MI',
 'stiffness-chain',
 'stiffness-reduced',
 'stiffness-sliced',
 'wt_PSIC')

As an example, let's add a new field PDB_length, extracted from the Integrated Dataset, to the "full" Rhapsody training dataset:

In [6]:

default_ds = rd.getDefaultTrainingDataset()
array = default_ds['PDB_length']

This array already has the correct length:

In [7]:

len(array)

Out[7]:

Let's call this extra field 'new_feature':

In [8]:

new_training_dataset = rd.extendDefaultTrainingDataset('new_feature', array)
new_training_dataset

Out[8]:

rec.array([('A0FGR8 210 C S', 0, -2.053, 0.305, 13., 1.6369812 , 0.47203878, 0.48780644, 10.934773 , 1.9736528, 0.7888889, -1., 455.),
           ('A0FGR8 638 S G', 0, -2.261, 1.017, 70., 1.1408623 , 0.6298156 , 0.44004798, 11.919116 ,       nan,       nan,  0., 455.),
           ('A1Z1Q3 58 T I', 0, -1.815, 1.578, 85., 4.432677  , 0.0067604 , 0.6147862 , 10.329021 ,       nan,       nan, -1., 221.),
           ...,
           ('Q9Y6X9 283 R H', 0, -1.281, 2.726, 30., 0.474798  , 0.48984432, 0.19532807, 13.609769 ,       nan,       nan,  0., 536.),
           ('Q9Y6X9 466 D H', 0, -1.261, 3.143, 71., 0.46281278, 0.45384642, 0.16229136, 13.8006115,       nan,       nan, -1., 536.),
           ('Q9Y6X9 87 S L', 1, -1.239, 1.896, 17., 0.4945855 , 0.57216614, 0.23604834, 13.548991 , 1.1867005, 0.5555556, -2., 536.)],
          dtype=[('SAV_coords', '<U50'), ('true_label', '<i2'), ('wt_PSIC', '<f4'), ('Delta_PSIC', '<f4'), ('SASA', '<f4'), ('ANM_MSF-chain', '<f4'), ('ANM_effectiveness-chain', '<f4'), ('ANM_sensitivity-chain', '<f4'), ('stiffness-chain', '<f4'), ('entropy', '<f4'), ('ranked_MI', '<f4'), ('BLOSUM', '<f4'), ('new_feature', '<f4')])

Multiple new features can be added at once by providing a list of arrays:

rd.extendDefaultTrainingDataset(['new_feature_1', 'new_feature_2', ...], [array1, array2, ...])

A completely different training dataset can also be built from scratch, by creating a NumPy structured array, containing the two mandatory fields SAV_coords and true_label.

Finally, let's train the new custom classifier:

In [9]:

new_clsf_info = rd.trainRFclassifier(new_training_dataset, pickle_name='custom_classifier.pkl')

@> 3918 out of 20361 cases ignored with missing features.
@> CV iteration # 1:   AUROC = 0.871   AUPRC = 0.931   OOB score = 0.820
@> CV iteration # 2:   AUROC = 0.875   AUPRC = 0.934   OOB score = 0.820
@> CV iteration # 3:   AUROC = 0.867   AUPRC = 0.926   OOB score = 0.820
@> CV iteration # 4:   AUROC = 0.878   AUPRC = 0.934   OOB score = 0.818
@> CV iteration # 5:   AUROC = 0.872   AUPRC = 0.935   OOB score = 0.821
@> CV iteration # 6:   AUROC = 0.875   AUPRC = 0.935   OOB score = 0.820
@> CV iteration # 7:   AUROC = 0.871   AUPRC = 0.931   OOB score = 0.819
@> CV iteration # 8:   AUROC = 0.873   AUPRC = 0.934   OOB score = 0.821
@> CV iteration # 9:   AUROC = 0.861   AUPRC = 0.922   OOB score = 0.820
@> CV iteration #10:   AUROC = 0.879   AUPRC = 0.935   OOB score = 0.818
@> ------------------------------------------------------------
@> Cross-validation summary:
@> training dataset size:   16443
@> fraction of positives:   0.690
@> mean AUROC:              0.872 +/- 0.005
@> mean AUPRC:              0.932 +/- 0.004
@> mean OOB score:          0.820 +/- 0.001
@> optimal cutoff*:         0.709 +/- 0.036
@> (* argmax of Youden's index)
@> feature importances:
@>                 wt_PSIC: 0.143
@>              Delta_PSIC: 0.188
@>                    SASA: 0.068
@>           ANM_MSF-chain: 0.074
@> ANM_effectiveness-chain: 0.078
@>   ANM_sensitivity-chain: 0.071
@>         stiffness-chain: 0.080
@>                 entropy: 0.099
@>               ranked_MI: 0.068
@>                  BLOSUM: 0.047
@>             new_feature: 0.084
@> ------------------------------------------------------------
@> Predictions distribution saved to predictions_distribution.png
@> Pathogenicity plot saved to pathogenicity_prob.png
@> ROC plot saved to ROC.png
@> ------------------------------------------------------------
@> Classifier training summary:
@> mean OOB score:          0.820
@> feature importances:
@>                 wt_PSIC: 0.144
@>              Delta_PSIC: 0.188
@>                    SASA: 0.069
@>           ANM_MSF-chain: 0.074
@> ANM_effectiveness-chain: 0.078
@>   ANM_sensitivity-chain: 0.071
@>         stiffness-chain: 0.080
@>                 entropy: 0.098
@>               ranked_MI: 0.068
@>                  BLOSUM: 0.047
@>             new_feature: 0.084
@> ------------------------------------------------------------
@> Feat. importance plot saved to feat_importances.png

NB: Note that, although its introduction yields an increased accuracy, the PDB size is not a good feature and probably reflects an intrinsic bias of the training dataset (more deleterious SAVs have been reported for proteins of larger size).

In [10]:

# new custom classifier accuracy
new_clsf_info['CV summary']['mean AUROC']
del new_clsf_info

In [11]:

import pickle

full_clsf_fname = rd.getDefaultClassifiers()['full']
full_clsf = pickle.load(open(full_clsf_fname, 'rb'))

# "full" Rhapsody classifier accuracy
full_clsf['CV summary']['mean AUROC']
del full_clsf

Getting predictions based on custom classifier¶

Here we show how to obtain pathogenicity predictions for a small set of 5 Single Amino acid Variants, using our new custom classifier:

In [12]:

test_SAVs = [
    'O00294 496 A T',  # know neutral SAV used for training
    'O00238 31 R H',   # SAV with no PDB structure
    'P01112 58 T R',   # unknown SAV
    'P01112 30 D E',   # unknown SAV
    'P01112 170 K I',  # unknown SAV with no Pfam domain
]

We cannot simply use the main interface rhapsody(), because the new feature cannot be computed automatically by Rhapsody:

# This would cause an error
rh = rd.rhapsody(test_SAVs, main_classifier='custom_classifier.pkl')

The Rhapsody class contains all the necessary functionalities for getting custom predictions.

In [13]:

# initialize a new instance by importing SAVs and querying PolyPhen-2 directly
rh = rd.Rhapsody(query=test_SAVs)

@> Submitting query to PolyPhen-2...
@> Query to PolyPhen-2 started in 1.7s.
@> PolyPhen-2 is running...
@> Query to PolyPhen-2 completed in 20.0s.
@> PolyPhen-2's output parsed.

In [14]:

# import new precomputed features
features_dict = {
    'new_feature': [224,   0, 168, 168, 171]
}
rh.importPrecomputedExtraFeatures(features_dict)

In [15]:

# import the classifier that will be used for predictions
rh.importClassifiers('custom_classifier.pkl')

@> Imported feature set:
@>    'wt_PSIC' 
@>    'Delta_PSIC' 
@>    'SASA' 
@>    'ANM_MSF-chain' 
@>    'ANM_effectiveness-chain' 
@>    'ANM_sensitivity-chain' 
@>    'stiffness-chain' 
@>    'entropy' 
@>    'ranked_MI' 
@>    'BLOSUM' 
@>    'new_feature'

In [16]:

rh.getPredictions()

@> Sequence-conservation features have been retrieved from PolyPhen-2's output.
@> Mapping SAVs to PDB structures...
Mapping SAV 'O00238 31 R H' to PDB:   0%|          | 0/5 [00:00<?]@> Pickle 'UniprotMap-O00238.pkl' recovered.
Mapping SAV 'O00294 496 A T' to PDB:  20%|██        | 1/5 [00:00<00:00]@> Pickle 'UniprotMap-O00238.pkl' saved.
@> Pickle 'UniprotMap-O00294.pkl' recovered.
Mapping SAV 'P01112 58 T R' to PDB:  40%|████      | 2/5 [00:00<00:00] @> Pickle 'UniprotMap-O00294.pkl' saved.
@> Pickle 'UniprotMap-P01112.pkl' recovered.
Mapping SAV 'P01112 170 K I' to PDB: 100%|██████████| 5/5 [00:02<00:00]
@> Pickle 'UniprotMap-P01112.pkl' saved.
@> 4 out of 5 SAVs have been mapped to PDB in 2.3s.
@> Computing structural and dynamical features from PDB structures...
Analizing mutation site 1AA9:A 170:   0%|          | 0/5 [00:00<?]@> Pickle 'PDBfeatures-1AA9.pkl' recovered.
Analizing mutation site 2FIM:A 443:   0%|          | 0/5 [00:00<?]@> Pickle 'PDBfeatures-1AA9.pkl' saved.
@> Pickle 'PDBfeatures-2FIM.pkl' recovered.
Analizing mutation site 4Q21:A 58:   0%|          | 0/5 [00:00<?] @> Pickle 'PDBfeatures-2FIM.pkl' saved.
@> Pickle 'PDBfeatures-4Q21.pkl' recovered.
Analizing mutation site 4Q21:A 30:   0%|          | 0/5 [00:00<?]@> Pickle 'PDBfeatures-4Q21.pkl' saved.
Analizing mutation site 4Q21:A 30: 100%|██████████| 5/5 [00:00<00:00]
@> PDB features have been computed in 0.0s.
@> Computing sequence properties from Pfam domains...
Mapping SAV 'O00238 31 R H' to Pfam:   0%|          | 0/5 [00:00<?]@> Pickle 'UniprotMap-O00238.pkl' recovered.
Mapping SAV 'O00294 496 A T' to Pfam:   0%|          | 0/5 [00:00<?]@> Pickle 'UniprotMap-O00238.pkl' saved.
@> Pickle 'UniprotMap-O00294.pkl' recovered.
Mapping SAV 'P01112 58 T R' to Pfam:   0%|          | 0/5 [00:00<?] @> Pickle 'UniprotMap-O00294.pkl' saved.
@> Pickle 'UniprotMap-P01112.pkl' recovered.
Mapping SAV 'P01112 170 K I' to Pfam:  60%|██████    | 3/5 [00:00<00:00]@> WARNING Unable to compute Pfam features: No Pfam domain for resid 170.
@> Pickle 'UniprotMap-P01112.pkl' saved.
Mapping SAV 'P01112 170 K I' to Pfam: 100%|██████████| 5/5 [00:00<00:00]
@> SAVs have been mapped on Pfam domains and sequence properties have been computed in 0.4s.
@> Random Forest classifier imported in 22.6s.
@> 3 predictions computed in 0.5s.
@> Recovering EVmutation data...
@> EVmutation scores recovered in 0.5s.

Out[16]:

array([('O00294 496 A T', 'known_neu', 0.086     , 0.03065213, 'neutral', 0.351, 'neutral', -3.1479, 'neutral'),
       ('O00238 31 R H', 'new',        nan,        nan, '?', 0.219, 'neutral', -2.4718, 'neutral'),
       ('P01112 58 T R', 'new', 0.956     , 0.913615  , 'deleterious', 1.   , 'deleterious', -9.7604, 'deleterious'),
       ('P01112 30 D E', 'new', 0.12666667, 0.044859  , 'neutral', 0.001, 'neutral',  0.2196, 'neutral'),
       ('P01112 170 K I', 'new',        nan,        nan, '?', 0.   , 'neutral',     nan, '?')],
      dtype=[('SAV coords', '<U50'), ('training info', '<U12'), ('score', '<f4'), ('path. prob.', '<f4'), ('path. class', '<U12'), ('PolyPhen-2 score', '<f4'), ('PolyPhen-2 path. class', '<U12'), ('EVmutation score', '<f4'), ('EVmutation path. class', '<U12')])