This tutorial provides step-by-step instructions for building a new classifier on a personalized set of features.
import os
if not os.path.isdir('local'):
os.mkdir('local')
if not os.path.isdir('local/custom_classifier'):
os.mkdir('local/custom_classifier')
os.chdir('local/custom_classifier')
First, we need to build a custom training dataset, for instance by extending the Integrated Dataset that comes with Rhapsody.
New feature arrays must have the same length as those already in the Integrated Dataset. Information about size and composition of the default training dataset is stored in the installation settings:
import rhapsody as rd
settings = rd.getSettings()
@> rhapsody_local_folder : /home/luca/rhapsody @> EVmutation_local_folder : /home/luca/rhapsody/EVmutation_mutation_effects @> full classifier : /home/luca/rhapsody/default_classifiers-sklearn_v0.21.3/full/trained_classifier.pkl @> reduced classifier : /home/luca/rhapsody/default_classifiers-sklearn_v0.21.3/reduced/trained_classifier.pkl @> EVmut classifier : /home/luca/rhapsody/default_classifiers-sklearn_v0.21.3/EVmut/trained_classifier.pkl @> training dataset size : 20361 @> EVmutation_metrics : <computed>
settings['rhapsody_training_dataset']['size']
20361
settings['rhapsody_training_dataset']['fields']
('SAV_coords', 'Uniprot2PDB', 'PDB_length', 'true_label', 'ANM_MSF-chain', 'ANM_MSF-reduced', 'ANM_MSF-sliced', 'ANM_effectiveness-chain', 'ANM_effectiveness-reduced', 'ANM_effectiveness-sliced', 'ANM_sensitivity-chain', 'ANM_sensitivity-reduced', 'ANM_sensitivity-sliced', 'BLOSUM', 'Delta_PSIC', 'Delta_SASA', 'EVmut-DeltaE_epist', 'EVmut-DeltaE_indep', 'EVmut-mut_aa_freq', 'EVmut-wt_aa_cons', 'GNM_MSF-chain', 'GNM_MSF-reduced', 'GNM_MSF-sliced', 'GNM_effectiveness-chain', 'GNM_effectiveness-reduced', 'GNM_effectiveness-sliced', 'GNM_sensitivity-chain', 'GNM_sensitivity-reduced', 'GNM_sensitivity-sliced', 'SASA', 'SASA_in_complex', 'entropy', 'ranked_MI', 'stiffness-chain', 'stiffness-reduced', 'stiffness-sliced', 'wt_PSIC')
As an example, let's add a new field PDB_length
, extracted from the Integrated Dataset, to the "full" Rhapsody training dataset:
default_ds = rd.getDefaultTrainingDataset()
array = default_ds['PDB_length']
This array already has the correct length:
len(array)
20361
Let's call this extra field 'new_feature'
:
new_training_dataset = rd.extendDefaultTrainingDataset('new_feature', array)
new_training_dataset
rec.array([('A0FGR8 210 C S', 0, -2.053, 0.305, 13., 1.6369812 , 0.47203878, 0.48780644, 10.934773 , 1.9736528, 0.7888889, -1., 455.), ('A0FGR8 638 S G', 0, -2.261, 1.017, 70., 1.1408623 , 0.6298156 , 0.44004798, 11.919116 , nan, nan, 0., 455.), ('A1Z1Q3 58 T I', 0, -1.815, 1.578, 85., 4.432677 , 0.0067604 , 0.6147862 , 10.329021 , nan, nan, -1., 221.), ..., ('Q9Y6X9 283 R H', 0, -1.281, 2.726, 30., 0.474798 , 0.48984432, 0.19532807, 13.609769 , nan, nan, 0., 536.), ('Q9Y6X9 466 D H', 0, -1.261, 3.143, 71., 0.46281278, 0.45384642, 0.16229136, 13.8006115, nan, nan, -1., 536.), ('Q9Y6X9 87 S L', 1, -1.239, 1.896, 17., 0.4945855 , 0.57216614, 0.23604834, 13.548991 , 1.1867005, 0.5555556, -2., 536.)], dtype=[('SAV_coords', '<U50'), ('true_label', '<i2'), ('wt_PSIC', '<f4'), ('Delta_PSIC', '<f4'), ('SASA', '<f4'), ('ANM_MSF-chain', '<f4'), ('ANM_effectiveness-chain', '<f4'), ('ANM_sensitivity-chain', '<f4'), ('stiffness-chain', '<f4'), ('entropy', '<f4'), ('ranked_MI', '<f4'), ('BLOSUM', '<f4'), ('new_feature', '<f4')])
Multiple new features can be added at once by providing a list of arrays:
rd.extendDefaultTrainingDataset(['new_feature_1', 'new_feature_2', ...], [array1, array2, ...])
A completely different training dataset can also be built from scratch, by creating a NumPy structured array, containing the two mandatory fields SAV_coords
and true_label
.
Finally, let's train the new custom classifier:
new_clsf_info = rd.trainRFclassifier(new_training_dataset, pickle_name='custom_classifier.pkl')
@> 3918 out of 20361 cases ignored with missing features. @> CV iteration # 1: AUROC = 0.871 AUPRC = 0.931 OOB score = 0.820 @> CV iteration # 2: AUROC = 0.875 AUPRC = 0.934 OOB score = 0.820 @> CV iteration # 3: AUROC = 0.867 AUPRC = 0.926 OOB score = 0.820 @> CV iteration # 4: AUROC = 0.878 AUPRC = 0.934 OOB score = 0.818 @> CV iteration # 5: AUROC = 0.872 AUPRC = 0.935 OOB score = 0.821 @> CV iteration # 6: AUROC = 0.875 AUPRC = 0.935 OOB score = 0.820 @> CV iteration # 7: AUROC = 0.871 AUPRC = 0.931 OOB score = 0.819 @> CV iteration # 8: AUROC = 0.873 AUPRC = 0.934 OOB score = 0.821 @> CV iteration # 9: AUROC = 0.861 AUPRC = 0.922 OOB score = 0.820 @> CV iteration #10: AUROC = 0.879 AUPRC = 0.935 OOB score = 0.818 @> ------------------------------------------------------------ @> Cross-validation summary: @> training dataset size: 16443 @> fraction of positives: 0.690 @> mean AUROC: 0.872 +/- 0.005 @> mean AUPRC: 0.932 +/- 0.004 @> mean OOB score: 0.820 +/- 0.001 @> optimal cutoff*: 0.709 +/- 0.036 @> (* argmax of Youden's index) @> feature importances: @> wt_PSIC: 0.143 @> Delta_PSIC: 0.188 @> SASA: 0.068 @> ANM_MSF-chain: 0.074 @> ANM_effectiveness-chain: 0.078 @> ANM_sensitivity-chain: 0.071 @> stiffness-chain: 0.080 @> entropy: 0.099 @> ranked_MI: 0.068 @> BLOSUM: 0.047 @> new_feature: 0.084 @> ------------------------------------------------------------ @> Predictions distribution saved to predictions_distribution.png @> Pathogenicity plot saved to pathogenicity_prob.png @> ROC plot saved to ROC.png @> ------------------------------------------------------------ @> Classifier training summary: @> mean OOB score: 0.820 @> feature importances: @> wt_PSIC: 0.144 @> Delta_PSIC: 0.188 @> SASA: 0.069 @> ANM_MSF-chain: 0.074 @> ANM_effectiveness-chain: 0.078 @> ANM_sensitivity-chain: 0.071 @> stiffness-chain: 0.080 @> entropy: 0.098 @> ranked_MI: 0.068 @> BLOSUM: 0.047 @> new_feature: 0.084 @> ------------------------------------------------------------ @> Feat. importance plot saved to feat_importances.png
NB: Note that, although its introduction yields an increased accuracy, the PDB size is not a good feature and probably reflects an intrinsic bias of the training dataset (more deleterious SAVs have been reported for proteins of larger size).
# new custom classifier accuracy
new_clsf_info['CV summary']['mean AUROC']
del new_clsf_info
import pickle
full_clsf_fname = rd.getDefaultClassifiers()['full']
full_clsf = pickle.load(open(full_clsf_fname, 'rb'))
# "full" Rhapsody classifier accuracy
full_clsf['CV summary']['mean AUROC']
del full_clsf
Here we show how to obtain pathogenicity predictions for a small set of 5 Single Amino acid Variants, using our new custom classifier:
test_SAVs = [
'O00294 496 A T', # know neutral SAV used for training
'O00238 31 R H', # SAV with no PDB structure
'P01112 58 T R', # unknown SAV
'P01112 30 D E', # unknown SAV
'P01112 170 K I', # unknown SAV with no Pfam domain
]
We cannot simply use the main interface rhapsody()
, because the new feature cannot be computed automatically by Rhapsody:
# This would cause an error
rh = rd.rhapsody(test_SAVs, main_classifier='custom_classifier.pkl')
The Rhapsody
class contains all the necessary functionalities for getting custom predictions.
# initialize a new instance by importing SAVs and querying PolyPhen-2 directly
rh = rd.Rhapsody(query=test_SAVs)
@> Submitting query to PolyPhen-2... @> Query to PolyPhen-2 started in 1.7s. @> PolyPhen-2 is running... @> Query to PolyPhen-2 completed in 20.0s. @> PolyPhen-2's output parsed.
# import new precomputed features
features_dict = {
'new_feature': [224, 0, 168, 168, 171]
}
rh.importPrecomputedExtraFeatures(features_dict)
# import the classifier that will be used for predictions
rh.importClassifiers('custom_classifier.pkl')
@> Imported feature set: @> 'wt_PSIC' @> 'Delta_PSIC' @> 'SASA' @> 'ANM_MSF-chain' @> 'ANM_effectiveness-chain' @> 'ANM_sensitivity-chain' @> 'stiffness-chain' @> 'entropy' @> 'ranked_MI' @> 'BLOSUM' @> 'new_feature'
rh.getPredictions()
@> Sequence-conservation features have been retrieved from PolyPhen-2's output. @> Mapping SAVs to PDB structures... Mapping SAV 'O00238 31 R H' to PDB: 0%| | 0/5 [00:00<?]@> Pickle 'UniprotMap-O00238.pkl' recovered. Mapping SAV 'O00294 496 A T' to PDB: 20%|██ | 1/5 [00:00<00:00]@> Pickle 'UniprotMap-O00238.pkl' saved. @> Pickle 'UniprotMap-O00294.pkl' recovered. Mapping SAV 'P01112 58 T R' to PDB: 40%|████ | 2/5 [00:00<00:00] @> Pickle 'UniprotMap-O00294.pkl' saved. @> Pickle 'UniprotMap-P01112.pkl' recovered. Mapping SAV 'P01112 170 K I' to PDB: 100%|██████████| 5/5 [00:02<00:00] @> Pickle 'UniprotMap-P01112.pkl' saved. @> 4 out of 5 SAVs have been mapped to PDB in 2.3s. @> Computing structural and dynamical features from PDB structures... Analizing mutation site 1AA9:A 170: 0%| | 0/5 [00:00<?]@> Pickle 'PDBfeatures-1AA9.pkl' recovered. Analizing mutation site 2FIM:A 443: 0%| | 0/5 [00:00<?]@> Pickle 'PDBfeatures-1AA9.pkl' saved. @> Pickle 'PDBfeatures-2FIM.pkl' recovered. Analizing mutation site 4Q21:A 58: 0%| | 0/5 [00:00<?] @> Pickle 'PDBfeatures-2FIM.pkl' saved. @> Pickle 'PDBfeatures-4Q21.pkl' recovered. Analizing mutation site 4Q21:A 30: 0%| | 0/5 [00:00<?]@> Pickle 'PDBfeatures-4Q21.pkl' saved. Analizing mutation site 4Q21:A 30: 100%|██████████| 5/5 [00:00<00:00] @> PDB features have been computed in 0.0s. @> Computing sequence properties from Pfam domains... Mapping SAV 'O00238 31 R H' to Pfam: 0%| | 0/5 [00:00<?]@> Pickle 'UniprotMap-O00238.pkl' recovered. Mapping SAV 'O00294 496 A T' to Pfam: 0%| | 0/5 [00:00<?]@> Pickle 'UniprotMap-O00238.pkl' saved. @> Pickle 'UniprotMap-O00294.pkl' recovered. Mapping SAV 'P01112 58 T R' to Pfam: 0%| | 0/5 [00:00<?] @> Pickle 'UniprotMap-O00294.pkl' saved. @> Pickle 'UniprotMap-P01112.pkl' recovered. Mapping SAV 'P01112 170 K I' to Pfam: 60%|██████ | 3/5 [00:00<00:00]@> WARNING Unable to compute Pfam features: No Pfam domain for resid 170. @> Pickle 'UniprotMap-P01112.pkl' saved. Mapping SAV 'P01112 170 K I' to Pfam: 100%|██████████| 5/5 [00:00<00:00] @> SAVs have been mapped on Pfam domains and sequence properties have been computed in 0.4s. @> Random Forest classifier imported in 22.6s. @> 3 predictions computed in 0.5s. @> Recovering EVmutation data... @> EVmutation scores recovered in 0.5s.
array([('O00294 496 A T', 'known_neu', 0.086 , 0.03065213, 'neutral', 0.351, 'neutral', -3.1479, 'neutral'), ('O00238 31 R H', 'new', nan, nan, '?', 0.219, 'neutral', -2.4718, 'neutral'), ('P01112 58 T R', 'new', 0.956 , 0.913615 , 'deleterious', 1. , 'deleterious', -9.7604, 'deleterious'), ('P01112 30 D E', 'new', 0.12666667, 0.044859 , 'neutral', 0.001, 'neutral', 0.2196, 'neutral'), ('P01112 170 K I', 'new', nan, nan, '?', 0. , 'neutral', nan, '?')], dtype=[('SAV coords', '<U50'), ('training info', '<U12'), ('score', '<f4'), ('path. prob.', '<f4'), ('path. class', '<U12'), ('PolyPhen-2 score', '<f4'), ('PolyPhen-2 path. class', '<U12'), ('EVmutation score', '<f4'), ('EVmutation path. class', '<U12')])