%pylab inline
Populating the interactive namespace from numpy and matplotlib
!cd toy_datasets; wget -O MiniBooNE_PID.txt -nc --no-check-certificate https://archive.ics.uci.edu/ml/machine-learning-databases/00199/MiniBooNE_PID.txt
File `MiniBooNE_PID.txt' already there; not retrieving.
import numpy, pandas
from rep.utils import train_test_split
import numpy, pandas
from rep.utils import train_test_split
from sklearn.metrics import roc_auc_score
data = pandas.read_csv('toy_datasets/MiniBooNE_PID.txt', sep='\s*', skiprows=[0], header=None, engine='python')
labels = pandas.read_csv('toy_datasets/MiniBooNE_PID.txt', sep=' ', nrows=1, header=None)
labels = [1] * labels[1].values[0] + [0] * labels[2].values[0]
data.columns = ['feature_{}'.format(key) for key in data.columns]
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.5)
train_variables = ["feature_new01: feature_0/feature_1", "feature_2", "feature_26", "feature_12", "feature_24",
"feature_25", "feature_16",]
plot_variables = train_variables + ['feature_3']
This class is OrderedDict, with additional interface, main methods are:
factory.add_classifier(name, classifier)
factory.fit(X, y, sample_weight=None, ipc_profile=None, features=None)
train all classifiers in factory
if features
is not None, then all classifiers will be trained on these features
you can pass the name of ipython cluster via ipc_profile
for parallel training
factory.test_on_lds(lds)
- test all models on lds(rep.data.storage.LabeledDataStorage
)
returns report (rep.report.classification.ClassificationReport
)
from rep.metaml import ClassifiersFactory
from rep.estimators import TMVAClassifier, SklearnClassifier, XGBoostClassifier
from sklearn.ensemble import AdaBoostClassifier
Please pay attention that we set very small number of trees, just to make this notebook work fast. Don't forget to tune classifier!
factory = ClassifiersFactory()
# There are different ways to add classifiers to Factory:
factory.add_classifier('tmva', TMVAClassifier(NTrees=50, features=train_variables, Shrinkage=0.05))
factory.add_classifier('ada', AdaBoostClassifier(n_estimators=10))
factory['xgb'] = XGBoostClassifier(features=train_variables)
from copy import deepcopy
factory_copy = deepcopy(factory)
pay attention:
for the first factory we pointed features that will be used in training and all classifiers will use them,
for the second factory all the classifiers will use those features we set in their constuctors
%time factory.fit(train_data, train_labels, features=train_variables)
pass
Overwriting features of estimator tmva Overwriting features of estimator xgb model tmva was trained in 12.17 seconds model ada was trained in 1.09 seconds model xgb was trained in 29.71 seconds Totally spent 42.97 seconds on training CPU times: user 31.4 s, sys: 235 ms, total: 31.6 s Wall time: 43 s
factory.predict_proba(train_data)
data was predicted by tmva in 5.45 seconds data was predicted by ada in 0.13 seconds data was predicted by xgb in 0.84 seconds Totally spent 6.42 seconds on prediction
OrderedDict([('tmva', array([[ 0.52329173, 0.47670827], [ 0.42505833, 0.57494167], [ 0.31753175, 0.68246825], ..., [ 0.44861638, 0.55138362], [ 0.74466065, 0.25533935], [ 0.58525529, 0.41474471]])), ('ada', array([[ 0.54559202, 0.45440798], [ 0.49533497, 0.50466503], [ 0.46967085, 0.53032915], ..., [ 0.45320584, 0.54679416], [ 0.58490372, 0.41509628], [ 0.56773603, 0.43226397]])), ('xgb', array([[ 0.86888862, 0.1311114 ], [ 0.80507904, 0.194921 ], [ 0.09248171, 0.90751827], ..., [ 0.05847125, 0.9415288 ], [ 0.99372029, 0.00627972], [ 0.81142735, 0.18857266]], dtype=float32))])
%time factory_copy.fit(train_data, train_labels)
pass
model tmva was trained in 11.78 seconds model ada was trained in 6.48 seconds model xgb was trained in 33.99 seconds Totally spent 52.24 seconds on training CPU times: user 40.7 s, sys: 286 ms, total: 41 s Wall time: 52.2 s
ClassificationReport
class provides the posibility to get classification description to compare different models.
Below you can find available functions which can help you to analyze result on arbitrary dataset.
There are different plotting backends supported:
report has many useful methods
report = factory.test_on(test_data, test_labels)
Only the features used in training are compared
features_importances = report.feature_importance()
features_importances.plot()
# not only in matplotlib, but in other libraries too. For instance, with plotly
# features_importances.plot_plotly('importances', figsize=(15, 6))
Estimator tmva doesn't support feature importances
Learning curves are powerful and simple tool to analyze the behaviour of your model.
from rep.report.metrics import RocAuc
learning_curve = report.learning_curve(RocAuc(), metric_label='ROC AUC', steps=1)
learning_curve.plot(new_plot=True, figsize=(8, 4))
# with plotly
# learning_curve.plot_plotly(plotly_filename='learning curves', figsize=(18, 8))
Estimator tmva doesn't support stage predictions
correlation_pairs = []
correlation_pairs.append((plot_variables[0], plot_variables[1]))
correlation_pairs.append((plot_variables[0], plot_variables[2]))
report.scatter(correlation_pairs, alpha=0.01).plot()
# plot correlations between variables for signal-like and bck-like events
report.features_correlation_matrix(features=plot_variables).plot(new_plot=True, show_legend=False, figsize=(7, 5))
report.features_correlation_matrix_by_class(features=plot_variables).plot(new_plot=True,
show_legend=False, figsize=(15, 5))
# compute correclation matrix only for zero class data
# report.features_correlation_matrix_by_class(features=plot_variables[:4], labels_dict={0: 'background'}, grid_columns=1).plot()
# use just common features for all classifiers
report.features_pdf().plot()
report.prediction_pdf().plot(new_plot=True, figsize = (9, 4))
# Plot the same only for zero class data
# report.prediction_pdf(labels_dict={0: 'background'}, size=5).plot()
Plot roc curve for train, test data (it's the same as BackgroundRejection vs Signal Efficiency plot)
figure(figsize(14, 5))
subplot(1, 2, 1)
report.roc().plot(xlim=(0.5, 1))
subplot(1, 2, 2)
report.roc(physics_notion=True).plot(ylim=(0.5, 1))
(this is dependence of efficiency on variables of dataset)
efficiencies = report.efficiencies(['feature_3'], ignored_sideband=0.01)
efficiencies_with_errors = report.efficiencies(['feature_3'], errors=True, bins=15, ignored_sideband=0.01)
efficiencies.plot(figsize=(18, 15), fontsize=12, show_legend=False)
# plot efficiencies with error bars
# efficiencies_with_errors.plot(figsize=(18, 25), fontsize=12, show_legend=False)
look how you can estimate the quality with your custom binary metrics and look for optimal threshold
from rep.report.metrics import significance
metrics = report.metrics_vs_cut(significance, metric_label='significance')
metrics.plot(new_plot=True, figsize=(10, 4))
# You can define your own metric and use it
# def AMS(s, b):
# b_reg = 0.01
# radicand = 2 *( (s+b+b_reg) * numpy.log (1.0 + s/(b+b_reg)) - s)
# return numpy.sqrt(radicand)
# metrics = report.metrics_vs_cut(AMS, metric_label='ams')
Exercise 1. Create weight column for test and train datasets. Use weights in fit
ting and test
ing of factory.
Exercise 2. Train another classifiers, play with parameters and feature sets.
Exercise 3. Set up you IPython cluster for distributed training of estimators.