This notebook demonstrates tools to optimize classification model provided by Reproducible experiment platform (REP) package:
grid search for the best classifier hyperparameters
different optimization algorithms
different scoring models (optimization of arbirtary figure of merit)
%pylab inline
Populating the interactive namespace from numpy and matplotlib
Dataset 'magic' from UCI
!cd toy_datasets; wget -O magic04.data -nc --no-check-certificate https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.data
File `magic04.data' already there; not retrieving.
import numpy, pandas
from rep.utils import train_test_split
from sklearn.metrics import roc_auc_score
columns = ['fLength', 'fWidth', 'fSize', 'fConc', 'fConc1', 'fAsym', 'fM3Long', 'fM3Trans', 'fAlpha', 'fDist', 'g']
data = pandas.read_csv('toy_datasets/magic04.data', names=columns)
labels = numpy.array(data['g'] == 'g', dtype=int)
data = data.drop('g', axis=1)
In this example we are optimizing
import numpy
import pandas
from rep import utils
from sklearn.ensemble import GradientBoostingClassifier
from rep.report.metrics import RocAuc
from rep.metaml import GridOptimalSearchCV, FoldingScorer, RandomParameterOptimizer
from rep.estimators import SklearnClassifier, TMVAClassifier, XGBoostRegressor
# define grid parameters
grid_param = {}
grid_param['learning_rate'] = [0.2, 0.1, 0.05, 0.02, 0.01]
grid_param['max_depth'] = [2, 3, 4, 5]
# use random hyperparameter optimization algorithm
generator = RandomParameterOptimizer(grid_param)
# define folding scorer
scorer = FoldingScorer(RocAuc(), folds=3, fold_checks=3)
%%time
estimator = SklearnClassifier(GradientBoostingClassifier(n_estimators=30))
grid_finder = GridOptimalSearchCV(estimator, generator, scorer, parallel_profile='threads-4')
grid_finder.fit(data, labels)
Performing grid search in 4 threads 4 evaluations done 8 evaluations done 10 evaluations done CPU times: user 38.5 s, sys: 609 ms, total: 39.1 s Wall time: 14.8 s
grid_finder.params_generator.print_results()
0.917: learning_rate=0.2, max_depth=3 0.914: learning_rate=0.1, max_depth=4 0.903: learning_rate=0.1, max_depth=3 0.888: learning_rate=0.01, max_depth=5 0.885: learning_rate=0.05, max_depth=3 0.874: learning_rate=0.01, max_depth=4 0.870: learning_rate=0.05, max_depth=2 0.854: learning_rate=0.01, max_depth=3 0.850: learning_rate=0.02, max_depth=2 0.834: learning_rate=0.01, max_depth=2
In many applications we need to optimize some binary metrics for classification (f1, BER, misclassification error), in which case we need each time after training classifier to find optimal threshold on predicted probabilities (default one is usually bad).
In this example:
from rep.metaml import RegressionParameterOptimizer
from sklearn.gaussian_process import GaussianProcess
from rep.report.metrics import OptimalMetric, ams
%%time
# OptimalMetrics is a wrapper which is able to check all possible thresholds
# expected number of signal and background events are taken as some arbitrary numbers
optimal_ams = OptimalMetric(ams, expected_s=100, expected_b=1000)
# define grid parameters
grid_param = {'Shrinkage': [0.4, 0.2, 0.1, 0.05, 0.02, 0.01],
'NTrees': [5, 10, 15, 20, 25],
# you can pass different sets of features to be compared
'features': [columns[:2], columns[:3], columns[:4]],
}
# using GaussianProcesses
generator = RegressionParameterOptimizer(grid_param, n_evaluations=10, regressor=GaussianProcess(), n_attempts=10)
# define folding scorer
scorer = FoldingScorer(optimal_ams, folds=2, fold_checks=2)
grid_finder = GridOptimalSearchCV(TMVAClassifier(method='kBDT', BoostType='Grad',), generator, scorer, parallel_profile='threads-3')
grid_finder.fit(data, labels)
Performing grid search in 3 threads 3 evaluations done 6 evaluations done 9 evaluations done 12 evaluations done CPU times: user 8.39 s, sys: 1.75 s, total: 10.1 s Wall time: 1min 17s
grid_finder.generator.print_results()
4.348: Shrinkage=0.4, NTrees=20, features=['fLength', 'fWidth', 'fSize', 'fConc'] 4.253: Shrinkage=0.4, NTrees=25, features=['fLength', 'fWidth', 'fSize'] 4.222: Shrinkage=0.4, NTrees=20, features=['fLength', 'fWidth', 'fSize'] 4.201: Shrinkage=0.4, NTrees=10, features=['fLength', 'fWidth', 'fSize', 'fConc'] 4.188: Shrinkage=0.4, NTrees=15, features=['fLength', 'fWidth', 'fSize'] 4.152: Shrinkage=0.2, NTrees=20, features=['fLength', 'fWidth', 'fSize'] 4.130: Shrinkage=0.2, NTrees=15, features=['fLength', 'fWidth', 'fSize'] 4.064: Shrinkage=0.1, NTrees=15, features=['fLength', 'fWidth', 'fSize', 'fConc'] 4.060: Shrinkage=0.1, NTrees=15, features=['fLength', 'fWidth', 'fSize'] 3.983: Shrinkage=0.05, NTrees=10, features=['fLength', 'fWidth', 'fSize', 'fConc'] 3.845: Shrinkage=0.01, NTrees=10, features=['fLength', 'fWidth', 'fSize', 'fConc'] 3.696: Shrinkage=0.1, NTrees=15, features=['fLength', 'fWidth']
plot(grid_finder.generator.grid_scores_.values())
[<matplotlib.lines.Line2D at 0x1101333d0>]
REP supports sklearn-way of combining classifiers and getting/setting their parameters.
So you can tune complex models using the same approach.
Let's optimize
from sklearn.ensemble import BaggingRegressor
from rep.estimators import XGBoostRegressor
from rep.utils import train_test_split
# splitting into train and test
train_data, test_data, train_labels, test_labels = train_test_split(data, labels)
from sklearn.metrics import mean_absolute_error
from sklearn.base import clone
class MyMAEScorer(object):
def __init__(self, test_data, test_labels):
self.test_data = test_data
self.test_labels = test_labels
def __call__(self, base_estimator, params, X, y, sample_weight=None):
cl = clone(base_estimator)
cl.set_params(**params)
cl.fit(X, y)
# Returning with minus, because we maximize metric
return - mean_absolute_error(self.test_labels, cl.predict(self.test_data))
%%time
# define grid parameters
grid_param = {
# parameters of sklearn Bagging
'n_estimators': [1, 3, 5, 7],
'max_samples': [0.2, 0.4, 0.6, 0.8],
# parameters of base (XGBoost)
'base_estimator__n_estimators': [10, 20, 40],
'base_estimator__eta': [0.1, 0.2, 0.4, 0.6, 0.8]
}
# using Gaussian Processes
generator = RegressionParameterOptimizer(grid_param, n_evaluations=10, regressor=GaussianProcess(), n_attempts=10)
estimator = BaggingRegressor(XGBoostRegressor(), n_estimators=10)
scorer = MyMAEScorer(test_data, test_labels)
grid_finder = GridOptimalSearchCV(estimator, generator, scorer, parallel_profile=None)
grid_finder.fit(data, labels)
CPU times: user 27.4 s, sys: 300 ms, total: 27.7 s Wall time: 28 s
grid_finder.generator.print_results()
-0.158: n_estimators=3, max_samples=0.6, base_estimator__n_estimators=40, base_estimator__eta=0.8 -0.161: n_estimators=3, max_samples=0.6, base_estimator__n_estimators=40, base_estimator__eta=0.6 -0.168: n_estimators=3, max_samples=0.4, base_estimator__n_estimators=40, base_estimator__eta=0.6 -0.169: n_estimators=3, max_samples=0.4, base_estimator__n_estimators=40, base_estimator__eta=0.4 -0.179: n_estimators=1, max_samples=0.8, base_estimator__n_estimators=40, base_estimator__eta=0.2 -0.182: n_estimators=1, max_samples=0.4, base_estimator__n_estimators=40, base_estimator__eta=0.2 -0.184: n_estimators=1, max_samples=0.6, base_estimator__n_estimators=40, base_estimator__eta=0.2 -0.184: n_estimators=1, max_samples=0.6, base_estimator__n_estimators=20, base_estimator__eta=0.6 -0.190: n_estimators=1, max_samples=0.8, base_estimator__n_estimators=10, base_estimator__eta=0.8 -0.321: n_estimators=1, max_samples=0.2, base_estimator__n_estimators=10, base_estimator__eta=0.1
Grid search in REP extends sklearn grid search, uses optimization techniques to avoid complete search of estimator parameters.
REP has predefined scorers, metric functions, optimization techniques. Each component is replaceable and you can optimize complex models and pipelines (Folders/Bagging/Boosting and so on).
ParameterOptimizer is responsible for generating new set of parameters which will be checked
Scorer is responsible for training and evaluating metrics
GridOptimalSearchCV makes all of this work together and sends tasks to cluster or separate threads.