The goal of this notebook is to use GPyOpt to tune the parameters of Machine Learning algorithms. In particular, in this section we will show how to tune the hyper-parameters for the Support Vector Regression (SVR) implemented in Scikit-learn. Given the standard interface of Scikit-learn, other models can be tuned in a similar fashion.
We start loading the requires modules.
%pylab inline
import GPy
import GPyOpt
import numpy as np
from sklearn import svm
from numpy.random import seed
seed(12345)
Populating the interactive namespace from numpy and matplotlib
For this example we will use the Olympic marathon dataset available in GPy. We split the original dataset into the training data (first 20 data points) and testing data (last 7 data points). The performance of SVR is evaluated in terms of Rooted Mean Squared Error (RMSE) on the testing data.
# Let's load the dataset
GPy.util.datasets.authorize_download = lambda x: True # prevents requesting authorization for download.
data = GPy.util.datasets.olympic_marathon_men()
X = data['X']
Y = data['Y']
X_train = X[:20]
Y_train = Y[:20,0]
X_test = X[20:]
Y_test = Y[20:,0]
Let's first see the results with the default kernel parameters.
from sklearn import svm
svr = svm.SVR()
svr.fit(X_train,Y_train)
Y_train_pred = svr.predict(X_train)
Y_test_pred = svr.predict(X_test)
print("The default parameters obtained: C="+str(svr.C)+", epilson="+str(svr.epsilon)+", gamma="+str(svr.gamma))
We compute the RMSE on the testing data and plot the prediction. With the default parameters, SVR does not give an OK fit to the training data but completely miss out the testing data well.
plot(X_train,Y_train_pred,'b',label='pred-train')
plot(X_test,Y_test_pred,'g',label='pred-test')
plot(X_train,Y_train,'rx',label='ground truth')
plot(X_test,Y_test,'rx')
legend(loc='best')
print("RMSE = "+str(np.sqrt(np.square(Y_test_pred-Y_test).mean())))
RMSE = 0.56330740612
Now let's try Bayesian Optimization. We first write a wrap function for fitting with SVR. The objective is the RMSE from cross-validation. We optimize the parameters in log space.
nfold = 3
def fit_svr_val(x):
x = np.atleast_2d(np.exp(x))
fs = np.zeros((x.shape[0],1))
for i in range(x.shape[0]):
fs[i] = 0
for n in range(nfold):
idx = np.array(range(X_train.shape[0]))
idx_valid = np.logical_and(idx>=X_train.shape[0]/nfold*n, idx<X_train.shape[0]/nfold*(n+1))
idx_train = np.logical_not(idx_valid)
svr = svm.SVR(C=x[i,0], epsilon=x[i,1],gamma=x[i,2])
svr.fit(X_train[idx_train],Y_train[idx_train])
fs[i] += np.sqrt(np.square(svr.predict(X_train[idx_valid])-Y_train[idx_valid]).mean())
fs[i] *= 1./nfold
return fs
## -- Note that similar wrapper functions can be used to tune other Scikit-learn methods
We set the search interval of $C$ to be roughly $[0,1000]$ and the search interval of $\epsilon$ and $\gamma$ to be roughtly $[1\times 10^{-5},0.1]$.
domain =[{'name': 'C', 'type': 'continuous', 'domain': (0.,7.)},
{'name': 'epsilon','type': 'continuous', 'domain': (-12.,-2.)},
{'name': 'gamma', 'type': 'continuous', 'domain': (-12.,-2.)}]
We, then, create the GPyOpt object and run the optimization procedure. It might take a while.
opt = GPyOpt.methods.BayesianOptimization(f = fit_svr_val, # function to optimize
domain = domain, # box-constraints of the problem
acquisition_type ='LCB', # LCB acquisition
acquisition_weight = 0.1) # Exploration exploitation
# it may take a few seconds
opt.run_optimization(max_iter=50)
opt.plot_convergence()
Let's show the best parameters found. They differ significantly from the default parameters.
x_best = np.exp(opt.X[np.argmin(opt.Y)])
print("The best parameters obtained: C="+str(x_best[0])+", epilson="+str(x_best[1])+", gamma="+str(x_best[2]))
svr = svm.SVR(C=x_best[0], epsilon=x_best[1],gamma=x_best[2])
svr.fit(X_train,Y_train)
Y_train_pred = svr.predict(X_train)
Y_test_pred = svr.predict(X_test)
The best parameters obtained: C=430.280588436, epilson=0.00104925901026, gamma=6.14421235333e-06
We can see SVR does a reasonable fit to the data. The result could be further improved by increasing the max_iter.
plot(X_train,Y_train_pred,'b',label='pred-train')
plot(X_test,Y_test_pred,'g',label='pred-test')
plot(X_train,Y_train,'rx',label='ground truth')
plot(X_test,Y_test,'rx')
legend(loc='best')
print("RMSE = "+str(np.sqrt(np.square(Y_test_pred-Y_test).mean())))
RMSE = 0.0709714822355
In the same way as we did for the regression task, we can tune the parameters for a classification problem. The function to optimise is treated as a black box, so the procedure is very similar.
We can start by loading the standard Iris dataset from scikit-learn.
from sklearn.datasets import load_iris, make_blobs
# iris = load_iris()
# # Take the first two features. We could avoid this by using a two-dim dataset
# X = iris.data[:, :2]
# y = iris.target
X, y = make_blobs(centers=3, cluster_std=4, random_state=1234)
for i in np.unique(y):
idx = y == i
plt.plot(X[idx,0], X[idx,1], 'o')
# plt.title("First two dimensions of Iris data")
# plt.xlabel('Sepal length')
# plt.ylabel('Sepal width');
Then, let's divide the dataset into a "learning" and a "test" set. The test set will be used later, in order to assess the performance of parameters selected by GPyOpt in relation to GridSearchCV and a SVC with the default parameters.
For each case, we consider an SVM with RBF kernel.
Let's plot both the learning (with colors related to their classes) and the test data (red "x").
from sklearn.model_selection import StratifiedShuffleSplit
train_index, test_index = next(StratifiedShuffleSplit(test_size=.4, random_state=123).split(X,y))
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
for i in np.unique(y_train):
idx = y_train == i
plt.plot(X_train[idx,0], X_train[idx,1], 'o')
plt.plot(X_test[:,0], X_test[:,1], 'rx');
plt.title("Learning and test sets")
plt.xlabel('Sepal length')
plt.ylabel('Sepal width');
The first tentative is to use the SVC with the default parameters.
from sklearn.svm import SVC
svc = SVC(kernel='rbf').fit(X_train, y_train)
y_train_pred = svc.predict(X_train)
y_test_pred = svc.predict(X_test)
print("SVC with default parameters\nTest score: %.3f" % svc.score(X_test, y_test))
SVC with default parameters Test score: 0.700
# utility functions taken from
# http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html
def make_meshgrid(x, y, h=.02):
"""Create a mesh of points to plot in
Parameters
----------
x: data to base x-axis meshgrid on
y: data to base y-axis meshgrid on
h: stepsize for meshgrid, optional
Returns
-------
xx, yy : ndarray
"""
x_min, x_max = x.min() - 1, x.max() + 1
y_min, y_max = y.min() - 1, y.max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
return xx, yy
def plot_contours(ax, clf, xx, yy, **params):
"""Plot the decision boundaries for a classifier.
Parameters
----------
ax: matplotlib axes object
clf: a classifier
xx: meshgrid ndarray
yy: meshgrid ndarray
params: dictionary of params to pass to contourf, optional
"""
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
out = ax.contourf(xx, yy, Z, **params)
return out
X0, X1 = X[:, 0], X[:, 1]
xx, yy = make_meshgrid(X0, X1)
We can visualise the surfaces defined by the SVM to differentiate the classes. The points have the color of each own class.
f, ax = plt.subplots(1,1, squeeze=False)
for i in np.unique(y_test):
plot_contours(ax[0,0], svc, xx, yy, alpha=0.8)
idx = y_test == i
ax[0,0].plot(X_test[idx,0], X_test[idx,1], 'o')
ax[0,0].set_title("Difference between SVM prediction (background color) \nand real classes (point color) in test set")
# ax[0,0].set_xlim([4,8])
# ax[0,0].set_ylim([2,4.5])
# plt.xlabel('Sepal length')
# plt.ylabel('Sepal width');
plt.tight_layout()
For comparison, we can use the standard GridSearchCV for the parameter selection. This will build a grid with the combination of all of the parameters, and then select the best combination of parameters that achieve the maximum validation score (or minimum error).
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(SVC(kernel='rbf'), dict(C=np.logspace(-2,4,10), gamma=np.logspace(-12,-2,10))).fit(X_train,y_train)
print("GridSearchCV\nTest score: %.3f" % grid.score(X_test, y_test))
print("Best parameters extracted: %s" % grid.best_params_)
GridSearchCV Test score: 0.850 Best parameters extracted: {'C': 0.21544346900318834, 'gamma': 0.01}
# y_test_pred = grid.predict(X_test)
f, ax = plt.subplots(1,1, squeeze=False)
for i in np.unique(y_test):
plot_contours(ax[0,0], grid, xx, yy, alpha=0.8)
idx = y_test == i
ax[0,0].plot(X_test[idx,0], X_test[idx,1], 'o')
ax[0,0].set_title("Difference between SVM prediction (background color) \n"
"and real classes (point color) in test set, using GridSearchCV")
# ax[0,0].set_xlim([4,8])
# ax[0,0].set_ylim([2,4.5])
# plt.xlabel('Sepal length')
# plt.ylabel('Sepal width');
plt.tight_layout()
domain =[{'name': 'C', 'type': 'continuous', 'domain': (-2.,4.)},
# {'name': 'kernel', 'type': 'categorical', 'domain': (0, 1)},
{'name': 'gamma', 'type': 'continuous', 'domain': (-12.,-2.)}
]
from sklearn.model_selection import cross_val_score
def fit_svc_val(x, mdl=None, cv=None):
x = np.atleast_2d(np.exp(x))
fs = np.zeros((x.shape[0], 1))
for i, params in enumerate(x):
dict_params = dict(zip([el['name'] for el in domain], params))
if 'kernel' in dict_params:
dict_params['kernel'] = 'rbf' if dict_params['kernel'] == 0 else 'poly'
mdl.set_params(**dict_params)
fs[i] = -np.mean(cross_val_score(mdl, X_train, y_train, cv=cv))
return fs
from functools import partial
opt = GPyOpt.methods.BayesianOptimization(f = partial(fit_svc_val, mdl=SVC(kernel='rbf')), # function to optimize
domain = domain, # box-constrains of the problem
acquisition_type ='LCB', # LCB acquisition
acquisition_weight = 0.2) # Exploration exploitation
opt.run_optimization(max_iter=50)
opt.plot_convergence()
The set cost function is ignored! LBC acquisition does not make sense with cost.
opt.plot_acquisition()
x_best = np.exp(opt.X[np.argmin(opt.Y)])
best_params = dict(zip([el['name'] for el in domain], x_best))
svc_opt = SVC(**best_params)
svc_opt.fit(X_train, y_train)
# y_train_pred = svc.predict(X_train)
# y_test_pred = svc.predict(X_test)
print("GPyOpt\nTest score: %.3f" % svc_opt.score(X_test, y_test))
print("Best parameters extracted: %s" % best_params)
GPyOpt Test score: 0.850 Best parameters extracted: {'C': 2.03757065667337, 'gamma': 0.0033852993544374603}
# y_test_pred = grid.predict(X_test)
f, ax = plt.subplots(1,1, squeeze=False)
for i in np.unique(y_test):
plot_contours(ax[0,0], svc_opt, xx, yy, alpha=0.8)
idx = y_test == i
ax[0,0].plot(X_test[idx,0], X_test[idx,1], 'o')
ax[0,0].set_title("Difference between SVM prediction (background color) \n"
"and real classes (point color) in test set, using GPyOpt")
# ax[0,0].set_xlim([4,8])
# ax[0,0].set_ylim([2,4.5])
# plt.xlabel('Sepal length')
# plt.ylabel('Sepal width');
plt.tight_layout()
The parameter selected by GridSearchCV and GPyOpt are very similar. However, it is clear from the convergence plots that GPyOpt converges very quickly to the minimum, while GridSearch has to test each combination of parameters.