Up to now, we have seen many ways to use models from the sklearn
library to help solve your machine learning problems. Once you know how to use these models, an important question is that of model selection.
In this section of the tutorial, we will deal with two important questions related to model selection:
sklearn
.Let us consider the problem of explaining variable y
using covariate X
in the following case:
%matplotlib inline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(0)
X = np.random.randn(100, 1)
y = 2 * X ** 2 - X + .3 * np.random.randn(100, 1)
plt.scatter(X, y, s=40)
<matplotlib.collections.PathCollection at 0x10e1f5710>
Of course, if we want to fit a linear regression model to this data, we do not get great results:
def plot_regression(regressor, data, y, tf=None):
plt.scatter(data[:, 0], y, s=40)
X = np.linspace(data[:,0].min() - .5, data[:,0].max() + .5, 100).reshape((100, 1))
if tf is None:
y_pred = regressor.predict(X)
else:
y_pred = regressor.predict(tf.fit_transform(X))
plt.plot(X, y_pred, color="r")
plt.xlim(data[:,0].min() - .5, data[:,0].max() + .5)
plt.ylim(y.min() - .5, y.max() + .5)
regressor = LinearRegression()
regressor.fit(X, y)
plot_regression(regressor, X, y)
We are in a typical case of underfitting: our model is not complex enough to explain our data.
Let us see what happens if we now feed our linear regressor with polynomial features:
tf = PolynomialFeatures(degree=2)
X_deg2 = tf.fit_transform(X)
print(X_deg2.shape) # Col. 0: constant, Col. 1: X, Col. 2: X^2
regressor_deg2 = LinearRegression()
regressor_deg2.fit(X_deg2, y)
plot_regression(regressor_deg2, X, y, tf=tf)
(100, 3)
Here, it seems the model fits pretty well our data. And if we had used higher degree polynomial representation:
tf = PolynomialFeatures(degree=10)
X_deg10 = tf.fit_transform(X)
regressor_deg10 = LinearRegression()
regressor_deg10.fit(X_deg10, y)
plot_regression(regressor_deg10, X, y, tf=tf)
Here, it is quite clear that we are overfitting the data.
To see this, we can have a look at the $R^2$:
print(regressor_deg2.score(X_deg2, y))
print(regressor_deg10.score(X_deg10, y))
0.98683319423 0.988025974709
In this case, the $R^2$ is not able to show that degree 10 polynomial regression is overfitting.
To better assess whether a fitted model tends to underfit or overfit the data, it is necessary to split our data into a train set and a validation set.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
for degree in range(1, 11):
tf = PolynomialFeatures(degree=degree)
X_train_poly = tf.fit_transform(X_train)
X_test_poly = tf.fit_transform(X_test)
regressor = LinearRegression()
regressor.fit(X_train_poly, y_train)
print(degree, regressor.score(X_test_poly, y_test))
1 -0.150174605536 2 0.989327576712 3 0.98875729718 4 0.989462491585 5 0.980318724099 6 0.913716903047 7 0.799959199313 8 0.255051844284 9 0.444334215095 10 -6.52795449077
To get more reliable results, it is recommended to use $k$-fold cross validation:
from sklearn.model_selection import cross_val_score, KFold
for degree in range(1, 11):
tf = PolynomialFeatures(degree=degree)
X_poly = tf.fit_transform(X)
scores = cross_val_score(LinearRegression(), X_poly, y, cv=KFold(n_splits=3))
print(degree, scores.mean())
1 -0.0473348727 2 0.97998648803 3 0.979919207038 4 0.980795367286 5 0.966331415499 6 0.951232294194 7 0.640549302997 8 -2.36809778361 9 -0.837808309871 10 -46.5247428935
This indicates that we should pick a degree between 2 and 6.
We now turn our focus on setting adequate values for hyperparameters in diverse machine learning problems. In the following, we will consider the case of Support Vector Machines for Classification, but what we will see can be straight-forwardly extended to other sklearn
models.
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import make_blobs, make_circles
X, y = make_blobs(n_samples=50, n_features=2, centers=2, cluster_std=1.3, random_state=0)
dict_params = {"C": [.01, .1, 1., 10.]}
clf = GridSearchCV(estimator=SVC(kernel="linear"), param_grid=dict_params, cv=KFold(n_splits=3))
clf.fit(X, y)
print(clf.best_estimator_)
SVC(C=0.01, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma='auto', kernel='linear', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)
The value $C=0.01$ seems the best when using cross-validation on the training set. A similar scheme can also be used to pick the adequate kernel for SVM:
X, y = make_circles(n_samples=100, random_state=0, noise=.1, factor=.5)
dict_params = {"C": [.01, .1, 1., 10.], "kernel": ["linear", "rbf"]}
clf = GridSearchCV(estimator=SVC(kernel="linear"), param_grid=dict_params, cv=KFold(n_splits=3))
clf.fit(X, y)
print(clf.best_estimator_)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)