This notebook contains code and comments from Section 1.3 of the book Ensemble Methods for Machine Learning. Please see the book for additional details on this topic. This notebook and code are released under the MIT license.
We will explore fit vs. complexity (a simpler view of the bias-variance dilemma) through a regression task on the on a synthetic regression data set, Friedman-1
. This data set is a highly nonlinear regression problem, where the labels are related to the features through the relationship:
y=10sin(πx1x2)+20(x3−0.5)2+10x4+5x5+GaussianNoise(0,σ)
from sklearn.datasets import make_friedman1
X, y = make_friedman1(n_samples=500,
n_features=15,
noise=0.3,
random_state=23)
Perform 5 runs of the following:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import validation_curve
import numpy as np
subsets = ShuffleSplit(n_splits=10, test_size=0.25, random_state=23)
model = DecisionTreeRegressor()
trn_scores, tst_scores = validation_curve(model, X, y, \
param_name='max_depth', param_range=range(1, 11), \
cv=subsets, scoring='r2')
mean_train_score = np.mean(trn_scores, axis=1)
mean_test_score = np.mean(tst_scores, axis=1)
Plot the training and test score curves.
import matplotlib.pyplot as plt
fig = plt.figure()
plt.plot(range(1, 11), mean_train_score, linewidth=1.5, marker='o', markersize=9, mfc='w');
plt.plot(range(1, 11), mean_test_score, linewidth=1.5, marker='s', markersize=9, mfc='w');
plt.legend(['training score', 'test score'], loc='lower center', ncol=2, fontsize=12)
plt.xlabel('Decision Tree Complexity (maximum tree depth)', fontsize=12);
plt.ylabel('$R^2$ coefficient', fontsize=12);
plt.xticks(range(1, 11));
plt.title('Decision Tree Regression')
fig.tight_layout()
# plt.savefig('./figures/CH01_F04_Kunapuli.png', format='png', dpi=300, bbox_inches='tight');
# plt.savefig('./figures/CH01_F04_Kunapuli.pdf', format='pdf', dpi=300, bbox_inches='tight');
An R2 score close to 1 means that the model achieves nearly zero error and is very good.
As decision trees become deeper (more complex) the training scores increase and the resulting models fit the data increasingly better. However, the test scores do not correspondingly increase and the resulting models do not generalize better. Thus, the most complex model with the best fit on the training set is not necessarily the best model for future predictions.
SVMs aim to minimize an objective function of the form
objective=complexity(model)+C∗loss(model,data).As C increases, the loss term becomes more dominant, forcing the SVM to minimize the loss and improve the fit. As it does so however, for larger values of C, the complexity term is increasingly ignored and the model becomes more complex.
This behavior is visualized below for a simple 1d regression problem where the (synthetic) data is generated from the true function y=sinxx.
from sklearn.svm import SVR
from sklearn.metrics import r2_score
n_syn = 100
X_syn = np.linspace(-10.0, 10.0, n_syn).reshape(-1, 1)
y_true = np.sin(X_syn) / X_syn
y_true = y_true.ravel()
y_syn = y_true + 0.125 * np.random.normal(0.0, 1.0, y_true.shape)
y_syn[-1] = -0.5 # Add one very noisy point to illustrate (exaggeratedly), the impact of overfitting
fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(12, 9))
for k, C in enumerate(10.0**np.arange(-3, 3)):
# Find the correct axis row and column and plot the noisy data and the true function
i, j = np.divmod(k, 3)
ax[i, j].scatter(X_syn[:, 0], y_syn, edgecolors=None, alpha=0.5);
ax[i, j].plot(X_syn[:, 0], y_true, linewidth=1, linestyle='-.', label='true');
# Learn an SVM model for this value of C
model = SVR(C=C, kernel='rbf', gamma=0.75)
model.fit(X_syn, y_syn)
y_pred = model.predict(X_syn)
# Plot the learned SVM model for this value of C
ax[i, j].plot(X_syn[:, 0], y_pred, linewidth=3, linestyle='-', label='learned');
# Finish up the plots
trn_score = r2_score(y_syn, y_pred)
ax[i, j].set_title('C=$10^{{ {0} }}$, trn score = {1:3.2f}'.format(int(np.log10(C)), trn_score))
# Put legend on one plot
if k == 0:
handles, labels = ax[i, j].get_legend_handles_labels()
ax[i, j].legend(handles, labels, loc='upper left', fontsize=12);
if i == 0:
ax[i, j].set_xticks([])
fig.tight_layout()
# plt.savefig('./figures/CH01_F05_Kunapuli.png', format='png', dpi=300, bbox_inches='tight');
# plt.savefig('./figures/CH01_F05_Kunapuli.pdf', format='pdf', dpi=300, bbox_inches='tight');
As C increases, the model moves from underfit to "good" fit. However, as C keeps increasing, the fit ultimately plateaus, though the model continues to become more nonlinear and complex. This increasing complexity makes it start deviating from the true underlying function and leads to overfitting, which ultimately hurts generalization on future data points.
Now we return to the Friedman data set and repeat the same experiment as we did with decision trees.
Perform 5 runs of the following:
from sklearn.svm import SVR
model = SVR(kernel='rbf', gamma=0.1)
trn_scores, tst_scores = validation_curve(model, X, y.ravel(),
param_name='C',
param_range=np.logspace(-2, 4, 7),
cv=subsets, scoring='r2')
mean_train_score = np.mean(trn_scores, axis=1)
mean_test_score = np.mean(tst_scores, axis=1)
Plot the training and test score curves.
plt.semilogx(np.logspace(-2, 4, 7), mean_train_score, linewidth=1.5, marker='o', markersize=9, mfc='w');
plt.semilogx(np.logspace(-2, 4, 7), mean_test_score, linewidth=1.5, marker='s', markersize=9, mfc='w');
plt.legend(['training score', 'test_score'], loc='lower center', ncol=2, fontsize=12);
plt.xlabel('SVM Complexity (regularization parameter, C)', fontsize=12);
plt.ylabel('$R^2$ coefficient', fontsize=12);
plt.title('Support Vector Regression', fontsize=12)
fig.tight_layout()
# plt.savefig('./figures/CH01_F06_Kunapuli.png', format='png', dpi=300, bbox_inches='tight');
# plt.savefig('./figures/CH01_F06_Kunapuli.pdf', format='pdf', dpi=300, bbox_inches='tight');
As with decision trees, a more complex model doesn fit the training data better, but without the corresponding generalization performance as indicated by the test score.