Notebook

This notebook contains code and comments from Section 1.3 of the book Ensemble Methods for Machine Learning. Please see the book for additional details on this topic. This notebook and code are released under the MIT license.

1.3 Fit vs. Complexity in Machine-Learning Models¶

We will explore fit vs. complexity (a simpler view of the bias-variance dilemma) through a regression task on the on a synthetic regression data set, Friedman-1. This data set is a highly nonlinear regression problem, where the labels are related to the features through the relationship:

$y = 10 \sin{(\pi x_1 x_2) + 20(x_3 - 0.5)^2 + 10 x_4 + 5 x_5 + GaussianNoise(0, \sigma)}$

In [1]:

from sklearn.datasets import make_friedman1
X, y = make_friedman1(n_samples=500, 
                      n_features=15, 
                      noise=0.3, 
                      random_state=23)

1.3.1 Regression with Decision Trees¶

Perform 5 runs of the following:

Split the data into training (67%) and test (33%) sets randomly
Fit (train) decision trees of different depths in the range 1 to 10 on the training set
Evaluate each of the trees on both the training set (to get the training score) and test set (to get the test score) using R2 coefficient as the scoring metric

In [2]:

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import validation_curve
import numpy as np

subsets = ShuffleSplit(n_splits=10, test_size=0.25, random_state=23)

model = DecisionTreeRegressor()
trn_scores, tst_scores = validation_curve(model, X, y, \
                param_name='max_depth', param_range=range(1, 11), \
                cv=subsets, scoring='r2')
mean_train_score = np.mean(trn_scores, axis=1) 
mean_test_score = np.mean(tst_scores, axis=1)  

Plot the training and test score curves.

In [3]:

import matplotlib.pyplot as plt

fig = plt.figure()
plt.plot(range(1, 11), mean_train_score, linewidth=1.5, marker='o', markersize=9, mfc='w');
plt.plot(range(1, 11), mean_test_score, linewidth=1.5, marker='s', markersize=9, mfc='w');
plt.legend(['training score', 'test score'], loc='lower center', ncol=2, fontsize=12)
plt.xlabel('Decision Tree Complexity (maximum tree depth)', fontsize=12);
plt.ylabel('$R^2$ coefficient', fontsize=12);
plt.xticks(range(1, 11));
plt.title('Decision Tree Regression')
fig.tight_layout()

# plt.savefig('./figures/CH01_F04_Kunapuli.png', format='png', dpi=300, bbox_inches='tight');
# plt.savefig('./figures/CH01_F04_Kunapuli.pdf', format='pdf', dpi=300, bbox_inches='tight');

An R2 score close to 1 means that the model achieves nearly zero error and is very good.

As decision trees become deeper (more complex) the training scores increase and the resulting models fit the data increasingly better. However, the test scores do not correspondingly increase and the resulting models do not generalize better. Thus, the most complex model with the best fit on the training set is not necessarily the best model for future predictions.

1.3.2 Regression with Support Vector Machines¶

SVMs aim to minimize an objective function of the form

$objective = complexity(model) + C*loss(model, data).$

As C increases, the loss term becomes more dominant, forcing the SVM to minimize the loss and improve the fit. As it does so however, for larger values of C, the complexity term is increasingly ignored and the model becomes more complex.

This behavior is visualized below for a simple 1d regression problem where the (synthetic) data is generated from the true function $y = \frac{\sin x}{x}$ .

In [4]:

from sklearn.svm import SVR
from sklearn.metrics import r2_score

n_syn = 100
X_syn = np.linspace(-10.0, 10.0, n_syn).reshape(-1, 1)
y_true = np.sin(X_syn) / X_syn
y_true = y_true.ravel()
y_syn = y_true + 0.125 * np.random.normal(0.0, 1.0, y_true.shape)
y_syn[-1] = -0.5  # Add one very noisy point to illustrate (exaggeratedly), the impact of overfitting

fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(12, 9))
for k, C in enumerate(10.0**np.arange(-3, 3)):
    # Find the correct axis row and column and plot the noisy data and the true function
    i, j = np.divmod(k, 3)
    ax[i, j].scatter(X_syn[:, 0], y_syn, edgecolors=None, alpha=0.5);
    ax[i, j].plot(X_syn[:, 0], y_true, linewidth=1, linestyle='-.', label='true');
    
    # Learn an SVM model for this value of C
    model = SVR(C=C, kernel='rbf', gamma=0.75)
    model.fit(X_syn, y_syn)
    y_pred = model.predict(X_syn)
    
    # Plot the learned SVM model for this value of C
    ax[i, j].plot(X_syn[:, 0], y_pred, linewidth=3, linestyle='-', label='learned');
    
    # Finish up the plots
    trn_score = r2_score(y_syn, y_pred)
    ax[i, j].set_title('C=$10^{{ {0} }}$, trn score = {1:3.2f}'.format(int(np.log10(C)), trn_score))
    
    # Put legend on one plot
    if k == 0:
        handles, labels = ax[i, j].get_legend_handles_labels()
        ax[i, j].legend(handles, labels, loc='upper left', fontsize=12);
        
    if i == 0:
        ax[i, j].set_xticks([])
    
fig.tight_layout()

# plt.savefig('./figures/CH01_F05_Kunapuli.png', format='png', dpi=300, bbox_inches='tight');
# plt.savefig('./figures/CH01_F05_Kunapuli.pdf', format='pdf', dpi=300, bbox_inches='tight');

As C increases, the model moves from underfit to "good" fit. However, as C keeps increasing, the fit ultimately plateaus, though the model continues to become more nonlinear and complex. This increasing complexity makes it start deviating from the true underlying function and leads to overfitting, which ultimately hurts generalization on future data points.

Now we return to the Friedman data set and repeat the same experiment as we did with decision trees.

Perform 5 runs of the following:

Use the same subsets from the previous experiment with decision trees
Fit (train) SVRs of different with different C values (10^-2, 10^-1, ..., 10³, 10⁴) on the training set
Evaluate each of the trees on both the training set (to get the training score) and test set (to get the test score) using R2 coefficient as the scoring metric

In [7]:

from sklearn.svm import SVR

model = SVR(kernel='rbf', gamma=0.1)
trn_scores, tst_scores = validation_curve(model, X, y.ravel(),    
                                          param_name='C',  
                                          param_range=np.logspace(-2, 4, 7), 
                                          cv=subsets, scoring='r2')

mean_train_score = np.mean(trn_scores, axis=1) 
mean_test_score = np.mean(tst_scores, axis=1)  

Plot the training and test score curves.

In [8]:

plt.semilogx(np.logspace(-2, 4, 7), mean_train_score, linewidth=1.5, marker='o', markersize=9, mfc='w');
plt.semilogx(np.logspace(-2, 4, 7), mean_test_score, linewidth=1.5, marker='s', markersize=9, mfc='w');
plt.legend(['training score', 'test_score'], loc='lower center', ncol=2, fontsize=12);
plt.xlabel('SVM Complexity (regularization parameter, C)', fontsize=12);
plt.ylabel('$R^2$ coefficient', fontsize=12);
plt.title('Support Vector Regression', fontsize=12)
fig.tight_layout()

# plt.savefig('./figures/CH01_F06_Kunapuli.png', format='png', dpi=300, bbox_inches='tight');
# plt.savefig('./figures/CH01_F06_Kunapuli.pdf', format='pdf', dpi=300, bbox_inches='tight');

As with decision trees, a more complex model doesn fit the training data better, but without the corresponding generalization performance as indicated by the test score.