Notebook

This notebook contains code and comments from Section 3.1 of the book Ensemble Methods for Machine Learning. Please see the book for additional details on this topic. This notebook and code are released under the MIT license.

3.1 Base estimators for heterogeneous ensembles¶

In [1]:

# Ignore FutureWarnings
# Currently generated by sklearn.neighbors which uses scipy.mode
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

In [2]:

from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

X, y = make_moons(600, noise=0.25, random_state=13)
X, Xval, y, yval = train_test_split(X, y, test_size=0.25)        # Set aside 25% of data for validation
Xtrn, Xtst, ytrn, ytst = train_test_split(X, y, test_size=0.25)  # Set aside a further 25% of data for hold-out test

In [3]:

from plot_utils import plot_2d_data, plot_2d_classifier
import matplotlib.pyplot as plt

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(4, 4))
plot_2d_data(ax, X, y, alpha=0.2, s=80, xlabel='$x_1$', ylabel='$x_2$', title='Two moons data', colormap='Blues');
fig.tight_layout()

# plt.savefig('./figures/CH03_F02_Kunapuli.png', format='png', dpi=300, bbox_inches='tight');
# plt.savefig('./figures/CH03_F02_Kunapuli.pdf', format='pdf', dpi=300, bbox_inches='tight');

3.1.1 Fitting base estimators¶

For this scenario, we use six popular machine-learning algorithms, all of which are available in scikit-learn: DecisionTreeClassifier, SVC, GaussianProcessClassifier, KNeighborsClassifier, RandomForestClassifier and GaussianNB.

Listing 3.1: Fitting different base estimators

In [4]:

from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

estimators = [('dt', DecisionTreeClassifier(max_depth=5)),
              ('svm', SVC(gamma=1.0, C=1.0, probability=True)),
              ('gp', GaussianProcessClassifier(RBF(1.0))),
              ('3nn', KNeighborsClassifier(n_neighbors=3)),
              ('rf',RandomForestClassifier(max_depth=3, n_estimators=25)),
              ('gnb', GaussianNB())]

def fit(estimators, X, y):
    for model, estimator in estimators:
        estimator.fit(X, y)
    return estimators

In [5]:

estimators = fit(estimators, Xtrn, ytrn)

We visualize how each base estimator behaves on our data set. Using different base learning algorithms allows the ensemble to train naturally diverse base estimators.

In [6]:

from sklearn.metrics import accuracy_score

n_estimators = len(estimators)
nrows, ncols = n_estimators // 3, 3

fig, ax = plt.subplots(nrows=nrows, ncols=ncols, figsize=(9, 6))
for i, (model, estimator) in enumerate(estimators):
    r, c = divmod(i, 3)

    # Compute the test error
    tst_err = 1 - accuracy_score(ytst, estimator.predict(Xtst))

    title = '{0} (test err = {1:4.2f}%)'.format(model, tst_err*100)  
    plot_2d_classifier(ax[r, c], X, y, colormap='Blues', alpha=0.3, s=80,
                       predict_function=estimator.predict_proba, predict_proba=True,
                       title=title)   

fig.tight_layout()
# plt.savefig('./figures/CH03_F04_Kunapuli.png', format='png', dpi=300, bbox_inches='tight');
# plt.savefig('./figures/CH03_F04_Kunapuli.pdf', format='pdf', dpi=300, bbox_inches='tight');

3.1.2 Individual predictions of base estimators¶

Given that we have six base estimators, each test example will have six predictions, one corresponding to each base estimator.

Listing 3.2: Individual predictions of base estimators

The function predict_individual has a flag proba. When we set proba=False, predict_individual returns the predicted labels according to each estimator; the predicted labels take the values y_pred=0 or y_pred=1, and tell us that the estimator has predicted that example to belong to Class 0 or Class 1 respectively.

When we set proba=True, however, each estimator will return the class prediction probabilities instead via each base estimator’s predict_proba() function:

        y[:, i] = estimator.predict_proba(X)[:, 1]

In [7]:

import numpy as np

def predict_individual(X, estimators, proba=False):
    n_estimators = len(estimators)
    n_samples = X.shape[0] 

    y = np.zeros((n_samples, n_estimators))
    for i, (model, estimator) in enumerate(estimators):
        if proba:
            y[:, i] = estimator.predict_proba(X)[:, 1]  
        else:
            y[:, i] = estimator.predict(X)              
    return y

First, test this function with proba=False, to get label predictions directy.

In [8]:

y_individual = predict_individual(Xtst, estimators, proba=False)
np.set_printoptions(threshold=5, precision=2)
print(y_individual)

[[0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 1. 1.]
 [0. 0. 0. 0. 0. 0.]
 ...
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [1. 1. 1. 1. 1. 1.]]

Each row contains six predictions, each one corresponding to the prediction of each base estimator. We sanity check our predictions: Xtst has 113 test examples, and y_individual has six predictions for each of them, which gives us a 113 x 6 array of predictions.

In [9]:

print(Xtst.shape)
print(y_individual.shape)

(113, 2)
(113, 6)

Next, test this function with proba=True, to get prediction probabilities instead

In [10]:

y_individual = predict_individual(Xtst, estimators, proba=True)
y_individual

Out[10]:

array([[0.29, 0.01, 0.07, 0.  , 0.19, 0.31],
       [0.  , 0.57, 0.45, 0.33, 0.61, 0.9 ],
       [0.  , 0.01, 0.06, 0.  , 0.1 , 0.04],
       ...,
       [0.  , 0.01, 0.09, 0.  , 0.08, 0.08],
       [0.  , 0.02, 0.09, 0.  , 0.09, 0.01],
       [1.  , 0.99, 0.9 , 1.  , 0.95, 0.99]])

In [ ]: