Notebook

This notebook contains code and comments from Section 2.2 and 2.3 of the book Ensemble Methods for Machine Learning. Please see the book for additional details on this topic. This notebook and code are released under the MIT license.

2.2 Bagging¶

2.2.1 Implementing our own Bagging classifier¶

We will implement our own version of bagging to understand its internals, after which we look at how to use scikit-learn's bagging implementation.

Listing 2.1: Bagging with Decision Trees: Training

In [1]:

import numpy as np
from sklearn.tree import DecisionTreeClassifier

rng = np.random.RandomState(seed=4190)
def bagging_fit(X, y, n_estimators, max_depth=5, max_samples=200):
    n_examples = len(y)   
    estimators = [DecisionTreeClassifier(max_depth=max_depth)  
                  for _ in range(n_estimators)]  
    
    for tree in estimators:
        bag = np.random.choice(n_examples, max_samples, replace=True)  
        tree.fit(X[bag, :], y[bag])  
        
    return estimators

This function will return a list of DecisionTreeClassifier objects. We can use this ensemble for prediction, by first obtaining the individual predictions and then aggregating them (through majority voting).

Listing 2.2: Bagging with Decision Trees: Prediction

In [2]:

from scipy.stats import mode

def bagging_predict(X, estimators):
    all_predictions = np.array([tree.predict(X) for tree in estimators])  
    ypred, _ = mode(all_predictions, axis=0, keepdims=False)  
    return np.squeeze(ypred)

Let's test this on a 2d synthetic data set. We train a bagging ensemble of 500 decision trees, each of depth 10 on bootstrap samples of size 200.

In [3]:

from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = make_moons(n_samples=300, noise=.25, random_state=rng)
Xtrn, Xtst, ytrn, ytst = train_test_split(X, y, test_size=0.33, random_state=rng)
bag_ens = bagging_fit(Xtrn, ytrn, n_estimators=500,
                      max_depth=12, max_samples=300)
ypred = bagging_predict(Xtst, bag_ens)

print(accuracy_score(ytst, ypred))

0.898989898989899

In [4]:

ensembleAcc = accuracy_score(ytst, ypred)
print('Bagging: Holdout accuracy = {0:4.2f}%.'.format(ensembleAcc * 100))

tree = DecisionTreeClassifier(max_depth=12)
ypred_single = tree.fit(Xtrn, ytrn).predict(Xtst)
treeAcc = accuracy_score(ytst, ypred_single)
print('Single Decision Tree: Holdout test accuracy = {0:4.2f}%.'.format(treeAcc * 100))

Bagging: Holdout accuracy = 89.90%.
Single Decision Tree: Holdout test accuracy = 89.90%.

We can visualize the difference between the bagging classifier and a single decision tree.

In [5]:

from plot_utils import plot_2d_data, plot_2d_classifier
import matplotlib.pyplot as plt

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(8, 4))

title = 'Single Decision Tree (acc = {0:4.2f}%)'.format(treeAcc*100)
plot_2d_classifier(ax[0], X, y, colormap='Blues', alpha=0.15, s=80,
                   predict_function=tree.predict,
                   xlabel='$x_1$', ylabel='$x_2$', title=title)

title = 'Bagging Ensemble (acc = {0:4.2f}%)'.format(ensembleAcc*100)
plot_2d_classifier(ax[1], X, y, colormap='Blues', alpha=0.15, s=80,
                   predict_function=bagging_predict, predict_args=(bag_ens),
                   xlabel='$x_1$', ylabel='$x_2$', title=title)

fig.tight_layout()

# plt.savefig('./figures/CH02_F04_Kunapuli.png', format='png', dpi=300, bbox_inches='tight');
# plt.savefig('./figures/CH02_F04_Kunapuli.pdf', format='pdf', dpi=300, bbox_inches='tight');

2.2.3 Bagging with `scikit-learn`¶

scikit-learn's BaggingClassifier can be used to train a bagging ensemble for classification. It supports many different kinds of base estimators, though in the example below, we use DecisionTreeClassifier as the base estimator.

Listing 2.3: Bagging with scikit-learn

In [6]:

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

base_estimator = DecisionTreeClassifier(max_depth=10)
bag_ens = BaggingClassifier(base_estimator=base_estimator, n_estimators=500, 
                            max_samples=100, oob_score=True, random_state=rng)
bag_ens.fit(Xtrn, ytrn)
ypred = bag_ens.predict(Xtst)

BaggingClassifier supports out-of-bag evaluation and will return the oob accuracy if we set oob_score=True, as we have done above. We have ourselves also held out a test set, with which we cancompute another estimate of this model’s generalization. These are both pretty close together, as we expect!

In [7]:

bag_ens.oob_score_

Out[7]:

0.9402985074626866

In [8]:

accuracy_score(ytst, ypred)

Out[8]:

0.9292929292929293

We can visualize the smoothing behavior of the BaggingClassifier by comparing its decision boundary to its component base DecisionTreeClassifiers.

In [9]:

%matplotlib inline

fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(12, 8))
trees_to_plot = np.random.choice(500, 5, replace=True)

title = 'Bagging Ensemble (acc = {0:4.2f}%)'.format(accuracy_score(ytst, ypred)*100)
plot_2d_classifier(ax[0, 0], X, y, colormap='Blues', alpha=0.15, s=80,
                   predict_function=bag_ens.predict, 
                   xlabel='$x_1$', ylabel='$x_2$', title=title)

for i in range(5):
    r, c = np.divmod(i + 1, 3)  # Get the row and column index of the subplot
    j = trees_to_plot[i]
    
    tst_acc_clf = accuracy_score(ytst, bag_ens[j].predict(Xtst))
    
    bag = bag_ens.estimators_samples_[j] 
    X_bag = X[bag, :]
    y_bag = y[bag]
    
    title = 'Decision Tree {1} (acc = {0:4.2f}%)'.format(tst_acc_clf*100, j+1)
    plot_2d_classifier(ax[r, c], X, y, colormap='Blues', alpha=0.15, s=80,
                       predict_function=bag_ens[j].predict, 
                       xlabel='$x_1$', ylabel='$x_2$', title=title)

fig.tight_layout()
# plt.savefig('./figures/CH02_F05_Kunapuli.png', format='png', dpi=300, bbox_inches='tight');
# plt.savefig('./figures/CH02_F05_Kunapuli.pdf', format='pdf', dpi=300, bbox_inches='tight');

2.2.4 Faster Training with Parallelization¶

BaggingClassifier supports the speed up of both training and prediction through the n_jobs parameter. By default, this parameter is set to 1 and bagging will run sequentially. Alternately, you can specify the number of concurrent processes BaggingClassifier should use with by setting n_jobs.

The experiment below compares the training efficiency of sequential (with n_jobs=1) with parallelized bagging (n_jobs=-1) on a machine with 6 cores. Bagging can be effectively parallelized, and the resulting gains can training times can be significantly improved.

CAUTION: This experiment below runs slowly! Pickle files from a previous run are included for quick plotting.

In [10]:

import time
import os
import pickle

# See if the result file for this experiment already exists, and if not, rerun and save a new set of results
if not os.path.exists('./data/SeqentialVsParallelBagging.pickle'):
    n_estimator_range = np.arange(50, 525, 50, dtype=int)
    n_range = len(n_estimator_range)
    n_runs = 10

    run_time_seq = np.zeros((n_runs, n_range))
    run_time_par = np.zeros((n_runs, n_range))

    base_estimator = DecisionTreeClassifier(max_depth=5)
    for r in range(n_runs):
        # Split the data randomly into training and test for this run
        X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=100)

        # Learn and evaluate this train/test split for this run with sequential bagging
        for i, n_estimators in enumerate(n_estimator_range):
            start = time.time()
            bag_ens = BaggingClassifier(base_estimator=base_estimator, n_estimators=n_estimators, 
                                        max_samples=100, oob_score=True, n_jobs=1)
            bag_ens.fit(X_trn, y_trn)
            run_time_seq[r, i] = time.time() - start        

         # Learn and evaluate this train/test split for this run
        for i, n_estimators in enumerate(n_estimator_range):
            start = time.time()
            bag_ens = BaggingClassifier(base_estimator=base_estimator, n_estimators=n_estimators, 
                                        max_samples=100, oob_score=True, n_jobs=-1)
            bag_ens.fit(X_trn, y_trn)
            run_time_par[r, i] = time.time() - start
            
    results = (run_time_seq, run_time_par)
    
    with open('./data/SeqentialVsParallelBagging.pickle', 'wb') as result_file:
        pickle.dump(results, result_file)

else:
    with open('./data/SeqentialVsParallelBagging.pickle', 'rb') as result_file:
        (run_time_seq, run_time_par) = pickle.load(result_file) 

Once the sequential vs. parallel results have been loaded/run, plot them.

In [11]:

%matplotlib inline

n_estimator_range = np.arange(50, 525, 50, dtype=int)

run_time_seq_adj = np.copy(run_time_seq)
run_time_seq_adj[run_time_seq > 0.5] = np.nan
run_time_seq_mean = np.nanmean(run_time_seq_adj, axis=0)

run_time_par_adj = np.copy(run_time_par)
run_time_par_adj[run_time_par > 0.3] = np.nan
run_time_par_mean = np.nanmean(run_time_par_adj, axis=0)

fig = plt.figure(figsize=(4, 4))

plt.plot(n_estimator_range, run_time_seq_mean, linewidth=1.5, marker='o', markersize=9, mfc='w');
plt.plot(n_estimator_range[1:], run_time_par_mean[1:], linewidth=1.5, marker='s', markersize=9, mfc='w');
plt.ylabel('Run Time (msec.)', fontsize=12)
plt.xlabel('Number of estimators', fontsize=12)
plt.legend(['Single training job', 'Parallel training jobs'], fontsize=11, loc='upper left');

fig.tight_layout()
# plt.savefig('./figures/CH02_F06_Kunapuli.png', format='png', dpi=300, bbox_inches='tight');
# plt.savefig('./figures/CH02_F06_Kunapuli.pdf', format='pdf', dpi=300, bbox_inches='tight');

2.3 Random Forest¶

Using scikit-learn's RandomForestClassifier.

Listing 2.4: Random Forest with scikit-learn

In [12]:

from sklearn.ensemble import RandomForestClassifier

rf_ens = RandomForestClassifier(n_estimators=500, max_depth=10, 
                                oob_score=True, n_jobs=-1, random_state=rng)
rf_ens.fit(Xtrn, ytrn)
ypred = rf_ens.predict(Xtst)

In [13]:

%matplotlib inline

fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(12, 8))
trees_to_plot = np.random.choice(500, 5, replace=True)

title = 'Random Forest (acc = {0:4.2f}%)'.format(accuracy_score(ytst, ypred)*100)
plot_2d_classifier(ax[0, 0], X, y, colormap='Blues', alpha=0.15, s=80,
                   predict_function=rf_ens.predict, 
                   xlabel='$x_1$', ylabel='$x_2$', title=title)

for i in range(5):
    r, c = np.divmod(i + 1, 3)  # Get the row and column index of the subplot
    j = trees_to_plot[i]
    
    tst_acc_clf = accuracy_score(ytst, rf_ens[j].predict(Xtst))
    
    # bag = bag_ens.estimators_samples_[j] 
    # X_bag = X[bag, :]
    # y_bag = y[bag]
    
    title = 'Randomized Tree {1} (acc = {0:4.2f}%)'.format(tst_acc_clf*100, j+1)
    plot_2d_classifier(ax[r, c], X, y, colormap='Blues', alpha=0.15, s=80,
                       predict_function=rf_ens[j].predict, 
                       xlabel='$x_1$', ylabel='$x_2$', title=title)

fig.tight_layout()
# plt.savefig('./figures/CH02_F08_Kunapuli.png', format='png', dpi=300, bbox_inches='tight');
# plt.savefig('./figures/CH02_F08_Kunapuli.pdf', format='pdf', dpi=300, bbox_inches='tight');

scikit-learn's RandomForestClassifier can also rank features by their importance. Feature importances can be extracted from the learned RandomForestClassifier's feature_importances_ attribute. This is computed by adding up how much each feature decreases the overall Gini impurity criterion during training. Features that decrease the impurity more will have higher feature importances.

In [14]:

for i, score in enumerate(rf_ens.feature_importances_):
    print('Feature x{0}: {1:6.5f}'.format(i, score))

Feature x0: 0.50072
Feature x1: 0.49928

In [ ]:

2.2 Bagging¶

2.2.1 Implementing our own Bagging classifier¶

2.2.3 Bagging with scikit-learn¶

2.2.4 Faster Training with Parallelization¶

2.3 Random Forest¶

2.2.3 Bagging with `scikit-learn`¶