This notebook contains code and comments from Section 2.2 and 2.3 of the book Ensemble Methods for Machine Learning. Please see the book for additional details on this topic. This notebook and code are released under the MIT license.
We will implement our own version of bagging to understand its internals, after which we look at how to use scikit-learn's bagging implementation.
Listing 2.1: Bagging with Decision Trees: Training
import numpy as np
from sklearn.tree import DecisionTreeClassifier
rng = np.random.RandomState(seed=4190)
def bagging_fit(X, y, n_estimators, max_depth=5, max_samples=200):
n_examples = len(y)
estimators = [DecisionTreeClassifier(max_depth=max_depth)
for _ in range(n_estimators)]
for tree in estimators:
bag = np.random.choice(n_examples, max_samples, replace=True)
tree.fit(X[bag, :], y[bag])
return estimators
This function will return a list of DecisionTreeClassifier
objects. We can use this ensemble for prediction, by first obtaining the individual predictions and then aggregating them (through majority voting).
Listing 2.2: Bagging with Decision Trees: Prediction
from scipy.stats import mode
def bagging_predict(X, estimators):
all_predictions = np.array([tree.predict(X) for tree in estimators])
ypred, _ = mode(all_predictions, axis=0, keepdims=False)
return np.squeeze(ypred)
Let's test this on a 2d synthetic data set. We train a bagging ensemble of 500 decision trees, each of depth 10 on bootstrap samples of size 200.
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = make_moons(n_samples=300, noise=.25, random_state=rng)
Xtrn, Xtst, ytrn, ytst = train_test_split(X, y, test_size=0.33, random_state=rng)
bag_ens = bagging_fit(Xtrn, ytrn, n_estimators=500,
max_depth=12, max_samples=300)
ypred = bagging_predict(Xtst, bag_ens)
print(accuracy_score(ytst, ypred))
0.898989898989899
ensembleAcc = accuracy_score(ytst, ypred)
print('Bagging: Holdout accuracy = {0:4.2f}%.'.format(ensembleAcc * 100))
tree = DecisionTreeClassifier(max_depth=12)
ypred_single = tree.fit(Xtrn, ytrn).predict(Xtst)
treeAcc = accuracy_score(ytst, ypred_single)
print('Single Decision Tree: Holdout test accuracy = {0:4.2f}%.'.format(treeAcc * 100))
Bagging: Holdout accuracy = 89.90%. Single Decision Tree: Holdout test accuracy = 89.90%.
We can visualize the difference between the bagging classifier and a single decision tree.
from plot_utils import plot_2d_data, plot_2d_classifier
import matplotlib.pyplot as plt
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(8, 4))
title = 'Single Decision Tree (acc = {0:4.2f}%)'.format(treeAcc*100)
plot_2d_classifier(ax[0], X, y, colormap='Blues', alpha=0.15, s=80,
predict_function=tree.predict,
xlabel='$x_1$', ylabel='$x_2$', title=title)
title = 'Bagging Ensemble (acc = {0:4.2f}%)'.format(ensembleAcc*100)
plot_2d_classifier(ax[1], X, y, colormap='Blues', alpha=0.15, s=80,
predict_function=bagging_predict, predict_args=(bag_ens),
xlabel='$x_1$', ylabel='$x_2$', title=title)
fig.tight_layout()
# plt.savefig('./figures/CH02_F04_Kunapuli.png', format='png', dpi=300, bbox_inches='tight');
# plt.savefig('./figures/CH02_F04_Kunapuli.pdf', format='pdf', dpi=300, bbox_inches='tight');
scikit-learn
¶scikit-learn
's BaggingClassifier
can be used to train a bagging ensemble for classification. It supports many different kinds of base estimators, though in the example below, we use DecisionTreeClassifier
as the base estimator.
Listing 2.3: Bagging with scikit-learn
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
base_estimator = DecisionTreeClassifier(max_depth=10)
bag_ens = BaggingClassifier(base_estimator=base_estimator, n_estimators=500,
max_samples=100, oob_score=True, random_state=rng)
bag_ens.fit(Xtrn, ytrn)
ypred = bag_ens.predict(Xtst)
BaggingClassifier
supports out-of-bag evaluation and will return the oob accuracy if we set oob_score=True
, as we have done above. We have ourselves also held out a test set, with which we cancompute another estimate of this model’s generalization. These are both pretty close together, as we expect!
bag_ens.oob_score_
0.9402985074626866
accuracy_score(ytst, ypred)
0.9292929292929293
We can visualize the smoothing behavior of the BaggingClassifier
by comparing its decision boundary to its component base DecisionTreeClassifiers
.
%matplotlib inline
fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(12, 8))
trees_to_plot = np.random.choice(500, 5, replace=True)
title = 'Bagging Ensemble (acc = {0:4.2f}%)'.format(accuracy_score(ytst, ypred)*100)
plot_2d_classifier(ax[0, 0], X, y, colormap='Blues', alpha=0.15, s=80,
predict_function=bag_ens.predict,
xlabel='$x_1$', ylabel='$x_2$', title=title)
for i in range(5):
r, c = np.divmod(i + 1, 3) # Get the row and column index of the subplot
j = trees_to_plot[i]
tst_acc_clf = accuracy_score(ytst, bag_ens[j].predict(Xtst))
bag = bag_ens.estimators_samples_[j]
X_bag = X[bag, :]
y_bag = y[bag]
title = 'Decision Tree {1} (acc = {0:4.2f}%)'.format(tst_acc_clf*100, j+1)
plot_2d_classifier(ax[r, c], X, y, colormap='Blues', alpha=0.15, s=80,
predict_function=bag_ens[j].predict,
xlabel='$x_1$', ylabel='$x_2$', title=title)
fig.tight_layout()
# plt.savefig('./figures/CH02_F05_Kunapuli.png', format='png', dpi=300, bbox_inches='tight');
# plt.savefig('./figures/CH02_F05_Kunapuli.pdf', format='pdf', dpi=300, bbox_inches='tight');
BaggingClassifier supports the speed up of both training and prediction through the n_jobs
parameter. By default, this parameter is set to 1
and bagging will run sequentially. Alternately, you can specify the number of concurrent processes BaggingClassifier
should use with by setting n_jobs
.
The experiment below compares the training efficiency of sequential (with n_jobs=1
) with parallelized bagging (n_jobs=-1
) on a machine with 6 cores. Bagging can be effectively parallelized, and the resulting gains can training times can be significantly improved.
CAUTION: This experiment below runs slowly! Pickle files from a previous run are included for quick plotting.
import time
import os
import pickle
# See if the result file for this experiment already exists, and if not, rerun and save a new set of results
if not os.path.exists('./data/SeqentialVsParallelBagging.pickle'):
n_estimator_range = np.arange(50, 525, 50, dtype=int)
n_range = len(n_estimator_range)
n_runs = 10
run_time_seq = np.zeros((n_runs, n_range))
run_time_par = np.zeros((n_runs, n_range))
base_estimator = DecisionTreeClassifier(max_depth=5)
for r in range(n_runs):
# Split the data randomly into training and test for this run
X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=100)
# Learn and evaluate this train/test split for this run with sequential bagging
for i, n_estimators in enumerate(n_estimator_range):
start = time.time()
bag_ens = BaggingClassifier(base_estimator=base_estimator, n_estimators=n_estimators,
max_samples=100, oob_score=True, n_jobs=1)
bag_ens.fit(X_trn, y_trn)
run_time_seq[r, i] = time.time() - start
# Learn and evaluate this train/test split for this run
for i, n_estimators in enumerate(n_estimator_range):
start = time.time()
bag_ens = BaggingClassifier(base_estimator=base_estimator, n_estimators=n_estimators,
max_samples=100, oob_score=True, n_jobs=-1)
bag_ens.fit(X_trn, y_trn)
run_time_par[r, i] = time.time() - start
results = (run_time_seq, run_time_par)
with open('./data/SeqentialVsParallelBagging.pickle', 'wb') as result_file:
pickle.dump(results, result_file)
else:
with open('./data/SeqentialVsParallelBagging.pickle', 'rb') as result_file:
(run_time_seq, run_time_par) = pickle.load(result_file)
Once the sequential vs. parallel results have been loaded/run, plot them.
%matplotlib inline
n_estimator_range = np.arange(50, 525, 50, dtype=int)
run_time_seq_adj = np.copy(run_time_seq)
run_time_seq_adj[run_time_seq > 0.5] = np.nan
run_time_seq_mean = np.nanmean(run_time_seq_adj, axis=0)
run_time_par_adj = np.copy(run_time_par)
run_time_par_adj[run_time_par > 0.3] = np.nan
run_time_par_mean = np.nanmean(run_time_par_adj, axis=0)
fig = plt.figure(figsize=(4, 4))
plt.plot(n_estimator_range, run_time_seq_mean, linewidth=1.5, marker='o', markersize=9, mfc='w');
plt.plot(n_estimator_range[1:], run_time_par_mean[1:], linewidth=1.5, marker='s', markersize=9, mfc='w');
plt.ylabel('Run Time (msec.)', fontsize=12)
plt.xlabel('Number of estimators', fontsize=12)
plt.legend(['Single training job', 'Parallel training jobs'], fontsize=11, loc='upper left');
fig.tight_layout()
# plt.savefig('./figures/CH02_F06_Kunapuli.png', format='png', dpi=300, bbox_inches='tight');
# plt.savefig('./figures/CH02_F06_Kunapuli.pdf', format='pdf', dpi=300, bbox_inches='tight');
Using scikit-learn
's RandomForestClassifier
.
Listing 2.4: Random Forest with scikit-learn
from sklearn.ensemble import RandomForestClassifier
rf_ens = RandomForestClassifier(n_estimators=500, max_depth=10,
oob_score=True, n_jobs=-1, random_state=rng)
rf_ens.fit(Xtrn, ytrn)
ypred = rf_ens.predict(Xtst)
%matplotlib inline
fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(12, 8))
trees_to_plot = np.random.choice(500, 5, replace=True)
title = 'Random Forest (acc = {0:4.2f}%)'.format(accuracy_score(ytst, ypred)*100)
plot_2d_classifier(ax[0, 0], X, y, colormap='Blues', alpha=0.15, s=80,
predict_function=rf_ens.predict,
xlabel='$x_1$', ylabel='$x_2$', title=title)
for i in range(5):
r, c = np.divmod(i + 1, 3) # Get the row and column index of the subplot
j = trees_to_plot[i]
tst_acc_clf = accuracy_score(ytst, rf_ens[j].predict(Xtst))
# bag = bag_ens.estimators_samples_[j]
# X_bag = X[bag, :]
# y_bag = y[bag]
title = 'Randomized Tree {1} (acc = {0:4.2f}%)'.format(tst_acc_clf*100, j+1)
plot_2d_classifier(ax[r, c], X, y, colormap='Blues', alpha=0.15, s=80,
predict_function=rf_ens[j].predict,
xlabel='$x_1$', ylabel='$x_2$', title=title)
fig.tight_layout()
# plt.savefig('./figures/CH02_F08_Kunapuli.png', format='png', dpi=300, bbox_inches='tight');
# plt.savefig('./figures/CH02_F08_Kunapuli.pdf', format='pdf', dpi=300, bbox_inches='tight');
scikit-learn
's RandomForestClassifier
can also rank features by their importance. Feature importances can be extracted from the learned RandomForestClassifier
's feature_importances_
attribute. This is computed by adding up how much each feature decreases the overall Gini impurity criterion during training. Features that decrease the impurity more will have higher feature importances.
for i, score in enumerate(rf_ens.feature_importances_):
print('Feature x{0}: {1:6.5f}'.format(i, score))
Feature x0: 0.50072 Feature x1: 0.49928