This notebook contains code and comments from Section 2.2 of the book Ensemble Methods for Machine Learning. Please see the book for additional details on this topic. This notebook and code are released under the MIT license.
Our first case study explores a medical decision-making task: breast cancer diagnosis. We will see how to use scikit-learn’s homogeneous parallel ensemble modules in practice. Specifically, we will train and evaluate the performance of three homogeneous parallel algorithms, each characterized by increasing randomness: bagging with decision trees, random forests and ExtraTrees.
Breast-cancer diagnosis is the task of classifying patient tumor data into one of two classes: malignant vs. benign. This task (and similar medical tasks) are at the core of medical diagnostic decision-support systems that are beginning to come into wider use in the healthcare industry. Such machine-learning algorithms can greatly minimize the false negative rate (where patients who do have cancer are misdiagnosed until too late) and improve chances for treatment and survival.
This task is challenging owing to subtle variations in patient data, which makes it very difficult for a single model to achieve high performance. Bagging and random forests are able to learn effective models. This case study will be tackled using scikit-learn’s implementations of bagging and random forests. The performance of these learned ensemble models will be compared with individual (non-ensemble / traditional) machine-learning models. Results will be used to show that ensemble methods can significantly outperform individual models, thus giving the reader a concrete, real-world example of how ensembles can be successful.
from sklearn.datasets import load_breast_cancer
dataset = load_breast_cancer()
# Convert to a Pandas DataFrame so we can visualize nicely
import pandas as pd
import numpy as np
df = pd.DataFrame(data=dataset['data'], columns=dataset['feature_names'])
i = np.random.permutation(len(dataset['target']))
df = df.iloc[i, :7]
df['diagnosis'] = dataset['target'][i]
df = df.reset_index()
df.head()
index | mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | diagnosis | |
---|---|---|---|---|---|---|---|---|---|
0 | 89 | 14.64 | 15.24 | 95.77 | 651.9 | 0.11320 | 0.13390 | 0.09966 | 1 |
1 | 181 | 21.09 | 26.57 | 142.70 | 1311.0 | 0.11410 | 0.28320 | 0.24870 | 0 |
2 | 535 | 20.55 | 20.86 | 137.80 | 1308.0 | 0.10460 | 0.17390 | 0.20850 | 0 |
3 | 37 | 13.03 | 18.42 | 82.61 | 523.8 | 0.08983 | 0.03766 | 0.02562 | 1 |
4 | 94 | 15.06 | 19.83 | 100.30 | 705.6 | 0.10390 | 0.15530 | 0.17000 | 0 |
Pre-process the data by standardizing the features to be zero mean and unit standard deviation.
X, y = dataset['data'], dataset['target']
# X = StandardScaler().fit_transform(X) # Never pre-processing as it leads to data leakage
rng=np.random.RandomState(seed=4190) # Initialize a random number generator
Once we have pre-processed our data set, we will train and evaluate bagging with decision trees, random forests and ExtraTrees in order to answer the following questions:
We compare the behavior of the three algorithms as the parameter n_estimators
increases.
Listing 2.5: Training and test errors with increasing ensemble size.
CAUTION: This experiment below runs slowly! Pickle files from a previous run are included for quick plotting.
import pickle
import os
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import accuracy_score
if not os.path.exists('./data/ErrorVsNumEstimators.pickle'):
max_leaf_nodes = 8
n_runs = 20
n_estimator_range = range(2, 20, 1)
bag_trn_error = np.zeros((n_runs, len(n_estimator_range))) # Train error for the bagging classifier
rf_trn_error = np.zeros((n_runs, len(n_estimator_range))) # Train error for the random forest classifier
xt_trn_error = np.zeros((n_runs, len(n_estimator_range))) # Train error for the extra trees classifier
bag_tst_error = np.zeros((n_runs, len(n_estimator_range))) # Test error for the bagging classifier
rf_tst_error = np.zeros((n_runs, len(n_estimator_range))) # Test error for the random forest classifier
xt_tst_error = np.zeros((n_runs, len(n_estimator_range))) # Test error for the extra trees classifier
for run in range(0, n_runs):
print('Run {0}'.format(run))
# Split into train and test sets
X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.25, random_state=rng)
for j, n_estimators in enumerate(n_estimator_range):
# Train using bagging
bag_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes),
n_estimators=n_estimators, max_samples=0.5, n_jobs=-1, random_state=rng)
bag_clf.fit(X_trn, y_trn)
bag_trn_error[run, j] = 1 - accuracy_score(y_trn, bag_clf.predict(X_trn))
bag_tst_error[run, j] = 1 - accuracy_score(y_tst, bag_clf.predict(X_tst))
# Train using random forests
rf_clf = RandomForestClassifier(max_leaf_nodes=max_leaf_nodes,
n_estimators=n_estimators, n_jobs=-1)
rf_clf.fit(X_trn, y_trn)
rf_trn_error[run, j] = 1 - accuracy_score(y_trn, rf_clf.predict(X_trn))
rf_tst_error[run, j] = 1 - accuracy_score(y_tst, rf_clf.predict(X_tst))
# Train using extra trees
xt_clf = ExtraTreesClassifier(max_leaf_nodes=max_leaf_nodes, bootstrap=True,
n_estimators=n_estimators, n_jobs=-1, random_state=rng)
xt_clf.fit(X_trn, y_trn)
xt_trn_error[run, j] = 1 - accuracy_score(y_trn, xt_clf.predict(X_trn))
xt_tst_error[run, j] = 1 - accuracy_score(y_tst, xt_clf.predict(X_tst))
results = (bag_trn_error, bag_tst_error, \
rf_trn_error, rf_tst_error, \
xt_trn_error, xt_tst_error)
with open('./data/ErrorVsNumEstimators.pickle', 'wb') as result_file:
pickle.dump(results, result_file)
else:
with open('./data/ErrorVsNumEstimators.pickle', 'rb') as result_file:
(bag_trn_error, bag_tst_error, \
rf_trn_error, rf_tst_error, \
xt_trn_error, xt_tst_error) = pickle.load(result_file)
%matplotlib inline
import matplotlib.pyplot as plt
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
n_estimator_range = range(2, 20, 1)
# Plot the training error
m = np.mean(bag_trn_error*100, axis=0)
ax[0].plot(n_estimator_range, m, linewidth=1.5, marker='o', markersize=9, mfc='w');
m = np.mean(rf_trn_error*100, axis=0)
ax[0].plot(n_estimator_range, m, linewidth=1.5, marker='s', markersize=9, mfc='w');
m = np.mean(xt_trn_error*100, axis=0)
ax[0].plot(n_estimator_range, m, linewidth=1.5, marker='D', markersize=9, mfc='w');
ax[0].legend(['Bagging', 'Random Forest', 'Extra Trees'], fontsize=12)
ax[0].set_xlabel('Number of Estimators', fontsize=12)
ax[0].set_ylabel('Training Error (%)', fontsize=12)
ax[0].set_xticks(range(2, 20, 2))
ax[0].axis([0, 21, 1, 9])
# ax[0].grid()
# Plot the test error
m = np.mean(bag_tst_error*100, axis=0)
ax[1].plot(n_estimator_range, m, linewidth=1.5, marker='o', markersize=9, mfc='w');
m = np.mean(rf_tst_error*100, axis=0)
ax[1].plot(n_estimator_range, m, linewidth=1.5, marker='s', markersize=9, mfc='w');
m = np.mean(xt_tst_error*100, axis=0)
ax[1].plot(n_estimator_range, m, linewidth=1.5, marker='D', markersize=9, mfc='w');
ax[1].legend(['Bagging', 'Random Forest', 'Extra Trees'], fontsize=12)
ax[1].set_xlabel('Number of Estimators', fontsize=12)
ax[1].set_ylabel('Hold-out Test Error (%)', fontsize=12)
# ax[1].grid()
ax[1].set_xticks(range(2, 20, 2))
ax[1].axis([0, 21, 1, 9]);
plt.tight_layout()
# plt.savefig('./figures/CH02_F11_Kunapuli.png', format='png', dpi=300, bbox_inches='tight');
# plt.savefig('./figures/CH02_F11_Kunapuli.pdf', format='pdf', dpi=300, bbox_inches='tight');
Next, we compare the behavior of the three algorithms as the complexity of the base learners increases. There are several ways to control the complexity of the base decision trees: maximum depth, maximum number of leaf nodes, impurity criteria etc. Here, we compare the performance of the three ensemble methods as with complexity as determined by max_leaf_nodes
.
Unlisted: Compare the performance of bagging, random forests and extra trees on the breast cancer data set with increasing bag size. The maximum number of estimators is fixed to 10, which ensures that all three ensemble methods have the same number of estimators in the ensemble.
CAUTION: This experiment below runs slowly! Pickle files from a previous run are included for quick plotting.
# See if the result file for this experiment already exists, and if not, rerun and save
# a new set of results
if not os.path.exists('./data/ErrorVsNumLeaves.pickle'):
n_estimators = 10
max_leaf_nodes = 8
n_runs = 20
n_leaf_range = [2, 4, 8, 16, 24, 32]
bag_trn_error = np.zeros((n_runs, len(n_leaf_range))) # Train error for the bagging classifier
rf_trn_error = np.zeros((n_runs, len(n_leaf_range))) # Train error for the random forest classifier
xt_trn_error = np.zeros((n_runs, len(n_leaf_range))) # Train error for the extra trees classifier
bag_tst_error = np.zeros((n_runs, len(n_leaf_range))) # Test error for the bagging classifier
rf_tst_error = np.zeros((n_runs, len(n_leaf_range))) # Test error for the random forest classifier
xt_tst_error = np.zeros((n_runs, len(n_leaf_range))) # Test error for the extra trees classifier
for run in range(0, n_runs):
print('Run {0}'.format(run))
# Split into train and test sets
X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.25, random_state=rng)
for j, max_leaf_nodes in enumerate(n_leaf_range):
# Train using bagging
bag_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes),
n_estimators=n_estimators, max_samples=0.5, n_jobs=-1, random_state=rng)
bag_clf.fit(X_trn, y_trn)
bag_trn_error[run, j] = 1 - accuracy_score(y_trn, bag_clf.predict(X_trn))
bag_tst_error[run, j] = 1 - accuracy_score(y_tst, bag_clf.predict(X_tst))
# Train using random forests
rf_clf = RandomForestClassifier(max_leaf_nodes=max_leaf_nodes,
n_estimators=n_estimators, n_jobs=-1, random_state=rng)
rf_clf.fit(X_trn, y_trn)
rf_trn_error[run, j] = 1 - accuracy_score(y_trn, rf_clf.predict(X_trn))
rf_tst_error[run, j] = 1 - accuracy_score(y_tst, rf_clf.predict(X_tst))
# Train using extra trees
xt_clf = ExtraTreesClassifier(max_leaf_nodes=max_leaf_nodes, bootstrap=True,
n_estimators=n_estimators, n_jobs=-1, random_state=rng)
xt_clf.fit(X_trn, y_trn)
xt_trn_error[run, j] = 1 - accuracy_score(y_trn, xt_clf.predict(X_trn))
xt_tst_error[run, j] = 1 - accuracy_score(y_tst, xt_clf.predict(X_tst))
results = (bag_trn_error, bag_tst_error, \
rf_trn_error, rf_tst_error, \
xt_trn_error, xt_tst_error)
with open('./data/ErrorVsNumLeaves.pickle', 'wb') as result_file:
pickle.dump(results, result_file)
else:
with open('./data/ErrorVsNumLeaves.pickle', 'rb') as result_file:
(bag_trn_error, bag_tst_error, \
rf_trn_error, rf_tst_error, \
xt_trn_error, xt_tst_error) = pickle.load(result_file)
%matplotlib inline
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
n_leaf_range = [2, 4, 8, 16, 24, 32]
# Plot the training error
m = np.mean(bag_trn_error*100, axis=0)
ax[0].plot(n_leaf_range, m, linewidth=1.5, marker='o', markersize=9, mfc='w');
m = np.mean(rf_trn_error*100, axis=0)
ax[0].plot(n_leaf_range, m, linewidth=1.5, marker='s', markersize=9, mfc='w');
m = np.mean(xt_trn_error*100, axis=0)
ax[0].plot(n_leaf_range, m, linewidth=1.5, marker='D', markersize=9, mfc='w');
ax[0].legend(['Bagging', 'Random Forest', 'Extra Trees'], fontsize=12)
ax[0].set_xlabel('Max. Leaf Nodes in Each Base Estimator', fontsize=12)
ax[0].set_ylabel('Training Error (%)', fontsize=12)
ax[0].set_xticks(n_leaf_range)
# ax[0].grid()
# Plot the test error
m = np.mean(bag_tst_error*100, axis=0)
ax[1].plot(n_leaf_range, m, linewidth=1.5, marker='o', markersize=9, mfc='w');
m = np.mean(rf_tst_error*100, axis=0)
ax[1].plot(n_leaf_range, m, linewidth=1.5, marker='s', markersize=9, mfc='w');
m = np.mean(xt_tst_error*100, axis=0)
ax[1].plot(n_leaf_range, m, linewidth=1.5, marker='D', markersize=9, mfc='w');
ax[1].legend(['Bagging', 'Random Forest', 'Extra Trees'], fontsize=12)
ax[1].set_xlabel('Max. Leaf Nodes in Each Base Estimator', fontsize=12)
ax[1].set_ylabel('Hold-out Test Error (%)', fontsize=12)
ax[1].set_xticks(n_leaf_range)
# ax[1].grid();
plt.tight_layout()
# plt.savefig('./figures/CH02_F12_Kunapuli.png', format='png', dpi=300, bbox_inches='tight');
# plt.savefig('./figures/CH02_F12_Kunapuli.pdf', format='pdf', dpi=300, bbox_inches='tight');
Finally, we see how we can use feature importances to identify the most predictive features for breast cancer diagnosis using the random forest ensemble. Such analysis adds interpretability to the model and can be very helpful in communicating and explaining such models to domain experts such as doctors.
First, let’s peek into the data set to see if we can discover some interesting relationships among the features and the diagnosis. This type of analysis is typical when we get a new data set, as we try to learn more about it. Here, our analysis will try to identify which features are most correlated with each other and with the diagnosis (label), so that we can check if random forests can do something similar.
import pandas as pd
import seaborn as sea
df = pd.DataFrame(data=dataset['data'], columns=dataset['feature_names'])
df['diagnosis'] = dataset['target']
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(8, 8))
cor = np.abs(df.corr())
sea.heatmap(cor, annot=False, cbar=False, cmap=plt.cm.Greys, ax=ax)
fig.tight_layout()
# plt.savefig('./figures/CH02_F13_Kunapuli.png', format='png', dpi=300, bbox_inches='tight');
# plt.savefig('./figures/CH02_F13_Kunapuli.pdf', format='pdf', dpi=300, bbox_inches='tight');
There are several features that are also highly correlated with the label, that is, the diagnosis as benign or malignant. Let’s identify the 10 features most correlated with the diagnosis label.
label_corr = cor.iloc[:, -1]
label_corr.sort_values(ascending=False)[1:11]
worst concave points 0.793566 worst perimeter 0.782914 mean concave points 0.776614 worst radius 0.776454 mean perimeter 0.742636 worst area 0.733825 mean radius 0.730029 mean area 0.708984 mean concavity 0.696360 worst concavity 0.659610 Name: diagnosis, dtype: float64
Random forests can also provide feature importances. The listing below illustrates this.
Listing 2.7: Feature importances in the WDBC data set using random forests
X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.15)
n_features = X_trn.shape[1]
rf = RandomForestClassifier(max_leaf_nodes=24, n_estimators=50, n_jobs=-1)
rf.fit(X_trn, y_trn)
err = 1 - accuracy_score(y_tst, rf.predict(X_tst))
print('Prediction Error = {0:4.2f}%'.format(err*100))
importance_threshold = 0.02
for i, (feature, importance) in enumerate(zip(dataset['feature_names'],
rf.feature_importances_)):
if importance > importance_threshold:
print('[{0}] {1} (score={2:4.3f})'.format(i, feature, importance))
Prediction Error = 1.16% [2] mean perimeter (score=0.055) [3] mean area (score=0.065) [6] mean concavity (score=0.071) [7] mean concave points (score=0.138) [13] area error (score=0.065) [20] worst radius (score=0.080) [21] worst texture (score=0.023) [22] worst perimeter (score=0.067) [23] worst area (score=0.131) [26] worst concavity (score=0.029) [27] worst concave points (score=0.149)
We can plot the feature importances as identified by the random forest ensemble
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(8, 4))
# Identify the important features
importance_threshold = 0.02
idx = np.array(range(n_features))
imp = np.where(rf.feature_importances_ >= importance_threshold) # important features
rest = np.setdiff1d(idx, imp) # remaining features
# Plot the important features and the rest on a bar chart
plt.bar(idx[imp], rf.feature_importances_[imp], alpha=0.65)
plt.bar(idx[rest], rf.feature_importances_[rest], alpha=0.45)
# Print feature names on the bar chart
for i, (feature, importance) in enumerate(zip(dataset['feature_names'], rf.feature_importances_)):
if importance > importance_threshold:
plt.text(i, 0.015, feature, ha='center', va='bottom', rotation='vertical', fontsize=14)
print('[{0}] {1} (score={2:4.3f})'.format(i, feature, importance))
else:
plt.text(i, 0.01, feature, ha='center', va='bottom', rotation='vertical', fontsize=14, color='gray')
# Finish the plot
fig.axes[0].get_xaxis().set_visible(False)
plt.ylabel('Feature Importance Score', fontsize=12)
plt.xlabel('Features for Breast Cancer Diagnosis', fontsize=12);
plt.tight_layout()
# plt.savefig('./figures/CH02_F14_Kunapuli.png', format='png', dpi=300, bbox_inches='tight');
# plt.savefig('./figures/CH02_F14_Kunapuli.pdf', format='pdf', dpi=300, bbox_inches='tight');
[2] mean perimeter (score=0.055) [3] mean area (score=0.065) [6] mean concavity (score=0.071) [7] mean concave points (score=0.138) [13] area error (score=0.065) [20] worst radius (score=0.080) [21] worst texture (score=0.023) [22] worst perimeter (score=0.067) [23] worst area (score=0.131) [26] worst concavity (score=0.029) [27] worst concave points (score=0.149)