In this notebook, I will implement algorithms described in Sections 14.2 and 14.3 of the book PRML, i.e., bagging and adaboost for binary classifications.
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot as plt
%matplotlib inline
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
In this notebook, we construct strong classifier from a given simple classifier, which we call a base classifier.
First, let us consider an extremely simple classifier called decision stump.
A decision stump is characterized by parameter $\theta = (d, a, s) \in \{ 0,1, \dots, D-1 \} \times \mathbb{R} \times \{ 1, -1 \} $, whose prediction is given by $$ \begin{align} y(x, (d,a,s)) = \begin{cases} s & (x^{(d)} \geq a ) \\ -s & (x^{(d)} < a) \end{cases} \end{align} $$ where $x \in \mathbb{R}^D$ and $x^{(d)}$ stands for its $d$th component. In essence, the decision stump $y(\cdot, (d,a,s))$ makes prediction according to whether the $d$ th coordinate of the input is smaller than $a$ or not.
In training a decision stump, we choose a parameter $\theta$ that minimizes $$ \begin{align} \sum_{n=0}^{N-1} w_n I\left[ y(x_n, \theta) \neq t_n \right], \end{align} $$ where $w_n$ stands for sample weight, and will be specified later depending on the algorithm we consider.
Given training data $x_0, \dots, x_{N-1}$, we only consider $2D(N-1)$ possible decision stumps because
As to the final point: The limitation does not affect the values of the error function, although it affects prediction error. Also, we assume that the training data is not single class (i.e., there are $n$ and $m$ such that $t_n \neq t_m$ ).
In the later implementation, we perform this exhaustive search.
The idea of bagging, or bootstrap aggregation, is to train $M$ base classifiers using different "training data" for each classifier, where we obtain the "training data" by resampling from the original training data.
For $m = 0,1, \dots, M-1$,
Output the prediction for input $x$ by $$ \begin{align} y_{bag}(x) = \mathrm{sign} \left[ \sum_{m=0}^{M-1} y(x, \theta_m) \right] \end{align} $$
NOTE : Eq. (14.7) in the book is for regression, and for classification, we have to take sign to obtain meaningful result. We ommitted $1/M$ factor, because here we only care about sign.
NOTE : Formally, we can use other classifiers than decision stump for AdaBoost. However, in that case, the error function to be minimized by each base classifier is not necessarily $\sum_{n=0}^{N-1} w^{(m)}_{n} I \left[ y(x_n, \theta_m) \neq t_n \right]$, and I am not sure whether it is theoretically sound or not.
In AdaBoost, we train the classifier following the algorithm shown below:
NOTE :
Having obtained the classifier, the predicted label for input $x$ is given by $$ \begin{align} \textrm{sign} \left[ \sum_{m=0}^{M-1} \alpha_m y(x, \theta_m) \right] \end{align} $$
Here we define three classes, namely, DecisionStump
, Bagging
, and AdaBoost
.
We let Bagging
and AdaBoost
classes have fit
and precit
methods, and the following properties
BaseClassifierClass
: the class from which base classifiers are generated (class, rather than instance should be given), which is assumed to implement fit
and predict
methods. For AdaBoost
class, it is also assumed that the fit
method of BaseClassifierClass
can handle sample_weight
.num_clfs
: $M$, i.e., the number of base classifiers used.clfs
: list for storing base classifiersWe let our DecisionStump
class to have the following properties
axis
: $d$, i.e., the axis used for predictionsign
: $s$threshold
: $a$In fitting a decision stump, we
fit_onedim
method (where the axis $d$ is given), andIn fit_onedim
, we use the folowoing arrays, assuming that $X, T$ and $w$ are sorted according the value of the $d$th component of the input data $X$:
pred
: $(N-1, N)$ array, where pred[m,n]
= $y\left(x_n, \left(d, \frac{x_m + x_{m+1}}{2}, 1 \right) \right)$. Note that it can be generated easily by using np.tri (Although it is inefficient in terms of memory. ).mistakes
: $(N-1, N)$ array, where mistakes[m,n]
= $I \left[ y\left(x_n, \left(d, \frac{x_m + x_{m+1}}{2}, 1 \right) \right) \neq t_n\right] $errs
: $(2, N-1)$ array, whereerrs[0,m]
= $\sum_{n=0}^{N-1} w_n I \left[ y\left(x_n, \left(d, \frac{x_m + x_{m+1}}{2}, 1 \right) \right) \neq t_n\right] $errs[1,m]
= $\sum_{n=0}^{N-1} w_n I \left[ y\left(x_n, \left(d, \frac{x_m + x_{m+1}}{2}, -1 \right) \right) \neq t_n\right] $class DecisionStump:
def __init__(self, axis=None, sign=None, threshold=None):
self.axis = axis
self.sign = sign
self.threshold = threshold
def fit_onedim(self, X, y, sample_weight, axis):
'''
Performing exhaustive search on threshold and sign, where the axis to consider is given.
Parameters
----------
X : 2D numpy array
2D numpy array representing input data, where X[n, i] represents the i-th element of n-th point in X.
y : 1D numpy array
(len(X),) numpy array representing labels, where y[n] represents the label corresponding to n-th point in X.
Each element should be either 1 or -1
sample_weight : 1D numpy array
(len(X), ) numpy array representing the sample weights.
The elements should be non-negative.
axis : integer
A non-negative integer the axis to be considered.
Must be between 0 and X.shape(1)-1
Returns
----------
sign : int, 1 or -1
Integer representing the sign s for the (candidate) decision stump
threshold : float
Threshold a for the (candidate) decision stump
err : float
Training error for the (candidate) decision stump
'''
N = len(X)
# Here we sort everything according the axis-th coordinate of X
sort_ind = np.argsort(X[:, axis])
sorted_label = y[sort_ind]
sorted_input = X[sort_ind]
sorted_sample_weight = sample_weight[sort_ind]
pred = -2*np.tri(N-1, N, k=0, dtype='int') + 1
mistakes = (pred != sorted_label ).astype('int')
# The (weighted) error is calculated for each classifier
errs = np.zeros((2, N-1))
errs[0] = mistakes @ sorted_sample_weight
errs[1] = (1 - mistakes) @ sorted_sample_weight
# Here, we select the best threshold and sign
ind = np.unravel_index(np.argmin(errs, axis=None), errs.shape)
sign = -2*ind[0] + 1
threshold = ( sorted_input[ind[1], axis] + sorted_input[ind[1] + 1, axis] ) / 2
err = errs[ind]
return sign, threshold, err
def fit(self, X, y, sample_weight=None):
'''
Performing fitting by exhaustive search on threshold, sign, and axis.
Parameters
----------
X : 2D numpy array
2D numpy array representing input data, where X[n, i] represents the i-th element of n-th point in X.
y : 1D numpy array
(len(X),) numpy array representing labels, where y[n] represents the label corresponding to n-th point in X.
Each element should be either 1 or -1
sample_weight : 1D numpy array
(len(X), ) numpy array representing the sample weights.
The elements should be non-negative.
If None, the sample_weight is assumed to be uniform.
'''
N, D = X.shape
if sample_weight is None:
sample_weight = np.ones(N)/N
signs = np.zeros(D)
threshs = np.zeros(D)
errs = np.zeros(D)
for axis in range(D):
signs[axis], threshs[axis], errs[axis] = self.fit_onedim(X, y, sample_weight, axis)
self.axis = np.argmin(errs)
self.sign = signs[self.axis]
self.threshold = threshs[self.axis]
def predict(self, X):
'''
The method predicts the labels for the given input data X.
Parameters
----------
X : 2D numpy array
2D numpy array representing input data, where X[n, i] represents the i-th element of n-th point in X.
Returns
----------
y : 1D numpy array
(len(X),) numpy array representing the predicted labels, where y[n] represents the label corresponding to n-th point in X.
'''
return self.sign*( 2*(X[:, self.axis] >= self.threshold) - 1 )
The implementation is rather straightforward.
The code is written in a way that it can be used for other base classifiers, assuming that the base classifier class has fit
and predict
method.
class Bagging:
def __init__(self, BaseClassifierClass, num_clfs):
self.BaseClassifierClass = BaseClassifierClass # base classifier class
self.num_clfs = num_clfs
self.clfs = [self.BaseClassifierClass() for _ in range(self.num_clfs)]
def fit(self, X, y):
'''
The method performs fitting by bagging using the given base classifier.
Parameters
----------
X : 2D numpy array
2D numpy array representing input data, where X[n, i] represents the i-th element of n-th point in X.
y : 1D numpy array
(len(X),) numpy array representing labels, where y[n] represents the label corresponding to n-th point in X.
Each element should be either 1 or -1
'''
N = len(X)
for m in range(self.num_clfs):
# resampling is performed here
sample_ind = np.random.randint(0, N, N)
X_resampled = X[sample_ind]
y_resampled = y[sample_ind]
# traing the classifier using the resampled training data
self.clfs[m].fit(X_resampled, y_resampled)
def predict(self, X):
'''
The method predicts the labels for the given input data X.
Parameters
----------
X : 2D numpy array
2D numpy array representing input data, where X[n, i] represents the i-th element of n-th point in X.
y : 1D numpy array
(len(X),) numpy array representing the predicted labels, where y[n] represents the label corresponding to n-th point in X.
'''
base_pred = np.zeros((self.num_clfs, len(X)))
for m in range(self.num_clfs):
base_pred[m] = self.clfs[m].predict(X)
return np.sign( np.sum(base_pred, axis = 0 ) )
Just like the Bagging
class, the code is written in a way that it can be used for other base classifiers, asuming that the base classifier class has
fit
method which takes input data, label, and sample weight, andpredict
method.class AdaBoost:
def __init__(self, BaseClassifierClass, num_clfs):
self.BaseClassifierClass = BaseClassifierClass # base classifier class
self.num_clfs = num_clfs
self.alpha = np.zeros(self.num_clfs)
self.clfs = [self.BaseClassifierClass() for _ in range(self.num_clfs)]
def fit(self, X, y):
'''
The method performs fitting by AdaBoost using the given base classifier.
Parameters
----------
X : 2D numpy array
2D numpy array representing input data, where X[n, i] represents the i-th element of n-th point in X.
y : 1D numpy array
(len(X),) numpy array representing labels, where y[n] represents the label corresponding to n-th point in X.
Each element should be either 1 or -1
'''
N = len(X)
w = np.ones(N)/N # initialize the weight
for m in range(self.num_clfs):
self.clfs[m].fit(X, y, sample_weight = w)
mistakes = (self.clfs[m].predict(X) != y)
# calculate the epsilon and alpha
ep = np.sum ( w * mistakes)
self.alpha[m] = np.log(1.0/ep - 1)
# update the weight
w = w * np.exp( self.alpha[m] * mistakes )
w = w/np.sum(w)
def predict(self, X):
'''
The method predicts the labels for the given input data X.
Parameters
----------
X : 2D numpy array
2D numpy array representing input data, where X[n, i] represents the i-th element of n-th point in X.
y : 1D numpy array
(len(X),) numpy array representing the predicted labels, where y[n] represents the label corresponding to n-th point in X.
'''
base_pred = np.zeros((self.num_clfs, len(X)))
for m in range(self.num_clfs):
base_pred[m] = self.clfs[m].predict(X)
return np.sign( base_pred.T @ self.alpha )
def get_meshgrid(x, y, nx, ny, margin=0.1):
x_min, x_max = (1 + margin) * x.min() - margin * x.max(), (1 + margin) * x.max() - margin * x.min()
y_min, y_max = (1 + margin) * y.min() - margin * y.max(), (1 + margin) * y.max() - margin * y.min()
xx, yy = np.meshgrid(np.linspace(x_min, x_max, nx),
np.linspace(y_min, y_max, ny))
return xx, yy
def plot_result(ax, clf, xx, yy, X, y):
pred = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
ax.contourf(xx, yy, pred, alpha=0.7)
ax.scatter(X[:,0], X[:,1], c=y, edgecolor='k')
from sklearn import datasets
X, y = datasets.make_moons(n_samples = 50, noise = 0.1, random_state=1)
y = 2*y-1
plt.scatter(X[:,0], X[:,1], c=y, edgecolor='k')
xx, yy = get_meshgrid(X[:, 0], X[:, 1], nx=100, ny=100, margin=0.1)
clfs = [DecisionStump(), Bagging(DecisionStump, 100), AdaBoost(DecisionStump, 100)]
clf_names = ['Decision Stump', 'Bagging', 'AdaBoost']
fig, axes = plt.subplots(1, 3, figsize=(15,5))
for i in range(len(clfs)):
clfs[i].fit(X, y)
plot_result(ax=axes[i], clf=clfs[i], xx=xx, yy=yy, X=X, y=y)
axes[i].set_title(clf_names[i])
Let us take a look on how AdaBoost behaves when we increase the number of base classifiers.
num_clf_list = [2, 4, 8, 16, 32, 64]
clfs = [AdaBoost(DecisionStump, M) for M in num_clf_list]
fig, axes = plt.subplots(2, 3, figsize=(15,10))
for i in range(len(clfs)):
clfs[i].fit(X, y)
plot_result(ax=np.ravel(axes)[i], clf=clfs[i], xx=xx, yy=yy, X=X, y=y)
np.ravel(axes)[i].set_title(f'M = {num_clf_list[i]}')
Roughly speaking, the larger the number of base classifiers is, the more complex the decision boundary becames. It seems to be intuitively natural, because in AdaBoost, a base classifier "tries" to correct the mistakes that its previous base classifier made.