Lecture 9: Meta-Learning

Ensembles: $n$ Heads are Better than One

In a previous assignment, we had you analyze the Titanic dataset on Kaggle. You may have noticed that Kaggle competitions boast thousands of participants, many of whom are professional data scientists. Out of all of these participants, many of whom are using the models we learned in class, how does a winner emerge?

Is it in the way that they clean and process their data? Perhaps, but cleaning data can only go so far to improve predictive power. Ultimately, success is in the selection and optimization of models. We've seen some techniques for this in lecture 7 (like PCA), but many Kagglers take things even further. Let's see what the first-place Kaggle user, Gilberto Titericz Jr, used for the Otto Product Classification Challenge [1]:

Kaggle winning image

Titericz's solution contained several layers, each containing a blend of learning models. We'll now learn why this is the case.

Review: Bias-Variance Tradeoff

Why do several models perform better than a single model? Recall our discussion of goodness of fit - a model is often prone to either underfitting or overfitting. Another way to look at this dichotomy is as a tradeoff between bias and variance.

Bias is another word for "systematic error." This is error that can occur because a model is underdeveloped or undertrained - in other words, it is inadequate for the task it is attempting to perform. Imagine, for example, using a linear model to fit a relationship that is clearly quadratic or cubic. Some of the relationship would be captured, and occasionally the model would get things right. But as a whole, this linear model would have all-around poor performance. We call this underfitting.

Variance can be thought of as "random error." This, in contrast with bias, occurs because a model is overly sophisticated or complex for the task at hand. The model would perform very well on the training data, but as soon as the data is varied slightly - as soon as we move to a testing set, for example - this model would perform poorly. An example of this is fitting a degree-9 polynomial to a relationship that is quadratic or cubic. As we saw last lecture, this is an example of overfitting.

The bias-variance tradeoff states that for a single machine learning model, bias and variance cannot be reduced simultaneously: [2]

Bias/variance graph

We have already seen methods for reducing variance (regularization), but there are other recently-developed methods for reducing both bias and variance. These methods, also called ensemble methods, bypass the bias-variance tradeoff by employing a mix of many models. The usage of many models increases predictive power, and the blending of these models together into a single output reduces variance.

We'll examine a few types of ensembles: boosting, bagging, and stacking.

Boosting

The Idea

Boosting is a sequential ensemble method: we start from an ineffective machine learning model and gradually "boost" it up, increasing its predictive abilities in each time step. The method by which we improve this model is underspecified, and each specific boosting implementation performs it differently.

Generally, the improvement method is as follows: take the parts of the data where the model has done poorly and retrain a new version of the model on this "failed" data. We then combine this new version with the old version, hoping that this combination performs better than the old version alone [3].

Below is a diagram of a boosted model. Note how each model is trained on a different set of data based on the strengths of the others [4]:

Boosted model

When boosting, our starting point (the very first model used) is often called a weak learner. The power of this starting point doesn't matter much, since the output of a boosting procedure (if done correctly) should be stronger than any individual model.

Example: AdaBoost

AdaBoost is a widely used boosting framework which uses the procedure described above on decision trees. Each iteration of the decision tree created by AdaBoost weights incorrectly classified data more heavily than the previous tree. We continue the process a given number of times until we're satisfied, and then return the weighted sum of all of the decision trees [5].

In this competition, we are given a list of collision events and their properties. We will then predict whether a τ → 3μ decay happened in this collision. This τ → 3μ is currently assumed by scientists not to happen, and the goal of this competition was to discover τ → 3μ happening more frequently than scientists currently can understand.

The challenge here was to design a machine learning problem for something no one has ever observed before. Scientists at CERN developed the following designs to achieve the goal.

https://www.kaggle.com/c/flavours-of-physics/data

In [2]:
#Data Cleaning
import pandas as pd
data_test = pd.read_csv("test.csv")
data_train = pd.read_csv("training.csv")
data_train = data_train.drop('min_ANNmuon',1)
data_train = data_train.drop('production',1)
data_train = data_train.drop('mass',1)

#Cleaned data
Y = data_train['signal']
X = data_train.drop('signal',1)
In [3]:
#adaboost
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
seed = 9001 #this ones over 9000!!!
boosted_tree = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), algorithm="SAMME", 
                                  n_estimators=50, random_state = seed)
model = boosted_tree.fit(X, Y)
In [4]:
predictions = model.predict(data_test)
predictions
#Note we can't really validate this data since we don't have an array of "right answers"
Out[4]:
array([0, 0, 1, ..., 0, 0, 0])

Example: GradientBoost

GradientBoost uses another paradigm: the idea of gradient descent. We take some cost function that tells us how poorly the ensemble is performing. We then take the gradient of this cost function. This is the direction of maximum decrease; this is the direction we'd like to go in since we want to minimize the cost function. We then adjust the next tree such that the cost function moves in the direction of this gradient [6].

In [5]:
#stochastic gradient boosting
from sklearn.ensemble import GradientBoostingClassifier
gradient_boosted_tree = GradientBoostingClassifier(n_estimators=50, random_state=seed)
model2 = gradient_boosted_tree.fit(X,Y)
In [6]:
predictions2 = model2.predict(data_test)
predictions2
Out[6]:
array([0, 0, 1, ..., 0, 0, 0])

Bagging

The Idea

Bagging, also known as Bootstrap aggregating, is a parallel ensemble method. Before talking more about Bootstrap Aggregating ensemble, we might want to understand what Bootstrap is; bootstrap is choosing a random sample from the dataset with replacement (in our context, it will be getting a random subset() from the original data.frame with replacement). Therefore, bootstrap agggregating is to choose multiple bootstrap samples, train them separately and independently (thus, a parallel method as mentioned above), and combine (or aggregate) each trained model's result as a whole.

More formal definition is: Bagging is a parallel ensemble method: we run a host of models in parallel on a dataset, averaging or aggregating their results in some way. Each model runs on a randomly selected subset of the data, and all models are trained independently of the others (contrast this with boosting).

The primary goal of bagging is to reduce variance. Bagging does not necessarily improve predictive power; instead it is mainly used to prevent overfitting [7]. Below is a diagram of a typical bagging process [8]:

Bagging

Example: Random Forest

An example of bagging applied to decision trees is the popular random forest model. Random forests create decision trees which are randomly trained on the data. The final output of a given test point is the average of the outputs of all of the randomly trained trees on that point.

In [7]:
from sklearn.datasets import load_digits
digits = load_digits()
X = digits['data']
y = digits['target']

from sklearn import model_selection
kfold = model_selection.KFold(n_splits=50, random_state=7)
In [8]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()

results = model_selection.cross_val_score(model, X, y, cv=kfold)
print(results.mean())
print(results.var())
0.84473015873
0.00479692013102
In [10]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=50, max_features=5)

results = model_selection.cross_val_score(model, X, y, cv=kfold)
print(results.mean())
print(results.var())
0.96826984127
0.000867636432351

Stacking

The Idea

Stacking is perhaps the epitome of meta-learning: stacked models perform machine learning on other models. We take various models (often with different features or different learning algorithms) and train them on random subsets (folds) of our data. We then perform another machine learning model - for instance, linear or logistic regression - on the outputs of these models to obtain a final value [9]. Using a regression model gives us a weighted sum of all of the models - much like what we see in the Kaggle winner's diagram.

Here is a diagram of stacking [10]:

Stacking

It is important to train each model on a random subset of the data. Otherwise, the stacked model will simply lean towards the most accurate model in the set of models and ignore others. Random subsetting is important for achieving the right "blend" of model weights.

Stacking Example

We will be using the titanic dataset found here and stacked models to predict who survived the sinking of the titanic. This example stacks four different models (kNN, SVM, decision trees, and logistic regression) and uses logistic regression as the stacked model. Note the use of test folds and how the predictions are averaged in the final stacked model. This ensures all models are taken into account rather than the stacked model trying to determine which model is the most accurate.

In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder

titanic = pd.read_csv('train.csv')
titanic = titanic.dropna(axis=0, how='any')

Y = titanic['Survived']
titanic = titanic[['Pclass', 'Sex', 'Age']]
# changes sex to numeric
enc = LabelEncoder()
titanic['Sex'] = enc.fit_transform(titanic['Sex'])

# splits into training and testing
X_train, X_test, Y_train, Y_test = train_test_split(titanic, Y, test_size=.1)
# splits training into 4 folds
kf = KFold(n_splits = 4)
split1, split2, split3, split4 = kf.split(X_train, Y_train)
In [13]:
# kNN - uses first fold
X_train_kNN = X_train.iloc[split1[0]]
X_test_kNN = X_train.iloc[split1[1]]
Y_train_kNN = Y_train.iloc[split1[0]]
Y_test_kNN = Y_train.iloc[split1[1]]

from sklearn.neighbors import KNeighborsClassifier
kNN = KNeighborsClassifier()
kNN.fit(X_train_kNN, Y_train_kNN)
kNN_pred = kNN.predict(X_test_kNN)

kNN_score = (Y_test_kNN == kNN_pred).sum() / len(kNN_pred)
print("kNN: " + str(kNN_score))
kNN: 0.658536585366
In [14]:
# SVM - uses second fold
X_train_SVM = X_train.iloc[split2[0]]
X_test_SVM = X_train.iloc[split2[1]]
Y_train_SVM = Y_train.iloc[split2[0]]
Y_test_SVM = Y_train.iloc[split2[1]]

from sklearn import svm
SVM = svm.SVC()
SVM.fit(X_train_SVM, Y_train_SVM)
SVM_pred = SVM.predict(X_test_SVM)

SVM_score = (Y_test_SVM == SVM_pred).sum() / len(SVM_pred)
print("SVM: " + str(SVM_score))
SVM: 0.609756097561
In [15]:
# decision tree - uses third fold
X_train_tree = X_train.iloc[split3[0]]
X_test_tree = X_train.iloc[split3[1]]
Y_train_tree = Y_train.iloc[split3[0]]
Y_test_tree = Y_train.iloc[split3[1]]

from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()
tree.fit(X_train_tree, Y_train_tree)
tree_pred = tree.predict(X_test_tree)

tree_score = (Y_test_tree == tree_pred).sum() / len(tree_pred)
print("Decision Tree: " + str(tree_score))
Decision Tree: 0.780487804878
In [16]:
# logistic - uses fourth fold
X_train_logit = X_train.iloc[split4[0]]
X_test_logit = X_train.iloc[split4[1]]
Y_train_logit = Y_train.iloc[split4[0]]
Y_test_logit = Y_train.iloc[split4[1]]

from sklearn.linear_model import LogisticRegression
logit = LogisticRegression()
logit.fit(X_train_logit, Y_train_logit)
logit_pred = logit.predict(X_test_logit)

logit_score = (Y_test_logit == logit_pred).sum() / len(logit_pred)
print("Logistic Regression: " + str(logit_score))
Logistic Regression: 0.780487804878
In [17]:
# stacking
# combines four folds
import numpy as np
kNN_pred = kNN_pred.reshape(-1,1)
SVM_pred = SVM_pred.reshape(-1,1)
tree_pred = tree_pred.reshape(-1,1)
logit_pred = logit_pred.reshape(-1,1)
pred_input = np.vstack([kNN_pred, SVM_pred, tree_pred, logit_pred])

from sklearn.linear_model import LogisticRegression
stacked = LogisticRegression()
stacked.fit(pred_input, Y_train)

# the first layer predicts
pred_test = pd.DataFrame({'kNN': kNN.predict(X_test), 'SVM': SVM.predict(X_test), 'tree': tree.predict(X_test), 'logit': logit.predict(X_test)})
# the average of the predictions is calculated
pred_test = pred_test.assign(avg=pred_test.mean(axis=1))
stacked_pred = stacked.predict(pred_test[['avg']])

stacked_score = (Y_test == stacked_pred).sum() / len(stacked_pred)
print("Stacked: " + str(stacked_score))
Stacked: 0.894736842105

Terms to Review

  • Ensemble
  • Bias
  • Variance
  • Boosting
  • Weak learner
  • AdaBoost
  • GradientBoost
  • Gradient descent
  • Bagging (bootstrap aggregating)
  • Random forest
  • Stacking