Introduction

In this document I'll show a python implementation of stacked generalization (or stacking), an ensemble technique introduced in [Wolpert, David H., 1992. Stacked generalization, Neural Networks, Volume 5, Issue 2, Pages 241-259]. Stacking uses cross validation to combine the results of several predictive models to improve their accuracy.

A particular case of stacked generalization (blending) was used by the winners of the Netflix Prize (http://www.netflixprize.com/assets/GrandPrize2009_BPC_BigChaos.pdf). Ensemble techniques are also extremely popular in several other competitions like the ones hosted on Kaggle. More important, these methods usually perform very well also on "real world" predictive modeling tasks. Stacked generalization is particularly effective when we have datasets describing different aspects of the "thing" we are trying to predict (eg. a dataset of patients' signals). Olivetti et. at. 2014. MEG Decoding Across Subjects - and a related Kaggle competition - is an example of using stacking to build a robust predictor across subjects.

Stacked Generalization

In its original formulation, the method works as follows:

  1. Split a dataset set into two disjoint sets (train/test).
  2. Train and test $k$ models, with cross validation, on the first part. These are called level-0 models
  3. Build train and test level-1 datasets, using the predictions from 2) as inputs
  4. Train a higher level model (level-1 model) on the level 1 data from 3) and use it to predict unseen instances from the test set in 1.

A complement to Wolpert's work is [Ting, Witten 1998. Issues in Stacked Generalization http://arxiv.org/pdf/1105.5466.pdf]; this paper presents empirical results that fill in on what Wolpert described as "black art". These can be considered a sort of "best practices" for stacking. In particular:

  1. Logistic Regression performs well as the level-1 model
  2. For classification tasks, build level-1 data using class probabilities rather than the predicted class labels.
  3. Like any ensemble method, stacking is ideal for parallel computation
  4. Stacking can work well with just two or three level-0 models

In terms of potential pitfalls, the common issue with loss of interpretability in model ensembles comes to mind. Perlich, Swirszcz 2010 suggest tha cross validation and stacking should be handled with care (eg. use stratified k-fold to improve robustness) when the dataset is skewed (eg. very small number of positive examples).

Data

I'm not really interested in the method performance, so I'll create an artificial dataset to experiment with a classification task.

In [69]:
from sklearn.cross_validation import train_test_split, StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification
import numpy as np

n_features = 20
n_samples = 10000

X, y = make_classification(n_features=n_features, n_samples=n_samples)

Divide the dataset into a 75% - 25% training/test split to satisfy Step 1.

In [70]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

Cross validation will be carried out by means of stratified k-fold.

In [71]:
skf = StratifiedKFold(y_test, n_folds=3)

Models

I'm using decision trees (cart) as level-0 classifiers

In [72]:
from sklearn.tree import DecisionTreeClassifier

n_models = 3
clfs = [DecisionTreeClassifier()] * n_models

and logistic regression as the level 1 model

In [74]:
from sklearn.linear_model import LogisticRegression 
lr = LogisticRegression()

Iterative version

We start the process by training and testing a classification tree $M_k^{-j}$ on a training set $L^{-j}$, for each fold $j$. We use the predictions of these models to build the level-1 dataset $L_{cv}$ that is the train set for the level-1 classifier $\tilde{M}$. In this loop we also take care of building a level-1 test set for $\tilde{M}$, by collecting the predictions of each models $M_k^{-j}$ on unseen instances (X_test).

The code comments follow [Ting, Witten 1998] naming conventions.

In [75]:
level_1_train = np.zeros((X_train.shape[0], len(clfs)))
level_1_test = np.zeros((X_test.shape[0], len(clfs), len(skf)))
In [105]:
for k, clf in enumerate(clfs):
    for j, (train_index, test_index) in enumerate(skf):
        # L^(-j), L_j 
        X_train_cv, X_test_cv = X_train[train_index], X_test[test_index]
        y_train_cv, y_test_cv = y_train[train_index], y_test[test_index]
        
        # M_k^(-j) - level 0 model (M_k) on the training set L^{-j}
        clf.fit(X_train_cv, y_train_cv)
        
        # L_cv = z_kj 
        # we use this dataset to train the level-1 model 
        # this is a 2-class problems, so we consider only the probability
        # p of class 0. 
        level_1_train[test_index, k] = clf.predict_proba(X_test_cv)[:, 0]
        
        # We build a level-1 test set to be used with the level 1 classifier.
        # This is the output of model M_k^(-j) on the held out test set
        level_1_test[:, k, j] = clf.predict_proba(X_test)[:, 0]

We conclude the training process by fitting a logistic regression on level-1 data.

In [81]:
lr.fit(level_1_train, y_train)
Out[81]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)

Finally we predict labels on the level-1 test set. The per-fold classifiers predictions of each model $M_k^{-j}$ are blended using their mean as a combiner. This leads to what [Ting, Witten 1998] refer to as final level-0 models $M_k$.

In [82]:
pred = lr.predict(level_1_test.mean(2))

Parallel version

Cross validation does not require any form of comunication between the models being trained. This makes stacked generalization a good candidate for parallelization.

In this section I'll be using joblib, a frontend to the multiprocessing framework, to paralleize the training/testing of level-0 models as well as the generation of level 1 data. The results of parallel computations are written to shared, mem-mapped, ndarrays. In general this is not a good idea; numpy does not provide atomic operations and writes to shared segments can lead to data corruption. However, in this specific case we can rely on the fact that each classifier $k$ and fold $j$ are allocated exclusive segments of the shared ndarrays.

In [107]:
from joblib import Parallel, delayed
from joblib import load, dump, cpu_count

import tempfile
import shutil
import os
import numpy as np
In [13]:
mmap_dir = tempfile.mkdtemp()

X_train, X_test, y_train, y_test have been defined above.

For each input dataset, I'm releaseing the reference on the original in memory array (dump) and replacing it with a reference to the mem mapped ndarray. gc.collect() is called in Parallel just before forking. joblib.dump crashes Ipython Notebook, so for the sake of this example I will not mmap input dataset. I leave to code here as a template for future reuse.

{python}
X_train = np.memmap(os.path.join(mmap_dir, "X_train"), 
                          shape=X_train.shape,
                          mode='w+')
dump(X_train, os.path.join(mmap_dir, "X_train"))
X_train = load(os.path.join(mmap_dir, "X_train"), mmap_mode='r')

X_test = np.memmap(os.path.join(mmap_dir, "X_test"), 
                          shape=X_test.shape,
                          mode='w+')
dump(X_test, os.path.join(mmap_dir, "X_test"))
X_test = load(os.path.join(mmap_dir, "X_test"), mmap_mode='r')

y_train = np.memmap(os.path.join(mmap_dir, "y_train"), 
                          shape=y_train.shape,
                          mode='w+')
dump(y_train, os.path.join(mmap_dir, "y_train"))
y_train = load(os.path.join(mmap_dir, "y_train"), mmap_mode='r')

y_test = np.memmap(os.path.join(mmap_dir, "y_test"), 
                          shape=y_train.shape,
                          mode='w+')
dump(y_test, os.path.join(mmap_dir, "y_test"))
y_test = load(os.path.join(mmap_dir, "y_test"), mmap_mode='r')

Output data.

In [92]:
level_1_train = np.memmap(os.path.join(mmap_dir, "level_1_train"), 
                          shape=(X_train.shape[0], len(clfs)),
                          mode='w+')
level_1_test = np.memmap(os.path.join(mmap_dir, "level_1_test"),
                         shape=(X_test.shape[0], len(clfs), len(skf)), 
                         mode='w+')

cross_validate implements the training of level-0 models and generation of mem-mapped level-1 data.

In [106]:
def cross_validate(params):
    (level_1_train,
        level_1_test,
        X_train,
        X_test,
        y_train,
        y_test,
        train_index,
        test_index,
        k,
        j,
        clf
    ) = params
    
    X_train_cv, X_test_cv = X_train[train_index], X_test[test_index]
    y_train_cv, y_test_cv = y_train[train_index], y_test[test_index]

    clf.fit(X_train_cv, y_train_cv)
    
    level_1_train[test_index, k] = clf.predict_proba(X_test_cv)[:, 0]
    level_1_test[:,k,j] = clf.predict_proba(X_test)[:, 0]

We can use list comprehension to generate a list of parameters to pass to cross_validate() via delayed(). Each element of the list, is itself a list containing the $k$ model and $j$ fold data.

Note that we could be passing the $j$-th fold as eg. X_train[train_index] rather that the whole X_train. However, the function is supposed to use a mem-mapped version of the input data, hence we would pass a reference rather than a copy of the object.

In [100]:
params = [[level_1_train,
            level_1_test,
            X_train,
            X_test,
            y_train,
            y_test,
            train_index,
            test_index,
            k,
            j,
            clf] 
           for k, clf in enumerate(clfs) 
           for j, (train_index, test_index) in enumerate(skf)]
In [ ]:
#n_jobs = max(1, min(cpu_count()-1, len(clfs)*len(skf)))
n_jobs = 4
In [ ]:
results = Parallel(n_jobs=n_jobs)(delayed(cross_validate)(param) for param in params)

Like in the iterative case, we use logistic regression as the level-1 model and predict unseen instances on the blended level-1 test set.

In [102]:
from sklearn.linear_model import LogisticRegression 
lr = LogisticRegression()
lr.fit(level_1_train, y_train)
pred = lr.predict(level_1_test.mean(2))

At last we clean up mmap data

In [103]:
shutil.rmtree(mmap_dir)

Conclusion

Is stacking a silver bullet that improves classification accuracy in every possible task? No, as it is the case with predictive modeling, performance depends on a number of factors. YMMV.