In [1]:
%load_ext watermark
In [2]:
%watermark -d -u -v -p scikit-learn,pandas,scipy,matplotlib
Last updated: 15/08/2014 

CPython 3.4.1
IPython 2.0.0

scikit-learn 0.15.1
pandas 0.14.0
scipy 0.14.0
matplotlib 1.3.1

[More information](https://github.com/rasbt/watermark) about the `watermark` magic command extension.


I would be happy to hear your comments and suggestions. Please feel free to drop me a note via twitter, email, or google+.


Streamline your cross-validation and classification workflow

- scikit-learn's Pipeline in action



Sections



Introduction

The Pipeline class in scikit-learn's pipeline module offers a convenient way to execute a chain of normalization and pre-processing steps, as well as predictors and classifiers on a dataset.

Such a pipeline is especially useful in the context of cross-validation, where we are interested in testing and comparing different feature selection techniques, dimensionality reduction approaches, and classifiers. All in all, Pipelines are not only a great time-saver, but they also allow us to write clutter-free code and stay organized in the attempt to find the ideal combination of techniques for solving a pattern classification task.

In the current implementation of Pipeline the different steps in the chain have to be classes that have fit and transform methods (or, alternatively, a fit_transform method instead of transform).

But before we get bogged down into details, let us prepare the Iris dataset in order to have a look at a Pipeline in action.



An example of a simple pipeline that consists of three steps: standardizing the dataset, dimensionality reduction via PCA, and training a naive Bayes classifier:

In [ ]:
clf = Pipeline(steps=[
    ('scaler', StandardScaler()),    
    ('reduce_dim', PCA(n_components=2)),
    ('classifier', GaussianNB())   
    ])

clf.fit(X_train, y_train)      # fitting on the training dataset
pred = clf_lda.predict(X_test) # classifying the test dataset



An example for using Pipelines in Cross-validation and classification

Since I don't want to distract from the main topic of this article, the Pipelines, I want to keep the descriptions about the other steps and concepts as concise as possible. Below is a list of the particular topics with links to other articles for further reading.




The Iris dataset

A dataset of 150 flowers from three iris classes with 4 different features.

Cross-validation

Cross-validation is a statistical technique to estimate the prediction error rate by splitting the data into training, cross-validation, and test datasets. A prediction model is obtained using the training set, and model parameters are optimized by the cross-validation set, while the test set is held primarily for empirical error estimation.

  • Separate article in preparation

Feature Scaling and Standardization

A data pre-processing step for re-scaling features from different measurements to match proportions of a standard normal distribution (unit variance centered at mean=0).

Principal Component Analysis (PCA)

A linear transformation technique that is commonly used to project a dataset (without utilizing class labels) onto a new feature space or feature subspace (for dimensionality reduction) where the new component axes are the directions that maximize the variance/spread of the data.

Linear Discriminant Analysis (LDA)

A linear transformation technique (related to Principal Component Analysis) that is commonly used to project a dataset onto a new feature space or feature subspace, where the new component axes maximize the spread between multiple classes, or for classification of data.

Naive Bayes classifier

A classifier based on a statistical model (i.e., Bayes theorem: calculating posterior probabilities based on the prior probability and the so-called likelihood) in the field of pattern classification. Naive Bayes assumes that all attributes are conditionally independent, thereby, computing the likelihood is simplified to the product of the conditional probabilities of observing individual attributes given a particular class label.

  • Separate article in preparation



Reading in the Iris dataset

In [3]:
import pandas as pd

df = pd.io.parsers.read_csv(
    filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', 
    header=None, 
    sep=',', 
    )
df.dropna(how="all", inplace=True) # to drop the empty line at file-end

feature_dict = {i:label for i,label in zip(
            range(4),
              ('sepal length in cm', 
              'sepal width in cm', 
              'petal length in cm', 
              'petal width in cm', ))}
/Users/sebastian/miniconda3/envs/py34/lib/python3.4/site-packages/pandas/io/excel.py:626: UserWarning: Installed openpyxl is not supported at this time. Use >=1.6.1 and <2.0.0.
  .format(openpyxl_compat.start_ver, openpyxl_compat.stop_ver))



Since it is more convenient to work with numerical values, we will use the LabelEncoder from the scikit-learn library to convert the class labels into numbers: 1, 2, and 3.

In [4]:
from sklearn.preprocessing import LabelEncoder

X = df[[0,1,2,3]].values 
y = df[4].values

enc = LabelEncoder()
label_encoder = enc.fit(y)
y = label_encoder.transform(y) + 1

label_dict = {1: 'Setosa', 2: 'Versicolor', 3:'Virginica'}



Exploratory visualization

In [5]:
%matplotlib inline
In [9]:
from matplotlib import pyplot as plt
import numpy as np
import math

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12,6))

for ax,cnt in zip(axes.ravel(), range(4)):

    # set bin sizes
    min_b = math.floor(np.min(X[:,cnt]))
    max_b = math.ceil(np.max(X[:,cnt]))
    bins = np.linspace(min_b, max_b, 25)

    # plottling the histograms
    for lab,col in zip(range(1,4), ('blue', 'red', 'green')):
        ax.hist(X[y==lab, cnt],
                   color=col, 
                   label='class %s' %label_dict[lab], 
                   bins=bins,
                   alpha=0.5,)
    ylims = ax.get_ylim()

    # plot annotation
    leg = ax.legend(loc='upper right', fancybox=True, fontsize=8)
    leg.get_frame().set_alpha(0.5)
    ax.set_ylim([0, max(ylims)+2])
    ax.set_xlabel(feature_dict[cnt])
    ax.set_title('Iris histogram #%s' %str(cnt+1))

    # hide axis ticks
    ax.tick_params(axis="both", which="both", bottom="off", top="off",  
            labelbottom="on", left="off", right="off", labelleft="on")

    # remove axis spines
    ax.spines["top"].set_visible(False)  
    ax.spines["right"].set_visible(False) 
    ax.spines["bottom"].set_visible(False) 
    ax.spines["left"].set_visible(False)

axes[0][0].set_ylabel('count')
axes[1][0].set_ylabel('count')

fig.tight_layout()

plt.show()

Above, we used simple 1D-histograms as a quick exploratory tool to get a rough idea how the classes are separated for each particular feature.
We can see that the petal dimensions (petal width and petal width) are much more useful features than the sepal dimensions, which have a large overlap between the 3 different Iris species.



Preparing a Test and Training dataset

In this step, we split our dataset into two subsets: A training dataset (60% of the samples) and a test dataset (40% of the samples from the original dataset).

In [10]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
    test_size=0.40, random_state=12345)



Cross-Validation and Pipelines

It is important to note that the test set should only be used once for evaluating the prediction error of a classifier after training it on the training dataset; repetitive evaluations of different models using the same test dataset would be prone to overfitting.
A common approach to compare and evaluate different pre-processing techniques and models is cross-validation. Here, we will use a variant called "k-fold cross-validation" in order to compare different preprocessing steps and demonstrate the usefulness of a Pipeline.

In k-fold cross-validation, the original training dataset is split into k different subsets (the so-called "folds") where 1 fold is retained for testing the classifier and the other k-1 folds are used for training.

E.g., if we'd choose k=4 (4 folds) the classifier will be trained on the remaining using 3 different training set subsets and evaluated on the 4th fold (the test fold). This procedure is then repeated 4 times until every fold has been used as the test set once so that we can eventually calculate the average error rate of our model from the error rate of every iteration, which gives us an idea of how well our model generalizes (a more detailed article about different cross validation methods is in preparation).



Custom Pipeline objects

Now, after we successfully prepared our example dataset, let us come back to the pipelines. Pipelines, as we have heard before, are extremely useful in a cross-validation task in order to compare different models.

And we can also write our own classes that can be used in such a pipeline. If we think back to the histograms of our Iris dataset, one interesting exploration could be to select the features "petal length" and "petal width" and compare a classifier trained on this subset against classifiers that were trained on the whole dataset, or classifiers that were trained on datasets after dimensionality reduction (e.g., Linear Discriminant Analysis or Principal Component Analysis).

Below, we will write a simple ColumnSelector that will return a subset of certain features (columns) from a NumPy array that represents our dataset.

The requirement for a valid Pipeline object is that it responds to the transform method with exception of the last object (the final estimator), which has to respond to the fit function.

In [11]:
class ColumnSelector(object):
    """ A feature selector for scikit-learn's Pipeline class that returns
        specified columns from a numpy array.
    
    """
    
    def __init__(self, cols):
        self.cols = cols
        
    def transform(self, X, y=None):
        return X[:, self.cols]

    def fit(self, X, y=None):
        return self

I thought that it was worthwhile to create a central module for such helper functions and classes (https://github.com/rasbt/mlxtend) from which they could not only be easily imported but also collected, documented, and optimized. I am happy about any suggestions, ideas, and contributions!



Running the cross-validation

For a simple example, let us do cross-validation to compare different preprocessing steps for training a naive Bayes classifier.

  • on the whole feature set
  • on a 2D feature subset consisting of the features "petal width" and "petal length"
  • on a 2D subset after dimensionality reduction via Principal Component Analysis
  • on a 2D subset after dimensionality reduction via Linear Discriminant Analysis
In [12]:
from sklearn.cross_validation import cross_val_score, KFold
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.lda import LDA
from sklearn.decomposition import PCA

clf_all = Pipeline(steps=[
    ('scaler', StandardScaler()),           
    ('classifier', GaussianNB())   
    ])

clf_petal = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('reduce_dim', ColumnSelector(cols=(2,3))),    
    ('classifier', GaussianNB())   
    ]) 

clf_pca = Pipeline(steps=[
    ('scaler', StandardScaler()),    
    ('reduce_dim', PCA(n_components=2)),
    ('classifier', GaussianNB())   
    ])

clf_lda = Pipeline(steps=[
    ('scaler', StandardScaler()), 
    ('reduce_dim', LDA(n_components=2)),
    ('classifier', GaussianNB())   
    ])

# Constructing the k-fold cross validation iterator (k=5)  

cv = KFold(n=X_train.shape[0],  # total number of samples
           n_folds=5,           # number of folds the dataset is divided into
           random_state=12345)

scores = [
    cross_val_score(clf, X_train, y_train, cv=cv, scoring='accuracy')
            for clf in [clf_all, clf_petal, clf_pca, clf_lda]
    ]
In [13]:
for score,label in zip(scores, 
                       ['all attributes', 
                        'Petal dimensions (column 3 & 4)',
                        'PCA dim. red. (n=2)', 
                        'LDA dim. red. (n=2)', 
                        ]
                       ):
    print("Accuracy: {:.2%} (+/- {:.2%}), {:}".format(score.mean(), score.std(), label))
Accuracy: 94.44% (+/- 3.51%), all attributes
Accuracy: 94.44% (+/- 6.09%), Petal dimensions (column 3 & 4)
Accuracy: 85.56% (+/- 9.69%), PCA dim. red. (n=2)
Accuracy: 96.67% (+/- 4.44%), LDA dim. red. (n=2)



In general, the prediction accuracy is calculated as the number of correct classification divided by the number of total classification. In this case, the accuracies displayed above represent the averages from all 5 iterations in a cross-validation procedure.

Here, the accuracies differ only slightly. Interestingly, the prediction accuracies are the same for the whole dataset and our feature subset that consists of only the petal widths and lengths. This indicates that we might lose minimum information by removing the sepal measurements (but note that the standard error increased). LDA tends to outperform PCA in this case, which is not unexpected: In contrast to PCA, LDA tries to maximize class separability for a supervised dataset.



Training the Naive Bayes classifier

Conviniently, we can reuse the pipeline objects from our cross-validation above to train the classifier on the whole training dataset and evaluate its performance on the separate test dataset.
Since the LDA-approach for preprocessing seemed to have performed best in cross validation, let us have a look how it performs on the "real" test dataset.

In [14]:
from sklearn import metrics

clf_lda.fit(X_train, y_train)
pred_test = clf_lda.predict(X_test)

print('Prediction accuracy for the test dataset')
print('{:.2%}'.format(metrics.accuracy_score(y_test, pred_test)))

print('\nConfusion Matrix of the Naive Bayes classifier')
print(metrics.confusion_matrix(y_test, clf_lda.predict(X_test)))
Prediction accuracy for the test dataset
100.00%

Confusion Matrix of the Naive Bayes classifier
[[21  0  0]
 [ 0 22  0]
 [ 0  0 17]]