Statistical Programming DC
10/23/14

<h1>Modeling with scikit-learn: common workflows<br>
for reproducible results</h1>

Repo: https://github.com/isms/spdc-lightning-talk-oct-2014

View notebook online (non-interactive): http://bit.ly/spdc23oct

Overview¶

Describe an example problem
Build a couple basic models
Common sklearn workflows:
- Evaluate performance
- Tools to improve the model (data transformations, parameter search, etc)
- Solidifying the experiments into a repeatable data pipeline
Using outside tools

Motivation¶

What we want our tools to be¶

What we have right now¶

The way ahead for R?¶

The `caret` package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. [...] There are many different modeling functions in R. Some have different syntax for model training and/or prediction. The package started off as a way to provide a uniform interface the functions themselves, as well as a way to standardize common tasks (such parameter tuning and variable importance).

`scikit-learn`'s approach¶

Consistent API with a cohesive "grammar" of modeling actions:

Nouns
- Model classes
- Metrics classes
- Preprocessing classes
- Data (as numpy arrays)
- ...
Verbs
- fit
- transform
- fit_transform
- score
- predict
- ...

Example problem description:¶

We'll be using data on blood donations from the UC Irvine Machine Learning Repository. Given some data on some blood donors' previous donations, we'll be trying to predict a binary outcome of whether the donor gave blood in a certain time period.

Here's the official description:

To demonstrate the RFMTC marketing model (a modified version of RFM), this study adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes their blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. To build a FRMTC model, we selected 748 donors at random from the donor database. These 748 donor data, each one included R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood).

Source: https://archive.ics.uci.edu/ml/datasets/Blood+Transfusion+Service+Center

Get the data¶

From beginning to end, we'll treat the raw data as immutable and base all of our analyses on views and transformations of this "ground truth" data set.

In [1]:

!wget -nc --directory-prefix data \
    https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data

File ‘data/transfusion.data’ already there; not retrieving.

In [2]:

!head data/transfusion.data

Import the data as a `pandas` DataFrame¶

In [3]:

import numpy as np
import pandas as pd

df = pd.read_csv('data/transfusion.data')
df.head()

Out[3]:

	Recency (months)	Frequency (times)	Monetary (c.c. blood)	Time (months)	whether he/she donated blood in March 2007
0	2	50	12500	98	1
1	0	13	3250	28	1
2	1	16	4000	35	1
3	2	20	5000	45	1
4	1	24	6000	77	0

In [4]:

df.shape

Out[4]:

(748, 5)

In [5]:

df.dtypes

Out[5]:

Recency (months)                              int64
Frequency (times)                             int64
Monetary (c.c. blood)                         int64
Time (months)                                 int64
whether he/she donated blood in March 2007    int64
dtype: object

In [6]:

# save the current names in case we need them later
original_column_names = df.columns

# make the names less ugly
names = ['recency', 'frequency', 'cc', 'time', 'donated']
df.columns = names

df.head()

Out[6]:

	recency	frequency	cc	time	donated
0	2	50	12500	98	1
1	0	13	3250	28	1
2	1	16	4000	35	1
3	2	20	5000	45	1
4	1	24	6000	77	0

Some exploratory visualization¶

In [7]:

# import our graphics tools
%matplotlib inline
import matplotlib as mpl
from matplotlib import pyplot as plt
import seaborn as sns  # nice defaults for matplotlib styles
set2 = sns.color_palette('Set2')

# add on some settings from 'Bayesian Methods for Hackers'
plt.style.use('bmh')

# set larger default fonts for presentation-friendliness
mpl.rc('figure', figsize=(10, 8))
mpl.rc('axes', labelsize=16, titlesize=20)

In [8]:

from pandas.tools.plotting import scatter_matrix

axeslist = scatter_matrix(df, alpha=0.8, figsize=(10, 10))
for ax in axeslist.flatten():
    ax.grid(False)

In [9]:

import numpy as np

# create a figure with 4 subplots
fig, axs = plt.subplots(nrows=2, ncols=2)

feature_column_names = df.columns[:-1]
label_column_name = df.columns[-1]

for i, col in enumerate(feature_column_names):
    
    # get the current subplot to work on
    ax = axs.ravel()[i]
    
    # create some random y jitter to add
    jitter = np.random.uniform(low=-0.05, high=0.05, size=len(df))
    
    # plot the data
    ax.scatter(x=df[col], y=df[label_column_name] + jitter,
               c=df.donated, cmap='coolwarm', alpha=0.5)
    
    # label the axes
    ax.set_xlabel(col)
    ax.set_ylabel(label_column_name)
    
plt.tight_layout()
plt.show()

Common tasks with `scikit-learn`:¶

Split the data into train and test¶

In [10]:

from sklearn.cross_validation import train_test_split

# using conventional sklearn variable names
X = df[feature_column_names].astype(float)
y = df.donated.ravel()

# break up the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)

In [11]:

X_train.shape

Out[11]:

(374, 4)

In [12]:

y_train.shape

Out[12]:

(374,)

In [13]:

X_test.shape

Out[13]:

(374, 4)

In [14]:

y_test.shape

Out[14]:

(374,)

Fit a simple model using the training data¶

The basic modeling workflow in sklearn involves instantiating a model object with the desired settings, and then fitting it to given data.

Decision tree¶

In [15]:

from sklearn.tree import DecisionTreeClassifier

clf_tree = DecisionTreeClassifier(max_depth=3)
clf_tree.fit(X_train, y_train)

Out[15]:

DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=3, max_features=None, max_leaf_nodes=None,
            min_density=None, min_samples_leaf=1, min_samples_split=2,
            random_state=None, splitter='best')

In [16]:

print 'Score:', clf_tree.score(X_test, y_test)

Score: 0.794117647059

In [17]:

# viz adapted from http://scikit-learn.org/stable/modules/tree.html
import pydot
from sklearn.externals.six import StringIO
from sklearn.tree import export_graphviz

dot_data = StringIO() 
export_graphviz(clf_tree, out_file=dot_data) 
graph = pydot.graph_from_dot_data(dot_data.getvalue()) 
graph.write_png('output/decision_tree.png')

Out[17]:

True

Logistic regression¶

In [18]:

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(penalty='l2', fit_intercept=True)
clf.fit(X_train, y_train)

Out[18]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)

In [19]:

print 'Score:', clf.score(X_test, y_test)

Score: 0.772727272727

Trying a bunch of different models¶

sklearn tries to use the same syntax for most modeling tasks, so it's pretty easy to plug and play.

Note: in this example we're not putting much thought into which model to use and we're sticking with the default model parameters. The point is not that we would *want* to do this. The point is that all of these very different models have such similar structure and conventions that we can iterate through a bunch of them, fit, and evaluate them while changing none of the syntax.

In [20]:

from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC

other_clfs = {
    'Random forest': RandomForestClassifier(),
    'AdaBoost': AdaBoostClassifier(),
    'Naive Bayes': MultinomialNB(),
    'Linear SVC': LinearSVC(),
    'KNN': KNeighborsClassifier(5),
}

# iterating through all of these models we want to fit ...
for name, other_clf in other_clfs.iteritems():
    
    # fit the model with the training data
    print other_clf.fit(X_train, y_train)
    
    # cross validation score
    print '---\nScore:', other_clf.score(X_test, y_test)
    print '\n'

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_neighbors=5, p=2, weights='uniform')
---
Score: 0.756684491979


LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
     random_state=None, tol=0.0001, verbose=0)
---
Score: 0.299465240642


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
---
Score: 0.729946524064


RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0)
---
Score: 0.756684491979


AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
---
Score: 0.775401069519

Anatomy of a `scikit-learn` model¶

The object itself:

In [21]:

clf

Out[21]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)

Coefficients (if fit):

In [22]:

clf.coef_

Out[22]:

array([[ -9.74764830e-02,   1.49194232e-06,   3.72985587e-04,
         -1.17108498e-02]])

Intercept (if fit):

In [23]:

clf.intercept_

Out[23]:

array([-0.53074433])

Model parameters:

In [24]:

clf.get_params()

Out[24]:

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'penalty': 'l2',
 'random_state': None,
 'tol': 0.0001}

Making predictions for $\hat y$:

In [25]:

clf.predict(X_test)

Out[25]:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0])

Predicting probabilities, $p(y_i = 1)$

In [26]:

pd.DataFrame(clf.predict_proba(X_test))\
    .head(10)

Out[26]:

	0	1
0	0.691294	0.308706
1	0.952584	0.047416
2	0.726055	0.273945
3	0.770998	0.229002
4	0.680134	0.319866
5	0.705628	0.294372
6	0.637364	0.362636
7	0.888941	0.111059
8	0.559386	0.440614
9	0.936608	0.063392

Challenge: evaluating model performance¶

The default metric for logistic regression is mean accuracy:

$$\mathrm{accuracy}(y, \hat y) = \frac{1}{n} \sum_{i=1}^n \mathbb{I}(\hat y_i = y_i)$$

In [27]:

clf.score(X_test, y_test)

Out[27]:

0.77272727272727271

But what if we want a more holistic view of how well this model works?

Common solution: use cross validation tools and other metrics¶

Cross validation¶

Our data set here is fairly small $(N=748)$, so what if we happened to randomly pick a non-representative chunk for testing?

Instead of summarizing the performance just on the test data we want to use $k$-fold cross-validation, splitting the data randomly into chunks — "folds" — and evaluating on each to get an idea of the variation in performance.

In [28]:

from sklearn import cross_validation

# come up with random folds of the data
kf = cross_validation.KFold(len(X), n_folds=5, shuffle=True)

def plot_scores(scores):
    N = len(scores)
    plt.bar(np.arange(1, N + 1) - 0.4, scores, color=set2[2])
    plt.title('{}-fold cross-validation scores'.format(N), fontsize=18)
    plt.xlabel('fold', fontsize=14)
    plt.ylabel('score', fontsize=14)
    plt.xlim(0.5, N + 0.5)
    plt.ylim(0, 1)
    plt.show()

# evaluate the fitted model on each fold in turn, returns a score for each fold
scores = cross_validation.cross_val_score(clf, X, y, cv=kf, n_jobs=1)

print 'scores:', scores
print 'average score:', np.mean(scores)
plot_scores(scores)

scores: [ 0.78        0.76666667  0.81333333  0.75167785  0.73154362]
average score: 0.768644295302

Other metrics¶

We can try all sorts of other metrics, though, such as log loss:

$$\textrm{LogLoss}(y, \hat y) = - \frac{1}{n} \sum_{i=1}^n \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\right]$$

In [29]:

from sklearn.metrics import log_loss

log_loss(y_test, clf.predict_proba(X_test))

Out[29]:

0.45710527540996404

Or the F1 score:

$$\textrm{F1} = 2 \frac{\textrm{precision} * \textrm{recall}}{\textrm{precision} + \textrm{recall}}$$

In [30]:

from sklearn.metrics import f1_score

f1_score(y_test, clf.predict(X_test))

Out[30]:

0.12371134020618556

And if we want more than a single quantity summarizing score, sklearn comes with all sorts of helpful tools like confusion_matrix:

In [31]:

from itertools import permutations
from sklearn.metrics import confusion_matrix

# get the raw confusion matrix
cm = confusion_matrix(y_test, clf.predict(X_test))

# create a dataframe
cmdf = pd.DataFrame(cm)
cmdf.columns = map(lambda x: 'pred {}'.format(x), cmdf.columns)
cmdf.index = map(lambda x: 'actual {}'.format(x), cmdf.index)
cmdf

Out[31]:

	pred 0	pred 1
actual 0	283	5
actual 1	80	6

there are many available ...

metrics.accuracy_score, metrics.auc, metrics.average_precision_score, metrics.classification_report, metrics.confusion_matrix, metrics.f1_score, metrics.fbeta_score, metrics.hamming_loss, metrics.hinge_loss, metrics.jaccard_similarity_score, metrics.log_loss, metrics.matthews_corrcoef, metrics.precision_recall_curve, metrics.precision_recall_fscore_support, metrics.precision_score, metrics.recall_score, metrics.roc_auc_score, metrics.roc_curve, metrics.zero_one_loss, metrics.explained_variance_score, metrics.mean_absolute_error, metrics.mean_squared_error, metrics.r2_score, metrics.adjusted_mutual_info_score, metrics.adjusted_rand_score, metrics.completeness_score, metrics.homogeneity_completeness_v_measure, metrics.homogeneity_score, metrics.mutual_info_score, metrics.normalized_mutual_info_score, metrics.silhouette_score, metrics.silhouette_samples, metrics.v_measure_score, metrics.consensus_score, metrics.pairwise.additive_chi2_kernel, metrics.pairwise.chi2_kernel, metrics.pairwise.distance_metrics, metrics.pairwise.euclidean_distances, metrics.pairwise.kernel_metrics, metrics.pairwise.linear_kernel, metrics.pairwise.manhattan_distances, metrics.pairwise.pairwise_distances, metrics.pairwise.pairwise_kernels, metrics.pairwise.polynomial_kernel, metrics.pairwise.rbf_kernel, metrics.pairwise_distances, metrics.pairwise_distances_argmin, metrics.pairwise_distances_argmin_min

... and they all use the same basic syntax.

Refining the model and preparing the data¶

Challenge: finding reasonable model parameters¶

The regularization constant, $C$, is unknown. Let's say we want to try some different orders of magnitude as well experimenting with L1 regularization (default for sklearn logistic regression is L2):

$$C \in \{0.01, 0.1, 1, 10, 100\}$$$$\textrm{penalty} \in \{\textrm{L1}, \textrm{L2}\}$$

We know that the outcome can be different depending on what settings we choose.

In [32]:

from IPython.html.widgets import interact

from sklearn.metrics import roc_curve, auc

def plot_roc_curve(y_test, probas):
    # Compute ROC curve and area the curve
    fpr, tpr, thresholds = roc_curve(y_test, probas[:, 1])
    roc_auc = auc(fpr, tpr)
    
    # Plot ROC curve
    plt.clf()
    plt.plot(fpr, tpr, label='ROC curve (AUC = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.0])
    plt.xlabel('False Positive Rate', fontsize=14)
    plt.ylabel('True Positive Rate', fontsize=14)
    plt.legend(loc="lower right", fontsize=20)
    plt.show()

def fit_model(penalty, C):
    clf = LogisticRegression(penalty=penalty, C=C)
    clf.fit(X_train, y_train)
    plot_roc_curve(y_test, clf._predict_proba_lr(X_test))
    
interact(fit_model, penalty=('l2','l1'), C=(0.01, 100, 1))

Out[32]:

<function __main__.fit_model>

Common solution: grid search¶

From sklearn, we get built in grid search for exploring the space. Using grid search cross validation, we can try all combinations, and automatically keep the most performant.

In [33]:

from sklearn.grid_search import GridSearchCV

params_to_try = {
    'C': [0.01, 0.1, 1, 10, 100, 100],
    'penalty': ['l1', 'l2']
}

gs = GridSearchCV(clf, param_grid=params_to_try, cv=5)
gs.fit(X, y)

Out[33]:

GridSearchCV(cv=5,
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.01, 0.1, 1, 10, 100, 100]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)

In [34]:

print "Best parameters:", gs.best_params_
print "Best score:", gs.best_score_

Best parameters: {'penalty': 'l1', 'C': 0.01}
Best score: 0.791443850267

Challenge: data transformations often needed¶

One thing we might want to do is scale all of the data before fitting any models. As the sklearn docs point out:

Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit: they might behave badly if the individual feature do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

Common solution: use plug-and-play data prep tools¶

It's painful and error prone to wing it every time we want to do common transformations. The sklearn toolbox comes with many useful tools for doing this in a consistent way.

Scaling¶

In [35]:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_standardized = scaler.fit_transform(X_train.astype(np.float))
X_train_standardized

Out[35]:

array([[ 1.15376999,  0.35655147,  0.35655147,  2.38280942],
       [-0.56644764,  6.41792654,  6.41792654,  2.50245107],
       [ 0.22749896, -0.60050775, -0.60050775, -0.9671566 ],
       ..., 
       [-0.03714991, -0.12197814, -0.12197814, -0.64811222],
       [ 0.62447226, -0.44099788, -0.44099788, -0.01002345],
       [-0.83109651,  0.03753173,  0.03753173, -0.56835112]])

In [36]:

print 'column means:', np.round(X_train_standardized.mean(axis=0))
print 'column variances:', np.round(X_train_standardized.var(axis=0))

column means: [-0.  0.  0.  0.]
column variances: [ 1.  1.  1.  1.]

Now that the scaler is fit on the training data, we can use it to transform new data:

In [37]:

X_new = np.array([25., 35., 9200., 90.])

scaler.transform(X_new)

Out[37]:

array([ 2.08004102,  4.66331797,  4.95043573,  2.18340669])

Dimensionality reduction¶

We could also reduce the dimensionality with PCA and whiten the data:

In [38]:

from sklearn.decomposition import PCA

# instantiate the PCA transformation object
pca = PCA(n_components=2, whiten=True)

# fit the PCA object on and transform the training data
X_train_pca = pca.fit_transform(X_train_standardized)

# create a 3d figure
fig = plt.figure()
ax = fig.add_subplot(111)

# scatterplot the PCA points
ax.scatter(*np.hsplit(X_train_pca, 2), c=y_train, s=40,
           cmap='coolwarm')

# annotate and show the figure
ax.set_xlabel('component 1')
ax.set_ylabel('component 2')

plt.show()

Challenge: all the different steps add up into a big jumbled mess¶

We have all seen the data science version of this:

But there's no reason for data_aug_13_v2_scaled_pca_3_dims_WHITENED_plus_some_tweaks_final_V4_FINAL.csv, right?

We shouldn't have to do that.

Common solution: putting all the steps together with a `Pipeline`¶

One big takeaway: it's all about documented, repeatable data-to-model pipelines.

At some point, especially for tasks we will be doing more than once, we might want to codify all of the exploratory work we've done. The standard way to express a beginning-to-end data transformation and model fitting workflow in sklearn is the Pipeline.

In [39]:

from sklearn.pipeline import Pipeline

# define a pipeline with some transforms and a simple classifier
pipeline = Pipeline([
    ('scale', StandardScaler()),
    ('reduce_dim', PCA()),
    ('clf', LogisticRegression()),
])

# enumerate all of the different settings we wish to try out
parameters = {
    'reduce_dim__n_components': (1, 2, 3),
    'reduce_dim__whiten': (True, False),
    'clf__penalty': ('l1', 'l2'),
    'clf__C': (1e-3, 1e-2, 1, 1e1, 1e2, 1e3),
}

# grid search the parameter space
grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1)

print("Performing grid search...\n")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
print(parameters)
grid_search.fit(X, y)

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done  50 jobs       | elapsed:    0.2s
[Parallel(n_jobs=1)]: Done 200 jobs       | elapsed:    0.7s
[Parallel(n_jobs=1)]: Done 216 out of 216 | elapsed:    0.8s finished

Performing grid search...

('pipeline:', ['scale', 'reduce_dim', 'clf'])
parameters:
{'clf__penalty': ('l1', 'l2'), 'clf__C': (0.001, 0.01, 1, 10.0, 100.0, 1000.0), 'reduce_dim__whiten': (True, False), 'reduce_dim__n_components': (1, 2, 3)}
Fitting 3 folds for each of 72 candidates, totalling 216 fits
Best score: 0.777
Best parameters set:
	clf__C: 0.01
	clf__penalty: 'l2'
	reduce_dim__n_components: 2
	reduce_dim__whiten: True

Takeaways for using `sklearn` in the real world¶

API consistency is the "killer feature" of sklearn
Treat the original data as immutable, everything later is a view into or a transformation of the original and all the steps are obvious
To that end, building transparent and repeatable dataflows using Pipelines cuts down on black magic

Public service announcement: using other tools¶

Why choose? Use whichever one is best for the task at hand.

In [40]:

# load the %R cell magic extension
%load_ext rmagic

# send the dataframe over to the R instance
%Rpush df

In [41]:

%%R
library(ggplot2)
qplot(log(time), log(cc), data=df, color=donated)

In [42]:

%%R
blood.glm <- glm(donated ~ log(cc) + log(time), data=df, family="binomial")
print(summary(blood.glm))

par(mfrow=c(2, 2))
plot(blood.glm)

Call:
glm(formula = donated ~ log(cc) + log(time), family = "binomial", 
    data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5214  -0.7665  -0.5352  -0.2782   2.5447  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -7.9523     0.8749  -9.089  < 2e-16 ***
log(cc)       1.4448     0.1693   8.535  < 2e-16 ***
log(time)    -1.0278     0.1454  -7.069 1.56e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 820.89  on 747  degrees of freedom
Residual deviance: 728.63  on 745  degrees of freedom
AIC: 734.63

Number of Fisher Scoring iterations: 5

... and we can pull data back from R to Python:

In [43]:

r_coeffs = %R coef(blood.glm)
r_coeffs

Out[43]:

array([-7.95228363,  1.44481591, -1.02784654])

Even better for those in the Julia community ... you can replicate this whole experience when working in an IJulia notebook:

In fact - IPython notebooks are not just for Python anymore. The core functionality has been generalized as Project Jupyter, and language kernels are available or under active development for Julia, Haskell, and R.

![](img/drivendata-logo.png)

[www.drivendata.org](http://www.drivendata.org)

Currently in beta — launching our first for-prize competitions in the next few weeks.

Twitter: @drivendataorg
E-mail: isaac@drivendata.org

Repo: https://github.com/isms/spdc-lightning-talk-oct-2014 View notebook online (non-interactive): http://bit.ly/spdc23oct