This is one of the 100 recipes of the IPython Cookbook, the definitive guide to high-performance scientific computing and data science in Python.

8.2. Predicting who will survive on the Titanic with logistic regression¶

This recipe is based on a Kaggle competition where the goal is to predict survival on the Titanic, based on real data. Kaggle hosts machine learning competitions where anyone can download a dataset, train a model, and test the predictions on the website. The author of the best model wins a price. It is a fun way to get started with machine learning.

Here, we use this example to introduce logistic regression, a basic classifier. We also show how to perform a grid search with cross-validation.

You need to download the Titanic dataset on the book's website (https://ipython-books.github.io).

We import the standard libraries.

In [ ]:

import numpy as np
import pandas as pd
import sklearn
import sklearn.linear_model as lm
import sklearn.cross_validation as cv
import sklearn.grid_search as gs
import matplotlib.pyplot as plt
%matplotlib inline

We load the train and test datasets with Pandas.

In [ ]:

train = pd.read_csv('data/titanic_train.csv')
test = pd.read_csv('data/titanic_test.csv')

In [ ]:

train[train.columns[[2,4,5,1]]].head()

Let's keep only a few fields for this example. We also convert the sex field to a binary variable, so that it can be handled correctly by NumPy and scikit-learn. Finally, we remove the rows containing NaN values.

In [ ]:

data = train[['Sex', 'Age', 'Pclass', 'Survived']].copy()
data['Sex'] = data['Sex'] == 'female'
data = data.dropna()

Now, we convert this DataFrame to a NumPy array, so that we can pass it to scikit-learn.

In [ ]:

data_np = data.astype(np.int32).values
X = data_np[:,:-1]
y = data_np[:,-1]

Let's have a look at the survival of male and female passengers, as a function of their age.

In [ ]:

# We define a few boolean vectors.
female = X[:,0] == 1
survived = y == 1
# This vector contains the age of the passengers.
age = X[:,1]
# We compute a few histograms.
bins_ = np.arange(0, 81, 5)
S = {'male': np.histogram(age[survived & ~female], 
                          bins=bins_)[0],
     'female': np.histogram(age[survived & female], 
                            bins=bins_)[0]}
D = {'male': np.histogram(age[~survived & ~female], 
                          bins=bins_)[0],
     'female': np.histogram(age[~survived & female], 
                            bins=bins_)[0]}

In [ ]:

# We now plot the data.
bins = bins_[:-1]
plt.figure(figsize=(10,3));
for i, sex, color in zip((0, 1),
                         ('male', 'female'),
                         ('#3345d0', '#cc3dc0')):
    plt.subplot(121 + i);
    plt.bar(bins, S[sex], bottom=D[sex], color=color,
            width=5, label='survived');
    plt.bar(bins, D[sex], color='k', width=5, label='died');
    plt.xlim(0, 80);
    plt.grid(None);
    plt.title(sex + " survival");
    plt.xlabel("Age (years)");
    plt.legend();

Let's try to train a LogisticRegression classifier. We first need to create a train and a test dataset.

In [ ]:

# We split X and y into train and test datasets.
(X_train, X_test, 
 y_train, y_test) = cv.train_test_split(X, y, test_size=.05)

In [ ]:

# We instanciate the classifier.
logreg = lm.LogisticRegression();

Let's train the model and get the predicted values on the test set.

In [ ]:

logreg.fit(X_train, y_train)
y_predicted = logreg.predict(X_test)

The following figure shows the actual and predicted results.

In [ ]:

plt.figure(figsize=(8, 3));
plt.imshow(np.vstack((y_test, y_predicted)),
           interpolation='none', cmap='bone');
plt.xticks([]); plt.yticks([]);
plt.title(("Actual and predicted survival outcomes"
          " on the test set"));

To get an estimation of the performance of the model, we can use the cross_val_score that computes the cross-validation score. This function uses by default a 3-fold stratified cross-validation procedure, but this can be changed with the cv keyword argument.

In [ ]:

cv.cross_val_score(logreg, X, y)

This function returns, for each pair of train and test set, a prediction score.

The LogisticRegression class accepts a C hyperparameter as argument. This parameter quantifies the regularization strength. To find a good value, we can perform a grid search with the GridSearchCV class. It takes as input an estimator, and a dictionary of parameter values. This new estimator uses cross-validation to select the best parameter.

In [ ]:

grid = gs.GridSearchCV(logreg, {'C': np.logspace(-5, 5, 200)}, n_jobs=4)
grid.fit(X_train, y_train);
grid.best_params_

Here is the performance of the best estimator.

In [ ]:

cv.cross_val_score(grid.best_estimator_, X, y)

You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).

IPython Cookbook, by Cyrille Rossant, Packt Publishing, 2014 (500 pages).