This is one of the 100 recipes of the IPython Cookbook, the definitive guide to high-performance scientific computing and data science in Python.
This recipe is based on a Kaggle competition where the goal is to predict survival on the Titanic, based on real data. Kaggle hosts machine learning competitions where anyone can download a dataset, train a model, and test the predictions on the website. The author of the best model wins a price. It is a fun way to get started with machine learning.
Here, we use this example to introduce logistic regression, a basic classifier. We also show how to perform a grid search with cross-validation.
You need to download the Titanic dataset on the book's website (https://ipython-books.github.io).
import numpy as np
import pandas as pd
import sklearn
import sklearn.linear_model as lm
import sklearn.cross_validation as cv
import sklearn.grid_search as gs
import matplotlib.pyplot as plt
%matplotlib inline
train = pd.read_csv('data/titanic_train.csv')
test = pd.read_csv('data/titanic_test.csv')
train[train.columns[[2,4,5,1]]].head()
sex
field to a binary variable, so that it can be handled correctly by NumPy and scikit-learn. Finally, we remove the rows containing NaN
values.data = train[['Sex', 'Age', 'Pclass', 'Survived']].copy()
data['Sex'] = data['Sex'] == 'female'
data = data.dropna()
DataFrame
to a NumPy array, so that we can pass it to scikit-learn.data_np = data.astype(np.int32).values
X = data_np[:,:-1]
y = data_np[:,-1]
# We define a few boolean vectors.
female = X[:,0] == 1
survived = y == 1
# This vector contains the age of the passengers.
age = X[:,1]
# We compute a few histograms.
bins_ = np.arange(0, 81, 5)
S = {'male': np.histogram(age[survived & ~female],
bins=bins_)[0],
'female': np.histogram(age[survived & female],
bins=bins_)[0]}
D = {'male': np.histogram(age[~survived & ~female],
bins=bins_)[0],
'female': np.histogram(age[~survived & female],
bins=bins_)[0]}
# We now plot the data.
bins = bins_[:-1]
plt.figure(figsize=(10,3));
for i, sex, color in zip((0, 1),
('male', 'female'),
('#3345d0', '#cc3dc0')):
plt.subplot(121 + i);
plt.bar(bins, S[sex], bottom=D[sex], color=color,
width=5, label='survived');
plt.bar(bins, D[sex], color='k', width=5, label='died');
plt.xlim(0, 80);
plt.grid(None);
plt.title(sex + " survival");
plt.xlabel("Age (years)");
plt.legend();
LogisticRegression
classifier. We first need to create a train and a test dataset.# We split X and y into train and test datasets.
(X_train, X_test,
y_train, y_test) = cv.train_test_split(X, y, test_size=.05)
# We instanciate the classifier.
logreg = lm.LogisticRegression();
logreg.fit(X_train, y_train)
y_predicted = logreg.predict(X_test)
The following figure shows the actual and predicted results.
plt.figure(figsize=(8, 3));
plt.imshow(np.vstack((y_test, y_predicted)),
interpolation='none', cmap='bone');
plt.xticks([]); plt.yticks([]);
plt.title(("Actual and predicted survival outcomes"
" on the test set"));
cross_val_score
that computes the cross-validation score. This function uses by default a 3-fold stratified cross-validation procedure, but this can be changed with the cv
keyword argument.cv.cross_val_score(logreg, X, y)
This function returns, for each pair of train and test set, a prediction score.
LogisticRegression
class accepts a C
hyperparameter as argument. This parameter quantifies the regularization strength. To find a good value, we can perform a grid search with the GridSearchCV
class. It takes as input an estimator, and a dictionary of parameter values. This new estimator uses cross-validation to select the best parameter.grid = gs.GridSearchCV(logreg, {'C': np.logspace(-5, 5, 200)}, n_jobs=4)
grid.fit(X_train, y_train);
grid.best_params_
Here is the performance of the best estimator.
cv.cross_val_score(grid.best_estimator_, X, y)
You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).
IPython Cookbook, by Cyrille Rossant, Packt Publishing, 2014 (500 pages).