This is one of the 100 recipes of the IPython Cookbook, the definitive guide to high-performance scientific computing and data science in Python.

This recipe is based on a Kaggle competition where the goal is to predict survival on the Titanic, based on real data. Kaggle hosts machine learning competitions where anyone can download a dataset, train a model, and test the predictions on the website. The author of the best model wins a price. It is a fun way to get started with machine learning.

Here, we use this example to introduce logistic regression, a basic classifier. We also show how to perform a grid search with cross-validation.

You need to download the Titanic dataset on the book's website (https://ipython-books.github.io).

- We import the standard libraries.

In [ ]:

```
import numpy as np
import pandas as pd
import sklearn
import sklearn.linear_model as lm
import sklearn.cross_validation as cv
import sklearn.grid_search as gs
import matplotlib.pyplot as plt
%matplotlib inline
```

- We load the train and test datasets with Pandas.

In [ ]:

```
train = pd.read_csv('data/titanic_train.csv')
test = pd.read_csv('data/titanic_test.csv')
```

In [ ]:

```
train[train.columns[[2,4,5,1]]].head()
```

- Let's keep only a few fields for this example. We also convert the
`sex`

field to a binary variable, so that it can be handled correctly by NumPy and scikit-learn. Finally, we remove the rows containing`NaN`

values.

In [ ]:

```
data = train[['Sex', 'Age', 'Pclass', 'Survived']].copy()
data['Sex'] = data['Sex'] == 'female'
data = data.dropna()
```

- Now, we convert this
`DataFrame`

to a NumPy array, so that we can pass it to scikit-learn.

In [ ]:

```
data_np = data.astype(np.int32).values
X = data_np[:,:-1]
y = data_np[:,-1]
```

- Let's have a look at the survival of male and female passengers, as a function of their age.

In [ ]:

```
# We define a few boolean vectors.
female = X[:,0] == 1
survived = y == 1
# This vector contains the age of the passengers.
age = X[:,1]
# We compute a few histograms.
bins_ = np.arange(0, 81, 5)
S = {'male': np.histogram(age[survived & ~female],
bins=bins_)[0],
'female': np.histogram(age[survived & female],
bins=bins_)[0]}
D = {'male': np.histogram(age[~survived & ~female],
bins=bins_)[0],
'female': np.histogram(age[~survived & female],
bins=bins_)[0]}
```

In [ ]:

```
# We now plot the data.
bins = bins_[:-1]
plt.figure(figsize=(10,3));
for i, sex, color in zip((0, 1),
('male', 'female'),
('#3345d0', '#cc3dc0')):
plt.subplot(121 + i);
plt.bar(bins, S[sex], bottom=D[sex], color=color,
width=5, label='survived');
plt.bar(bins, D[sex], color='k', width=5, label='died');
plt.xlim(0, 80);
plt.grid(None);
plt.title(sex + " survival");
plt.xlabel("Age (years)");
plt.legend();
```

- Let's try to train a
`LogisticRegression`

classifier. We first need to create a train and a test dataset.

In [ ]:

```
# We split X and y into train and test datasets.
(X_train, X_test,
y_train, y_test) = cv.train_test_split(X, y, test_size=.05)
```

In [ ]:

```
# We instanciate the classifier.
logreg = lm.LogisticRegression();
```

- Let's train the model and get the predicted values on the test set.

In [ ]:

```
logreg.fit(X_train, y_train)
y_predicted = logreg.predict(X_test)
```

The following figure shows the actual and predicted results.

In [ ]:

```
plt.figure(figsize=(8, 3));
plt.imshow(np.vstack((y_test, y_predicted)),
interpolation='none', cmap='bone');
plt.xticks([]); plt.yticks([]);
plt.title(("Actual and predicted survival outcomes"
" on the test set"));
```

- To get an estimation of the performance of the model, we can use the
`cross_val_score`

that computes the cross-validation score. This function uses by default a 3-fold stratified cross-validation procedure, but this can be changed with the`cv`

keyword argument.

In [ ]:

```
cv.cross_val_score(logreg, X, y)
```

This function returns, for each pair of train and test set, a prediction score.

- The
`LogisticRegression`

class accepts a`C`

hyperparameter as argument. This parameter quantifies the regularization strength. To find a good value, we can perform a grid search with the`GridSearchCV`

class. It takes as input an estimator, and a dictionary of parameter values. This new estimator uses cross-validation to select the best parameter.

In [ ]:

```
grid = gs.GridSearchCV(logreg, {'C': np.logspace(-5, 5, 200)}, n_jobs=4)
grid.fit(X_train, y_train);
grid.best_params_
```

Here is the performance of the best estimator.

In [ ]:

```
cv.cross_val_score(grid.best_estimator_, X, y)
```

You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).

IPython Cookbook, by Cyrille Rossant, Packt Publishing, 2014 (500 pages).