*Key ideas:* Logistic regression, cross validation, sensitivity, specificity, receiver operating characteristics curve

This noteboook shows how to use cross-validation and logistic regression in Statsmodels to assess how well a group of predictor variables can be used to predict a binary outcome. A reciever operating characteristics (ROC) curve is used to describe the strength of the predictive relationship.

We start with the usual import statements:

In [1]:

```
import numpy as np
import pandas as pd
from statsmodels.discrete.discrete_model import Logit
```

In [2]:

```
data_url = "http://vincentarelbundock.github.io/Rdatasets/csv/MASS/Pima.tr.csv"
data_tr = pd.read_csv(data_url)
data_url = "http://vincentarelbundock.github.io/Rdatasets/csv/MASS/Pima.te.csv"
data_te = pd.read_csv(data_url)
data = pd.concat((data_tr, data_te), axis=0)
```

In [3]:

```
endog = np.asarray([int(x.lower() == "yes") for x in data["type"]])
xnames = ["npreg", "glu", "bp", "skin", "bmi", "ped", "age"]
exog = np.asarray(data[xnames])
exog = np.concatenate((np.ones((exog.shape[0],1)), exog), axis=1)
```

In [4]:

```
x = [np.corrcoef(endog, x)[0,1] for x in exog[:,1:].T]
x = pd.Series(x, index=xnames)
print x
```

In [5]:

```
n = data.shape[0]
scores = np.zeros(n, dtype=np.float64)
for k in range(n):
ii = range(n)
ii.pop(k)
mod = Logit(endog[ii], exog[ii,:])
rslt = mod.fit(disp=False)
scores[k] = rslt.predict(exog[k,:])
```

In [6]:

```
uscores = np.unique(scores)
n1 = np.sum(endog)
Sens = np.zeros_like(uscores)
Spec = np.zeros_like(uscores)
for j,u in enumerate(uscores):
Sens[j] = np.sum((scores >= u) * endog) / float(n1)
Spec[j] = np.sum((scores <= u) * (1-endog)) / float(n - n1)
```

Now we make the ROC plot.

In [7]:

```
plt.plot(1 - Spec, Sens, '-')
plt.plot([0,1], [0,1], '-', color='grey')
plt.xlabel("1 - Specificity", size=17)
plt.ylabel("Sensitivity", size=17)
```

Out[7]:

We can calculate the area under the curve (AUC) using the trapezoidal method:

In [8]:

```
auc = 0.
for i in range(len(Spec)-1):
auc += (Spec[i+1] - Spec[i]) * (Sens[i+1] + Sens[i]) / 2
print auc
```