Generalized Linear Models

Key ideas: Logistic regression, cross validation, sensitivity, specificity, receiver operating characteristics curve

This noteboook shows how to use cross-validation and logistic regression in Statsmodels to assess how well a group of predictor variables can be used to predict a binary outcome. A reciever operating characteristics (ROC) curve is used to describe the strength of the predictive relationship.

We start with the usual import statements:

In [1]:
import numpy as np
import pandas as pd
from statsmodels.discrete.discrete_model import Logit

The data are from a study of Pima indians in the US. Each individual is assessed as either having, or not having type 2 diabetes. Several predictors such as age, BMI, and blood pressure can be used to predict diabetes status. The data have been split into training and testing sets, but since we are using cross-validation here, we merge the two data sets.

In [2]:
data_url = "http://vincentarelbundock.github.io/Rdatasets/csv/MASS/Pima.tr.csv"
data_tr = pd.read_csv(data_url)

data_url = "http://vincentarelbundock.github.io/Rdatasets/csv/MASS/Pima.te.csv"
data_te = pd.read_csv(data_url)

data = pd.concat((data_tr, data_te), axis=0)

Next we split the data into a vector of outcomes (endog), and a matrix of predictors (exog). We also add an intercept (a column of 1's) to exog. Note that an alternative approach would be to use formulas to fit the logistic regression model. In this case, these two steps would not be necessary.

In [3]:
endog = np.asarray([int(x.lower() == "yes") for x in data["type"]])
xnames = ["npreg", "glu", "bp", "skin", "bmi", "ped", "age"]
exog = np.asarray(data[xnames])
exog = np.concatenate((np.ones((exog.shape[0],1)), exog), axis=1)

Many of the predictors show moderately strong marginal associations with diabetes status, so there is hope that we can predict the outcome fairly well by combining the information in all the predictors.

In [4]:
x = [np.corrcoef(endog, x)[0,1] for x in exog[:,1:].T]
x = pd.Series(x, index=xnames)
print x
npreg    0.252586
glu      0.503614
bp       0.183432
skin     0.254874
bmi      0.300901
ped      0.233074
age      0.315097
dtype: float64

Next we do the cross validation. Each subject is held out in turn during the model building. We construct a predicted outcome (on a probability scale) for the held-out observation and place it into the vector named "scores".

In [5]:
n = data.shape[0]
scores = np.zeros(n, dtype=np.float64)
for k in range(n):
    
    ii = range(n)
    ii.pop(k)
    
    mod = Logit(endog[ii], exog[ii,:])
    rslt = mod.fit(disp=False)
    
    scores[k] = rslt.predict(exog[k,:])

The ROC curve is a plot of senstivity against 1 - specificity. We calculate sensitivity and specificity here (note that this is not a very efficient algorithm, but the calculation is fast so it doesn't matter much here).

In [6]:
uscores = np.unique(scores)
n1 = np.sum(endog)

Sens = np.zeros_like(uscores)
Spec = np.zeros_like(uscores)
for j,u in enumerate(uscores):
    Sens[j] = np.sum((scores >= u) * endog) / float(n1)
    Spec[j] = np.sum((scores <= u) * (1-endog)) / float(n - n1)

Now we make the ROC plot.

In [7]:
plt.plot(1 - Spec, Sens, '-')
plt.plot([0,1], [0,1], '-', color='grey')
plt.xlabel("1 - Specificity", size=17)
plt.ylabel("Sensitivity", size=17)
Out[7]:
<matplotlib.text.Text at 0x483be50>

We can calculate the area under the curve (AUC) using the trapezoidal method:

In [8]:
auc = 0.
for i in range(len(Spec)-1):
    auc += (Spec[i+1] - Spec[i]) * (Sens[i+1] + Sens[i]) / 2
print auc
0.846749423092