Key ideas: Logistic regression, cross validation, sensitivity, specificity, receiver operating characteristics curve
This noteboook shows how to use cross-validation and logistic regression in Statsmodels to assess how well a group of predictor variables can be used to predict a binary outcome. A reciever operating characteristics (ROC) curve is used to describe the strength of the predictive relationship.
We start with the usual import statements:
import numpy as np import pandas as pd from statsmodels.discrete.discrete_model import Logit
The data are from a study of Pima indians in the US. Each individual is assessed as either having, or not having type 2 diabetes. Several predictors such as age, BMI, and blood pressure can be used to predict diabetes status. The data have been split into training and testing sets, but since we are using cross-validation here, we merge the two data sets.
data_url = "http://vincentarelbundock.github.io/Rdatasets/csv/MASS/Pima.tr.csv" data_tr = pd.read_csv(data_url) data_url = "http://vincentarelbundock.github.io/Rdatasets/csv/MASS/Pima.te.csv" data_te = pd.read_csv(data_url) data = pd.concat((data_tr, data_te), axis=0)
Next we split the data into a vector of outcomes (endog), and a matrix of predictors (exog). We also add an intercept (a column of 1's) to exog. Note that an alternative approach would be to use formulas to fit the logistic regression model. In this case, these two steps would not be necessary.
endog = np.asarray([int(x.lower() == "yes") for x in data["type"]]) xnames = ["npreg", "glu", "bp", "skin", "bmi", "ped", "age"] exog = np.asarray(data[xnames]) exog = np.concatenate((np.ones((exog.shape,1)), exog), axis=1)
Many of the predictors show moderately strong marginal associations with diabetes status, so there is hope that we can predict the outcome fairly well by combining the information in all the predictors.
x = [np.corrcoef(endog, x)[0,1] for x in exog[:,1:].T] x = pd.Series(x, index=xnames) print x
npreg 0.252586 glu 0.503614 bp 0.183432 skin 0.254874 bmi 0.300901 ped 0.233074 age 0.315097 dtype: float64
Next we do the cross validation. Each subject is held out in turn during the model building. We construct a predicted outcome (on a probability scale) for the held-out observation and place it into the vector named "scores".
n = data.shape scores = np.zeros(n, dtype=np.float64) for k in range(n): ii = range(n) ii.pop(k) mod = Logit(endog[ii], exog[ii,:]) rslt = mod.fit(disp=False) scores[k] = rslt.predict(exog[k,:])
The ROC curve is a plot of senstivity against 1 - specificity. We calculate sensitivity and specificity here (note that this is not a very efficient algorithm, but the calculation is fast so it doesn't matter much here).
uscores = np.unique(scores) n1 = np.sum(endog) Sens = np.zeros_like(uscores) Spec = np.zeros_like(uscores) for j,u in enumerate(uscores): Sens[j] = np.sum((scores >= u) * endog) / float(n1) Spec[j] = np.sum((scores <= u) * (1-endog)) / float(n - n1)
Now we make the ROC plot.
plt.plot(1 - Spec, Sens, '-') plt.plot([0,1], [0,1], '-', color='grey') plt.xlabel("1 - Specificity", size=17) plt.ylabel("Sensitivity", size=17)
<matplotlib.text.Text at 0x483be50>
We can calculate the area under the curve (AUC) using the trapezoidal method:
auc = 0. for i in range(len(Spec)-1): auc += (Spec[i+1] - Spec[i]) * (Sens[i+1] + Sens[i]) / 2 print auc