Logistic Regression with scikit-learn

This is an example of logistic regression in Python with the scikit-learn module, performed for an assignment with my General Assembly Data Science class.

Dataset

The dataset I chose is the affairs dataset that comes with Statsmodels. It was derived from a survey of women in 1974 by Redbook magazine, in which married women were asked about their participation in extramarital affairs. More information about the study is available in a 1978 paper from the Journal of Political Economy.

Description of Variables

The dataset contains 6366 observations of 9 variables:

  • rate_marriage: woman's rating of her marriage (1 = very poor, 5 = very good)
  • age: woman's age
  • yrs_married: number of years married
  • children: number of children
  • religious: woman's rating of how religious she is (1 = not religious, 4 = strongly religious)
  • educ: level of education (9 = grade school, 12 = high school, 14 = some college, 16 = college graduate, 17 = some graduate school, 20 = advanced degree)
  • occupation: woman's occupation (1 = student, 2 = farming/semi-skilled/unskilled, 3 = "white collar", 4 = teacher/nurse/writer/technician/skilled, 5 = managerial/business, 6 = professional with advanced degree)
  • occupation_husb: husband's occupation (same coding as above)
  • affairs: time spent in extra-marital affairs

Problem Statement

I decided to treat this as a classification problem by creating a new binary variable affair (did the woman have at least one affair?) and trying to predict the classification for each woman.

Skipper Seabold, one of the primary contributors to Statsmodels, did a similar classification in his Statsmodels demo at a Statistical Programming DC Meetup. However, he used Statsmodels for the classification (whereas I'm using scikit-learn), and he treated the occupation variables as continuous (whereas I'm treating them as categorical).

Import modules

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cross_validation import cross_val_score

Data Pre-Processing

First, let's load the dataset and add a binary affair column.

In [2]:
# load dataset
dta = sm.datasets.fair.load_pandas().data

# add "affair" column: 1 represents having affairs, 0 represents not
dta['affair'] = (dta.affairs > 0).astype(int)

Data Exploration

In [3]:
dta.groupby('affair').mean()
Out[3]:
rate_marriage age yrs_married children religious educ occupation occupation_husb affairs
affair
0 4.329701 28.390679 7.989335 1.238813 2.504521 14.322977 3.405286 3.833758 0.000000
1 3.647345 30.537019 11.152460 1.728933 2.261568 13.972236 3.463712 3.884559 2.187243

We can see that on average, women who have affairs rate their marriages lower, which is to be expected. Let's take another look at the rate_marriage variable.

In [4]:
dta.groupby('rate_marriage').mean()
Out[4]:
age yrs_married children religious educ occupation occupation_husb affairs affair
rate_marriage
1 33.823232 13.914141 2.308081 2.343434 13.848485 3.232323 3.838384 1.201671 0.747475
2 30.471264 10.727011 1.735632 2.330460 13.864943 3.327586 3.764368 1.615745 0.635057
3 30.008056 10.239174 1.638469 2.308157 14.001007 3.402820 3.798590 1.371281 0.550856
4 28.856601 8.816905 1.369536 2.400981 14.144514 3.420161 3.835861 0.674837 0.322926
5 28.574702 8.311662 1.252794 2.506334 14.399776 3.454918 3.892697 0.348174 0.181446

An increase in age, yrs_married, and children appears to correlate with a declining marriage rating.

Data Visualization

In [5]:
# show plots in the notebook
%matplotlib inline

Let's start with histograms of education and marriage rating.

In [6]:
# histogram of education
dta.educ.hist()
plt.title('Histogram of Education')
plt.xlabel('Education Level')
plt.ylabel('Frequency')
Out[6]:
<matplotlib.text.Text at 0x16e48ac8>
In [7]:
# histogram of marriage rating
dta.rate_marriage.hist()
plt.title('Histogram of Marriage Rating')
plt.xlabel('Marriage Rating')
plt.ylabel('Frequency')
Out[7]:
<matplotlib.text.Text at 0x16eac550>

Let's take a look at the distribution of marriage ratings for those having affairs versus those not having affairs.

In [8]:
# barplot of marriage rating grouped by affair (True or False)
pd.crosstab(dta.rate_marriage, dta.affair.astype(bool)).plot(kind='bar')
plt.title('Marriage Rating Distribution by Affair Status')
plt.xlabel('Marriage Rating')
plt.ylabel('Frequency')
Out[8]:
<matplotlib.text.Text at 0x1710c5f8>

Let's use a stacked barplot to look at the percentage of women having affairs by number of years of marriage.

In [9]:
affair_yrs_married = pd.crosstab(dta.yrs_married, dta.affair.astype(bool))
affair_yrs_married.div(affair_yrs_married.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Affair Percentage by Years Married')
plt.xlabel('Years Married')
plt.ylabel('Percentage')
Out[9]:
<matplotlib.text.Text at 0x17d83a20>

Prepare Data for Logistic Regression

To prepare the data, I want to add an intercept column as well as dummy variables for occupation and occupation_husb, since I'm treating them as categorial variables. The dmatrices function from the patsy module can do that using formula language.

In [10]:
# create dataframes with an intercept column and dummy variables for
# occupation and occupation_husb
y, X = dmatrices('affair ~ rate_marriage + age + yrs_married + children + \
                  religious + educ + C(occupation) + C(occupation_husb)',
                  dta, return_type="dataframe")
print X.columns
Index([u'Intercept', u'C(occupation)[T.2.0]', u'C(occupation)[T.3.0]', u'C(occupation)[T.4.0]', u'C(occupation)[T.5.0]', u'C(occupation)[T.6.0]', u'C(occupation_husb)[T.2.0]', u'C(occupation_husb)[T.3.0]', u'C(occupation_husb)[T.4.0]', u'C(occupation_husb)[T.5.0]', u'C(occupation_husb)[T.6.0]', u'rate_marriage', u'age', u'yrs_married', u'children', u'religious', u'educ'], dtype='object')

The column names for the dummy variables are ugly, so let's rename those.

In [11]:
# fix column names of X
X = X.rename(columns = {'C(occupation)[T.2.0]':'occ_2',
                        'C(occupation)[T.3.0]':'occ_3',
                        'C(occupation)[T.4.0]':'occ_4',
                        'C(occupation)[T.5.0]':'occ_5',
                        'C(occupation)[T.6.0]':'occ_6',
                        'C(occupation_husb)[T.2.0]':'occ_husb_2',
                        'C(occupation_husb)[T.3.0]':'occ_husb_3',
                        'C(occupation_husb)[T.4.0]':'occ_husb_4',
                        'C(occupation_husb)[T.5.0]':'occ_husb_5',
                        'C(occupation_husb)[T.6.0]':'occ_husb_6'})

We also need to flatten y into a 1-D array, so that scikit-learn will properly understand it as the response variable.

In [12]:
# flatten y into a 1-D array
y = np.ravel(y)

Logistic Regression

Let's go ahead and run logistic regression on the entire data set, and see how accurate it is!

In [13]:
# instantiate a logistic regression model, and fit with X and y
model = LogisticRegression()
model = model.fit(X, y)

# check the accuracy on the training set
model.score(X, y)
Out[13]:
0.72588752748978946

73% accuracy seems good, but what's the null error rate?

In [14]:
# what percentage had affairs?
y.mean()
Out[14]:
0.32249450204209867

Only 32% of the women had affairs, which means that you could obtain 68% accuracy by always predicting "no". So we're doing better than the null error rate, but not by much.

Let's examine the coefficients to see what we learn.

In [15]:
# examine the coefficients
pd.DataFrame(zip(X.columns, np.transpose(model.coef_)))
Out[15]:
0 1
0 Intercept [1.48988372957]
1 occ_2 [0.188045598942]
2 occ_3 [0.498926222393]
3 occ_4 [0.25064662018]
4 occ_5 [0.838982982602]
5 occ_6 [0.833921262629]
6 occ_husb_2 [0.190546828287]
7 occ_husb_3 [0.297744578502]
8 occ_husb_4 [0.161319424129]
9 occ_husb_5 [0.187683007035]
10 occ_husb_6 [0.193916860892]
11 rate_marriage [-0.70312052711]
12 age [-0.0584177439191]
13 yrs_married [0.10567679013]
14 children [0.0169195866351]
15 religious [-0.371135218074]
16 educ [0.0040161519299]

Increases in marriage rating and religiousness correspond to a decrease in the likelihood of having an affair. For both the wife's occupation and the husband's occupation, the lowest likelihood of having an affair corresponds to the baseline occupation (student), since all of the dummy coefficients are positive.

Model Evaluation Using a Validation Set

So far, we have trained and tested on the same set. Let's instead split the data into a training set and a testing set.

In [16]:
# evaluate the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model2 = LogisticRegression()
model2.fit(X_train, y_train)
Out[16]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)

We now need to predict class labels for the test set. We will also generate the class probabilities, just to take a look.

In [17]:
# predict class labels for the test set
predicted = model2.predict(X_test)
print predicted
[ 1.  0.  0. ...,  0.  0.  0.]
In [18]:
# generate class probabilities
probs = model2.predict_proba(X_test)
print probs
[[ 0.3514255   0.6485745 ]
 [ 0.90952541  0.09047459]
 [ 0.72576645  0.27423355]
 ..., 
 [ 0.55736908  0.44263092]
 [ 0.81213879  0.18786121]
 [ 0.74729574  0.25270426]]

As you can see, the classifier is predicting a 1 (having an affair) any time the probability in the second column is greater than 0.5.

Now let's generate some evaluation metrics.

In [19]:
# generate evaluation metrics
print metrics.accuracy_score(y_test, predicted)
print metrics.roc_auc_score(y_test, probs[:, 1])
0.729842931937
0.74596198609

The accuracy is 73%, which is the same as we experienced when training and predicting on the same data.

We can also see the confusion matrix and a classification report with other metrics.

In [20]:
print metrics.confusion_matrix(y_test, predicted)
print metrics.classification_report(y_test, predicted)
[[1169  134]
 [ 382  225]]
             precision    recall  f1-score   support

        0.0       0.75      0.90      0.82      1303
        1.0       0.63      0.37      0.47       607

avg / total       0.71      0.73      0.71      1910

Model Evaluation Using Cross-Validation

Now let's try 10-fold cross-validation, to see if the accuracy holds up more rigorously.

In [21]:
# evaluate the model using 10-fold cross-validation
scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10)
print scores
print scores.mean()
[ 0.72100313  0.70219436  0.73824451  0.70597484  0.70597484  0.72955975
  0.7327044   0.70440252  0.75157233  0.75      ]
0.724163068551

Looks good. It's still performing at 73% accuracy.

Predicting the Probability of an Affair

Just for fun, let's predict the probability of an affair for a random woman not present in the dataset. She's a 25-year-old teacher who graduated college, has been married for 3 years, has 1 child, rates herself as strongly religious, rates her marriage as fair, and her husband is a farmer.

In [22]:
model.predict_proba(np.array([1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 3, 25, 3, 1, 4,
                              16]))
Out[22]:
array([[ 0.77472334,  0.22527666]])

The predicted probability of an affair is 23%.

Next Steps

There are many different steps that could be tried in order to improve the model:

  • including interaction terms
  • removing features
  • regularization techniques
  • using a non-linear model