Machine Learning -- Model Training and Evaluation


Introduction

In this tutorial, we'll discuss how to formulate a policy problem or a social science question in the machine learning framework; how to transform raw data into something that can be fed into a model; how to build, evaluate, compare, and select models; and how to reasonably and accurately interpret model results. You'll also get hands-on experience using the scikit-learn package in Python.

This tutorial is based on chapter "Machine Learning" of Big Data and Social Science.

Setup

In [ ]:
import pandas as pd
import numpy as np
import sqlite3
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
from dateutil.parser import parse
from sklearn.metrics import precision_recall_curve, roc_curve, auc, confusion_matrix, accuracy_score, precision_score, recall_score
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.tree import DecisionTreeClassifier
sns.set_style("white")
In [ ]:
DB = 'ncdoc.db'
conn = sqlite3.connect(DB)
cur = conn.cursor()

Problem Formulation


Our Machine Learning Problem

Of all prisoners released, we would like to predict who is likely to reenter jail within 5 years of the day we make our prediction. For instance, say it is Jan 1, 2009 and we want to identify which prisoners are likely to re-enter jail between now and end of 2013. We can run our predictive model and identify who is most likely at risk. The is an example of a binary classification problem.

Note the outcome window of 5 years is completely arbitrary. You could use a window of 5, 3, 1 years or 1 day.

In order to predict recidivism, we will be using data from the inmate and sentences table to create labels (predictors, or independent variables, or $X$ variables) and features (dependent variables, or $Y$ variables).

We need to munge our data into labels (1_Machine_Learning_Labels.ipynb) and features (2_Machine_Learning_Features.ipynb) before we can train and evaluate machine learning models (3_Machine_Learning_Models.ipynb).

This notebook assumes that you have already worked through the 1_Machine_Learning_Labels and 2_Machine_Learning_Features notebooks. If that is not the case, the following functions allow you to bring in the functions developed in those notebooks from .py scripts.

In [ ]:
# We are bringing in the create_labels and create_features functions covered in previous notebooks
# Note that these are brought in from scripts rather than from external packages
from create_labels import create_labels
from create_features import create_features
In [ ]:
# These functions make sure that the tables have been created in the database.
create_labels('2008-12-31', '2009-01-01', '2013-12-31', conn)
create_labels('2013-12-31', '2014-01-01', '2018-12-31', conn)

create_features('2008-12-31', '2009-01-01', '2013-12-31', conn)
create_features('2013-12-31', '2014-01-01', '2018-12-31', conn)

Create Training and Test Sets


Our Training Set

We create a training set that takes people at the beginning of 2009 and defines the outcome based on data from 2009-2013 (recidivism_labels_2009_2013). The features for each person are based on data up to the end of 2008 (features_2000_2008).

Note: It is important to segregate your data based on time when creating features. Otherwise there can be "leakage", where you accidentally use information that you would not have known at the time.

In [ ]:
sql_string = "drop table if exists train_matrix;"
cur.execute(sql_string)

sql_string = "create table train_matrix as "
sql_string += "select l.inmate_doc_number, l.recidivism, f.num_admits, f.length_longest_sentence, f.age_first_admit, f.age "
sql_string += "from recidivism_labels_2009_2013 l "
sql_string += "left join features_2000_2008 f on f.inmate_doc_number = l.inmate_doc_number "
sql_string += ";"
cur.execute(sql_string)

We then load the training data into df_training.

In [ ]:
sql_string = "SELECT *"
sql_string += "FROM train_matrix "
sql_string += ";"

df_training = pd.read_sql(sql_string, con = conn)
df_training.head(5)

Our Test (Validation) Set

In the machine learning process, we want to build models on the training set and evaluate them on the test set. Our test set will use labels from 2014-2018 (recidivism_labels_2014_2018), and our features will be based on data up to the end of 2013 (features_2000_2013).

In [ ]:
sql_string = "drop table if exists test_matrix;"
cur.execute(sql_string)

sql_string = "create table test_matrix as "
sql_string += "select l.inmate_doc_number, l.recidivism, f.num_admits, f.length_longest_sentence, f.age_first_admit, f.age "
sql_string += "from recidivism_labels_2014_2018 l "
sql_string += "left join features_2000_2013 f on f.inmate_doc_number = l.inmate_doc_number "
sql_string += ";"
cur.execute(sql_string)

We load the test data into df_test.

In [ ]:
sql_string = "SELECT *"
sql_string += "FROM test_matrix "
sql_string += ";"

df_test = pd.read_sql(sql_string, con = conn)
df_test.head()

Data Cleaning

Before we proceed to model training, we need to clean our training data. First, we check the percentage of missing values.

In [ ]:
isnan_training_rows = df_training.isnull().any(axis=1)
nrows_training = df_training.shape[0]
nrows_training_isnan = df_training[isnan_training_rows].shape[0]
print('%of frows with NaNs {} '.format(float(nrows_training_isnan)/nrows_training))

We see that about 1% of the rows in our data have missing values. In the following, we will drop rows with missing values. Note, however, that better ways for dealing with missings exist.

In [ ]:
df_training = df_training[~isnan_training_rows]

Let's check if the values of the ages at first admit are reasonable.

In [ ]:
np.unique( df_training['age_first_admit'] )

Looks like this needs some cleaning. We will drop any rows that have age < 14 and > 99.

In [ ]:
keep = (df_training['age_first_admit'] >= 14) & (df_training['age_first_admit'] <= 99)
df_training = df_training[keep]

Let's check how much data we still have and how many examples of recidivism are in our training dataset. When it comes to model evaluation, it is good to know what the "baseline" is in our dataset.

In [ ]:
print('Number of rows: {}'.format(df_training.shape[0]))
df_training['recidivism'].value_counts(normalize=True)

We have about 155,000 examples, and about 25% of those are positive examples (recidivist), which is what we're trying to identify. About 75% of the examples are negative examples (non-recidivist).

Next, let's take a look at the test set.

In [ ]:
isnan_test_rows = df_test.isnull().any(axis=1)
nrows_test = df_test.shape[0]
nrows_test_isnan = df_test[isnan_test_rows].shape[0]
print('%of rows with NaNs {} '.format(float(nrows_test_isnan)/nrows_test))

We see that about 1% of the rows in our test set have missing values. This matches what we'd expect based on what we saw in the training set.

In [ ]:
df_test = df_test[~isnan_test_rows]

As before, we drop cases with age < 14 and > 99.

In [ ]:
keep = (df_test['age_first_admit'] >= 14) & (df_test['age_first_admit'] <= 99)
df_test = df_test[keep]

We also check the number of observations and the outcome distribution for our test data.

In [ ]:
print('Number of rows: {}'.format(df_test.shape[0]))
df_test['recidivism'].value_counts(normalize=True)

Split into features and labels

Here we select our features and outcome variable.

In [ ]:
sel_features = ['num_admits', 'length_longest_sentence', 'age_first_admit', 'age']
sel_label = 'recidivism'

We can now create an X- and y-training and X- and y-test object to train and evaluate prediction models with scikit-learn.

In [ ]:
X_train = df_training[sel_features].values
y_train = df_training[sel_label].values
X_test = df_test[sel_features].values
y_test = df_test[sel_label].values

Model Training


On this basis, we can now build a prediction model that learns the relationship between our predictors (X_train) and recidivism (y_train) in the training data. We start with using logistic regression as our first model.

In [ ]:
model = LogisticRegression(penalty = 'none')
model.fit( X_train, y_train )
print(model)

When we print the model object, we see different model settings that can be adjusted. To adjust these parameters, one would alter the call that creates the LogisticRegression() model instance, passing it one or more of these parameters with a value other than the default. So, to re-fit the model with penalty of "elasticnet", C of 0.01, and intercept_scaling of 2 (as an example), you'd create your model as follows:

model = LogisticRegression(penalty = 'elasticnet', C = 0.01, intercept_scaling = 2)

The basic way to choose values for, or "tune," these parameters is the same as the way you choose a model: fit the model to your training data with a variety of parameters, and see which perform the best on the test set. However, an obvious drawback is that you can also overfit to your test set. In this case, you can (and should) alter the validation method (e.g., split the data into a training, validation and test set or run cross-validation in the training set).

Let's look at what the model learned, i.e. what the coefficients are.

In [ ]:
model.coef_[0]
In [ ]:
model.intercept_

Model Evaluation


Machine learning models usually do not produce a prediction (0 or 1) directly. Rather, models produce a score (that can sometimes be interpreted as a probability) between 0 and 1, which lets you more finely rank all of the examples from most likely to least likely to have label 1 (positive). This score is then turned into a 0 or 1 based on a user-specified threshold. For example, you might label all examples that have a score greater than 0.5 as positive (1), but there's no reason that has to be the cutoff.

In [ ]:
y_scores = model.predict_proba(X_test)[:,1]

Let's take a look at the distribution of scores and see if it makes sense to us.

In [ ]:
sns.distplot(y_scores, kde=False, rug=False)

Our distribution of scores is skewed, with the majority of scores on the lower end of the scale. We expect this because 75% of the training data is made up of non-recidivists, so we'd guess that a higher proportion of the examples in the test set will be negative (meaning they should have lower scores).

In [ ]:
df_test['y_score'] = y_scores

Tools like scikit-learn often have a default threshold of 0.5, but a good threshold is selected based on the data, model and the specific problem you are solving. As a trial run, let's set a threshold of 0.5.

In [ ]:
calc_threshold = lambda x,y: 0 if x < y else 1 
predicted = np.array( [calc_threshold(score,0.5) for score in y_scores] )
expected = y_test

Confusion Matrix

Once we have tuned our scores to 0 or 1 for classification, we create a confusion matrix, which has four cells: true negatives, true positives, false negatives, and false positives. If an example was predicted to be negative and is negative, it's a true negative. If an example was predicted to be positive and is positive, it's a true positive. If an example was predicted to be negative and is positive, it's a false negative. If an example was predicted to be positive and is negative, it's a false negative.

In [ ]:
conf_matrix = confusion_matrix(expected,predicted)
print(conf_matrix)

The count of true negatives is conf_matrix[0,0], false negatives conf_matrix[1,0], true positives conf_matrix[1,1], and false_positives conf_matrix[0,1].

In [ ]:
accuracy = accuracy_score(expected, predicted)
print( "Accuracy = " + str( accuracy ) )

We get an accuracy score of 84%. Recall that our test set had 84.5% non-recidivists and 15.5% recidivists. If we had just labeled all the examples as negative and guessed non-recidivist every time, we would have had an accuracy of 84.5%, so our basic model is not doing better than a "dumb classifier". We therefore want to explore other prediction methods in later sections. For now, let's look at the precision and recall scores of our model, still using the default classification threshold.

In [ ]:
precision = precision_score(expected, predicted)
recall = recall_score(expected, predicted)
print( "Precision = " + str( precision ) )
print( "Recall= " + str(recall))

AUC-PR and AUC-ROC

If we care about the whole precision-recall space, we can consider a metric known as the area under the precision-recall curve (AUC-PR). The maximum AUC-PR is 1.

In [ ]:
precision_curve, recall_curve, pr_thresholds = precision_recall_curve(expected, y_scores)
auc_val = auc(recall_curve,precision_curve)

Here we plot the PR curve and print the corresponding AUC-PR score.

In [ ]:
plt.plot(recall_curve, precision_curve)
plt.xlabel('Recall')
plt.ylabel('Precision')
print('AUC-PR: {0:1f}'.format(auc_val))
plt.show()

A related performance metric is the area under the receiver operating characteristic curve (AUC-ROC). It also has a maximum of 1, with 0.5 representing a non-informative model.

In [ ]:
fpr, tpr, thresholds = roc_curve(expected, y_scores)
roc_auc = auc(fpr, tpr)

Here we plot the ROC curve and print the corresponding AUC-ROC score.

In [ ]:
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Precision and Recall at k%

If we only care about a specific part of the precision-recall curve we can focus on more fine-grained metrics. For instance, say there is a special program for people likely to be recidivists, but only 5% can be admitted. In that case, we would want to prioritize the 5% who were most likely to end up back in jail, and it wouldn't matter too much how accurate we were on the 80% or so who weren't very likely to end up back in jail.

Let's say that, out of the approximately 200,000 prisoners, we can intervene on 5% of them, or the "top" 10,000 prisoners (where "top" means highest predicted risk of recidivism). We can then focus on optimizing our precision at 5%. For this, we first define a function (plot_precision_recall_n) that computes and plots precision and recall for any percentage cutoff (k).

In [ ]:
def plot_precision_recall_n(y_true, y_prob, model_name):
    """
    y_true: ls
    y_prob: ls
    model_name: str
    """
    y_score = y_prob
    precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_true, y_score)
    precision_curve = precision_curve[:-1]
    recall_curve = recall_curve[:-1]
    pct_above_per_thresh = []
    number_scored = len(y_score)
    for value in pr_thresholds:
        num_above_thresh = len(y_score[y_score>=value])
        pct_above_thresh = num_above_thresh / float(number_scored)
        pct_above_per_thresh.append(pct_above_thresh)
    pct_above_per_thresh = np.array(pct_above_per_thresh)
    plt.clf()
    fig, ax1 = plt.subplots()
    ax1.plot(pct_above_per_thresh, precision_curve, 'b')
    ax1.set_xlabel('percent of population')
    ax1.set_ylabel('precision', color='b')
    ax1.set_ylim(0,1.05)
    ax2 = ax1.twinx()
    ax2.plot(pct_above_per_thresh, recall_curve, 'r')
    ax2.set_ylabel('recall', color='r')
    ax2.set_ylim(0,1.05)
    
    name = model_name
    plt.title(name)
    plt.show()
    plt.clf()

We can now plot the precision and recall scores for the full range of k values (this might take some time to run).

In [ ]:
plot_precision_recall_n(expected,y_scores, 'LR')

Here we define another function, precision_at_k, which returns the precision score for specific values of k.

In [ ]:
def precision_at_k(y_true, y_scores,k):
    
    threshold = np.sort(y_scores)[::-1][int(k*len(y_scores))]
    y_pred = np.asarray([1 if i >= threshold else 0 for i in y_scores ])
    return precision_score(y_true, y_pred)

We can now compute, e.g., precision at top 1% and precision at top 5%.

In [ ]:
p_at_1 = precision_at_k(expected,y_scores, 0.01)
print('Precision at 1%: {:.2f}'.format(p_at_1))
In [ ]:
p_at_5 = precision_at_k(expected,y_scores, 0.05)
print('Precision at 5%: {:.2f}'.format(p_at_5))

Baseline

Finally, it is important to check our model against a reasonable baseline to know how well our model is doing. Without any context, 84% accuracy can sound really great... but it's not so great when you remember that you could do better by declaring everyone a non-recidivist, which would be a useless model. This baseline would be called the no information rate.

In addition to the no information rate, we can check against a random baseline by assigning every example a label (positive or negative) completely at random. We can then compute the precision at top 5% for the random model.

In [ ]:
random_score = [np.random.uniform(0,1) for i in enumerate(y_test)] 
random_predicted = np.array( [calc_threshold(score,0.5) for score in random_score] )
random_p_at_5 = precision_at_k(expected,random_predicted, 0.05)
random_p_at_5

More models


We have only scratched the surface of what we can do with scikit-learn. We've only tried one method (logistic regression), and there are plenty more classification algorithms. In the following, we consider decision trees (DT), random forests (RF), extremely randomized trees (ET) and gradient boosting (GB) as additional prediction methods.

In [ ]:
clfs = {'DT': DecisionTreeClassifier(max_depth=3),
        'RF': RandomForestClassifier(n_estimators=500, n_jobs=-1),
        'ET': ExtraTreesClassifier(n_estimators=250, n_jobs=-1, criterion='entropy'),
        'GB': GradientBoostingClassifier(learning_rate=0.05, subsample=0.7, max_depth=3, n_estimators=250)}
In [ ]:
sel_clfs = ['DT', 'RF', 'ET', 'GB']

We will use these methods in a loop that trains one model for each method with the training data and plots the corresponding precision and recall at top k figures with the test data.

In [ ]:
max_p_at_k = 0
for clfNM in sel_clfs:
    clf = clfs[clfNM]
    clf.fit( X_train, y_train )
    print(clf)
    y_score = clf.predict_proba(X_test)[:,1]
    predicted = np.array(y_score)
    expected = np.array(y_test)
    plot_precision_recall_n(expected,predicted, clfNM)
    p_at_5 = precision_at_k(expected,y_score, 0.05)
    if max_p_at_k < p_at_5:
        max_p_at_k = p_at_5
    print('Precision at 5%: {:.2f}'.format(p_at_5))

Let's explore the models we just built. We can, e.g., extract the decision tree result from the list of fitted models.

In [ ]:
clf = clfs[sel_clfs[0]]
print(clf)

We can then print and plot the feature importances for this model, which are stored as the attribute feature_importances_. Note that you can explore other models (e.g. the random forest) by extracting the corresponding result from clfs as above.

In [ ]:
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]

print ("Feature ranking")
for f in range(X_test.shape[1]):
    print ("%d. %s (%f)" % (f + 1, sel_features[f], importances[indices[f]]))

plt.figure
plt.title ("Feature Importances")
plt.bar(range(X_test.shape[1]), importances[indices], color='r', align = "center")
plt.xticks(range(X_test.shape[1]), sel_features, rotation=90)
plt.xlim([-1, X_test.shape[1]])
plt.show

Our ML modeling pipeline can be extended in various ways. Further steps may include:

  • Creating more features
  • Trying more models
  • Trying different parameters for our models
In [ ]:
cur.close()
conn.close()