Machine Learning -- Creating Labels


Introduction

In this tutorial, we'll discuss how to formulate a policy problem or a social science question in the machine learning framework; how to transform raw data into something that can be fed into a model; how to build, evaluate, compare, and select models; and how to reasonably and accurately interpret model results. You'll also get hands-on experience using the scikit-learn package in Python.

This tutorial is based on chapter "Machine Learning" of Big Data and Social Science.

Setup

In [ ]:
import pandas as pd
import sqlite3
from dateutil.parser import parse
In [ ]:
DB = 'ncdoc.db'
conn = sqlite3.connect(DB)
cur = conn.cursor()

Problem Formulation


Our Machine Learning Problem

Of all prisoners released, we would like to predict who is likely to reenter jail within 5 years of the day we make our prediction. For instance, say it is Jan 1, 2009 and we want to identify which prisoners are likely to re-enter jail between now and end of 2013. We can run our predictive model and identify who is most likely at risk. The is an example of a binary classification problem.

Note the outcome window of 5 years is completely arbitrary. You could use a window of 5, 3, 1 years or 1 day.

In order to predict recidivism, we will be using data from the inmate and sentences table to create labels (predictors, or independent variables, or $X$ variables) and features (dependent variables, or $Y$ variables).

We need to munge our data into labels (1_Machine_Learning_Labels.ipynb) and features (2_Machine_Learning_Features.ipynb) before we can train and evaluate machine learning models (3_Machine_Learning_Models.ipynb).

Creating Labels (Outcomes)


First, we create a new sentence table sentences_prep that includes the sentence begin and end dates in date format.

In [ ]:
sql_string = "drop table if exists sentences_prep;"
cur.execute(sql_string)

sql_string ="""
create table sentences_prep as
select inmate_doc_number, 
cast(inmate_sentence_component as integer) as sentence_component,
date([sentence_begin_date_(for_max)]) as sentence_begin_date,
date(actual_sentence_end_date) as sentence_end_date
from sentences;
"""
cur.execute(sql_string)

We then create a table release_dates_2000_2008, which is based on the sentence_prep table. We take all of the records for inmate_doc_number and sentence_end_date between 2000 and 2008.

In [ ]:
sql_string = "drop table if exists release_dates_2000_2008;"
cur.execute(sql_string)

sql_string ="""
create temp table release_dates_2000_2008 as
select inmate_doc_number, sentence_end_date
from sentences_prep
where sentence_end_date >= '2000-01-01' and sentence_end_date <= '2008-12-31';
"""
cur.execute(sql_string)

Next we create a table last_exit_2000_2008, which takes the maximum (most recent) sentence_end_date for every inmate_doc_number. This table will only have one entry per inmate_doc_number, so for any given inmate_doc_number, or individual, we know their most recent release year.

In [ ]:
sql_string = "drop table if exists last_exit_2000_2008;"
cur.execute(sql_string)

sql_string ="""
create temp table last_exit_2000_2008 as
select inmate_doc_number, max(sentence_end_date) sentence_end_date
from release_dates_2000_2008
group by inmate_doc_number;
"""
cur.execute(sql_string)

We then find everyone admitted into prison between 2009 and 2013 and create table admit_2009_2013.

In [ ]:
sql_string = "drop table if exists admit_2009_2013;"
cur.execute(sql_string)

sql_string ="""
create temp table admit_2009_2013 as
select inmate_doc_number, sentence_component, sentence_begin_date
from sentences_prep
where sentence_begin_date >= '2009-01-01' and sentence_begin_date <= '2013-12-31' and sentence_component = 1;
"""
cur.execute(sql_string)

Next, we do a left join on the last_exit_2000_2008 (left) table and the admit_2009_2013 (right) table on the inmate_doc_number field. The resulting table will keep all the entries from the left table (most recent releases between 2000 and 2008) and add their admits between 2009 and 2013. Now we can create a label: 0 indicates no recidivism, 1 indicates recidivism, i.e. that person did return to jail between 2009 and 2013.

In [ ]:
sql_string = "drop table if exists recidivism_2009_2013;"
cur.execute(sql_string)

sql_string ="""
create temp table recidivism_2009_2013 as
select r.inmate_doc_number, r.sentence_end_date, a.sentence_begin_date,
case when a.sentence_begin_date is null then 0 else 1 end recidivism
from last_exit_2000_2008 r
left join admit_2009_2013 a on r.inmate_doc_number = a.inmate_doc_number;
"""
cur.execute(sql_string)

We then remove any potential duplicates and create the final label table recidivism_labels_2009_2013.

In [ ]:
sql_string = "drop table if exists recidivism_labels_2009_2013;"
cur.execute(sql_string)

sql_string ="""
create table recidivism_labels_2009_2013 as
select distinct inmate_doc_number, recidivism
from recidivism_2009_2013;
"""
cur.execute(sql_string)

Finally, we load the label table into label_2009_2013 and inspect the first observations.

In [ ]:
sql_string = "SELECT *"
sql_string += "FROM recidivism_labels_2009_2013 "
sql_string += ";"

label_2009_2013 = pd.read_sql(sql_string, con = conn)
label_2009_2013.head(5)

Following the machine learning pipeline, we will need a second label table for creating a test set later on. To facilitate the label generation process, we define a function called create_labels that automates all steps that are needed to create the label table. In essence, it runs all previous steps for a given prediction start date and prediction end date.

In [ ]:
def create_labels(features_end, prediction_start, prediction_end, conn):
    """
    Generate a list of labels and return the table as a dataframe.
    
    Parameters
    ----------
    features_end
    prediction_start
    prediction_end
    conn: obj
        
    Returns
    -------
    df_labels: DataFrame
    """
    end_x_year = parse(features_end, fuzzy = True).year
    start_y_year = parse(prediction_start, fuzzy = True).year
    end_y_year = parse(prediction_end, fuzzy = True).year
    
    sql_script="""

drop table if exists sentences_prep;
create table sentences_prep as
select inmate_doc_number, 
cast(inmate_sentence_component as integer) as sentence_component,
date([sentence_begin_date_(for_max)]) as sentence_begin_date,
date(actual_sentence_end_date) as sentence_end_date
from sentences;

drop table if exists release_dates_2000_{end_x_year};
create temp table release_dates_2000_{end_x_year} as
select inmate_doc_number, sentence_end_date
from sentences_prep
where sentence_end_date >= '2000-01-01' and sentence_end_date <= '{features_end}';

drop table if exists last_exit_2000_{end_x_year};
create temp table last_exit_2000_{end_x_year} as
select inmate_doc_number, max(sentence_end_date) sentence_end_date
from release_dates_2000_{end_x_year}
group by inmate_doc_number;

drop table if exists admit_{start_y_year}_{end_y_year};
create temp table admit_{start_y_year}_{end_y_year} as
select inmate_doc_number, sentence_component, sentence_begin_date
from sentences_prep
where sentence_begin_date >= '{prediction_start}' and sentence_begin_date <= '{prediction_end}' and sentence_component = 1;

drop table if exists recidivism_{start_y_year}_{end_y_year};
create temp table recidivism_{start_y_year}_{end_y_year} as
select r.inmate_doc_number, r.sentence_end_date, a.sentence_begin_date,
case when a.sentence_begin_date is null then 0 else 1 end recidivism
from last_exit_2000_{end_x_year} r
left join admit_{start_y_year}_{end_y_year} a on r.inmate_doc_number = a.inmate_doc_number;

drop table if exists recidivism_labels_{start_y_year}_{end_y_year};
create table recidivism_labels_{start_y_year}_{end_y_year} as
select distinct inmate_doc_number, recidivism
from recidivism_{start_y_year}_{end_y_year};

    """.format(features_end = features_end,
               prediction_start = prediction_start,
               prediction_end = prediction_end,
               end_x_year = end_x_year,
               start_y_year = start_y_year,
               end_y_year = end_y_year)
    
    cur.executescript(sql_script)
    df_label = pd.read_sql('select * from recidivism_labels_{start_y_year}_{end_y_year}'.format(
                                                                                    start_y_year = start_y_year,
                                                                                    end_y_year = end_y_year), conn)    
    return df_label

The create_labels function takes a features_end date, a prediction_start date and a prediction_end date as arguments. Our second label table covers recidivism between 2014 and 2018, based on all releases between 2000 and 2013.

In [ ]:
label_2014_2018 = create_labels('2013-12-31', '2014-01-01', '2018-12-31', conn)
label_2014_2018.head(5)
In [ ]:
cur.close()
conn.close()