In this tutorial, we'll discuss how to formulate a policy problem or a social science question in the machine learning framework; how to transform raw data into something that can be fed into a model; how to build, evaluate, compare, and select models; and how to reasonably and accurately interpret model results. You'll also get hands-on experience using the scikit-learn
package in Python.
This tutorial is based on chapter "Machine Learning" of Big Data and Social Science.
import pandas as pd
import sqlite3
from dateutil.parser import parse
DB = 'ncdoc.db'
conn = sqlite3.connect(DB)
cur = conn.cursor()
Our Machine Learning Problem
Of all prisoners released, we would like to predict who is likely to reenter jail within 5 years of the day we make our prediction. For instance, say it is Jan 1, 2009 and we want to identify which prisoners are likely to re-enter jail between now and end of 2013. We can run our predictive model and identify who is most likely at risk. The is an example of a binary classification problem.
Note the outcome window of 5 years is completely arbitrary. You could use a window of 5, 3, 1 years or 1 day.
In order to predict recidivism, we will be using data from the inmate
and sentences
table to create labels (predictors, or independent variables, or $X$ variables) and features (dependent variables, or $Y$ variables).
We need to munge our data into labels (1_Machine_Learning_Labels.ipynb) and features (2_Machine_Learning_Features.ipynb) before we can train and evaluate machine learning models (3_Machine_Learning_Models.ipynb).
First, we create a new sentence table sentences_prep
that includes the sentence begin and end dates in date format.
sql_string = "drop table if exists sentences_prep;"
cur.execute(sql_string)
sql_string ="""
create table sentences_prep as
select inmate_doc_number,
cast(inmate_sentence_component as integer) as sentence_component,
date([sentence_begin_date_(for_max)]) as sentence_begin_date,
date(actual_sentence_end_date) as sentence_end_date
from sentences;
"""
cur.execute(sql_string)
We then create a table release_dates_2000_2008
, which is based on the sentence_prep
table. We take all of the records for inmate_doc_number
and sentence_end_date
between 2000 and 2008.
sql_string = "drop table if exists release_dates_2000_2008;"
cur.execute(sql_string)
sql_string ="""
create temp table release_dates_2000_2008 as
select inmate_doc_number, sentence_end_date
from sentences_prep
where sentence_end_date >= '2000-01-01' and sentence_end_date <= '2008-12-31';
"""
cur.execute(sql_string)
Next we create a table last_exit_2000_2008
, which takes the maximum (most recent) sentence_end_date
for every inmate_doc_number
. This table will only have one entry per inmate_doc_number
, so for any given inmate_doc_number
, or individual, we know their most recent release year.
sql_string = "drop table if exists last_exit_2000_2008;"
cur.execute(sql_string)
sql_string ="""
create temp table last_exit_2000_2008 as
select inmate_doc_number, max(sentence_end_date) sentence_end_date
from release_dates_2000_2008
group by inmate_doc_number;
"""
cur.execute(sql_string)
We then find everyone admitted into prison between 2009 and 2013 and create table admit_2009_2013
.
sql_string = "drop table if exists admit_2009_2013;"
cur.execute(sql_string)
sql_string ="""
create temp table admit_2009_2013 as
select inmate_doc_number, sentence_component, sentence_begin_date
from sentences_prep
where sentence_begin_date >= '2009-01-01' and sentence_begin_date <= '2013-12-31' and sentence_component = 1;
"""
cur.execute(sql_string)
Next, we do a left join
on the last_exit_2000_2008
(left) table and the admit_2009_2013
(right) table on the inmate_doc_number
field. The resulting table will keep all the entries from the left table (most recent releases between 2000 and 2008) and add their admits between 2009 and 2013. Now we can create a label: 0 indicates no recidivism, 1 indicates recidivism, i.e. that person did return to jail between 2009 and 2013.
sql_string = "drop table if exists recidivism_2009_2013;"
cur.execute(sql_string)
sql_string ="""
create temp table recidivism_2009_2013 as
select r.inmate_doc_number, r.sentence_end_date, a.sentence_begin_date,
case when a.sentence_begin_date is null then 0 else 1 end recidivism
from last_exit_2000_2008 r
left join admit_2009_2013 a on r.inmate_doc_number = a.inmate_doc_number;
"""
cur.execute(sql_string)
We then remove any potential duplicates and create the final label table recidivism_labels_2009_2013
.
sql_string = "drop table if exists recidivism_labels_2009_2013;"
cur.execute(sql_string)
sql_string ="""
create table recidivism_labels_2009_2013 as
select distinct inmate_doc_number, recidivism
from recidivism_2009_2013;
"""
cur.execute(sql_string)
Finally, we load the label table into label_2009_2013
and inspect the first observations.
sql_string = "SELECT *"
sql_string += "FROM recidivism_labels_2009_2013 "
sql_string += ";"
label_2009_2013 = pd.read_sql(sql_string, con = conn)
label_2009_2013.head(5)
Following the machine learning pipeline, we will need a second label table for creating a test set later on. To facilitate the label generation process, we define a function called create_labels
that automates all steps that are needed to create the label table. In essence, it runs all previous steps for a given prediction start date and prediction end date.
def create_labels(features_end, prediction_start, prediction_end, conn):
"""
Generate a list of labels and return the table as a dataframe.
Parameters
----------
features_end
prediction_start
prediction_end
conn: obj
Returns
-------
df_labels: DataFrame
"""
end_x_year = parse(features_end, fuzzy = True).year
start_y_year = parse(prediction_start, fuzzy = True).year
end_y_year = parse(prediction_end, fuzzy = True).year
sql_script="""
drop table if exists sentences_prep;
create table sentences_prep as
select inmate_doc_number,
cast(inmate_sentence_component as integer) as sentence_component,
date([sentence_begin_date_(for_max)]) as sentence_begin_date,
date(actual_sentence_end_date) as sentence_end_date
from sentences;
drop table if exists release_dates_2000_{end_x_year};
create temp table release_dates_2000_{end_x_year} as
select inmate_doc_number, sentence_end_date
from sentences_prep
where sentence_end_date >= '2000-01-01' and sentence_end_date <= '{features_end}';
drop table if exists last_exit_2000_{end_x_year};
create temp table last_exit_2000_{end_x_year} as
select inmate_doc_number, max(sentence_end_date) sentence_end_date
from release_dates_2000_{end_x_year}
group by inmate_doc_number;
drop table if exists admit_{start_y_year}_{end_y_year};
create temp table admit_{start_y_year}_{end_y_year} as
select inmate_doc_number, sentence_component, sentence_begin_date
from sentences_prep
where sentence_begin_date >= '{prediction_start}' and sentence_begin_date <= '{prediction_end}' and sentence_component = 1;
drop table if exists recidivism_{start_y_year}_{end_y_year};
create temp table recidivism_{start_y_year}_{end_y_year} as
select r.inmate_doc_number, r.sentence_end_date, a.sentence_begin_date,
case when a.sentence_begin_date is null then 0 else 1 end recidivism
from last_exit_2000_{end_x_year} r
left join admit_{start_y_year}_{end_y_year} a on r.inmate_doc_number = a.inmate_doc_number;
drop table if exists recidivism_labels_{start_y_year}_{end_y_year};
create table recidivism_labels_{start_y_year}_{end_y_year} as
select distinct inmate_doc_number, recidivism
from recidivism_{start_y_year}_{end_y_year};
""".format(features_end = features_end,
prediction_start = prediction_start,
prediction_end = prediction_end,
end_x_year = end_x_year,
start_y_year = start_y_year,
end_y_year = end_y_year)
cur.executescript(sql_script)
df_label = pd.read_sql('select * from recidivism_labels_{start_y_year}_{end_y_year}'.format(
start_y_year = start_y_year,
end_y_year = end_y_year), conn)
return df_label
The create_labels
function takes a features_end
date, a prediction_start
date and a prediction_end
date as arguments. Our second label table covers recidivism between 2014 and 2018, based on all releases between 2000 and 2013.
label_2014_2018 = create_labels('2013-12-31', '2014-01-01', '2018-12-31', conn)
label_2014_2018.head(5)
cur.close()
conn.close()