#!/usr/bin/env python # coding: utf-8 # ## Exploration in Criminal Justice and Corrections Data # # I've repeatedly read the statistic that the US makes up around 4% of the global population, but makes up more than 20% of the prison population. Well defined questions are crucial tools for making sense of raw data, and this massive asymmetry is provokes some important questions, such as those listed below. # * Why isn't our prison population proportional to our regular population? # * Do other countries have fewer criminals? # * Do we have a drastically different approach to criminal justice than the rest of the world? # * Are our prison sentances just much longer? # * What are the most common crimes? # * How many people are in for those crimes? # * How many people enter the prison system each month? # * How many exit? # * Recidivism # # * How many prisoners have been to prison at least once before? # * What is the recidivism rate for 1st time prisoners who served in public prisons? # * And for private prisons? # * Which prisons have the lowest recidivism rates? # ## Table of Contents # # 1. [Exploration in Criminal Justice and Corrections Data](#0)
# 2. [Datasets](#1)
# 3. [Total Imprisonment Rates](#2)
# 3a. [Observations](#3)
# 4. [Imprisonment by Race](#4)
# 4a. [Observations](#5)
# 5. [Imprisonment by Gender](#6)
# 5a. [Observations](#7)
# 6. [To Be Continued](#8)
# ## Datasets # With some questions in hand, I can start trying to gather data that could shed light on the situation. I may not be able to get data with the resolution needed to answer some of these questions (eg I may not be able to find data broken down by month). # # The government makes some data publicly available. While government data is rarely current, often involves a bit of cleanup, and always involves hunting through old government sites, it's the only source for much of this extremely valuable information. # From the Bureau of Justice Statistics, I found a government project called the [Prisoner Series](https://www.bjs.gov/index.cfm?ty=pbse&sid=40) and I downloaded the [most recent data](http://www.bjs.gov/index.cfm?ty=pbdetail&iid=6187). # In[1]: import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt import matplotlib.ticker as ticker from IPython.core.display import display, HTML import os get_ipython().run_line_magic('matplotlib', 'inline') # In[2]: # Notebook Styling # pd.options.display.max_columns = None # display(HTML("")) pd.set_option('display.float_format',lambda x: '%.5f' % x) plt.rcParams['figure.figsize'] = 10,10 # ## Total Imprisonment Rates (Table p16f01) # The values in the by-race dataset are [per 100k population]. # In[3]: CSV_PATH = os.path.join('data', 'prison', 'p16f01.csv') total_df = pd.read_csv(CSV_PATH, encoding='latin1', header=12, index_col='Year', parse_dates=['Year']) total_df.dropna(inplace=True) total_df.index = total_df.index.values.astype(int) # In[4]: total_df.head() # In[5]: with sns.axes_style("whitegrid"): fig, ax = plt.subplots(figsize=(10,7)) ax.plot(total_df['All ages']) ax.plot(total_df['Age 18 or older']) ax.set_title('Total Imprisonment Rates (table: p16f01)') ax.set_xlabel('Year') ax.set_ylabel('People imprisoned (per relevant 100k US population)') ax.legend() ax.set_ylim([0, 1.1*max([total_df['All ages'].max(), total_df['Age 18 or older'].max()])]) # ## Observations # The imprisonment rate normalized to the entire population of US residents is lower than the imprisonment rate normalized to the population of US residents that are 18 or older. This indicates that the imprisonment rate for people under age 18 is much lower than for people 18 or older. That fits with my intuition. # # We also see that the imprisonment rate climbs steadily from 1980 up through 1999, dips, and peaks around 2007-2008, at which point it starts trending down. # In 1978, 183 people were in prison per every 100k US residents 18 or older. In 2007, 670 people were in prison per every 100k US residents 18 or older. That's a 266% increase in the imprisonment rate over that 29 year span. That's huge. # # ### New Questions: # * What was responsible for the increase in the rate of imprisonment? What was responsible for the decrease? # * Was it proportional to the actual crime rates? # * Was it a product of different enforcement policies? # # To answer these new questions, we will probably have to look at other sets of data. # ## Imprisonment by Race (Table p16f02) # Imprisonment rate of sentenced prisoners under the jurisdiction of state or federal correctional authorities, per 100,000 U.S. residents age 18 or older, by race and Hispanic origin, December 31, 2006–2016 # In[6]: CSV_PATH = os.path.join('data', 'prison', 'p16f02.csv') race_df = pd.read_csv(CSV_PATH, encoding='latin1', header=12) race_df.dropna(inplace=True) race_df.rename(columns={'Unnamed: 0': 'Year'}, inplace=True) race_df.set_index('Year', inplace=True) race_df # In[7]: with sns.axes_style("whitegrid"): fig, ax = plt.subplots(figsize=(10,7)) ax.plot(race_df['White/*']) ax.plot(race_df['Black/*']) ax.plot(race_df['Hispanic']) ax.set_title('Imprisonment Rates by Race (table: p16f02)') ax.set_xlabel('Year') ax.set_ylabel('People imprisoned (per 100k US population)') ax.legend() ax.set_ylim([0, 1.1*max([race_df['White/*'].max(), race_df['Black/*'].max(), race_df['Hispanic'].max()])]) # In[8]: print('{:>8s} imprisonment per 100k US pop: max: {}, min: {}' .format('White/*', race_df['White/*'].max(), race_df['White/*'].min())) print('{:>8s} imprisonment per 100k US pop: max: {}, min: {}' .format('Black/*', race_df['Black/*'].max(), race_df['Black/*'].min())) print('{:>8s} imprisonment per 100k US pop: max: {}, min: {}' .format('Hispanic', race_df['Hispanic'].max(), race_df['Hispanic'].min())) print('*: non-Hispanic') # ## Observations # This is very striking. We see that there is a very significant difference in the rates of white (non-Hispanic), black (non-Hispanic), and Hispanic imprisonment. We also see that rates for all three have dropped over this time period. # # ### New Questions: # * What is responsible for this difference in imprisonment rates for different demographic groups? # # Based on prior research, I suspect that this is the result of many systemic factors, but let's continue exploring the data. # ## Breakdown by Gender (Table p16t01) # In[9]: CSV_PATH = os.path.join('data', 'prison', 'p16t01.csv') sex_df = pd.read_csv(CSV_PATH, encoding='latin1', header=11, thousands=r',') sex_df.dropna(inplace=True, thresh=3) sex_df.dropna(inplace=True, axis=1, thresh=3) sex_df.dropna(inplace=True) fix = lambda x: x.split('/')[0] sex_df['Year'] = sex_df['Year'].apply(fix) sex_df['Year'] = sex_df['Year'].astype(int) sex_df.set_index('Year', inplace=True) sex_df # In[10]: @ticker.FuncFormatter def y_formatter(x, pos): return '{:4.0f}'.format(x/1000) # In[11]: with sns.axes_style("whitegrid"): fig, ax = plt.subplots(figsize=(10,7)) ax.plot(sex_df['Male']) ax.plot(sex_df['Female']) ax.set_title('Imprisonment Counts by Gender (table: p16t01)') ax.set_xlabel('Year') ax.yaxis.set_major_formatter(y_formatter) ax.set_ylabel('People Imprisoned [in thousands of people]') ax.legend() ax.set_ylim([0, 1.1*max(sex_df['Male'].max(), sex_df['Female'].max())]) # ## Observations # # The first thing that I notice is that the number of men in prison is much higher than the number of females imprisoned. Per the chart below, over the entire span of the data set (2006 to 2016), there are at least 12 men in prison for each woman in prison. This is a massive asymmetry. It doesn't feel very controversial, but should it? According to the [2010 US Census](https://www.census.gov/prod/cen2010/briefs/c2010br-03.pdf), the US population is 50.8% female and 49.2% male. # # ### New Questions # # * Why are men so much more likely to be in prison? # * What are the relevant differences between men and women? # * What is the gender breakdown of crimes? # In[12]: sex_df['m_f_ratio'] = sex_df['Male'] / sex_df['Female'] with sns.axes_style("whitegrid"): fig, ax = plt.subplots(figsize=(7,5)) ax.plot(sex_df['m_f_ratio']) ax.set_title('Male to Female Imprisonment Ratio (table: p16t01)') ax.set_xlabel('Year') ax.set_ylabel('Average Number of Males imprisoned per Female') ax.legend() ax.set_ylim([0, 1.1*sex_df['m_f_ratio'].max()]) # In[13]: with sns.axes_style("whitegrid"): fig, ax = plt.subplots(figsize=(10,7)) ax.plot(sex_df['Federal/a']) ax.plot(sex_df['State']) ax.set_title('Imprisonment Counts in State and Federal Prisons (table: p16t01)') ax.set_xlabel('Year') ax.yaxis.set_major_formatter(y_formatter) ax.set_ylabel('People Imprisoned [in thousands of people]') ax.legend() ax.set_ylim([0, 1.1*max(sex_df['Federal/a'].max(), sex_df['State'].max())]) # ### Observations # # Far more people are in State prisons than are in Federal prisons. That isn't very controversial and at this level, I won't dig much deeper. It may be interesting to dig further into imprisonment counts broken down by state. # # To Be Continued # # There are still 20 more tables that I haven't looked at yet, but so far, we've seen # * The imprisonment rates increased by just over 266% between 1978 and 2007. # * Black people are imprisoned at a far higher rate than either Hispanic people or non-Hispanic white people. # * Hispanic people are imprisoned at a far higher rate than non-Hispanic white people. # * Far more men are imprisoned than women. # * Far more people are in state prisons than in federal prisons. # # Continued in [the next notebook](https://nbviewer.jupyter.org/github/MattTriano/personal_site_public/blob/master/Crime_and_Prisons_part2.ipynb).