I've repeatedly read the statistic that the US makes up around 4% of the global population, but makes up more than 20% of the prison population. Well defined questions are crucial tools for making sense of raw data, and this massive asymmetry is provokes some important questions, such as those listed below.
Why isn't our prison population proportional to our regular population?
What are the most common crimes?
How many people enter the prison system each month?
Recidivism
With some questions in hand, I can start trying to gather data that could shed light on the situation. I may not be able to get data with the resolution needed to answer some of these questions (eg I may not be able to find data broken down by month).
The government makes some data publicly available. While government data is rarely current, often involves a bit of cleanup, and always involves hunting through old government sites, it's the only source for much of this extremely valuable information. From the Bureau of Justice Statistics, I found a government project called the Prisoner Series and I downloaded the most recent data.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from IPython.core.display import display, HTML
import os
%matplotlib inline
# Notebook Styling
# pd.options.display.max_columns = None
# display(HTML("<style>.container { width:100% !important; }</style>"))
pd.set_option('display.float_format',lambda x: '%.5f' % x)
plt.rcParams['figure.figsize'] = 10,10
The values in the by-race dataset are [per 100k population].
CSV_PATH = os.path.join('data', 'prison', 'p16f01.csv')
total_df = pd.read_csv(CSV_PATH, encoding='latin1', header=12, index_col='Year', parse_dates=['Year'])
total_df.dropna(inplace=True)
total_df.index = total_df.index.values.astype(int)
total_df.head()
All ages | Age 18 or older | |
---|---|---|
1978 | 131.00000 | 183.00000 |
1979 | 133.00000 | 185.00000 |
1980 | 138.00000 | 191.00000 |
1981 | 153.00000 | 211.00000 |
1982 | 170.00000 | 232.00000 |
with sns.axes_style("whitegrid"):
fig, ax = plt.subplots(figsize=(10,7))
ax.plot(total_df['All ages'])
ax.plot(total_df['Age 18 or older'])
ax.set_title('Total Imprisonment Rates (table: p16f01)')
ax.set_xlabel('Year')
ax.set_ylabel('People imprisoned (per relevant 100k US population)')
ax.legend()
ax.set_ylim([0, 1.1*max([total_df['All ages'].max(),
total_df['Age 18 or older'].max()])])
The imprisonment rate normalized to the entire population of US residents is lower than the imprisonment rate normalized to the population of US residents that are 18 or older. This indicates that the imprisonment rate for people under age 18 is much lower than for people 18 or older. That fits with my intuition.
We also see that the imprisonment rate climbs steadily from 1980 up through 1999, dips, and peaks around 2007-2008, at which point it starts trending down. In 1978, 183 people were in prison per every 100k US residents 18 or older. In 2007, 670 people were in prison per every 100k US residents 18 or older. That's a 266% increase in the imprisonment rate over that 29 year span. That's huge.
To answer these new questions, we will probably have to look at other sets of data.
Imprisonment rate of sentenced prisoners under the jurisdiction of state or federal correctional authorities, per 100,000 U.S. residents age 18 or older, by race and Hispanic origin, December 31, 2006–2016
CSV_PATH = os.path.join('data', 'prison', 'p16f02.csv')
race_df = pd.read_csv(CSV_PATH, encoding='latin1', header=12)
race_df.dropna(inplace=True)
race_df.rename(columns={'Unnamed: 0': 'Year'}, inplace=True)
race_df.set_index('Year', inplace=True)
race_df
White/* | Black/* | Hispanic | |
---|---|---|---|
Year | |||
2006 | 324.00000 | 2261.00000 | 1073.00000 |
2007 | 317.00000 | 2233.00000 | 1094.00000 |
2008 | 316.00000 | 2196.00000 | 1057.00000 |
2009 | 308.00000 | 2134.00000 | 1060.00000 |
2010 | 307.00000 | 2059.00000 | 1014.00000 |
2011 | 299.00000 | 1973.00000 | 990.00000 |
2012 | 293.00000 | 1873.00000 | 949.00000 |
2013 | 291.00000 | 1817.00000 | 922.00000 |
2014 | 289.00000 | 1754.00000 | 893.00000 |
2015 | 281.00000 | 1670.00000 | 862.00000 |
2016 | 274.00000 | 1608.00000 | 856.00000 |
with sns.axes_style("whitegrid"):
fig, ax = plt.subplots(figsize=(10,7))
ax.plot(race_df['White/*'])
ax.plot(race_df['Black/*'])
ax.plot(race_df['Hispanic'])
ax.set_title('Imprisonment Rates by Race (table: p16f02)')
ax.set_xlabel('Year')
ax.set_ylabel('People imprisoned (per 100k US population)')
ax.legend()
ax.set_ylim([0, 1.1*max([race_df['White/*'].max(),
race_df['Black/*'].max(),
race_df['Hispanic'].max()])])
print('{:>8s} imprisonment per 100k US pop: max: {}, min: {}'
.format('White/*', race_df['White/*'].max(), race_df['White/*'].min()))
print('{:>8s} imprisonment per 100k US pop: max: {}, min: {}'
.format('Black/*', race_df['Black/*'].max(), race_df['Black/*'].min()))
print('{:>8s} imprisonment per 100k US pop: max: {}, min: {}'
.format('Hispanic', race_df['Hispanic'].max(), race_df['Hispanic'].min()))
print('*: non-Hispanic')
White/* imprisonment per 100k US pop: max: 324.0, min: 274.0 Black/* imprisonment per 100k US pop: max: 2261.0, min: 1608.0 Hispanic imprisonment per 100k US pop: max: 1094.0, min: 856.0 *: non-Hispanic
This is very striking. We see that there is a very significant difference in the rates of white (non-Hispanic), black (non-Hispanic), and Hispanic imprisonment. We also see that rates for all three have dropped over this time period.
Based on prior research, I suspect that this is the result of many systemic factors, but let's continue exploring the data.
CSV_PATH = os.path.join('data', 'prison', 'p16t01.csv')
sex_df = pd.read_csv(CSV_PATH, encoding='latin1', header=11, thousands=r',')
sex_df.dropna(inplace=True, thresh=3)
sex_df.dropna(inplace=True, axis=1, thresh=3)
sex_df.dropna(inplace=True)
fix = lambda x: x.split('/')[0]
sex_df['Year'] = sex_df['Year'].apply(fix)
sex_df['Year'] = sex_df['Year'].astype(int)
sex_df.set_index('Year', inplace=True)
sex_df
Total | Federal/a | State | Male | Female | |
---|---|---|---|---|---|
Year | |||||
2006 | 1568674.00000 | 193046.00000 | 1375628.00000 | 1456366.00000 | 112308.00000 |
2007 | 1596835.00000 | 199618.00000 | 1397217.00000 | 1482524.00000 | 114311.00000 |
2008 | 1608282.00000 | 201280.00000 | 1407002.00000 | 1493670.00000 | 114612.00000 |
2009 | 1615487.00000 | 208118.00000 | 1407369.00000 | 1502002.00000 | 113485.00000 |
2010 | 1613803.00000 | 209771.00000 | 1404032.00000 | 1500936.00000 | 112867.00000 |
2011 | 1598968.00000 | 216362.00000 | 1382606.00000 | 1487561.00000 | 111407.00000 |
2012 | 1570397.00000 | 217815.00000 | 1352582.00000 | 1461625.00000 | 108772.00000 |
2013 | 1576950.00000 | 215866.00000 | 1361084.00000 | 1465592.00000 | 111358.00000 |
2014 | 1562319.00000 | 210567.00000 | 1351752.00000 | 1449291.00000 | 113028.00000 |
2015 | 1526603.00000 | 196455.00000 | 1330148.00000 | 1415112.00000 | 111491.00000 |
2016 | 1505397.00000 | 189192.00000 | 1316205.00000 | 1393975.00000 | 111422.00000 |
@ticker.FuncFormatter
def y_formatter(x, pos):
return '{:4.0f}'.format(x/1000)
with sns.axes_style("whitegrid"):
fig, ax = plt.subplots(figsize=(10,7))
ax.plot(sex_df['Male'])
ax.plot(sex_df['Female'])
ax.set_title('Imprisonment Counts by Gender (table: p16t01)')
ax.set_xlabel('Year')
ax.yaxis.set_major_formatter(y_formatter)
ax.set_ylabel('People Imprisoned [in thousands of people]')
ax.legend()
ax.set_ylim([0, 1.1*max(sex_df['Male'].max(),
sex_df['Female'].max())])
The first thing that I notice is that the number of men in prison is much higher than the number of females imprisoned. Per the chart below, over the entire span of the data set (2006 to 2016), there are at least 12 men in prison for each woman in prison. This is a massive asymmetry. It doesn't feel very controversial, but should it? According to the 2010 US Census, the US population is 50.8% female and 49.2% male.
sex_df['m_f_ratio'] = sex_df['Male'] / sex_df['Female']
with sns.axes_style("whitegrid"):
fig, ax = plt.subplots(figsize=(7,5))
ax.plot(sex_df['m_f_ratio'])
ax.set_title('Male to Female Imprisonment Ratio (table: p16t01)')
ax.set_xlabel('Year')
ax.set_ylabel('Average Number of Males imprisoned per Female')
ax.legend()
ax.set_ylim([0, 1.1*sex_df['m_f_ratio'].max()])
with sns.axes_style("whitegrid"):
fig, ax = plt.subplots(figsize=(10,7))
ax.plot(sex_df['Federal/a'])
ax.plot(sex_df['State'])
ax.set_title('Imprisonment Counts in State and Federal Prisons (table: p16t01)')
ax.set_xlabel('Year')
ax.yaxis.set_major_formatter(y_formatter)
ax.set_ylabel('People Imprisoned [in thousands of people]')
ax.legend()
ax.set_ylim([0, 1.1*max(sex_df['Federal/a'].max(),
sex_df['State'].max())])
Far more people are in State prisons than are in Federal prisons. That isn't very controversial and at this level, I won't dig much deeper. It may be interesting to dig further into imprisonment counts broken down by state.
There are still 20 more tables that I haven't looked at yet, but so far, we've seen
Continued in the next notebook.