Datasets

As in part 1, I'm exploring the most recent data from the Bureau of Justice Statistics' Prisoner Series datasets.

Imprisonment Counts and Rates by State (Table p16t02)

In the previous post, Exploratory Data Analysis Part 1, we saw that the vast majority of people in prisons are in state prisons, but we didn't discover anything about imprisonment rates in individual states. One of my favorite tools for examining geographical data is the choropleth map (which you may recognize from any recent election night). In this post, I use the Bokeh python package to generate some state-level choropleths to show rates of imprisonment and how they vary from state to state.

Scroll to the bottom if you want to jump straight to the maps.

In [24]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from IPython.core.display import display, HTML
import os
%matplotlib inline
In [25]:
# Notebook Styling 
# pd.options.display.max_columns = None
# display(HTML("<style>.container { width:100% !important; }</style>"))
# pd.set_option('display.float_format',lambda x: '%.5f' % x)
# plt.rcParams['figure.figsize'] = 10,10
In [26]:
# Importing modules from a visualization package.
from bokeh.sampledata.us_states import data as states
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource
from bokeh.models import LinearColorMapper, ColorBar, BasicTicker
try:
    del states["HI"]
    del states["AK"]
except KeyError:
    pass
In [27]:
# This dictionary maps State names to State Codes
def get_state_code():
    state_code = {
        'United States': 'US',
        'Alabama': 'AL',
        'Alaska': 'AK',
        'American Samoa': 'AS',
        'Arizona': 'AZ',
        'Arkansas': 'AR',
        'California': 'CA',
        'Colorado': 'CO',
        'Connecticut': 'CT',
        'Delaware': 'DE',
        'District of Columbia': 'DC',
        'Florida': 'FL',
        'Georgia': 'GA',
        'Guam': 'GU',
        'Hawaii': 'HI',
        'Idaho': 'ID',
        'Illinois': 'IL',
        'Indiana': 'IN',
        'Iowa': 'IA',
        'Kansas': 'KS',
        'Kentucky': 'KY',
        'Louisiana': 'LA',
        'Maine': 'ME',
        'Maryland': 'MD',
        'Massachusetts': 'MA',
        'Michigan': 'MI',
        'Minnesota': 'MN',
        'Mississippi': 'MS',
        'Missouri': 'MO',
        'Montana': 'MT',
        'National': 'NA',
        'Nebraska': 'NE',
        'Nevada': 'NV',
        'New Hampshire': 'NH',
        'New Jersey': 'NJ',
        'New Mexico': 'NM',
        'New York': 'NY',
        'North Carolina': 'NC',
        'North Dakota': 'ND',
        'Northern Mariana Islands': 'MP',
        'Ohio': 'OH',
        'Oklahoma': 'OK',
        'Oregon': 'OR',
        'Pennsylvania': 'PA',
        'Puerto Rico': 'PR',
        'Rhode Island': 'RI',
        'South Carolina': 'SC',
        'South Dakota': 'SD',
        'Tennessee': 'TN',
        'Texas': 'TX',
        'Utah': 'UT',
        'Vermont': 'VT',
        'Virgin Islands': 'VI',
        'Virginia': 'VA',
        'Washington': 'WA',
        'West Virginia': 'WV',
        'Wisconsin': 'WI',
        'Wyoming': 'WY'
    }
    return state_code
In [28]:
# Breakdown by State and Gender (Table p16t02)
CSV_PATH = os.path.join('data', 'prison', 'p16t02.csv')
state_sex_df = pd.read_csv(CSV_PATH, encoding='latin1',
                           header=[11,12], na_values=':',
                           thousands=r',')
# display(state_sex_df)
state_sex_df.dropna(axis=0, thresh=5, inplace=True)
state_sex_df.dropna(axis=1, thresh=5, inplace=True)
# state_sex_df.dropna(axis=0, inplace=True)
state_sex_df.drop(['Percent change, 2015–2016', 'Unnamed: 12_level_0',
                   'Unnamed: 14_level_0'],
                  axis=1, inplace=True)
state_sex_df.drop([0,1,2], axis=0, inplace=True)  
state_sex_df.head()
C:\Users\mattt\Anaconda3\envs\py36\lib\site-packages\pandas\core\generic.py:3111: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
  obj = obj._drop_axis(labels, axis, level=level, errors=errors)
Out[28]:
Unnamed: 1_level_0 2015 Unnamed: 3_level_0 Unnamed: 4_level_0 2016 Unnamed: 7_level_0 Unnamed: 8_level_0
Unnamed: 1_level_1 Total Male Female Total Male Female
3 Alabama 30810.0 28220.0 2590.0 28883.0 26506.0 2377.0
4 Alaska/c 5338.0 4761.0 577.0 4434.0 4024.0 410.0
5 Arizona 42719.0 38738.0 3981.0 42320.0 38323.0 3997.0
6 Arkansas 17707.0 16305.0 1402.0 17537.0 16161.0 1376.0
7 California 129593.0 123808.0 5785.0 130390.0 124487.0 5903.0
In [29]:
# Fixing the headers for the dataframe
tmp1 = ['Area'] 
tmp1.extend(3*[state_sex_df.columns.values[1][0]])
tmp1.extend(3*[state_sex_df.columns.values[4][0]])
tmp2 = [el[1] for el in state_sex_df.columns.values]
tmp2[0] = ''
state_sex_df.columns = pd.MultiIndex.from_arrays([tmp1,tmp2], names=['', ''])
state_sex_df.reset_index(drop=True, inplace=True)

# To clean up the '/_' footnote indicators
fix = lambda x: x.split('/')[0]
state_sex_df['Area'] = state_sex_df['Area'].apply(fix)

# This creates a new column in the DataFrame with the state code for each state
state_code = get_state_code()
map_state_codes = lambda x: state_code[x]
state_sex_df['Code'] = state_sex_df['Area'].apply(map_state_codes)
state_sex_df.drop(['Area'], axis=1, inplace=True)  # drops the row with aggregated values
# This sets the state code as the index
state_sex_df.set_index('Code', inplace=True)

# This fixes the names of the columns (to facilitate merging later)
state_sex_df.columns.set_names([None,'SEX'], inplace=True)
state_sex_df.index.name = 'Code'
C:\Users\mattt\Anaconda3\envs\py36\lib\site-packages\pandas\core\generic.py:3111: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
  obj = obj._drop_axis(labels, axis, level=level, errors=errors)

In the above cell, I've cleaned up the DataFrame so that the column header now consists of:

  • column labels: ('2015', '2016', 'Total', 'Male', 'Female')
  • column label names: ('', 'SEX')
  • index name: ('Code)
In [30]:
state_sex_df.head()
Out[30]:
2015 2016
SEX Total Male Female Total Male Female
Code
AL 30810.0 28220.0 2590.0 28883.0 26506.0 2377.0
AK 5338.0 4761.0 577.0 4434.0 4024.0 410.0
AZ 42719.0 38738.0 3981.0 42320.0 38323.0 3997.0
AR 17707.0 16305.0 1402.0 17537.0 16161.0 1376.0
CA 129593.0 123808.0 5785.0 130390.0 124487.0 5903.0
In [31]:
# The color levels to use in the map
COLORS = ['#feedde', '#fdd0a2', '#fdae6b', '#fd8d3c',
          '#f16913', '#d94801', '#8c2d04']
state_xs = [states[code]["lons"] for code in states]
state_ys = [states[code]["lats"] for code in states]

One of the great things about coding in Jupyter Notebooks is that I can prototype things as I go, and then encapsulate that prototype into a function. I built the prototype for my visualization in the cell below, and after I tune the visualization, I can make it into a function and easily generate more visualizations.

In [32]:
state_names = []
prisoner_counts = []
yr = '2015'
sex = 'Male'
df = state_sex_df
min_val = df.loc[:,yr][sex].min()
max_val = df.loc[:,yr][sex].max()
bins = len(COLORS)
TOOLS = "pan,wheel_zoom,reset,hover,save"
for state_code in states:
    try:
        num_imprisoned = df.loc[state_code,yr][sex]
        state_names.append(state_code)
        prisoner_counts.append(num_imprisoned)
    except KeyError:
        state_names.append(' ')
        prisoner_counts.append(0)

color_mapper = LinearColorMapper(palette=COLORS,
                                 low=min(prisoner_counts),
                                 high=max(prisoner_counts))
        
source = ColumnDataSource(data=dict(
    x=state_xs,
    y=state_ys,
    name=state_names,
    imp_count = prisoner_counts
))
        
p = figure(title="(Weak Visualization) Number of Imprisoned Males by State (2015)", 
           toolbar_location="left",
           plot_width=800, plot_height=450, tools=TOOLS)
p.patches('x', 'y', source=source, 
          fill_color={'field': 'imp_count', 'transform': color_mapper}, fill_alpha=1.0,
          line_color="#884444", line_width=2, line_alpha=0.3)
color_bar = ColorBar(color_mapper=color_mapper, ticker=BasicTicker(),
                     label_standoff=12, border_line_color=None, location=(0,0))
p.add_layout(color_bar, 'right')
hover = p.select_one(HoverTool)
hover.point_policy = "follow_mouse"
hover.tooltips = [
        ("State", "@name"),
        ("No. Imprisoned in this State", "@imp_count")
    ]
output_notebook()
show(p)
Loading BokehJS ...

Visualization Pitfall!

From this map, we see that the most populated states have the most prisoners (you can see the actual number of people in prison by hovering the mouse over a state). That's approximately what we would expect, but it doesn't really help us answer questions about the differences in imprisonment rates between States (with different policies on criminal justice). This information is useful for investigating some questions (e.g. Where could prison reform impact the most lives?), but it's not as useful if we're investigating the relationship between policies, crime rates, and imprisonment rates. If we want to actually compare imprisonment rates between states, we need to normalize (or scale) the prisoner counts to the population of the state. This requires getting a different data set.

Population Dataset

The US Census tracks and publishes population data, including a data set that tracks estimates of annual state populations (by gender) from 2010 to 2016. I'll use this data to normalize the raw prisoner counts by state.

In [33]:
CSV_PATH = os.path.join('data', 'pop', 'sc-est2016-agesex-civ.csv')
state_pop_sex_df = pd.read_csv(CSV_PATH, encoding='latin1', header=0)
print('\nRaw Population Data Set Format')
display(state_pop_sex_df.head())
state_pop_sex_df = state_pop_sex_df[state_pop_sex_df['SEX'].isin([1,2])]
state_code = get_state_code()
map_state_codes = lambda x: state_code[x]
fix_sex = lambda x: 'Male' if x == 1 else 'Female'
state_pop_sex_df['SEX'] = state_pop_sex_df['SEX'].apply(fix_sex)
state_pop_sex_df.drop(['SUMLEV','REGION','DIVISION','STATE','AGE', 
                       'ESTBASE2010_CIV', 'POPEST2010_CIV', 'POPEST2011_CIV',
                       'POPEST2012_CIV', 'POPEST2013_CIV', 'POPEST2014_CIV'], 
                      axis=1, inplace=True)
state_pop_sex_df.rename({'POPEST2015_CIV':'2015_pop',
                         'POPEST2016_CIV':'2016_pop'},
                        axis=1, inplace=True)
state_pop_sex_df['Code'] = state_pop_sex_df['NAME'].apply(map_state_codes)
state_pop_sex_df = state_pop_sex_df.groupby(['SEX', 'Code']).sum()
state_pop_sex_df = state_pop_sex_df.unstack(0)
print('\nPopulation Data Set Format After Reformatting')
state_pop_sex_df.head()
Raw Population Data Set Format
SUMLEV REGION DIVISION STATE NAME SEX AGE ESTBASE2010_CIV POPEST2010_CIV POPEST2011_CIV POPEST2012_CIV POPEST2013_CIV POPEST2014_CIV POPEST2015_CIV POPEST2016_CIV
0 10 0 0 0 United States 0 0 3944160 3951400 3963239 3926677 3931346 3955374 3975414 3970145
1 10 0 0 0 United States 0 1 3978090 3957847 3966617 3978101 3943114 3950083 3974980 3995008
2 10 0 0 0 United States 0 2 4096939 4090856 3971363 3980016 3992752 3959663 3967361 3992154
3 10 0 0 0 United States 0 3 4119051 4111929 4102483 3982920 3992660 4006960 3974468 3982074
4 10 0 0 0 United States 0 4 4063186 4077557 4122286 4112795 3994261 4005464 4020276 3987656
Population Data Set Format After Reformatting
Out[33]:
2015_pop 2016_pop
SEX Female Male Female Male
Code
AK 697258 734122 701214 739176
AL 4998694 4681748 5011344 4687956
AR 3028428 2916226 3039182 2926812
AZ 6854628 6744392 6970496 6853282
CA 39205974 38470174 39466330 38716576

Now that I've reformatted the population data to match the format of the state prisoner counts, I want to merge these two DataFrames together so that I can calculate the normalized rates of imprisonment.

In [34]:
state_sex_join_df = state_sex_df.join(state_pop_sex_df)
state_sex_join_df.head()
Out[34]:
2015 2016 2015_pop 2016_pop
SEX Total Male Female Total Male Female Female Male Female Male
Code
AL 30810.0 28220.0 2590.0 28883.0 26506.0 2377.0 4998694 4681748 5011344 4687956
AK 5338.0 4761.0 577.0 4434.0 4024.0 410.0 697258 734122 701214 739176
AZ 42719.0 38738.0 3981.0 42320.0 38323.0 3997.0 6854628 6744392 6970496 6853282
AR 17707.0 16305.0 1402.0 17537.0 16161.0 1376.0 3028428 2916226 3039182 2926812
CA 129593.0 123808.0 5785.0 130390.0 124487.0 5903.0 39205974 38470174 39466330 38716576

Now that I've merged the state prisoner counts with population data, I have to divide male prisoner counts by state population (of that gender) and store that information for use later.

In [35]:
# This Cell normalizes prisoner counts to be per 100k 
#   state population for that year.
yrs = ['2015','2016']
sexes = ['Male','Female']

for yr in yrs:
    for sex in sexes:
        state_sex_join_df.loc[:,(yr, sex + '_rate')] = \
            state_sex_join_df.loc[:,(yr, sex)]\
            .divide(state_sex_join_df.loc[:,(yr + '_pop', sex)]/100000)
In [36]:
state_sex_join_df.head()
Out[36]:
2015 2016 2015_pop 2016_pop 2015 2016
SEX Total Male Female Total Male Female Female Male Female Male Male_rate Female_rate Male_rate Female_rate
Code
AL 30810.0 28220.0 2590.0 28883.0 26506.0 2377.0 4998694 4681748 5011344 4687956 602.766317 51.813534 565.406331 47.432385
AK 5338.0 4761.0 577.0 4434.0 4024.0 410.0 697258 734122 701214 739176 648.529808 82.752726 544.389969 58.470025
AZ 42719.0 38738.0 3981.0 42320.0 38323.0 3997.0 6854628 6744392 6970496 6853282 574.373494 58.077550 559.191932 57.341687
AR 17707.0 16305.0 1402.0 17537.0 16161.0 1376.0 3028428 2916226 3039182 2926812 559.113045 46.294645 552.170758 45.275341
CA 129593.0 123808.0 5785.0 130390.0 124487.0 5903.0 39205974 38470174 39466330 38716576 321.828542 14.755404 321.534115 14.957053
In [37]:
# I'm going to generate more choropleths, so I'll encapsulate the code
#   I prototyped above into this function.
def make_choropleth(df, level1, level2, title, colors=COLORS):
    """Generates a choropleth with tooltips
    
    Args:
        df:        The DataFrame containing the relevant data. df should be 
                 formatted such that it has a 2 level column-header 
        level1:    The top column-header name (eg '2015' in examples above)
        level2:    The next-lower column header name (eg 'Male)
        title:     A title for the Choropleth
        colors:    An ordered list of color codes to use in mapping data.
    """
    state_names = []
    data = []
    min_val = df.loc[:,level1][level2].min()
    max_val = df.loc[:,level1][level2].max()
    bins = len(colors)
    TOOLS = "pan,wheel_zoom,reset,hover,save"
    for state_code in states:
        try:
            num_imprisoned = df.loc[state_code,level1][level2]
            state_names.append(state_code)
            data.append(num_imprisoned)
        except KeyError:
            state_names.append(' ')
            data.append(0)
    
    # Loading up Bokeh's special data structure
    source = ColumnDataSource(data=dict(
        x=state_xs,
        y=state_ys,
        name=state_names,
        imp_count = data
    ))
    
    # Maps the data value to a color
    color_mapper = LinearColorMapper(palette=colors,
                                     low=min(data),
                                     high=max(data))
    
    # Instantiating the figure
    p = figure(title=title, 
           toolbar_location="left",
           plot_width=800, plot_height=450, tools=TOOLS)
    
    # Plotting the state-polygons on the figure
    p.patches('x', 'y', source=source, 
              fill_color={'field': 'imp_count',
                          'transform': color_mapper},
              fill_alpha=1.0,
              line_color="#884444",
              line_width=2, 
              line_alpha=0.3)
    
    # Setting the legend for colors
    color_bar = ColorBar(color_mapper=color_mapper,
                         ticker=BasicTicker(),
                         label_standoff=12,
                         border_line_color=None,
                         location=(0,0))
    
    p.add_layout(color_bar, 'right')
    hover = p.select_one(HoverTool)
    hover.point_policy = "follow_mouse"
    hover.tooltips = [
        ("State", "@name"),
        ("Imprisonment rate (per 100k state pop)", "@imp_count")
    ]
    output_notebook()
    show(p)

Visualization Breakthrough! }

Now that we've normalized the number of imprisoned people to the state populations, we can compare rates between states. From looking at these choropleths, it's clear that there are massive differences between states.

Observations

  • For both men and women over the two included years, Massachusetts has the lowest imprisonment rates (below 145 men per 100k state male pop. and below 10 women per 100k state female pop).
  • For men, Louisiana has the highest imprisonment rate (at least 740 men per 100k state male pop.).
  • Delaware stands out as having a much higher imprisonment rate that its neighbors. Delaware imprisons at least 660 men per 100k state male pop., which is 2 to 3 times more than any neighboring state.
  • Oklahoma has the highest imprisonment rate for women at over 75 womek per 100k state female pop.
  • In general, southern states have higher male imprisonment rates than the north.
  • Except for Delaware, New England and the Midwest have lower than average female imprisonment rates.
  • Bokeh's built-in state geometries are convenient, but they're not great (just look at Michigan and the Great Lakes).
In [38]:
titl = "Number of Imprisoned Females per 100k Female State Population (2015)"
make_choropleth(state_sex_join_df, '2015', 'Female_rate', titl)
Loading BokehJS ...
In [39]:
titl = "Number of Imprisoned Females per 100k Female State Population (2016)"
make_choropleth(state_sex_join_df, '2016', 'Female_rate', titl)
Loading BokehJS ...
In [40]:
titl = "Number of Imprisoned Males per 100k State Male Population (2015)"
make_choropleth(state_sex_join_df, '2015', 'Male_rate', titl)
Loading BokehJS ...
In [41]:
titl = "Number of Imprisoned Males per 100k State Male Population (2016)"
make_choropleth(state_sex_join_df, '2016', 'Male_rate', titl)
Loading BokehJS ...