Like many people, I'm still recovering from the results of the election. A big part of the process for me is coming up with an accurate description of what actually happened, so this post attempts to do so by building a few visualizations.
Building a graphic that clearly communicates the election results is difficult because you need to do a few things simultaneously:
Use a high enough data resolution to show sub-state trends. County level data does a good job here.
Communicate the size of the vote in each county so we know how consequential it was for the election. If the data is at the national level, weight the vote totals by their contribution to the electoral college.
Show both the voter turnout and margins for each party. Either can swing an election.
Compare the outcomes with those of past elections so we can get an idea of what changed this time. Ideally, compare with the average of the past few elections so you're not comparing against a single candidate or a single point in time.
The best graphics I have seen come from a Washington Post article looking at the urban/rural divide, which gets pretty close to meeting all of the above requirements. Unfortunately, their graphics communicate changes in voter turnout using line thickness, which makes it hard to discern differences over time.
Below I attempt to accomplish most of those objectives using county level data from Wisconsin.
The county level data for the 2000-2016 elections is from David Leip's Atlas of Presidential Elections. There's a script for getting the state level data below.
The voting age population (VAP) data is from the American Community Survey. I downloaded county level data averaged over 2005-2009, and 2010-2014 to try to account for population changes. I calculated turnout by dividing the sum of the votes by the voting age population for each county. All of the input data is available here.
%matplotlib inline
import pandas as pd
import os
import glob
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
matplotlib.style.use('ggplot') #'ggplot' 'fivethirtyeight' 'seaborn-paper'
#pd.set_option('max_colwidth', 400)
#Optional script for getting state level data:
'''
import requests
from bs4 import BeautifulSoup
import csv
years = [2000, 2004, 2008, 2012, 2016]
#Iterate through years 2000-2016, by 4
base_url = 'http://uselectionatlas.org/RESULTS/datagraph.php?year={year}&fips=55&f=1&off=0&elect=0'
row_list = []
for year in years:
url = base_url.format(year=year)
r = requests.get(url) #Encoding is UTF-8
if r.status_code == 200:
soup = BeautifulSoup(r.text, 'html.parser')
tables = soup.findAll('table')[:-1] #Ignore extra table at end
county = None
for table in tables:
rows = table.findAll('tr')
for row in rows:
data = row.findAll('td')
for element in data:
if element.get('style') == 'width:100px':
county = element.get_text(strip=True)
data.remove(element)
candidate = data[0].get_text(strip=True)
pct = data[1].get_text(strip=True)
num = data[2].get_text(strip=True)
row = {'year': year, 'county': county, 'candidate': candidate, 'pct': pct, 'num':num}
row_list.append(row)
print 'Completed {0}.'.format(year)
else:
print "Failed to access url: " + url
with open('county_data_wi_2000-2016.csv', 'wb') as csvfile:
fieldnames = ['year', 'county', 'candidate', 'pct', 'num']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(row_list)
'''
# Voting behavior by county in WI
# Source: http://uselectionatlas.org/RESULTS/datagraph.php?year=2016&fips=55&f=1&off=0&elect=0
voting_df = pd.read_csv('./data/voting/county_data_wi_2000-2016.csv')
#print len(voting_df['county'].drop_duplicates())
# 72 Counties
rs = ['Bush', 'McCain', 'Romney', 'Trump']
ds = ['Gore', 'Kerry', 'Obama', 'Clinton']
def party(element):
if element in rs:
return 'R'
elif element in ds:
return 'D'
else:
return 'O'
voting_df['party'] = voting_df['candidate'].apply(party)
voting_df.replace({'%': '', ',': ''}, inplace=True, regex=True)
voting_df[['pct','num']] = voting_df[['pct','num']].astype(float) #.apply(pd.to_numeric)
#voting_df.dtypes
voting_df.head(10)
year | county | candidate | pct | num | party | |
---|---|---|---|---|---|---|
0 | 2000 | Adams | Gore | 52.9 | 4826.0 | D |
1 | 2000 | Adams | Bush | 43.0 | 3920.0 | R |
2 | 2000 | Adams | Nader | 2.4 | 217.0 | O |
3 | 2000 | Adams | Other | 1.7 | 153.0 | O |
4 | 2000 | Ashland | Gore | 55.2 | 4356.0 | D |
5 | 2000 | Ashland | Bush | 38.5 | 3038.0 | R |
6 | 2000 | Ashland | Nader | 5.6 | 440.0 | O |
7 | 2000 | Ashland | Other | 0.7 | 56.0 | O |
8 | 2000 | Barron | Gore | 44.9 | 8928.0 | D |
9 | 2000 | Barron | Bush | 49.5 | 9848.0 | R |
# Voting age population dataframe
# Source: http://www.census.gov/rdo/data/voting_age_population_by_citizenship_and_race_cvap.html
# 2010-2014 Averaged data
population2014_df = pd.read_csv('./data/CVAP_CSV_Format_2010-2014/County.csv')
population2014_df = population2014_df[(population2014_df['LNTITLE'] == 'Total') &
(population2014_df['GEONAME'].str.contains('Wisconsin'))]
population2014_df = population2014_df[['GEONAME', 'CVAP_EST', 'CVAP_MOE']]
population2014_df.replace({' County, Wisconsin': ''}, inplace=True, regex=True)
population2014_df.rename(columns={'GEONAME': 'county'}, inplace=True)
#print len(population_df['GEONAME'].drop_duplicates())
# 72 counties
# 2005 to 2009 average
population2009_df = pd.read_csv('./data/CVAP_CSV_Format_2005-2009/County.csv')
population2009_df = population2009_df[(population2009_df['LNTITLE'] == 'Total') &
(population2009_df['GEONAME'].str.contains('Wisconsin'))]
population2009_df = population2009_df[['GEONAME', 'CVAP_EST', 'CVAP_MOE']]
population2009_df.replace({' County, Wisconsin': ''}, inplace=True, regex=True)
population2009_df.rename(columns={'GEONAME': 'county'}, inplace=True)
# print len(population2008_df['GEONAME'].drop_duplicates())
# 72
population2009_df.head(10)
county | CVAP_EST | CVAP_MOE | |
---|---|---|---|
39624 | Adams | 16640 | 81 |
39637 | Ashland | 12420 | 45 |
39650 | Barron | 35260 | 94 |
39663 | Bayfield | 11945 | 51 |
39676 | Brown | 176420 | 491 |
39689 | Buffalo | 10470 | 32 |
39702 | Burnett | 12925 | 26 |
39715 | Calumet | 31990 | 123 |
39728 | Chippewa | 45520 | 82 |
39741 | Clark | 23525 | 40 |
# Calculate voter turnouts by county
turnout_df = pd.pivot_table(voting_df, values='num', index=['year', 'county'], aggfunc=np.sum)
turnout_df = turnout_df.reset_index()
turnout_df = pd.merge(turnout_df, population2009_df, on='county', suffixes=('', '_2009'))
turnout_df = pd.merge(turnout_df, population2014_df, on='county', suffixes=('', '_2014'))
turnout_df.rename(columns={'CVAP_EST': 'CVAP_EST_2009',
'CVAP_MOE': 'CVAP_MOE_2009'}, inplace=True)
def turnout_frac(row):
if row['year'] < 2012:
return row['num']/row['CVAP_EST_2009']
else:
return row['num']/row['CVAP_EST_2014']
turnout_df['turnout'] = turnout_df.apply(turnout_frac, axis=1)
# Mean turnout, 2000-2012
turnout2012_df = turnout_df[turnout_df['year'] < 2016].groupby(['county'], as_index=False
).agg({'turnout':'mean'})
# Turnout, 2016
turnout2016_df = turnout_df[turnout_df['year'] == 2016]
turnout_df = pd.merge(turnout2012_df, turnout2016_df, on='county', suffixes=('_2000_2012', '_2016'))
turnout_df.rename(columns={'num': 'num_2016'}, inplace=True)
turnout_df = turnout_df[['county', 'num_2016', 'turnout_2016', 'turnout_2000_2012']]
turnout_df['turnout_change'] = turnout_df['turnout_2016'] - turnout_df['turnout_2000_2012']
turnout_df.sort_values(by='num_2016', ascending=False).head(10)
county | num_2016 | turnout_2016 | turnout_2000_2012 | turnout_change | |
---|---|---|---|---|---|
40 | Milwaukee | 440698.0 | 0.654155 | 0.703174 | -0.049019 |
12 | Dane | 309096.0 | 0.830200 | 0.757048 | 0.073152 |
67 | Waukesha | 237588.0 | 0.808631 | 0.801177 | 0.007454 |
4 | Brown | 128965.0 | 0.706793 | 0.680866 | 0.025927 |
44 | Outagamie | 95162.0 | 0.724051 | 0.686042 | 0.038009 |
51 | Racine | 94921.0 | 0.666838 | 0.686983 | -0.020146 |
70 | Winnebago | 87140.0 | 0.667714 | 0.678759 | -0.011046 |
66 | Washington | 77551.0 | 0.776092 | 0.741324 | 0.034768 |
29 | Kenosha | 76894.0 | 0.636830 | 0.639971 | -0.003141 |
53 | Rock | 76056.0 | 0.649524 | 0.670892 | -0.021368 |
# Calculate 2016 Democrate-Republican Difference, by county
clinton2016_df = voting_df[(voting_df['year'] == 2016) &
(voting_df['candidate'] == 'Clinton')]
trump2016_df = voting_df[(voting_df['year'] == 2016) &
(voting_df['candidate'] == 'Trump')]
spread2016_df = pd.merge(trump2016_df, clinton2016_df, on='county', suffixes=('_trump', '_clinton'))
spread2016_df['dem_lead'] = spread2016_df['pct_clinton'] - spread2016_df['pct_trump']
spread2016_df = spread2016_df[['county', 'pct_clinton', 'pct_trump',
'dem_lead', 'num_clinton', 'num_trump']
].sort_values(by='dem_lead', ascending=False)
# Avg 2000-2012 Democrat-Republican Difference by county
dem_df = voting_df[(voting_df['year'] < 2016) & (voting_df['party'] == 'D')]
dem_df = dem_df.groupby(['county'], as_index=False
).agg({'pct':'mean'})
repub_df = voting_df[(voting_df['year'] < 2016) & (voting_df['party'] == 'R')]
repub_df = repub_df.groupby(['county'], as_index=False
).agg({'pct':'mean'})
spread2000_df = pd.merge(dem_df, repub_df[['pct','county']],
on='county', suffixes=('_dem', '_repub'))
spread2000_df['dem_lead'] = spread2000_df['pct_dem'] - spread2000_df['pct_repub']
#Merge both together
spreadall_df = pd.merge(spread2016_df, spread2000_df, on='county', suffixes=('_2016', '_2000_2012'))
spreadall_df.head()
county | pct_clinton | pct_trump | dem_lead_2016 | num_clinton | num_trump | pct_dem | pct_repub | dem_lead_2000_2012 | |
---|---|---|---|---|---|---|---|---|---|
0 | Menominee | 78.4 | 21.0 | 57.4 | 1003.0 | 269.0 | 83.225 | 15.200 | 68.025 |
1 | Dane | 70.4 | 23.1 | 47.3 | 217506.0 | 71270.0 | 67.725 | 29.725 | 38.000 |
2 | Milwaukee | 65.6 | 28.6 | 37.0 | 288986.0 | 126091.0 | 63.675 | 34.500 | 29.175 |
3 | Iowa | 54.5 | 39.3 | 15.2 | 6669.0 | 4809.0 | 60.900 | 37.200 | 23.700 |
4 | Rock | 51.7 | 41.4 | 10.3 | 39336.0 | 31483.0 | 60.050 | 38.150 | 21.900 |
After all the calculations above, this is the final dataframe. The relevant columns for the graphics are the turnout
columns which show the fraction of voting age people that voted in the relevant years, and the dem_lead
columns which show the margins between parties for various elections. The demlead_change
column shows the difference between the Democratic party lead over the 2000-2012 time period and the lead in 2016.
#Join with turnout_df
turnout_df = pd.merge(turnout_df, spreadall_df, on='county')
#Calculate change in Democratic margin, 2016 - average 2000 to 2012
turnout_df['demlead_change'] = turnout_df['dem_lead_2016'] - turnout_df['dem_lead_2000_2012']
turnout_df.sort_values(by='num_2016', ascending=False, inplace=True)
turnout_df.head(10)
county | num_2016 | turnout_2016 | turnout_2000_2012 | turnout_change | pct_clinton | pct_trump | dem_lead_2016 | num_clinton | num_trump | pct_dem | pct_repub | dem_lead_2000_2012 | demlead_change | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
40 | Milwaukee | 440698.0 | 0.654155 | 0.703174 | -0.049019 | 65.6 | 28.6 | 37.0 | 288986.0 | 126091.0 | 63.675 | 34.500 | 29.175 | 7.825 |
12 | Dane | 309096.0 | 0.830200 | 0.757048 | 0.073152 | 70.4 | 23.1 | 47.3 | 217506.0 | 71270.0 | 67.725 | 29.725 | 38.000 | 9.300 |
67 | Waukesha | 237588.0 | 0.808631 | 0.801177 | 0.007454 | 33.3 | 60.0 | -26.7 | 79199.0 | 142519.0 | 33.125 | 65.425 | -32.300 | 5.600 |
4 | Brown | 128965.0 | 0.706793 | 0.680866 | 0.025927 | 41.4 | 52.1 | -10.7 | 53358.0 | 67192.0 | 48.150 | 49.975 | -1.825 | -8.875 |
44 | Outagamie | 95162.0 | 0.724051 | 0.686042 | 0.038009 | 40.1 | 54.2 | -14.1 | 38117.0 | 51579.0 | 47.750 | 49.950 | -2.200 | -11.900 |
51 | Racine | 94921.0 | 0.666838 | 0.686983 | -0.020146 | 44.8 | 49.1 | -4.3 | 42506.0 | 46620.0 | 49.675 | 48.625 | 1.050 | -5.350 |
70 | Winnebago | 87140.0 | 0.667714 | 0.678759 | -0.011046 | 42.5 | 49.9 | -7.4 | 37054.0 | 43447.0 | 49.200 | 48.350 | 0.850 | -8.250 |
66 | Washington | 77551.0 | 0.776092 | 0.741324 | 0.034768 | 26.9 | 66.7 | -39.8 | 20854.0 | 51729.0 | 30.700 | 67.625 | -36.925 | -2.875 |
29 | Kenosha | 76894.0 | 0.636830 | 0.639971 | -0.003141 | 46.5 | 46.9 | -0.4 | 35770.0 | 36025.0 | 54.275 | 43.800 | 10.475 | -10.875 |
53 | Rock | 76056.0 | 0.649524 | 0.670892 | -0.021368 | 51.7 | 41.4 | 10.3 | 39336.0 | 31483.0 | 60.050 | 38.150 | 21.900 | -11.600 |
This was my first attempt at a comprehensive plot that meets all the requirements. Average 2000-2012
turnout is on the x-axis, and 2016
turnout is on the y-axis. Any counties above/below the 45 degree
line had better/worse turnout in 2016
relative to the average. The area of each of the circles is proportional to the number of votes cast, and the color corresponds with the vote margin.
It's clear that Milwaukee underperformed in this election. If Milwaukee voted at it's recent historical average, Wisconsin would probably be a blue state.
Unfortunately, this plot doesn't do good job of communicating the shift in the vote margin relative to the past. For that, I created the next plot.
#http://stackoverflow.com/questions/37401872/custom-continuous-color-map-in-matplotlib
import matplotlib.colors as clr
#Find midpoint of data on 0-1 scale:
vmin = turnout_df['dem_lead_2016'].min()
vmax = turnout_df['dem_lead_2016'].max()
mid = 1 - vmax/(vmax + abs(vmin))
#Construct colormap using midpoint:
cmap = clr.LinearSegmentedColormap.from_list('red_blue',
[(0, '#EF3B2C'), (mid, '#FFFFFF'), (1,'#08519C')], N=256)
fig, ax = plt.subplots(figsize=(13,9))
#label='_nolegend_'
plt.scatter(x=turnout_df['turnout_2000_2012'], y=turnout_df['turnout_2016'],
s=turnout_df['num_2016']/100, marker='o', alpha=1.0, label='Votes',
c=turnout_df['dem_lead_2016'], cmap=cmap, edgecolors='gray')
ax.set_xlim(0.55, 0.85)
ax.set_ylim(0.55, 0.85)
#s=20
# Create X points
x = pd.DataFrame({'line': np.linspace(0, 1, 100)})
plt.plot(x, x, 'k--', alpha=0.9, label='Equal', color='gray')
A = turnout_df['turnout_2000_2012']
B = turnout_df['turnout_2016']
C = turnout_df['county']
D = turnout_df['num_2016']
for a,b,c,d in zip(A, B, C, D):
if d > 70000: #Annotate large counties
ax.annotate('%s' % c, xy=(a,b), textcoords='data')
plt.xlabel('Average Voter Turnout, 2000-2012 (% VAP)')
plt.ylabel('Voter Turnout, 2016 (% VAP)')
legend = plt.legend(loc='upper left')
legend.legendHandles[1]._sizes = [40]
plt.colorbar(shrink=0.5, pad=0.03, label='Democratic Margin', format='%.0f%%')
plt.show()
Ok, here is a different approach that matches all of my initial criteria for a visualization.
It's easy to build a graphic that shows a difference relative to Obama, but that's not very informative. Wisconsin has voted for the Democratic nominee in every election since 1984, so Trump's win should represent a deviation from deeper historical averages, not just a deviation from Obama's results. That's what the axes of this graphic try to show, with the x and y axes showing changes from the 2000-2012 averages.
fig, ax = plt.subplots(figsize=(13.5,9)) #figsize=(12,10)
plt.scatter(x=turnout_df['demlead_change'], y=turnout_df['turnout_change']*100,
s=turnout_df['num_2016']/100, marker='o', alpha=1.0, label='Votes',
c=turnout_df['dem_lead_2016'], cmap=cmap, edgecolors='gray')
#http://stackoverflow.com/questions/31556446
# Move left y-axis and bottim x-axis to zero:
ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')
# Eliminate upper and right axes
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
# Show ticks in the left and lower axes only
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')
ax.set_ylim(-10, 10)
A = turnout_df['demlead_change']
B = turnout_df['turnout_change']*100
C = turnout_df['county']
D = turnout_df['num_2016']
for a,b,c,d in zip(A, B, C, D):
if d > 70000: #Annotate large counties >90000
ax.annotate('%s' % c, xy=(a,b), textcoords='data')
plt.xlabel('Change from Average 2000-2012 Democratic Margin (%)', labelpad=250)
plt.ylabel('Change from Average 2000-2012 Turnout (%)', labelpad=400)
legend = plt.legend(loc='upper left')
legend.legendHandles[0]._sizes = [40]
#http://stackoverflow.com/questions/5306756
plt.colorbar(shrink=0.5, pad=0.03, label='Democratic Margin', format='%.0f%%') #orientation='horizontal'
plt.show()
The size and colors of the circles do a fairly good job of communicating what happened this election. The area of the circle is proportional to the votes cast, and the color communicates the margin for the Democrats so it's pretty easy to get an idea of the crucial counties and the role they played for each party.
The axes in this plot show what changed relative to the past. On the x-axis, I have the change from the average democratic margin, which does a good job of communicating the shift in support within a county. This axis makes it clear that there was a massive shift to the Republican party in small rural counties. There was even a shift towards Trump in many larger suburban counties, even if they were still won by Clinton.
Interestingly, Waukesha and Ozaukee counties, which are fairly rich, well educated suburban counties of Milwaukee, actually saw 5-10%
shifts towards the Democrats. This fits with Nate Silver's analysis that suggested level of education was the most important predictor of voting behavior in this election. Waukesha county is one of the most conservative counties in the country, but it's dominated by establishment conservatives that may have clashed with Trump's views.
The y-axis shows change in voter turnout
relative to the 2000-2012
average. It's clear that Milwaukee really underperformed here, with a 5%
drop in turnout relative to the average. Part of this is probably because Obama wasn't on the ballot, but the historical average includes the Gore and Kerry elections as well, which should help mitigate the Obama effect.
One thing I haven't heard anyone discuss is the fact that the Democrat's margin in Milwaukee was up by 7.8%
compared with the 2000-2012 average. A margin increase of 7.8%
would lead to a larger net gain in votes than an increase in turnout of the same percent, so this margin increase probably canceled out the effects of lower turnout (although it would be ideal to have increases in both).
Dane County had an amazing 7.3%
increase in turnout along with a large 10%
increase in Democratic support. This was almost enough to cancel out the poorer performance elsewhere in the state. But in the end, the large shift to Trump in rural counties overwhelmed the Dane County effect.
Dane County performed much better than Milwaukee County when it comes to voter turnout and margins. It might make sense to try to apply lessons learned from Dane County to Milwaukee, although they are two very different places. There are much higher levels of poverty in Milwaukee, which might magnify the effects of the recent voter ID laws. Anecdotally, there seemed to be less enthusiasm for Clinton as well.
Madison is a younger, more progressive city, and probably had more resources to put into get out the vote efforts. It also has a large student population and one of the top public universities in the world. For more thoughts on the causes of the voting disparities, see this article.
Whatever the cause, I think it makes sense to study the differences between the two counties to look for ways to improve in the future. Things are only going to get more difficult over the coming years with more voter suppression efforts likely at the state and now national level.
#Data output script for interactive visualization.
'''
# Calculate voter turnouts by county
out_df = pd.pivot_table(voting_df, values='num', index=['year', 'county'], aggfunc=np.sum)
out_df = out_df.reset_index()
out_df = pd.merge(out_df, population2009_df, on='county', suffixes=('', '_2009'))
out_df = pd.merge(out_df, population2014_df, on='county', suffixes=('', '_2014'))
out_df.rename(columns={'CVAP_EST': 'CVAP_EST_2009',
'num': 'county_num',
'CVAP_MOE': 'CVAP_MOE_2009'}, inplace=True)
def turnout_frac(row):
if row['year'] < 2012:
return row['county_num']/row['CVAP_EST_2009']
else:
return row['county_num']/row['CVAP_EST_2014']
out_df['turnout'] = out_df.apply(turnout_frac, axis=1)
out_df = out_df[['year', 'county', 'county_num', 'turnout']]
#out_df.head(30)
#out_df.head(30)
#Calculate 2016 Democrate-Republican Difference, by county
demout_df = voting_df[(voting_df['party'] == 'D')]
repout_df = voting_df[(voting_df['party'] == 'R')]
spreadout_df = pd.merge(repout_df, demout_df, on=['year', 'county'], suffixes=('_rep', '_dem'))
#spreadout_df.head(20)
spreadout_df.head()
spreadout_df['dem_lead'] = spreadout_df['pct_dem'] - spreadout_df['pct_rep']
spreadout_df = spreadout_df[['year', 'county', 'pct_dem', 'pct_rep',
'dem_lead', 'num_dem', 'num_rep']
] #.sort_values(by='dem_lead', ascending=False)
out_df = pd.merge(out_df, spreadout_df, on=['year', 'county'])
#out_df.round(4) #{'turnout':4}
#np.round(out_df, decimals = 4)
#Round isn't working. . .
out_df['turnout'] = out_df['turnout'].apply(lambda x: round(x, 4) )
#out_df.head() #.dtypes #.head()
#out_df = out_df.groupby(['year', 'county'])
#out_df.head(30)
#out_df.tail(30)
#Output to current directory:
out_df.to_csv('./county_results_20002016.csv', index=False)
'''