Visualizing the Presidential Election Results¶

Like many people, I'm still recovering from the results of the election. A big part of the process for me is coming up with an accurate description of what actually happened, so this post attempts to do so by building a few visualizations.

Building a graphic that clearly communicates the election results is difficult because you need to do a few things simultaneously:

Use a high enough data resolution to show sub-state trends. County level data does a good job here.
Communicate the size of the vote in each county so we know how consequential it was for the election. If the data is at the national level, weight the vote totals by their contribution to the electoral college.
Show both the voter turnout and margins for each party. Either can swing an election.
Compare the outcomes with those of past elections so we can get an idea of what changed this time. Ideally, compare with the average of the past few elections so you're not comparing against a single candidate or a single point in time.

The best graphics I have seen come from a Washington Post article looking at the urban/rural divide, which gets pretty close to meeting all of the above requirements. Unfortunately, their graphics communicate changes in voter turnout using line thickness, which makes it hard to discern differences over time.

Below I attempt to accomplish most of those objectives using county level data from Wisconsin.

Getting the data¶

The county level data for the 2000-2016 elections is from David Leip's Atlas of Presidential Elections. There's a script for getting the state level data below.

The voting age population (VAP) data is from the American Community Survey. I downloaded county level data averaged over 2005-2009, and 2010-2014 to try to account for population changes. I calculated turnout by dividing the sum of the votes by the voting age population for each county. All of the input data is available here.

In [2]:

%matplotlib inline

import pandas as pd
import os 
import glob
import matplotlib.pyplot as plt
import matplotlib
import numpy as np

matplotlib.style.use('ggplot')  #'ggplot' 'fivethirtyeight'  'seaborn-paper'

#pd.set_option('max_colwidth', 400)

In [ ]:

#Optional script for getting state level data:
'''
import requests
from bs4 import BeautifulSoup
import csv

years = [2000, 2004, 2008, 2012, 2016]

#Iterate through years 2000-2016, by 4
base_url = 'http://uselectionatlas.org/RESULTS/datagraph.php?year={year}&fips=55&f=1&off=0&elect=0'

row_list = []

for year in years:

    url = base_url.format(year=year)
    r = requests.get(url) #Encoding is UTF-8

    if r.status_code == 200:

        soup = BeautifulSoup(r.text, 'html.parser')
        tables = soup.findAll('table')[:-1]  #Ignore extra table at end
        county = None

        for table in tables:
            rows = table.findAll('tr')

            for row in rows:
                data = row.findAll('td')

                for element in data:
                    if element.get('style') == 'width:100px':
                        county = element.get_text(strip=True)
                        data.remove(element)

                candidate = data[0].get_text(strip=True)
                pct = data[1].get_text(strip=True)
                num = data[2].get_text(strip=True)

                row = {'year': year, 'county': county, 'candidate': candidate, 'pct': pct, 'num':num}
                row_list.append(row)
                
        print 'Completed {0}.'.format(year)

    else:
        print "Failed to access url: " + url

    
with open('county_data_wi_2000-2016.csv', 'wb') as csvfile:
    fieldnames = ['year', 'county', 'candidate', 'pct', 'num']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(row_list)
''' 

In [3]:

# Voting behavior by county in WI
# Source: http://uselectionatlas.org/RESULTS/datagraph.php?year=2016&fips=55&f=1&off=0&elect=0

voting_df = pd.read_csv('./data/voting/county_data_wi_2000-2016.csv')

#print len(voting_df['county'].drop_duplicates())
# 72 Counties

rs = ['Bush', 'McCain', 'Romney', 'Trump']
ds = ['Gore', 'Kerry', 'Obama', 'Clinton']

def party(element):
    if element in rs:
        return 'R'
    elif element in ds:
        return 'D'
    else:
        return 'O'

voting_df['party'] = voting_df['candidate'].apply(party)

voting_df.replace({'%': '', ',': ''}, inplace=True, regex=True)
voting_df[['pct','num']] = voting_df[['pct','num']].astype(float)  #.apply(pd.to_numeric)
#voting_df.dtypes
voting_df.head(10)

Out[3]:

	year	county	candidate	pct	num	party
0	2000	Adams	Gore	52.9	4826.0	D
1	2000	Adams	Bush	43.0	3920.0	R
2	2000	Adams	Nader	2.4	217.0	O
3	2000	Adams	Other	1.7	153.0	O
4	2000	Ashland	Gore	55.2	4356.0	D
5	2000	Ashland	Bush	38.5	3038.0	R
6	2000	Ashland	Nader	5.6	440.0	O
7	2000	Ashland	Other	0.7	56.0	O
8	2000	Barron	Gore	44.9	8928.0	D
9	2000	Barron	Bush	49.5	9848.0	R

In [4]:

# Voting age population dataframe
# Source: http://www.census.gov/rdo/data/voting_age_population_by_citizenship_and_race_cvap.html

# 2010-2014 Averaged data
population2014_df = pd.read_csv('./data/CVAP_CSV_Format_2010-2014/County.csv')

population2014_df = population2014_df[(population2014_df['LNTITLE'] == 'Total') & 
                              (population2014_df['GEONAME'].str.contains('Wisconsin'))]
population2014_df = population2014_df[['GEONAME', 'CVAP_EST', 'CVAP_MOE']]
population2014_df.replace({' County, Wisconsin': ''}, inplace=True, regex=True)
population2014_df.rename(columns={'GEONAME': 'county'}, inplace=True)
                        
#print len(population_df['GEONAME'].drop_duplicates())
# 72 counties

# 2005 to 2009 average
population2009_df = pd.read_csv('./data/CVAP_CSV_Format_2005-2009/County.csv')
population2009_df = population2009_df[(population2009_df['LNTITLE'] == 'Total') & 
                              (population2009_df['GEONAME'].str.contains('Wisconsin'))]
population2009_df = population2009_df[['GEONAME', 'CVAP_EST', 'CVAP_MOE']]
population2009_df.replace({' County, Wisconsin': ''}, inplace=True, regex=True)
population2009_df.rename(columns={'GEONAME': 'county'}, inplace=True)

# print len(population2008_df['GEONAME'].drop_duplicates())
# 72

population2009_df.head(10)

Out[4]:

	county	CVAP_EST	CVAP_MOE
39624	Adams	16640	81
39637	Ashland	12420	45
39650	Barron	35260	94
39663	Bayfield	11945	51
39676	Brown	176420	491
39689	Buffalo	10470	32
39702	Burnett	12925	26
39715	Calumet	31990	123
39728	Chippewa	45520	82
39741	Clark	23525	40

In [5]:

# Calculate voter turnouts by county
turnout_df = pd.pivot_table(voting_df, values='num', index=['year', 'county'], aggfunc=np.sum)
turnout_df = turnout_df.reset_index()

turnout_df = pd.merge(turnout_df, population2009_df, on='county', suffixes=('', '_2009'))
turnout_df = pd.merge(turnout_df, population2014_df, on='county', suffixes=('', '_2014'))
turnout_df.rename(columns={'CVAP_EST': 'CVAP_EST_2009',
                           'CVAP_MOE': 'CVAP_MOE_2009'}, inplace=True)

def turnout_frac(row):
    if row['year'] < 2012:
        return row['num']/row['CVAP_EST_2009']
    else:
        return row['num']/row['CVAP_EST_2014']

turnout_df['turnout'] = turnout_df.apply(turnout_frac, axis=1)

# Mean turnout, 2000-2012
turnout2012_df = turnout_df[turnout_df['year'] < 2016].groupby(['county'], as_index=False
                        ).agg({'turnout':'mean'})

# Turnout, 2016
turnout2016_df = turnout_df[turnout_df['year'] == 2016]

turnout_df = pd.merge(turnout2012_df, turnout2016_df, on='county', suffixes=('_2000_2012', '_2016'))
turnout_df.rename(columns={'num': 'num_2016'}, inplace=True)
turnout_df = turnout_df[['county', 'num_2016', 'turnout_2016', 'turnout_2000_2012']]
turnout_df['turnout_change'] = turnout_df['turnout_2016'] - turnout_df['turnout_2000_2012']
    
turnout_df.sort_values(by='num_2016', ascending=False).head(10)

Out[5]:

	county	num_2016	turnout_2016	turnout_2000_2012	turnout_change
40	Milwaukee	440698.0	0.654155	0.703174	-0.049019
12	Dane	309096.0	0.830200	0.757048	0.073152
67	Waukesha	237588.0	0.808631	0.801177	0.007454
4	Brown	128965.0	0.706793	0.680866	0.025927
44	Outagamie	95162.0	0.724051	0.686042	0.038009
51	Racine	94921.0	0.666838	0.686983	-0.020146
70	Winnebago	87140.0	0.667714	0.678759	-0.011046
66	Washington	77551.0	0.776092	0.741324	0.034768
29	Kenosha	76894.0	0.636830	0.639971	-0.003141
53	Rock	76056.0	0.649524	0.670892	-0.021368

In [6]:

# Calculate 2016 Democrate-Republican Difference, by county
clinton2016_df = voting_df[(voting_df['year'] == 2016) & 
                           (voting_df['candidate'] == 'Clinton')]

trump2016_df = voting_df[(voting_df['year'] == 2016) & 
                         (voting_df['candidate'] == 'Trump')]

spread2016_df = pd.merge(trump2016_df, clinton2016_df, on='county', suffixes=('_trump', '_clinton'))

spread2016_df['dem_lead'] = spread2016_df['pct_clinton'] - spread2016_df['pct_trump']

spread2016_df = spread2016_df[['county', 'pct_clinton', 'pct_trump', 
                               'dem_lead', 'num_clinton', 'num_trump']
                             ].sort_values(by='dem_lead', ascending=False)


# Avg 2000-2012 Democrat-Republican Difference by county 
dem_df = voting_df[(voting_df['year'] < 2016) & (voting_df['party'] == 'D')]
dem_df = dem_df.groupby(['county'], as_index=False
                        ).agg({'pct':'mean'})

repub_df = voting_df[(voting_df['year'] < 2016) & (voting_df['party'] == 'R')]
repub_df = repub_df.groupby(['county'], as_index=False
                            ).agg({'pct':'mean'})

spread2000_df = pd.merge(dem_df, repub_df[['pct','county']],
                         on='county', suffixes=('_dem', '_repub'))
spread2000_df['dem_lead'] = spread2000_df['pct_dem'] - spread2000_df['pct_repub']


#Merge both together
spreadall_df = pd.merge(spread2016_df, spread2000_df, on='county', suffixes=('_2016', '_2000_2012'))
spreadall_df.head()

Out[6]:

	county	pct_clinton	pct_trump	dem_lead_2016	num_clinton	num_trump	pct_dem	pct_repub	dem_lead_2000_2012
0	Menominee	78.4	21.0	57.4	1003.0	269.0	83.225	15.200	68.025
1	Dane	70.4	23.1	47.3	217506.0	71270.0	67.725	29.725	38.000
2	Milwaukee	65.6	28.6	37.0	288986.0	126091.0	63.675	34.500	29.175
3	Iowa	54.5	39.3	15.2	6669.0	4809.0	60.900	37.200	23.700
4	Rock	51.7	41.4	10.3	39336.0	31483.0	60.050	38.150	21.900

The Final Dataframe¶

After all the calculations above, this is the final dataframe. The relevant columns for the graphics are the turnout columns which show the fraction of voting age people that voted in the relevant years, and the dem_lead columns which show the margins between parties for various elections. The demlead_change column shows the difference between the Democratic party lead over the 2000-2012 time period and the lead in 2016.

In [7]:

#Join with turnout_df
turnout_df = pd.merge(turnout_df, spreadall_df, on='county')

#Calculate change in Democratic margin, 2016 - average 2000 to 2012
turnout_df['demlead_change'] = turnout_df['dem_lead_2016'] - turnout_df['dem_lead_2000_2012']

turnout_df.sort_values(by='num_2016', ascending=False, inplace=True)
turnout_df.head(10)

Out[7]:

	county	num_2016	turnout_2016	turnout_2000_2012	turnout_change	pct_clinton	pct_trump	dem_lead_2016	num_clinton	num_trump	pct_dem	pct_repub	dem_lead_2000_2012	demlead_change
40	Milwaukee	440698.0	0.654155	0.703174	-0.049019	65.6	28.6	37.0	288986.0	126091.0	63.675	34.500	29.175	7.825
12	Dane	309096.0	0.830200	0.757048	0.073152	70.4	23.1	47.3	217506.0	71270.0	67.725	29.725	38.000	9.300
67	Waukesha	237588.0	0.808631	0.801177	0.007454	33.3	60.0	-26.7	79199.0	142519.0	33.125	65.425	-32.300	5.600
4	Brown	128965.0	0.706793	0.680866	0.025927	41.4	52.1	-10.7	53358.0	67192.0	48.150	49.975	-1.825	-8.875
44	Outagamie	95162.0	0.724051	0.686042	0.038009	40.1	54.2	-14.1	38117.0	51579.0	47.750	49.950	-2.200	-11.900
51	Racine	94921.0	0.666838	0.686983	-0.020146	44.8	49.1	-4.3	42506.0	46620.0	49.675	48.625	1.050	-5.350
70	Winnebago	87140.0	0.667714	0.678759	-0.011046	42.5	49.9	-7.4	37054.0	43447.0	49.200	48.350	0.850	-8.250
66	Washington	77551.0	0.776092	0.741324	0.034768	26.9	66.7	-39.8	20854.0	51729.0	30.700	67.625	-36.925	-2.875
29	Kenosha	76894.0	0.636830	0.639971	-0.003141	46.5	46.9	-0.4	35770.0	36025.0	54.275	43.800	10.475	-10.875
53	Rock	76056.0	0.649524	0.670892	-0.021368	51.7	41.4	10.3	39336.0	31483.0	60.050	38.150	21.900	-11.600

An Initial Attempt¶

This was my first attempt at a comprehensive plot that meets all the requirements. Average 2000-2012 turnout is on the x-axis, and 2016 turnout is on the y-axis. Any counties above/below the 45 degree line had better/worse turnout in 2016 relative to the average. The area of each of the circles is proportional to the number of votes cast, and the color corresponds with the vote margin.

It's clear that Milwaukee underperformed in this election. If Milwaukee voted at it's recent historical average, Wisconsin would probably be a blue state.

Unfortunately, this plot doesn't do good job of communicating the shift in the vote margin relative to the past. For that, I created the next plot.

In [9]:

#http://stackoverflow.com/questions/37401872/custom-continuous-color-map-in-matplotlib
import matplotlib.colors as clr

#Find midpoint of data on 0-1 scale:
vmin = turnout_df['dem_lead_2016'].min()
vmax = turnout_df['dem_lead_2016'].max()
mid = 1 - vmax/(vmax + abs(vmin))

#Construct colormap using midpoint:
cmap = clr.LinearSegmentedColormap.from_list('red_blue', 
                        [(0, '#EF3B2C'), (mid, '#FFFFFF'), (1,'#08519C')], N=256) 

fig, ax = plt.subplots(figsize=(13,9))  

#label='_nolegend_'
plt.scatter(x=turnout_df['turnout_2000_2012'], y=turnout_df['turnout_2016'], 
            s=turnout_df['num_2016']/100, marker='o', alpha=1.0, label='Votes', 
            c=turnout_df['dem_lead_2016'], cmap=cmap, edgecolors='gray') 

ax.set_xlim(0.55, 0.85)
ax.set_ylim(0.55, 0.85)
#s=20

# Create X points
x = pd.DataFrame({'line': np.linspace(0, 1, 100)})
plt.plot(x, x, 'k--', alpha=0.9, label='Equal', color='gray')

A = turnout_df['turnout_2000_2012']
B = turnout_df['turnout_2016']
C = turnout_df['county']
D = turnout_df['num_2016']

for a,b,c,d in zip(A, B, C, D):
    if d > 70000: #Annotate large counties
        ax.annotate('%s' % c, xy=(a,b), textcoords='data') 
        
plt.xlabel('Average Voter Turnout, 2000-2012 (% VAP)')
plt.ylabel('Voter Turnout, 2016 (% VAP)')

legend = plt.legend(loc='upper left')
legend.legendHandles[1]._sizes = [40]

plt.colorbar(shrink=0.5, pad=0.03, label='Democratic Margin', format='%.0f%%')

plt.show()

A Better Approach¶

Ok, here is a different approach that matches all of my initial criteria for a visualization.

It's easy to build a graphic that shows a difference relative to Obama, but that's not very informative. Wisconsin has voted for the Democratic nominee in every election since 1984, so Trump's win should represent a deviation from deeper historical averages, not just a deviation from Obama's results. That's what the axes of this graphic try to show, with the x and y axes showing changes from the 2000-2012 averages.

In [10]:

fig, ax = plt.subplots(figsize=(13.5,9))  #figsize=(12,10)

plt.scatter(x=turnout_df['demlead_change'], y=turnout_df['turnout_change']*100, 
            s=turnout_df['num_2016']/100, marker='o', alpha=1.0, label='Votes', 
            c=turnout_df['dem_lead_2016'], cmap=cmap, edgecolors='gray')

#http://stackoverflow.com/questions/31556446
# Move left y-axis and bottim x-axis to zero:
ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')

# Eliminate upper and right axes
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')

# Show ticks in the left and lower axes only
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')

ax.set_ylim(-10, 10)


A = turnout_df['demlead_change']
B = turnout_df['turnout_change']*100
C = turnout_df['county']
D = turnout_df['num_2016']

for a,b,c,d in zip(A, B, C, D):
    if d > 70000: #Annotate large counties  >90000
        ax.annotate('%s' % c, xy=(a,b), textcoords='data') 
        
plt.xlabel('Change from Average 2000-2012 Democratic Margin (%)', labelpad=250)
plt.ylabel('Change from Average 2000-2012 Turnout (%)', labelpad=400)

legend = plt.legend(loc='upper left')
legend.legendHandles[0]._sizes = [40]

#http://stackoverflow.com/questions/5306756
plt.colorbar(shrink=0.5, pad=0.03, label='Democratic Margin', format='%.0f%%') #orientation='horizontal'


plt.show()

So What Happened?¶

The size and colors of the circles do a fairly good job of communicating what happened this election. The area of the circle is proportional to the votes cast, and the color communicates the margin for the Democrats so it's pretty easy to get an idea of the crucial counties and the role they played for each party.

The axes in this plot show what changed relative to the past. On the x-axis, I have the change from the average democratic margin, which does a good job of communicating the shift in support within a county. This axis makes it clear that there was a massive shift to the Republican party in small rural counties. There was even a shift towards Trump in many larger suburban counties, even if they were still won by Clinton.

Interestingly, Waukesha and Ozaukee counties, which are fairly rich, well educated suburban counties of Milwaukee, actually saw 5-10% shifts towards the Democrats. This fits with Nate Silver's analysis that suggested level of education was the most important predictor of voting behavior in this election. Waukesha county is one of the most conservative counties in the country, but it's dominated by establishment conservatives that may have clashed with Trump's views.

The y-axis shows change in voter turnout relative to the 2000-2012 average. It's clear that Milwaukee really underperformed here, with a 5% drop in turnout relative to the average. Part of this is probably because Obama wasn't on the ballot, but the historical average includes the Gore and Kerry elections as well, which should help mitigate the Obama effect.

One thing I haven't heard anyone discuss is the fact that the Democrat's margin in Milwaukee was up by 7.8% compared with the 2000-2012 average. A margin increase of 7.8% would lead to a larger net gain in votes than an increase in turnout of the same percent, so this margin increase probably canceled out the effects of lower turnout (although it would be ideal to have increases in both).

Dane County had an amazing 7.3% increase in turnout along with a large 10% increase in Democratic support. This was almost enough to cancel out the poorer performance elsewhere in the state. But in the end, the large shift to Trump in rural counties overwhelmed the Dane County effect.

Why did Dane County do so much better?¶

Dane County performed much better than Milwaukee County when it comes to voter turnout and margins. It might make sense to try to apply lessons learned from Dane County to Milwaukee, although they are two very different places. There are much higher levels of poverty in Milwaukee, which might magnify the effects of the recent voter ID laws. Anecdotally, there seemed to be less enthusiasm for Clinton as well.

Madison is a younger, more progressive city, and probably had more resources to put into get out the vote efforts. It also has a large student population and one of the top public universities in the world. For more thoughts on the causes of the voting disparities, see this article.

Whatever the cause, I think it makes sense to study the differences between the two counties to look for ways to improve in the future. Things are only going to get more difficult over the coming years with more voter suppression efforts likely at the state and now national level.

In [122]:

#Data output script for interactive visualization.  

'''
# Calculate voter turnouts by county
out_df = pd.pivot_table(voting_df, values='num', index=['year', 'county'], aggfunc=np.sum)
out_df = out_df.reset_index()

out_df = pd.merge(out_df, population2009_df, on='county', suffixes=('', '_2009'))
out_df = pd.merge(out_df, population2014_df, on='county', suffixes=('', '_2014'))
out_df.rename(columns={'CVAP_EST': 'CVAP_EST_2009',
                       'num': 'county_num',
                       'CVAP_MOE': 'CVAP_MOE_2009'}, inplace=True)

def turnout_frac(row):
    if row['year'] < 2012:
        return row['county_num']/row['CVAP_EST_2009']
    else:
        return row['county_num']/row['CVAP_EST_2014']

out_df['turnout'] = out_df.apply(turnout_frac, axis=1)

out_df = out_df[['year', 'county', 'county_num', 'turnout']]

#out_df.head(30)

#out_df.head(30)

#Calculate 2016 Democrate-Republican Difference, by county
demout_df = voting_df[(voting_df['party'] == 'D')]

repout_df = voting_df[(voting_df['party'] == 'R')]

spreadout_df = pd.merge(repout_df, demout_df, on=['year', 'county'], suffixes=('_rep', '_dem'))


#spreadout_df.head(20)

spreadout_df.head()

spreadout_df['dem_lead'] = spreadout_df['pct_dem'] - spreadout_df['pct_rep']

spreadout_df = spreadout_df[['year', 'county', 'pct_dem', 'pct_rep', 
                               'dem_lead', 'num_dem', 'num_rep']
                             ]  #.sort_values(by='dem_lead', ascending=False)

out_df = pd.merge(out_df, spreadout_df, on=['year', 'county'])
#out_df.round(4) #{'turnout':4}
#np.round(out_df, decimals = 4)
#Round isn't working. . . 
out_df['turnout'] = out_df['turnout'].apply(lambda x: round(x, 4) )

#out_df.head()  #.dtypes #.head()

#out_df = out_df.groupby(['year', 'county'])
#out_df.head(30)
#out_df.tail(30)

#Output to current directory:
out_df.to_csv('./county_results_20002016.csv', index=False)
'''