Visualizing the Presidential Election Results

Like many people, I'm still recovering from the results of the election. A big part of the process for me is coming up with an accurate description of what actually happened, so this post attempts to do so by building a few visualizations.

Building a graphic that clearly communicates the election results is difficult because you need to do a few things simultaneously:

  1. Use a high enough data resolution to show sub-state trends. County level data does a good job here.

  2. Communicate the size of the vote in each county so we know how consequential it was for the election. If the data is at the national level, weight the vote totals by their contribution to the electoral college.

  3. Show both the voter turnout and margins for each party. Either can swing an election.

  4. Compare the outcomes with those of past elections so we can get an idea of what changed this time. Ideally, compare with the average of the past few elections so you're not comparing against a single candidate or a single point in time.

The best graphics I have seen come from a Washington Post article looking at the urban/rural divide, which gets pretty close to meeting all of the above requirements. Unfortunately, their graphics communicate changes in voter turnout using line thickness, which makes it hard to discern differences over time.

Below I attempt to accomplish most of those objectives using county level data from Wisconsin.

Getting the data

The county level data for the 2000-2016 elections is from David Leip's Atlas of Presidential Elections. There's a script for getting the state level data below.

The voting age population (VAP) data is from the American Community Survey. I downloaded county level data averaged over 2005-2009, and 2010-2014 to try to account for population changes. I calculated turnout by dividing the sum of the votes by the voting age population for each county. All of the input data is available here.

In [2]:
%matplotlib inline

import pandas as pd
import os 
import glob
import matplotlib.pyplot as plt
import matplotlib
import numpy as np

matplotlib.style.use('ggplot')  #'ggplot' 'fivethirtyeight'  'seaborn-paper'

#pd.set_option('max_colwidth', 400)
In [ ]:
#Optional script for getting state level data:
'''
import requests
from bs4 import BeautifulSoup
import csv

years = [2000, 2004, 2008, 2012, 2016]

#Iterate through years 2000-2016, by 4
base_url = 'http://uselectionatlas.org/RESULTS/datagraph.php?year={year}&fips=55&f=1&off=0&elect=0'

row_list = []

for year in years:

    url = base_url.format(year=year)
    r = requests.get(url) #Encoding is UTF-8

    if r.status_code == 200:

        soup = BeautifulSoup(r.text, 'html.parser')
        tables = soup.findAll('table')[:-1]  #Ignore extra table at end
        county = None

        for table in tables:
            rows = table.findAll('tr')

            for row in rows:
                data = row.findAll('td')

                for element in data:
                    if element.get('style') == 'width:100px':
                        county = element.get_text(strip=True)
                        data.remove(element)

                candidate = data[0].get_text(strip=True)
                pct = data[1].get_text(strip=True)
                num = data[2].get_text(strip=True)

                row = {'year': year, 'county': county, 'candidate': candidate, 'pct': pct, 'num':num}
                row_list.append(row)
                
        print 'Completed {0}.'.format(year)

    else:
        print "Failed to access url: " + url

    
with open('county_data_wi_2000-2016.csv', 'wb') as csvfile:
    fieldnames = ['year', 'county', 'candidate', 'pct', 'num']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(row_list)
''' 
In [3]:
# Voting behavior by county in WI
# Source: http://uselectionatlas.org/RESULTS/datagraph.php?year=2016&fips=55&f=1&off=0&elect=0

voting_df = pd.read_csv('./data/voting/county_data_wi_2000-2016.csv')

#print len(voting_df['county'].drop_duplicates())
# 72 Counties

rs = ['Bush', 'McCain', 'Romney', 'Trump']
ds = ['Gore', 'Kerry', 'Obama', 'Clinton']

def party(element):
    if element in rs:
        return 'R'
    elif element in ds:
        return 'D'
    else:
        return 'O'

voting_df['party'] = voting_df['candidate'].apply(party)

voting_df.replace({'%': '', ',': ''}, inplace=True, regex=True)
voting_df[['pct','num']] = voting_df[['pct','num']].astype(float)  #.apply(pd.to_numeric)
#voting_df.dtypes
voting_df.head(10)
Out[3]:
year county candidate pct num party
0 2000 Adams Gore 52.9 4826.0 D
1 2000 Adams Bush 43.0 3920.0 R
2 2000 Adams Nader 2.4 217.0 O
3 2000 Adams Other 1.7 153.0 O
4 2000 Ashland Gore 55.2 4356.0 D
5 2000 Ashland Bush 38.5 3038.0 R
6 2000 Ashland Nader 5.6 440.0 O
7 2000 Ashland Other 0.7 56.0 O
8 2000 Barron Gore 44.9 8928.0 D
9 2000 Barron Bush 49.5 9848.0 R
In [4]:
# Voting age population dataframe
# Source: http://www.census.gov/rdo/data/voting_age_population_by_citizenship_and_race_cvap.html

# 2010-2014 Averaged data
population2014_df = pd.read_csv('./data/CVAP_CSV_Format_2010-2014/County.csv')

population2014_df = population2014_df[(population2014_df['LNTITLE'] == 'Total') & 
                              (population2014_df['GEONAME'].str.contains('Wisconsin'))]
population2014_df = population2014_df[['GEONAME', 'CVAP_EST', 'CVAP_MOE']]
population2014_df.replace({' County, Wisconsin': ''}, inplace=True, regex=True)
population2014_df.rename(columns={'GEONAME': 'county'}, inplace=True)
                        
#print len(population_df['GEONAME'].drop_duplicates())
# 72 counties

# 2005 to 2009 average
population2009_df = pd.read_csv('./data/CVAP_CSV_Format_2005-2009/County.csv')
population2009_df = population2009_df[(population2009_df['LNTITLE'] == 'Total') & 
                              (population2009_df['GEONAME'].str.contains('Wisconsin'))]
population2009_df = population2009_df[['GEONAME', 'CVAP_EST', 'CVAP_MOE']]
population2009_df.replace({' County, Wisconsin': ''}, inplace=True, regex=True)
population2009_df.rename(columns={'GEONAME': 'county'}, inplace=True)

# print len(population2008_df['GEONAME'].drop_duplicates())
# 72

population2009_df.head(10)
Out[4]:
county CVAP_EST CVAP_MOE
39624 Adams 16640 81
39637 Ashland 12420 45
39650 Barron 35260 94
39663 Bayfield 11945 51
39676 Brown 176420 491
39689 Buffalo 10470 32
39702 Burnett 12925 26
39715 Calumet 31990 123
39728 Chippewa 45520 82
39741 Clark 23525 40
In [5]:
# Calculate voter turnouts by county
turnout_df = pd.pivot_table(voting_df, values='num', index=['year', 'county'], aggfunc=np.sum)
turnout_df = turnout_df.reset_index()

turnout_df = pd.merge(turnout_df, population2009_df, on='county', suffixes=('', '_2009'))
turnout_df = pd.merge(turnout_df, population2014_df, on='county', suffixes=('', '_2014'))
turnout_df.rename(columns={'CVAP_EST': 'CVAP_EST_2009',
                           'CVAP_MOE': 'CVAP_MOE_2009'}, inplace=True)

def turnout_frac(row):
    if row['year'] < 2012:
        return row['num']/row['CVAP_EST_2009']
    else:
        return row['num']/row['CVAP_EST_2014']

turnout_df['turnout'] = turnout_df.apply(turnout_frac, axis=1)

# Mean turnout, 2000-2012
turnout2012_df = turnout_df[turnout_df['year'] < 2016].groupby(['county'], as_index=False
                        ).agg({'turnout':'mean'})

# Turnout, 2016
turnout2016_df = turnout_df[turnout_df['year'] == 2016]

turnout_df = pd.merge(turnout2012_df, turnout2016_df, on='county', suffixes=('_2000_2012', '_2016'))
turnout_df.rename(columns={'num': 'num_2016'}, inplace=True)
turnout_df = turnout_df[['county', 'num_2016', 'turnout_2016', 'turnout_2000_2012']]
turnout_df['turnout_change'] = turnout_df['turnout_2016'] - turnout_df['turnout_2000_2012']
    
turnout_df.sort_values(by='num_2016', ascending=False).head(10)
Out[5]:
county num_2016 turnout_2016 turnout_2000_2012 turnout_change
40 Milwaukee 440698.0 0.654155 0.703174 -0.049019
12 Dane 309096.0 0.830200 0.757048 0.073152
67 Waukesha 237588.0 0.808631 0.801177 0.007454
4 Brown 128965.0 0.706793 0.680866 0.025927
44 Outagamie 95162.0 0.724051 0.686042 0.038009
51 Racine 94921.0 0.666838 0.686983 -0.020146
70 Winnebago 87140.0 0.667714 0.678759 -0.011046
66 Washington 77551.0 0.776092 0.741324 0.034768
29 Kenosha 76894.0 0.636830 0.639971 -0.003141
53 Rock 76056.0 0.649524 0.670892 -0.021368
In [6]:
# Calculate 2016 Democrate-Republican Difference, by county
clinton2016_df = voting_df[(voting_df['year'] == 2016) & 
                           (voting_df['candidate'] == 'Clinton')]

trump2016_df = voting_df[(voting_df['year'] == 2016) & 
                         (voting_df['candidate'] == 'Trump')]

spread2016_df = pd.merge(trump2016_df, clinton2016_df, on='county', suffixes=('_trump', '_clinton'))

spread2016_df['dem_lead'] = spread2016_df['pct_clinton'] - spread2016_df['pct_trump']

spread2016_df = spread2016_df[['county', 'pct_clinton', 'pct_trump', 
                               'dem_lead', 'num_clinton', 'num_trump']
                             ].sort_values(by='dem_lead', ascending=False)


# Avg 2000-2012 Democrat-Republican Difference by county 
dem_df = voting_df[(voting_df['year'] < 2016) & (voting_df['party'] == 'D')]
dem_df = dem_df.groupby(['county'], as_index=False
                        ).agg({'pct':'mean'})

repub_df = voting_df[(voting_df['year'] < 2016) & (voting_df['party'] == 'R')]
repub_df = repub_df.groupby(['county'], as_index=False
                            ).agg({'pct':'mean'})

spread2000_df = pd.merge(dem_df, repub_df[['pct','county']],
                         on='county', suffixes=('_dem', '_repub'))
spread2000_df['dem_lead'] = spread2000_df['pct_dem'] - spread2000_df['pct_repub']


#Merge both together
spreadall_df = pd.merge(spread2016_df, spread2000_df, on='county', suffixes=('_2016', '_2000_2012'))
spreadall_df.head()
Out[6]:
county pct_clinton pct_trump dem_lead_2016 num_clinton num_trump pct_dem pct_repub dem_lead_2000_2012
0 Menominee 78.4 21.0 57.4 1003.0 269.0 83.225 15.200 68.025
1 Dane 70.4 23.1 47.3 217506.0 71270.0 67.725 29.725 38.000
2 Milwaukee 65.6 28.6 37.0 288986.0 126091.0 63.675 34.500 29.175
3 Iowa 54.5 39.3 15.2 6669.0 4809.0 60.900 37.200 23.700
4 Rock 51.7 41.4 10.3 39336.0 31483.0 60.050 38.150 21.900

The Final Dataframe

After all the calculations above, this is the final dataframe. The relevant columns for the graphics are the turnout columns which show the fraction of voting age people that voted in the relevant years, and the dem_lead columns which show the margins between parties for various elections. The demlead_change column shows the difference between the Democratic party lead over the 2000-2012 time period and the lead in 2016.

In [7]:
#Join with turnout_df
turnout_df = pd.merge(turnout_df, spreadall_df, on='county')

#Calculate change in Democratic margin, 2016 - average 2000 to 2012
turnout_df['demlead_change'] = turnout_df['dem_lead_2016'] - turnout_df['dem_lead_2000_2012']

turnout_df.sort_values(by='num_2016', ascending=False, inplace=True)
turnout_df.head(10)
Out[7]:
county num_2016 turnout_2016 turnout_2000_2012 turnout_change pct_clinton pct_trump dem_lead_2016 num_clinton num_trump pct_dem pct_repub dem_lead_2000_2012 demlead_change
40 Milwaukee 440698.0 0.654155 0.703174 -0.049019 65.6 28.6 37.0 288986.0 126091.0 63.675 34.500 29.175 7.825
12 Dane 309096.0 0.830200 0.757048 0.073152 70.4 23.1 47.3 217506.0 71270.0 67.725 29.725 38.000 9.300
67 Waukesha 237588.0 0.808631 0.801177 0.007454 33.3 60.0 -26.7 79199.0 142519.0 33.125 65.425 -32.300 5.600
4 Brown 128965.0 0.706793 0.680866 0.025927 41.4 52.1 -10.7 53358.0 67192.0 48.150 49.975 -1.825 -8.875
44 Outagamie 95162.0 0.724051 0.686042 0.038009 40.1 54.2 -14.1 38117.0 51579.0 47.750 49.950 -2.200 -11.900
51 Racine 94921.0 0.666838 0.686983 -0.020146 44.8 49.1 -4.3 42506.0 46620.0 49.675 48.625 1.050 -5.350
70 Winnebago 87140.0 0.667714 0.678759 -0.011046 42.5 49.9 -7.4 37054.0 43447.0 49.200 48.350 0.850 -8.250
66 Washington 77551.0 0.776092 0.741324 0.034768 26.9 66.7 -39.8 20854.0 51729.0 30.700 67.625 -36.925 -2.875
29 Kenosha 76894.0 0.636830 0.639971 -0.003141 46.5 46.9 -0.4 35770.0 36025.0 54.275 43.800 10.475 -10.875
53 Rock 76056.0 0.649524 0.670892 -0.021368 51.7 41.4 10.3 39336.0 31483.0 60.050 38.150 21.900 -11.600

An Initial Attempt

This was my first attempt at a comprehensive plot that meets all the requirements. Average 2000-2012 turnout is on the x-axis, and 2016 turnout is on the y-axis. Any counties above/below the 45 degree line had better/worse turnout in 2016 relative to the average. The area of each of the circles is proportional to the number of votes cast, and the color corresponds with the vote margin.

It's clear that Milwaukee underperformed in this election. If Milwaukee voted at it's recent historical average, Wisconsin would probably be a blue state.

Unfortunately, this plot doesn't do good job of communicating the shift in the vote margin relative to the past. For that, I created the next plot.

In [9]:
#http://stackoverflow.com/questions/37401872/custom-continuous-color-map-in-matplotlib
import matplotlib.colors as clr

#Find midpoint of data on 0-1 scale:
vmin = turnout_df['dem_lead_2016'].min()
vmax = turnout_df['dem_lead_2016'].max()
mid = 1 - vmax/(vmax + abs(vmin))

#Construct colormap using midpoint:
cmap = clr.LinearSegmentedColormap.from_list('red_blue', 
                        [(0, '#EF3B2C'), (mid, '#FFFFFF'), (1,'#08519C')], N=256) 

fig, ax = plt.subplots(figsize=(13,9))  

#label='_nolegend_'
plt.scatter(x=turnout_df['turnout_2000_2012'], y=turnout_df['turnout_2016'], 
            s=turnout_df['num_2016']/100, marker='o', alpha=1.0, label='Votes', 
            c=turnout_df['dem_lead_2016'], cmap=cmap, edgecolors='gray') 

ax.set_xlim(0.55, 0.85)
ax.set_ylim(0.55, 0.85)
#s=20

# Create X points
x = pd.DataFrame({'line': np.linspace(0, 1, 100)})
plt.plot(x, x, 'k--', alpha=0.9, label='Equal', color='gray')

A = turnout_df['turnout_2000_2012']
B = turnout_df['turnout_2016']
C = turnout_df['county']
D = turnout_df['num_2016']

for a,b,c,d in zip(A, B, C, D):
    if d > 70000: #Annotate large counties
        ax.annotate('%s' % c, xy=(a,b), textcoords='data') 
        
plt.xlabel('Average Voter Turnout, 2000-2012 (% VAP)')
plt.ylabel('Voter Turnout, 2016 (% VAP)')

legend = plt.legend(loc='upper left')
legend.legendHandles[1]._sizes = [40]

plt.colorbar(shrink=0.5, pad=0.03, label='Democratic Margin', format='%.0f%%')

plt.show()

A Better Approach

Ok, here is a different approach that matches all of my initial criteria for a visualization.

It's easy to build a graphic that shows a difference relative to Obama, but that's not very informative. Wisconsin has voted for the Democratic nominee in every election since 1984, so Trump's win should represent a deviation from deeper historical averages, not just a deviation from Obama's results. That's what the axes of this graphic try to show, with the x and y axes showing changes from the 2000-2012 averages.

In [10]:
fig, ax = plt.subplots(figsize=(13.5,9))  #figsize=(12,10)

plt.scatter(x=turnout_df['demlead_change'], y=turnout_df['turnout_change']*100, 
            s=turnout_df['num_2016']/100, marker='o', alpha=1.0, label='Votes', 
            c=turnout_df['dem_lead_2016'], cmap=cmap, edgecolors='gray')

#http://stackoverflow.com/questions/31556446
# Move left y-axis and bottim x-axis to zero:
ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')

# Eliminate upper and right axes
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')

# Show ticks in the left and lower axes only
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')

ax.set_ylim(-10, 10)


A = turnout_df['demlead_change']
B = turnout_df['turnout_change']*100
C = turnout_df['county']
D = turnout_df['num_2016']

for a,b,c,d in zip(A, B, C, D):
    if d > 70000: #Annotate large counties  >90000
        ax.annotate('%s' % c, xy=(a,b), textcoords='data') 
        
plt.xlabel('Change from Average 2000-2012 Democratic Margin (%)', labelpad=250)
plt.ylabel('Change from Average 2000-2012 Turnout (%)', labelpad=400)

legend = plt.legend(loc='upper left')
legend.legendHandles[0]._sizes = [40]

#http://stackoverflow.com/questions/5306756
plt.colorbar(shrink=0.5, pad=0.03, label='Democratic Margin', format='%.0f%%') #orientation='horizontal'


plt.show()

So What Happened?

The size and colors of the circles do a fairly good job of communicating what happened this election. The area of the circle is proportional to the votes cast, and the color communicates the margin for the Democrats so it's pretty easy to get an idea of the crucial counties and the role they played for each party.

The axes in this plot show what changed relative to the past. On the x-axis, I have the change from the average democratic margin, which does a good job of communicating the shift in support within a county. This axis makes it clear that there was a massive shift to the Republican party in small rural counties. There was even a shift towards Trump in many larger suburban counties, even if they were still won by Clinton.

Interestingly, Waukesha and Ozaukee counties, which are fairly rich, well educated suburban counties of Milwaukee, actually saw 5-10% shifts towards the Democrats. This fits with Nate Silver's analysis that suggested level of education was the most important predictor of voting behavior in this election. Waukesha county is one of the most conservative counties in the country, but it's dominated by establishment conservatives that may have clashed with Trump's views.

The y-axis shows change in voter turnout relative to the 2000-2012 average. It's clear that Milwaukee really underperformed here, with a 5% drop in turnout relative to the average. Part of this is probably because Obama wasn't on the ballot, but the historical average includes the Gore and Kerry elections as well, which should help mitigate the Obama effect.

One thing I haven't heard anyone discuss is the fact that the Democrat's margin in Milwaukee was up by 7.8% compared with the 2000-2012 average. A margin increase of 7.8% would lead to a larger net gain in votes than an increase in turnout of the same percent, so this margin increase probably canceled out the effects of lower turnout (although it would be ideal to have increases in both).

Dane County had an amazing 7.3% increase in turnout along with a large 10% increase in Democratic support. This was almost enough to cancel out the poorer performance elsewhere in the state. But in the end, the large shift to Trump in rural counties overwhelmed the Dane County effect.

Why did Dane County do so much better?

Dane County performed much better than Milwaukee County when it comes to voter turnout and margins. It might make sense to try to apply lessons learned from Dane County to Milwaukee, although they are two very different places. There are much higher levels of poverty in Milwaukee, which might magnify the effects of the recent voter ID laws. Anecdotally, there seemed to be less enthusiasm for Clinton as well.

Madison is a younger, more progressive city, and probably had more resources to put into get out the vote efforts. It also has a large student population and one of the top public universities in the world. For more thoughts on the causes of the voting disparities, see this article.

Whatever the cause, I think it makes sense to study the differences between the two counties to look for ways to improve in the future. Things are only going to get more difficult over the coming years with more voter suppression efforts likely at the state and now national level.

In [122]:
#Data output script for interactive visualization.  

'''
# Calculate voter turnouts by county
out_df = pd.pivot_table(voting_df, values='num', index=['year', 'county'], aggfunc=np.sum)
out_df = out_df.reset_index()

out_df = pd.merge(out_df, population2009_df, on='county', suffixes=('', '_2009'))
out_df = pd.merge(out_df, population2014_df, on='county', suffixes=('', '_2014'))
out_df.rename(columns={'CVAP_EST': 'CVAP_EST_2009',
                       'num': 'county_num',
                       'CVAP_MOE': 'CVAP_MOE_2009'}, inplace=True)

def turnout_frac(row):
    if row['year'] < 2012:
        return row['county_num']/row['CVAP_EST_2009']
    else:
        return row['county_num']/row['CVAP_EST_2014']

out_df['turnout'] = out_df.apply(turnout_frac, axis=1)

out_df = out_df[['year', 'county', 'county_num', 'turnout']]

#out_df.head(30)

#out_df.head(30)

#Calculate 2016 Democrate-Republican Difference, by county
demout_df = voting_df[(voting_df['party'] == 'D')]

repout_df = voting_df[(voting_df['party'] == 'R')]

spreadout_df = pd.merge(repout_df, demout_df, on=['year', 'county'], suffixes=('_rep', '_dem'))


#spreadout_df.head(20)

spreadout_df.head()

spreadout_df['dem_lead'] = spreadout_df['pct_dem'] - spreadout_df['pct_rep']

spreadout_df = spreadout_df[['year', 'county', 'pct_dem', 'pct_rep', 
                               'dem_lead', 'num_dem', 'num_rep']
                             ]  #.sort_values(by='dem_lead', ascending=False)

out_df = pd.merge(out_df, spreadout_df, on=['year', 'county'])
#out_df.round(4) #{'turnout':4}
#np.round(out_df, decimals = 4)
#Round isn't working. . . 
out_df['turnout'] = out_df['turnout'].apply(lambda x: round(x, 4) )

#out_df.head()  #.dtypes #.head()

#out_df = out_df.groupby(['year', 'county'])
#out_df.head(30)
#out_df.tail(30)

#Output to current directory:
out_df.to_csv('./county_results_20002016.csv', index=False)
'''