Election Simulation

I've seen a claim that the early results from IEBC were too "smooth" (too many random variables blah blah). The point of this exercise is to see whether the claim holds.

First a bit of a flashback. We'll use prev election results as a baseline to predict vote share and turnout. The numbers might not be completely accurate but that's not important for the exercise. I could only find the IEBC report as a pdf and it's kinda tedious to verify each line.

In [25]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

elections_2013 = pd.read_csv('election_2013.csv', index_col=False)
elections_2013['COUNTY'] = elections_2013['COUNTY'].apply(lambda x: x.strip().lower())
elections_2013 = elections_2013.set_index('COUNTY')
elections_2013
Out[25]:
UHURU ODINGA MUDAVADI REJECT TURNOUT
COUNTY
mombasa 23.79 69.77 1.65 1.10 66.62
kwale 14.04 80.74 1.19 0.78 72.00
kilifi 10.72 83.74 1.10 1.09 65.00
tana river 34.71 61.41 0.71 0.79 81.00
lamu 40.02 51.98 1.56 1.44 84.00
taita taveta 13.18 81.56 1.12 1.11 81.00
garissa 45.34 48.67 0.42 0.56 80.00
wajir 38.83 49.59 0.39 0.54 85.00
mandera 92.93 4.30 0.06 0.34 84.00
marsabit 47.18 48.78 0.33 0.36 86.00
isiolo 55.41 29.61 0.31 0.60 87.00
meru 89.41 7.55 0.32 1.03 88.00
tharaka nithi 92.38 5.12 0.25 0.75 89.00
embu 89.00 7.97 0.34 0.87 88.00
kitui 14.76 79.53 1.53 0.95 85.00
machakos 9.58 85.89 0.88 1.30 84.00
makueni 5.02 90.73 0.97 0.86 85.00
nyandarua 97.11 1.21 0.21 0.74 94.00
nyeri 96.33 1.70 0.19 0.74 93.00
kirinyaga 95.99 1.44 0.15 0.69 91.00
muranga 95.92 2.43 0.14 0.56 94.00
kiambu 90.21 7.89 0.28 0.65 91.00
turkana 29.85 67.53 0.53 0.40 76.00
west pokot 73.33 22.95 1.27 0.66 90.00
samburu 40.94 57.62 0.23 0.33 88.00
trans nzoia 37.24 46.03 12.38 2.33 82.00
uasin gishu 74.26 21.09 2.53 1.01 86.00
elgeyo marakwet 92.07 4.85 0.53 0.83 92.00
nandi 81.52 8.70 7.41 0.95 90.00
baringo 87.93 9.41 0.76 0.73 91.00
laikipia 85.49 12.56 0.31 0.51 90.00
nakuru 80.19 17.14 0.80 0.89 89.00
narok 46.38 50.28 0.41 0.70 90.00
kajiado 52.36 44.44 0.62 0.77 87.00
kericho 90.74 6.59 0.70 0.73 91.00
bomet 92.68 4.61 0.48 0.62 90.00
kakamega 2.63 63.84 30.53 1.47 84.00
vihiga 1.52 46.44 49.19 1.24 83.00
bungoma 12.25 52.83 30.73 1.51 86.00
busia 3.71 85.62 8.42 1.03 88.00
siaya 0.31 98.47 0.25 0.60 92.00
kisumu 1.33 96.64 1.10 0.53 90.00
homabay 0.24 98.93 0.18 0.34 94.00
migori 9.97 86.38 2.37 0.51 92.00
kisii 27.42 67.93 0.75 1.32 84.00
nyamira 29.47 66.26 0.72 1.24 84.00
nairobi 46.75 49.00 1.56 0.86 82.00
In [115]:
elections_2013.plot(kind='bar', figsize=(20,10))
Out[115]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a941a90>
In [6]:
registration_2017 = pd.read_csv('registration.csv')
registration_2017['COUNTY'] = registration_2017['COUNTY'].apply(lambda x: x.strip().lower())
registration_2017 = registration_2017.set_index('COUNTY')
registration_2017
Out[6]:
NUM VOTERS
COUNTY
kiambu 1173593
nakuru 948668
meru 712378
muranga 590775
nyeri 460806
uasin gishu 451485
kericho 379815
kirinyaga 351162
nandi 349340
nyandarua 336322
bomet 325606
embu 315668
laikipia 239497
baringo 227918
tharaka nithi 216522
west pokot 178989
elgeyo marakwet 178975
kakamega 746877
machakos 627168
mombasa 596485
bungoma 559897
kisumu 548868
kilifi 510484
kitui 477655
kisii 544753
nyamira 279685
homabay 476150
siaya 447745
makueni 421180
migori 388967
busia 347911
trans nzoia 339832
kwale 282436
vihiga 267481
taita taveta 155904
tana river 118189
lamu 70224
nairobi 2304386
kajiado 409266
narok 347427
turkana 188617
mandera 168478
garissa 132486
wajir 155916
marsabit 143541
samburu 79477
isiolo 72548
In [7]:
registration_2017.plot(kind='bar', figsize=(20,10))
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x11307f5f8>
In [35]:
turnout_rate = [0.01*elections_2013['TURNOUT'][county] for county in registration_2017.index]
ti = 0
spoil = 0.03
adj_voters = registration_2017.copy()
def turnout_adjust(x):
    global ti
    votes = np.floor((x * turnout_rate[ti]) -(x*spoil))
    ti += 1
    return votes
adj_voters['NUM VOTERS'] = adj_voters['NUM VOTERS'].apply(turnout_adjust)
adj_voters
Out[35]:
NUM VOTERS
COUNTY
kiambu 1032761.0
nakuru 815854.0
meru 605521.0
muranga 537605.0
nyeri 414725.0
uasin gishu 374732.0
kericho 334237.0
kirinyaga 309022.0
nandi 303925.0
nyandarua 306053.0
bomet 283277.0
embu 268317.0
laikipia 208362.0
baringo 200567.0
tharaka nithi 186208.0
west pokot 155720.0
elgeyo marakwet 159287.0
kakamega 604970.0
machakos 508006.0
mombasa 379483.0
bungoma 464714.0
kisumu 477515.0
kilifi 316500.0
kitui 391677.0
kisii 441249.0
nyamira 226544.0
homabay 433296.0
siaya 398493.0
makueni 345367.0
migori 346180.0
busia 295724.0
trans nzoia 268467.0
kwale 194880.0
vihiga 213984.0
taita taveta 121605.0
tana river 92187.0
lamu 56881.0
nairobi 1820464.0
kajiado 343783.0
narok 302261.0
turkana 137690.0
mandera 136467.0
garissa 102014.0
wajir 127851.0
marsabit 119139.0
samburu 67555.0
isiolo 60940.0

The Simulation

The idea here is very simple and could probably be accomplished without real data. We assume that IEBC releases results from N polling stations at a time. These N stations are randomly sampled across counties. From each polling station we assume that the share of votes that go to a candidate is equal to the share of votes the candidate received in 2013 in addition to half of Mudavadi's votes. What I'm trying to show is that the difference need not be volatile.

In [54]:
def time_step(adj_voters_cpy, num_stations=100):
    station_size = 600
    mdvd_share = 0.5 # raila's share of mudava votes
    u = np.zeros(num_stations)
    o = np.zeros(num_stations)
    weights = [1 if x > 0 else 0 for x in adj_voters_cpy['NUM VOTERS']]
    num_samples = np.min([num_stations,np.array(weights).sum()])
    sample = adj_voters_cpy['NUM VOTERS'].sample(num_samples, weights = weights)
    for i in range(num_samples):
        rand_station = sample.index[i] #stations come in randomly
        voters = np.min([station_size, adj_voters_cpy['NUM VOTERS'][rand_station]])
        pred_u = elections_2013['UHURU'][rand_station] + (mdvd_share*elections_2013['MUDAVADI'][rand_station])
        pred_o = elections_2013['ODINGA'][rand_station] + ((1-mdvd_share)*elections_2013['MUDAVADI'][rand_station])
        u[i] = pred_u * voters * 0.01
        o[i] = pred_o * voters * 0.01
        adj_voters_cpy['NUM VOTERS'][rand_station] = adj_voters_cpy['NUM VOTERS'][rand_station] - voters
    return (u, o)

def exp(batch_size=100):
    adj_voters_cpy = adj_voters.copy()
    uhuru_data = [0]
    raila_data = [0]
    diff = [0]
    while adj_voters_cpy.any()[0]:
        u, o = time_step(adj_voters_cpy, batch_size)
        uhuru_data.append(uhuru_data[-1] + u.sum())
        raila_data.append(raila_data[-1] + o.sum())
        diff.append(np.abs((uhuru_data[-1] - raila_data[-1])/float(uhuru_data[-1])))
    return uhuru_data,raila_data,diff

uhuru_data,raila_data,diff = exp()
plt.plot(uhuru_data,label = 'uhuru')
plt.plot(raila_data, label = 'raila' )
plt.ylabel('Total Votes')
plt.xlabel('Time(ticks)')
plt.legend(loc="upper left")
Out[54]:
<matplotlib.legend.Legend at 0x115fcceb8>
In [39]:
plt.plot(diff, label='pct diff')
plt.legend()
Out[39]:
<matplotlib.legend.Legend at 0x115461400>
In [62]:
num_stations = 1
while num_stations < 200:
    num_stations += 30
    uhuru_data,raila_data,diff = exp(num_stations)
    var = [np.std(diff[:i]) for i in range(len(diff))]
    plt.plot(var,label = "num stations:" + str(num_stations))
plt.xlabel("Time")
plt.ylabel("std-dev")
plt.legend(loc='upper right')
/Users/davidmuchene/anaconda/lib/python3.5/site-packages/numpy/core/_methods.py:135: RuntimeWarning: Degrees of freedom <= 0 for slice
  keepdims=keepdims)
Out[62]:
<matplotlib.legend.Legend at 0x11392d860>

Obviously ALOT of assumptions were made so free to correct me if anything is clearly wrong. I can't verify the authenticity of a screenshot floating around claiming that Uhuru's votes were a constant multiple (1.12) of Odinga's. However, idea that the difference should be volatile and that the lack thereof implies rigging doesn't make much sense to me. It also seems to me that there are much easier ways to prove rigging (presumably ones that don't involve us understanding the gambler's fallacy, or be expert DB admins).