Copyright 2019 Allen B. Downey
MIT License: https://opensource.org/licenses/MIT
%matplotlib inline
import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
This article suggests that a smooth curve is a better way to show noisy polling data over time.
Here's their before and after:
And here's their data:
df = pd.read_csv('Economist_brexit.csv', header=3, parse_dates=[0])
df.index = df['Date']
df.head()
Date | % responding right | % responding wrong | |
---|---|---|---|
Date | |||
2016-02-08 | 2016-02-08 | 46 | 42 |
2016-09-08 | 2016-09-08 | 45 | 44 |
2016-08-17 | 2016-08-17 | 46 | 43 |
2016-08-23 | 2016-08-23 | 45 | 43 |
2016-08-31 | 2016-08-31 | 47 | 44 |
df.tail()
Date | % responding right | % responding wrong | |
---|---|---|---|
Date | |||
2018-08-13 | 2018-08-13 | 43 | 47 |
2018-08-14 | 2018-08-14 | 43 | 45 |
2018-08-21 | 2018-08-21 | 41 | 47 |
2018-08-29 | 2018-08-29 | 42 | 47 |
2018-04-09 | 2018-04-09 | 42 | 48 |
The following function uses StatsModels to put a smooth curve through a time series (and stuff the results back into a Pandas Series)
from statsmodels.nonparametric.smoothers_lowess import lowess
def make_lowess(series):
endog = series.values
exog = series.index.values
smooth = lowess(endog, exog)
index, data = np.transpose(smooth)
return pd.Series(data, index=pd.to_datetime(index))
Here's what the graph looks like.
options = dict(marker='o', linewidth=0, alpha=0.3, label='')
df['% responding right'].plot(color='C0', **options)
df['% responding wrong'].plot(color='C1', **options)
right = make_lowess(df['% responding right'])
right.plot(label='Right')
wrong = make_lowess(df['% responding wrong'])
wrong.plot(label='Wrong')
plt.legend();