Homework 2: More Exploratory Data Analysis¶

Gene Expression Data and Election Polls¶

Due: Thursday, October 2, 2014 11:59 PM

Submission Instructions¶

To submit your homework, create a folder named lastname_firstinitial_hw# and place your IPython notebooks, data files, and any other files in this folder. Your IPython Notebooks should be completely executed with the results visible in the notebook. We should not have to run any code. Compress the folder (please use .zip compression) and submit to the CS109 dropbox in the appropriate folder. If we cannot access your work because these directions are not followed correctly, we will not grade your work.

Introduction¶

John Tukey wrote in Exploratory Data Analysis, 1977: "The greatest value of a picture is when it forces us to notice what we never expected to see." In this assignment we will continue using our exploratory data analysis tools, but apply it to new sets of data: gene expression and polls from the 2012 Presidental Election and from the 2014 Senate Midterm Elections.

First: You will use exploratory data analysis and apply the singular value decomposition (SVD) to a gene expression data matrix to determine if the the date that the gene expression samples are processed has large effect on the variability seen in the data.

Second: You will use the polls from the 2012 Presidential Elections to determine (1) Is there a pollster bias in presidential election polls? and (2) Is the average of polls better than just one poll?

Finally: You will use the HuffPost Pollster API to extract the polls for the current 2014 Senate Midterm Elections and provide a preliminary prediction of the result of each state.

Data¶

We will use the following data sets:

A gene expression data set called exprs_GSE5859.csv and sample annotation table called sampleinfo_GSE5859.csv which are both available on Github in the 2014_data repository: expression data set and sample annotation table.
Polls from the 2012 Presidential Election: Barack Obama vs Mitt Romney. The polls we will use are from the Huffington Post Pollster.
Polls from the 2014 Senate Midterm Elections from the HuffPost Pollster API.

Load Python modules¶

In [2]:

# special IPython command to prepare the notebook for matplotlib
%matplotlib inline 

import requests 
# fix stringio for python 3 version
#from StringIO import StringIO
from io import StringIO

import numpy as np
import pandas as pd # pandas
import matplotlib.pyplot as plt # module for plotting 
import seaborn as sns
import datetime as dt # module for manipulating dates and times
import numpy.linalg as lin # module for performing linear algebra operations
import os

Problem 1¶

In this problem we will be using a gene expression data set obtained from a microarray experiement Read more about the specific experiment here. There are two data sets we will use:

The gene expression intensities where the rows represent the features on the microarray (e.g. genes) and the columsns represent the different microarray samples.
A table that contains the information about each of the samples (columns in the gene expression data set) such as the sex, the age, the treatment status, the date the samples were processed. Each row represents one sample.

Problem 1(a)¶

Read in the two files from Github: exprs_GSE5859.csv and sampleinfo_GSE5859.csv as pandas DataFrames called exprs and sampleinfo. Use the gene names as the index of the exprs DataFrame.

In [20]:

#your code here
exprs__url = "https://github.com/cs109/2014_data/raw/master/exprs_GSE5859.csv"
sampleinfo_url = "https://github.com/cs109/2014_data/raw/master/sampleinfo_GSE5859.csv"

urls = [exprs__url, sampleinfo_url]
filenames = ["exprs_GSE5859.csv", "sampleinfo_GSE5859.csv"]

for url, filename in zip(urls, filenames):
    r = requests.get(url, stream=True)
    with open("data/"+filename, 'wb') as fd:
        for chunk in r.iter_content(chunk_size=256):
            fd.write(chunk)
    print(f"file {filename} downloaded succesfully")

file exprs_GSE5859.csv downloaded succesfully
file sampleinfo_GSE5859.csv downloaded succesfully

Double checking that our files are in the data dir:

In [21]:

[file for file in os.listdir("data") if file in filenames]

Out[21]:

['exprs_GSE5859.csv', 'sampleinfo_GSE5859.csv']

In [37]:

exprs = pd.read_csv("data/"+filenames[0], index_col=0)
exprs.head()

Out[37]:

	GSM25581.CEL.gz	GSM25681.CEL.gz	GSM136524.CEL.gz	GSM136707.CEL.gz	GSM25553.CEL.gz	GSM136676.CEL.gz	GSM136711.CEL.gz	GSM136542.CEL.gz	GSM136535.CEL.gz	GSM25399.CEL.gz	...	GSM48650.CEL.gz	GSM25687.CEL.gz	GSM25685.CEL.gz	GSM136549.CEL.gz	GSM25427.CEL.gz	GSM25525.CEL.gz	GSM25349.CEL.gz	GSM136727.CEL.gz	GSM25626.CEL.gz	GSM136725.CEL.gz
1007_s_at	6.333951	5.736190	6.156754	6.513010	6.061013	6.506493	6.863426	6.369645	6.646321	5.810930	...	7.082636	6.315866	7.010165	6.460107	6.122355	6.330314	6.627014	6.133068	6.419444	6.488579
1053_at	7.255622	7.399993	7.491967	7.006123	7.250995	7.082581	6.842236	7.048487	7.042361	6.666625	...	6.599718	7.412731	7.274429	6.928642	7.168782	7.235648	6.939184	7.280781	6.351776	7.517410
117_at	5.760106	4.825169	5.039387	5.414160	5.205697	5.300078	5.099337	5.156459	5.736109	5.807496	...	6.231457	5.240717	5.275062	5.759827	5.409720	5.491938	5.113570	5.401876	5.537605	5.247190
121_at	6.935451	7.025210	7.543667	7.959781	7.223323	8.006816	8.102504	7.434617	7.532321	7.711461	...	8.215096	7.677593	7.923624	7.219401	7.432338	6.785174	7.833862	7.607461	7.302935	7.331864
1255_g_at	3.597872	3.108801	3.128269	3.364947	3.125909	3.186834	3.311022	3.092042	3.167737	2.969579	...	3.331424	3.043856	3.151900	3.132253	3.250600	2.984125	3.152269	3.225123	3.222731	3.175846

5 rows × 208 columns

In [24]:

sampleinfo = pd.read_csv("data/"+filenames[1])
sampleinfo.head()

Out[24]:

	ethnicity	date	filename	sex
0	CEU	2003-02-04	GSM25349.CEL.gz	M
1	CEU	2003-02-04	GSM25350.CEL.gz	M
2	CEU	2002-12-17	GSM25356.CEL.gz	M
3	CEU	2003-01-30	GSM25357.CEL.gz	M
4	CEU	2003-01-03	GSM25358.CEL.gz	M

Make sure the order of the columns in the gene expression DataFrame match the order of file names in the sample annotation DataFrame. If the order of the columns the exprs DataFrame do not match the order of the file names in the sampleinfo DataFrame, reorder the columns in the exprs DataFrame.

Note: The column names of the gene expression DataFrame are the filenames of the orignal files from which these data were obtained.

Hint: The method list.index(x) [read here] can be used to return the index in the list of the first item whose value is x. It is an error if there is no such item. To check if the order of the columns in exprs matches the order of the rows in sampleinfo, you can check using the method .all() on a Boolean or list of Booleans:

Example code: (exprs.columns == sampleinfo.filename).all()

In [40]:

#your code here
# change the ordering of columns to be the same as the sample filenames
exprs = exprs[sampleinfo.filename]
#check order
(exprs.columns == sampleinfo.filename).all()

Out[40]:

True

Show the head of the two tables: exprs and sampleinfo.

In [44]:

#your code here
exprs.head()

Out[44]:

	GSM25349.CEL.gz	GSM25350.CEL.gz	GSM25356.CEL.gz	GSM25357.CEL.gz	GSM25358.CEL.gz	GSM25359.CEL.gz	GSM25360.CEL.gz	GSM25361.CEL.gz	GSM25377.CEL.gz	GSM25378.CEL.gz	...	GSM136719.CEL.gz	GSM136720.CEL.gz	GSM136721.CEL.gz	GSM136722.CEL.gz	GSM136723.CEL.gz	GSM136724.CEL.gz	GSM136725.CEL.gz	GSM136726.CEL.gz	GSM136727.CEL.gz	GSM136729.CEL.gz
1007_s_at	6.627014	6.249807	5.934128	6.145268	6.091270	6.041186	6.050375	6.063847	6.226106	6.358282	...	6.466445	6.533592	6.631492	6.513362	6.440706	6.704324	6.488579	6.809481	6.133068	6.155473
1053_at	6.939184	6.818038	7.047962	7.422477	7.272361	7.128216	6.750719	6.836287	6.738022	7.367895	...	7.032885	7.156344	7.018025	6.973322	6.884738	7.120898	7.517410	7.200596	7.280781	7.111583
117_at	5.113570	5.074096	5.371201	5.266550	5.342047	5.063876	5.315898	5.483652	6.689444	6.482782	...	5.661112	5.127260	5.151840	5.505602	5.687689	4.942651	5.247190	5.237239	5.401876	5.302628
121_at	7.833862	7.780682	7.458197	7.655948	7.546555	7.072670	7.092984	6.954225	7.489785	7.388539	...	7.769734	7.815864	7.683279	7.883231	7.913621	7.834196	7.331864	7.357102	7.607461	7.456453
1255_g_at	3.152269	3.111747	3.018932	3.154545	3.107954	3.224284	3.114241	3.044975	3.304038	2.887919	...	3.257484	3.339234	3.298384	3.150654	3.344501	3.230285	3.175846	3.105092	3.225123	3.090149

5 rows × 208 columns

In [45]:

sampleinfo.head()

Out[45]:

	ethnicity	date	filename	sex
0	CEU	2003-02-04	GSM25349.CEL.gz	M
1	CEU	2003-02-04	GSM25350.CEL.gz	M
2	CEU	2002-12-17	GSM25356.CEL.gz	M
3	CEU	2003-01-30	GSM25357.CEL.gz	M
4	CEU	2003-01-03	GSM25358.CEL.gz	M

Problem 1(b)¶

Extract the year and month as integers from the sampleinfo table.

Hint: To convert a Series or a column of a pandas DataFrame that contains a date-like object, you can use the to_datetime function [read here]. This will create a DatetimeIndex which can be used to extract the month and year for each row in the DataFrame.

In [52]:

#your code here
dates = pd.to_datetime(sampleinfo["date"])
years = [t.year for t in dates]
months = [t.month for t in dates]

Problem 1(c)¶

Convert the dates in the date column from the sampleinfo table into days since October 31, 2002. Add a column to the sampleinfo DataFrame titled elapsedInDays containing the days since October 31, 2002. Show the head of the sampleinfo DataFrame which includes the new column.

Hint: Use the datetime module to create a new datetime object for the specific date October 31, 2002. Then, subtract the October 31, 2002 date from each date from the date column in the sampleinfo DataFrame.

In [180]:

#your code here
basedate = pd.to_datetime("October 31, 2002")
basedate

sampleinfo["elapsedInDays"] = pd.to_datetime(sampleinfo.date) - basedate
sampleinfo.head()

Out[180]:

	ethnicity	date	filename	sex	elapsedInDays
0	CEU	2003-02-04	GSM25349.CEL.gz	M	96 days
1	CEU	2003-02-04	GSM25350.CEL.gz	M	96 days
2	CEU	2002-12-17	GSM25356.CEL.gz	M	47 days
3	CEU	2003-01-30	GSM25357.CEL.gz	M	91 days
4	CEU	2003-01-03	GSM25358.CEL.gz	M	64 days

Problem 1(d)¶

Use exploratory analysis and the singular value decomposition (SVD) of the gene expression data matrix to determine if the date the samples were processed has large effect on the variability seen in the data or if it is just ethnicity (which is confounded with date).

Hint: See the end of the lecture from 9/23/2014 for help with SVD.

First subset the the sampleinfo DataFrame to include only the CEU ethnicity. Call this new subsetted DataFrame sampleinfoCEU. Show the head of sampleinfoCEU DataFrame.

In [62]:

#your code here
sampleinfoCEU = sampleinfo[sampleinfo["ethnicity"]=="CEU"]
print(sampleinfoCEU.shape, sampleinfo.shape)
sampleinfoCEU.head()

(102, 5) (208, 5)

Out[62]:

	ethnicity	date	filename	sex	elapsedInDays
0	CEU	2003-02-04	GSM25349.CEL.gz	M	96 days
1	CEU	2003-02-04	GSM25350.CEL.gz	M	96 days
2	CEU	2002-12-17	GSM25356.CEL.gz	M	47 days
3	CEU	2003-01-30	GSM25357.CEL.gz	M	91 days
4	CEU	2003-01-03	GSM25358.CEL.gz	M	64 days

Next, subset the exprs DataFrame to only include the samples with the CEU ethnicity. Name this new subsetted DataFrame exprsCEU. Show the head of the exprsCEU DataFrame.

In [61]:

#your code here
exprsCEU = exprs[sampleinfoCEU.filename]
print(exprsCEU.shape, exprs.shape)
exprsCEU.head()

(8793, 102) (8793, 208)

Out[61]:

	GSM25349.CEL.gz	GSM25350.CEL.gz	GSM25356.CEL.gz	GSM25357.CEL.gz	GSM25358.CEL.gz	GSM25359.CEL.gz	GSM25360.CEL.gz	GSM25361.CEL.gz	GSM25377.CEL.gz	GSM25378.CEL.gz	...	GSM48658.CEL.gz	GSM48660.CEL.gz	GSM48661.CEL.gz	GSM48662.CEL.gz	GSM48663.CEL.gz	GSM48664.CEL.gz	GSM48665.CEL.gz	GSM136725.CEL.gz	GSM136726.CEL.gz	GSM136727.CEL.gz
1007_s_at	6.627014	6.249807	5.934128	6.145268	6.091270	6.041186	6.050375	6.063847	6.226106	6.358282	...	6.501510	6.558100	6.618286	6.869995	5.981000	6.403285	6.248702	6.488579	6.809481	6.133068
1053_at	6.939184	6.818038	7.047962	7.422477	7.272361	7.128216	6.750719	6.836287	6.738022	7.367895	...	7.066686	7.386702	6.407958	7.089180	7.120923	7.018998	7.155419	7.517410	7.200596	7.280781
117_at	5.113570	5.074096	5.371201	5.266550	5.342047	5.063876	5.315898	5.483652	6.689444	6.482782	...	5.600514	5.232676	5.630720	4.944748	5.275577	5.770358	5.616842	5.247190	5.237239	5.401876
121_at	7.833862	7.780682	7.458197	7.655948	7.546555	7.072670	7.092984	6.954225	7.489785	7.388539	...	7.437535	7.714650	7.416252	7.746448	8.001434	7.626723	7.452299	7.331864	7.357102	7.607461
1255_g_at	3.152269	3.111747	3.018932	3.154545	3.107954	3.224284	3.114241	3.044975	3.304038	2.887919	...	3.009983	3.151203	3.199709	3.159496	3.149710	3.242780	3.433125	3.175846	3.105092	3.225123

5 rows × 102 columns

Check to make sure the order of the columns in the exprsCEU DataFrame matches the rows in the sampleinfoCEU DataFrame.

In [63]:

#your code here
(exprsCEU.columns == sampleinfoCEU.filename).any()

Out[63]:

True

Compute the average gene expression intensity in the exprsCEU DataFrame across all the samples. For each sample in the exprsCEU DataFrame, subtract the average gene expression intensity from each of the samples. Show the head of the mean normalized gene expression data.

In [75]:

#your code here
avg = exprsCEU.mean(axis=1)
avg[:4]

Out[75]:

1007_s_at    6.261955
1053_at      7.075217
117_at       5.604127
121_at       7.415835
dtype: float64

In [76]:

df = exprsCEU.sub(avg, axis="index")
df.head()

Out[76]:

	GSM25349.CEL.gz	GSM25350.CEL.gz	GSM25356.CEL.gz	GSM25357.CEL.gz	GSM25358.CEL.gz	GSM25359.CEL.gz	GSM25360.CEL.gz	GSM25361.CEL.gz	GSM25377.CEL.gz	GSM25378.CEL.gz	...	GSM48658.CEL.gz	GSM48660.CEL.gz	GSM48661.CEL.gz	GSM48662.CEL.gz	GSM48663.CEL.gz	GSM48664.CEL.gz	GSM48665.CEL.gz	GSM136725.CEL.gz	GSM136726.CEL.gz	GSM136727.CEL.gz
1007_s_at	0.365059	-0.012149	-0.327827	-0.116687	-0.170685	-0.220769	-0.211580	-0.198109	-0.035849	0.096327	...	0.239554	0.296144	0.356331	0.608040	-0.280956	0.141330	-0.013254	0.226624	0.547526	-0.128887
1053_at	-0.136032	-0.257178	-0.027254	0.347260	0.197144	0.053000	-0.324497	-0.238930	-0.337195	0.292678	...	-0.008531	0.311485	-0.667259	0.013964	0.045706	-0.056219	0.080203	0.442193	0.125379	0.205564
117_at	-0.490556	-0.530031	-0.232926	-0.337577	-0.262080	-0.540250	-0.288228	-0.120475	1.085317	0.878655	...	-0.003613	-0.371451	0.026594	-0.659379	-0.328549	0.166231	0.012716	-0.356936	-0.366887	-0.202251
121_at	0.418026	0.364847	0.042362	0.240113	0.130720	-0.343165	-0.322852	-0.461611	0.073949	-0.027296	...	0.021700	0.298814	0.000417	0.330612	0.585598	0.210887	0.036463	-0.083972	-0.058733	0.191626
1255_g_at	0.018335	-0.022187	-0.115002	0.020611	-0.025980	0.090351	-0.019693	-0.088959	0.170104	-0.246015	...	-0.123951	0.017269	0.065775	0.025562	0.015776	0.108846	0.299192	0.041912	-0.028842	0.091189

5 rows × 102 columns

Using this mean normalized gene expression data, compute the projection to the first Principal Component (PC1).

Hint: Use the numpy.linalg.svd() function in the numpy.linalg module (or the scipy.linalg.svd() function in the scipy.linalg module) to apply an singular value decomposition to a matrix.

In [97]:

#your code here
U, s, Vh = np.linalg.svd(df.values, full_matrices=False)

In [102]:

df.shape, u.shape, s.shape, Vh.shape, Vh.T.shape

Out[102]:

((8793, 102), (8793, 102), (102,), (102, 102), (102, 102))

Create a histogram using the values from PC1. Use a bin size of 25.

In [108]:

#your code here
plt.title("Vh.T[:0,] histogram")
plt.hist(Vh.T[:,0], bins=25);

Create a scatter plot with the days since October 31, 2002 on the x-axis and PC1 on the y-axis.

In [236]:

#your code here
x = sampleinfoCEU.elapsedInDays.dt.days
y = Vh.T[:,0]
plt.scatter(x,y)
plt.xlabel("Days since Oct 31, 2002")
plt.ylabel("PC1")

Out[236]:

<matplotlib.text.Text at 0x126945b70>

Around what day do you notice a difference in the way the samples were processed?

In [200]:

#your code here
for X,Y in zip(x,y):
    if X <=100:
        plt.scatter(X,Y, color="b")
    else:
        plt.scatter(X,Y, color="r")

A more efficent version of the above, using a mask:

In [250]:

mask = x<=100
plt.figure(figsize=(10,6))
plt.scatter(x[mask], y[mask], color="b", label="100 or less days", s=20)
plt.scatter(x[~mask], y[~mask], color="r", label="over a 100 days", s=20)
plt.axvline(100, label="100 days", linestyle="dashed")
plt.xlabel("Days since Oct 31, 2002")
plt.ylabel("PC1")
plt.legend();

Answer: At about 100 days the PC1 of the samples is markedly different. Something seems to be going on there.

Discussion for Problem 1¶

Write a brief discussion of your conclusions to the questions and tasks above in 100 words or less.

First, I need to better understand the PC1 here, but given that numpy must have done the right SVD and projected the high D gene data into a smaller D principal componenets, by plotting PC1 against days we can see that there is a clear difference in the samples after day 100.

The original gene data was very hard to visualize, doing a PCA analysis makes it possible to do so.

Problem 2: Is there a pollster bias in presidential election polls?¶

Problem 2(a)¶

The HuffPost Pollster contains many political polls. You can access these polls from individual races as a CSV but you can also access polls through the HuffPost Pollster API to access the data.

Read in the polls from the 2012 Presidential Election: Barack Obama vs Mitt Romney into a pandas DataFrame called election. For this problem, you may read in the polls for this race directly using the CSV file available from the HuffPost Pollster page.

In [3]:

#your code here
election = pd.read_csv("http://elections.huffingtonpost.com/pollster/2012-general-election-romney-vs-obama.csv")

Show the head of the election DataFrame.

In [4]:

print("df shape:", election.shape)
election.head()

df shape: (586, 17)

Out[4]:

	Pollster	Start Date	End Date	Entry Date/Time (ET)	Number of Observations	Population	Mode	Obama	Romney	Undecided	Other	Pollster URL	Source URL	Partisan	Affiliation	Question Text	Question Iteration
0	Politico/GWU/Battleground	2012-11-04	2012-11-05	2012-11-06T08:40:26Z	1000.0	Likely Voters	Live Phone	47.0	47.0	6.0	NaN	http://elections.huffingtonpost.com/pollster/p...	http://www.politico.com/news/stories/1112/8338...	Nonpartisan	None	NaN	1
1	YouGov/Economist	2012-11-03	2012-11-05	2012-11-26T15:31:23Z	740.0	Likely Voters	Internet	49.0	47.0	3.0	NaN	http://elections.huffingtonpost.com/pollster/p...	http://cdn.yougov.com/cumulus_uploads/document...	Nonpartisan	None	NaN	1
2	Gravis Marketing	2012-11-03	2012-11-05	2012-11-06T09:22:02Z	872.0	Likely Voters	Automated Phone	48.0	48.0	4.0	NaN	http://elections.huffingtonpost.com/pollster/p...	http://www.gravispolls.com/2012/11/gravis-mark...	Nonpartisan	None	NaN	1
3	IBD/TIPP	2012-11-03	2012-11-05	2012-11-06T08:51:48Z	712.0	Likely Voters	Live Phone	50.0	49.0	NaN	1.0	http://elections.huffingtonpost.com/pollster/p...	http://news.investors.com/special-report/50841...	Nonpartisan	None	NaN	1
4	Rasmussen	2012-11-03	2012-11-05	2012-11-06T08:47:50Z	1500.0	Likely Voters	Automated Phone	48.0	49.0	NaN	NaN	http://elections.huffingtonpost.com/pollster/p...	http://www.rasmussenreports.com/public_content...	Nonpartisan	None	NaN	1

How many polls were conducted in November? Define this number as M.

Hint: Subset the election DataFrame for only dates in the Start Date column that are in November 2012.

In [5]:

# converting string date columns into datetime
election["Start Date"] = pd.to_datetime(election["Start Date"])
election["End Date"] = pd.to_datetime(election["End Date"])
election["Entry Date/Time (ET)"] = pd.to_datetime(election["Entry Date/Time (ET)"])

pollsNov12 = election[(election["Start Date"] >= pd.datetime(2012,11,1)) &
                      (election["Start Date"] < pd.datetime(2012,12,1))]
M = pollsNov12.shape[0]
pollsNov12.describe()

Out[5]:

	Number of Observations	Obama	Romney	Undecided	Other	Question Text	Question Iteration
count	19.000000	19.000000	19.000000	12.000000	8.000000	0.0	19.0
mean	1523.368421	48.789474	47.736842	3.083333	1.375000	NaN	1.0
std	999.457754	1.084176	1.097578	1.676486	0.517549	NaN	0.0
min	693.000000	47.000000	45.000000	1.000000	1.000000	NaN	1.0
25%	1009.500000	48.000000	47.000000	2.000000	1.000000	NaN	1.0
50%	1200.000000	49.000000	48.000000	3.000000	1.000000	NaN	1.0
75%	1487.500000	49.500000	48.500000	4.000000	2.000000	NaN	1.0
max	4725.000000	51.000000	49.000000	6.000000	2.000000	NaN	1.0

Answer: 19 polls were conducted in Nov

What was the median of the number of observations in the November polls? Define this quantity as N.

In [6]:

#your code here
N = np.median(pollsNov12["Number of Observations"])
N

Out[6]:

1200.0

Answer: Median for polls in Nov is 1200

Problem 2(b)¶

Using the median sample size $N$ from Problem 1(a), simulate the results from a single poll: simulate the number of votes for Obama out of a sample size $N$ where $p$ = 0.53 is the percent of voters who are voting for Obama.

Hint: Use the binomial distribution with parameters $N$ and $p$ = 0.53.

In [7]:

#your code here
poll = np.mean(np.random.binomial(1, p=0.53, size=int(N)))
poll2 = np.random.binomial(N, 0.53) / N
poll, poll2

Out[7]:

(0.52666666666666662, 0.52500000000000002)

Now, perform a Monte Carlo simulation to obtain the estimated percentage of Obama votes with a sample size $N$ where $N$ is the median sample size calculated in Problem 2(a). Let $p$=0.53 be the percent of voters are voting for Obama.

Hint: You will repeat the simulation above 1,000 times and plot the distribution of the estimated percent of Obama votes from a single poll. The results from the single poll you simulate is random variable and will be different every time you sample.

Two days, first to draw a number from N, or to generate N votes and take the mean:

In [8]:

#your code here
sim = np.random.binomial(N, 0.53, 1000)/N
sim2 = [np.mean(np.random.binomial(1, p=0.53, size=int(N))) for _ in range(1000)]
sim[:5], sim2[:5]

Out[8]:

(array([ 0.54416667,  0.50916667,  0.55083333,  0.5375    ,  0.5175    ]),
 [0.54000000000000004,
  0.52749999999999997,
  0.51916666666666667,
  0.53333333333333333,
  0.54249999999999998])

Plot the distribution of the estimated percentage of Obama votes from your single poll. What is the distribution of the estimated percentage of Obama votes?

In [9]:

#your code here
#plt.hist(sim, bins=20);
simP = sim2 # to easily choose b/w sim and sim2
plt.figure(figsize=(10,6))
sns.distplot(simP, label="Voting distribution")
plt.axvline(np.mean(simP), color="g", label=f"Mean: {np.mean(simP):.2f}", linestyle="dashed")
plt.axvline(0.5, label="50 percent line", color="y", alpha=0.8)
plt.xlabel("percentage for votes for Obama", fontsize=12)
plt.ylabel("Probablity of it happening", fontsize=12)
plt.annotate(f"The mean is {np.mean(simP):.2%}", (np.mean(simP),0.0), (np.mean(simP)+0.01,3), 
             arrowprops={"arrowstyle":'-|>'}, bbox = {"boxstyle":"round", "fc":"0.8"})
plt.text(0.48,20, "So there is a small\nchance of Obama\nloosing", bbox=dict(facecolor='red', alpha=0.3))
plt.legend();

Answer:

What is the standard error (SE) of the estimated percentage from the poll.

Hint: Remember the SE is the standard deviation (SD) of the distribution of a random variable.

In [10]:

#your code here
np.std(simP)

Out[10]:

0.014020573152613193

Problem 2(c)¶

Now suppose we run M polls where M is the number of polls that happened in November (calculated in Problem 2(a)). Run 1,000 simulations and compute the mean of the M polls for each simulation.

In [11]:

np.random.binomial(N, 0.53, M)

Out[11]:

array([623, 667, 612, 630, 631, 630, 619, 610, 603, 640, 610, 633, 664,
       668, 638, 646, 640, 632, 651])

In [12]:

#your code here
polls1k = [np.mean(np.random.binomial(N, 0.53, M)) for _ in range(1000)]
polls1k[:10]

Out[12]:

[634.84210526315792,
 639.42105263157896,
 638.42105263157896,
 643.47368421052636,
 631.9473684210526,
 631.68421052631584,
 638.0,
 634.89473684210532,
 643.10526315789468,
 642.26315789473688]

What is the distribution of the average of polls?

Hint: Show a plot.

In [13]:

#your code here
plt.figure(figsize=(10,6))
plt.xlim(500,700)
plt.ylim(0,0.11)
sns.distplot(polls1k, label="Voting distribution")
sns.distplot(np.array(simP)*1200, label="non averaged poll")
plt.axvline(np.mean(polls1k), color="g", label=f"Mean: {np.mean(polls1k):.2f}", linestyle="dashed")
plt.axvline(N*0.5, label="50 percent line", color="y", alpha=0.8)
plt.xlabel("votes for Obama", fontsize=12)
plt.ylabel("Probablity of it happening", fontsize=12)
plt.legend();

Answer: The average poll has a much tighter outcome, as is clear on the chart above. The non-average poll has a much wider spread.

What is the standard error (SE) of the average of polls?

In [14]:

#your code here
np.std(polls1k), np.std(np.array(simP)*1200)

Out[14]:

(4.0203287292723635, 16.824687783135836)

Answer:

Is the SE of the average of polls larger, the same, or smaller than that the SD of a single poll (calculated in Problem 2(b))? By how much?

Hint: Compute a ratio of the two quantities.

In [15]:

#your code here
np.std(polls1k) / np.std(np.array(simP)*1200)

Out[15]:

0.23895413579692845

In [16]:

np.std(np.array(simP)*1200) / np.std(polls1k)

Out[16]:

4.184903503197094

Answer:

Problem 2(d)¶

Repeat Problem 2(c) but now record the across poll standard deviation in each simulation.

In [17]:

#your code here
polls1kstd = [np.std(np.random.binomial(N, 0.53, M)) for _ in range(1000)]
polls1kstd[:5]

Out[17]:

[14.847516743974527,
 21.288936952443095,
 17.348952405966404,
 16.956192537653056,
 20.507548775675353]

What is the distribution of the across M polls standard deviation?

Hint: Show a plot.

In [18]:

#your code here
sns.distplot(polls1kstd, label="std across M polls")
plt.xlabel("Standard Dev");

Answer:

Problem 2(e)¶

What is the standard deviation of M polls in our real (not simulated) 2012 presidential election data ?

In [19]:

#your code here
np.std(election["Obama"]/100)

Out[19]:

0.024199901720378662

Is this larger, the same, or smaller than what we expeced if polls were not biased.

In [20]:

#your code here
np.std(election["Obama"]/100), np.std(np.array(simP)*1200/100), np.std(election["Obama"]/100)/np.std(np.array(simP)*1200/100)

Out[20]:

(0.024199901720378662, 0.16824687783135833, 0.14383566597078459)

Answer:

Problem 2(f)¶

For AC209 Students: Learn about the normal approximation for the binomial distribution and derive the results of Problem 2(b) and 2(c) analytically (using this approximation). Compare the results obtained analytically to those obtained from simulations.

In [30]:

#your code here

Answer:

Discussion for Problem 2¶

Write a brief discussion of your conclusions to the questions and tasks above in 100 words or less.

Problem 3: Is the average of polls better than just one poll?¶

Problem 3(a)¶

Most undecided voters vote for one of the two candidates at the election. Therefore, the reported percentages underestimate the final value of both candidates. However, if we assume the undecided will split evenly, then the observed difference should be an unbiased estimate of the final difference.

Add a new column to the election DataFrame containg the difference between Obama and Romeny called Diff.

In [25]:

#your code here
election["Diff"] = 100*(election["Obama"] - election["Romney"]) / (election["Obama"] + election["Romney"])
election.head()

Out[25]:

	Pollster	Start Date	End Date	Entry Date/Time (ET)	Number of Observations	Population	Mode	Obama	Romney	Undecided	Other	Pollster URL	Source URL	Partisan	Affiliation	Question Text	Question Iteration	Diff
585	PPP (D)	2009-03-13	2009-03-15	2010-11-08 10:36:31	1403.0	Registered Voters	Automated Phone	44.0	44.0	12.0	NaN	http://elections.huffingtonpost.com/pollster/p...	http://www.publicpolicypolling.com/pdf/PPP_Rel...	Pollster	Dem	NaN	1	0.000000
584	PPP (D)	2009-04-17	2009-04-19	2010-11-08 10:45:18	686.0	Registered Voters	Automated Phone	50.0	39.0	11.0	NaN	http://elections.huffingtonpost.com/pollster/p...	http://www.publicpolicypolling.com/pdf/surveys...	Pollster	Dem	NaN	1	12.359551
583	PPP (D)	2009-05-14	2009-05-18	2010-11-08 10:49:08	1000.0	Registered Voters	Automated Phone	53.0	35.0	12.0	NaN	http://elections.huffingtonpost.com/pollster/p...	http://www.publicpolicypolling.com/pdf/surveys...	Pollster	Dem	NaN	1	20.454545
582	PPP (D)	2009-06-12	2009-06-16	2010-11-08 10:58:00	638.0	Registered Voters	Automated Phone	48.0	40.0	12.0	NaN	http://elections.huffingtonpost.com/pollster/p...	http://www.publicpolicypolling.com/pdf/surveys...	Pollster	Dem	NaN	1	9.090909
581	PPP (D)	2009-07-15	2009-07-16	2010-11-08 11:02:37	577.0	Registered Voters	Automated Phone	49.0	40.0	11.0	NaN	http://elections.huffingtonpost.com/pollster/p...	http://www.publicpolicypolling.com/pdf/surveys...	Pollster	Dem	NaN	1	10.112360

In [22]:

election["Diff"].describe()

Out[22]:

count    586.000000
mean       2.433892
std        4.211205
min      -17.777778
25%        0.000000
50%        2.127660
75%        4.651163
max       20.454545
Name: Diff, dtype: float64

Problem 3(b)¶

Make a plot of the differences for the week before the election (e.g. 5 days) where the days are on the x-axis and the differences are on the y-axis. Add a horizontal line showing 3.9%: the difference between Obama and Romney on election day.

In [23]:

election.sort_values("End Date", ascending=True, inplace=True)

Adding a column to easily see the days from end of elections:

In [28]:

election["Days from End"] = -(election["End Date"] - election["End Date"].max()).dt.days.values
election.sort_values("Days from End", ascending=False, inplace=True)
election.head()

Out[28]:

	Pollster	Start Date	End Date	Entry Date/Time (ET)	Number of Observations	Population	Mode	Obama	Romney	Undecided	Other	Pollster URL	Source URL	Partisan	Affiliation	Question Text	Question Iteration	Diff	Days from End
585	PPP (D)	2009-03-13	2009-03-15	2010-11-08 10:36:31	1403.0	Registered Voters	Automated Phone	44.0	44.0	12.0	NaN	http://elections.huffingtonpost.com/pollster/p...	http://www.publicpolicypolling.com/pdf/PPP_Rel...	Pollster	Dem	NaN	1	0.000000	1331
584	PPP (D)	2009-04-17	2009-04-19	2010-11-08 10:45:18	686.0	Registered Voters	Automated Phone	50.0	39.0	11.0	NaN	http://elections.huffingtonpost.com/pollster/p...	http://www.publicpolicypolling.com/pdf/surveys...	Pollster	Dem	NaN	1	12.359551	1296
583	PPP (D)	2009-05-14	2009-05-18	2010-11-08 10:49:08	1000.0	Registered Voters	Automated Phone	53.0	35.0	12.0	NaN	http://elections.huffingtonpost.com/pollster/p...	http://www.publicpolicypolling.com/pdf/surveys...	Pollster	Dem	NaN	1	20.454545	1267
582	PPP (D)	2009-06-12	2009-06-16	2010-11-08 10:58:00	638.0	Registered Voters	Automated Phone	48.0	40.0	12.0	NaN	http://elections.huffingtonpost.com/pollster/p...	http://www.publicpolicypolling.com/pdf/surveys...	Pollster	Dem	NaN	1	9.090909	1238
581	PPP (D)	2009-07-15	2009-07-16	2010-11-08 11:02:37	577.0	Registered Voters	Automated Phone	49.0	40.0	11.0	NaN	http://elections.huffingtonpost.com/pollster/p...	http://www.publicpolicypolling.com/pdf/surveys...	Pollster	Dem	NaN	1	10.112360	1208

First, since we're looking only at the last 7 days, I make a mask to access only that data:

In [29]:

ma = election["Days from End"] <= 7
f"there are {np.sum(ma)} polls in the 7 days before election"
election[ma].describe()

Out[29]:

	Number of Observations	Obama	Romney	Undecided	Other	Question Text	Question Iteration	Diff	Days from End
count	34.000000	34.000000	34.000000	22.000000	13.000000	0.0	34.0	34.000000	34.000000
mean	2518.617647	48.294118	47.382353	4.045455	1.538462	NaN	1.0	0.958681	2.411765
std	6070.819644	1.219275	1.392941	2.419706	0.776250	NaN	0.0	1.453875	2.175868
min	693.000000	46.000000	43.000000	1.000000	1.000000	NaN	1.0	-2.083333	0.000000
25%	1004.750000	48.000000	47.000000	2.250000	1.000000	NaN	1.0	0.000000	1.000000
50%	1164.000000	48.000000	48.000000	3.000000	1.000000	NaN	1.0	1.030928	2.000000
75%	1500.000000	49.000000	48.000000	6.000000	2.000000	NaN	1.0	2.072704	4.000000
max	36472.000000	51.000000	49.000000	11.000000	3.000000	NaN	1.0	4.255319	7.000000

In [30]:

sns.lmplot("Days from End", "Diff", data=election[ma], hue="Pollster", fit_reg=False, legend=True, size=6)
plt.axhline(3.9, label="Actual Diff at Election")
plt.xlim(7, -0)
ymin, ymax = plt.ylim()
plt.fill_between("Days from End", ymax, data=election[ma], alpha=0.051, color="b")
plt.fill_between("Days from End", ymin, data=election[ma], alpha=0.051, color="r")
plt.ylim(ymin, ymax)
plt.title("Polls 7 Days from elections");

Problem 3(c)¶

Make a plot showing the differences by pollster where the pollsters are on the x-axis and the differences on the y-axis.

In [42]:

dfpoll = election[ma].reset_index()
dfpoll.head(2)

Out[42]:

	index	Pollster	Start Date	End Date	Entry Date/Time (ET)	Number of Observations	Population	Mode	Obama	Romney	Undecided	Other	Pollster URL	Source URL	Partisan	Affiliation	Question Text	Question Iteration	Diff	Days from End
0	33	JZ Analytics/Newsmax	2012-10-27	2012-10-29	2012-10-30 17:16:32	1073.0	Likely Voters	Internet	46.0	47.0	7.0	NaN	http://elections.huffingtonpost.com/pollster/p...	http://www.jzanalytics.com/	Sponsor	Rep	NaN	1	-1.075269	7
1	32	High Point University	2012-10-22	2012-10-30	2012-11-01 13:19:44	805.0	Registered Voters	Live Phone	46.0	43.0	11.0	1.0	http://elections.huffingtonpost.com/pollster/p...	http://acme.highpoint.edu/~mkifer/src/19releas...	Nonpartisan	None	NaN	1	3.370787	6

In [43]:

sns.swarmplot(x="Pollster", y="Diff", hue="Pollster", data=dfpoll, size=10)
#sns.factorplot(x="Pollster", y="Diff", hue="Pollster", data=dfpoll, size=10)
plt.axhline(0, alpha=0.2, color="black")
plt.text(0,0.6,"Obama Wins", bbox={'facecolor':'blue', 'alpha':0.1})
plt.text(0,-0.8,"Romney Wins", bbox={'facecolor':'red', 'alpha':0.1})
plt.xticks(rotation=90, fontsize=11)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.);

Is the across poll difference larger than the between pollster difference?

Answer: There is a lot of variation across polls b/w the same pollster.

Problem 3(d)¶

Take the average for each pollster and then compute the average of that. Given this difference how confident would you have been of an Obama victory?

Hint: Compute an estimate of the SE of this average based exclusively on the observed data.

In [44]:

#your code here
# first getting the avg of each pollsters poll
dfpollavg = election[ma].groupby("Pollster").mean().reset_index()
win_perctange = dfpollavg['Diff'].mean() 
print(f"avg of polls predict an Obama win by {win_perctange:.2f}%")
print(f"The avg error is {np.std(dfpollavg.Diff):.2f}")

avg of polls predict an Obama win by 1.14%
The avg error is 1.38

Answer: The win % is small, and the standard error is large, so the polls make the Obama victory pretty uncertain. It's well within the margin of error.

Problem 3(e)¶

For AC209 Students: Show the difference against time and see if you can detect a trend towards the end. Use this trend to see if it improves the final estimate.

In [45]:

election.head(2)

Out[45]:

	Pollster	Start Date	End Date	Entry Date/Time (ET)	Number of Observations	Population	Mode	Obama	Romney	Undecided	Other	Pollster URL	Source URL	Partisan	Affiliation	Question Text	Question Iteration	Diff	Days from End
585	PPP (D)	2009-03-13	2009-03-15	2010-11-08 10:36:31	1403.0	Registered Voters	Automated Phone	44.0	44.0	12.0	NaN	http://elections.huffingtonpost.com/pollster/p...	http://www.publicpolicypolling.com/pdf/PPP_Rel...	Pollster	Dem	NaN	1	0.000000	1331
584	PPP (D)	2009-04-17	2009-04-19	2010-11-08 10:45:18	686.0	Registered Voters	Automated Phone	50.0	39.0	11.0	NaN	http://elections.huffingtonpost.com/pollster/p...	http://www.publicpolicypolling.com/pdf/surveys...	Pollster	Dem	NaN	1	12.359551	1296

In [106]:

#your code here
#sns.lmplot("Days from End", "Diff", data=election, fit_reg=True, legend=False, size=6)
sns.lmplot("Days from End", "Diff", data=election[election["Days from End"] <= 30], scatter=False)
plt.plot("Days from End", "Diff", data=election[election["Days from End"] <= 100], color="g", alpha=0.2,
        label="All polls")
plt.axhline(3.9, label="Actual Diff at Election", linestyle="dashed", linewidth=1)
plt.axhline(0, label="0 line", linestyle="-", linewidth=1)
plt.xlim(100, 0)
plt.title("All polls before election");

Answer: It's interesting how the polls markedly swing from Obama down to Romney, and then there is a marked upward trend in the last 30 days, though it doesn't reach close to the 3.9 mark.

Discussion for Problem 3¶

Write a brief discussion of your conclusions to the questions and tasks above in 100 words or less.

I would look closely at the methods Nate Silver used to deal with polls, since there is so much variability both b/w the same poll and different polls. Also I'd compute confidence intervals etc.

Problem 4¶

In this last problem, we will use the polls from the 2014 Senate Midterm Elections from the HuffPost Pollster API to create a preliminary prediction of the result of each state.

The HuffPost Pollster API allows you to access the data as a CSV or a JSON response by tacking ".csv" or ".json" at the end of the URLs. For example the 2012 Presidential Election could be accessed as a .json instead of a .csv

Problem 4(a)¶

Read in the polls for all of the 2014 Senate Elections using the HuffPost API. For example, we can consider the 2014 Senate race in Kentucky between Mitch McConnell and Alison Grimes.

To search for the 2014 Senate races, use the topics parameter in the API [read more about topics here].

In [107]:

url_str = "http://elections.huffingtonpost.com/pollster/api/charts/?topic=2014-senate"

To list all the URLs related to the 2014 Senate races using the pollster API, we can use a list comprehension:

In [108]:

election_urls = [election['url'] + '.csv' for election in requests.get(url_str).json()]
election_urls

Out[108]:

['http://elections.huffingtonpost.com/pollster/2014-delaware-senate-wade-vs-coons.csv',
 'http://elections.huffingtonpost.com/pollster/2014-massachusetts-senate-herr-vs-markey.csv',
 'http://elections.huffingtonpost.com/pollster/2014-rhode-island-senate-zaccaria-vs-reed.csv',
 'http://elections.huffingtonpost.com/pollster/2014-montana-senate-daines-vs-curtis.csv',
 'http://elections.huffingtonpost.com/pollster/2014-oklahoma-senate-lankford-vs-johnson.csv',
 'http://elections.huffingtonpost.com/pollster/2014-hawaii-senate-cavasso-vs-schatz.csv',
 'http://elections.huffingtonpost.com/pollster/2014-kansas-senate-roberts-vs-orman-vs-taylor.csv',
 'http://elections.huffingtonpost.com/pollster/2014-wyoming-senate.csv',
 'http://elections.huffingtonpost.com/pollster/2014-tennessee-senate-alexander-vs-ball.csv',
 'http://elections.huffingtonpost.com/pollster/2014-idaho-senate-risch-vs-mitchell.csv',
 'http://elections.huffingtonpost.com/pollster/2014-new-jersey-senate-bell-vs-booker.csv',
 'http://elections.huffingtonpost.com/pollster/2014-new-mexico-senate-weh-vs-udall.csv',
 'http://elections.huffingtonpost.com/pollster/2014-oklahoma-senate-inhofe-vs-silverstein.csv',
 'http://elections.huffingtonpost.com/pollster/2014-south-carolina-senate-scott-vs-dickerson.csv',
 'http://elections.huffingtonpost.com/pollster/2014-south-carolina-senate-graham-vs-hutto.csv',
 'http://elections.huffingtonpost.com/pollster/2014-texas-senate-cornyn-vs-alameel.csv',
 'http://elections.huffingtonpost.com/pollster/2014-minnesota-senate-mcfadden-vs-franken.csv',
 'http://elections.huffingtonpost.com/pollster/2014-maine-senate-collins-vs-bellows.csv',
 'http://elections.huffingtonpost.com/pollster/2014-south-dakota-senate-rounds-vs-weiland.csv',
 'http://elections.huffingtonpost.com/pollster/2014-nebraska-senate-sasse-vs-domina.csv',
 'http://elections.huffingtonpost.com/pollster/2014-georgia-senate-perdue-vs-nunn.csv',
 'http://elections.huffingtonpost.com/pollster/2014-oregon-senate-wehby-vs-merkley.csv',
 'http://elections.huffingtonpost.com/pollster/2014-mississippi-senate-cochran-vs-childers.csv',
 'http://elections.huffingtonpost.com/pollster/2014-iowa-senate-ernst-vs-braley.csv',
 'http://elections.huffingtonpost.com/pollster/2014-alaska-senate-sullivan-vs-begich.csv',
 'http://elections.huffingtonpost.com/pollster/2014-illinois-senate-oberweis-vs-durbin.csv',
 'http://elections.huffingtonpost.com/pollster/2014-colorado-senate-gardner-vs-udall.csv',
 'http://elections.huffingtonpost.com/pollster/2014-virginia-senate-gillespie-vs-warner.csv',
 'http://elections.huffingtonpost.com/pollster/2014-north-carolina-senate-tillis-vs-hagan.csv',
 'http://elections.huffingtonpost.com/pollster/2014-new-hampshire-senate-bass-vs-shaheen.csv',
 'http://elections.huffingtonpost.com/pollster/2014-west-virginia-senate-capito-vs-tennant.csv',
 'http://elections.huffingtonpost.com/pollster/2014-new-hampshire-senate-brown-vs-shaheen.csv',
 'http://elections.huffingtonpost.com/pollster/2014-louisiana-senate-cassidy-vs-landrieu.csv',
 'http://elections.huffingtonpost.com/pollster/2014-michigan-senate-land-vs-peters.csv',
 'http://elections.huffingtonpost.com/pollster/2014-arkansas-senate-cotton-vs-pryor.csv',
 'http://elections.huffingtonpost.com/pollster/2014-kentucky-senate-mcconnell-vs-grimes.csv']

Because there so many Senate races, we can create a dictionary of pandas DataFrames that will be keyed by the name of the election (a string).

In [109]:

def build_frame(url):
    """
    Returns a pandas DataFrame object containing
    the data returned from the given url
    """
    source = requests.get(url).text
    
    # Use StringIO because pd.DataFrame.from_csv requires .read() method
    s = StringIO(source)
    
    return pd.DataFrame.from_csv(s, index_col=None).convert_objects(
            convert_dates="coerce", convert_numeric=True)

In [110]:

# Makes a dictionary of pandas DataFrames keyed on election string.
dfs = dict((election.split("/")[-1][:-4], build_frame(election)) for election in election_urls)

/Users/ko/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:12: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  if sys.path[0] == '':

Show the head of the DataFrame containing the polls for the 2014 Senate race in Kentucky between McConnell and Grimes.

In [154]:

#your code here
dfs['2014-kentucky-senate-mcconnell-vs-grimes'].head(2)

Out[154]:

	Pollster	Start Date	End Date	Entry Date/Time (ET)	Number of Observations	Population	Mode	Grimes	McConnell	Undecided	Pollster URL	Source URL	Partisan	Affiliation	Question Text	Question Iteration
0	PPP (D)	2014-10-30	2014-11-01	2014-11-02 01:44:58	1503	Likely Voters	IVR/Online	42.0	50.0	NaN	http://elections.huffingtonpost.com/pollster/p...	http://www.publicpolicypolling.com/pdf/2014/PP...	Pollster	Dem	NaN	1
1	NBC/Marist	2014-10-27	2014-10-30	2014-11-02 14:37:22	556	Likely Voters	Live Phone	41.0	50.0	3.0	http://elections.huffingtonpost.com/pollster/p...	http://newscms.nbcnews.com/sites/newscms/files...	Nonpartisan	None	NaN	1

Problem 4(b)¶

For each 2014 Senate race, create a preliminary prediction of the result for that state.

First, I get the names of the contenders in each polls:

In [167]:

# all the columns besides the contenders
drop_cols = ['Pollster','Start Date','End Date','Entry Date/Time (ET)','Number of Observations',
             'Population','Mode','Undecided','Pollster URL', 'Source URL',
             'Partisan','Affiliation','Question Text','Question Iteration']

# lets see if that works to get the names
[col for col in dfs['2014-kentucky-senate-mcconnell-vs-grimes'].columns if col not in drop_cols]

Out[167]:

['Grimes', 'McConnell']

In [268]:

#your code here
predictions = []
for election, df in dfs.items():
    names = [col for col in df.columns if col not in drop_cols]

    win_name = None
    high = False

    for name in names:
        m = df[name].mean()
        if m > high:
            high = m
            win_name = name
    
    predictions.append({"Winner": win_name, "Poll": high, "Race": election})
    #print(predictions)

winners = pd.DataFrame(predictions, cols=["Race","Winner","Poll"])
#winners.set_index("Race")

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-268-fa479debe2b4> in <module>()
     16     #print(predictions)
     17 
---> 18 winners = pd.DataFrame(predictions, cols=["Race","Winner","Poll"])
     19 #winners.set_index("Race")

TypeError: __init__() got an unexpected keyword argument 'cols'

an attempt to show the winners visually, though this could be done better:

In [265]:

sns.swarmplot(x="Race", y="Poll", data=winners, size=10, hue="Winner")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.);
plt.xticks(rotation=90, fontsize=11);

Submission Instructions¶

To submit your homework, create a folder named lastname_firstinitial_hw# and place your IPython notebooks, data files, and any other files in this folder. Your IPython Notebooks should be completely executed with the results visible in the notebook. We should not have to run any code. Compress the folder (please use .zip compression) and submit to the CS109 dropbox in the appropriate folder. If we cannot access your work because these directions are not followed correctly, we will not grade your work.