Due: Thursday, October 2, 2014 11:59 PM
To submit your homework, create a folder named lastname_firstinitial_hw# and place your IPython notebooks, data files, and any other files in this folder. Your IPython Notebooks should be completely executed with the results visible in the notebook. We should not have to run any code. Compress the folder (please use .zip compression) and submit to the CS109 dropbox in the appropriate folder. If we cannot access your work because these directions are not followed correctly, we will not grade your work.
John Tukey wrote in Exploratory Data Analysis, 1977: "The greatest value of a picture is when it forces us to notice what we never expected to see." In this assignment we will continue using our exploratory data analysis tools, but apply it to new sets of data: gene expression and polls from the 2012 Presidental Election and from the 2014 Senate Midterm Elections.
First: You will use exploratory data analysis and apply the singular value decomposition (SVD) to a gene expression data matrix to determine if the the date that the gene expression samples are processed has large effect on the variability seen in the data.
Second: You will use the polls from the 2012 Presidential Elections to determine (1) Is there a pollster bias in presidential election polls? and (2) Is the average of polls better than just one poll?
Finally: You will use the HuffPost Pollster API to extract the polls for the current 2014 Senate Midterm Elections and provide a preliminary prediction of the result of each state.
We will use the following data sets:
A gene expression data set called exprs_GSE5859.csv
and sample annotation table called sampleinfo_GSE5859.csv
which are both available on Github in the 2014_data repository: expression data set and sample annotation table.
Polls from the 2012 Presidential Election: Barack Obama vs Mitt Romney. The polls we will use are from the Huffington Post Pollster.
Polls from the 2014 Senate Midterm Elections from the HuffPost Pollster API.
# special IPython command to prepare the notebook for matplotlib
%matplotlib inline
import requests
# fix stringio for python 3 version
#from StringIO import StringIO
from io import StringIO
import numpy as np
import pandas as pd # pandas
import matplotlib.pyplot as plt # module for plotting
import seaborn as sns
import datetime as dt # module for manipulating dates and times
import numpy.linalg as lin # module for performing linear algebra operations
import os
In this problem we will be using a gene expression data set obtained from a microarray experiement Read more about the specific experiment here. There are two data sets we will use:
The gene expression intensities where the rows represent the features on the microarray (e.g. genes) and the columsns represent the different microarray samples.
A table that contains the information about each of the samples (columns in the gene expression data set) such as the sex, the age, the treatment status, the date the samples were processed. Each row represents one sample.
Read in the two files from Github: exprs_GSE5859.csv and sampleinfo_GSE5859.csv as pandas DataFrames called exprs
and sampleinfo
. Use the gene names as the index of the exprs
DataFrame.
#your code here
exprs__url = "https://github.com/cs109/2014_data/raw/master/exprs_GSE5859.csv"
sampleinfo_url = "https://github.com/cs109/2014_data/raw/master/sampleinfo_GSE5859.csv"
urls = [exprs__url, sampleinfo_url]
filenames = ["exprs_GSE5859.csv", "sampleinfo_GSE5859.csv"]
for url, filename in zip(urls, filenames):
r = requests.get(url, stream=True)
with open("data/"+filename, 'wb') as fd:
for chunk in r.iter_content(chunk_size=256):
fd.write(chunk)
print(f"file {filename} downloaded succesfully")
file exprs_GSE5859.csv downloaded succesfully file sampleinfo_GSE5859.csv downloaded succesfully
Double checking that our files are in the data dir:
[file for file in os.listdir("data") if file in filenames]
['exprs_GSE5859.csv', 'sampleinfo_GSE5859.csv']
exprs = pd.read_csv("data/"+filenames[0], index_col=0)
exprs.head()
GSM25581.CEL.gz | GSM25681.CEL.gz | GSM136524.CEL.gz | GSM136707.CEL.gz | GSM25553.CEL.gz | GSM136676.CEL.gz | GSM136711.CEL.gz | GSM136542.CEL.gz | GSM136535.CEL.gz | GSM25399.CEL.gz | ... | GSM48650.CEL.gz | GSM25687.CEL.gz | GSM25685.CEL.gz | GSM136549.CEL.gz | GSM25427.CEL.gz | GSM25525.CEL.gz | GSM25349.CEL.gz | GSM136727.CEL.gz | GSM25626.CEL.gz | GSM136725.CEL.gz | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1007_s_at | 6.333951 | 5.736190 | 6.156754 | 6.513010 | 6.061013 | 6.506493 | 6.863426 | 6.369645 | 6.646321 | 5.810930 | ... | 7.082636 | 6.315866 | 7.010165 | 6.460107 | 6.122355 | 6.330314 | 6.627014 | 6.133068 | 6.419444 | 6.488579 |
1053_at | 7.255622 | 7.399993 | 7.491967 | 7.006123 | 7.250995 | 7.082581 | 6.842236 | 7.048487 | 7.042361 | 6.666625 | ... | 6.599718 | 7.412731 | 7.274429 | 6.928642 | 7.168782 | 7.235648 | 6.939184 | 7.280781 | 6.351776 | 7.517410 |
117_at | 5.760106 | 4.825169 | 5.039387 | 5.414160 | 5.205697 | 5.300078 | 5.099337 | 5.156459 | 5.736109 | 5.807496 | ... | 6.231457 | 5.240717 | 5.275062 | 5.759827 | 5.409720 | 5.491938 | 5.113570 | 5.401876 | 5.537605 | 5.247190 |
121_at | 6.935451 | 7.025210 | 7.543667 | 7.959781 | 7.223323 | 8.006816 | 8.102504 | 7.434617 | 7.532321 | 7.711461 | ... | 8.215096 | 7.677593 | 7.923624 | 7.219401 | 7.432338 | 6.785174 | 7.833862 | 7.607461 | 7.302935 | 7.331864 |
1255_g_at | 3.597872 | 3.108801 | 3.128269 | 3.364947 | 3.125909 | 3.186834 | 3.311022 | 3.092042 | 3.167737 | 2.969579 | ... | 3.331424 | 3.043856 | 3.151900 | 3.132253 | 3.250600 | 2.984125 | 3.152269 | 3.225123 | 3.222731 | 3.175846 |
5 rows × 208 columns
sampleinfo = pd.read_csv("data/"+filenames[1])
sampleinfo.head()
ethnicity | date | filename | sex | |
---|---|---|---|---|
0 | CEU | 2003-02-04 | GSM25349.CEL.gz | M |
1 | CEU | 2003-02-04 | GSM25350.CEL.gz | M |
2 | CEU | 2002-12-17 | GSM25356.CEL.gz | M |
3 | CEU | 2003-01-30 | GSM25357.CEL.gz | M |
4 | CEU | 2003-01-03 | GSM25358.CEL.gz | M |
Make sure the order of the columns in the gene expression DataFrame match the order of file names in the sample annotation DataFrame. If the order of the columns the exprs
DataFrame do not match the order of the file names in the sampleinfo
DataFrame, reorder the columns in the exprs
DataFrame.
Note: The column names of the gene expression DataFrame are the filenames of the orignal files from which these data were obtained.
Hint: The method list.index(x)
[read here] can be used to return the index in the list of the first item whose value is x. It is an error if there is no such item. To check if the order of the columns in exprs
matches the order of the rows in sampleinfo
, you can check using the method .all()
on a Boolean or list of Booleans:
Example code: (exprs.columns == sampleinfo.filename).all()
#your code here
# change the ordering of columns to be the same as the sample filenames
exprs = exprs[sampleinfo.filename]
#check order
(exprs.columns == sampleinfo.filename).all()
True
Show the head of the two tables: exprs
and sampleinfo
.
#your code here
exprs.head()
GSM25349.CEL.gz | GSM25350.CEL.gz | GSM25356.CEL.gz | GSM25357.CEL.gz | GSM25358.CEL.gz | GSM25359.CEL.gz | GSM25360.CEL.gz | GSM25361.CEL.gz | GSM25377.CEL.gz | GSM25378.CEL.gz | ... | GSM136719.CEL.gz | GSM136720.CEL.gz | GSM136721.CEL.gz | GSM136722.CEL.gz | GSM136723.CEL.gz | GSM136724.CEL.gz | GSM136725.CEL.gz | GSM136726.CEL.gz | GSM136727.CEL.gz | GSM136729.CEL.gz | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1007_s_at | 6.627014 | 6.249807 | 5.934128 | 6.145268 | 6.091270 | 6.041186 | 6.050375 | 6.063847 | 6.226106 | 6.358282 | ... | 6.466445 | 6.533592 | 6.631492 | 6.513362 | 6.440706 | 6.704324 | 6.488579 | 6.809481 | 6.133068 | 6.155473 |
1053_at | 6.939184 | 6.818038 | 7.047962 | 7.422477 | 7.272361 | 7.128216 | 6.750719 | 6.836287 | 6.738022 | 7.367895 | ... | 7.032885 | 7.156344 | 7.018025 | 6.973322 | 6.884738 | 7.120898 | 7.517410 | 7.200596 | 7.280781 | 7.111583 |
117_at | 5.113570 | 5.074096 | 5.371201 | 5.266550 | 5.342047 | 5.063876 | 5.315898 | 5.483652 | 6.689444 | 6.482782 | ... | 5.661112 | 5.127260 | 5.151840 | 5.505602 | 5.687689 | 4.942651 | 5.247190 | 5.237239 | 5.401876 | 5.302628 |
121_at | 7.833862 | 7.780682 | 7.458197 | 7.655948 | 7.546555 | 7.072670 | 7.092984 | 6.954225 | 7.489785 | 7.388539 | ... | 7.769734 | 7.815864 | 7.683279 | 7.883231 | 7.913621 | 7.834196 | 7.331864 | 7.357102 | 7.607461 | 7.456453 |
1255_g_at | 3.152269 | 3.111747 | 3.018932 | 3.154545 | 3.107954 | 3.224284 | 3.114241 | 3.044975 | 3.304038 | 2.887919 | ... | 3.257484 | 3.339234 | 3.298384 | 3.150654 | 3.344501 | 3.230285 | 3.175846 | 3.105092 | 3.225123 | 3.090149 |
5 rows × 208 columns
sampleinfo.head()
ethnicity | date | filename | sex | |
---|---|---|---|---|
0 | CEU | 2003-02-04 | GSM25349.CEL.gz | M |
1 | CEU | 2003-02-04 | GSM25350.CEL.gz | M |
2 | CEU | 2002-12-17 | GSM25356.CEL.gz | M |
3 | CEU | 2003-01-30 | GSM25357.CEL.gz | M |
4 | CEU | 2003-01-03 | GSM25358.CEL.gz | M |
Extract the year and month as integers from the sampleinfo
table.
Hint: To convert a Series or a column of a pandas DataFrame that contains a date-like object, you can use the to_datetime
function [read here]. This will create a DatetimeIndex
which can be used to extract the month and year for each row in the DataFrame.
#your code here
dates = pd.to_datetime(sampleinfo["date"])
years = [t.year for t in dates]
months = [t.month for t in dates]
Convert the dates in the date
column from the sampleinfo
table into days since October 31, 2002. Add a column to the sampleinfo
DataFrame titled elapsedInDays
containing the days since October 31, 2002. Show the head of the sampleinfo
DataFrame which includes the new column.
Hint: Use the datetime
module to create a new datetime
object for the specific date October 31, 2002. Then, subtract the October 31, 2002 date from each date from the date
column in the sampleinfo
DataFrame.
#your code here
basedate = pd.to_datetime("October 31, 2002")
basedate
sampleinfo["elapsedInDays"] = pd.to_datetime(sampleinfo.date) - basedate
sampleinfo.head()
ethnicity | date | filename | sex | elapsedInDays | |
---|---|---|---|---|---|
0 | CEU | 2003-02-04 | GSM25349.CEL.gz | M | 96 days |
1 | CEU | 2003-02-04 | GSM25350.CEL.gz | M | 96 days |
2 | CEU | 2002-12-17 | GSM25356.CEL.gz | M | 47 days |
3 | CEU | 2003-01-30 | GSM25357.CEL.gz | M | 91 days |
4 | CEU | 2003-01-03 | GSM25358.CEL.gz | M | 64 days |
Use exploratory analysis and the singular value decomposition (SVD) of the gene expression data matrix to determine if the date the samples were processed has large effect on the variability seen in the data or if it is just ethnicity (which is confounded with date).
Hint: See the end of the lecture from 9/23/2014 for help with SVD.
First subset the the sampleinfo
DataFrame to include only the CEU ethnicity. Call this new subsetted DataFrame sampleinfoCEU
. Show the head of sampleinfoCEU
DataFrame.
#your code here
sampleinfoCEU = sampleinfo[sampleinfo["ethnicity"]=="CEU"]
print(sampleinfoCEU.shape, sampleinfo.shape)
sampleinfoCEU.head()
(102, 5) (208, 5)
ethnicity | date | filename | sex | elapsedInDays | |
---|---|---|---|---|---|
0 | CEU | 2003-02-04 | GSM25349.CEL.gz | M | 96 days |
1 | CEU | 2003-02-04 | GSM25350.CEL.gz | M | 96 days |
2 | CEU | 2002-12-17 | GSM25356.CEL.gz | M | 47 days |
3 | CEU | 2003-01-30 | GSM25357.CEL.gz | M | 91 days |
4 | CEU | 2003-01-03 | GSM25358.CEL.gz | M | 64 days |
Next, subset the exprs
DataFrame to only include the samples with the CEU ethnicity. Name this new subsetted DataFrame exprsCEU
. Show the head of the exprsCEU
DataFrame.
#your code here
exprsCEU = exprs[sampleinfoCEU.filename]
print(exprsCEU.shape, exprs.shape)
exprsCEU.head()
(8793, 102) (8793, 208)
GSM25349.CEL.gz | GSM25350.CEL.gz | GSM25356.CEL.gz | GSM25357.CEL.gz | GSM25358.CEL.gz | GSM25359.CEL.gz | GSM25360.CEL.gz | GSM25361.CEL.gz | GSM25377.CEL.gz | GSM25378.CEL.gz | ... | GSM48658.CEL.gz | GSM48660.CEL.gz | GSM48661.CEL.gz | GSM48662.CEL.gz | GSM48663.CEL.gz | GSM48664.CEL.gz | GSM48665.CEL.gz | GSM136725.CEL.gz | GSM136726.CEL.gz | GSM136727.CEL.gz | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1007_s_at | 6.627014 | 6.249807 | 5.934128 | 6.145268 | 6.091270 | 6.041186 | 6.050375 | 6.063847 | 6.226106 | 6.358282 | ... | 6.501510 | 6.558100 | 6.618286 | 6.869995 | 5.981000 | 6.403285 | 6.248702 | 6.488579 | 6.809481 | 6.133068 |
1053_at | 6.939184 | 6.818038 | 7.047962 | 7.422477 | 7.272361 | 7.128216 | 6.750719 | 6.836287 | 6.738022 | 7.367895 | ... | 7.066686 | 7.386702 | 6.407958 | 7.089180 | 7.120923 | 7.018998 | 7.155419 | 7.517410 | 7.200596 | 7.280781 |
117_at | 5.113570 | 5.074096 | 5.371201 | 5.266550 | 5.342047 | 5.063876 | 5.315898 | 5.483652 | 6.689444 | 6.482782 | ... | 5.600514 | 5.232676 | 5.630720 | 4.944748 | 5.275577 | 5.770358 | 5.616842 | 5.247190 | 5.237239 | 5.401876 |
121_at | 7.833862 | 7.780682 | 7.458197 | 7.655948 | 7.546555 | 7.072670 | 7.092984 | 6.954225 | 7.489785 | 7.388539 | ... | 7.437535 | 7.714650 | 7.416252 | 7.746448 | 8.001434 | 7.626723 | 7.452299 | 7.331864 | 7.357102 | 7.607461 |
1255_g_at | 3.152269 | 3.111747 | 3.018932 | 3.154545 | 3.107954 | 3.224284 | 3.114241 | 3.044975 | 3.304038 | 2.887919 | ... | 3.009983 | 3.151203 | 3.199709 | 3.159496 | 3.149710 | 3.242780 | 3.433125 | 3.175846 | 3.105092 | 3.225123 |
5 rows × 102 columns
Check to make sure the order of the columns in the exprsCEU
DataFrame matches the rows in the sampleinfoCEU
DataFrame.
#your code here
(exprsCEU.columns == sampleinfoCEU.filename).any()
True
Compute the average gene expression intensity in the exprsCEU
DataFrame across all the samples. For each sample in the exprsCEU
DataFrame, subtract the average gene expression intensity from each of the samples. Show the head of the mean normalized gene expression data.
#your code here
avg = exprsCEU.mean(axis=1)
avg[:4]
1007_s_at 6.261955 1053_at 7.075217 117_at 5.604127 121_at 7.415835 dtype: float64
df = exprsCEU.sub(avg, axis="index")
df.head()
GSM25349.CEL.gz | GSM25350.CEL.gz | GSM25356.CEL.gz | GSM25357.CEL.gz | GSM25358.CEL.gz | GSM25359.CEL.gz | GSM25360.CEL.gz | GSM25361.CEL.gz | GSM25377.CEL.gz | GSM25378.CEL.gz | ... | GSM48658.CEL.gz | GSM48660.CEL.gz | GSM48661.CEL.gz | GSM48662.CEL.gz | GSM48663.CEL.gz | GSM48664.CEL.gz | GSM48665.CEL.gz | GSM136725.CEL.gz | GSM136726.CEL.gz | GSM136727.CEL.gz | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1007_s_at | 0.365059 | -0.012149 | -0.327827 | -0.116687 | -0.170685 | -0.220769 | -0.211580 | -0.198109 | -0.035849 | 0.096327 | ... | 0.239554 | 0.296144 | 0.356331 | 0.608040 | -0.280956 | 0.141330 | -0.013254 | 0.226624 | 0.547526 | -0.128887 |
1053_at | -0.136032 | -0.257178 | -0.027254 | 0.347260 | 0.197144 | 0.053000 | -0.324497 | -0.238930 | -0.337195 | 0.292678 | ... | -0.008531 | 0.311485 | -0.667259 | 0.013964 | 0.045706 | -0.056219 | 0.080203 | 0.442193 | 0.125379 | 0.205564 |
117_at | -0.490556 | -0.530031 | -0.232926 | -0.337577 | -0.262080 | -0.540250 | -0.288228 | -0.120475 | 1.085317 | 0.878655 | ... | -0.003613 | -0.371451 | 0.026594 | -0.659379 | -0.328549 | 0.166231 | 0.012716 | -0.356936 | -0.366887 | -0.202251 |
121_at | 0.418026 | 0.364847 | 0.042362 | 0.240113 | 0.130720 | -0.343165 | -0.322852 | -0.461611 | 0.073949 | -0.027296 | ... | 0.021700 | 0.298814 | 0.000417 | 0.330612 | 0.585598 | 0.210887 | 0.036463 | -0.083972 | -0.058733 | 0.191626 |
1255_g_at | 0.018335 | -0.022187 | -0.115002 | 0.020611 | -0.025980 | 0.090351 | -0.019693 | -0.088959 | 0.170104 | -0.246015 | ... | -0.123951 | 0.017269 | 0.065775 | 0.025562 | 0.015776 | 0.108846 | 0.299192 | 0.041912 | -0.028842 | 0.091189 |
5 rows × 102 columns
Using this mean normalized gene expression data, compute the projection to the first Principal Component (PC1).
Hint: Use the numpy.linalg.svd()
function in the numpy.linalg
module (or the scipy.linalg.svd()
function in the scipy.linalg
module) to apply an singular value decomposition to a matrix.
#your code here
U, s, Vh = np.linalg.svd(df.values, full_matrices=False)
df.shape, u.shape, s.shape, Vh.shape, Vh.T.shape
((8793, 102), (8793, 102), (102,), (102, 102), (102, 102))
Create a histogram using the values from PC1. Use a bin size of 25.
#your code here
plt.title("Vh.T[:0,] histogram")
plt.hist(Vh.T[:,0], bins=25);
Create a scatter plot with the days since October 31, 2002 on the x-axis and PC1 on the y-axis.
#your code here
x = sampleinfoCEU.elapsedInDays.dt.days
y = Vh.T[:,0]
plt.scatter(x,y)
plt.xlabel("Days since Oct 31, 2002")
plt.ylabel("PC1")
<matplotlib.text.Text at 0x126945b70>
Around what day do you notice a difference in the way the samples were processed?
#your code here
for X,Y in zip(x,y):
if X <=100:
plt.scatter(X,Y, color="b")
else:
plt.scatter(X,Y, color="r")
A more efficent version of the above, using a mask:
mask = x<=100
plt.figure(figsize=(10,6))
plt.scatter(x[mask], y[mask], color="b", label="100 or less days", s=20)
plt.scatter(x[~mask], y[~mask], color="r", label="over a 100 days", s=20)
plt.axvline(100, label="100 days", linestyle="dashed")
plt.xlabel("Days since Oct 31, 2002")
plt.ylabel("PC1")
plt.legend();
Answer: At about 100 days the PC1 of the samples is markedly different. Something seems to be going on there.
Write a brief discussion of your conclusions to the questions and tasks above in 100 words or less.
First, I need to better understand the PC1 here, but given that numpy must have done the right SVD and projected the high D gene data into a smaller D principal componenets, by plotting PC1 against days we can see that there is a clear difference in the samples after day 100.
The original gene data was very hard to visualize, doing a PCA analysis makes it possible to do so.
The HuffPost Pollster contains many political polls. You can access these polls from individual races as a CSV but you can also access polls through the HuffPost Pollster API to access the data.
Read in the polls from the 2012 Presidential Election: Barack Obama vs Mitt Romney into a pandas DataFrame called election
. For this problem, you may read in the polls for this race directly using the CSV file available from the HuffPost Pollster page.
#your code here
election = pd.read_csv("http://elections.huffingtonpost.com/pollster/2012-general-election-romney-vs-obama.csv")
Show the head of the election
DataFrame.
print("df shape:", election.shape)
election.head()
df shape: (586, 17)
Pollster | Start Date | End Date | Entry Date/Time (ET) | Number of Observations | Population | Mode | Obama | Romney | Undecided | Other | Pollster URL | Source URL | Partisan | Affiliation | Question Text | Question Iteration | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Politico/GWU/Battleground | 2012-11-04 | 2012-11-05 | 2012-11-06T08:40:26Z | 1000.0 | Likely Voters | Live Phone | 47.0 | 47.0 | 6.0 | NaN | http://elections.huffingtonpost.com/pollster/p... | http://www.politico.com/news/stories/1112/8338... | Nonpartisan | None | NaN | 1 |
1 | YouGov/Economist | 2012-11-03 | 2012-11-05 | 2012-11-26T15:31:23Z | 740.0 | Likely Voters | Internet | 49.0 | 47.0 | 3.0 | NaN | http://elections.huffingtonpost.com/pollster/p... | http://cdn.yougov.com/cumulus_uploads/document... | Nonpartisan | None | NaN | 1 |
2 | Gravis Marketing | 2012-11-03 | 2012-11-05 | 2012-11-06T09:22:02Z | 872.0 | Likely Voters | Automated Phone | 48.0 | 48.0 | 4.0 | NaN | http://elections.huffingtonpost.com/pollster/p... | http://www.gravispolls.com/2012/11/gravis-mark... | Nonpartisan | None | NaN | 1 |
3 | IBD/TIPP | 2012-11-03 | 2012-11-05 | 2012-11-06T08:51:48Z | 712.0 | Likely Voters | Live Phone | 50.0 | 49.0 | NaN | 1.0 | http://elections.huffingtonpost.com/pollster/p... | http://news.investors.com/special-report/50841... | Nonpartisan | None | NaN | 1 |
4 | Rasmussen | 2012-11-03 | 2012-11-05 | 2012-11-06T08:47:50Z | 1500.0 | Likely Voters | Automated Phone | 48.0 | 49.0 | NaN | NaN | http://elections.huffingtonpost.com/pollster/p... | http://www.rasmussenreports.com/public_content... | Nonpartisan | None | NaN | 1 |
How many polls were conducted in November? Define this number as M.
Hint: Subset the election
DataFrame for only dates in the Start Date
column that are in November 2012.
# converting string date columns into datetime
election["Start Date"] = pd.to_datetime(election["Start Date"])
election["End Date"] = pd.to_datetime(election["End Date"])
election["Entry Date/Time (ET)"] = pd.to_datetime(election["Entry Date/Time (ET)"])
pollsNov12 = election[(election["Start Date"] >= pd.datetime(2012,11,1)) &
(election["Start Date"] < pd.datetime(2012,12,1))]
M = pollsNov12.shape[0]
pollsNov12.describe()
Number of Observations | Obama | Romney | Undecided | Other | Question Text | Question Iteration | |
---|---|---|---|---|---|---|---|
count | 19.000000 | 19.000000 | 19.000000 | 12.000000 | 8.000000 | 0.0 | 19.0 |
mean | 1523.368421 | 48.789474 | 47.736842 | 3.083333 | 1.375000 | NaN | 1.0 |
std | 999.457754 | 1.084176 | 1.097578 | 1.676486 | 0.517549 | NaN | 0.0 |
min | 693.000000 | 47.000000 | 45.000000 | 1.000000 | 1.000000 | NaN | 1.0 |
25% | 1009.500000 | 48.000000 | 47.000000 | 2.000000 | 1.000000 | NaN | 1.0 |
50% | 1200.000000 | 49.000000 | 48.000000 | 3.000000 | 1.000000 | NaN | 1.0 |
75% | 1487.500000 | 49.500000 | 48.500000 | 4.000000 | 2.000000 | NaN | 1.0 |
max | 4725.000000 | 51.000000 | 49.000000 | 6.000000 | 2.000000 | NaN | 1.0 |
Answer: 19 polls were conducted in Nov
What was the median of the number of observations in the November polls? Define this quantity as N.
#your code here
N = np.median(pollsNov12["Number of Observations"])
N
1200.0
Answer: Median for polls in Nov is 1200
Using the median sample size $N$ from Problem 1(a), simulate the results from a single poll: simulate the number of votes for Obama out of a sample size $N$ where $p$ = 0.53 is the percent of voters who are voting for Obama.
Hint: Use the binomial distribution with parameters $N$ and $p$ = 0.53.
#your code here
poll = np.mean(np.random.binomial(1, p=0.53, size=int(N)))
poll2 = np.random.binomial(N, 0.53) / N
poll, poll2
(0.52666666666666662, 0.52500000000000002)
Now, perform a Monte Carlo simulation to obtain the estimated percentage of Obama votes with a sample size $N$ where $N$ is the median sample size calculated in Problem 2(a). Let $p$=0.53 be the percent of voters are voting for Obama.
Hint: You will repeat the simulation above 1,000 times and plot the distribution of the estimated percent of Obama votes from a single poll. The results from the single poll you simulate is random variable and will be different every time you sample.
Two days, first to draw a number from N, or to generate N votes and take the mean:
#your code here
sim = np.random.binomial(N, 0.53, 1000)/N
sim2 = [np.mean(np.random.binomial(1, p=0.53, size=int(N))) for _ in range(1000)]
sim[:5], sim2[:5]
(array([ 0.54416667, 0.50916667, 0.55083333, 0.5375 , 0.5175 ]), [0.54000000000000004, 0.52749999999999997, 0.51916666666666667, 0.53333333333333333, 0.54249999999999998])
Plot the distribution of the estimated percentage of Obama votes from your single poll. What is the distribution of the estimated percentage of Obama votes?
#your code here
#plt.hist(sim, bins=20);
simP = sim2 # to easily choose b/w sim and sim2
plt.figure(figsize=(10,6))
sns.distplot(simP, label="Voting distribution")
plt.axvline(np.mean(simP), color="g", label=f"Mean: {np.mean(simP):.2f}", linestyle="dashed")
plt.axvline(0.5, label="50 percent line", color="y", alpha=0.8)
plt.xlabel("percentage for votes for Obama", fontsize=12)
plt.ylabel("Probablity of it happening", fontsize=12)
plt.annotate(f"The mean is {np.mean(simP):.2%}", (np.mean(simP),0.0), (np.mean(simP)+0.01,3),
arrowprops={"arrowstyle":'-|>'}, bbox = {"boxstyle":"round", "fc":"0.8"})
plt.text(0.48,20, "So there is a small\nchance of Obama\nloosing", bbox=dict(facecolor='red', alpha=0.3))
plt.legend();
Answer:
What is the standard error (SE) of the estimated percentage from the poll.
Hint: Remember the SE is the standard deviation (SD) of the distribution of a random variable.
#your code here
np.std(simP)
0.014020573152613193
Now suppose we run M polls where M is the number of polls that happened in November (calculated in Problem 2(a)). Run 1,000 simulations and compute the mean of the M polls for each simulation.
np.random.binomial(N, 0.53, M)
array([623, 667, 612, 630, 631, 630, 619, 610, 603, 640, 610, 633, 664, 668, 638, 646, 640, 632, 651])
#your code here
polls1k = [np.mean(np.random.binomial(N, 0.53, M)) for _ in range(1000)]
polls1k[:10]
[634.84210526315792, 639.42105263157896, 638.42105263157896, 643.47368421052636, 631.9473684210526, 631.68421052631584, 638.0, 634.89473684210532, 643.10526315789468, 642.26315789473688]
What is the distribution of the average of polls?
Hint: Show a plot.
#your code here
plt.figure(figsize=(10,6))
plt.xlim(500,700)
plt.ylim(0,0.11)
sns.distplot(polls1k, label="Voting distribution")
sns.distplot(np.array(simP)*1200, label="non averaged poll")
plt.axvline(np.mean(polls1k), color="g", label=f"Mean: {np.mean(polls1k):.2f}", linestyle="dashed")
plt.axvline(N*0.5, label="50 percent line", color="y", alpha=0.8)
plt.xlabel("votes for Obama", fontsize=12)
plt.ylabel("Probablity of it happening", fontsize=12)
plt.legend();
Answer: The average poll has a much tighter outcome, as is clear on the chart above. The non-average poll has a much wider spread.
What is the standard error (SE) of the average of polls?
#your code here
np.std(polls1k), np.std(np.array(simP)*1200)
(4.0203287292723635, 16.824687783135836)
Answer:
Is the SE of the average of polls larger, the same, or smaller than that the SD of a single poll (calculated in Problem 2(b))? By how much?
Hint: Compute a ratio of the two quantities.
#your code here
np.std(polls1k) / np.std(np.array(simP)*1200)
0.23895413579692845
np.std(np.array(simP)*1200) / np.std(polls1k)
4.184903503197094
Answer:
Repeat Problem 2(c) but now record the across poll standard deviation in each simulation.
#your code here
polls1kstd = [np.std(np.random.binomial(N, 0.53, M)) for _ in range(1000)]
polls1kstd[:5]
[14.847516743974527, 21.288936952443095, 17.348952405966404, 16.956192537653056, 20.507548775675353]
What is the distribution of the across M polls standard deviation?
Hint: Show a plot.
#your code here
sns.distplot(polls1kstd, label="std across M polls")
plt.xlabel("Standard Dev");
Answer:
What is the standard deviation of M polls in our real (not simulated) 2012 presidential election data ?
#your code here
np.std(election["Obama"]/100)
0.024199901720378662
Is this larger, the same, or smaller than what we expeced if polls were not biased.
#your code here
np.std(election["Obama"]/100), np.std(np.array(simP)*1200/100), np.std(election["Obama"]/100)/np.std(np.array(simP)*1200/100)
(0.024199901720378662, 0.16824687783135833, 0.14383566597078459)
Answer:
For AC209 Students: Learn about the normal approximation for the binomial distribution and derive the results of Problem 2(b) and 2(c) analytically (using this approximation). Compare the results obtained analytically to those obtained from simulations.
#your code here
Answer:
Write a brief discussion of your conclusions to the questions and tasks above in 100 words or less.
Most undecided voters vote for one of the two candidates at the election. Therefore, the reported percentages underestimate the final value of both candidates. However, if we assume the undecided will split evenly, then the observed difference should be an unbiased estimate of the final difference.
Add a new column to the election
DataFrame containg the difference between Obama and Romeny called Diff
.
#your code here
election["Diff"] = 100*(election["Obama"] - election["Romney"]) / (election["Obama"] + election["Romney"])
election.head()
Pollster | Start Date | End Date | Entry Date/Time (ET) | Number of Observations | Population | Mode | Obama | Romney | Undecided | Other | Pollster URL | Source URL | Partisan | Affiliation | Question Text | Question Iteration | Diff | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
585 | PPP (D) | 2009-03-13 | 2009-03-15 | 2010-11-08 10:36:31 | 1403.0 | Registered Voters | Automated Phone | 44.0 | 44.0 | 12.0 | NaN | http://elections.huffingtonpost.com/pollster/p... | http://www.publicpolicypolling.com/pdf/PPP_Rel... | Pollster | Dem | NaN | 1 | 0.000000 |
584 | PPP (D) | 2009-04-17 | 2009-04-19 | 2010-11-08 10:45:18 | 686.0 | Registered Voters | Automated Phone | 50.0 | 39.0 | 11.0 | NaN | http://elections.huffingtonpost.com/pollster/p... | http://www.publicpolicypolling.com/pdf/surveys... | Pollster | Dem | NaN | 1 | 12.359551 |
583 | PPP (D) | 2009-05-14 | 2009-05-18 | 2010-11-08 10:49:08 | 1000.0 | Registered Voters | Automated Phone | 53.0 | 35.0 | 12.0 | NaN | http://elections.huffingtonpost.com/pollster/p... | http://www.publicpolicypolling.com/pdf/surveys... | Pollster | Dem | NaN | 1 | 20.454545 |
582 | PPP (D) | 2009-06-12 | 2009-06-16 | 2010-11-08 10:58:00 | 638.0 | Registered Voters | Automated Phone | 48.0 | 40.0 | 12.0 | NaN | http://elections.huffingtonpost.com/pollster/p... | http://www.publicpolicypolling.com/pdf/surveys... | Pollster | Dem | NaN | 1 | 9.090909 |
581 | PPP (D) | 2009-07-15 | 2009-07-16 | 2010-11-08 11:02:37 | 577.0 | Registered Voters | Automated Phone | 49.0 | 40.0 | 11.0 | NaN | http://elections.huffingtonpost.com/pollster/p... | http://www.publicpolicypolling.com/pdf/surveys... | Pollster | Dem | NaN | 1 | 10.112360 |
election["Diff"].describe()
count 586.000000 mean 2.433892 std 4.211205 min -17.777778 25% 0.000000 50% 2.127660 75% 4.651163 max 20.454545 Name: Diff, dtype: float64
Make a plot of the differences for the week before the election (e.g. 5 days) where the days are on the x-axis and the differences are on the y-axis. Add a horizontal line showing 3.9%: the difference between Obama and Romney on election day.
election.sort_values("End Date", ascending=True, inplace=True)
Adding a column to easily see the days from end of elections:
election["Days from End"] = -(election["End Date"] - election["End Date"].max()).dt.days.values
election.sort_values("Days from End", ascending=False, inplace=True)
election.head()
Pollster | Start Date | End Date | Entry Date/Time (ET) | Number of Observations | Population | Mode | Obama | Romney | Undecided | Other | Pollster URL | Source URL | Partisan | Affiliation | Question Text | Question Iteration | Diff | Days from End | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
585 | PPP (D) | 2009-03-13 | 2009-03-15 | 2010-11-08 10:36:31 | 1403.0 | Registered Voters | Automated Phone | 44.0 | 44.0 | 12.0 | NaN | http://elections.huffingtonpost.com/pollster/p... | http://www.publicpolicypolling.com/pdf/PPP_Rel... | Pollster | Dem | NaN | 1 | 0.000000 | 1331 |
584 | PPP (D) | 2009-04-17 | 2009-04-19 | 2010-11-08 10:45:18 | 686.0 | Registered Voters | Automated Phone | 50.0 | 39.0 | 11.0 | NaN | http://elections.huffingtonpost.com/pollster/p... | http://www.publicpolicypolling.com/pdf/surveys... | Pollster | Dem | NaN | 1 | 12.359551 | 1296 |
583 | PPP (D) | 2009-05-14 | 2009-05-18 | 2010-11-08 10:49:08 | 1000.0 | Registered Voters | Automated Phone | 53.0 | 35.0 | 12.0 | NaN | http://elections.huffingtonpost.com/pollster/p... | http://www.publicpolicypolling.com/pdf/surveys... | Pollster | Dem | NaN | 1 | 20.454545 | 1267 |
582 | PPP (D) | 2009-06-12 | 2009-06-16 | 2010-11-08 10:58:00 | 638.0 | Registered Voters | Automated Phone | 48.0 | 40.0 | 12.0 | NaN | http://elections.huffingtonpost.com/pollster/p... | http://www.publicpolicypolling.com/pdf/surveys... | Pollster | Dem | NaN | 1 | 9.090909 | 1238 |
581 | PPP (D) | 2009-07-15 | 2009-07-16 | 2010-11-08 11:02:37 | 577.0 | Registered Voters | Automated Phone | 49.0 | 40.0 | 11.0 | NaN | http://elections.huffingtonpost.com/pollster/p... | http://www.publicpolicypolling.com/pdf/surveys... | Pollster | Dem | NaN | 1 | 10.112360 | 1208 |
First, since we're looking only at the last 7 days, I make a mask to access only that data:
ma = election["Days from End"] <= 7
f"there are {np.sum(ma)} polls in the 7 days before election"
election[ma].describe()
Number of Observations | Obama | Romney | Undecided | Other | Question Text | Question Iteration | Diff | Days from End | |
---|---|---|---|---|---|---|---|---|---|
count | 34.000000 | 34.000000 | 34.000000 | 22.000000 | 13.000000 | 0.0 | 34.0 | 34.000000 | 34.000000 |
mean | 2518.617647 | 48.294118 | 47.382353 | 4.045455 | 1.538462 | NaN | 1.0 | 0.958681 | 2.411765 |
std | 6070.819644 | 1.219275 | 1.392941 | 2.419706 | 0.776250 | NaN | 0.0 | 1.453875 | 2.175868 |
min | 693.000000 | 46.000000 | 43.000000 | 1.000000 | 1.000000 | NaN | 1.0 | -2.083333 | 0.000000 |
25% | 1004.750000 | 48.000000 | 47.000000 | 2.250000 | 1.000000 | NaN | 1.0 | 0.000000 | 1.000000 |
50% | 1164.000000 | 48.000000 | 48.000000 | 3.000000 | 1.000000 | NaN | 1.0 | 1.030928 | 2.000000 |
75% | 1500.000000 | 49.000000 | 48.000000 | 6.000000 | 2.000000 | NaN | 1.0 | 2.072704 | 4.000000 |
max | 36472.000000 | 51.000000 | 49.000000 | 11.000000 | 3.000000 | NaN | 1.0 | 4.255319 | 7.000000 |
sns.lmplot("Days from End", "Diff", data=election[ma], hue="Pollster", fit_reg=False, legend=True, size=6)
plt.axhline(3.9, label="Actual Diff at Election")
plt.xlim(7, -0)
ymin, ymax = plt.ylim()
plt.fill_between("Days from End", ymax, data=election[ma], alpha=0.051, color="b")
plt.fill_between("Days from End", ymin, data=election[ma], alpha=0.051, color="r")
plt.ylim(ymin, ymax)
plt.title("Polls 7 Days from elections");
Make a plot showing the differences by pollster where the pollsters are on the x-axis and the differences on the y-axis.
dfpoll = election[ma].reset_index()
dfpoll.head(2)
index | Pollster | Start Date | End Date | Entry Date/Time (ET) | Number of Observations | Population | Mode | Obama | Romney | Undecided | Other | Pollster URL | Source URL | Partisan | Affiliation | Question Text | Question Iteration | Diff | Days from End | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 33 | JZ Analytics/Newsmax | 2012-10-27 | 2012-10-29 | 2012-10-30 17:16:32 | 1073.0 | Likely Voters | Internet | 46.0 | 47.0 | 7.0 | NaN | http://elections.huffingtonpost.com/pollster/p... | http://www.jzanalytics.com/ | Sponsor | Rep | NaN | 1 | -1.075269 | 7 |
1 | 32 | High Point University | 2012-10-22 | 2012-10-30 | 2012-11-01 13:19:44 | 805.0 | Registered Voters | Live Phone | 46.0 | 43.0 | 11.0 | 1.0 | http://elections.huffingtonpost.com/pollster/p... | http://acme.highpoint.edu/~mkifer/src/19releas... | Nonpartisan | None | NaN | 1 | 3.370787 | 6 |
sns.swarmplot(x="Pollster", y="Diff", hue="Pollster", data=dfpoll, size=10)
#sns.factorplot(x="Pollster", y="Diff", hue="Pollster", data=dfpoll, size=10)
plt.axhline(0, alpha=0.2, color="black")
plt.text(0,0.6,"Obama Wins", bbox={'facecolor':'blue', 'alpha':0.1})
plt.text(0,-0.8,"Romney Wins", bbox={'facecolor':'red', 'alpha':0.1})
plt.xticks(rotation=90, fontsize=11)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.);
Is the across poll difference larger than the between pollster difference?
Answer: There is a lot of variation across polls b/w the same pollster.
Take the average for each pollster and then compute the average of that. Given this difference how confident would you have been of an Obama victory?
Hint: Compute an estimate of the SE of this average based exclusively on the observed data.
#your code here
# first getting the avg of each pollsters poll
dfpollavg = election[ma].groupby("Pollster").mean().reset_index()
win_perctange = dfpollavg['Diff'].mean()
print(f"avg of polls predict an Obama win by {win_perctange:.2f}%")
print(f"The avg error is {np.std(dfpollavg.Diff):.2f}")
avg of polls predict an Obama win by 1.14% The avg error is 1.38
Answer: The win % is small, and the standard error is large, so the polls make the Obama victory pretty uncertain. It's well within the margin of error.
For AC209 Students: Show the difference against time and see if you can detect a trend towards the end. Use this trend to see if it improves the final estimate.
election.head(2)
Pollster | Start Date | End Date | Entry Date/Time (ET) | Number of Observations | Population | Mode | Obama | Romney | Undecided | Other | Pollster URL | Source URL | Partisan | Affiliation | Question Text | Question Iteration | Diff | Days from End | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
585 | PPP (D) | 2009-03-13 | 2009-03-15 | 2010-11-08 10:36:31 | 1403.0 | Registered Voters | Automated Phone | 44.0 | 44.0 | 12.0 | NaN | http://elections.huffingtonpost.com/pollster/p... | http://www.publicpolicypolling.com/pdf/PPP_Rel... | Pollster | Dem | NaN | 1 | 0.000000 | 1331 |
584 | PPP (D) | 2009-04-17 | 2009-04-19 | 2010-11-08 10:45:18 | 686.0 | Registered Voters | Automated Phone | 50.0 | 39.0 | 11.0 | NaN | http://elections.huffingtonpost.com/pollster/p... | http://www.publicpolicypolling.com/pdf/surveys... | Pollster | Dem | NaN | 1 | 12.359551 | 1296 |
#your code here
#sns.lmplot("Days from End", "Diff", data=election, fit_reg=True, legend=False, size=6)
sns.lmplot("Days from End", "Diff", data=election[election["Days from End"] <= 30], scatter=False)
plt.plot("Days from End", "Diff", data=election[election["Days from End"] <= 100], color="g", alpha=0.2,
label="All polls")
plt.axhline(3.9, label="Actual Diff at Election", linestyle="dashed", linewidth=1)
plt.axhline(0, label="0 line", linestyle="-", linewidth=1)
plt.xlim(100, 0)
plt.title("All polls before election");
Answer: It's interesting how the polls markedly swing from Obama down to Romney, and then there is a marked upward trend in the last 30 days, though it doesn't reach close to the 3.9 mark.
Write a brief discussion of your conclusions to the questions and tasks above in 100 words or less.
I would look closely at the methods Nate Silver used to deal with polls, since there is so much variability both b/w the same poll and different polls. Also I'd compute confidence intervals etc.
In this last problem, we will use the polls from the 2014 Senate Midterm Elections from the HuffPost Pollster API to create a preliminary prediction of the result of each state.
The HuffPost Pollster API allows you to access the data as a CSV or a JSON response by tacking ".csv" or ".json" at the end of the URLs. For example the 2012 Presidential Election could be accessed as a .json instead of a .csv
Read in the polls for all of the 2014 Senate Elections using the HuffPost API. For example, we can consider the 2014 Senate race in Kentucky between Mitch McConnell and Alison Grimes.
To search for the 2014 Senate races, use the topics
parameter in the API [read more about topics here].
url_str = "http://elections.huffingtonpost.com/pollster/api/charts/?topic=2014-senate"
To list all the URLs related to the 2014 Senate races using the pollster API, we can use a list comprehension:
election_urls = [election['url'] + '.csv' for election in requests.get(url_str).json()]
election_urls
['http://elections.huffingtonpost.com/pollster/2014-delaware-senate-wade-vs-coons.csv', 'http://elections.huffingtonpost.com/pollster/2014-massachusetts-senate-herr-vs-markey.csv', 'http://elections.huffingtonpost.com/pollster/2014-rhode-island-senate-zaccaria-vs-reed.csv', 'http://elections.huffingtonpost.com/pollster/2014-montana-senate-daines-vs-curtis.csv', 'http://elections.huffingtonpost.com/pollster/2014-oklahoma-senate-lankford-vs-johnson.csv', 'http://elections.huffingtonpost.com/pollster/2014-hawaii-senate-cavasso-vs-schatz.csv', 'http://elections.huffingtonpost.com/pollster/2014-kansas-senate-roberts-vs-orman-vs-taylor.csv', 'http://elections.huffingtonpost.com/pollster/2014-wyoming-senate.csv', 'http://elections.huffingtonpost.com/pollster/2014-tennessee-senate-alexander-vs-ball.csv', 'http://elections.huffingtonpost.com/pollster/2014-idaho-senate-risch-vs-mitchell.csv', 'http://elections.huffingtonpost.com/pollster/2014-new-jersey-senate-bell-vs-booker.csv', 'http://elections.huffingtonpost.com/pollster/2014-new-mexico-senate-weh-vs-udall.csv', 'http://elections.huffingtonpost.com/pollster/2014-oklahoma-senate-inhofe-vs-silverstein.csv', 'http://elections.huffingtonpost.com/pollster/2014-south-carolina-senate-scott-vs-dickerson.csv', 'http://elections.huffingtonpost.com/pollster/2014-south-carolina-senate-graham-vs-hutto.csv', 'http://elections.huffingtonpost.com/pollster/2014-texas-senate-cornyn-vs-alameel.csv', 'http://elections.huffingtonpost.com/pollster/2014-minnesota-senate-mcfadden-vs-franken.csv', 'http://elections.huffingtonpost.com/pollster/2014-maine-senate-collins-vs-bellows.csv', 'http://elections.huffingtonpost.com/pollster/2014-south-dakota-senate-rounds-vs-weiland.csv', 'http://elections.huffingtonpost.com/pollster/2014-nebraska-senate-sasse-vs-domina.csv', 'http://elections.huffingtonpost.com/pollster/2014-georgia-senate-perdue-vs-nunn.csv', 'http://elections.huffingtonpost.com/pollster/2014-oregon-senate-wehby-vs-merkley.csv', 'http://elections.huffingtonpost.com/pollster/2014-mississippi-senate-cochran-vs-childers.csv', 'http://elections.huffingtonpost.com/pollster/2014-iowa-senate-ernst-vs-braley.csv', 'http://elections.huffingtonpost.com/pollster/2014-alaska-senate-sullivan-vs-begich.csv', 'http://elections.huffingtonpost.com/pollster/2014-illinois-senate-oberweis-vs-durbin.csv', 'http://elections.huffingtonpost.com/pollster/2014-colorado-senate-gardner-vs-udall.csv', 'http://elections.huffingtonpost.com/pollster/2014-virginia-senate-gillespie-vs-warner.csv', 'http://elections.huffingtonpost.com/pollster/2014-north-carolina-senate-tillis-vs-hagan.csv', 'http://elections.huffingtonpost.com/pollster/2014-new-hampshire-senate-bass-vs-shaheen.csv', 'http://elections.huffingtonpost.com/pollster/2014-west-virginia-senate-capito-vs-tennant.csv', 'http://elections.huffingtonpost.com/pollster/2014-new-hampshire-senate-brown-vs-shaheen.csv', 'http://elections.huffingtonpost.com/pollster/2014-louisiana-senate-cassidy-vs-landrieu.csv', 'http://elections.huffingtonpost.com/pollster/2014-michigan-senate-land-vs-peters.csv', 'http://elections.huffingtonpost.com/pollster/2014-arkansas-senate-cotton-vs-pryor.csv', 'http://elections.huffingtonpost.com/pollster/2014-kentucky-senate-mcconnell-vs-grimes.csv']
Because there so many Senate races, we can create a dictionary of pandas DataFrames that will be keyed by the name of the election (a string).
def build_frame(url):
"""
Returns a pandas DataFrame object containing
the data returned from the given url
"""
source = requests.get(url).text
# Use StringIO because pd.DataFrame.from_csv requires .read() method
s = StringIO(source)
return pd.DataFrame.from_csv(s, index_col=None).convert_objects(
convert_dates="coerce", convert_numeric=True)
# Makes a dictionary of pandas DataFrames keyed on election string.
dfs = dict((election.split("/")[-1][:-4], build_frame(election)) for election in election_urls)
/Users/ko/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:12: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. if sys.path[0] == '':
Show the head of the DataFrame containing the polls for the 2014 Senate race in Kentucky between McConnell and Grimes.
#your code here
dfs['2014-kentucky-senate-mcconnell-vs-grimes'].head(2)
Pollster | Start Date | End Date | Entry Date/Time (ET) | Number of Observations | Population | Mode | Grimes | McConnell | Undecided | Pollster URL | Source URL | Partisan | Affiliation | Question Text | Question Iteration | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | PPP (D) | 2014-10-30 | 2014-11-01 | 2014-11-02 01:44:58 | 1503 | Likely Voters | IVR/Online | 42.0 | 50.0 | NaN | http://elections.huffingtonpost.com/pollster/p... | http://www.publicpolicypolling.com/pdf/2014/PP... | Pollster | Dem | NaN | 1 |
1 | NBC/Marist | 2014-10-27 | 2014-10-30 | 2014-11-02 14:37:22 | 556 | Likely Voters | Live Phone | 41.0 | 50.0 | 3.0 | http://elections.huffingtonpost.com/pollster/p... | http://newscms.nbcnews.com/sites/newscms/files... | Nonpartisan | None | NaN | 1 |
For each 2014 Senate race, create a preliminary prediction of the result for that state.
First, I get the names of the contenders in each polls:
# all the columns besides the contenders
drop_cols = ['Pollster','Start Date','End Date','Entry Date/Time (ET)','Number of Observations',
'Population','Mode','Undecided','Pollster URL', 'Source URL',
'Partisan','Affiliation','Question Text','Question Iteration']
# lets see if that works to get the names
[col for col in dfs['2014-kentucky-senate-mcconnell-vs-grimes'].columns if col not in drop_cols]
['Grimes', 'McConnell']
#your code here
predictions = []
for election, df in dfs.items():
names = [col for col in df.columns if col not in drop_cols]
win_name = None
high = False
for name in names:
m = df[name].mean()
if m > high:
high = m
win_name = name
predictions.append({"Winner": win_name, "Poll": high, "Race": election})
#print(predictions)
winners = pd.DataFrame(predictions, cols=["Race","Winner","Poll"])
#winners.set_index("Race")
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-268-fa479debe2b4> in <module>() 16 #print(predictions) 17 ---> 18 winners = pd.DataFrame(predictions, cols=["Race","Winner","Poll"]) 19 #winners.set_index("Race") TypeError: __init__() got an unexpected keyword argument 'cols'
an attempt to show the winners visually, though this could be done better:
sns.swarmplot(x="Race", y="Poll", data=winners, size=10, hue="Winner")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.);
plt.xticks(rotation=90, fontsize=11);
To submit your homework, create a folder named lastname_firstinitial_hw# and place your IPython notebooks, data files, and any other files in this folder. Your IPython Notebooks should be completely executed with the results visible in the notebook. We should not have to run any code. Compress the folder (please use .zip compression) and submit to the CS109 dropbox in the appropriate folder. If we cannot access your work because these directions are not followed correctly, we will not grade your work.