For class today, you read a chapter on Exploratory Data Analysis from the book "Doing Data Science." Today, we'll be walking through an analysis of the same data presented in that chapter and extending it with Multiple Hypothesis testing.
Don't forget to fil out the response form.
Also - you may want to install scipy while we're talking about hypothesis testing: sudo apt-get install python-scipy
Hypothesis testing is a powerful tool used in statistical analysis to support claims like "Drug X is better than Drug Y", or "Campers who attend a safety course led by a talking bear are effective at preventing forest fires." You'll see questions about whether or not an experiment was effective or results from two processes are different all over the place in your data science career.
Hypothesis testing allows us to answer these questions with statistical rigor. Generally, we establish a "null hypothesis", and then conduct a test which will tell us, given the data, whether or not we can reject this null hypothesis. Usually, the null hypothesis is something like "These two drugs are the same." or "This measure's mean is no different than zero."
We're going to work with one such tool, namely the Student's Two-Sample t-Test.
This was not a test designed for students, instead, it was designed by a statistician William Gosset who published under the pseudonym "Student," while working for the Guinness brewing company.
Student's t-Test is used when comparing samples of normally distributed variables. This assumption of normality is important, but not strict. If the data is very much not normally distributed, a non-parameteric method (that is, distribution agnostic) like the Wilcoxon signed-rank test can be used. We are looking at Welch's t-Test, a variant of the classic Student's test which allows for two samples of different size, and possibly different variance.
To perform a t-Test, we compute a "t statistic" or "t score" with the two samples as input, the formula is:
$t = \frac{mean(X_1) - mean(X_2)}{s_{X_1-X_2}}$
where $s_{X_1-X_2} = \sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}$, and $s_i$ is the sample standard deviation of sample $i$ and $n_i$ is the size of sample $i$.
This score comes from a T-distribution, which looks like a normal distribution but with fatter tails. If the two samples are similar, the t-statistic will be close to 0. If they're not, the t-statistic will be high in absolute value.
Take a moment and think about the math. Under what conditions is the statistic highest? When the means are very different and their respective sample standard deviations are very small.
We'll use Welch's t-Test, an adaptation of the two-sample independent Student's t-test that takes into account samples that may have unequal sizes and variances (which, we can see from our distribution above, may be true in our data set!)
Since the t-statistic comes from a statistical distribution, we can map this value to a probability of sampling that value under the null hypothesis. The probability of realizing a high t-statistic when two samples come from the same distribution is very small, thus it's p-value is also small.
Courtesy of Wikipedia, the probability density function for the Student's t-Distribution. V indicates the number of samples in our distribution.
So, the output of a T-test is usually a pair of statistics, the "t score", and a "p value". The p value has a natural interpretation as a probability. We can say: "With $100 \times (1-p)$ percent confidence, I reject the null hypothesis that these two samples are the same." Meaning, that when p is very small, the two samples are statistically likely to be different.
You'll need to install a python package to your VM using the following command at the shell: sudo apt-get install python-scipy
We'll be working with a single day's worth of session log data from the New York Times website. The data is available here as well as in the class github repo. The data has been nicely cleaned and aggregated for us (no janitor work!) Load up a single day's data into a DataFrame
, and summarize it.
import pandas as pd
data = pd.read_csv("nyt1.csv")
data.describe()
print data.shape
data.head()
We can see that the data contains roughly 450,000 rows, and 5 columns. These columns are Age, Gender, Impressions (the number of pages the user viewed), Clicks (the number of ads a user clicked on), and whether or not the user was signed in.
Let's plot the data to get a sense of the distributions.
%pylab inline
data.hist(figsize=(10,8))
What can you say about the distributions of these fields?
Some observations.
##YOUR ANSWER HERE
Let's create two new columns of interest. The first is Click Thru Rate (CTR), this is a measure commonly used in online advertising to measure the effectiveness of an Ad Campaign. We will use it to measure differences in behavior between groups of users.
The second thing we'll do is use pandas' cut
method to turn a continuous variable (Age), into a discrete one - AgeGroup. This makes things like plotting and measuring differences between groups easier.
Before doing any of this, though, we'll copy our data to a new data frame where we remove the cases where there are 0 impressions, because this will cause a divide by zero and possibly bias our analysis. The number of rows affected by this filter is small but non-trivial - these rows may warrant further investigation later!
data1 = data[data.Impressions > 0]
##YOUR ANSWERS HERE
data1['CTR'] = #DEFINE CTR
data1['AgeGroup'] = #DEFINE AgeGroup
data1.head()
print data1.shape
data1.describe()
Now, let's plot total impressions and clicks by Age Group and whether or not the user is signed in.
impressionsByAgeSignIn = data1.groupby(['AgeGroup','Signed_In'])['Clicks'].sum()
impressionsByAgeSignIn.plot(kind='bar')
By Now, we understand that we need to treat groups differently. Let's take our data and divide it into CTRs by age group for those users that have clicked on something (CTR > 0) and are signed in (Signed_In > 0).
loggedInCTRsByAgeGroup = data1[(data1.CTR > 0) & (data1.Signed_In > 0)].groupby('AgeGroup').CTR
loggedInCTRsByAgeGroup.describe()
Now that we have several samples of user's click-through behavior. What's a question that we can ask ourselves about these samples?
One question we might ask is "Which groups of users is most different?" Or, more precisely: Are any groups of users different? If any groups are different, which ones are?
First, we'll collect our groups as separate lists, then, we'll run a t-Test between each pair of groups. Finally, we'll collect the p-values for each pair of groups into a DataFrame
.
from scipy.stats import ttest_ind
groups = [s for s in loggedInCTRsByAgeGroup]
def run_pairwise_tests(groups):
for g in groups:
for g2 in groups:
if g[0] < g2[0]:
yield g[0], g2[0], ttest_ind(g[1], g2[1])[1]
testResults = pd.DataFrame(run_pairwise_tests(groups))
testResults
contains the p-value for the T-test between CTR samples of all pairs of age groups in our data set. Using this sample, what pairs of groups are different at the 95% confidence level? Which groups are most likely to be different, according to these p-values?## YOUR ANSWERS HERE