For class today, you read a chapter on Exploratory Data Analysis from the book "Doing Data Science." Today, we'll be walking through an analysis of the same data presented in that chapter and extending it with Multiple Hypothesis testing.
Hypothesis testing is a powerful tool used in statistical analysis to support claims like "Drug X is better than Drug Y", or "Campers who attend a safety course led by a talking bear are effective at preventing forest fires." You'll see questions about whether or not an experiment was effective or results from two processes are different all over the place in your data science career.
Hypothesis testing allows us to answer these questions with statistical rigor. Generally, we establish a "null hypothesis", and then conduct a test which will tell us, given the data, whether or not we can reject this null hypothesis. Usually, the null hypothesis is something like "These two drugs are the same." or "This measure's mean is no different than zero."
We're going to work with one such tool, namely the Student's Two-Sample t-Test.
This was not a test designed for students, instead, it was designed by a statistician William Gosset who published under the pseudonym "Student," while working for the Guinness brewing company.
Student's t-Test is used when comparing samples of normally distributed variables. This assumption of normality is important, but not strict. If the data is very much not normally distributed, a non-parameteric method (that is, distribution agnostic) like the Wilcoxon signed-rank test can be used. We are looking at Welch's t-Test, a variant of the classic Student's test which allows for two samples of different size, and possibly different variance.
To perform a t-Test, we compute a "t statistic" or "t score" with the two samples as input, the formula is:
$t = \frac{mean(X_1) - mean(X_2)}{s_{X_1-X_2}}$
where $s_{X_1-X_2} = \sqrt{\frac{s^2_1}{n_1} - \frac{s^2_2}{n_2}}$, and $s_i$ is the sample standard deviation of sample $i$ and $n_i$ is the size of sample $i$.
This score comes from a T-distribution, which looks like a normal distribution but with fatter tails. If the two samples are similar, the t-statistic will be close to 0. If they're not, the t-statistic will be high in absolute value.
Take a moment and think about the math. Under what conditions is the statistic highest? When the means are very different and their respective sample standard deviations are very small.
We'll use Welch's t-Test, an adaptation of the two-sample independent Student's t-test that takes into account samples that may have unequal sizes and variances (which, we can see from our distribution above, may be true in our data set!)
Since the t-statistic comes from a statistical distribution, we can map this value to a probability of sampling that value under the null hypothesis. The probability of realizing a high t-statistic when two samples come from the same distribution is very small, thus it's p-value is also small.
Courtesy of Wikipedia, the probability density function for the Student's t-Distribution. V indicates the number of samples in our distribution.
So, the output of a T-test is usually a pair of statistics, the "t score", and a "p value". The p value has a natural interpretation as a probability. We can say: "With $100 \times (1-p)$ percent confidence, I reject the null hypothesis that these two samples are the same." Meaning, that when p is very small, the two samples are statistically likely to be different.
You'll need to install a python package to your VM using the following command at the shell: sudo apt-get install python-scipy
We'll be working with a single day's worth of session log data from the New York Times website. The data is available here as well as in the class github repo. The data has been nicely cleaned and aggregated for us (no janitor work!) Load up a single day's data into a DataFrame
, and summarize it.
import pandas as pd
data = pd.read_csv("nyt1.csv")
data.describe()
Age | Gender | Impressions | Clicks | Signed_In | |
---|---|---|---|---|---|
count | 458441.000000 | 458441.000000 | 458441.000000 | 458441.000000 | 458441.000000 |
mean | 29.482551 | 0.367037 | 5.007316 | 0.092594 | 0.700930 |
std | 23.607034 | 0.481997 | 2.239349 | 0.309973 | 0.457851 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 0.000000 | 3.000000 | 0.000000 | 0.000000 |
50% | 31.000000 | 0.000000 | 5.000000 | 0.000000 | 1.000000 |
75% | 48.000000 | 1.000000 | 6.000000 | 0.000000 | 1.000000 |
max | 108.000000 | 1.000000 | 20.000000 | 4.000000 | 1.000000 |
8 rows × 5 columns
print data.shape
data.head()
(458441, 5)
Age | Gender | Impressions | Clicks | Signed_In | |
---|---|---|---|---|---|
0 | 36 | 0 | 3 | 0 | 1 |
1 | 73 | 1 | 3 | 0 | 1 |
2 | 30 | 0 | 3 | 0 | 1 |
3 | 49 | 1 | 3 | 0 | 1 |
4 | 47 | 1 | 11 | 0 | 1 |
5 rows × 5 columns
We can see that the data contains roughly 450,000 rows, and 5 columns. These columns are Age, Gender, Impressions (the number of pages the user viewed), Clicks (the number of ads a user clicked on), and whether or not the user was signed in.
Let's plot the data to get a sense of the distributions.
%pylab inline
data.hist(figsize=(10,8))
Populating the interactive namespace from numpy and matplotlib
array([[<matplotlib.axes.AxesSubplot object at 0x114b9d610>, <matplotlib.axes.AxesSubplot object at 0x114f9f850>, <matplotlib.axes.AxesSubplot object at 0x114fc3d90>], [<matplotlib.axes.AxesSubplot object at 0x114fe3d10>, <matplotlib.axes.AxesSubplot object at 0x11542fbd0>, <matplotlib.axes.AxesSubplot object at 0x115455f90>]], dtype=object)
What can you say about the distributions of these fields?
Some observations.
data.groupby('Signed_In').Gender.describe()
Signed_In 0 count 136177.000000 mean 0.000000 std 0.000000 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 0.000000 1 count 319198.000000 mean 0.523644 std 0.499441 min 0.000000 25% 0.000000 50% 1.000000 75% 1.000000 max 1.000000 dtype: float64
Let's create two new columns of interest. The first is Click Thru Rate (CTR), this is a measure commonly used in online advertising to measure the effectiveness of an Ad Campaign. We will use it to measure differences in behavior between groups of users.
The second thing we'll do is use pandas' cut
method to turn a continuous variable (Age), into a discrete one - AgeGroup. This makes things like plotting and measuring differences between groups easier.
Before doing any of this, though, we'll copy our data to a new data frame where we remove the cases where there are 0 impressions, because this will cause a divide by zero and possibly bias our analysis. The number of rows affected by this filter is small but non-trivial - these rows may warrant further investigation later!
data1 = data[data.Impressions > 0]
data1['CTR'] = data1['Clicks']/data1['Impressions']
data1['AgeGroup'] = pd.cut(data1['Age'], [-1,0,18,24,34,44,54,64,1000])
data1.head()
Age | Gender | Impressions | Clicks | Signed_In | CTR | AgeGroup | |
---|---|---|---|---|---|---|---|
0 | 36 | 0 | 3 | 0 | 1 | 0 | (34, 44] |
1 | 73 | 1 | 3 | 0 | 1 | 0 | (64, 1000] |
2 | 30 | 0 | 3 | 0 | 1 | 0 | (24, 34] |
3 | 49 | 1 | 3 | 0 | 1 | 0 | (44, 54] |
4 | 47 | 1 | 11 | 0 | 1 | 0 | (44, 54] |
5 rows × 7 columns
print data1.shape
data1.describe()
(455375, 7)
Age | Gender | Impressions | Clicks | Signed_In | CTR | |
---|---|---|---|---|---|---|
count | 455375.000000 | 455375.000000 | 455375.000000 | 455375.000000 | 455375.000000 | 455375.000000 |
mean | 29.484010 | 0.367051 | 5.041030 | 0.093218 | 0.700956 | 0.018471 |
std | 23.606697 | 0.482001 | 2.208731 | 0.310922 | 0.457839 | 0.069034 |
min | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 0.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 31.000000 | 0.000000 | 5.000000 | 0.000000 | 1.000000 | 0.000000 |
75% | 48.000000 | 1.000000 | 6.000000 | 0.000000 | 1.000000 | 0.000000 |
max | 108.000000 | 1.000000 | 20.000000 | 4.000000 | 1.000000 | 1.000000 |
8 rows × 6 columns
Now, let's plot total impressions and clicks by Age Group and whether or not the user is signed in.
impressionsByAgeSignIn = data1.groupby(['AgeGroup','Signed_In'])['Clicks'].sum()
impressionsByAgeSignIn.plot(kind='bar')
<matplotlib.axes.AxesSubplot at 0x11e1d0150>
By Now, we understand that we need to treat groups differently. Let's take our data and divide it into CTRs by age group for those users that have clicked on something (CTR > 0) and are signed in (Signed_In > 0).
loggedInCTRsByAgeGroup = data1[(data1.CTR > 0) & (data1.Signed_In > 0)].groupby('AgeGroup').CTR
loggedInCTRsByAgeGroup.describe()
AgeGroup (0, 18] count 2371.000000 mean 0.214738 std 0.122203 min 0.058824 25% 0.142857 50% 0.200000 75% 0.250000 max 1.000000 (18, 24] count 1669.000000 mean 0.203926 std 0.116896 min 0.066667 25% 0.125000 50% 0.166667 75% 0.250000 max 1.000000 (24, 34] count 2870.000000 mean 0.204344 std 0.111438 min 0.066667 25% 0.142857 50% 0.166667 75% 0.250000 max 1.000000 (34, 44] count 3592.000000 mean 0.201586 std 0.112516 min 0.066667 25% 0.142857 50% 0.166667 75% 0.250000 max 1.000000 (44, 54] count 3139.000000 mean 0.202531 std 0.108735 min 0.062500 25% 0.142857 50% 0.166667 75% 0.250000 max 1.000000 (54, 64] count 4337.000000 mean 0.208181 std 0.117601 min 0.062500 25% 0.142857 50% 0.166667 75% 0.250000 max 1.000000 (64, 1000] count 4084.000000 mean 0.208385 std 0.110372 min 0.062500 25% 0.142857 50% 0.166667 75% 0.250000 max 1.000000 Length: 56, dtype: float64
Now that we have several samples of user's click-through behavior. What's a question that we can ask ourselves about these samples?
One question we might ask is "Which groups of users is most different?" Or, more precisely: Are any groups of users different? If any groups are different, which ones are?
First, we'll collect our groups as separate lists, then, we'll run a T-Test between each pair of groups. Finally, we'll collect the p-values for each pair of groups into a DataFrame
.
from scipy.stats import ttest_ind
groups = [s for s in loggedInCTRsByAgeGroup]
def run_pairwise_tests(groups):
for g in groups:
for g2 in groups:
if g[0] < g2[0]:
yield g[0], g2[0], ttest_ind(g[1], g2[1], equal_var=False)[1]
testResults = pd.DataFrame(run_pairwise_tests(groups))
testResults
contains the p-value for the T-test between CTR samples of all pairs of age groups in our data set. Using this sample, what pairs of groups are different at the 95% confidence level? Which groups are most likely to be different, according to these p-values?testResults[testResults[2] < 0.05].sort(columns=[2])
0 | 1 | 2 | |
---|---|---|---|
2 | (0, 18] | (34, 44] | 0.000028 |
3 | (0, 18] | (44, 54] | 0.000121 |
1 | (0, 18] | (24, 34] | 0.001439 |
0 | (0, 18] | (18, 24] | 0.004526 |
17 | (34, 44] | (64, 1000] | 0.007703 |
16 | (34, 44] | (54, 64] | 0.010931 |
19 | (44, 54] | (64, 1000] | 0.024254 |
18 | (44, 54] | (54, 64] | 0.032186 |
4 | (0, 18] | (54, 64] | 0.033328 |
5 | (0, 18] | (64, 1000] | 0.037107 |
10 rows × 3 columns