Exploring Click Logs and Testing Statistical Hypotheses

For class today, you read a chapter on Exploratory Data Analysis from the book "Doing Data Science." Today, we'll be walking through an analysis of the same data presented in that chapter and extending it with Multiple Hypothesis testing.

Don't forget to fil out the response form.

Also - you may want to install scipy while we're talking about hypothesis testing: sudo apt-get install python-scipy

Statistical Hypothesis Testing

Hypothesis testing is a powerful tool used in statistical analysis to support claims like "Drug X is better than Drug Y", or "Campers who attend a safety course led by a talking bear are effective at preventing forest fires." You'll see questions about whether or not an experiment was effective or results from two processes are different all over the place in your data science career.

Hypothesis testing allows us to answer these questions with statistical rigor. Generally, we establish a "null hypothesis", and then conduct a test which will tell us, given the data, whether or not we can reject this null hypothesis. Usually, the null hypothesis is something like "These two drugs are the same." or "This measure's mean is no different than zero."

We're going to work with one such tool, namely the Student's Two-Sample t-Test.

Historical Anecdote

This was not a test designed for students, instead, it was designed by a statistician William Gosset who published under the pseudonym "Student," while working for the Guinness brewing company.

Back to Statistics

Student's t-Test is used when comparing samples of normally distributed variables. This assumption of normality is important, but not strict. If the data is very much not normally distributed, a non-parameteric method (that is, distribution agnostic) like the Wilcoxon signed-rank test can be used. We are looking at Welch's t-Test, a variant of the classic Student's test which allows for two samples of different size, and possibly different variance.

To perform a t-Test, we compute a "t statistic" or "t score" with the two samples as input, the formula is:

$t = \frac{mean(X_1) - mean(X_2)}{s_{X_1-X_2}}$

where $s_{X_1-X_2} = \sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}$, and $s_i$ is the sample standard deviation of sample $i$ and $n_i$ is the size of sample $i$.

This score comes from a T-distribution, which looks like a normal distribution but with fatter tails. If the two samples are similar, the t-statistic will be close to 0. If they're not, the t-statistic will be high in absolute value.

Take a moment and think about the math. Under what conditions is the statistic highest? When the means are very different and their respective sample standard deviations are very small.

We'll use Welch's t-Test, an adaptation of the two-sample independent Student's t-test that takes into account samples that may have unequal sizes and variances (which, we can see from our distribution above, may be true in our data set!)

Since the t-statistic comes from a statistical distribution, we can map this value to a probability of sampling that value under the null hypothesis. The probability of realizing a high t-statistic when two samples come from the same distribution is very small, thus it's p-value is also small.

T-stat Courtesy of Wikipedia, the probability density function for the Student's t-Distribution. V indicates the number of samples in our distribution.

So, the output of a T-test is usually a pair of statistics, the "t score", and a "p value". The p value has a natural interpretation as a probability. We can say: "With $100 \times (1-p)$ percent confidence, I reject the null hypothesis that these two samples are the same." Meaning, that when p is very small, the two samples are statistically likely to be different.

Lab Work

Setup - Install Scipy

You'll need to install a python package to your VM using the following command at the shell: sudo apt-get install python-scipy

Setup - Data Loading

We'll be working with a single day's worth of session log data from the New York Times website. The data is available here as well as in the class github repo. The data has been nicely cleaned and aggregated for us (no janitor work!) Load up a single day's data into a DataFrame, and summarize it.

In [ ]:
import pandas as pd

data = pd.read_csv("nyt1.csv")
data.describe()
In [ ]:
print data.shape
data.head()

We can see that the data contains roughly 450,000 rows, and 5 columns. These columns are Age, Gender, Impressions (the number of pages the user viewed), Clicks (the number of ads a user clicked on), and whether or not the user was signed in.

Let's plot the data to get a sense of the distributions.

In [ ]:
%pylab inline
data.hist(figsize=(10,8))

What can you say about the distributions of these fields?

Some observations.

  1. Age - looks like a bell curve, but with a bunch of 0 values. How many new york times visitors do you think are 0? Do you think that 0 represents some kind of default value? Maybe some finer granularity plotting will help us here.
  2. Gender - Ok, it's binary, but would we expect this much of a skew (64/36) in the distribution of genders of readers on sites like the New York Times? Do you think something similar to age is happening here?
  3. Clicks - looks like it's discrete valued, with the vast majority of the values with 0 clicks.
  4. Impressions - kind of looks like a bell curve, but not quite - maybe plotting at finer granularity will help.
  5. Signed In - Looks like another binary variable with more people signed in than not. Can you think of why this might be?

DIY

  1. Plot these histograms at finer granularity by setting the number of histogram bins to 50. What kind of distribution does it look like Impressions follows?
  2. Does the behavior for age look more clear?
  3. Plot a histogram (or summarize the distribution otherwise - maybe a groupby and a describe?) of gender for sessions where the user is signed in (Signed_In == 1) vs not. Do you notice any differences?
In [ ]:
##YOUR ANSWER HERE

Defining New Variables for Analysis

Let's create two new columns of interest. The first is Click Thru Rate (CTR), this is a measure commonly used in online advertising to measure the effectiveness of an Ad Campaign. We will use it to measure differences in behavior between groups of users.

The second thing we'll do is use pandas' cut method to turn a continuous variable (Age), into a discrete one - AgeGroup. This makes things like plotting and measuring differences between groups easier.

Before doing any of this, though, we'll copy our data to a new data frame where we remove the cases where there are 0 impressions, because this will cause a divide by zero and possibly bias our analysis. The number of rows affected by this filter is small but non-trivial - these rows may warrant further investigation later!

DIY

  1. Calculate a new column called "CTR", which is the proportion of "Clicks" to "Impressions".
  2. Calculate a new column called "AgeGroup", which bins 'Age' into the following buckets - (-1,0], (0, 18], (18, 24], (24, 34], (34, 44], (44, 54], (54, 64], (64, 1000]. Use pandas cut function.
In [ ]:
data1 = data[data.Impressions > 0]

##YOUR ANSWERS HERE
data1['CTR'] = #DEFINE CTR
data1['AgeGroup'] = #DEFINE AgeGroup
data1.head()

Exploring New Metrics

In [ ]:
print data1.shape
data1.describe()

Now, let's plot total impressions and clicks by Age Group and whether or not the user is signed in.

In [ ]:
impressionsByAgeSignIn = data1.groupby(['AgeGroup','Signed_In'])['Clicks'].sum()
impressionsByAgeSignIn.plot(kind='bar')

By Now, we understand that we need to treat groups differently. Let's take our data and divide it into CTRs by age group for those users that have clicked on something (CTR > 0) and are signed in (Signed_In > 0).

In [ ]:
loggedInCTRsByAgeGroup = data1[(data1.CTR > 0) & (data1.Signed_In > 0)].groupby('AgeGroup').CTR

loggedInCTRsByAgeGroup.describe()

Hypothesis Testing

Now that we have several samples of user's click-through behavior. What's a question that we can ask ourselves about these samples?

One question we might ask is "Which groups of users is most different?" Or, more precisely: Are any groups of users different? If any groups are different, which ones are?

First, we'll collect our groups as separate lists, then, we'll run a t-Test between each pair of groups. Finally, we'll collect the p-values for each pair of groups into a DataFrame.

In [ ]:
from scipy.stats import ttest_ind

groups = [s for s in loggedInCTRsByAgeGroup]

def run_pairwise_tests(groups):
    for g in groups:
        for g2 in groups:
            if g[0] < g2[0]:
                yield g[0], g2[0], ttest_ind(g[1], g2[1])[1]
            
testResults = pd.DataFrame(run_pairwise_tests(groups))

DIY

  1. The object testResults contains the p-value for the T-test between CTR samples of all pairs of age groups in our data set. Using this sample, what pairs of groups are different at the 95% confidence level? Which groups are most likely to be different, according to these p-values?
  2. What about the other end of the spectrum? Which are least likely to be different? Do these results make intuitive sense to you?
In [ ]:
## YOUR ANSWERS HERE