P-Hacking¶

Author Jean-François Puget ¶

This notebook is the code used in my blog post on Green dice are loaded (welcome to p-hacking)

It demonstrates how to mislead people about a scientific experiment by hacking p-values following the methodlogy from this xkcd comic:

First, some useful imports.

In [1]:

import numpy as np
import pandas as pd
from numpy.random import random_integers, seed

The experiment¶

We are given 20 dice colors, and we roll dice for each color 1,000 times. We report the number of six. We set the random seed to make sure results are reproducible.

In [2]:

dice = ['Purple', 'Brown', 'Pink', 'Blue', 'Teal', 
        'Salmon', 'Red', 'Turquoise', 'Magenta', 'Yellow', 
        'Grey', 'Tan', 'Cyan', 'Green', 'Mauve',
        'Beige', 'Lilac', 'Black', 'Peach', 'Orange']

def dice_experiments(dice, n):
    df = pd.DataFrame(index=dice)
    for die in dice:
        result = random_integers(1,6,n)
        df.ix[die,'Number of Six'] = np.sum(result[result==6])//6
    return df

np.random.seed(75000)
df = dice_experiments(dice, 1000)
df

Out[2]:

	Number of Six
Purple	151
Brown	167
Pink	158
Blue	167
Teal	181
Salmon	162
Red	170
Turquoise	161
Magenta	165
Yellow	180
Grey	172
Tan	164
Cyan	181
Green	188
Mauve	165
Beige	172
Lilac	178
Black	176
Peach	173
Orange	157

The green color has 188 times a six, which is rather high. Indeed, average value is 1,000/6 = 167.

The probability that green dice have at least 188 six is about 4% as we will see below. This is the p-value for the green color result. If we decided to watch the green color before the experiement, then this low p-value would warrant us a scientific publication. Read my blog to see why.

Let's see how unlikely this result is by running our experiment many times.

In [3]:

def get_fraction_one_color(color, dice, k, n, repeat):
    success = 0.0
    for experiment in range(repeat):
        df = dice_experiments(dice, n)
        if df.loc['Green','Number of Six'] >= k:
            success += 1
    return success/repeat

get_fraction_one_color('Green', dice, 188, 1000, 1000)

Out[3]:

0.041

This is about 4.1%, i.e. rather unlikely. There must be a reason why green dice favor six that much!

The Catch¶

We misinterpreted what we did and our conclusion is misleading. What we did was to run the experiment for 20 colors, then select the color with the highest numpber of six. The p-value should be the probability that at least one color gets at least 188 six.

Let's evaluate this probability experimentally.

In [4]:

def get_fraction_any_color(dice, k, n, repeat):
    success = 0.0
    for experiment in range(repeat):
        df = dice_experiments(dice, n)
        if df['Number of Six'].max() >= k:
            success += 1
    return success/repeat

get_fraction_any_color(dice, 188, 1000, 1000)

Out[4]:

0.556

This probability is about 56%, i.e. very likely. There is nothing special about our dice afterall.

This little code may look contrived, but it is not. scientists got publications using a similar methodology, see my blog for details.

Exact probabilities¶

Let's compute the above p-values using probability theory.

The probability that k out of n are six is given by the function below. Indeed, in order to get exactly $k$ times a six among $n$ rolls, we must first decide where those six appear in the sequence of 1,000 rolls. There are exactly $n$ choose $k = n!/(k!.(n-k)!$ ways to do it. Then for each of these sequence, the probability that all of the $k$ occurrences are a six is $1/6^k$ and the probability that the remaining $n-k$ occurrences are not a six is $(5/6)^{n-k}$. The probability to get exactly $k$ times a six out of $n$ dice rolls is therefore equal to $1/6^k(5/6)^{n-k}n!/(k!(n-k)!)$.

In [5]:

from scipy.misc import comb

def proba_k_among_n(k, n):
    p = 1/6
    q = 1-p
    result = p**k * q**(n-k) * comb(n, k)
    return result

print("%0.4f" % proba_k_among_n(188,1000))

0.0066

The probability that at least k out of n are six is the sum of the probability for each possible value greater than or equal to k.

In [6]:

def proba_at_least_k_among_n(k, n):
    proba = 0.0
    for i in range(k, n+1):
        proba += proba_k_among_n(i, n)
    return proba

print("%0.4f" % proba_at_least_k_among_n(188,1000))

0.0401

Probability that one specific color gets at least k six out of n is what we just computed.

In [7]:

def proba_one_color(k, n):
    return proba_at_least_k_among_n(k, n)

print("%0.3f" % proba_one_color(188, 1000))

0.040

Probability that at least one color among ncolors gets at least k six out of n

In [8]:

def proba_at_least_one_color(ncolors, k, n):
    p = proba_one_color(k, n)
    proba_no_color = (1-p)**ncolors
    return 1 - proba_no_color

print("%0.3f" % proba_at_least_one_color(20, 188, 1000))

0.559

This is close to what we found experimentally.

Using built-in binomial distribution¶

A more direct code, using binomial distribution.

In [9]:

from scipy import stats

One experiment gets more than 188 times a six.

In [10]:

stats.binom(1000, 1/6.0).sf(187)

Out[10]:

0.040142990286396493

At least one in 20 experiments gets more than 188 times a six.

In [11]:

1 - (stats.binom(1000, 1/6.0).cdf(187)**20)

Out[11]:

0.55931241410111465

Ultimate p-hacking¶

We wanted to mimick the xkcd comic, hence we looked for a random seed that yields the result we want. This is really bad p-hacking, as we define first the p-value, then find experiment settings that produce it.

In [12]:

def find_seed(dice, k, n, repeat, rounding):
    nseed = 0.0
    for seed in range(0,repeat,rounding):
        np.random.seed(seed)
        df = dice_experiments(dice, n)
        m = df['Number of Six'].max()
        if m == k and m == df.loc['Green','Number of Six']:
            return seed
    
res = find_seed(dice, 188, 1000, 1000000, 1000)
res

Out[12]: