This notebook computes an estimate of A/B test duration in a manner similar to Dan McKinley's online tool http://www.experimentcalculator.com. The calculations are adapted from Casagrande, Pike, and Smith (1978), An improved approximate formula for calculating sample sizes for comparing two binomial distributions (link to PDF).

The adjustable parameters are as follows:

  • The number of test subjects expected per day (to be split into 50% control, 50% test)
  • Baseline rate of conversion
  • Percentage change anticipated in the baseline rate
  • Desired false alarm rate
  • Desired probability of correct detection

The false alarm rate (denoted $\alpha$) is the statistical significance level for rejecting the null hypothesis. In other words, if it turns out that there really is no difference in our conversion rate in the test group, this is the chance we might think there is one anyways.

The correct detection rate (denoted $\beta$, also called the "power" of the test) is the probability that, given there does turn out to be a difference between test and control, we would actually correctly notice it.

In [1]:
import numpy as np
from scipy import stats

n_per_day = 10000  # Number of subjects per day

control_rate = 0.05  # Baseline conversion rate
rate_diff = 0.03  # Anticipated percent change to baseline rate

alpha = 0.05  # False alarm rate
beta = 0.95  # Correct detection rate

Let $p_1$ and $p_2$ be the conversion rates of the two groups (test and control), with the test and control rates assigned such that $p_1 > p_2$.

In [2]:
test_rate = control_rate*(1.0 + rate_diff)
p1 = max(control_rate, test_rate)
p2 = min(control_rate, test_rate)

We want to test whether $p_1 = p_2$ (the null hypothesis) or $p_1 > p_2$ (the alternate hypothesis). The number of samples we will need depends on how many successes we expect to see in each of the groups, which in turn depends on $p_1$, $p_2$, our desired false positive rate $\alpha$, and our desired correct detection rate $\beta$.

We compute required sample size using the following approximate formula from Casagrande et al (1978): $$n = A \left[\frac{1 + \sqrt{1 + \frac{4(p_1 - p_2)}{A}}}{2(p_1 - p_2)}\right]^2$$ where $A$ is a $\chi^2$ "correction factor" given by
$$A = \left[z_{1-\alpha} \sqrt{2\bar{p}(1 - \bar{p})} + z_{\beta} \sqrt{p_1 (1-p_1) + p_2 (1-p_2)} \right]^2,$$ with $\bar{p} = (p_1+p_2)/2$ and where $z_p$ denotes the standard normal quantile function, i.e. $z_p = \Phi^{-1}(p)$ is location of the $p$-th quantile for $N(0, 1)$.

In [6]:
p_bar = (p1 + p2)/2.0

# PPF is the "percent point function", a.k.a. quantile function.
# Note that we divide alpha by 2 if we desire a two-sided test.
za = stats.norm.ppf(1 - alpha/2)  
zb = stats.norm.ppf(beta)

A = (za*np.sqrt(2*p_bar*(1-p_bar)) + zb*np.sqrt(p1*(1-p1) + p2*(1-p2)))**2
n = A*(((1 + np.sqrt(1 + 4*(p1-p2)/A))) / (2*(p1-p2)))**2
n_act = int(np.ceil(n))

# Print results
print('Number of users needed each for control and test groups = {:,d} ({:,d} total)'.format(n_act, 2*n_act))
print('Estimated duration at {:,d} subjects per day: {:,d} days'.format(int(n_per_day), int(np.ceil(2.0*n_act/n_per_day))))
Number of users needed each for control and test groups = 557,786 (1,115,572 total)
Estimated duration at 10,000 subjects per day: 112 days

To validate these results, we can run $n_s$ trials of the experiment to estimate empirical false alarm and missed detection rates. To perform the actual statistical test, we compute the 2x2 contigency table for each pair of binomial variates and use Fisher's exact test. We can then make a decision for each trial where the null an alternate are actually true, and compute the frequency of false alarms and missed detections.

In [4]:
dns = 1000
astr = 'greater' if (rate_diff < 0) else 'less'  # The tail probability we use depends on whether the change is positive or negative

# Experimental results when null is true
control0 = stats.binom.rvs(n, control_rate, size=ns)
test0 = stats.binom.rvs(n, control_rate, size=ns)  # Test and control are the same
tables0 = [[[a, n_act-a], [b, n_act-b]] for a, b in zip(control0, test0)]  # Contingency tables
results0 = [stats.fisher_exact(T, alternative=astr) for T in tables0]
decisions0 = [x[1] <= alpha for x in results0]
         
# Experimental results when alternate is true
control1 = stats.binom.rvs(n, control_rate, size=ns)
test1 = stats.binom.rvs(n, test_rate, size=ns)  # Test and control are different
tables1 = [[[a, n_act-a], [b, n_act-b]] for a, b in zip(control1, test1)]  # Contingency tables
results1 = [stats.fisher_exact(T, alternative=astr) for T in tables1]
decisions1 = [x[1] <= alpha for x in results1]

# Compute false alarm and correct detection rates
alpha_est = sum(decisions0)/float(ns)
beta_est = sum(decisions1)/float(ns)

print('True false alarm rate = %0.2f, estimated false alarm rate = %0.2f' % (alpha, alpha_est))
print('True correct detection rate = %0.2f, estimated correct detection rate = %0.2f' % (beta, beta_est))
True false alarm rate = 0.05, estimated false alarm rate = 0.05
True correct detection rate = 0.95, estimated correct detection rate = 0.97