This is an update to my previous take on Bayesian AB testing. Specifically, I looked at taking advantage of prior knowledge that the change in conversion rate shouldn't be large -- maybe because you've run dozens of similar tests and you're making only a small modification to the website. Previously I described a sort of multi-level beta-binomial model to account for that prior knowledge, but it turns out that regularized logistic regression is simpler and gives pretty much the same answer.
We start with a baseline conversion rate $\mu$, from an uninformative prior*. Then, based on your prior knowledge of how likely the conversion rates will differ from one another, select a spread term $\sigma$. Next the two conversion rates $p_A$ and $p_B$ are drawn from $\text{logistic} \left[ \mathcal{N}(\mu, \sigma) \right]$. We then observe visits to the website which are Bernoulli distributed with rates $p_A$ and $p_B$.
* It's simple enough to use prior information for this too, but it turns out to have very little effect on the answer.
TODO: a little bit on how this model is just regularized logistic regression
You can solve for the maximum-likelihood estimates of $p_A$ and $p_B$ directly, but it's nice to have a full distribution to work with. For example, you might want to report various credible intervals of interest. To get the distributions, I'll use Stan.
import pandas as pd
import matplotlib.pyplot as plt
from numpy import exp
from pystan import StanModel
# pystan gives some annoying warnings when sampling
import warnings
warnings.filterwarnings('ignore')
logistic = lambda x: 1/(1+exp(-x))
The Stan code for the 'naive', usual beta-binomial model:
naive_model_code = '''
data {
int<lower=0> successes[2];
int<lower=0> trials[2];
real<lower=0> alpha0;
real<lower=0> beta0;
}
parameters {
vector[2] p;
}
model {
p ~ beta(alpha0, beta0);
successes ~ binomial(trials, p);
}
'''
naive_model = StanModel(model_code=naive_model_code)
The Stan code for the improved beta-binomial model from the previous notebook:
bb_model_code = '''
data {
int<lower=0> successes[2];
int<lower=0> trials[2];
real<lower=0> alpha0;
real<lower=0> beta0;
real<lower=0> tau;
}
parameters {
real mu;
vector[2] p;
}
transformed parameters {
real<lower=0> alpha1;
real<lower=0> beta1;
alpha1 <- tau*mu;
beta1 <- tau*(1-mu);
}
model {
p ~ beta(alpha1, beta1);
successes ~ binomial(trials, p);
}
'''
bb_model = StanModel(model_code=bb_model_code)
The Stan code for the new logistic regression model:
lr_model_code = '''
data {
int<lower=0> successes[2];
int<lower=0> trials[2];
real<lower=0> sigma;
}
parameters {
real mu;
vector[2] coef;
}
model {
coef ~ normal(mu, sigma);
successes ~ binomial_logit(trials, coef);
}
'''
lr_model = StanModel(model_code=lr_model_code)
Let's look at the same example as the previous notebook. Check the fit summary for each model to make sure things converged and behaved as expected.
successesA = 300
successesB = 350
failuresA = 3000
failuresB = 3000
successes = [successesA, successesB]
trials = [successesA+failuresA, successesB+failuresB]
pA, pB = successesA/(successesA+failuresA), successesB/(successesB+failuresB)
pA, pB, (pB-pA)/pA
(0.09090909090909091, 0.1044776119402985, 0.1492537313432835)
stan_data = dict(successes=successes, trials=trials, alpha0=1, beta0=1)
naive_fit1 = naive_model.sampling(data=stan_data)
naive_fit1
Inference for Stan model: anon_model_6d413c36495b1f1de92fa8ddc3024cf0. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat p[0] 0.09 4.7e-4 4.8e-3 0.08 0.09 0.09 0.09 0.1 105 1.02 p[1] 0.1 3.6e-4 5.2e-3 0.09 0.1 0.1 0.11 0.11 211 1.01 lp__ -2127 0.05 0.93 -2130 -2128 -2127 -2127 -2126 381 1.01 Samples were drawn using NUTS(diag_e) at Sun Feb 1 17:13:34 2015. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1).
stan_data = dict(successes=successes, trials=trials, alpha0=30, beta0=270)
naive_fit2 = naive_model.sampling(data=stan_data)
naive_fit2
Inference for Stan model: anon_model_6d413c36495b1f1de92fa8ddc3024cf0. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat p[0] 0.09 3.8e-4 5.0e-3 0.08 0.09 0.09 0.09 0.1 171 1.02 p[1] 0.1 3.3e-4 5.0e-3 0.09 0.1 0.1 0.11 0.11 230 1.01 lp__ -2318 0.06 1.05 -2320 -2318 -2317 -2317 -2317 286 1.01 Samples were drawn using NUTS(diag_e) at Sun Feb 1 17:14:00 2015. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1).
stan_data = dict(successes=successes, trials=trials, alpha0=3, beta0=27, tau=1500)
bb_fit = bb_model.sampling(data=stan_data)
bb_fit
Inference for Stan model: anon_model_472b220940aa6b1a70571ed176fc238a. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat mu 0.1 2.2e-4 6.7e-3 0.09 0.09 0.1 0.1 0.11 904 1.0 p[0] 0.09 1.5e-4 4.7e-3 0.08 0.09 0.09 0.1 0.1 928 1.0 p[1] 0.1 1.6e-4 4.8e-3 0.09 0.1 0.1 0.11 0.11 947 1.0 alpha1 146.84 0.33 10.06 127.52 139.91 146.74 153.69 166.38 904 1.0 beta1 1353.1 0.33 10.06 1333.6 1346.3 1353.2 1360.0 1372.4 904 1.0 lp__ -2121 0.05 1.29 -2124 -2121 -2120 -2120 -2119 800 1.0 Samples were drawn using NUTS(diag_e) at Sun Feb 1 17:03:37 2015. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1).
stan_data = dict(successes=successes, trials=trials, sigma=.09)
lr_fit = lr_model.sampling(data=stan_data)
lr_fit
Inference for Stan model: anon_model_f81458e5eafcf826b0b5d753bc432a22. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat mu -2.22 2.4e-3 0.07 -2.37 -2.27 -2.22 -2.18 -2.08 962 1.0 coef[0] -2.28 1.8e-3 0.05 -2.39 -2.32 -2.28 -2.24 -2.17 930 1.0 coef[1] -2.17 1.6e-3 0.05 -2.27 -2.2 -2.17 -2.14 -2.07 1006 1.0 lp__ -2128 0.04 1.24 -2132 -2129 -2128 -2127 -2127 876 1.0 Samples were drawn using NUTS(diag_e) at Sun Feb 1 17:02:17 2015. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1).
Combine all the estimates into one dataframe for plotting:
p = pd.DataFrame(naive_fit1.extract()['p'], columns=['A', 'B'])
naive_data1 = pd.DataFrame(dict(value=(p.B-p.A)/p.A))
naive_data1['variable'] = 'naive flat prior'
p = pd.DataFrame(naive_fit2.extract()['p'], columns=['A', 'B'])
naive_data2 = pd.DataFrame(dict(value=(p.B-p.A)/p.A))
naive_data2['variable'] = 'naive informative prior'
p = pd.DataFrame(bb_fit.extract()['p'], columns=['A', 'B'])
bb_data = pd.DataFrame(dict(value=(p.B-p.A)/p.A))
bb_data['variable'] = 'improved beta-binomial'
p = pd.DataFrame(logistic(lr_fit.extract()['coef']), columns=['A', 'B'])
lr_data = pd.DataFrame(dict(value=(p.B-p.A)/p.A))
lr_data['variable'] = 'logistic regression'
data = pd.concat([naive_data1, naive_data2, bb_data, lr_data])
The new, simpler logistic regression model gives basically the same answer as the previous beta-binomial model.
data.groupby('variable').value.plot(kind='kde')
plt.legend()
plt.xlim(-.1, .4);