The posterior distribution is equal to the joint distribution divided by the marginal distribution of the evidence.
$$ \color{red}{P(\theta\ |\ \mathcal{D})} = \frac{\color{blue}{P(\mathcal{D}\ |\ \theta)\ P(\theta)}}{\color{green}{P(\mathcal{D})}} = \frac{\color{blue}{P(\mathcal{D}, \theta)}}{\color{green}{\int P(\mathcal{D}\ |\ \theta)\ P(\theta)\ d\theta}} $$For many useful models the marginal distribution of the evidence is hard or impossible to calculate analytically.
HTML(mcmc_video)
We observe three successes in ten trials, and want to infer the true success probability.
x_beta_binomial = np.array([1, 1, 1, 0, 0, 0, 0, 0, 0, 0])
import pymc3 as pm
with pm.Model() as beta_binomial_model:
p_beta_binomial = pm.Uniform('p', 0., 1.)
Applied interval-transform to p and added transformed p_interval_ to model.
with beta_binomial_model:
x_obs = pm.Bernoulli('y', p_beta_binomial,
observed=x_beta_binomial)
fig
with beta_binomial_model:
beta_binomial_trace_ = pm.sample(BETA_BINOMIAL_SAMPLES, random_seed=SEED)
beta_binomial_trace = beta_binomial_trace_[BETA_BINOMIAL_BURN::BETA_BINOMIAL_THIN]
Assigned NUTS to p_interval_ [-------100%-------] 50000 of 50000 in 13.1 sec. | SPS: 3829.8 | ETA: 0.0
fig
Variational inference minimizes the Kullback-Leibler divergence
$$\mathbb{KL}(\color{purple}{q(\theta)} \parallel \color{red}{p(\theta\ |\ \mathcal{D})}) = \mathbb{E}_q\left(\log\left(\frac{\color{purple}{q(\theta)}}{\color{red}{p(\theta\ |\ \mathcal{D})}}\right)\right)$$from approximate distributons, but we can't calculate the true posterior distribution.
Minimizing the Kullback-Leibler divergence
$$ \mathbb{KL}(\color{purple}{q(\theta)} \parallel \color{red}{p(\theta\ |\ \mathcal{D})}) = -(\underbrace{\mathbb{E}_q(\log \color{blue}{p(\mathcal{D}, \theta))} - \mathbb{E}_q(\color{purple}{\log q(\theta)})}_{\color{orange}{\textrm{ELBO}}}) + \log \color{green}{p(\mathcal{D})} $$is equivalent to maximizing the Evidence Lower BOund (ELBO), which only requires calculating the joint distribution.
fig
Approximate the true distribution using a diagonal covariance Gaussian from the class
$$\mathcal{Q} = \left\{\left.N\left(\begin{pmatrix} \mu_x \\ \mu_y \end{pmatrix}, \begin{pmatrix} \sigma_x^2 & 0 \\ 0 & \sigma_y^2\end{pmatrix}\ \right|\ \mu_x, \mu_y \in \mathbb{R}^2, \sigma_x, \sigma_y > 0\right)\right\}$$fig
Assume the variational distribution factors independently as $q(\theta_1, \ldots, \theta_n) = q(\theta_1) \cdots q(\theta_n)$
The variational approximation can be found by coordinate ascent
$$ \begin{align*} q(\theta_i) & \propto \exp\left(\mathbb{E}_{q_{-i}}(\log(\mathcal{D}, \boldsymbol{\theta}))\right) \\ q_{-i}(\boldsymbol{\theta}) & = q(\theta_1) \cdots q(\theta_{i - 1})\ q(\theta_{i + 1}) \cdots q(\theta_n) \end{align*} $$BBVI and ADVI arise from different ways of calculating the ELBO gradient
import edward as ed
from edward.models import Bernoulli, Beta, Uniform
ed.set_seed(SEED)
# probability model
p = Uniform(a=0., b=1.)
x_edward_beta_binomial = Bernoulli(p=p)
data = {x_edward_beta_binomial: x_beta_binomial}
# variational approximation
q_p = Beta(a=tf_positive_variable(),
b=tf_positive_variable())
q = {p: q_p}
%%time
beta_binomial_inference = ed.MFVI(q, data)
beta_binomial_inference.run(n_iter=10000, n_print=None)
CPU times: user 5.83 s, sys: 880 ms, total: 6.71 s Wall time: 4.63 s
fig
from edward.models import PyMC3Model
# probability model
x_beta_binomial_obs = shared(np.zeros(1))
with pm.Model() as beta_binomial_model_untransformed:
p = pm.Uniform('p', 0., 1., transform=None)
x_beta_binomial_ = pm.Bernoulli('x', p, observed=x_beta_binomial_obs)
pymc3_data = {x_beta_binomial_obs: x_beta_binomial}
pymc3_beta_binomial_model = PyMC3Model(beta_binomial_model_untransformed)
%%time
# variational distribution
pymc3_q_p = Beta(a=tf_positive_variable(),
b=tf_positive_variable())
pymc3_q = {'p': pymc3_q_p}
pymc3_beta_binomial_inference = ed.MFVI(pymc3_q, pymc3_data,
pymc3_beta_binomial_model)
pymc3_beta_binomial_inference.run(n_iter=30000, n_print=None)
CPU times: user 29.3 s, sys: 5.54 s, total: 34.8 s Wall time: 22.4 s
fig
Direction density calculations
|
vote_df.head()
party | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
rep | |||||||||||||||||
0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | NaN | 1 | 1 | 1 | 0 | 1 |
1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | NaN |
2 | 1 | NaN | 1 | 1 | NaN | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 |
3 | 1 | 0 | 1 | 1 | 0 | NaN | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
4 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | NaN | 1 | 1 | 1 | 1 |
grid.fig
Idea:
def normal_log_prior(value, loc=0., scale=1.):
return tf.reduce_sum(normal.log_pdf(value, loc, scale))
def hierarchical_log_prior(value, mu, sigma):
log_hyperprior = normal_log_prior(mu) + uniform.log_pdf(sigma, 0., 1.)
return log_hyperprior + normal_log_prior(value, loc=mu, scale=sigma)
def log_like(vote, rep, bill, alpha, beta, gamma):
# alpha, beta, and gamma have one entry for representative/bill,
# index them to have one entry per representative/bill combination
alpha_long = long_from_indices(alpha, rep)
beta_long = long_from_indices(beta, bill)
gamma_long = long_from_indices(gamma, bill)
p = tf.sigmoid(gamma_long * (alpha_long - beta_long))
return tf.reduce_sum(bernoulli.logpmf(vote, p))
class IdealPoint:
def log_prob(self, xs, zs):
"""
xs is a dictionary of observed data
zs is a dictionary of parameters
Calculates the model's log joint distribution
"""
# parameters
alpha, beta, gamma = zs['alpha'], zs['beta'], zs['gamma']
mu_beta, mu_gamma = zs['mu_beta'], zs['mu_gamma']
sigma_beta, sigma_gamma = zs['sigma_beta'], zs['sigma_gamma']
# observed data
vote, bill, rep = xs['vote'], xs['bill'], xs['rep']
# log joint distribution
log_prior_ = log_prior(alpha,
beta, mu_beta, sigma_beta,
gamma, mu_gamma, sigma_gamma)
log_like_ = log_like(vote, rep, bill,
alpha, beta, gamma)
return log_prior_ + log_like_
ideal_point_model = IdealPoint()
# variational distribution
q_alpha = tf_normal((n_reps,))
q_mu_beta = tf_normal()
q_sigma_beta = tf_uniform()
q_beta = tf_normal((N_BILLS,))
q_mu_gamma = tf_normal()
q_sigma_gamma = tf_uniform()
q_gamma = tf_normal((N_BILLS,))
q = {
'alpha': q_alpha,
'beta': q_beta, 'mu_beta': q_mu_beta, 'sigma_beta': q_sigma_beta,
'gamma': q_gamma, 'mu_gamma': q_mu_gamma, 'sigma_gamma': q_sigma_gamma,
}
%%time
inference = ed.MFVI(q, ideal_point_data, ideal_point_model)
inference.run(n_iter=20000, n_print=None)
CPU times: user 1min 33s, sys: 30.2 s, total: 2min 4s Wall time: 52.8 s
fig
Fun project: Recreate the Martin-Quinn dynamic ideal point model for ideology of U.S. Supreme Court justices with Edward
fig
%%time
with beta_binomial_model:
advi_fit = pm.advi(n=20000, random_seed=SEED)
Iteration 0 [0%]: ELBO = -10.91 Iteration 2000 [10%]: Average ELBO = -7.76 Iteration 4000 [20%]: Average ELBO = -7.37 Iteration 6000 [30%]: Average ELBO = -7.24 Iteration 8000 [40%]: Average ELBO = -7.22 Iteration 10000 [50%]: Average ELBO = -7.17 Iteration 12000 [60%]: Average ELBO = -7.15 Iteration 14000 [70%]: Average ELBO = -7.19 Iteration 16000 [80%]: Average ELBO = -7.18 Iteration 18000 [90%]: Average ELBO = -7.19 Finished [100%]: Average ELBO = -7.2 CPU times: user 3.12 s, sys: 0 ns, total: 3.12 s Wall time: 3.1 s
fig
fig
fig
Idea: A mixture of linear models, where the unknown number of mixture weights depend on $x$
fig
The model has infinitely many mixture components, we truncate to $K$
$$ \begin{align*} \alpha_1, \ldots, \alpha_K & \sim N(0, 1) \\ \beta_1, \ldots, \beta_K & \sim N(0, 1) \\ \boldsymbol{\pi} \ |\ \boldsymbol{\alpha}, \boldsymbol{\beta}, x & \sim \textrm{Logit-Stick-Breaking}(\boldsymbol{\alpha} + \boldsymbol{\beta} x) \\ \end{align*} $$K = 20
x_lidar = shared(std_range, broadcastable=(False, True))
with pm.Model() as lidar_model:
alpha = pm.Normal('alpha', 0., 1., shape=K)
beta = pm.Normal('beta', 0., 1., shape=K)
v = tt.nnet.sigmoid(alpha + beta * x_lidar)
pi = pm.Deterministic('pi', stick_breaking(v))
with lidar_model:
gamma = pm.Normal('gamma', 0., 100., shape=K)
delta = pm.Normal('delta', 0., 100., shape=K)
tau = pm.Gamma('tau', 1., 1., shape=K)
ys = pm.Deterministic('ys', gamma + delta * x_lidar)
Applied log-transform to tau and added transformed tau_log_ to model.
with lidar_model:
lidar_obs = NormalMixture('lidar_obs', pi, ys, tau,
observed=std_logratio)
%%time
N_ADVI_ITER = 50000
with lidar_model:
advi_fit = pm.advi(n=N_ADVI_ITER, random_seed=SEED)
Iteration 0 [0%]: ELBO = -1173858.35 Iteration 5000 [10%]: Average ELBO = -1472209.86 Iteration 10000 [20%]: Average ELBO = -893902.97 Iteration 15000 [30%]: Average ELBO = -665586.63 Iteration 20000 [40%]: Average ELBO = -369517.75 Iteration 25000 [50%]: Average ELBO = 12058.54 Iteration 30000 [60%]: Average ELBO = 130100.63 Iteration 35000 [70%]: Average ELBO = 186668.28 Iteration 40000 [80%]: Average ELBO = 222983.67 Iteration 45000 [90%]: Average ELBO = 253507.05 Finished [100%]: Average ELBO = 266985.1 CPU times: user 43.7 s, sys: 0 ns, total: 43.7 s Wall time: 43.7 s
fig
PPC_SAMPLES = 5000
lidar_ppc_x = np.linspace(std_range.min() - 0.05,
std_range.max() + 0.05,
100)
with lidar_model:
# sample from the variational posterior distribution
advi_trace = pm.sample_vp(advi_fit, PPC_SAMPLES, random_seed=SEED)
# sample from the posterior predictive distribution
x_lidar.set_value(lidar_ppc_x[:, np.newaxis])
advi_ppc_trace = pm.sample_ppc(advi_trace, PPC_SAMPLES, random_seed=SEED)
fig
fig
http://pymc-devs.github.io/pymc3/
PyMC3 port of Probabilistic Programming and Bayesian Methods for Hackers
Angelino, Elaine, Matthew James Johnson, and Ryan P. Adams. "Patterns of Scalable Bayesian Inference." arXiv preprint arXiv:1602.05221 (2016).
Blei, David M., Alp Kucukelbir, and Jon D. McAuliffe. "Variational inference: A review for statisticians." arXiv preprint arXiv:1601.00670 (2016).
Kucukelbir, Alp, et al. "Automatic Differentiation Variational Inference." arXiv preprint arXiv:1603.00788 (2016).
Ranganath, Rajesh, Sean Gerrish, and David M. Blei. "[Black Box Variational Inference]((http://www.cs.columbia.edu/~blei/papers/RanganathGerrishBlei2014.pdf)." AISTATS. 2014.
Bafumi, Joseph, et al. "Practical issues in implementing and understanding Bayesian ideal point estimation." Political Analysis 13.2 (2005): 171-187.
Fox, Jean-Paul. Bayesian item response modeling: Theory and applications. Springer Science & Business Media, 2010.
Gelman, Andrew, et al. Bayesian data analysis. Vol. 2. Boca Raton, FL, USA: Chapman & Hall/CRC, 2014.
Gelman, Andrew, and Jennifer Hill. Data analysis using regression and multilevel/hierarchical models. Cambridge University Press, 2006.
Ren, Lu, et al. "Logistic stick-breaking process." Journal of Machine Learning Research 12.Jan (2011): 203-239.
The Jupyter notebook these slides were generated from is available here