Variational Inference in Python¶

PyData DC 2016¶

October 8, 2016¶

@AustinRochford — Monetate Labs ¶

arochford@monetate.com ¶

Bayesian Inference¶

The posterior distribution is equal to the joint distribution divided by the marginal distribution of the evidence.

$$ \color{red}{P(\theta\ |\ \mathcal{D})} = \frac{\color{blue}{P(\mathcal{D}\ |\ \theta)\ P(\theta)}}{\color{green}{P(\mathcal{D})}} = \frac{\color{blue}{P(\mathcal{D}, \theta)}}{\color{green}{\int P(\mathcal{D}\ |\ \theta)\ P(\theta)\ d\theta}} $$

For many useful models the marginal distribution of the evidence is hard or impossible to calculate analytically.

Modes of Bayesian Inference¶

Conjugate models with closed-form posteriors
Markov chain Monte Carlo algorithms
Approximate Bayesian computation
Distributional approximations
- Laplace approximations, INLA
- Variational inference

Markov Chain Monte Carlo Algorithms¶

Construct a Markov chain whose stationary distribution is the posterior distribution
Sample from the Markov chain for a long time
Approximate posterior quantities using the empirical distribution of the samples

In [12]:

HTML(mcmc_video)

Out[12]:

Beta-Binomial Model¶

We observe three successes in ten trials, and want to infer the true success probability.

In [13]:

x_beta_binomial = np.array([1, 1, 1, 0, 0, 0, 0, 0, 0, 0])

$$p \sim U(0, 1)$$

In [14]:

import pymc3 as pm

with pm.Model() as beta_binomial_model:
    p_beta_binomial = pm.Uniform('p', 0., 1.)

Applied interval-transform to p and added transformed p_interval_ to model.

$$ \begin{align*} P(X_i = 1\ |\ p) & = p \end{align*} $$

In [15]:

with beta_binomial_model:
    x_obs = pm.Bernoulli('y', p_beta_binomial,
                         observed=x_beta_binomial)

$$p\ \left|\ \sum_{i = 1}^{10} X_i \right. = 3 \sim \textrm{Beta}(4, 8)$$

In [17]:

fig

Out[17]:

In [19]:

with beta_binomial_model:
    beta_binomial_trace_ = pm.sample(BETA_BINOMIAL_SAMPLES, random_seed=SEED)

beta_binomial_trace = beta_binomial_trace_[BETA_BINOMIAL_BURN::BETA_BINOMIAL_THIN]

Assigned NUTS to p_interval_
 [-------100%-------] 50000 of 50000 in 13.1 sec. | SPS: 3829.8 | ETA: 0.0

In [21]:

fig

Out[21]:

Pros¶

Asymptotically unbiased: converges to the true posterior afer many samples
Model-agnostic algorithms
Well-studied for more than 60 years

Cons¶

Can take a long time to converge
- Can be difficult to assess convergence
Difficult to scale

Variational Inference¶

Choose a class of approximating distributions
Find the best approximation to the true posterior

Variational inference minimizes the Kullback-Leibler divergence

$$\mathbb{KL}(\color{purple}{q(\theta)} \parallel \color{red}{p(\theta\ |\ \mathcal{D})}) = \mathbb{E}_q\left(\log\left(\frac{\color{purple}{q(\theta)}}{\color{red}{p(\theta\ |\ \mathcal{D})}}\right)\right)$$

from approximate distributons, but we can't calculate the true posterior distribution.

Minimizing the Kullback-Leibler divergence

$$ \mathbb{KL}(\color{purple}{q(\theta)} \parallel \color{red}{p(\theta\ |\ \mathcal{D})}) = -(\underbrace{\mathbb{E}_q(\log \color{blue}{p(\mathcal{D}, \theta))} - \mathbb{E}_q(\color{purple}{\log q(\theta)})}_{\color{orange}{\textrm{ELBO}}}) + \log \color{green}{p(\mathcal{D})} $$

is equivalent to maximizing the Evidence Lower BOund (ELBO), which only requires calculating the joint distribution.

Variational Inference Example¶

In [24]:

fig

Out[24]:

Approximate the true distribution using a diagonal covariance Gaussian from the class

$$\mathcal{Q} = \left\{\left.N\left(\begin{pmatrix} \mu_x \\ \mu_y \end{pmatrix}, \begin{pmatrix} \sigma_x^2 & 0 \\ 0 & \sigma_y^2\end{pmatrix}\ \right|\ \mu_x, \mu_y \in \mathbb{R}^2, \sigma_x, \sigma_y > 0\right)\right\}$$

In [26]:

fig

Out[26]:

Pros¶

A principled method to trade complexity for bias
Optimization theory is applicable
- Assesment of convergence
- Scalability

Cons¶

Biased estimate of the true posterior
- Better for prediction than interpretation
Model-specific algorithms

Mean field variational inference¶

Assume the variational distribution factors independently as $q(\theta_1, \ldots, \theta_n) = q(\theta_1) \cdots q(\theta_n)$

The variational approximation can be found by coordinate ascent

$$ \begin{align*} q(\theta_i) & \propto \exp\left(\mathbb{E}_{q_{-i}}(\log(\mathcal{D}, \boldsymbol{\theta}))\right) \\ q_{-i}(\boldsymbol{\theta}) & = q(\theta_1) \cdots q(\theta_{i - 1})\ q(\theta_{i + 1}) \cdots q(\theta_n) \end{align*} $$

Coordinate Ascent Cons¶

Calculations are tedious, even when possible
Convergence is slow when the number of parameters is large

Automating Variational Inference in Python¶

Maximize ELBO using gradient ascent instead of coordinate ascent
Tensor libraries calculate ELBO gradients automatically

Python Package	Tensor Library	Variational Inference Algorithm(s)
Edward	TensorFlow	Black Box Variational Inference (BBVI)
PyMC3	Theano	Automatic Differentiation Variational Inference (ADVI)

Common themes¶

Monte Carlo estimate of the ELBO gradient
Minibatch estimates of the joint distribution

BBVI and ADVI arise from different ways of calculating the ELBO gradient

Variational Inference with Edward¶

Black Box Variational Inference (BBVI)¶

Model-agnostic
- Requires the ability to compute the joint distribution
- Required the ability to differentiate the variational distribution

Beta-binomial model¶

In [27]:

import edward as ed
from edward.models import Bernoulli, Beta, Uniform

ed.set_seed(SEED)

# probability model
p = Uniform(a=0., b=1.)
x_edward_beta_binomial = Bernoulli(p=p)

data = {x_edward_beta_binomial: x_beta_binomial}

In [29]:

# variational approximation
q_p = Beta(a=tf_positive_variable(),
           b=tf_positive_variable())
q = {p: q_p}

In [30]:

%%time
beta_binomial_inference = ed.MFVI(q, data)
beta_binomial_inference.run(n_iter=10000, n_print=None)

CPU times: user 5.83 s, sys: 880 ms, total: 6.71 s
Wall time: 4.63 s

In [32]:

fig

Out[32]:

In [33]:

from edward.models import PyMC3Model

# probability model
x_beta_binomial_obs = shared(np.zeros(1))

with pm.Model() as beta_binomial_model_untransformed:
    p = pm.Uniform('p', 0., 1., transform=None)
    x_beta_binomial_ = pm.Bernoulli('x', p, observed=x_beta_binomial_obs)
    
pymc3_data = {x_beta_binomial_obs: x_beta_binomial}
pymc3_beta_binomial_model = PyMC3Model(beta_binomial_model_untransformed)

In [34]:

%%time
# variational distribution
pymc3_q_p = Beta(a=tf_positive_variable(),
                 b=tf_positive_variable())
pymc3_q = {'p': pymc3_q_p}

pymc3_beta_binomial_inference = ed.MFVI(pymc3_q, pymc3_data,
                                        pymc3_beta_binomial_model)
pymc3_beta_binomial_inference.run(n_iter=30000, n_print=None)

CPU times: user 29.3 s, sys: 5.54 s, total: 34.8 s
Wall time: 22.4 s

In [36]:

fig

Out[36]:

Edward-supported modeling "languages"¶

Joint Distribution	Variational Distribution
TensorFlow Distributions	TensorFlow
PyMC3
Stan
Direction density calculations TensorFlow Numpy Pure Python

Congressional Ideal Points¶

In [40]:

vote_df.head()

Out[40]:

	party	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
rep
0	0	0	1	0	1	1	1	0	0	0	1	NaN	1	1	1	0	1
1	0	0	1	0	1	1	1	0	0	0	0	0	1	1	1	0	NaN
2	1	NaN	1	1	NaN	1	1	0	0	0	0	1	0	1	1	0	0
3	1	0	1	1	0	NaN	1	0	0	0	0	1	0	1	0	0	1
4	1	1	1	1	0	1	1	0	0	0	0	1	NaN	1	1	1	1

In [43]:

grid.fig

Out[43]:

Idea:

Representatives ($\color{blue}{\alpha_i}$) and bills ($\color{green}{\beta_j}$) each have an ideal point on a spectrum of conservativity
- If a representative's and bill's ideal points are the same, the representative has a 50% chance of voting for the bill

Bills also have an ability to discriminate ($\color{red}{\gamma_j}$) between conservative and liberal representatives
- Some bills have broad bipartisan support, while some provoke votes along party lines

Representative ideal points¶

$$ \begin{align*} \color{blue}{\alpha_1}, \ldots, \color{blue}{\alpha_N} & \sim N(0, 1) \end{align*} $$

In [44]:

def normal_log_prior(value, loc=0., scale=1.):
    return tf.reduce_sum(normal.log_pdf(value, loc, scale))

Bill ideal points and discriminative abilities¶

$$ \begin{align*} \mu_{\beta} & \sim N(0, 1) \\ \sigma_{\beta} & \sim U(0, 1) \\ \color{green}{\beta_1}, \ldots, \color{green}{\beta_K} & \sim N(\mu_{\beta}, \sigma_{\beta}^2) \end{align*} $$
$$ \begin{align*} \mu_{\gamma} & \sim N(0, 1) \\ \sigma_{\gamma} & \sim U(0, 1) \\ \color{red}{\gamma_1}, \ldots, \color{red}{\gamma_K} & \sim N(\mu_{\gamma}, \sigma_{\gamma}^2) \end{align*} $$

In [45]:

def hierarchical_log_prior(value, mu, sigma):
    log_hyperprior = normal_log_prior(mu) + uniform.log_pdf(sigma, 0., 1.)

    return log_hyperprior + normal_log_prior(value, loc=mu, scale=sigma)

Observation model¶

$$ \begin{align*} \mathbb{P}(\textrm{Representative }i \textrm{ votes for bill }j\ |\ \color{blue}{\alpha_i}, \color{green}{\beta_j}, \color{red}{\gamma_j}) & = \frac{1}{1 + \exp(-\color{red}{\gamma_j} \left(\color{blue}{\alpha_i} - \color{green}{\beta_j}\right))} \end{align*} $$

In [48]:

def log_like(vote, rep, bill, alpha, beta, gamma):
    # alpha, beta, and gamma have one entry for representative/bill,
    # index them to have one entry per representative/bill combination
    alpha_long = long_from_indices(alpha, rep)
    beta_long = long_from_indices(beta, bill)
    gamma_long = long_from_indices(gamma, bill)

    p = tf.sigmoid(gamma_long * (alpha_long - beta_long))

    return tf.reduce_sum(bernoulli.logpmf(vote, p))

Edward model¶

In [49]:

class IdealPoint:
    def log_prob(self, xs, zs):
        """
        xs is a dictionary of observed data
        zs is a dictionary of parameters
        
        Calculates the model's log joint distribution
        """
        # parameters
        alpha, beta, gamma = zs['alpha'], zs['beta'], zs['gamma']
        mu_beta, mu_gamma = zs['mu_beta'], zs['mu_gamma']
        sigma_beta, sigma_gamma = zs['sigma_beta'], zs['sigma_gamma']
        
        # observed data
        vote, bill, rep = xs['vote'], xs['bill'], xs['rep']

        # log joint distribution
        log_prior_ = log_prior(alpha,
                               beta, mu_beta, sigma_beta,
                               gamma, mu_gamma, sigma_gamma)
        log_like_ = log_like(vote, rep, bill,
                             alpha, beta, gamma)
        
        return log_prior_ + log_like_

ideal_point_model = IdealPoint()

BBVI inference¶

In [52]:

# variational distribution
q_alpha = tf_normal((n_reps,))

q_mu_beta = tf_normal()
q_sigma_beta = tf_uniform()
q_beta = tf_normal((N_BILLS,))

q_mu_gamma = tf_normal()
q_sigma_gamma = tf_uniform()
q_gamma = tf_normal((N_BILLS,))

q = {
    'alpha': q_alpha,
    'beta': q_beta, 'mu_beta': q_mu_beta, 'sigma_beta': q_sigma_beta,
    'gamma': q_gamma, 'mu_gamma': q_mu_gamma, 'sigma_gamma': q_sigma_gamma,
}

In [53]:

%%time
inference = ed.MFVI(q, ideal_point_data, ideal_point_model)
inference.run(n_iter=20000, n_print=None)

CPU times: user 1min 33s, sys: 30.2 s, total: 2min 4s
Wall time: 52.8 s

In [56]:

fig

Out[56]:

Fun project: Recreate the Martin-Quinn dynamic ideal point model for ideology of U.S. Supreme Court justices with Edward

Variational Inference with PyMC3¶

Automatic Differentiation Variational Inference (ADVI)¶

Only applicable to differentiable probability models
Transform constrained parameters to be unconstrained
Approximate the posterior for unconstrained parameters with mean field Gaussian

Beta-binomial model¶

Transformed distributions¶

In [58]:

fig

Out[58]:

In [59]:

%%time
with beta_binomial_model:
    advi_fit = pm.advi(n=20000, random_seed=SEED)

Iteration 0 [0%]: ELBO = -10.91
Iteration 2000 [10%]: Average ELBO = -7.76
Iteration 4000 [20%]: Average ELBO = -7.37
Iteration 6000 [30%]: Average ELBO = -7.24
Iteration 8000 [40%]: Average ELBO = -7.22
Iteration 10000 [50%]: Average ELBO = -7.17
Iteration 12000 [60%]: Average ELBO = -7.15
Iteration 14000 [70%]: Average ELBO = -7.19
Iteration 16000 [80%]: Average ELBO = -7.18
Iteration 18000 [90%]: Average ELBO = -7.19
Finished [100%]: Average ELBO = -7.2
CPU times: user 3.12 s, sys: 0 ns, total: 3.12 s
Wall time: 3.1 s

ADVI approxiation to transformed posterior¶

In [62]:

fig

Out[62]:

ADVI approximation to posterior¶

In [64]:

fig

Out[64]:

Dependent Density Regression¶

In [68]:

fig

Out[68]:

Idea: A mixture of linear models, where the unknown number of mixture weights depend on $x$

In [70]:

fig

Out[70]:

Mixture weights¶

The model has infinitely many mixture components, we truncate to $K$

$$ \begin{align*} \alpha_1, \ldots, \alpha_K & \sim N(0, 1) \\ \beta_1, \ldots, \beta_K & \sim N(0, 1) \\ \boldsymbol{\pi} \ |\ \boldsymbol{\alpha}, \boldsymbol{\beta}, x & \sim \textrm{Logit-Stick-Breaking}(\boldsymbol{\alpha} + \boldsymbol{\beta} x) \\ \end{align*} $$

In [73]:

K = 20

x_lidar = shared(std_range, broadcastable=(False, True))

with pm.Model() as lidar_model:
    alpha = pm.Normal('alpha', 0., 1., shape=K)
    beta = pm.Normal('beta', 0., 1., shape=K)
    
    v = tt.nnet.sigmoid(alpha + beta * x_lidar)
    pi = pm.Deterministic('pi', stick_breaking(v))

Component linear models¶

$$ \begin{align*} \gamma_1, \ldots, \gamma_K & \sim N(0, 100) \\ \delta_1, \ldots, \delta_K & \sim N(0, 100) \\ \tau_1, \ldots, \tau_K & \sim \textrm{Gamma}(1, 1) \\ Y_i\ |\ \gamma_i, \delta_i, \tau_i, x & \sim N(\gamma_i + \delta_i x, \tau_i^{-1}) \end{align*} $$

In [74]:

with lidar_model:
    gamma = pm.Normal('gamma', 0., 100., shape=K)
    delta = pm.Normal('delta', 0., 100., shape=K)
    tau = pm.Gamma('tau', 1., 1., shape=K)

    ys = pm.Deterministic('ys', gamma + delta * x_lidar)

Applied log-transform to tau and added transformed tau_log_ to model.

Observation model¶

$$ \begin{align*} Y\ |\ \boldsymbol{\alpha}, \boldsymbol{\beta}, \boldsymbol{\gamma}, \boldsymbol{\delta}, \boldsymbol{\tau}, x & = \sum_{i = 1}^K \pi_i Y_i \\ \end{align*} $$

In [76]:

with lidar_model:
    lidar_obs = NormalMixture('lidar_obs', pi, ys, tau,
                              observed=std_logratio)

ADVI inference¶

In [77]:

%%time
N_ADVI_ITER = 50000

with lidar_model:
    advi_fit = pm.advi(n=N_ADVI_ITER, random_seed=SEED)

Iteration 0 [0%]: ELBO = -1173858.35
Iteration 5000 [10%]: Average ELBO = -1472209.86
Iteration 10000 [20%]: Average ELBO = -893902.97
Iteration 15000 [30%]: Average ELBO = -665586.63
Iteration 20000 [40%]: Average ELBO = -369517.75
Iteration 25000 [50%]: Average ELBO = 12058.54
Iteration 30000 [60%]: Average ELBO = 130100.63
Iteration 35000 [70%]: Average ELBO = 186668.28
Iteration 40000 [80%]: Average ELBO = 222983.67
Iteration 45000 [90%]: Average ELBO = 253507.05
Finished [100%]: Average ELBO = 266985.1
CPU times: user 43.7 s, sys: 0 ns, total: 43.7 s
Wall time: 43.7 s

In [79]:

fig

Out[79]:

Posterior predictions¶

In [80]:

PPC_SAMPLES = 5000

lidar_ppc_x = np.linspace(std_range.min() - 0.05,
                          std_range.max() + 0.05,
                          100)

with lidar_model:
    # sample from the variational posterior distribution
    advi_trace = pm.sample_vp(advi_fit, PPC_SAMPLES, random_seed=SEED)

    # sample from the posterior predictive distribution
    x_lidar.set_value(lidar_ppc_x[:, np.newaxis])
    advi_ppc_trace = pm.sample_ppc(advi_trace, PPC_SAMPLES, random_seed=SEED)

Impact of truncation¶

In [82]:

fig

Out[82]:

In [84]:

fig

Out[84]:

References¶

Edward¶

edwardlib.org

Examples

PyMC3¶

http://pymc-devs.github.io/pymc3/

Examples

PyMC3 port of Probabilistic Programming and Bayesian Methods for Hackers

Variational Inference¶

Angelino, Elaine, Matthew James Johnson, and Ryan P. Adams. "Patterns of Scalable Bayesian Inference." arXiv preprint arXiv:1602.05221 (2016).

Blei, David M., Alp Kucukelbir, and Jon D. McAuliffe. "Variational inference: A review for statisticians." arXiv preprint arXiv:1601.00670 (2016).

Kucukelbir, Alp, et al. "Automatic Differentiation Variational Inference." arXiv preprint arXiv:1603.00788 (2016).

Ranganath, Rajesh, Sean Gerrish, and David M. Blei. "[Black Box Variational Inference]((http://www.cs.columbia.edu/~blei/papers/RanganathGerrishBlei2014.pdf)." AISTATS. 2014.

Models¶

Bafumi, Joseph, et al. "Practical issues in implementing and understanding Bayesian ideal point estimation." Political Analysis 13.2 (2005): 171-187.

Fox, Jean-Paul. Bayesian item response modeling: Theory and applications. Springer Science & Business Media, 2010.

Gelman, Andrew, et al. Bayesian data analysis. Vol. 2. Boca Raton, FL, USA: Chapman & Hall/CRC, 2014.

Gelman, Andrew, and Jennifer Hill. Data analysis using regression and multilevel/hierarchical models. Cambridge University Press, 2006.

Ren, Lu, et al. "Logistic stick-breaking process." Journal of Machine Learning Research 12.Jan (2011): 203-239.

Thank You!¶

@AustinRochford — Monetate Labs ¶

arochford@monetate.com ¶

The Jupyter notebook these slides were generated from is available here

	party	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
rep
0	0	0	1	0	1	1	1	0	0	0	1	NaN	1	1	1	0	1
1	0	0	1	0	1	1	1	0	0	0	0	0	1	1	1	0	NaN
2	1	NaN	1	1	NaN	1	1	0	0	0	0	1	0	1	1	0	0
3	1	0	1	1	0	NaN	1	0	0	0	0	1	0	1	0	0	1
4	1	1	1	1	0	1	1	0	0	0	0	1	NaN	1	1	1	1

	party	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
rep
0	0	0	1	0	1	1	1	0	0	0	1	NaN	1	1	1	0	1
1	0	0	1	0	1	1	1	0	0	0	0	0	1	1	1	0	NaN
2	1	NaN	1	1	NaN	1	1	0	0	0	0	1	0	1	1	0	0
3	1	0	1	1	0	NaN	1	0	0	0	0	1	0	1	0	0	1
4	1	1	1	1	0	1	1	0	0	0	0	1	NaN	1	1	1	1

Variational Inference in Python¶

PyData DC 2016¶

October 8, 2016¶

@AustinRochford — Monetate Labs¶

arochford@monetate.com¶

Bayesian Inference¶

Modes of Bayesian Inference¶

Markov Chain Monte Carlo Algorithms¶

Beta-Binomial Model¶

Pros¶

Cons¶

Variational Inference¶

Variational Inference Example¶

Pros¶

Cons¶

Mean field variational inference¶

Coordinate Ascent Cons¶

Automating Variational Inference in Python¶

Common themes¶

Variational Inference with Edward¶

Black Box Variational Inference (BBVI)¶

Beta-binomial model¶

Edward-supported modeling "languages"¶

Congressional Ideal Points¶

Representative ideal points¶

Bill ideal points and discriminative abilities¶

Observation model¶

Edward model¶

BBVI inference¶

Variational Inference with PyMC3¶

Automatic Differentiation Variational Inference (ADVI)¶

Beta-binomial model¶

Transformed distributions¶

ADVI approxiation to transformed posterior¶

ADVI approximation to posterior¶

Dependent Density Regression¶

Mixture weights¶

Component linear models¶

Observation model¶

ADVI inference¶

Posterior predictions¶

Impact of truncation¶

References¶

Edward¶

PyMC3¶

Variational Inference¶

Models¶

Thank You!¶

@AustinRochford — Monetate Labs¶

arochford@monetate.com¶

@AustinRochford — Monetate Labs ¶

arochford@monetate.com ¶

@AustinRochford — Monetate Labs ¶

arochford@monetate.com ¶

	party	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
rep
0	0	0	1	0	1	1	1	0	0	0	1	NaN	1	1	1	0	1
1	0	0	1	0	1	1	1	0	0	0	0	0	1	1	1	0	NaN
2	1	NaN	1	1	NaN	1	1	0	0	0	0	1	0	1	1	0	0
3	1	0	1	1	0	NaN	1	0	0	0	0	1	0	1	0	0	1
4	1	1	1	1	0	1	1	0	0	0	0	1	NaN	1	1	1	1