Latent Variable Models and Variational Bayes

Preliminaries

Illustrative Example

  • You're now asked to build a density model for a data set (Old Faithful, Bishop pg. 681) that clearly is not distributed as a single Gaussian:

Unobserved Classes

Consider again a set of observed data $D=\{x_1,\dotsc,x_N\}$

  • This time we suspect that there are unobserved class labels that would help explain (or predict) the data, e.g.,
    • the observed data are the color of living things; the unobserved classes are animals and plants.
    • observed are wheel sizes; unobserved categories are trucks and personal cars.
    • observed is an audio signal; unobserved classes include speech, music, traffic noise, etc.
  • Classification problems with unobserved classes are called Clustering problems. The learning algorithm needs to discover the classes from the observed data.

The Gaussian Mixture Model

  • The spread of the data in the illustrative example looks like it could be modeled by two Gaussians. Let's develop a model for this data set.
  • Let $D=\{x_n\}$ be a set of observations. We associate a one-hot coded hidden class label $z_n$ with each observation:
$$\begin{equation*} z_{nk} = \begin{cases} 1 & \text{if } x_n \in \mathcal{C}_k \text{ (the k-th class)}\\ 0 & \text{otherwise} \end{cases} \end{equation*}$$
  • We consider the same model as we did in the generative classification lesson: $$\begin{align*} p(x_n | z_{nk}=1) &= \mathcal{N}\left( x_n | \mu_k, \Sigma_k\right)\\ p(z_{nk}=1) &= \pi_k \end{align*}$$ which can be summarized with the selection variables $z_{nk}$ as $$\begin{align*} p(x_n,z_n) &= p(x_n | z_n) p(z_n) = \prod_{k=1}^K (\underbrace{\pi_k \cdot \mathcal{N}\left( x_n | \mu_k, \Sigma_k\right) }_{p(x_n,z_{nk}=1)})^{z_{nk}} \end{align*}$$

  • Again, this is the same model as we defined for the generative classification model: A Gaussian Discriminant Analysis model but now with unobserved classes).

  • This model (with unobserved class labels) is known as a Gaussian Mixture Model (GMM).

  • The marginal distribution for an observed data point $x_n$ is now given by
$$\begin{align*}{} p(x_n) &= \sum_{z_n} p(x_n,z_n) \\ &= \sum_{k=1}^K \pi_k \cdot \mathcal{N}\left( x_n | \mu_k, \Sigma_k \right) \tag{B-9.12} \end{align*}$$
  • (Eq. B-9.12 reveals the link to the name Gaussian mixture model).

GMM is a very flexible model

  • GMMs are very popular models. They have decent computational properties and are universal approximators of densities (as long as there are enough Gaussians of course)

  • (In the above figure, the Gaussian components are shown in red and the pdf of the mixture models in blue).

Latent Variable Models

  • The GMM contains both observed variables $\{x_n\}$, (unobserved) parameters $\theta= \{\pi_k,\mu_k, \Sigma_k\}$ and unobserved (synonym: latent, hidden) variables $\{z_n\}$.
  • From a Bayesian viewpoint, both latent variables $\{z_{nk}\}$ and parameters $\theta$ are just unobserved variables for which we can set a prior and compute a posterior by Bayes rule.
  • Note that $z_{nk}$ has a subscript $n$, hence its value depends on the $n$-th observation (in constrast to parameters). These observation-dependent variables can be useful to encode structure in the model. Here (in the GMM), the latent variables $\{z_{nk}\}$ encode (unobserved) class membership.
  • By adding model structure through (equations among) latent variables, we can build very complex models. Unfortunately, inference in latent variable models can also be much more complex than for fully observed models.

Inference for GMM is Difficult

$$ \log\, p(D|\theta) = \sum_{n,k} y_{nk} \underbrace{ \log\mathcal{N}(x_n|\mu_k,\Sigma) }_{ \text{Gaussian} } + \underbrace{ \sum_{n,k} y_{nk} \log \pi_k }_{ \text{multinomial} } \,. $$
  • Since the class labels $y_{nk}$ were given, this expression decomposed into a set of simple updates for the Gaussian and multinomial distributions. For both distributions, we have conjugate priors, so inference is easily solved.
  • However, for the Gaussian mixture model (same log-likelihood function with $z_{nk}$ replacing $y_{nk}$), the class labels $\{z_{nk}\}$ are unobserved and need to estimated alongside with the parameters.
  • There is no conjugate prior for the GMM likelihood function $p(\underbrace{D}_{\{x_n\}}\,|\,\underbrace{\{z_{nk}\},\{\mu_k,\Sigma_k,\pi_k\}}_{\text{latent variables}})$.
  • In this lesson, we introduce an approximate Bayesian inference method known as Variational Bayes (VB) (also known as Variational Inference) that can be used for Bayesian inference in models with latent variables. Later in this lesson, we will use VB to do inference in the GMM.
  • As a motivation for variational inference, we first discuss inference by the Method of Maximum Relative Entropy, (Caticha, 2010).

Inference When Information is in the Form of Constraints

  • We will now generalize this notion and consider learning as a process that updates a prior into a posterior distribution whenever new information becomes available.
  • In this context, new information is not necessarily a new observation, but could (for instance) also relate to constraints on the posterior distribution.
  • For example, consider a model $p(x_n,\theta) = p(x_n|\theta)p(\theta)$. At this point, our beliefs about $\theta$ are represented by $p(\theta)$. We might be interested in obtaining rational posterior beliefs $q(\theta)$ in consideration of the following constraints:
    1. Observations: two new data points $x_1=5$ and $x_2=3$.
    2. Constraint: we only consider the Gamma distribution for $q(\theta) = \mathrm{Gam}(\theta|\alpha,\beta)$.
    3. Constraint: the first moment of the posterior is given by $\int \theta q(\theta) \mathrm{d}\theta = 3$.
  • Note that this is not "just" applying Bayes rule, which would not be able to handle the constraints.
  • Note also that observations can be rephrased as constraints on the posterior, e.g., observation $x_1=5$ can be phrased as a constraint $q(x_1)=\delta(x_1-5)$.
    • $\Rightarrow$ Updating a prior to a posterior on the basis of constraints on the posterior is more general than updating based on observations alone.
  • Caticha (2010) developed the method of maximum (relative) entropy for rational updating of priors to posteriors when faced with new information in the form of constraints.

The Method of Maximum Relative Entropy

  • Consider prior beliefs $p(z)$ about $z$. New information in the form of constraints is obtained and we are interested in the "best update" to a posterior $q(z)$.
  • In order to define what "best update" means, we assume a ranking function $S[q,p]$ that generates a preference score for each candidate posterior $q$ for a given $p$. The best update from $p$ to $q$ is identified as $q^* = \arg\max_q S[q,p]$.
  • Similarly to Cox' method to deriving probability theory, Caticha introduced the following mild criteria based on a rational principle (the principle of minimal updating, see Caticha 2010) that the ranking function needs to adhere to:
    1. Locality: local information has local effects.
    2. Coordinate invariance: the system of coordinates carries no information.
    3. Independence: When systems are known to be independent, it should not matter whether they are treated separately or jointly.
  • These three criteria uniquely identify the Relative Entropy as the proper ranking function: $$\begin{align*} S[q,p] = - \int q(z) \log \frac{q(z)}{p(z)} \end{align*}$$
  • $\Rightarrow$ When information is supplied in the form of constaints on the posterior, we should select the posterior that maximizes the Relative Entropy. This is the Method of Maximum (Relative) Entropy (MRE).

Some notes on the MRE method

  • Note that the Relative Entropy is technically a functional, i.e., a function of function(s).
  • Note the relation of the Maximum Relative Entropy method to Probability Theory, which describes how to represent beliefs about events and how to calculate beliefs about joint and conditional events. In contrast, the MRE method describes how to update beliefs when new information becomes available. PT and the MRE method are both needed to describe the theory of optimal information processing.
  • In principle, entropies can always be considered as a relative score against a reference distribution. For instance, the score $-\sum_{z_i} q(z_i) \log q(z_i)$ can be interpreted as a score against the uniform distribution, i.e., $-\sum_{z_i} q(z_i) \log q(z_i) \propto -\sum_{z_i} q(z_i) \log \frac{q(z_i)}{\mathrm{Uniform(z_i)}}$. Therefore, the "method of maximum relative entropy" is often simply known as the "method of maximum entropy".
  • The negative relative entropy is known as the Kullback-Leibler divergence: $$ \mathrm{KL}[q,p] \triangleq \sum_z q(z) \log \frac{q(z)}{p(z)} \tag{B-1.113} $$
  • The Gibbs inequality (proof), is a famous theorem in information theory that states that $$\mathrm{KL}[q,p] \geq 0 $$ with equality only iff $p=q$.
  • The KL divergence can be interpreted as a "distance" between two probability distributions.
    • Note however that the KL divergence is an asymmetric "distance measure", i.e. in general $\mathrm{KL}[q,p] \neq \mathrm{KL}[p,q]$.
  • In this class, we prefer to discuss inference in terms of minimizing KL divergence rather than maximizing relative entropy, but note that these two concepts are equivalent.

Example KL divergence for Gaussians

source: By Mundhenk at English Wikipedia, CC BY-SA 3.0, Link</span>

The Free Energy Functional and Variational Bayes

  • Let's get back to the issue of doing inference for models with latent variables (such as the GMM).
  • Consider a generative model specified by $p(x,z)$ where $x$ and $z$ represent the observed and unobserved variables, respectively.
  • Assume further that $x$ has been observed as $x=\hat{x}$ and we are interested in performing Bayesian inference, i.e., we want to compute the posterior $p(z|x=\hat{x})$ for the latent variables and the evidence $p(x=\hat{x})$.
  • According to the Method of Maximum Relative Entropy, we want to find the posterior $q(x,z)$ that minimizes
$$ \mathrm{KL}[q,p] = \sum_{x,z} q(x,z) \log \frac{q(x,z)}{p(x,z)} \quad \text{subject to }x=\hat{x} $$
  • The constraint $x=\hat{x}$ fixes $p(x) = \delta(x-\hat{x})$ and $q(x) = \delta(x-\hat{x})$, so the KL-divergence evaluates to
    $$\begin{align*} \mathrm{KL}[q,p] &= \sum_{x,z} q(z|x)q(x) \log \frac{q(z|x) q(x)}{p(z|x) p(x)} \\ &= \sum_{x,z} q(z|x)\delta(x-\hat{x}) \log \frac{q(z|x)\delta(x-\hat{x})}{p(z|x) \delta(x-\hat{x})} \\ &= \sum_{z} q(z|x=\hat{x}) \log \frac{q(z|x=\hat{x})}{p(z|x=\hat{x}) } \end{align*}$$
  • Generally, we constrain the posterior $q(z|x=\hat{x})$ to be a parametric function $q(z;\mu(\hat{x}))$, where the parameters $\mu(\hat{x})$ are functions of the observed data. As a result, in the context of a generative model $p(x,z)$, with $x$ observed and $z$ unobserved, the method of MRE selects a posterior $q(z;\mu(\hat{x}))$ that minimizes $$ \mathrm{KL}[q,p] = \sum_z q(z;\mu(\hat{x})) \log \frac{q(z;\mu(\hat{x}))}{p(z|x=\hat{x})} $$
  • (Continuing, we shortly write $q(z)$ for $q(z;\mu(\hat{x}))$ and $p(z|x)$ for $p(z|x=\hat{x})$.) If we scale the KL divergence by a constant (i.e., constant w.r.t. $z$) the ranking of candidate posteriors $q(z)$ is not affected. Hence, the posterior that we want also minimizes the so-called (variational) Free Energy (FE) functional $$ \mathrm{F}[q] = \underbrace{\sum_z q(z) \log \frac{q(z)}{p(z|x)}}_{\text{KL divergence}\geq 0} - \underbrace{\log p(x)}_{\text{(Bayesian) log-evidence}} \tag{B-10.2} $$
    • For brevity, we write $F[q]$ rather than $F[q,p]$ since in the current context we are only interested in minimizing $F$ with respect to $q$.
    • (As an aside), note that Bishop introduces in Eq. B-10.2 an Evidence Lower BOund (in modern literature abbreviated as ELBO) $\mathcal{L}(z)$ that equals the negative FE. We prefer to discuss variational inference in terms of a free energy (but it is the same story as he discusses with the ELBO).
  • The log-evidence term for the FE does not depend on $q$. Hence, the global minimum $$q^* \triangleq \arg\min_q F[q]$$ is obtained for $q^*(z) = p(z|x)$, which is the Bayesian posterior.
  • Furthermore, the minimal free energy $F^* \triangleq F[q^*] = -\log p(x)$ equals minus-log-evidence.
  • $\Rightarrow$ It follows that Bayesian inference (computation of posterior and evidence) can be executed by FE minimization.
  • Executing inference by minimizing the FE functional is called Variational Bayes (VB) or variational inference. Note that VB transforms an inference problem (that involves integration) to an optimization problem! Generally optimization problems are easier to solve than integration problems.

Approximate Inference by Free Energy Minimization

  • The FE as specified above depends on the posterior $p(z|x)$ and evidence $p(x)$, both of which are not accessible since they are the objectives of Bayesian inference. Fortunately, we can rewrite the FE in computable terms.
  • Making use of $p(x,z) = p(z|x)p(x) = p(x|z)p(z)$, the FE functional can be rewritten as
$$\begin{align*} \mathrm{F}[q] &= \underbrace{\sum_z q(z) \log \frac{q(z)}{p(z|x)}}_{\text{KL divergence}\geq 0} - \underbrace{\log p(x)}_{\text{log-evidence}} \tag{DE}\\ &= \underbrace{-\sum_z q(z) \log p(x,z)}_{\text{energy}} - \underbrace{\sum_z q(z) \log \frac{1}{q(z)}}_{\text{entropy}} \tag{EE} \\ &= \underbrace{\sum_z q(z)\log\frac{q(z)}{p(z)}}_{\text{complexity}} - \underbrace{\sum_z q(z) \log p(x|z)}_{\text{accuracy}} \tag{CA} \end{align*}$$
  • These decompositions are very insightful (wil revisit them later) and we will label them respectively as divergence-evidence (DE), energy-entropy (EE), and complexity-accuracy (CA) decompositions.
  • The CA decomposition makes use of $q(z)$, the prior $p(z)$ and likelihood $p(x|z)$, all of which are selected by the engineer, so the FE can be evaluated with this decomposition!
  • Often, we are unable to fully minimize the FE (to the global minimum), but rather get stuck in a local minimum.
  • In that case, (by the DE composition and Gibbs inequality) the FE is an upperbound on $-\log p(x)$, the (negative) log-evidence.
  • Thus, FE minimization (variational Bayes) leads often to approximate Bayesian inference, with approximate posterior $$ \hat{q}(z) \approx \arg\min_q F[q] \\ $$ and model performance is assessed by the free energy $F[\hat{q}] \ge -\log p(x)$.

How to Solve the FE Minimization task?

  • Note that the FE is a functional , i.e., the FE is a function ($F$) of a function ($q$). We are looking for a function $q^*(z)$ that minimizes the FE.
  • The mathematics of minimizing functionals is described by variational calculus , see Bishop (2016), app.D and Lanczos (1961). (Optional reading).
  • Generally speaking, it is not possible to solve the variational FE minimization problem without some additional constraints on $q(z)$ (i.e., in addition to the data constraints).
  • Note that adding constraints to the functional form of $q(z)$ is fully compliant with the MRE method (and therefore also with the FE minimization framework).
  • There are three important cases of adding functional constraints on $q(z)$ that sometimes makes inference possible:
1. mean-field factorization
  • We constrain the posterior to factorize into a set of independent factors, i.e., $$ q(z) = \prod_{j=1}^m q_j(z_j)\,, \tag{B-10.5} $$
2. fixed-form parameterization
  • We constrain the posterior to be part of a parameterized probability distribution, e.g., $$q(z) = \mathcal{N}\left( z | \mu, \Sigma \right)\,.$$
    • In this case, the functional minimization problem for $F[q]$ reduces to the minimization of a function $F(\mu,\Sigma)$ w.r.t. its parameters.
3. the Expectation-Maximization (EM) algorithm
  • We place some constraints both on the prior and posterior for $z$ (to be dicussed below) that simplifies FE minimization to maximum-likelihood estimation.
  • Next, we work out examples for the mean-field and EM constraints that makes FE minimization solvable.

FE Minimization by Mean-field Factorization

  • Given the mean-field constraint $q(z) = \prod_{j=1}^m q_j(z_j)$, it is possible to derive an expression for the optimal solutions $q_j^*(z_j)$, for $j=1,\ldots,m$, which is given by $$\begin{align} \log q_j^*(z_j) &\propto \mathrm{E}_{q_{-j}^*}\left[ \log p(x,z) \right] \tag{B-10.9}\\ &=\underbrace{\sum_{z_{-j}} q_{-j}^*(z_{-j}) \underbrace{\log p(x,z)}_{\text{"field"}}}_{\text{"mean field"}} \notag \end{align}$$ where we defined $q_{-j}^*(z_{-j}) \triangleq q_1^*(z_1)q_2^*(z_2)\cdots q_{j-1}^*(z_{j-1})q_{j+1}^*(z_{j+1})\cdots q_m^*(z_m)$.
  • Proof (from Blei, 2017): We first rewrite the energy-entropy decomposition of the FE as a function of $q_j(z_j)$ only: $$ F[q_j] = \mathbb{E}_{q_{j}}\left[ \mathbb{E}_{q_{-j}}\left[ \log p(x,z_j,z_{-j})\right]\right] - \mathbb{E}_{q_j}\left[ \log q_j(z_j)\right] + \mathtt{const.}\,,$$ where the constant holds all terms that do not depend on $z_j$. This expression can be written as $$ F[q_j] = \sum_{z_j} q_j(z_j) \log \frac{q_j(z_j)}{\exp\left( \mathbb{E}_{q_{-j}}\left[ \log p(x,z_j,z_{-j})\right]\right)}$$ which is a KL-divergence that is minimized by Eq. B-10.9.
  • This is not yet a full solution to the FE minimization task since the solution $q_j^*(z_j)$ depends on expectations that involve $q_{i\neq j}^*(z_{i \neq j})$, and each of the solutions $q_{i\neq j}^*(z_{i \neq j})$ depends on an expection that involves $q_j^*(z_j)$.
  • In practice, we solve this chicken-and-egg problem by an iterative approach: we first initialize all $q_j(z_j)$ (for $j=1,\ldots,m$) to an appropriate initial distribution and then cycle through the factors in turn by solving eq.10.9 and update $q_{-j}^*(z_{-j})$ with the latest estimates. (See Blei, 2017, Algorithm 1, p864).

Example: FEM for the Gaussian Mixture Model

model specification
  • We consider a Gaussian Mixture Model, specified by $$\begin{align*} p(x,z|\theta) &= p(x|z,\mu,\Lambda)p(z|\pi) \\ &= \prod_{n=1}^N \prod_{k=1}^K \pi_k^{z_{nk}}\cdot \mathcal{N}\left( x_n | \mu_k, \Lambda_k^{-1}\right)^{z_{nk}} \tag{B-10.37,38} \end{align*}$$ with tuning parameters $\theta=\{\pi_k, \mu_k,\Lambda_k\}$.
  • Let us introduce some priors for the parameters. We factorize the prior and choose conjugate distributions by $$ p(\theta) = p(\pi,\mu,\Lambda) = p(\pi) p(\mu|\Lambda) p(\Lambda) $$ with $$\begin{align*} p(\pi) &= \mathrm{Dir}(\pi|\alpha_0) = C(\alpha_0) \prod_k \pi_k^{\alpha_0-1} \tag{B-10.39}\\ p(\mu|\Lambda) &= \prod_k \mathcal{N}\left(\mu_k | m_0, \left( \beta_0 \Lambda_k\right)^{-1} \right) \tag{B-10.40}\\ p(\Lambda) &= \prod_k \mathcal{W}\left( \Lambda_k | W_0, \nu_0 \right) \tag{B-10.40} \end{align*}$$ where $\mathcal{W}\left( \cdot \right)$ is a Wishart distribution (i.e., a multi-dimensional Gamma distribution).
  • The full generative model is now specified by $$ p(x,z,\pi,\mu,\Lambda) = p(x|z,\mu,\Lambda) p(z|\pi) p(\pi) p(\mu|\Lambda) p(\Lambda) \tag{B-10.41} $$ with hyperparameters $\{ \alpha_0, m_0, \beta_0, W_0, \nu_0\}$.
inference
  • Assume that we have observed $D = \left\{x_1, x_2, \ldots, x_N\right\}$ and are interested to infer the posterior distribution for the tuning parameters: $$ p(\theta|D) = p(\pi,\mu,\Lambda|D) $$
  • The (perfect) Bayesian solution is $$ p(\theta|D) = \frac{p(x=D,\theta)}{p(x=D)} = \frac{\sum_z p(x=D,z,\pi,\mu,\Lambda)}{\sum_z \sum_{\pi} \iint p(x=D,z,\pi,\mu,\Lambda) \,\mathrm{d}\mu\mathrm{d}\Lambda} $$ but this is intractable (See Blei (2017), p861, eqs. 8 and 9).
  • Alternatively, we can use FE minimization with constraint $$\begin{equation} q(\theta) = q(z) \cdot q(\pi,\mu,\Lambda) \tag{B-10.42} \end{equation}$$ on the posterior. For the specified model, this leads to FE minimization wrt the hyperparameters, i.e., we need to minimize the function $F(\alpha_0, m_0, \beta_0, W_0, \nu_0)$.
  • Bishop shows that the equations for the optimal solutions (Eq. B-10.9) are analytically solvable for the GMM as specified above, leading to (for $k=1,\ldots,K$): $$ \begin{align*} \alpha_k &= \alpha_0 + N_k \tag{B-10.58} \\ \beta_k &= \beta_0 + N_k \tag{B-10.60} \\ m_k &= \frac{1}{\beta_k} \left( \beta_0 m_0 + N_k \bar{x}_k \right) \tag{B-10.61} \\ W_k^{-1} &= W_0^{-1} + N_k S_k + \frac{\beta_0 N_k}{\beta_0 + N_k}\left( \bar{x}_k - m_0\right) \left( \bar{x}_k - m_0\right)^T \tag{B-10.62} \\ \nu_k &= \nu_0 + N_k \tag{B-10.63} \end{align*} $$ where we used $$ \begin{align*} \log \rho_{nk} &= \mathbb{E}\left[ \log \pi_k\right] + \frac{1}{2}\mathbb{E}\left[ \log | \Lambda_k | \right] - \frac{D}{2} \log(2\pi) \\ & \qquad - \frac{1}{2}\mathbb{E}\left[(x_k - \mu_k)^T \Lambda_k(x_k - \mu_k) \right] \tag{B-10.46} \\ r_{nk} &= \frac{\rho_{nk}}{\sum_{j=1}^K \rho_{nj}} \tag{B-10.49} \\ N_k &= \sum_{n=1}^N r_{nk} x_n \tag{B-10.51} \\ \bar{x}_k &= \frac{1}{N_k} \sum_{n=1}^N r_{nk} x_n \tag{B-10.52} \\ S_k &= \frac{1}{N_k} \sum_{n=1}^N r_{nk} \left( x_n - \bar{x}_k\right) \left( x_n - \bar{x}_k\right)^T \tag{B-10.53} \end{align*} $$

Code Example FEM for GMM

  • Below we exemplify training of a Gaussian Mixture Model on the Old Faithful data set by Free Energy Minimization, using the constraints as specified above, e.g., (B-10.42).
In [1]:
using Pkg;Pkg.activate("probprog/workspace/");Pkg.instantiate()
Pkg.add("PDMats")
Pkg.add("SpecialFunctions")
IJulia.clear_output();
In [3]:
using DataFrames, CSV, LinearAlgebra, PDMats, SpecialFunctions
include("scripts/gmm_plot.jl") # Holds plotting function 
old_faithful = CSV.read("datasets/old_faithful.csv");
X = convert(Matrix{Float64}, [old_faithful[!,1] old_faithful[!,2]]');#data matrix
N = size(X, 2) #number of observations
K = 6

function sufficientStatistics(X,r,k::Int) #function to compute sufficient statistics
    N_k = sum(r[k,:])
    hat_x_k = sum([r[k,n]*X[:,n] for n in 1:N]) ./ N_k
    S_k = sum([r[k,n]*(X[:,n]-hat_x_k)*(X[:,n]-hat_x_k)' for n in 1:N]) ./ N_k
    return N_k, hat_x_k, S_k
end

function updateMeanPrecisionPi(m_0,β_0,W_0,ν_0,α_0,r) #variational maximisation function
    m = Array{Float64}(undef,2,K) #mean of the clusters 
    β = Array{Float64}(undef,K) #precision scaling for Gausian distribution
    W = Array{Float64}(undef,2,2,K) #precision prior for Wishart distributions
    ν = Array{Float64}(undef,K) #degrees of freedom parameter for Wishart distribution
    α = Array{Float64}(undef,K) #Dirichlet distribution parameter 
    for k=1:K
        sst = sufficientStatistics(X,r,k)
        α[k] = α_0[k] + sst[1]; β[k] = β_0[k] + sst[1]; ν[k] = ν_0[k] .+ sst[1]
        m[:,k] = (1/β[k])*(β_0[k].*m_0[:,k] + sst[1].*sst[2])
        W[:,:,k] = inv(inv(W_0[:,:,k])+sst[3]*sst[1] + ((β_0[k]*sst[1])/(β_0[k]+sst[1])).*(sst[2]-m_0[:,k])*(sst[2]-m_0[:,k])')
    end
    return m,β,W,ν,α
end

function updateR(Λ,m,α,ν,β) #variational expectation function
    r = Array{Float64}(undef,K,N) #responsibilities 
    hat_π = Array{Float64}(undef,K) 
    hat_Λ = Array{Float64}(undef,K)
    for k=1:K
        hat_Λ[k] = 1/2*(2*log(2)+logdet(Λ[:,:,k])+digamma(ν[k]/2)+digamma((ν[k]-1)/2))
        hat_π[k] = exp(digamma(α[k])-digamma(sum(α)))
        for n=1:N
           r[k,n] = hat_π[k]*exp(-hat_Λ[k]-1/β[k] - (ν[k]/2)*(X[:,n]-m[:,k])'*Λ[:,:,k]*(X[:,n]-m[:,k]))
        end
    end
    for n=1:N
        r[:,n] = r[:,n]./ sum(r[:,n]) #normalize to ensure r represents probabilities 
    end
    return r
end

max_iter = 15
#store the inference results in these vectors
ν = fill!(Array{Float64}(undef,K,max_iter),3)
β = fill!(Array{Float64}(undef,K,max_iter),1.0)
α = fill!(Array{Float64}(undef,K,max_iter),0.01)
R = Array{Float64}(undef,K,N,max_iter)
M = Array{Float64}(undef,2,K,max_iter)
Λ = Array{Float64}(undef,2,2,K,max_iter)
clusters_vb = Array{Distribution}(undef,K,max_iter) #clusters to be plotted
#initialize prior distribution parameters
M[:,:,1] = [[3.0; 60.0];[4.0; 70.0];[2.0; 50.0];[2.0; 60.0];[3.0; 100.0];[4.0; 80.0]]
for k=1:K
    Λ[:,:,k,1] = [1.0 0;0 0.01]
    R[k,:,1] = 1/(K)*ones(N)
    clusters_vb[k,1] = MvNormal(M[:,k,1],PDMats.PDMat(convert(Matrix,Hermitian(inv(ν[1,1].*Λ[:,:,k,1])))))
end
#variational inference
for i=1:max_iter-1
    #variational expectation 
    R[:,:,i+1] = updateR(Λ[:,:,:,i],M[:,:,i],α[:,i],ν[:,i],β[:,i]) 
    #variational minimisation
    M[:,:,i+1],β[:,i+1],Λ[:,:,:,i+1],ν[:,i+1],α[:,i+1] = updateMeanPrecisionPi(M[:,:,i],β[:,i],Λ[:,:,:,i],ν[:,i],α[:,i],R[:,:,i+1])
    for k=1:K
        clusters_vb[k,i+1] = MvNormal(M[:,k,i+1],PDMats.PDMat(convert(Matrix,Hermitian(inv(ν[k,i+1].*Λ[:,:,k,i+1])))))
    end
end


subplot(2,3,1); plotGMM(X, clusters_vb[:,1], R[:,:,1]); title("Initial situation")
for i=2:6
    subplot(2,3,i)
    plotGMM(X, clusters_vb[:,i*2], R[:,:,i*2]); title("After $(i*2) iterations")
end
PyPlot.tight_layout();

The generated figure looks much like Figure 10.6 in Bishop. The plots show results for Variational Bayesian mixture of K = 6 Gaussians applied to the Old Faithful data set, in which the ellipses denote the one standard-deviation density contours for each of the components, and the color coding of the data points reflects the "soft" class label assignments. Components whose expected mixing coefficient are numerically indistinguishable from zero are not plotted. Note that this procedure learns the number of classes (two), learns the class label for each observation, and learns the mean and variance for the Gaussian data distributions.

FE Minimization by the Expectation-Maximization (EM) Algorithm

  • The EM algorithm is a special case of FE minimization that focusses on Maximum-Likelihood estimation for models with latent variables.
  • Consider a model $$p(x,z,\theta)$$ with observations $x = \{x_n\}$, latent variables $z=\{z_n\}$ and parameters $\theta$.
  • We can write the following FE functional for this model: $$\begin{align*} F[q] = \sum_z \sum_\theta q(z) q(\theta) \log \frac{q(z) q(\theta)}{p(x,z,\theta)} \end{align*}$$

  • The EM algorithm makes the following simplifying assumptions:

    1. The prior for the parameters is uninformative (uniform). This implies that $$p(x,z,\theta) = p(x,z|\theta) p(\theta) \propto p(x,z|\theta)$$
    2. The posterior for the parameters is a delta function: $$q(\theta) = \delta(\theta - \hat{\theta})$$
  • Basically, these two assumptions turn FE minimization into maximum likelihood estimation for the parameters $\theta$ and the FE simplifies to $$\begin{align*} F[q,\theta] = \sum_z q(z) \log \frac{q(z)}{p(x,z|\theta)} \end{align*}$$
  • The EM algorithm minimizes this FE by iterating (iteration counter: $i$) over
\begin{equation*} \boxed{ \begin{aligned} \mathcal{L}^{(i)}(\theta) &= \sum_z \overbrace{p(z|x,\theta^{(i-1)})}^{q^{(i)}(z)} \log p(x,z|\theta) \quad &&\text{the E-step} \\ \theta^{(i)} &= \arg\max_\theta \mathcal{L}^{(i)}(\theta) &&\text{the M-step} \end{aligned}} \end{equation*}
  • These choices are optimal for the given FE functional. In order to see this, consider the two decompositions $$\begin{align} F[q,\theta] &= \underbrace{-\sum_z q(z) \log p(x,z|\theta)}_{\text{energy}} - \underbrace{\sum_z q(z) \log \frac{1}{q(z)}}_{\text{entropy}} \tag{EE}\\ &= \underbrace{\sum_z q(z) \log \frac{q(z)}{p(z|x,\theta)}}_{\text{divergence}} - \underbrace{\log p(x|\theta)}_{\text{log-likelihood}} \tag{DE} \end{align}$$
  • The DE decomposition shows that the FE is minimized for the choice $q(z) := p(z|x,\theta)$. Also, for this choice, the FE equals the (negative) log-evidence (, which is this case simplifies to the log-likelihood).
  • The EE decomposition shows that the FE is minimized wrt $\theta$ by minimizing the energy term. The energy term is computed in the E-step and optimized in the M-step.
    • Note that in the EM literature, the energy term is often called the expected complete-data log-likelihood.)
  • In order to execute the EM algorithm, it is assumed that we can analytically execute the E- and M-steps. For a large set of models (including models whose distributions belong to the exponential family of distributions), this is indeed the case and hence the large popularity of the EM algorithm.
  • The EM algorihm imposes rather severe assumptions on the FE (basically approximating Bayesian inference by maximum likelihood estimation). Over the past few years, the rise of Probabilistic Programming languages has dramatically increased the range of models for which the parameters can by estimated autmatically by (approximate) Bayesian inference, so the popularity of EM is slowly waning. (More on this in the Probabilistic Programming lessons).
  • Bishop (2006) works out EM for the GMM in section 9.2.

CODE EXAMPLE: EM-algorithm for the GMM on the Old-Faithful data set

We'll perform clustering on the data set from the illustrative example by fitting a GMM consisting of two Gaussians using the EM algorithm.

In [1]:
using Pkg; Pkg.activate("probprog/workspace");Pkg.instantiate();
IJulia.clear_output();
In [2]:
using DataFrames, CSV, LinearAlgebra
include("scripts/gmm_plot.jl") # Holds plotting function 
old_faithful = CSV.read("datasets/old_faithful.csv");

X =  Array(Matrix{Float64}(old_faithful)')
N = size(X, 2)

# Initialize the GMM. We assume 2 clusters.
clusters = [MvNormal([4.;60.], [.5 0;0 10^2]); 
            MvNormal([2.;80.], [.5 0;0 10^2])];
π_hat = [0.5; 0.5]                    # Mixing weights
γ = fill!(Matrix{Float64}(undef,2,N), NaN)  # Responsibilities (row per cluster)

# Define functions for updating the parameters and responsibilities
function updateResponsibilities!(X, clusters, π_hat, γ)
    # Expectation step: update γ
    norm = [pdf(clusters[1], X) pdf(clusters[2], X)] * π_hat
    γ[1,:] = (π_hat[1] * pdf(clusters[1],X) ./ norm)'
    γ[2,:] = 1 .- γ[1,:]
end
function updateParameters!(X, clusters, π_hat, γ)
    # Maximization step: update π_hat and clusters using ML estimation
    m = sum(γ, dims=2)
    π_hat = m / N
    μ_hat = (X * γ') ./ m'
    for k=1:2
        Z = (X .- μ_hat[:,k])
        Σ_k = Symmetric(((Z .* (γ[k,:])') * Z') / m[k])
        clusters[k] = MvNormal(μ_hat[:,k], convert(Matrix, Σ_k))
    end
end

# Execute the algorithm: iteratively update parameters and responsibilities
subplot(2,3,1); plotGMM(X, clusters, γ); title("Initial situation")
updateResponsibilities!(X, clusters, π_hat, γ)
subplot(2,3,2); plotGMM(X, clusters, γ); title("After first E-step")
updateParameters!(X, clusters, π_hat, γ)
subplot(2,3,3); plotGMM(X, clusters, γ); title("After first M-step")
iter_counter = 1
for i=1:3
    for j=1:i+1
        updateResponsibilities!(X, clusters, π_hat, γ)
        updateParameters!(X, clusters, π_hat, γ)
        iter_counter += 1
    end
    subplot(2,3,3+i); 
    plotGMM(X, clusters, γ); 
    title("After $(iter_counter) iterations")
end
PyPlot.tight_layout()

Note that you can step through the interactive demo yourself by running this script in julia. You can run a script in julia by
julia> include("path/to/script-name.jl")

Summary

  • Latent variable models (LVM) contain a set of unobserved variables whose size grows with the number of observations.
  • LVMs can model more complex phenomena than fully observed models, but inference in LVM models is usually not analytically solvable.
  • The Free Energy (FE) functional transforms Bayesian inference computations (very large summations or integrals) to an optimization problem.
  • Inference by minimizing FE, also known as variational inference, is fully consistent with the "Method of Maximum Relative Entropy", which is by design the rational way to update beliefs from priors to posteriors when new information becomes available. Thus, FE mimimization is a "correct" inference procedure that generalizes Bayes rule.
  • In general, global FE minimization is an unsolved problem and finding good local minima is at the heart of current Bayesian technology research. Three simplifying constraints on the posterior $q(z)$ in the FE functional are currently popular in practical settings:
    • mean-field assumptions
    • assuming a parameterized PDF for $q$
    • EM algorithm
  • These constraints often makes FE minimization implementable at the price of obtaining approximately Bayesian inference results.
  • Also, while FE minimization with appropriate constraints may be implementable, it's definitely very complicated to derive the update equations manually. Hence there is currently a large research effort devoted to automating variational inference (see also the Probabilistic Programming lessons later in this course).
In [1]:
open("../../styles/aipstyle.html") do f
    display("text/html", read(f, String))
end
In [ ]: