The General EM Algorithm


The Kullback-Leibler Divergence

  • In order to prove that the EM algorithm works, we will need Gibbs inequality, which is a famous theorem in information theory.

  • Definition: the Kullback-Leibler divergence (a.k.a. relative entropy) is a distance measure between two distributions $q$ and $p$ and is defined as

$$ D_{\text{KL}}(q \parallel p) \triangleq \sum_z q(z) \log \frac{q(z)}{p(z)} $$

  • Theorem: Gibbs Inequality (proof uses Jensen inquality):
    $$\boxed{ D_{\text{KL}}(q \parallel p) \geq 0 }$$ with equality only iff $p=q$.
  • Note that the KL divergence is an asymmetric distance measure, i.e. in general

$$D_{\text{KL}}(q \parallel p) \neq D_{\text{KL}}(p \parallel q)$$

EM by maximizing a Lower Bound on the Log-Likelihood

  • Consider a model for observations $x$, hidden variables $z$ and tuning parameters $\theta$. Note that, for any distribution $q(z)$, we can expand the log-likelihood as follows:

$$\begin{align*} \mathrm{L}(\theta) &\triangleq \log p(x|\theta) \\ &= \sum_z q(z) \log p(x|\theta) \\ &= \sum_z q(z) \left( \log p(x|\theta) - \log \frac{q(z)}{p(z|x,\theta)}\right) + \sum_z q(z) \log \frac{q(z)}{p(z|x,\theta)} \\ &= \sum_z q(z) \log \frac{p(x,z|\theta)}{q(z)} + \underbrace{D_{\text{KL}}\left( q(z) \parallel p(z|x,\theta) \right)}_{\text{Kullback-Leibler div.}} \tag{1} \\ &\geq \sum_z q(z) \log \frac{p(x,z|\theta)}{q(z)} \quad \text{(use Gibbs inequality)} \\ &= \underbrace{\sum_z q(z) \log p(x,z|\theta)}_{\text{expected complete-data log-likelihood}} + \underbrace{\mathcal{H}\left[ q\right]}_{\text{entropy of }q} \\ &\triangleq \mathrm{LB}(q,\theta) \end{align*}$$

  • Hence, $\mathrm{LB}(q,\theta)$ is a lower bound on the log-likelihood $\log p(x|\theta)$.
  • Technically, the Expectation-Maximization (EM) algorithm is defined by coordinate ascent on $\mathrm{LB}(q,\theta)$:

$$\begin{align*} &\textrm{ } \\ &\textrm{Initialize }: \theta^{(0)}\\ &\textrm{for }m = 1,2,\ldots \textrm{until convergence}\\ &\quad q^{(m+1)} = \arg\max_q \mathrm{LB}(q,\theta^{(m)}) &\quad \text{% update responsibilities} \\ &\quad \theta^{(m+1)} = \arg\max_\theta \mathrm{LB}(q^{(m+1)},\theta) &\quad \text{% re-estimate parameters} \end{align*}$$

  • Since $\mathrm{LB}(q,\theta) \leq \mathrm{L}(\theta)$ (for all choices of $q(z)$), maximizing the lower-bound $\mathrm{LB}$ will also maximize the log-likelihood wrt $\theta$. The reason to maximize $\mathrm{LB}$ rather than log-likelihood $\mathrm{L}$ directly is that $\arg\max \mathrm{LB}$ often leads to easier expressions. E.g., see this illustrative figure (Bishop p.453):

EM Details

Maximizing $\mathrm{LB}$ w.r.t. $q$
  • Note that $$ \mathrm{LB}(q,\theta) = \mathrm{L}(\theta) - D_{\text{KL}}\left( q(z) \parallel p(z|x,\theta) \right) $$ and consequenty, maximizing $\mathrm{LB}$ over $q$ leads to minimization of the KL-divergence. Specifically, it follows from Gibbs inequality that $$ q^{(m+1)}(z) := p(z|x,\theta^{(m)}) $$
Maximizing $\mathrm{LB}$ w.r.t. $\theta$
  • It also follows from (the last line of) the multi-line derivation above that maximizing $\mathrm{LB}$ w.r.t. $\theta$ amounts to maximization of the expected complete-data log-likelihood (where the complete data set is defined as $\{(x_i,z_i)\}_{i=1}^N$). Hence, the EM algorithm comprises iterations of $$ \boxed{\textbf{EM}:\, \theta^{(m+1)} := \underbrace{\arg\max_\theta}_{\text{M-step}} \underbrace{\sum_z \overbrace{p(z|x,\theta^{(m)})}^{=q^{(m+1)}(z)} \log p(x,z|\theta)}_{\text{E-step}} } $$

Hoemwork exercise: EM for GMM Revisited

  • Write down the GMM generative model
  • The complete-data set is $D_c=\{x_1,z_1,x_2,z_2,\ldots,x_n,z_n\}$. Write down the complete-data likelihood $p(D_c|\theta)$
  • Write down the complete-data log-likelihood $\log p(D_c|\theta)$
  • Write down the expected complete-data log-likelihood $\mathrm{E}_Z\left[ \log p(D_c|\theta) \right]$
  • Maximize $\mathrm{E}_Z\left[ \log p(D_c|\theta) \right]$ w.r.t. $\theta=\{\pi,\mu,\Sigma\}$

  • Verify that your solution is the same as the 'intuitive' solution of the previous lesson.

Homework Exercise: EM for Three Coins problem

  • You have three coins in your pocket. For each coin, outcomes $\in \{\mathrm{H},\mathrm{T}\}$. $$ p(\mathrm{H}) = \begin{cases} \lambda & \text{for coin }0 \\ \rho & \text{for coin }1 \\ \theta & \text{for coin }2 \end{cases} $$
  • Scenario. Toss coin $0$. If Head comes up, toss three times with coin $1$; otherwise, toss three times with coin $2$.

  • The observed sequences after each toss with coin $0$ were $\langle \mathrm{HHH}\rangle$, $\langle \mathrm{HTH}\rangle$, $\langle \mathrm{HHT}\rangle$, and $\langle\mathrm{HTT}\rangle$

  • Task. Use EM to estimate most the likely values for $\lambda$, $\rho$ and $\theta$

Report Card on EM

  • EM is a general procedure for learning in the presence of unobserved variables.
  • In a sense, it is a family of algorithms. The update rules you will derive depend on the probability model assumed.
  • (Good!) No tuning parameters such a learning rate, unlike gradient descent-type algorithms
  • (Bad). EM is an iterative procedure that is very sensitive to initial conditions! EM converges to a local optimum.
  • Start from trash $\rightarrow$ end up with trash. Hence, we need a good and fast initialization procedure (often used: K-Means, see Bishop p.424)
  • Also used to train HMMs, etc.

(OPTIONAL SLIDE) The Free-Energy Principle

  • The negative lower-bound $\mathrm{F}(q,\theta) \triangleq -\mathrm{LB}(q,\theta)$ also appears in various scientific disciplines. In statistical physics and variational calculus, $F$ is known as the free energy functional. Hence, the EM algorithm is a special case of free energy minimization.
  • It follows from Eq.1 above that

$$\begin{align*} \mathrm{F}(q,\theta) &\triangleq \sum_z q(z) \log \frac{q(z)}{p(x,z|\theta)} \\ &= \underbrace{- \mathrm{L}(\theta)}_{-\text{log-likelihood}} + \underbrace{D_{\text{KL}}\left( q(z) \parallel p(z|x,\theta) \right)}_{\text{divergence}} \end{align*}$$

  • The Free-Energy Principle (FEP) is an influential neuroscientific theory that claims that information processing in the brain is also an example of free-energy minimization, see Friston, 2009.
  • According to FEP, the brain contains a generative model $p(x,z,a,\theta)$ for its environment. Here, $x$ are the sensory signals (observations); $z$ corresponds to (hidden) environmental causes of the sensorium; $a$ represents our actions (control signals for movement) and $\theta$ describes the fixed 'physical rules' of the world.

(OPTIONAL SLIDE) The Free-Energy Principle, cont'd

  • Solely through free-energy minimization, the brain infers an approximate posterior $q(z,a,\theta)$, thus inferring our perception, actions, and learning equations.
  • Free-energy can be interpreted as (a generalized notion of sum of) prediction errors. Free-energy minimization aims to minimize prediction errors through perception, actions, and learning. (The next picture is from a 2012 tutorial presentation by Karl Friston)

(OPTIONAL SLIDE) The Free-Energy Principle, cont'd

  • $\Rightarrow$ The brain is "nothing but" an approximate Bayesian agent that tries to build a model for its world. Actions (behavior) are selected to fulfill prior expectations about the world.
  • In our BIASlab research team in the Signal Processing Systems group (FLUX floor 7), we work on developing (approximately Bayesian) artificial intelligent agents that learn purposeful behavior through interactions with their environments, inspired by the free-energy principle. Applications include
    • robotics
    • self-learning of new signal processing algorithms, e.g., for hearing aids
    • self-learning how to drive
  • Let me know (email [email protected]) if you're interested in developing machine learning applications that are directly inspired by how the brain computes. We often have open traineeships or MSc thesis projects available.

The cell below loads the style file

In [1]:
open("../../styles/aipstyle.html") do f
    display("text/html", readstring(f))