proof that the marginal distribution for observations $x_n$ evaluates to
$$ p(x) = \sum_{j=1}^K \pi_k \cdot \mathcal{N}\left( x | \mu_j, \Sigma_j \right) $$$$\begin{align*}
p(x) &= \sum_{z} p(x,z) \ &= \sum_{z} \prod_{k=1}^K \left(\pi_k \cdot \mathcal{N}\left( x | \mu_k, \Sigma_k\right) \right)^{z_{k}} \end{align}$$ Exploiting the one-hot coding scheme for $z$, we can re-write the RHS as $$\begin{equation} \sum_{j=1}^K \prod_{k=1}^K \left(\pi_k \cdot \mathcal{N}\left( x | \mu_k, \Sigma_k\right) \right)^{I_{kj}} = \sum_{j=1}^K \pi_j \cdot \mathcal{N}\left( x | \mu_j, \Sigma_j\right) \end{equation*}$$ where $I_{kj} = 1$ if $k=j$ and $0$ otherwise.
The Energy-Entropy decomposition follows simply from $\log \frac{a}{b} = \log(a) - \log(b)$. The Divergence-Evidence decomposition follows from $p(x,z) = p(z|x)p(x)$ and the Complexity-Accuracy decomposition follows from substituting $p(x,z)=p(x|z)p(z)$. Altogether leading to$$\begin{align*}
\mathrm{F}[q] &= \underbrace{-\sum_z q(z) \log p(x,z)}{\text{energy}} - \underbrace{\sum_z q(z) \log \frac{1}{q(z)}}{\text{entropy}} \tag{EE} \ &= \underbrace{\sum_z q(z) \log \frac{q(z)}{p(z|x)}}{\text{KL divergence}\geq 0} - \underbrace{\log p(x)}{\text{log-evidence}} \tag{DE}\ &= \underbrace{\sum_z q(z)\log\frac{q(z)}{p(z)}}{\text{complexity}} - \underbrace{\sum_z q(z) \log p(x|z)}{\text{accuracy}} \tag{CA} \end{align*}$$
[3] (#) The Free energy functional $\mathrm{F}[q] = -\sum_z q(z) \log p(x,z) - \sum_z q(z) \log \frac{1}{q(z)}$ decomposes into "Energy minus Entropy". So apparently the entropy of the posterior $q(z)$ is maximized. This entropy maximization may seem puzzling at first because inference should intuitively lead to more informed posteriors, i.e., posterior distributions whose entropy is smaller than the entropy of the prior. Explain why entropy maximization is still a reasonable objective.
Note that Free Energy minimization is a balancing act: FE minimization implies entropy maximization and at the same time energy minimization. Minimizing the energy term leads to aligning $q(z)$ with $\log p(x,z)$, ie, it tries to move the bulk of the function $q(z)$ to areas in $z$-space where $p(x,z)$ is large ($p(x,z)$ is here just a function of $z$, since x is observed). However, aside from aligning with $p(x,z)$, we want $q(z)$ to be as uninformative as possible. Everything that can be inferred should be represented in $p(x,z)$ (which is prior times likelihood). We don't want to learn anything that is not in either the prior or the likelihood. The entropy term balances the energy term by favoring distributions that are as uninformative as possible.
[4] (#) Explain the following update rule for the mean of the Gaussian cluster-conditional data distribution (from the example about mean-field updating of a Gaussian mixture model):
We see here an example of "precision-weighted means add" when two sources of information are fused, just like precision-weighted means add when two Gaussians are multiplied, eg a prior and likelihood. In this case, the prior is $m_0$ and the likelihood estimate is $\bar{x}$. $\beta_0$ can be interpreted as the number of pseudo-observations in the prior.
Proof that this algorithm minimizes the Free Energy functional $$\begin{align*} F[q,\theta] = \sum_z q(z) \log \frac{q(z)}{p(D,z|\theta)} \end{align*}$$
Let's start with a prior estimate $\theta^{(i-1)}$ and we want to minimize the free energy functional wrt $q$. This leads to
$$\begin{align*} q^{(i)}(z) &= \arg\min_q F[q,\theta^{(i-1)}] \\ &= \arg\min_q \sum_z q(z) \log \frac{q(z)}{p(D,z|\theta^{(i-1)})} \\ &= \arg\min_q \sum_z q(z) \log \frac{q(z)}{p(z|D,\theta^{(i-1)}) \cdot p(D|\theta^{(i-1)})} \\ &= p(z|D,\theta^{(i-1)}) \end{align*}$$ Next, we use $q^{(i)}(z)=p(z|D,\theta^{(i-1)})$ and minimize the free energy w.r.t. $\theta$, leading to $$\begin{align*} \theta^{(i)} &= \arg\min_\theta F[q^{(i)}(z),\theta] \\ &= \arg\min_\theta \sum_z p(z|D,\theta^{(i-1)}) \log \frac{p(z|D,\theta^{(i-1)})}{p(D,z|\theta)} \\ &= \arg\max_\theta \sum_z \underbrace{p(z|D,\theta^{(i-1)})}_{q^{(i)}(z)} \log p(D,z|\theta) \end{align*}$$
Overfitting relates to learning a posterior that "listens" too much to the data (and not enough to the prior). Underfitting does the opposite. The CA decompostion
exposes this dilemma nicely. The complrxity term tries to keep the posterior $q(z)$ near the prior $p(z)$ whereas the accuracy term tries to align the posterior $q(z)$ with the likelihood $p(x|z)$. Thus, minimizing Free Energy keeps an eye on avoiding both under- and over-fitting.
(a) Apparently, in order to execute EM, we need to work out an expression for the 'responsibility' $p(z|x=D,\hat{\theta}^{(m)})$. Use Bayes rule to show how we can compute the responsibility that allows us to execute an EM step.
$$p(z|x=D,\hat{\theta}^{(m)}) = \frac{p(x=D|z,\hat{\theta}^{(m)}) \,p(z|\hat{\theta}^{(m)})}{\int p(x=D|z,\hat{\theta}^{(m)}) \,p(z|\hat{\theta}^{(m)}) \,\mathrm{d}z}$$Use Bayes rule:
Note that the RHS is an expression in $z$ since $D$ and $\hat{\theta}$ are given. If you want to evaluate the RHS, you need to make a specific choice for your model $$p(x,z|\theta) = \underbrace{p(x|z,\theta)}_{\text{likelihood}} \underbrace{p(z|\theta)}_{\text{prior}}$$
(b) Why do we need multiple iterations in the EM algorithm?
We must have a parameter estimate in order to compute the responsibilities, and vice versa, we need responsibilities to update the parameter estimate. Thus, in the EM algorithm, we iterate between updating responsibilities (beliefs about $z$) and parameter estimates (beliefs about $\theta$).
(c) Why can't we just use simple maximum log-likelihood to estimate parameters, as described by $$ \hat{\theta} := \arg \max_\theta \log p(x=D,z|\theta) \,? $$
Because $z$ is not observed.
Do you prefer a gradient descent or EM algorithm to estimate maximum likelihood values for the parameters? Explain your answer. (No need to work out the equations.)
Since this expression does not degenerate into simple MVGs, the EM approach is in practice preferred.