with probabilities $p(x_{nk} = 1)=\theta_k$.
(a) Write down the probability for the $n$th observation $p(x_n|\theta)$ and derive the log-likelihood $\log p(D|\theta)$.
(b) Derive the maximum likelihood estimate for $\theta$.
Provide an interpretation of the variables $m_k,N,\alpha_k,\sum_k\alpha_k$.
[3] (##) Show that Laplace's generalized rule of succession can be worked out to a prediction that is composed of a prior prediction and data-based correction term.
[4] (#) Verify that
(a) the categorial distribution is a special case of the multinomial for $N=1$.
(b) the Bernoulli is a special case of the categorial distribution for $K=2$.
(c) the binomial is a special case of the multinomial for $K=2$.
[5] (###) Determine the mean, variance and mode of a Beta distribution.
[6] (###) Consider a data set of binary variables $D=\{x_1,x_2,\ldots,x_N\}$ with a Bernoulli distribution $\mathrm{Ber}(x_k|\mu)$ as data generating distribution and a Beta prior for $\mu$. Assume that you make $n$ observations with $x=1$ and $N-n$ observations with $x=0$. Now consider a new draw $x_\bullet$. We are interested in computing $p(x_\bullet|D)$. Show that the mean value for $p(x_\bullet|D)$ lies in between the prior mean and Maximum Likelihood estimate.
[7] Consider a data set $D = \{(x_1,y_1), (x_2,y_2),\dots,(x_N,y_N)\}$ with one-hot encoding for the $K$ discrete classes, i.e., $y_{nk} = 1$ if and only if $y_n \in \mathcal{C}_k$, else $y_{nk} = 0$. Also given are the class-conditional distribution $p(x_n| y_{nk}=1,\theta) = \mathcal{N}(x_n|\mu_k,\Sigma)$ and multinomial prior $p(y_{nk}=1) = \pi_k$.
(a) Proof that the joint log-likelihood is given by
(b) Show now that the MLE of the class-conditional mean is given by $$\begin{equation*} \hat \mu_k = \frac{\sum_n y_{nk} x_n}{\sum_n y_{nk}} \end{equation*} $$