$$ p(\theta|D) = \frac{p(D|\theta)}{p(D)} p(\theta) $$(a) Machine learning is inference over models (hypotheses, parameters, etc.) from a given data set. Bayes rule makes this statement precise. Let $\theta \in \Theta$ and $D$ represent a model parameter vector and the given data set, respectively. Then, Bayes rule,
relates the information that we have about $\theta$ before we saw the data (i.e., the distribution $p(\theta)$) to what we know after having seen the data, $p(\theta|D)$.
$$ \hat\theta_{MAP} = \arg\max_\theta p(\theta|D)$$(b) The Maximum a Posteriori (MAP) estimate picks a value $\hat\theta$ for which the posterior distribution $p(\theta|D)$ is maximal, i.e.,
In a sense, MAP estimation approximates Bayesian learning, since we approximated $p(\theta|D)$ by $\delta(\theta-\hat\theta_{\text{MAP}})$. Note that, by Bayes rule, $$\arg\max_\theta p(\theta|D) = \arg\max_\theta p(D|\theta)p(\theta)$$ If we further assume that prior to seeing the data all values for $\theta$ are equally likely (i.e., $p(\theta)=\text{const.}$), then the MAP estimate reduces to the Maximum Likelihood estimate, $$ \hat\theta_{ML} = \arg\max_\theta p(D|\theta)$$
(1) Model specification, (2) parameter estimation, (3) model evaluation and (4) application of the model to tasks.
Proof that the Bayes estimate minimizes the mean-squared error, i.e., proof that $$ \hat \theta_{bayes} = \arg\min_{\hat \theta} \int_\theta (\hat \theta -\theta)^2 p \left( \theta |D \right) \,\mathrm{d}{\theta} $$
$$\begin{align*} \nabla_{\hat{\theta}} \int_\theta (\hat \theta -\theta)^2 p \left( \theta |D \right) \,\mathrm{d}{\theta} &= 0 \\ \int_\theta \nabla_{\hat{\theta}} (\hat \theta -\theta)^2 p \left( \theta |D \right) \,\mathrm{d}{\theta} &= 0 \\ \int_\theta 2(\hat \theta -\theta) p \left( \theta |D \right) \,\mathrm{d}{\theta} &= 0 \\ \int_\theta \hat \theta p \left( \theta |D \right) \,\mathrm{d}{\theta} &= \int_\theta \theta p \left( \theta |D \right) \,\mathrm{d}{\theta} \\ \hat \theta \underbrace{\int_\theta p \left( \theta |D \right) \,\mathrm{d}{\theta}}_{1} &= \int_\theta \theta p \left( \theta |D \right) \,\mathrm{d}{\theta} \\ \Rightarrow \hat \theta &= \int_\theta \theta p \left( \theta |D \right) \,\mathrm{d}{\theta} \end{align*}$$To minimize the expected mean-squared error we will look for $\hat{\theta}$ that makes the gradient of the integral with respect to $\hat{\theta}$ vanish.
$$\begin{align*} \nabla \log p(D|\mu) &= 0 \\ \nabla \left( n\log \mu + (N-n)\log(1-\mu)\right) &= 0\\ \frac{n}{\mu} - \frac{N-n}{1-\mu} &= 0 \\ \rightarrow \hat{\mu}_{\text{ML}} &= \frac{n}{N} \end{align*}$$(a) The likelihood is given by $p(D|\mu) = \mu^n\cdot (1-\mu)^{(N-n)}$. It follows that
$$\begin{align*} p(\mu|D) &\propto p(D|\mu)p(\mu) \\ &\propto \mu^n (1-\mu)^{N-n} \mu^{\alpha-1} (1-\mu)^{\beta-1} \\ &\propto \mathcal{B}(\mu|n+\alpha,N-n+\beta) \end{align*}$$(b) Assuming a beta prior $\mathcal{B}(\mu|\alpha,\beta)$, we can write the posterior as as
$$\begin{align*} \hat{\mu}_{\text{MAP}} &= \frac{(n+\alpha)-1}{(n+\alpha) + (N-n+\beta) -2} \\ &= \frac{n+\alpha-1}{N + \alpha +\beta -2} \end{align*}$$The MAP estimate for a beta distribution $\mathcal{B}(a,b)$ is located at $\frac{a - 1}{a+b-2}$, see wikipedia. Hence,
(c) As $N$ gets larger, the MAP estimate approaches the ML estimate. In the limit the MAP solution converges to the ML solution.
(a) Work out the probability $p(x=1|m_1)$.
$$\begin{align*}
p(x=1|m_1) &= \int_0^1 p(x=1|\theta,m_1) p(\theta|m_1) \mathrm{d}\theta \\
&= \int \theta \cdot 6\theta (1-\theta) \mathrm{d}\theta \\
&= 6 \cdot \left(\frac{1}{3}\theta^3 - \frac{1}{4}\theta^4\right) \bigg|_0^1 \\
&= 6 \cdot (\frac{1}{3} - \frac{1}{4}) = \frac{1}{2}
\end{align*}$$
(b) Determine the posterior $p(\theta|x=1,m_1)$.
$$\begin{align*}
p(\theta|x=1,m_1) &= \frac{p(x=1|\theta) p(\theta|m_1)}{p(x=1|m_1)} \\
&= 2\cdot \theta \cdot 6\theta (1-\theta) \\
&= \begin{cases} 12 \theta^2 (1-\theta) & \text{if }0 \leq \theta \leq 1 \\
0 & \text{otherwise} \end{cases}
\end{align*}$$
Now consider a second model $m_2$ with the following sampling distribution and prior on $0 \leq \theta \leq 1$:
$$\begin{align*}
p(x|\theta,m_2) &= (1-\theta)^x \theta^{(1-x)} \\
p(\theta|m_2) &= 2\theta
\end{align*}$$
(c) Determine the probability $p(x=1|m_2)$.
$$\begin{align*}
p(x=1|m_2) &= \int_0^1 p(x=1|\theta,m_2) p(\theta|m_2) \mathrm{d}\theta \\
&= \int (1-\theta) \cdot 2\theta \mathrm{d}\theta \\
&= 2 \cdot \left( \frac{1}{2}\theta^2 - \frac{1}{3}\theta^3 \right) \bigg|_0^1 \\
&= 2 \cdot (\frac{1}{2} - \frac{1}{3}) = \frac{1}{3}
\end{align*}$$
Now assume that the model priors are given by
$$\begin{align*}
p(m_1) &= 1/3 \\
p(m_2) &= 2/3
\end{align*}$$
(d) Compute the probability $p(x=1)$ by "Bayesian model averaging", i.e., by weighing the predictions of both models appropriately.
$$\begin{align*}
p(x=1) &= \sum_{k=1}^2 p(x=1|m_k) p(m_k) \\
&= \frac{1}{2} \cdot \frac{1}{3} + \frac{1}{3} \cdot \frac{2}{3} = \frac{7}{18}
\end{align*}$$
(e) Compute the fraction of posterior model probabilities $\frac{p(m_1|x=1)}{p(m_2|x=1)}$.
$$\frac{p(m_1|x=1)}{p(m_2|x=1)} = \frac{p(x=1|m_1) p(m_1)}{p(x=1|m_2) p(m_2)} = \frac{\frac{1}{2} \cdot \frac{1}{3}}{\frac{1}{3} \cdot \frac{2}{3}} =\frac{3}{4}$$
(f) Which model do you prefer after observation $x=1$?
In principle, the observation $x=1$ favors model $m_2$, since $p(m_2|x=1) = \frac{4}{3} \times p(m_1|x=1)$. However, note that $\log_{10} \frac{3}{4} \approx -0.125$, so the extra evidence for $m_2$ relative to $m_1$ is very low. At this point, after 1 observation, we have no preference for a model yet.