- Goals
- Introduction to Bayesian (i.e., probabilistic) modeling

- Materials
- Mandatory
- These lecture notes

- Optional
- Bishop pp. 21-24

- Mandatory

**Question**. We observe a the following sequence of heads (h) and tails (t) when tossing the same coin repeatedly $$D=\{hthhtth\}\,.$$What is the probability that heads (h) comes up next?

**Answer**later in this lecture.

- Suppose that your task is to predict a 'new' datum $x$, based on $N$ observations $D=\{x_1,\dotsc,x_N\}$.

- The Bayesian approach for this task involves three stages:
- Model specification
- parameter estimation (inference, learning)
- Prediction (apply the model)

- Next, we discuss these three stages in a bit more detail.

Your first task is to propose a model with tuning parameters $\theta$ for generating the data $D$.

- This involves specification of $p(D|\theta)$ and a prior for the parameters $p(\theta)$.

*You*choose the distribution $p(D|\theta)$ based on your physical understanding of the data generating process.- Note that, for independent observations $x_n$, $$ p(D|\theta) = \prod_{n=1}^N p(x_n|\theta)$$ so usually you select a model for generating one observation $x_n$ and then use (in-)dependence assumptions to combine these models into a model for $D$.

*You*choose the prior $p(\theta)$ to reflect what you know about the parameter values before you see the data $D$.

- After model specification, you need to measure/collect a data set $D$. Then, use Bayes rule to find the posterior distribution for the parameters, $$ p(\theta|D) = \frac{p(D|\theta) p(\theta)}{p(D)} \propto p(D|\theta) p(\theta) $$

- Note that there's
**no need for you to design a**. The only complexity lies in the computational issues.*smart*parameter estimation algorithm

- This "recipe" works only if the RHS factors can be evaluated; this is what machine learning is about

$\Rightarrow$**Machine learning is easy, apart from computational details:)**

- Given the data $D$, our knowledge about the yet unobserved datum $x$ is captured by $$\begin{align*} p(x|D) &= \int p(x,\theta|D) \,\mathrm{d}\theta\\ &= \int p(x|\theta,D) p(\theta|D) \,\mathrm{d}\theta\\ &= \int p(x|\theta) p(\theta|D) \,\mathrm{d}\theta\\ \end{align*}$$

- Again,
**no need to invent a special prediction algorithm**. Probability theory takes care of all that. The complexity of prediction is just computational: how to carry out the marginalization over $\theta$.

- In order to execute prediction, you need to have access to the factors $p(x|\theta)$ and $p(\theta|D)$. Where do these factors come from? Are they available?

- What did we learn from $D$? Without access to $D$, we would predict new observations through $$ p(x) = \int p(x,\theta) \,\mathrm{d}\theta = \int p(x|\theta) p(\theta) \,\mathrm{d}\theta $$

- NB The application of the learned posterior $p(\theta|D)$ not necessarily has to be prediction. We use it here as an example, but other applications are of course also possible.

- There appears to be a remaining problem: How good really were our model assumptions $p(x|\theta)$ and $p(\theta)$?

- Technically, this is a
**model comparison**problem

- [
**Q**.] What if I have more candidate models, say $\mathcal{M} = \{m_1,\ldots,m_K\}$ where each model relates to specific prior $p(\theta|m_k)$ and likelihood $p(D|\theta,m_k)$? Can we evaluate the relative performance of a model against another model from the set?

- [
**A**.]: Start again with**model specification**. Specify a prior $p(m_k)$ for each of the models and then solve the desired inference problem:

$$\begin{align*} p(m_k|D) &= \frac{p(D|m_k) p(m_k)}{p(D)} \\ &\propto p(m_k) \cdot p(D|m_k) \\ &= p(m_k)\cdot \int_\theta p(D,\theta|m_k) \,\mathrm{d}\theta\\ &= \underbrace{p(m_k)}_{\substack{\text{model}\\\text{prior}}}\cdot \int_\theta \underbrace{p(D|\theta,m_k)}_{\text{likelihood}} \,\underbrace{p(\theta|m_k)}_{\text{prior}}\, \mathrm{d}\theta\\ \end{align*}$$

- You, the engineer, have to choose the factors $p(D|\theta,m_k)$, $p(\theta|m_k)$ and $p(m_k)$. After that, for a given data set $D$, the model posterior $p(m_k|D)$ can be computed.

- If you need to work with one model,select the model with largest posterior $p(m_k|D)$

- Alternatively, if you don't want to choose a model, you can do prediction by
**Bayesian model averaging**to utilitize the predictive power from all models: $$\begin{align*} p(x|D) &= \sum_k \int p(x,\theta,m_k|D)\,\mathrm{d}\theta \\ &= \sum_k \underbrace{p(m_k|D)}_{\substack{\text{model}\\\text{posterior}}} \cdot \int \underbrace{p(\theta|D)}_{\substack{\text{parameter}\\\text{posterior}}} \, \underbrace{p(x|\theta,m_k)}_{\text{likelihood}} \,\mathrm{d}\theta \end{align*}$$

- $\Rightarrow$ In a Bayesian framework,
**model comparison**follows the same recipe as parameter estimation; it just works at one higher hierarchical level.

- More on this in part 2 (Tjalkens).

- Bayesian probability theory provides a unified framework for information processing (and even the Scientific Method).

We observe a the following sequence of heads (h) and tails (t) when tossing the same coin repeatedly $$D=\{hthhtth\}\,.$$

What is the probability that heads (h) comes up next? We solve this in the next slides ...

We observe a sequence of $N$ coin tosses $D=\{x_1,\ldots,x_N\}$ with $n$ heads.

- Assume a Bernoulli distributed variable $p(x_k=h|\mu)=\mu$, which leads to a
**binomial**distribution for the likelihood (assume $n$ times heads were thrown): $$ p(D|\mu) = \prod_{k=1}^N p(x_k|\mu) = \mu^n (1-\mu)^{N-n} $$

- Assume the prior belief is governed by a
**beta distribution**$$ p(\mu) = \mathcal{B}(\mu|\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \mu^{\alpha-1}(1-\mu)^{\beta-1} $$

- The Beta distribution is a
**conjugate prior**for the Binomial distribution, which means that $$ \text{beta} \propto \text{binomial} \times \text{beta} $$

- $\alpha$ and $\beta$ are called
**hyperparameters**, since they parameterize the distribution for another parameter ($\mu$). E.g., $\alpha=\beta=1$ (uniform).

- Infer posterior PDF over $\mu$ through Bayes rule

$$\begin{align*} p(\mu|D) &\propto p(D|\mu)\,p(\mu|\alpha,\beta) \\ &= \mu^n (1-\mu)^{N-n} \times \mu^{\alpha-1} (1-\mu)^{\beta-1} \\ &= \mu^{n+\alpha-1} (1-\mu)^{N-n+\beta-1} \end{align*}$$

hence the posterior is also beta distributed as

$$ p(\mu|D) = \mathcal{B}(\mu|\,n+\alpha, N-n+\beta) $$

- Essentially,
**here ends the machine learning activity**

- Now, we want to
**use**the trained model. Let's use it to predict future observations.

- Marginalize over the parameter posterior to get the predictive PDF for a new coin toss $x_\bullet$, given the data $D$,

$$\begin{align*} p(x_\bullet=h|D) &= \int_0^1 p(x_\bullet=h|\mu)\,p(\mu|D) \,\mathrm{d}\mu \\ &= \int_0^1 \mu \times \mathcal{B}(\mu|\,n+\alpha, N-n+\beta) \,\mathrm{d}\mu \\ &= \frac{n+\alpha}{N+\alpha+\beta} \qquad \mbox{(a.k.a. Laplace rule)}\hfill \end{align*}$$

- Finally, we're ready to solve our example problem: for $D=\{hthhtth\}$ and uniform prior ($\alpha=\beta=1$), we get

$$ p(x_\bullet=h|D)=\frac{n+1}{N+2} = \frac{4+1}{7+2} = \frac{5}{9}$$

- What did we learn from the data? Before seeing any data, we think that $$p(x_\bullet=h)=\left. p(x_\bullet=h|D) \right|_{n=N=0} = \frac{\alpha}{\alpha + \beta}\,.$$

- After the $N$ coin tosses, we think that $p(x_\bullet=h|D) = \frac{n+\alpha}{N+\alpha+\beta}$.

- Note the following decomposition

$$\begin{align*} p(x_\bullet=h|\,D) &= \frac{n+\alpha}{N+\alpha+\beta} = \frac{n}{N+\alpha+\beta} + \frac{\alpha}{N+\alpha+\beta} \\ &= \frac{N}{N+\alpha+\beta}\cdot \frac{n}{N} + \frac{\alpha+\beta}{N+\alpha+\beta} \cdot \frac{\alpha}{\alpha+\beta} \\ &= \underbrace{\frac{\alpha}{\alpha+\beta}}_{prior} + \underbrace{\frac{N}{N+\alpha+\beta}}_{gain}\cdot \big( \underbrace{\frac{n}{N}}_{MLE} - \underbrace{\frac{\alpha}{\alpha+\beta}}_{prior} \big) \end{align*}$$

- Note that, since $0\leq\text{gain}\lt 1$, the Bayesian estimate lies between prior and maximum likelihood estimate.

- For large $N$, the gain goes to $1$ and $p(x_\bullet=h|D)$ goes to the maximum likelihood estimate (the relative frequency) $n/N$.

**Bayesian evolution of $p(\mu|D)$ for the coin toss**

Let's see how $p(\mu|D)$ evolves as we increase the number of coin tosses $N$. We'll use two different priors to demonstrate the effect of the prior on the posterior (set $N=0$ to inspect the prior).

In [1]:

```
using Reactive, Interact, PyPlot, Distributions
f = figure()
range = linspace(0,1,100)
μ = 0.4
samples = rand(192) .<= μ # Flip 192 coins
@manipulate for N=0:1:192; withfig(f) do
n = sum(samples[1:N]) # Count number of heads in first N flips
posterior1 = Beta(1+n, 1+(N-n))
posterior2 = Beta(5+n, 5+(N-n))
plot(range, pdf.(posterior1,range), "k-")
plot(range, pdf.(posterior2,range), "k--")
xlabel(L"\mu"); ylabel(L"p(\mu|\mathcal{D})"); grid()
title(L"p(\mu|\mathcal{D})"*" for N=$(N), n=$(n) (real \$\\mu\$=$(μ))")
legend(["Based on uniform prior "*L"B(1,1)","Based on prior "*L"B(5,5)"], loc=4)
end
end
```

Out[1]:

$\Rightarrow$ With more data, the relevance of the prior diminishes!

- Sometimes we want just one 'best' parameter (vector), rather than a posterior distribution over parameters. Why?

- Recall Bayesian prediction

$$ p(x|D) = \int p(x|\theta)p(\theta|D)\,\mathrm{d}{\theta} $$

- If we approximate posterior $p(\theta|D)$ by a delta function for one 'best' value $\hat\theta$, then the predictive distribution collapses to

$$ p(x|D)= \int p(x|\theta)\,\delta(\theta-\hat\theta)\,\mathrm{d}{\theta} = p(x|\hat\theta) $$

- This is the model $p(x|\theta)$ evaluated at $\theta=\hat\theta$.

- Note that $p(x|\hat\theta)$ is much easier to evaluate than the integral for full Bayesian prediction.

**Bayes estimate**

$$ \hat \theta_{bayes} = \int \theta \, p\left( \theta |D \right) \,\mathrm{d}{\theta} $$

- (homework). Proof that the Bayes estimate minimizes the expected mean-square error, i.e., proof that

$$ \hat \theta_{bayes} = \arg\min_{\hat \theta} \int_\theta (\hat \theta -\theta)^2 p \left( \theta |D \right) \,\mathrm{d}{\theta} $$

**Maximum A Posteriori**(MAP) estimate $$ \hat \theta_{\text{map}}= \arg\max _{\theta} p\left( \theta |D \right) = \arg \max_{\theta} p\left(D |\theta \right) \, p\left(\theta \right) $$

**Maximum Likelihood**(ML) estimate $$ \hat \theta_{ml} = \arg \max_{\theta} p\left(D |\theta\right) $$- Note that Maximum Likelihood is MAP with uniform prior

Consider the task: predict a datum $x$ from an observed data set $D$.

Bayesian | Maximum Likelihood | |

1. Model Specification | Choose a model $m$ with data generating distribution $p(x|\theta,m)$ and parameter prior $p(\theta|m)$ | Choose a model $m$ with same data generating distribution $p(x|\theta,m)$. No need for priors. |

2. Learning | use Bayes rule to find the parameter posterior, $$ p(\theta|D) = \propto p(D|\theta) p(\theta) $$ | By Maximum Likelihood (ML) optimization, $$ \hat \theta = \arg \max_{\theta} p(D |\theta) $$ |

3. Prediction | $$ p(x|D) = \int p(x|\theta) p(\theta|D) \,\mathrm{d}\theta $$ | $$ p(x|D) = p(x|\hat\theta) $$ |

- Maximum Likelihood (ML) is MAP with uniform prior, or MAP is 'penalized' ML $$ \hat \theta_{map} = \arg \max _\theta \{ \overbrace{\log p\left( D|\theta \right)}^{\mbox{log-likelihood}} + \overbrace{\log p\left( \theta \right)}^{\mbox{penalty}} \} $$

- (good!). Works rather well if we have a lot of data because the influence of the prior diminishes with more data.

- (bad). Cannot be used for model comparison. E.g. best model does generally not correspond to largest likelihood (see part-2, Tjalkens).

- (good). Computationally often do-able. Useful fact (since $\log$ is monotonously increasing): $$\arg\max_\theta \log p(D|\theta) = \arg\max_\theta p(D|\theta)$$

$\Rightarrow$ **ML estimation is an approximation to Bayesian learning**, but for good reason a very popular learning method when faced with lots of available data.

The cell below loads the style file

In [2]:

```
open("../../styles/aipstyle.html") do f display("text/html", readstring(f)) end
```