Bayesian Machine Learning


  • Goals
    • Introduction to Bayesian (i.e., probabilistic) modeling
  • Materials
    • Mandatory
      • These lecture notes
    • Optional
      • Bishop pp. 21-24

Example Problem: Predicting a Coin Toss

  • Question. We observe a the following sequence of heads (h) and tails (t) when tossing the same coin repeatedly $$D=\{hthhtth\}\,.$$

  • What is the probability that heads (h) comes up next?

  • Answer later in this lecture.

The Bayesian Machine Learning Framework

  • Suppose that your task is to predict a 'new' datum $x$, based on $N$ observations $D=\{x_1,\dotsc,x_N\}$.
  • The Bayesian approach for this task involves three stages:
    1. Model specification
    2. parameter estimation (inference, learning)
    3. Prediction (apply the model)
  • Next, we discuss these three stages in a bit more detail.

(1) Model specification

Your first task is to propose a model with tuning parameters $\theta$ for generating the data $D$.

  • This involves specification of $p(D|\theta)$ and a prior for the parameters $p(\theta)$.
  • You choose the distribution $p(D|\theta)$ based on your physical understanding of the data generating process.
    • Note that, for independent observations $x_n$, $$ p(D|\theta) = \prod_{n=1}^N p(x_n|\theta)$$ so usually you select a model for generating one observation $x_n$ and then use (in-)dependence assumptions to combine these models into a model for $D$.
  • You choose the prior $p(\theta)$ to reflect what you know about the parameter values before you see the data $D$.

(2) Parameter estimation

  • After model specification, you need to measure/collect a data set $D$. Then, use Bayes rule to find the posterior distribution for the parameters, $$ p(\theta|D) = \frac{p(D|\theta) p(\theta)}{p(D)} \propto p(D|\theta) p(\theta) $$
  • Note that there's no need for you to design a smart parameter estimation algorithm. The only complexity lies in the computational issues.
  • This "recipe" works only if the RHS factors can be evaluated; this is what machine learning is about
    $\Rightarrow$ Machine learning is easy, apart from computational details:)

(3) Prediction

  • Given the data $D$, our knowledge about the yet unobserved datum $x$ is captured by $$\begin{align*} p(x|D) &= \int p(x,\theta|D) \,\mathrm{d}\theta\\ &= \int p(x|\theta,D) p(\theta|D) \,\mathrm{d}\theta\\ &= \int p(x|\theta) p(\theta|D) \,\mathrm{d}\theta\\ \end{align*}$$
  • Again, no need to invent a special prediction algorithm. Probability theory takes care of all that. The complexity of prediction is just computational: how to carry out the marginalization over $\theta$.
  • In order to execute prediction, you need to have access to the factors $p(x|\theta)$ and $p(\theta|D)$. Where do these factors come from? Are they available?
  • What did we learn from $D$? Without access to $D$, we would predict new observations through $$ p(x) = \int p(x,\theta) \,\mathrm{d}\theta = \int p(x|\theta) p(\theta) \,\mathrm{d}\theta $$
  • NB The application of the learned posterior $p(\theta|D)$ not necessarily has to be prediction. We use it here as an example, but other applications are of course also possible.

Bayesian Model Comparison

  • There appears to be a remaining problem: How good really were our model assumptions $p(x|\theta)$ and $p(\theta)$?
  • Technically, this is a model comparison problem
  • [Q.] What if I have more candidate models, say $\mathcal{M} = \{m_1,\ldots,m_K\}$ where each model relates to specific prior $p(\theta|m_k)$ and likelihood $p(D|\theta,m_k)$? Can we evaluate the relative performance of a model against another model from the set?
  • [A.]: Start again with model specification. Specify a prior $p(m_k)$ for each of the models and then solve the desired inference problem:
    $$\begin{align*} p(m_k|D) &= \frac{p(D|m_k) p(m_k)}{p(D)} \\ &\propto p(m_k) \cdot p(D|m_k) \\ &= p(m_k)\cdot \int_\theta p(D,\theta|m_k) \,\mathrm{d}\theta\\ &= \underbrace{p(m_k)}_{\substack{\text{model}\\\text{prior}}}\cdot \int_\theta \underbrace{p(D|\theta,m_k)}_{\text{likelihood}} \,\underbrace{p(\theta|m_k)}_{\text{prior}}\, \mathrm{d}\theta\\ \end{align*}$$

Bayesian Model Comparison (continued)

  • You, the engineer, have to choose the factors $p(D|\theta,m_k)$, $p(\theta|m_k)$ and $p(m_k)$. After that, for a given data set $D$, the model posterior $p(m_k|D)$ can be computed.
  • If you need to work with one model,select the model with largest posterior $p(m_k|D)$
  • Alternatively, if you don't want to choose a model, you can do prediction by Bayesian model averaging to utilitize the predictive power from all models: $$\begin{align*} p(x|D) &= \sum_k \int p(x,\theta,m_k|D)\,\mathrm{d}\theta \\ &= \sum_k \underbrace{p(m_k|D)}_{\substack{\text{model}\\\text{posterior}}} \cdot \int \underbrace{p(\theta|D)}_{\substack{\text{parameter}\\\text{posterior}}} \, \underbrace{p(x|\theta,m_k)}_{\text{likelihood}} \,\mathrm{d}\theta \end{align*}$$
  • $\Rightarrow$ In a Bayesian framework, model comparison follows the same recipe as parameter estimation; it just works at one higher hierarchical level.
  • More on this in part 2 (Tjalkens).

Machine Learning and the Scientific Method Revisited

  • Bayesian probability theory provides a unified framework for information processing (and even the Scientific Method).

Now Solve the Example Problem: Predicting a Coin Toss

  • We observe a the following sequence of heads (h) and tails (t) when tossing the same coin repeatedly $$D=\{hthhtth\}\,.$$

  • What is the probability that heads (h) comes up next? We solve this in the next slides ...

Coin toss example (1): Model Specification

We observe a sequence of $N$ coin tosses $D=\{x_1,\ldots,x_N\}$ with $n$ heads.

  • Assume a Bernoulli distributed variable $p(x_k=h|\mu)=\mu$, which leads to a binomial distribution for the likelihood (assume $n$ times heads were thrown): $$ p(D|\mu) = \prod_{k=1}^N p(x_k|\mu) = \mu^n (1-\mu)^{N-n} $$
  • Assume the prior belief is governed by a beta distribution $$ p(\mu) = \mathcal{B}(\mu|\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \mu^{\alpha-1}(1-\mu)^{\beta-1} $$
  • The Beta distribution is a conjugate prior for the Binomial distribution, which means that $$ \text{beta} \propto \text{binomial} \times \text{beta} $$
  • $\alpha$ and $\beta$ are called hyperparameters, since they parameterize the distribution for another parameter ($\mu$). E.g., $\alpha=\beta=1$ (uniform).

Coin toss example (2): Parameter estimation

  • Infer posterior PDF over $\mu$ through Bayes rule

$$\begin{align*} p(\mu|D) &\propto p(D|\mu)\,p(\mu|\alpha,\beta) \\ &= \mu^n (1-\mu)^{N-n} \times \mu^{\alpha-1} (1-\mu)^{\beta-1} \\ &= \mu^{n+\alpha-1} (1-\mu)^{N-n+\beta-1} \end{align*}$$

hence the posterior is also beta distributed as

$$ p(\mu|D) = \mathcal{B}(\mu|\,n+\alpha, N-n+\beta) $$

  • Essentially, here ends the machine learning activity

Coin Toss Example (3): Prediction

  • Now, we want to use the trained model. Let's use it to predict future observations.
  • Marginalize over the parameter posterior to get the predictive PDF for a new coin toss $x_\bullet$, given the data $D$,

$$\begin{align*} p(x_\bullet=h|D) &= \int_0^1 p(x_\bullet=h|\mu)\,p(\mu|D) \,\mathrm{d}\mu \\ &= \int_0^1 \mu \times \mathcal{B}(\mu|\,n+\alpha, N-n+\beta) \,\mathrm{d}\mu \\ &= \frac{n+\alpha}{N+\alpha+\beta} \qquad \mbox{(a.k.a. Laplace rule)}\hfill \end{align*}$$

  • Finally, we're ready to solve our example problem: for $D=\{hthhtth\}$ and uniform prior ($\alpha=\beta=1$), we get

$$ p(x_\bullet=h|D)=\frac{n+1}{N+2} = \frac{4+1}{7+2} = \frac{5}{9}$$

Coin Toss Example: What did we learn?

  • What did we learn from the data? Before seeing any data, we think that $$p(x_\bullet=h)=\left. p(x_\bullet=h|D) \right|_{n=N=0} = \frac{\alpha}{\alpha + \beta}\,.$$
  • After the $N$ coin tosses, we think that $p(x_\bullet=h|D) = \frac{n+\alpha}{N+\alpha+\beta}$.
  • Note the following decomposition

$$\begin{align*} p(x_\bullet=h|\,D) &= \frac{n+\alpha}{N+\alpha+\beta} = \frac{n}{N+\alpha+\beta} + \frac{\alpha}{N+\alpha+\beta} \\ &= \frac{N}{N+\alpha+\beta}\cdot \frac{n}{N} + \frac{\alpha+\beta}{N+\alpha+\beta} \cdot \frac{\alpha}{\alpha+\beta} \\ &= \underbrace{\frac{\alpha}{\alpha+\beta}}_{prior} + \underbrace{\frac{N}{N+\alpha+\beta}}_{gain}\cdot \big( \underbrace{\frac{n}{N}}_{MLE} - \underbrace{\frac{\alpha}{\alpha+\beta}}_{prior} \big) \end{align*}$$

  • Note that, since $0\leq\text{gain}\lt 1$, the Bayesian estimate lies between prior and maximum likelihood estimate.
  • For large $N$, the gain goes to $1$ and $p(x_\bullet=h|D)$ goes to the maximum likelihood estimate (the relative frequency) $n/N$.


Bayesian evolution of $p(\mu|D)$ for the coin toss

Let's see how $p(\mu|D)$ evolves as we increase the number of coin tosses $N$. We'll use two different priors to demonstrate the effect of the prior on the posterior (set $N=0$ to inspect the prior).

In [1]:
using Reactive, Interact, PyPlot, Distributions
f = figure()
range = linspace(0,1,100)
μ = 0.4
samples = rand(192) .<= μ # Flip 192 coins
@manipulate for N=0:1:192; withfig(f) do
        n = sum(samples[1:N]) # Count number of heads in first N flips
        posterior1 = Beta(1+n, 1+(N-n))
        posterior2 = Beta(5+n, 5+(N-n))
        plot(range, pdf.(posterior1,range), "k-")
        plot(range, pdf.(posterior2,range), "k--")
        xlabel(L"\mu"); ylabel(L"p(\mu|\mathcal{D})"); grid()
        title(L"p(\mu|\mathcal{D})"*" for N=$(N), n=$(n) (real \$\\mu\$=$(μ))")
        legend(["Based on uniform prior "*L"B(1,1)","Based on prior "*L"B(5,5)"], loc=4)

$\Rightarrow$ With more data, the relevance of the prior diminishes!

From Posterior to Point-Estimate

  • Sometimes we want just one 'best' parameter (vector), rather than a posterior distribution over parameters. Why?
  • Recall Bayesian prediction

$$ p(x|D) = \int p(x|\theta)p(\theta|D)\,\mathrm{d}{\theta} $$

  • If we approximate posterior $p(\theta|D)$ by a delta function for one 'best' value $\hat\theta$, then the predictive distribution collapses to

$$ p(x|D)= \int p(x|\theta)\,\delta(\theta-\hat\theta)\,\mathrm{d}{\theta} = p(x|\hat\theta) $$

  • This is the model $p(x|\theta)$ evaluated at $\theta=\hat\theta$.
  • Note that $p(x|\hat\theta)$ is much easier to evaluate than the integral for full Bayesian prediction.

Some Well-known Point-Estimates

  • Bayes estimate

$$ \hat \theta_{bayes} = \int \theta \, p\left( \theta |D \right) \,\mathrm{d}{\theta} $$

  • (homework). Proof that the Bayes estimate minimizes the expected mean-square error, i.e., proof that

$$ \hat \theta_{bayes} = \arg\min_{\hat \theta} \int_\theta (\hat \theta -\theta)^2 p \left( \theta |D \right) \,\mathrm{d}{\theta} $$

  • Maximum A Posteriori (MAP) estimate $$ \hat \theta_{\text{map}}= \arg\max _{\theta} p\left( \theta |D \right) = \arg \max_{\theta} p\left(D |\theta \right) \, p\left(\theta \right) $$
  • Maximum Likelihood (ML) estimate $$ \hat \theta_{ml} = \arg \max_{\theta} p\left(D |\theta\right) $$
    • Note that Maximum Likelihood is MAP with uniform prior

Bayesian vs Maximum Likelihood Learning

Consider the task: predict a datum $x$ from an observed data set $D$.

Bayesian Maximum Likelihood
1. Model SpecificationChoose a model $m$ with data generating distribution $p(x|\theta,m)$ and parameter prior $p(\theta|m)$Choose a model $m$ with same data generating distribution $p(x|\theta,m)$. No need for priors.
2. Learninguse Bayes rule to find the parameter posterior, $$ p(\theta|D) = \propto p(D|\theta) p(\theta) $$ By Maximum Likelihood (ML) optimization, $$ \hat \theta = \arg \max_{\theta} p(D |\theta) $$
3. Prediction$$ p(x|D) = \int p(x|\theta) p(\theta|D) \,\mathrm{d}\theta $$ $$ p(x|D) = p(x|\hat\theta) $$

Report Card on Maximum Likelihood Estimation

  • Maximum Likelihood (ML) is MAP with uniform prior, or MAP is 'penalized' ML $$ \hat \theta_{map} = \arg \max _\theta \{ \overbrace{\log p\left( D|\theta \right)}^{\mbox{log-likelihood}} + \overbrace{\log p\left( \theta \right)}^{\mbox{penalty}} \} $$
  • (good!). Works rather well if we have a lot of data because the influence of the prior diminishes with more data.
  • (bad). Cannot be used for model comparison. E.g. best model does generally not correspond to largest likelihood (see part-2, Tjalkens).
  • (good). Computationally often do-able. Useful fact (since $\log$ is monotonously increasing): $$\arg\max_\theta \log p(D|\theta) = \arg\max_\theta p(D|\theta)$$

$\Rightarrow$ ML estimation is an approximation to Bayesian learning, but for good reason a very popular learning method when faced with lots of available data.

The cell below loads the style file

In [2]:
open("../../styles/aipstyle.html") do f display("text/html", readstring(f)) end