Notebook

Discrete Data and the Multinomial Distribution¶

Preliminaries¶

Goal
- Simple Bayesian and maximum likelihood-based density estimation for discretely valued data sets
Materials
- Mandatory
  - These lecture notes
- Optional
  - Bishop pp. 67-70, 74-76, 93-94

Discrete Data: the 1-of-K Coding Scheme¶

Consider a coin-tossing experiment with outcomes $x \in\{0,1\}$ (tail and head) and let $0\leq \mu \leq 1$ represent the probability of heads. This model can written as a Bernoulli distribution:

$$ p(x|\mu) = \mu^{x}(1-\mu)^{1-x} $$

Note that the variable $x$ acts as a (binary) selector for the tail or head probabilities. Think of this as an 'if'-statement in programming.

Now consider a $K$-sided coin (e.g., a six-faced die (pl.: dice)). How should we encode outcomes?

Option 1: $x \in \{1,2,\ldots,K\}$.
- E.g., for $K=6$, if the die lands on the 3rd face $\,\Rightarrow x=3$.
Option 2: $x=(x_1,\ldots,x_K)^T$ with binary selection variables

$$ x_k = \begin{cases} 1 & \text{if die landed on $k$th face}\\ 0 & \text{otherwise} \end{cases} $$

E.g., for $K=6$, if the die lands on the 3rd face $\,\Rightarrow x=(0,0,1,0,0,0)^T$.
This coding scheme is called a 1-of-K or one-hot coding scheme.

It turns out that the one-hot coding scheme is mathematically more convenient!

The Categorical Distribution¶

Consider a $K$-sided die. We use a one-hot coding scheme. Assume the probabilities $p(x_k=1) = \mu_k$ with $\sum_k \mu_k = 1$. The data generating distribution is then (note the similarity to the Bernoulli distribution)

$$ p(x|\mu) = \mu_1^{x_1} \mu_2^{x_2} \cdots \mu_K^{x_K}=\prod_{k=1}^K \mu_k^{x_k} \tag{B-2.26} $$

This generalized Bernoulli distribution is called the categorical distribution (or sometimes the 'multi-noulli' distribution).

Bayesian Density Estimation for a Loaded Die¶

Now let's proceed with Bayesian density estimation, i.e., let's learn the parameters for model $p(x|\theta)$ for an observed data set $D=\{x_1,\ldots,x_N\}$ of $N$ independent-and-identically-distributed (IID) rolls of a $K$-sided die, with

$$ x_{nk} = \begin{cases} 1 & \text{if the $n$th throw landed on $k$th face}\\ 0 & \text{otherwise} \end{cases} $$

Model specification¶

The data generating PDF is

$$ p(D|\mu) = \prod_n \prod_k \mu_k^{x_{nk}} = \prod_k \mu_k^{\sum_n x_{nk}} = \prod_k \mu_k^{m_k} \tag{B-2.29} $$

where $m_k= \sum_n x_{nk}$ is the total number of occurrences that we 'threw' $k$ eyes. Note that $\sum_k m_k = N$.

This distribution depends on the observations only through the quantities $\{m_k\}$.

We need a prior for the parameters $\mu = (\mu_1,\mu_2,\ldots,\mu_K)$. In the binary coin toss example,

we used a beta distribution that was conjugate with the binomial and forced us to choose prior pseudo-counts.

The generalization of the beta prior to the $K$ parameters $\{\mu_k\}$ is the Dirichlet distribution:

$$ p(\mu|\alpha) = \mathrm{Dir}(\mu|\alpha) = \frac{\Gamma\left(\sum_k \alpha_k\right)}{\Gamma(\alpha_1)\cdots \Gamma(\alpha_K)} \prod_{k=1}^K \mu_k^{\alpha_k-1} $$

where $\Gamma(\cdot)$ is the Gamma function.

The Gamma function can be interpreted as a generalization of the factorial function to the real ($\mathbb{R}$) numbers. If $n$ is a natural number ($1,2,3, \ldots $), then $\Gamma(n) = (n-1)!$, where $(n-1)! = (n-1)\cdot (n-2) \cdot 1$.
As before for the Beta distribution in the coin toss experiment, you can interpret $\alpha_k-1$ as the prior number of (pseudo-)observations that the die landed on the $k$-th face.

Inference for $\{\mu_k\}$¶

The posterior for $\{\mu_k\}$ can be obtained through Bayes rule:

$$\begin{align*} p(\mu|D,\alpha) &\propto p(D|\mu) \cdot p(\mu|\alpha) \\ &\propto \prod_k \mu_k^{m_k} \cdot \prod_k \mu_k^{\alpha_k-1} \\ &= \prod_k \mu_k^{\alpha_k + m_k -1}\\ &\propto \mathrm{Dir}\left(\mu\,|\,\alpha + m \right) \tag{B-2.41} \\ &= \frac{\Gamma\left(\sum_k (\alpha_k + m_k) \right)}{\Gamma(\alpha_1+m_1) \Gamma(\alpha_2+m_2) \cdots \Gamma(\alpha_K + m_K)} \prod_{k=1}^K \mu_k^{\alpha_k + m_k -1} \end{align*}$$

where $m = (m_1,m_2,\ldots,m_K)^T$.

We recognize the $(\alpha_k-1)$'s as prior pseudo-counts and the Dirichlet distribution shows to be a conjugate prior to the categorical/multinomial:

$$\begin{align*} \underbrace{\text{Dirichlet}}_{\text{posterior}} &\propto \underbrace{\text{categorical}}_{\text{likelihood}} \cdot \underbrace{\text{Dirichlet}}_{\text{prior}} \end{align*}$$

This is actually a generalization of the conjugate relation that we found for the binary coin toss:

$$\begin{align*} \underbrace{\text{Beta}}_{\text{posterior}} &\propto \underbrace{\text{binomial}}_{\text{likelihood}} \cdot \underbrace{\text{Beta}}_{\text{prior}} \end{align*}$$

Prediction of next toss for the loaded die ¶

Let's apply what we have learned about the loaded die to compute the probability that we throw the $k$-th face at the next toss.

$$\begin{align*} p(x_{\bullet,k}=1|D) &= \int p(x_{\bullet,k}=1|\mu)\,p(\mu|D) \,\mathrm{d}\mu \\ &= \int_0^1 \mu_k \times \mathcal{Dir}(\mu|\,\alpha+m) \,\mathrm{d}\mu \\ &= \mathrm{E}\left[ \mu_k \right] \\ &= \frac{m_k + \alpha_k }{ N+ \sum_k \alpha_k} \end{align*}$$

(You can find the mean of the Dirichlet distribution at its Wikipedia site).
This result is simply a generalization of Laplace's rule of succession.

In the above derivation, we noticed that the data generating distribution for $N$ die tosses $D=\{x_1,\ldots,x_N\}$ only depends on the data frequencies $m_k$:

$$ p(D|\mu) = \prod_n \underbrace{\prod_k \mu_k^{x_{nk}}}_{\text{categorical dist.}} = \prod_k \mu_k^{\sum_n x_{nk}} = \prod_k \mu_k^{m_k} \tag{B-2.29} $$

A related distribution is the distribution over data frequency observations $D_m=\{m_1,\ldots,m_K\}$, which is called the multinomial distribution,

$$ p(D_m|\mu) =\frac{N!}{m_1! m_2!\ldots m_K!} \,\prod_k \mu_k^{m_k}\,. $$

When used as a likelihood function for $\mu$, it makes no difference whether you use $p(D|\mu)$ or $p(D_m|\mu)$. Why?

Verify for yourself that (Exercise):
- the categorial distribution is a special case of the multinomial for $N=1$.
- the Bernoulli is a special case of the categorial distribution for $K=2$.
- the binomial is a special case of the multinomial for $K=2$.

Maximum Likelihood Estimation for the Multinomial¶

Maximum likelihood as a special case of Bayesian estimation¶

We can get the maximum likelihood estimate $\hat{\mu}_k$ for $\mu_k$ based on $N$ throws of a $K$-sided die from the Bayesian framework by using a uniform prior for $\mu$ and taking the mode of the posterior for $\mu$:

$$\begin{align*} \hat{\mu}_k &= \arg\max_{\mu_k} p(D|\mu) \\ &= \arg\max_{\mu_k} p(D|\mu)\cdot \mathrm{Uniform}(\mu) \\ &= \arg\max_{\mu_k} p(D|\mu) \cdot \left.\mathrm{Dir}(\mu|\alpha)\right|_{\alpha=(1,1,\ldots,1)} \\ &= \arg\max_{\mu_k} \left.p(\mu|D,\alpha)\right|_{\alpha=(1,1,\ldots,1)} \\ &= \arg\max_{\mu_k} \left.\mathrm{Dir}\left( \mu | m + \alpha \right)\right|_{\alpha=(1,1,\ldots,1)} \\ &= \frac{m_k}{\sum_k m_k} = \frac{m_k}{N} \end{align*}$$

where we used the fact that the maximum of the Dirichlet distribution $\mathrm{Dir}(\{\alpha_1,\ldots,\alpha_K\})$ is obtained at $(\alpha_k-1)/(\sum_k\alpha_k - K)$.

Maximum likelihood estimation by optimizing a constrained log-likelihood ¶

Of course, we shouldn't have to go through the full Bayesian framework to get the maximum likelihood estimate. Alternatively, we can find the maximum likelihood (ML) solution directly by optimizing the (constrained) log-likelihood.
The log-likelihood for the multinomial distribution is given by

$$\begin{align*} \mathrm{L}(\mu) &\triangleq \log p(D_m|\mu) \propto \log \prod_k \mu_k^{m_k} = \sum_k m_k \log \mu_k \end{align*}$$

When doing ML estimation, we must obey the constraint $\sum_k \mu_k = 1$, which can be accomplished by a Lagrange multiplier. The constrained log-likelihood with Lagrange multiplier is then

$$ \tilde{\mathrm{L}}(\mu) = \sum_k m_k \log \mu_k + \lambda \cdot \big(1 - \sum_k \mu_k \big) $$

The method of Lagrange multipliers is a mathematical method for transferring a constrained optimization problem to an unconstrained optimization problem (see Bishop App.E). Unconstrained optimization problems can be solved by setting the derivative to zero.

Setting the derivative of $\tilde{\mathrm{L}}(\mu)$ to zero yields the sample proportion for $\mu_k$

$$\begin{equation*} \nabla_{\mu_k} \tilde{\mathrm{L}}(\mu) = \frac{m_k } {\hat\mu_k } - \lambda \overset{!}{=} 0 \; \Rightarrow \; \hat\mu_k = \frac{m_k }{N} \end{equation*}$$

where we get $\lambda$ from the constraint $$\begin{equation*} \sum_k \hat \mu_k = \sum_k \frac{m_k} {\lambda} = \frac{N}{\lambda} \overset{!}{=} 1 \end{equation*}$$

Recap Maximum Likelihood Estimation for Gaussian and Multinomial Distributions¶

Given $N$ IID observations $D=\{x_1,\dotsc,x_N\}$.

For a multivariate Gaussian model $p(x_n|\theta) = \mathcal{N}(x_n|\mu,\Sigma)$, we obtain ML estimates

$$\begin{align*} \hat \mu &= \frac{1}{N} \sum_n x_n \qquad &\text{(sample mean)} \\ \hat \Sigma &= \frac{1}{N} \sum_n (x_n-\hat\mu)(x_n - \hat \mu)^T \qquad &\text{(sample variance)} \end{align*}$$

For discrete outcomes modeled by a 1-of-K categorical distribution we find

$$\begin{align*} \hat\mu_k = \frac{1}{N} \sum_n x_{nk} \quad \left(= \frac{m_k}{N} \right) \qquad \text{(sample proportion)} \end{align*}$$

Note the similarity for the means between discrete and continuous data.

In [1]:

open("../../styles/aipstyle.html") do f
    display("text/html", read(f, String))
end