Material for a University of Illinois course offered by the Physics Department. This content is maintained on GitHub and is distributed under a BSD3 license.
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import scipy.stats
We construct a probability space by assigning a numerical probability in the range $[0,1]$ to sets of outcomes (events) in some space.
When outcomes are the result of an uncertain but repeatable process, probabilities can always be measured to arbitrary accuracy by simply observing many repetitions of the process and calculating the frequency at which each event occurs. These frequentist probabilities have an appealing objective reality to them.
DISCUSS: How might you assign a frequentist probability to statements like:
You cannot (if we assume that these are universal constants), since that would require a measurable process whose outcomes had different values for a universal constant.
The inevitable conclusion is that the statements we are most interested in cannot be assigned frequentist probabilities.
However, if we allow probabilities to also measure your subjective "degree of belief" in a statement, then we can use the full machinery of probability theory to discuss more interesting statements. These are called Bayesian probabiilities.
Roughly speaking, the choice is between:
Bayesian statistics starts from a joint probability distribution
$$ \Large P(D, \Theta_M, M) $$over
The subscript on $\Theta_M$ is to remind us that, in general, the set of parameters being used depends on the hyperparameters (e.g., increasing n_components
adds parameters for the new components). We will sometimes refer to the pair $(\Theta_M, M)$ as the model.
This joint probability implies that model parameters and hyperparameters are random variables, which in turn means that they label possible outcomes in our underlying probability space.
For a concrete example, consider the possible outcomes necessary to discuss the statement "the electron spin is 1/2", which must be labeled by the following random variables:
A table of random-variable values for possible outcomes would then look like:
$M$ | $\Theta_M$ | $D$ |
---|---|---|
boson | 0 | 0 |
fermion | 1/2 | -1/2 |
fermion | 1/2 | +1/2 |
boson | 1 | -1 |
boson | 1 | 0 |
boson | 1 | +1 |
... | ... | ... |
Only two of these outcomes occur in our universe, but a Bayesian approach requires us to broaden the sample space from "all possible outcomes" to "all possible outcomes in all possible universes".
The likelihood ${\cal L}_M(\Theta_M, D)$ is a function of model parameters $\Theta_M$ (given hyperparameters $M$) and data features $D$, and measures the probability (density) of observing the data $\vec{x}$ given the model. For example, a Gaussian mixture model has the likelihood function:
$$ \Large {\cal L}_M\left(\mathbf{\Theta}_M, \vec{x} \right) = \sum_{k=1}^{K}\, \omega_k G(\vec{x} ; \vec{\mu}_k, C_k) \; , $$with parameters
$$ \Large \begin{aligned} \mathbf{\Theta}_M = \big\{ &\omega_1, \omega_2, \ldots, \omega_K, \\ &\vec{\mu}_1, \vec{\mu}_2, \ldots, \vec{\mu}_K, \\ &C_1, C_2, \ldots, C_K \big\} \end{aligned} $$and hyperparameter $K$. Note that the likelihood must be normalized over the data for any values of the (fixed) parameters and hyperparameters. However, it is not normalized over the parameters or hyperparameters.
The likelihood function plays a central role in both frequentist and Bayesian statistics, but is used and interpreted differently. We will focus on the Bayesian perspective, where $\Theta_M$ and $M$ are considered random variables and the likelihood function is associated with the conditional probability
$$ \Large {\cal L}_M\left(\Theta_M, D \right) = P(D\mid \Theta_M, M) $$of observing features $D$ given the model $(\Theta_M, M)$.
Once we associated the likelihood with a conditional probability, we can apply the earlier rules (2 & 3) of probability calculus to derive the generalized Bayes' rule:
$$ \Large P(\Theta_M\mid D, M) = \frac{P(D\mid \Theta_M, M)\,P(\Theta_M\mid M)}{P(D\mid M)} $$Each term above has a name and measures a different probability:
In typical inference problems, the posterior (1) is what we really care about and the likelihood (2) is what we know how to calculate. The prior (3) is where we must quantify our subjective "degree of belief" in different possible universes.
What about the evidence (4)? Using the earlier rule (5) of probability calculus, we discover that (4) can be calculated from (2) and (3):
$$ \Large P(D\mid M) = \int d\Theta_M' P(D\mid \Theta_M', M)\, P(\Theta_M'\mid M) \; . $$Note that this result is not surprising since the denominator must normalize the ratio to yield a probability (density). When the set of possible parameter values is discrete, $\Theta_M \in \{ \Theta_{M,1}, \Theta_{M,2}, \ldots\}$, the normalization integral reduces to a sum:
$$ \Large P(D\mid M) \rightarrow \sum_k\, P(D\mid \Theta_{M,k}, M)\, P(\Theta_{M,k}\mid M) \; . $$The generalized Bayes' rule above assumes fixed values of any hyperparameters (since $M$ is on the RHS of all 4 terms), but a complete inference also requires us to consider different hyperparameter settings. We will defer this (harder) model selection problem until later.
EXERCISE: Suppose that you meet someone for the first time at your next conference and they are wearing an "England" t-shirt. Estimate the probability that they are English by:
Solution:
Note that the likelihood probabilities do not sum to one since the likelihood is normalized over the data, not the model, unlike the prior probabilities which must sum to one.
You would have probably assigned different probabilities, since these are subjective assessments where reasonable people can disagree. However, by allowing some subjectivity we are able to make a precise statement under some (subjective) assumptions.
A simple example like this can be represented graphically in the 2D space of joint probability $P(D, \Theta)$:
The generalized Bayes' rule can be viewed as a learning rule that updates our knowledge as new information becomes available:
The implied timeline motivates the posterior and prior terminology, although there is no requirement that the prior be based on data collected before the "new" data.
Bayesian inference problems can be tricky to get right, even when they sound straightforward, so it is important to clearly spell out what you know or assume, and what you wish to learn:
For problems with a finite number of possible models and observations, the calculations required are simple arithmetic but quickly get cumbersome. A helper function lets you hide the arithmetic and focus on the logic:
def learn(prior, likelihood, D):
# Calculate the Bayes' rule numerator for each model.
prob = {M: prior[M] * likelihood(D, M) for M in prior}
# Calculate the Bayes' rule denominator.
norm = sum(prob.values())
# Return the posterior probabilities for each model.
return {M: prob[M] / norm for M in prob}
For example, the problem above becomes:
prior = {'English': 0.2, '!English': 0.8}
def likelihood(D, M):
if M == 'English':
return 0.4 if D == 't-shirt' else 0.6
else:
return 0.1 if D == 't-shirt' else 0.9
learn(prior, likelihood, D='t-shirt')
{'English': 0.5, '!English': 0.5}
Note that the (posterior) output from one learning update can be the (prior) input to the next update. For example, how should we update our knowledge if the person wears an "England" t-shirt the next day also?
post1 = learn(prior, likelihood, 't-shirt')
post2 = learn(post1, likelihood, 't-shirt')
print(post2)
{'English': 0.8, '!English': 0.2}
The mls
package includes a function Learn
for these calculations that allows multiple updates with one call and displays the learning process as a pandas table:
from mls import Learn
Learn(prior, likelihood, 't-shirt', 't-shirt')
!English | English | |
---|---|---|
PRIOR | 0.8 | 0.2 |
D=t-shirt | 0.5 | 0.5 |
D=t-shirt | 0.2 | 0.8 |
EXERCISE: Suppose someone rolls 6, 4, 5 on a dice without telling you whether it has 4, 6, 8, 12, or 20 sides.
Learn
function to estimate the posterior probability for the number of sides after each roll.We can be sure the dice is not 4-sided (because of the rolls > 4) and guess that it is unlikely to be 12 or 20 sided (since the largest roll is a 6).
The models in this problem correspond to the number of sides on the dice: 4, 6, 8, 12, 20.
The data in this problem are the dice rolls: 6, 4, 5.
Define the prior assuming that each model is equally likely:
prior = {4: 0.2, 6: 0.2, 8: 0.2, 12: 0.2, 20: 0.2}
Define the likelihood assuming that each dice is fair:
def likelihood(D, M):
if D <= M:
return 1.0 / M
else:
return 0.0
Finally, put the pieces together to estimate the posterior probability of each model after each roll:
Learn(prior, likelihood, 6, 4, 5)
4 | 6 | 8 | 12 | 20 | |
---|---|---|---|---|---|
PRIOR | 0.2 | 0.200 | 0.200 | 0.200 | 0.200 |
D=6 | 0.0 | 0.392 | 0.294 | 0.196 | 0.118 |
D=4 | 0.0 | 0.526 | 0.296 | 0.131 | 0.047 |
D=5 | 0.0 | 0.635 | 0.268 | 0.079 | 0.017 |
Somewhat surprisingly, this toy problem has a practical application with historical significance!
Imagine a factory that has made $N$ items, each with a serial number 1--$N$. If you randomly select items and read their serial numbers, the problem of estimating $N$ is analogous to our dice problem, but with many more models to consider. This approach was successfully used in World-War II by the Allied Forces to estimate the production rate of German tanks at a time when most academic statisticians rejected Bayesian methods.
For more historical perspective on the development of Bayesian methods (and many obstacles along the way), read the entertaining book The Theory That Would Not Die.
The discrete examples above can be solved exactly, but this is not true in general. The challenge is to calculate the evidence, $P(D\mid M$), in the Bayes' rule denominator, as the marginalization integral:
$$ \Large P(D\mid M) = \int d\Theta_M' P(D\mid \Theta_M', M)\, P(\Theta_M'\mid M) \; . $$With careful choices of the prior and likelihood function, this integral can be performed analytically. However, for most practical work, an approximate numerical approach is required. Popular methods include Markov-Chain Monte Carlo and Variational Inference, which we will meet soon.
The choice of priors is necessarily subjective and sometimes contentious, but keep the following general guidelines in mind:
For a visual demonstration of these guidelines, the following function performs exact inference for a common task: you make a number of observations and count how many pass some predefined test, and want to infer the fraction $0\le \theta\le 1$ that pass. This applies to questions like:
For our prior, $P(\theta)$, we use the beta distribution which is specified by hyperparameters $a$ and $b$:
$$ \Large P(\theta\mid a, b) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\, \theta^{a-1} \left(1 - \theta\right)^{b-1} \; , $$where $\Gamma$ is the gamma function related to the factorial $\Gamma(n) = (n-1)!$
def binomial_learn(prior_a, prior_b, n_obs, n_pass):
theta = np.linspace(0, 1, 100)
# Calculate and plot the prior on theta.
prior = scipy.stats.beta(prior_a, prior_b)
plt.fill_between(theta, prior.pdf(theta), alpha=0.25)
plt.plot(theta, prior.pdf(theta), label='Prior')
# Calculate and plot the likelihood of the fixed data given any theta.
likelihood = scipy.stats.binom.pmf(n_pass, n_obs, theta)
plt.plot(theta, likelihood, 'k:', label='Likelihood')
# Calculate and plot the posterior on theta given the observed data.
posterior = scipy.stats.beta(prior_a + n_pass, prior_b + n_obs - n_pass)
plt.fill_between(theta, posterior.pdf(theta), alpha=0.25)
plt.plot(theta, posterior.pdf(theta), label='Posterior')
# Plot cosmetics.
plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=3,
ncol=3, mode="expand", borderaxespad=0., fontsize='large')
plt.ylim(0, None)
plt.xlim(theta[0], theta[-1])
plt.xlabel('Pass fraction $\\theta$')
EXERCISE:
Q1: Think of a question in your research area where this inference problem applies.
Q2: Infer $\theta$ from 2 observations with 1 passing, using hyperparameters $(a=1,b=1)$.
Q3: Infer $\theta$ from the same 2 observations with 1 passing, using instead $(a=10,b=5)$.
Q4: Use each of the priors above with different data: 100 trials with 60 passing.
binomial_learn(1, 1, 2, 1)
binomial_learn(5, 10, 2, 1)
binomial_learn(1, 1, 100, 60)
binomial_learn(5, 10, 100, 60)
You are hopefully convinced now that your choice of priors is mostly a non-issue, since inference with good data is relatively insensitive to your choice. However, you still need to make a choice, so here are some practical guidelines:
The answer is more complicated than this, but two primary reasons are often cited:
prior you will obtain different results and this “lack of objectivity” makes some people uncomfortable. However, we showed how some problems with sufficiant data can be relatively insensitive to choice of prior.
We started above with the Bayesian joint probability:
$$ \Large P(D, \Theta_M, M) $$When the individual data features, parameters and hyperparameters are all written out, this often ends up being a very high-dimensional function.
In the most general case, the joint probability requires a huge volume of data to estimate (recall our earlier discussion of dimensionality reduction). However, many problems can be (approximately) described by a joint probability that is simplified by assuming that some random variables are mutually independent.
Graphical models are a convenient visualization of the assumed direct dependencies between random variables. For example, suppose we have two parameters $(\alpha, \beta)$ and no hyperparameters, then the joint probability $P(D, \alpha, \beta)$ can be expanded into a product of conditionals different ways using the rules of probability calculus, e.g.
$$ \Large P(D, \alpha, \beta) = P(D,\beta\mid \alpha)\, P(\alpha) = P(D\mid \alpha,\beta)\, P(\beta\mid \alpha)\, P(\alpha) \; . $$or, equally well as,
$$ \Large P(D, \alpha, \beta) = P(D,\alpha\mid \beta)\, P(\beta) = P(D\mid \alpha,\beta)\, P(\alpha\mid \beta)\, P(\beta) \; , $$The corresponding diagrams are:
The way to read these diagrams is that a node labeled with $X$ represents a (multiplicative) factor $P(X\mid\ldots)$ in the joint probability, where $\ldots$ lists other nodes whose arrows feed into this node (in any order, thanks to probability calculus Rule-1). A shaded node indicates a random variable that is directly observed (i.e., data) while non-shaded nodes represent (unobserved) latent random variables.
These diagrams both describe a fully general joint probability with two parameters. The rules for building a fully general joint probability with any number of parameters are:
With $n$ parameters, there are then $n!$ possible diagrams and the number of potential dependencies grows rapidly with $n$.
To mitigate this factorial growth, we seek pairs of random variables that should not depend on each other. For example, in the two parameter case:
Notice how each diagram tells a different story. For example, the first diagram tells us that the data can be predicted knowing only $\beta$, but that our prior knowledge of $\beta$ depends on $\alpha$. In effect, then, simplifying a joint probability involves drawing a diagram that tells a suitable story for your data and models.
EXERCISE: Consider observing someone throwing a ball and measuring how far away it lands to infer the strength of gravity:
Draw one example of a fully general diagram of this inference's joint probability $P(r, v, \phi, g, d, w)$.
Suppose the thrower always throws as hard as they can, then adjusts the angle according to the wind. Draw a diagram to represent the direct dependencies in this simpler joint probability.
Write down the posterior we are interested in for this inference problem.
The posterior we are most likely interested in for this inference is $$ P(g\mid r) \; , $$ but a more explicit posterior would be: $$ P(g\mid r, v, \phi, d, w) \; . $$ The difference between is these is that we marginalized over the "nuisance" parameters $v, \phi, d, w$ in the first case.
The arrows in these diagrams define the direction of conditional dependencies. They often mirror a causal influence in the underlying physical system, but this is not necessary. Probabilistic diagrams with directed edges are known as Bayesian networks or belief networks.
It is also possible to draw diagrams where nodes are connected symmetrically, without a specified direction. These are known as Markov random fields or Markov networks and appropriate when dependencies flow in both directions or in an unknown direction. You can read more about these here.