Problem: We observe a the following sequence of heads (h) and tails (t) when tossing the same coin repeatedly $$D=\{hthhtth\}\,.$$
What is the probability that heads comes up next?
REPEAT
1- Model specification
2- Parameter estimation (i.e., learning from an observed data set using Bayesian inference)
3- Model evaluation (how good is this (trained) model?)
UNTIL model performance is satisfactory
4- Apply model, e.g. for prediction or classification of new data
so usually you select a model for generating one observation $x_n$ and then use (in-)dependence assumptions to combine these models into a likelihood function for the model parameters.
This "recipe" works only if the right-hand side (RHS) factors can be evaluated; the computational details can be quite challenging and this is what machine learning is about.
$\Rightarrow$ Machine learning is EASY, apart from computational details :)
We observe a the following sequence of heads ($h$) and tails ($t$) when tossing the same coin repeatedly $$D=\{hthhtth\}\,.$$
What is the probability that heads comes up next? We solve this in the next slides ...
where the Gamma function is sort of a generalized factorial function. If $\alpha,\beta$ are integers, then $$\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)(\Gamma(\beta)} = \frac{(\alpha+\beta-1)!}{(\alpha-1)!\,(\beta-1)!}$$
so we get a closed-form posterior.
hence the posterior is also beta-distributed as
$$ p(\mu|D) = \mathrm{Beta}(\mu|\,n+\alpha, N-n+\beta) $$using Pkg; Pkg.activate("probprog/workspace");Pkg.instantiate();
IJulia.clear_output();
using PyPlot, Distributions
f = figure()
range_grid = range(0.0, stop=1.0, length=100)
μ = 0.4
samples = rand(192) .<= μ # Flip 192 coins
posterior1 = Array{Distribution}(undef,193)
posterior2 = Array{Distribution}(undef,193)
for N=0:1:192
n = sum(samples[1:N]) # Count number of heads in first N flips
posterior1[N+1] = Beta(1+n, 1+(N-n))
posterior2[N+1] = Beta(5+n, 5+(N-n))
end
fig = figure("Posterior distributions", figsize=(10,8));
ax1 = fig.add_subplot(2,2,1);
plot(range_grid,pdf.(posterior1[3],range_grid), "k-");
plot(range_grid,pdf.(posterior2[3],range_grid), "k--");
xlabel(L"\mu"); ylabel(L"p(\mu|\mathcal{D})"); grid()
title(L"p(\mu|\mathcal{D})"*" for N=$(3), n=$(sum(samples[1:3])) (real \$\\mu\$=$(μ))")
legend(["Based on uniform prior "*L"B(1,1)","Based on prior "*L"B(5,5)"], loc=4)
ax2 = fig.add_subplot(2,2,2);
plot(range_grid,pdf.(posterior1[10],range_grid), "k-");
plot(range_grid,pdf.(posterior2[10],range_grid), "k--");
xlabel(L"\mu"); ylabel(L"p(\mu|\mathcal{D})"); grid()
title(L"p(\mu|\mathcal{D})"*" for N=$(10), n=$(sum(samples[1:10])) (real \$\\mu\$=$(μ))")
legend(["Based on uniform prior "*L"B(1,1)","Based on prior "*L"B(5,5)"], loc=4)
ax3 = fig.add_subplot(2,2,3);
plot(range_grid,pdf.(posterior1[50],range_grid), "k-");
plot(range_grid,pdf.(posterior2[50],range_grid), "k--");
xlabel(L"\mu"); ylabel(L"p(\mu|\mathcal{D})"); grid()
title(L"p(\mu|\mathcal{D})"*" for N=$(50), n=$(sum(samples[1:50])) (real \$\\mu\$=$(μ))")
legend(["Based on uniform prior "*L"B(1,1)","Based on prior "*L"B(5,5)"], loc=4)
ax4 = fig.add_subplot(2,2,4);
plot(range_grid,pdf.(posterior1[150],range_grid), "k-");
plot(range_grid,pdf.(posterior2[150],range_grid), "k--");
xlabel(L"\mu"); ylabel(L"p(\mu|\mathcal{D})"); grid()
title(L"p(\mu|\mathcal{D})"*" for N=$(150), n=$(sum(samples[1:150])) (real \$\\mu\$=$(μ))")
legend(["Based on uniform prior "*L"B(1,1)","Based on prior "*L"B(5,5)"], loc=4);
$\Rightarrow$ With more data, the relevance of the prior diminishes!
Consider the task: predict a datum $x$ from an observed data set $D$.
Bayesian | Maximum Likelihood | |
1. Model Specification | Choose a model $m$ with data generating distribution $p(x|\theta,m)$ and parameter prior $p(\theta|m)$ | Choose a model $m$ with same data generating distribution $p(x|\theta,m)$. No need for priors. |
2. Learning | use Bayes rule to find the parameter posterior, $$ p(\theta|D) \propto p(D|\theta) p(\theta) $$ | By Maximum Likelihood (ML) optimization, $$ \hat \theta = \arg \max_{\theta} p(D |\theta) $$ |
3. Prediction | $$ p(x|D) = \int p(x|\theta) p(\theta|D) \,\mathrm{d}\theta $$ | $$ p(x|D) = p(x|\hat\theta) $$ |
$\Rightarrow$ ML estimation is an approximation to Bayesian learning, but for good reason a very popular learning method when faced with lots of available data.
open("../../styles/aipstyle.html") do f display("text/html", read(f, String)) end