Problem: We observe a the following sequence of heads (h) and tails (t) when tossing the same coin repeatedly $$D=\{hthhtth\}\,.$$
What is the probability that heads comes up next?
so usually you select a model for generating one observation $x_n$ and then use (in-)dependence assumptions to combine these models into a likelihood function for the model parameters.
This "recipe" works only if the right-hand side (RHS) factors can be evaluated; the computational details can be quite challenging and this is what machine learning is about.
$\Rightarrow$ Machine learning is EASY, apart from computational details :)
Hence, for equal model priors ($p(m_1)=p(m_2)=0.5$), the Bayes Factor reports the posterior probability ratio for the two models.
In principle, any decision about which is the better model has accepted some ad hocery, but Jeffreys (1961) advises the following interpretation of the log-Bayes factor $log_{10} B_{12} =\log_{10}\frac{p(D|m_1)}{p(D|m_2)}$:
$\log_{10} B_{12}$ | Evidence for $m_1$ |
0 to 0.5 | not worth mentioning |
0.5 to 1 | substantial |
1 to 2 | strong |
>2 | decisive |
In principle, you now have the recipe in your hands now to solve all your prediction/classification/regression etc problems by the same method:
We observe a the following sequence of heads ($h$) and tails ($t$) when tossing the same coin repeatedly $$D=\{hthhtth\}\,.$$
What is the probability that heads comes up next? We solve this in the next slides ...
where the Gamma function is sort of a generalized factorial function. In particular, if $\alpha,\beta$ are integers, then $$\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)(\Gamma(\beta)} = \frac{(\alpha+\beta-1)!}{(\alpha-1)!\,(\beta-1)!}$$
so we get a closed-form posterior.
hence the posterior is also beta-distributed as
$$ p(\mu|D) = \mathrm{Beta}(\mu|\,n+\alpha, N-n+\beta) $$So, this coin is biased!
In order predict the outcomes of future coin tosses, we'll use two models $m_1$ and $m_2$.
Both models have the same data generating distribution (also Bernoulli)
but they have different priors: $$\begin{aligned} p(\mu|m_1) &= \mathrm{Beta}(\mu|\alpha=1,\beta=1) \\ p(\mu|m_2) &= \mathrm{Beta}(\mu|\alpha=15,\beta=1) \end{aligned}$$
Which model is better?
For both models, we will report as a function of the total number of coin tosses, the posteriors
and the Bayes factor in decibels $$ B_{12} = log_{10}\left( \frac{p(D|m_1)}{p(D|m_2)} \right) \,. $$
# computes log10 of Gamma function
function log10gamma(int)
if int == 1 || int == 2
return 1
end
return sum(log10(ii) for ii in 2:int-1)
end
# compute posterior and log-evidence
function execute_bayes(N,n,α,β)
# N, n is total # tosses and heads counts
# \alpha, \beta are parameters for Beta prior
posterior = Beta( α + n, β + (N-n) )
logevidence = log10gamma(α+β) - log10gamma(α) - log10gamma(β) + log10gamma(N+α) + log10gamma(N-n+β) - log10gamma(N+α+β)
return posterior, logevidence
end;
using Distributions
# specify model parameters
μ = 0.4;
α_m1 = 1; β_m1 = 1;
α_m2 = 15; β_m2 = 1;
# do experiment and update model
max_ntosses = 192
samples = rand(max_ntosses) .<= μ # Flip 192 coins
posterior_m1 = Array{Distribution}(undef,max_ntosses)
posterior_m2 = Array{Distribution}(undef,max_ntosses)
logevidence_m1 = Array{Float64}(undef,max_ntosses)
logevidence_m2 = Array{Float64}(undef,max_ntosses)
logBF12 = Array{Float64}(undef,max_ntosses)
for ntosses = 1:1:max_ntosses
nheads = sum(samples[1:ntosses]) # Count number of heads in first N flips
posterior_m1[ntosses], logevidence_m1[ntosses] = execute_bayes( ntosses, nheads, α_m1, β_m1 )
posterior_m2[ntosses], logevidence_m2[ntosses] = execute_bayes( ntosses, nheads, α_m2, β_m2 )
logBF12[ntosses] = logevidence_m1[ntosses] - logevidence_m2[ntosses]
end
using PyPlot
fig = figure("Posterior distributions", figsize=(11,11));
range_grid = range(0.0, stop=1.0, length=100)
ax1 = fig.add_subplot(2,2,1);
ax2 = fig.add_subplot(2,2,2);
ax3 = fig.add_subplot(2,2,3);
ax4 = fig.add_subplot(2,2,4);
function plot_distribution(position, ntosses, posterior_m1, posterior_m2, logBF12)
plt.subplot(position);
plot(range_grid,pdf.(posterior_m1[ntosses],range_grid), "b-");
plt.subplot(position);
plot(range_grid,pdf.(posterior_m2[ntosses],range_grid), "k--");
xlabel(L"\mu"); ylabel(L"p(\mu|\mathcal{D})"); grid()
title(L"p(\mu|\mathcal{D})"*" for N=$(ntosses), n=$(sum(samples[1:ntosses])) (real \$\\mu\$=$(μ))")
legend(["model 1 "*L"B("*string(α_m1)*","*string(β_m1)*")","model 2 "*L"B("*string(α_m2)*","*string(β_m2)*")"], loc=4)
ymax = max(maximum((pdf.(posterior_m1[ntosses],range_grid))),maximum((pdf.(posterior_m2[ntosses],range_grid))))
text(0, ymax-.5,"model 1 with prior "*L"B("*string(α_m1)*","*string(β_m1)*")\n")
text(0, ymax-1, "model 2 with prior "*L"B("*string(α_m2)*","*string(β_m2)*")\n")
text(0, ymax-1.5, "log B12 = $(round(logBF12[ntosses], digits=2))\n")
end
function plot_all_distributions(posterior_m1,posterior_m2,logBF12)
ntosses = [3,10,50,190];
plot_distribution(221, ntosses[1], posterior_m1, posterior_m2, logBF12)
plot_distribution(222, ntosses[2], posterior_m1, posterior_m2, logBF12)
plot_distribution(223, ntosses[3], posterior_m1, posterior_m2, logBF12)
plot_distribution(224, ntosses[4], posterior_m1, posterior_m2, logBF12)
end
plot_all_distributions( posterior_m1, posterior_m2, logBF12 );
$\Rightarrow$ With more data, the relevance of the prior diminishes!
Consider the task: predict a datum $x$ from an observed data set $D$.
Bayesian | Maximum Likelihood | |
1. Model Specification | Choose a model $m$ with data generating distribution $p(x|\theta,m)$ and parameter prior $p(\theta|m)$ | Choose a model $m$ with same data generating distribution $p(x|\theta,m)$. No need for priors. |
2. Learning | use Bayes rule to find the parameter posterior, $$ p(\theta|D) \propto p(D|\theta) p(\theta) $$ | By Maximum Likelihood (ML) optimization, $$ \hat \theta = \arg \max_{\theta} p(D |\theta) $$ |
3. Prediction | $$ p(x|D) = \int p(x|\theta) p(\theta|D) \,\mathrm{d}\theta $$ | $$ p(x|D) = p(x|\hat\theta) $$ |
$\Rightarrow$ ML estimation is an approximation to Bayesian learning, but for good reason a very popular learning method when faced with lots of available data.
open("../../styles/aipstyle.html") do f display("text/html", read(f, String)) end