#!/usr/bin/env python
# coding: utf-8

# # Frequentism and Bayesianism V: Model Selection

# *This notebook originally appeared as a [post](https://jakevdp.github.io/blog/2015/08/07/frequentism-and-bayesianism-5-model-selection/) on the blog [Pythonic Perambulations](http://jakevdp.github.io). The content is BSD licensed.*

# *This post is part of a 5-part series: [Part I](http://jakevdp.github.io/blog/2014/03/11/frequentism-and-bayesianism-a-practical-intro/)  [Part II](http://jakevdp.github.io/blog/2014/06/06/frequentism-and-bayesianism-2-when-results-differ/)  [Part III](http://jakevdp.github.io/blog/2014/06/12/frequentism-and-bayesianism-3-confidence-credibility/)  [Part IV](http://jakevdp.github.io/blog/2014/06/14/frequentism-and-bayesianism-4-bayesian-in-python/) [Part V](http://jakevdp.github.io/blog/2015/08/07/frequentism-and-bayesianism-5-model-selection/)*
# 
# *See also [Frequentism and Bayesianism: A Python-driven Primer](http://arxiv.org/abs/1411.5018), a peer-reviewed article partially based on this content.*

# <!-- PELICAN_BEGIN_SUMMARY -->
# 
# Last year I wrote a series of posts comparing frequentist and Bayesian approaches to various problems:
# 
# - In [Frequentism and Bayesianism I: a Practical Introduction](http://jakevdp.github.io/blog/2014/03/11/frequentism-and-bayesianism-a-practical-intro/) I gave an introduction to the main philosophical differences between frequentism and Bayesianism, and showed that for many common problems the two methods give basically the same point estimates.
# - In [Frequentism and Bayesianism II: When Results Differ](http://jakevdp.github.io/blog/2014/06/06/frequentism-and-bayesianism-2-when-results-differ/) I went into a bit more depth on when frequentism and Bayesianism start to diverge, particularly when it comes to the handling of nuisance parameters.
# - In [Frequentism and Bayesianism III: Confidence, Credibility, and why Frequentism and Science Don't Mix](http://jakevdp.github.io/blog/2014/06/12/frequentism-and-bayesianism-3-confidence-credibility/) I talked about the subtle difference between frequentist confidence intervals and Bayesian credible intervals, and argued that in most scientific settings frequentism answers the wrong question.
# - In [Frequentism and Bayesianism IV: How to be a Bayesian in Python](http://jakevdp.github.io/blog/2014/06/14/frequentism-and-bayesianism-4-bayesian-in-python/) I compared three Python packages for doing Bayesian analysis via MCMC: [emcee](http://dan.iel.fm/emcee), [pymc](http://pymc-devs.github.io/pymc/), and [pystan](https://pystan.readthedocs.org/en/latest/).
# 
# Here I am going to dive into an important topic that I've not yet covered: *model selection*.
# We will take a look at this from both a frequentist and Bayesian standpoint, and along the way gain some more insight into the fundamental philosophical divide between frequentist and Bayesian methods, and the practical consequences of this divide.
# 
# My quick, TL;DR summary is this: for model selection, frequentist methods tend to be **conceptually difficult but computationally straightforward**, while Bayesian methods tend to be **conceptually straightforward but computationally difficult**.
# 
# <!-- PELICAN_END_SUMMARY -->

# ## Model Fitting vs Model Selection
# 
# The difference between *model fitting* and *model selection* is often a cause of confusion.
# **Model fitting** proceeds by assuming a particular model is true, and tuning the model so it provides the best possible fit to the data. **Model selection**, on the other hand, asks the larger question of whether the assumptions of the model are compatible with the data.
# 
# Let's make this more concrete.
# By *model* here I essentially mean a formula, usually with tunable parameters, which quantifies the likelihood of observing your data.
# For example, your model might consist of the statement, "the $(x, y)$ observations come from a straight line, with known normal measurement errors $\sigma_y$".
# Labeling this model $M_1$, we can write:
# 
# $$
# y_{M_1}(x;\theta) = \theta_0 + \theta_1 x\\
# y \sim \mathcal{N}(y_{M_1}, \sigma_y^2)
# $$
# 
# where the second line indicates that the observed $y$ is normally distributed about the model value, with variance $\sigma_y^2$.
# There are two tunable parameters to this model, represented by the vector $\theta = [\theta_0, \theta_1]$ (i.e. the slope and intercept).
# 
# Another model might consist of the statement "the observations $(x, y)$ come from a quadratic curve, with known normal measurement errors $\sigma_y$".
# Labeling this model $M_2$, we can write:
# 
# $$
# y_{M_2}(x;\theta) = \theta_0 + \theta_1 x + \theta_2 x^2\\
# y \sim \mathcal{N}(y_{M_2}, \sigma_y^2)
# $$
# 
# There are three tunable parameters here, again represented by the vector $\theta$.
# 
# Model fitting, in this case, is the process of finding constraints on the values of the parameters $\theta$ within each model.
# That is, it allows you to make statements such as, "assuming $M_1$ is true, this particular $\theta$ gives the best-fit line" or "assuming $M_2$ is true, this particular vector $\theta$ gives the best-fit curve."
# Model fitting proceeds without respect to whether the model is capable of describing the data well; it just arrives at the best-fit model *under the assumption that the model is accurate*.
# 
# Model selection, on the other hand, is not concerned with the parameters themselves, but with the question of whether the model is capable of describing the data well.
# That is, it allows you to say, "for my data, a line ($M_1$) provides a better fit than a quadratic curve ($M_2$)".
# 
# Let's make this more concrete by introducing some data.

# ## Model Selection: Linear or Quadratic?
# 
# Consider the following data:

# In[1]:


import numpy as np
data = np.array([[ 0.42,  0.72,  0.  ,  0.3 ,  0.15,
                   0.09,  0.19,  0.35,  0.4 ,  0.54,
                   0.42,  0.69,  0.2 ,  0.88,  0.03,
                   0.67,  0.42,  0.56,  0.14,  0.2  ],
                 [ 0.33,  0.41, -0.22,  0.01, -0.05,
                  -0.05, -0.12,  0.26,  0.29,  0.39, 
                   0.31,  0.42, -0.01,  0.58, -0.2 ,
                   0.52,  0.15,  0.32, -0.13, -0.09 ],
                 [ 0.1 ,  0.1 ,  0.1 ,  0.1 ,  0.1 ,
                   0.1 ,  0.1 ,  0.1 ,  0.1 ,  0.1 ,
                   0.1 ,  0.1 ,  0.1 ,  0.1 ,  0.1 ,
                   0.1 ,  0.1 ,  0.1 ,  0.1 ,  0.1  ]])
x, y, sigma_y = data


# To get an idea of what we're looking at, let's quickly visualize these points:

# In[2]:


get_ipython().run_line_magic('matplotlib', 'inline')
import matplotlib.pyplot as plt
import seaborn as sns; sns.set() # set default plot styles

x, y, sigma_y = data
fig, ax = plt.subplots()
ax.errorbar(x, y, sigma_y, fmt='ok', ecolor='gray')
ax.set(xlabel='x', ylabel='y', title='input data');


# Our central model selection question will be this: **what better fits this data: a linear ($M_1$) or quadratic ($M_2$) curve**?
# 
# Let's create a function to compute these models given some data and parameters; for convenience, we'll make a very general polynomial model function:

# In[3]:


def polynomial_fit(theta, x):
    """Polynomial model of degree (len(theta) - 1)"""
    return sum(t * x ** n for (n, t) in enumerate(theta))


# If the ``theta`` variable is of length 2, this corresponds to the linear model ($M_1$). If the ``theta`` variable is length 3, this corresponds to the quadratic model ($M_2$).
# 
# As detailed in my previous posts, both the frequentist and Bayesian approaches to model fitting often revolve around the *likelihood*, which, for independent errors, is the product of the probabilities for each individual point.
# Here is a function which computes the log-likelihood for the two models:

# In[4]:


from scipy import stats

def logL(theta, model=polynomial_fit, data=data):
    """Gaussian log-likelihood of the model at theta"""
    x, y, sigma_y = data
    y_fit = model(theta, x)
    return sum(stats.norm.logpdf(*args)
               for args in zip(y, y_fit, sigma_y))


# Both the Bayesian and frequentist approaches are based on the likelihood, and the standard frequentist approach is to find the model which maximizes this expression.
# Though there are efficient closed-form ways of maximizing this, we'll use a direct optimization approach here for clarity:

# In[5]:


from scipy import optimize

def best_theta(degree, model=polynomial_fit, data=data):
    theta_0 = (degree + 1) * [0]
    neg_logL = lambda theta: -logL(theta, model, data)
    return optimize.fmin_bfgs(neg_logL, theta_0, disp=False)

theta1 = best_theta(1)
theta2 = best_theta(2)


# Let's now we can visually compare the maximum-likelihood degree-1 and degree-2 models:

# In[6]:


xfit = np.linspace(0, 1, 1000)
fig, ax = plt.subplots()
ax.errorbar(x, y, sigma_y, fmt='ok', ecolor='gray')
ax.plot(xfit, polynomial_fit(theta1, xfit), label='best linear model')
ax.plot(xfit, polynomial_fit(theta2, xfit), label='best quadratic model')
ax.legend(loc='best', fontsize=14)
ax.set(xlabel='x', ylabel='y', title='data');


# The crux of the model selection question is this: how we can quantify the difference between these models and decide which model better describes our data?

# ### Naive Approach: Comparing Maximum Likelihoods
# 
# One common mistake is to assume that we can select between models via *the value of the maximum likelihood*.
# While this works in some special cases, it is not generally applicable.
# Let's take a look at the maximum log-likelihood value for each of our fits:

# In[7]:


print("linear model:    logL =", logL(best_theta(1)))
print("quadratic model: logL =", logL(best_theta(2)))


# The quadratic model yields a higher log-likelihood, but this **does not** necessarily mean it is the better model!
# 
# The problem is that the quadratic model has more degrees of freedom than the linear model, and thus will **always** give an equal or larger maximum likelihood, regardless of the data!
# This trend holds generally: as you increase model complexity, the maximum likelihood value will (almost) always increase!
# 
# Let's take a look at the best maximum likelihood for a series of polynomial fits (linear, quadratic, cubic, quartic, etc.):

# In[8]:


degrees = np.arange(1, 10)
thetas = [best_theta(d) for d in degrees]
logL_max = [logL(theta) for theta in thetas]

fig, ax = plt.subplots(1, 2, figsize=(14, 5))
ax[0].plot(degrees, logL_max)
ax[0].set(xlabel='degree', ylabel='log(Lmax)')
ax[1].errorbar(x, y, sigma_y, fmt='ok', ecolor='gray')
ylim = ax[1].get_ylim()
for (degree, theta) in zip(degrees, thetas):
    if degree not in [1, 2, 9]: continue
    ax[1].plot(xfit, polynomial_fit(theta, xfit),
               label='degree={0}'.format(degree))
ax[1].set(ylim=ylim, xlabel='x', ylabel='y')
ax[1].legend(fontsize=14, loc='best');


# We see in the left panel that the maximum likelihood *value* always increases as we increase the degree of the polynomial.
# Looking at the right panel, we see how this metric has led us astray: while the ninth order polynomial certainly leads to a larger likelihood, it achieves this by **over-fitting** the data.
# 
# Thus, in some ways, you can view the model selection question as fundamentally about comparing models while correcting for over-fitting of more complicated models.
# Let's see how this is done within the frequentist and Bayesian approaches.

# ## Fundamentals of Frequentist & Bayesian Model Selection
# 
# Recall that the fundamental difference between the frequentist and Bayesian approaches is the [**definition of probability**](http://jakevdp.github.io/blog/2014/03/11/frequentism-and-bayesianism-a-practical-intro/).
# 
# Frequentists consider **probabilities as frequencies**: that is, a probability is only meaningful in the context of repeated experiments (even if those repetitions are merely hypothetical).
# This means, for example, that in the frequentist approach:
# 
# - *observed data* (and any quantities derived from them) are considered to be random variables: if you make the observations again under similar circumstances, the data may be different, and the details depend on the generating distribution.
# - *model parameters* (those things that help define the generating distribution) are considered fixed: they aren't subject to a probability distribution; they just *are*.
# 
# On the other hand, Bayesians consider **probabilities as degrees-of-belief**: that is, a probability is a way of quantifying our certainty about a particular statement.
# This means, for example, that in the Bayesian approach:
# 
# - *observed data* are not directly considered as random variables; they just *are*.
# - *model parameters* are uncertain quantities and thus subject to probabilistic description.
# 
# This difference in philosophy has real, practical implications, as we will see below.

# ### Some Notation
# 
# Before we continue, a quick note on notation.
# For the below discussion, it is important to be able to denote probabilities concisely.
# We'll generally be writing conditional probabilities of the form $P(A~|~B)$, which can be read "the probability of A given B".
# Additionally, I'll be using the following shorthands:
# 
# - $D$ represents observed data
# - $M$, $M_1$, $M_2$, etc. represent a model
# - $\theta$ represents a set of model parameters
# 
# With this in mind, we'll be writing statements like $P(D~|~\theta,M)$, which should be read "the probability of seeing the data, given the parameters $\theta$ with model $M$.
# I'm playing a bit fast-and-loose with discrete vs continuous variables, but the meaning should be clear from the context.

# ### Model Fitting
# 
# In the model fitting context, the difference in philosopy of probability leads to frequentist and Bayesians dealing with different quantities:
# 
# - frequentists look at the *likelihood*: $P(D~|~\theta, M)$
# - Bayesians look at the *posterior*: $P(\theta~|~D, M)$
# 
# Note the main distinction: frequentists operate on a **probability of the data**, while Bayesians operate on a **probability of the model parameters**, in line with their respective considerations about the applicability of probability.
# 
# **Frequentists**, here, have a clear advantage: the likelihood is something we can compute directly from the model – after all, a model is nothing more than a specification of the likelihood given model parameters.
# By optimizing this likelihood expression directly, as we saw above, you can arrive at an estimated best-fit model.
# 
# **Bayesians**, on the other hand, have a slightly more difficult task.
# To find an expression for the posterior, we can use Bayes' theroem:
# 
# $$
# P(\theta~|~D,M) = P(D~|~\theta,M) \frac{P(\theta~|~M)}{P(D~|~M)}
# $$
# 
# We see that the posterior is proportional to the likelihood used by frequentists, and the constant of proportionality involves ratio of $P(\theta~|~M)$ and $P(D~|~M)$.
# $P(\theta~|~M)$ here is the *prior* and quantifies our belief/uncertainty about the parameters $\theta$ without reference to the data.
# $P(D~|~M)$ is the *model evidence*, and in this context amounts to no more than a normalization term.
# 
# For a more detailed discussion of model fitting in the frequentist and Bayesian contexts, see the previous posts linked above.

# ### Model Selection
# 
# Similarly, when comparing two models $M_1$ and $M_2$, the frequentist and Bayesian approaches look at different quantities:
# 
# - frequentists compare the *model likelihood*, $P(D~|~M_1)$ and $P(D~|~M_2)$
# - Bayesians compare the *model posterior*, $P(M_1~|~D)$ and $P(M_2~|~D)$
# 
# Notice here that the parameter values $\theta$ no longer appear: we're not out to compare how well *particular fits* of the two models describe data; we're out to compare how well *the models themselves* describe the data.
# Unlike the parameter likelihood $P(D~|~\theta, M)$ above, neither quantity here is directly related to the likelihood expression, and so we must figure out how to re-express the desired quantity in terms we can compute.

# #### Model Selection: Bayesian Approach
# 
# For model selection, in many ways Bayesians have the advantage, at least in theory.
# Through a combination of Bayes Theorem and probability axioms, we can re-express the model posterior $P(M~|~D)$ in terms of computable quantities and priors:
# 
# First, using Bayes' Theorem:
# 
# $$
# P(M~|~D) = P(D~|~M)\frac{P(M)}{P(D)}
# $$
# 
# Using the definition of conditional probability, the first term can be expressed as an integral over the parameter space of the likelihood:
# 
# $$
# P(D~|~M) = \int_\Omega P(D~|~\theta, M) P(\theta~|~M) d\theta
# $$
# 
# Notice that this integral is over the exact quantity optimized in the case of Bayesian model *fitting*.
# 
# The remaining terms are priors, the most problematic of which is $P(D)$ – the prior probability of seeing your data *without reference to any model*.
# I'm not sure that $P(D)$ could ever be actually computed in the real world, but fortunately it can be canceled by computing the *odds ratio* between two alternative models:
# 
# $$
# O_{21} \equiv \frac{P(M_2~|~D)}{P(M_1~|~D)} = \frac{P(D~|~M_2)}{P(D~|~M_1)}\frac{P(M_2)}{P(M_1)}
# $$
# 
# We now have a means of comparing two models via computable quantities: an integral over the likelihood, and a prior odds for each model.
# Often the ratio of prior odds is assumed to be near unity, leaving only the well-defined (but often computationally intensive) integral over likelihood for each model.
# More on this below.

# #### Model Selection: Frequentist Approach
# 
# For model selection, frequentists are working with the quantity $P(D~|~M)$.
# Notice that unlike Bayesians, frequentists *cannot* express this as an integral over parameter space, because the notion of a probability distribution over model parameters does not make sense in the frequentist context.
# But recall that frequentists can make probabilistic statements about data or quantities derived from them: with this in mind, we can make progress by **computing some *statistic* from the data for which the distribution is known**.
# The difficulty is that which statistic is most useful depends highly on the precise model and data being used, and so practicing frequentist statistics requires a breadth of background knowledge about the assumptions made by various approaches.
# 
# For example, one commonly-seen distribution for data-derived statistis is the [$\chi^2$ (chi-squared) distribution](https://en.wikipedia.org/wiki/Chi-squared_distribution).
# The $\chi^2$ distribution with $K$ degrees of freedom describes the distribution of a sum of squares of $K$ independent normally-distributed variables.
# We can use Python tools to quickly visualize this:

# In[9]:


from scipy import stats
v = np.linspace(0, 8, 1000)
for k in range(1, 7):
    plt.plot(v, stats.chi2.pdf(v, k),
             label="k = {0}".format(k))
plt.legend(ncol=2)
plt.gca().set(title='$\chi^2$ distribution',
              xlabel='x', ylabel='p(x)', ylim=(0, 0.5));


# The key observation is this: if we can somehow *transform our data* into a single statistic $S(D;\theta,M)$ which we expect behaves like the sum of squares of normally-distributed values, then we can analyze the likelihood of this statistic in terms of this $\chi^2$ distribution, and use this as a proxy for the model likelihood $P(D~|~M)$.
# 
# In the case of our linear and quadratic model likelihoods $M_1$ and $M_2$ above, the models are built on the explicit expectation that for the correct model, the observed data $y$ will be normally distributed about the model value $y_M$ with a standard deviation $\sigma_y$.
# Thus the sum of squares of normalized residuals about the best-fit model should follow the $\chi^2$ distribution, and so we construct our $\chi^2$ statistic:
# 
# $$
# \chi^2(D;\theta,M) = \sum_n\left[\frac{y_n - y_M(x_n;\theta)}{\sigma_{y,n}}\right]^2
# $$
# 
# Coincidentally (or is it?), this statistic is proportional to the negative log likelihood, up to a constant offset, and so for model *fitting*, minimizing the $\chi^2$ is equivalent to maximizing the likelihood.
# 
# With this statistic, we've replaced an *uncomputable* quantity $P(D~|~M)$ with a *computable* quantity $P(S~|~M_S)$, where $S = \chi^2$ is a particular transformation of the data for which (under our model assumptions) we can compute the probability distribution.
# So rather than asking "how likely are we to see the data $D$ under the model $M$", we can ask "how likely are we to see the statistic $S$ under the model $M_S$", and the two likelihoods will be equivalent *as long as our assumptions hold*.

# ## Frequentist and Bayesian Model Selection by Example
# 
# Let's use the ideas developed above to address the above model selection problem in both a frequentist and Bayesian context.

# ### Frequentist Model Selection: $\chi^2$
# 
# We introduced the $\chi^2$ statistic above, and now let's take a look at how this is actually computed.
# From our proposed models, we explicitly expect that the residuals about the model fit will be independent and normally distributed, with variance $\sigma_y^2$.
# Consequently, we expect that for the correct model, the sum of normalized residuals will be drawn from a $\chi^2$ distribuion with a degree of freedom related to the number of data points.
# For $N$ data points fit by a $K$-parameter linear model, the degrees of freedom are usually given by $dof = N - K$, though there are some caveats to this (See [The Dos and Don'ts of Reduced Chi-Squared](http://arxiv.org/abs/1012.3754) for an enlightening discussion).
# 
# Let's define some functions to compute the $\chi^2$ and the number of degrees of freedom, and evaluate the likelihood of each model with the $\chi^2$ distribution:

# In[10]:


def compute_chi2(degree, data=data):
    x, y, sigma_y = data
    theta = best_theta(degree, data=data)
    resid = (y - polynomial_fit(theta, x)) / sigma_y
    return np.sum(resid ** 2)

def compute_dof(degree, data=data):
    return data.shape[1] - (degree + 1)

def chi2_likelihood(degree, data=data):
    chi2 = compute_chi2(degree, data)
    dof = compute_dof(degree, data)
    return stats.chi2(dof).pdf(chi2)

print("chi2 likelihood")
print("- linear model:    ", chi2_likelihood(1))
print("- quadratic model: ", chi2_likelihood(2))


# We have found that the likelihood of the observed residuals under the linear model ($M_1$) is slightly larger than the likelihood of the observed residuals under the quadratic model ($M_2$).
# 
# To understand what this is saying, it is helpful to visualize these distributions:

# In[11]:


fig, ax = plt.subplots()
for degree, color in zip([1, 2], ['blue', 'green']):
    v = np.linspace(0, 40, 1000)
    chi2_dist = stats.chi2(compute_dof(degree)).pdf(v)
    chi2_val = compute_chi2(degree)
    chi2_like = chi2_likelihood(degree)
    ax.fill(v, chi2_dist, alpha=0.3, color=color,
            label='Model {0} (degree = {0})'.format(degree))
    ax.vlines(chi2_val, 0, chi2_like, color=color, alpha=0.6)
    ax.hlines(chi2_like, 0, chi2_val, color=color, alpha=0.6)
    ax.set(ylabel='L(chi-square)')
ax.set_xlabel('chi-square')
ax.legend(fontsize=14);


# We can see visually here how this procedure corrects for model complexity: even though the $\chi^2$ *value* for the quadratic model is lower (shown by the vertical lines), the characteristics of the $\chi^2$ distribution mean the *likelihood* of seeing this value is lower (shown by the horizontal lines), meaning that the degree=1 linear model is favored.

# #### Significance of the Comparison
# 
# But how much should we trust this conclusion in favor of the linear model?
# In other words, how do we quantify the **significance** of this difference in $\chi^2$ likelihoods?
# 
# We can make progress by realizing that in the frequentist context *all data-derived quantities* can be consistered probabilistically, and this includes the difference in $\chi^2$ values from two models!
# For this particular case, the difference of $\chi^2$ statistics here also follows a $\chi^2$ distribution, with 1 degree of freedom. This is due to the fact that the models are *nested* – that is, the linear model is a specialization of the quadratic model (for some background, look up the [Likelihood Ratio Test](https://en.wikipedia.org/wiki/Likelihood-ratio_test)).
# 
# We might proceed by treating the linear model as the *null hypothesis*, and asking if there is sufficient evidence to justify the more complicated quadratic model.
# Just as above, we can plot the $\chi^2$ difference along with its expected distribution:

# In[12]:


chi2_diff = compute_chi2(1) - compute_chi2(2)

v = np.linspace(0, 5, 1000)
chi2_dist = stats.chi2(1).pdf(v)
p_value = 1 - stats.chi2(1).cdf(chi2_diff)

fig, ax = plt.subplots()
ax.fill_between(v, 0, chi2_dist, alpha=0.3)
ax.fill_between(v, 0, chi2_dist * (v > chi2_diff), alpha=0.5)
ax.axvline(chi2_diff)
ax.set(ylim=(0, 1), xlabel="$\chi^2$ difference", ylabel="probability");
ax.text(4.9, 0.95, "p = {0:.2f}".format(p_value),
        ha='right', va='top', size=14);


# Here we see where this observed $\chi^2$ difference lies on its expected distribution, under the null hypothesis that the linear model is the true model.
# The area of the distribution to the right of the observed value is known as the [*p value*](https://en.wikipedia.org/wiki/P-value): for our data, the $p$-value is 0.17, meaning that, assuming the linear model is true, there is a 17% probability that simply by chance we would see data that favors the quadratic model more strongly than ours.
# The standard interpretation of this is to say that our data are *not inconsistent with the linear model*: that is, our data does not support the quadratic model enough to conclusively reject the simpler linear model.

# #### Other Frequentist Approaches
# 
# Recall that a primary purpose of using the $\chi^2$ statistic, above, was to prevent mistaken selection of very flexible models which *over-fit* the data.
# Other qualitatively different approaches exist to limit this overfitting.
# Some important ones are:
# 
# - [The Akaike Information Criterion](https://en.wikipedia.org/wiki/Akaike_information_criterion) is an information-theoretic approach which penalizes the maximum likelihood to account for added model complexity. It makes rather stringent assumptions about the form of the likelihood, and so cannot be universally applied.
# - [Cross-Validation](https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29) is a sampling method in which the model fit and evaluation take place on disjoint randomized subsets of the data, which acts to penalize over-fitting. Cross-validation is more computationally intensive than other frequentist approaches, but has the advantage that it relies on very few assumptions about the data and model and so is applicable to a broad range of models.
# - **Other sampling-based methods**: the classical frequentist approach relies on computing specially-crafted statistics that fit the assumptions of your model. Some of this specialization can be side-stepped through randomized methods like bootstrap and jackknife resampling.
# For an interesting introduction to this subject, I'd recommend the short and free book, [Statistics is Easy](http://www.morganclaypool.com/doi/abs/10.2200/S00295ED1V01Y201009MAS008) by Shasha and Wilson, or the extremely approachable talk, [Statistics Without the Agonizing Pain](https://www.youtube.com/watch?v=5Dnw46eC-0o) by John Rauser.
# 
# We will not demonstrate these approaches here, but they are relatively straightforward and interesting to learn!

# ### Bayesian Model Selection: the Odds Ratio
# 
# The Bayesian approach proceeds very differently.
# Recall that the Bayesian model involves computing the *odds ratio* between two models:
# 
# $$
# O_{21} = \frac{P(M_2~|~D)}{P(M_1~|~D)} = \frac{P(D~|~M_2)}{P(D~|~M_1)}\frac{P(M_2)}{P(M_1)}
# $$
# 
# Here the ratio $P(M_2) / P(M_1)$ is the *prior odds ratio*, and is often assumed to be equal to 1 if no compelling prior evidence favors one model over another.
# The ratio $P(D~|~M_2) / P(D~|~M_1)$ is the *Bayes factor*, and is the key to Bayesian model selection.
# 
# The Bayes factor can be computed by evaluating the integral over the parameter likelihood:
# 
# $$
# P(D~|~M) = \int_\Omega P(D~|~\theta, M) P(\theta~|~M) d\theta
# $$
# 
# This integral is over the entire parameter space of the model, and thus can be extremely computationally intensive, especially as the dimension of the model grows beyond a few.
# For the 2-dimensional and 3-dimensional models we are considering here, however, this integral can be computed directly via numerical integration.
# 
# We'll start, though, by using an MCMC to draw samples from the posterior in order to solve the *model fitting* problem.
# We will use the [emcee](http://dan.iel.fm/emcee) package, which requires us to first define functions which compute the prior, likelihood, and posterior under each model:

# In[13]:


def log_prior(theta):
    # size of theta determines the model.
    # flat prior over a large range
    if np.any(abs(theta) > 100):
        return -np.inf  # log(0)
    else:
        return 200 ** -len(theta)

def log_likelihood(theta, data=data):
    x, y, sigma_y = data
    yM = polynomial_fit(theta, x)
    return -0.5 * np.sum(np.log(2 * np.pi * sigma_y ** 2)
                         + (y - yM) ** 2 / sigma_y ** 2)

def log_posterior(theta, data=data):
    theta = np.asarray(theta)
    return log_prior(theta) + log_likelihood(theta, data)


# Next we draw samples from the posterior using MCMC:

# In[14]:


import emcee

def compute_mcmc(degree, data=data,
                   log_posterior=log_posterior,
                   nwalkers=50, nburn=1000, nsteps=2000):
    ndim = degree + 1  # this determines the model
    rng = np.random.RandomState(0)
    starting_guesses = rng.randn(nwalkers, ndim)
    sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[data])
    sampler.run_mcmc(starting_guesses, nsteps)
    trace = sampler.chain[:, nburn:, :].reshape(-1, ndim)
    return trace

trace_2D = compute_mcmc(1)
trace_3D = compute_mcmc(2)


# The output is a trace, or a series of samples which by design should reflect the posterior distribution.
# To visualize the posterior samples, I like to use seaborn's ``jointplot`` (for 2D samples) or ``PairGrid`` (for N-D samples):

# In[15]:


import pandas as pd
columns = [r'$\theta_{0}$'.format(i) for i in range(3)]
df_2D = pd.DataFrame(trace_2D, columns=columns[:2])

with sns.axes_style('ticks'):
    jointplot = sns.jointplot(r'$\theta_0$', r'$\theta_1$',
                              data=df_2D, kind="hex");


# In[16]:


df_3D = pd.DataFrame(trace_3D, columns=columns[:3])

# get the colormap from the joint plot above
cmap = jointplot.ax_joint.collections[0].get_cmap()

with sns.axes_style('ticks'):
    grid = sns.PairGrid(df_3D)
    grid.map_diag(plt.hist, bins=30, alpha=0.5)
    grid.map_offdiag(plt.hexbin, gridsize=50, linewidths=0, cmap=cmap)


# These samples give us a good idea of what the posterior for each model looks like, but we still must integrate this posterior to find the Bayes factor.
# 
# For these lower-dimensional models, we'll do direct numerical integration using tools from the ``scipy.integrate`` package to integrate the posterior and compute the odds ratio. 
# The call signature of the multiple integration routines is a bit confusing – I suggest referring to the [scipy.integrate documentation](http://docs.scipy.org/doc/scipy/reference/tutorial/integrate.html) to read about the inputs.

# In[17]:


from scipy import integrate

def integrate_posterior_2D(log_posterior, xlim, ylim, data=data):
    func = lambda theta1, theta0: np.exp(log_posterior([theta0, theta1], data))
    return integrate.dblquad(func, xlim[0], xlim[1],
                             lambda x: ylim[0], lambda x: ylim[1])

def integrate_posterior_3D(log_posterior, xlim, ylim, zlim, data=data):
    func = lambda theta2, theta1, theta0: np.exp(log_posterior([theta0, theta1, theta2], data))
    return integrate.tplquad(func, xlim[0], xlim[1],
                             lambda x: ylim[0], lambda x: ylim[1],
                             lambda x, y: zlim[0], lambda x, y: zlim[1])


# The tricky part of the integration is choosing the integration limits correctly; fortunately we can use the MCMC traces to find appropriate values.
# We'll use IPython's ``%time`` magic to record how long each takes:

# In[18]:


xlim, ylim = zip(trace_2D.min(0), trace_2D.max(0))
get_ipython().run_line_magic('time', 'Z1, err_Z1 = integrate_posterior_2D(log_posterior, xlim, ylim)')
print("Z1 =", Z1, "+/-", err_Z1)


# In[19]:


xlim, ylim, zlim = zip(trace_3D.min(0), trace_3D.max(0))
get_ipython().run_line_magic('time', 'Z2, err_Z2 = integrate_posterior_3D(log_posterior, xlim, ylim, zlim)')
print("Z2 =", Z2, "+/-", err_Z2)


# The Bayes factor is simply the quotient of the two integrals:

# In[20]:


print("Bayes factor:", Z2 / Z1)


# The Bayes factor favors the quadratic model, but only slightly (an odds of about 7 to 3).
# In fact, this value for the Bayes factor ranks as "not worth a mere mention" according to the scale proposed by [Kass & Raferty (1995)](https://www.andrew.cmu.edu/user/kk3n/simplicity/KassRaftery1995.pdf) an influential paper on the subject.
# 
# Notice that this interpretation is very similar to what we found with the frequentist approach above, which favors the quadratic model but has too large a $p$-value to support discarding the simpler linear model. Indeed, at the risk of causing die-hard Bayesians to cringe, you can argue roughly that the equivalent "Bayesian $p$-value" assosiated with this odds ratio is

# In[21]:


print('Bayesian "p-value":', Z1 / (Z1 + Z2))


# That is, the posterior probability in favor of the linear model is about 30%, which is not low enough to support rejecting the simpler model.
# I put "$p$-value" here in quotes, because while a classical (frequentist) p-value reflects probability conditioned on the models, this Bayesian "$p$-value" reflects probability conditioned on the data, and so the detailed interpretation is very different.

# #### Other Bayesian approaches
# 
# While direct numerical integration of the posterior works for low-dimensional models, computing Bayes factors for higher-dimensional models requires either a more sophisticated method, or an approximation of the integral.
# For very high dimensional models, this is actually a very hard problem, and an area of active research.
# Here is an incomplete list of approaches you might turn to in this case:
# 
# - [Nested Sampling](https://en.wikipedia.org/wiki/Nested_sampling_algorithm) is a sampling-based algorithm like MCMC, which is specially designed to compute Bayes factors.
# - [Reversible Jump MCMC](https://en.wikipedia.org/wiki/Reversible-jump_Markov_chain_Monte_Carlo) (also see *bridge sampling*) can be used to sample multiple models in the same MCMC chain, so that the Bayes factor can be estimated directly from the joint trace (see, e.g. [this paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.399.8889&rep=rep1&type=pdf))
# - **Posterior Predictive Checks** are an interesting means of assessing the fitness of sampled posteriors without having to explicitly compute the integral, by empirically comparing the [Posterior Predictive Distribution](https://en.wikipedia.org/wiki/Posterior_predictive_distribution) to the data. For a technical but relatively approachable introduction, I'd suggest this [1996 paper by Gelman](http://www.cs.princeton.edu/courses/archive/fall09/cos597A/papers/GelmanMengStern1996.pdf).
# - [the Bayesian Information Criterion (BIC)](https://en.wikipedia.org/wiki/Bayesian_information_criterion) quickly approximates the Bayes factor using rather strong assumptions about the form of the posterior. It is very similar in form to the Akaike Information Criterion (AIC), mentioned above.
# 
# Again, I will not demonstrate any of these techniques here, but my hope is that this post has given you the background to understand them from other available references.

# ## A Note On the Example Data
# 
# In case you were curious, both the frequentist and Bayesian approaches landed on the correct answer, in the sense that our data actually was drawn from a straight line.
# Here is the function I used to generate the data used in this post:

# In[22]:


import numpy as np

def generate_data(N=20, rseed=1):
    rng = np.random.RandomState(rseed)
    x = rng.rand(N)
    sigma_y = 0.1 * np.ones(N)
    
    # linear model with noise
    y = x - 0.2 + sigma_y * rng.randn(N)
    return np.vstack([x, y, sigma_y]).round(2)

data = generate_data()


# If you *really* want to get a feel for how these methods work and respond to different inputs, you might wish to [download the notebook](http://jakevdp.github.io/downloads/notebooks/FreqBayes5.ipynb), and re-run the analysis with different functional forms or random seeds.
# In particular, you might think about how the above plots might change if the data were actually drawn from a quadratic function.
# Do the actual results match your intuition?

# ## Some Final Thoughts
# 
# My hope is that the above examples give you a flavor of the essential aspects of model selection within the frequentist and Bayesian paradigms.
# To summarize, I want to offer a few observations:
# 
# **Model comparison is mainly an exercise in preventing over-fitting.**
# Though the value of the maximum likelihood may seem, at first glance, to be a useful metric for comparing models, when the models differ in degrees of freedom this intuition is misleading.
# The frequentist approach addresses this by devising and computing some statistic which implicitly or explicitly accounts for model complexity.
# The Bayesian approach addresses this by integrating over the model parameter space, which in effect acts to automatically penalize overly-complex models.
# 
# Frequentist and Bayesian model selection approaches have complementary strengths and weaknesses:
# 
# **Frequentist model selection** generally relies on the selection of specifically-constructed statistics which apply to the particular data and models being used.
# Because any particular statistic is only applicable in a narrow set of cases, an effective frequentist statistician must have a depth and breadth of knowledge about the properties of common statistical distributions, as well as a well-honed intuition about when to choose one statistic over another.
# This breadth of required knowledge is, in my mind, one of the weaknesses of the frequentist approach.
# In particular, it all but invites misuse: because so many people who deal with data on a daily basis do not have an advanced degree in classical statistics, it is common to see statistics like the $\chi^2$ applied without due consideration of the assumptions required by the statistic.
# On the other hand, the frequentist approach does have the advantage that once the correct frequentist statistic is found, the results can often be computed very efficiently.
# 
# **Bayesian model selection** takes a much more uniform approach: regardless of the data or model being used, the same posterior odds ratio approach is applicable.
# Thus, in some senses, the Bayesian approach is conceptually much easier than the frequentist approach, which is perhaps why it appeals to so many scientists.
# The disadvantage, of course, is computational complexity: integrating the posterior, especially for very high-dimensional models, can be very computationally expensive, and this computational expense makes it tempting to take shortcuts which might bias the results.
# 
# I hope you found this discussion helpful, and thanks for reading!

# *This post was written entirely in the IPython notebook.  You can [download](http://jakevdp.github.io/downloads/notebooks/FreqBayes5.ipynb) this notebook, or see a static view on [nbviewer](http://nbviewer.ipython.org/url/jakevdp.github.io/downloads/notebooks/FreqBayes5.ipynb).*