Model comparison

To demonstrate the use of model comparison criteria, we implement the radon contamination example from the previous lecture.

Below, we fit a pooled model, which assumes a single fixed effect across all counties, and a hierarchical model that allows for a random effect that partially pools the data.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore", module="mkl_fft")

The data include the observed radon levels and associated covariates for 85 counties in Minnesota.

In [2]:
radon_data = pd.read_csv('../data/radon.csv', index_col=0)
idnum state state2 stfips zip region typebldg floor room basement ... pcterr adjwt dupflag zipflag cntyfips county fips Uppm county_code log_radon
0 5081.0 MN MN 27.0 55735 5.0 1.0 1.0 3.0 N ... 9.7 1146.499190 1.0 0.0 1.0 AITKIN 27001.0 0.502054 0 0.832909
1 5082.0 MN MN 27.0 55748 5.0 1.0 0.0 4.0 Y ... 14.5 471.366223 0.0 0.0 1.0 AITKIN 27001.0 0.502054 0 0.832909
2 5083.0 MN MN 27.0 55748 5.0 1.0 0.0 4.0 Y ... 9.6 433.316718 0.0 0.0 1.0 AITKIN 27001.0 0.502054 0 1.098612
3 5084.0 MN MN 27.0 56469 5.0 1.0 0.0 4.0 Y ... 24.3 461.623670 0.0 0.0 1.0 AITKIN 27001.0 0.502054 0 0.095310
4 5085.0 MN MN 27.0 55011 3.0 1.0 0.0 4.0 Y ... 13.8 433.316718 0.0 0.0 3.0 ANOKA 27003.0 0.428565 1 1.163151

5 rows × 29 columns

Pooled model

In [4]:
from pymc3 import Model, sample, Normal, HalfCauchy, Uniform

floor = radon_data.floor.values
log_radon = radon_data.log_radon.values

with Model() as pooled_model:
    β = Normal('β', 0, sd=1e5, shape=2)
    σ = HalfCauchy('σ', 5)
    θ = β[0] + β[1]*floor
    y = Normal('y', θ, sd=σ, observed=log_radon)
    trace_p = sample(1000, tune=1000, cores=2)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [σ, β]
Sampling 2 chains: 100%|██████████| 4000/4000 [00:06<00:00, 298.69draws/s]
In [5]:
from pymc3 import traceplot

traceplot(trace_p, varnames=['β']);

Hierarchical model

In [6]:
mn_counties = radon_data.county.unique()
counties = mn_counties.shape[0]
county = radon_data.county_code.values
floor_measure = radon_data.floor.values
with Model() as hierarchical_model:
    # Priors
    μ_a = Normal('μ_a', mu=0., tau=0.0001)
    σ_a = HalfCauchy('σ_a', 5)
    # Random intercepts
    a = Normal('a', mu=μ_a, sd=σ_a, shape=counties)
    # Common slope
    b = Normal('b', mu=0., sd=1e5)
    # Model error
    sd_y = HalfCauchy('sd_y', 5)
    # Expected value
    y_hat = a[county] + b * floor_measure
    # Data likelihood
    y_like = Normal('y_like', mu=y_hat, sd=sd_y, observed=log_radon)
    trace_h = sample(1000, tune=1000, cores=2)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [sd_y, b, a, σ_a, μ_a]
Sampling 2 chains: 100%|██████████| 4000/4000 [00:15<00:00, 260.87draws/s]
In [7]:
traceplot(trace_h, varnames=['μ_a', 'b']);
In [8]:
from pymc3 import forestplot

forestplot(trace_h, varnames=['a'], ylabels=['']);

Predictive Information Criteria

Measures of predictive accuracy are called information criteria, and are comprised of the log-predictive density of the data given a point estimate of the fitted model multiplied by −2 (i.e. the deviance):

$$−2 \log[p(y | \hat{\theta})]$$

Clearly, the expected accuracy of a fitted model’s predictions of future data will generally be lower than the accuracy of the model’s predictions for observed data, even though the parameters in the model happen to be sampled from the specified prior distribution.

Why are we interested in prediction accuracy?

  1. to quantify the performance of a model
  2. to perform model selection

By model selection, we may not necessarily want to choose one model over another, but we might want to put different models on the same scale. The advantage if information-theoretic measures is that candidate models do not need to be nested; even models with completely different parameterizations can be used to predict the same measurements.

Note that when candidate models have the same number of parameters, one can compare their best-fit log predictive densities directly, but when model dimensions differ, one has to make an adjustment for the tendency of a larger model to fit data better.

One advantage of using predictive information criteria for model comparison is that they allow us to estimate out-of-sample predictive accuracy using the data in our sample. All such methods are approximations of predictive accuracy, so they are not perfect, but they perform reasonably well.

One can naively use the log predictive density for the sample data (within-sample predictive accuracy) as an approximation of the out-of-sample predictive accuracy, but this will almost always result in an overestimate of performance.

As is popular in machine learning, cross-validation can be used to evaluate predictive accuracy, whereby the dataset is partitioned and each partition is allowed to be used to fit the model and evaluate the fit. However, this method is computationally expensive because it reqiures the same model to be fit to multiple subsets of the data.

We will focus here on adjusted within-sample predictive accuracy, using a variety of information criteria. The goal here is to get an approximately unbiased estimate of predictive accuracy which are correct in expectation.


One approach to model selection is to use an information-theoretic criterion to identify the most appropriate model. Akaike (1973) found a formal relationship between Kullback-Leibler information (a dominant paradigm in information and coding theory) and likelihood theory. Akaike's Information Criterion (AIC) is an estimator of expected relative K-L information based on the maximized log-likelihood function, corrected for asymptotic bias.

$$\text{AIC} = −2 \log(L(\theta|data)) + 2K$$

AIC balances the fit of the model (in terms of the likelihood) with the number of parameters required to achieve that fit. We can easily calculate AIC from the residual sums of squares as:

$$\text{AIC} = n \log(\text{RSS}/n) + 2k$$

where $k$ is the number of parameters in the model. Notice that as the number of parameters increase, the residual sum of squares goes down, but the second term (a penalty) increases.

A limitation of AIC for Bayesian models is that it cannot be applied to hierarchical models (or any models with random effects), as counting the number of parameters in such models is problematic. A more Bayesian version of AIC is called the deviance information criterion (DIC), and replaces the fixed parameter penalty with an estimate of the effective number of parameters.

$$p_{DIC} = 2\left(\log p(y | E[\theta | y]) - E_{post}[\log p(y|\theta)] \right)$$

where the second term is an average of $\theta$ over the posterior distribution:

$$\hat{p}_{DIC} = 2\left(\log p(y | E[\theta | y]) - \frac{1}{M} \sum_{j=1}^{M}\log p(y|\theta^{(j)}) \right)$$

DIC is computed as:

$$\text{DIC} = -2 \log p(y | E[\theta | y]) + 2p_{DIC}$$

Though this is an improvement over AIC, DIC is still not fully Bayesian, as it relies on a point estimate of the model rather than using the full posterior. As a result, it can be unstable for hierarchical models, sometimes producing estimates of effective number of parameters that is negative.

Widely-applicable Information Criterion (WAIC)

WAIC (Watanabe 2010) is a fully Bayesian criterion for estimating out-of-sample expectation, using the log pointwise posterior predictive density (LPPD) and correcting for the effective number of parameters to adjust for overfitting.

The computed log pointwise predictive density is:

$$lppd_{comp} = \sum_{i=1}^N \log \left(\frac{1}{M} \sum_{j=1}^M p(y_i | \theta^{(j)}) \right)$$

The complexity adjustment here is as follows:

$$p_{WAIC} = 2\sum_{i=1}^N \left[ \log \left(\frac{1}{M} \sum_{j=1}^M p(y_i | \theta^{(j)})\right) - \frac{1}{M} \sum_{j=1}^M \log p(y_i | \theta^{(j)}) \right]$$

so WAIC is then:

$$\text{WAIC} = -2(lppd) + 2p_{WAIC}$$

The adjustment is an approximation to the number of unconstrained parameters in the model (0=fully constrained, 1=no constraints). In this sense, WAIC treats the effective number of paramters as a random variable.

WAIC averages over the posterior distribution, and therefore is more reliable for a wider range of models.

In [9]:
from pymc3 import waic

pooled_waic = waic(trace_p, pooled_model)
In [10]:
hierarchical_waic = waic(trace_h, hierarchical_model)
/home/fonnesbeck/anaconda3/envs/dev/lib/python3.6/site-packages/pymc3/ UserWarning: For one or more samples the posterior variance of the
        log predictive densities exceeds 0.4. This could be indication of
        WAIC starting to fail see for details

PyMC3 includes two convenience functions to help compare WAIC for different models. The first of this functions is compare, this one computes WAIC (or LOO) from a set of traces and models and returns a DataFrame.

In [11]:
from pymc3 import compare

df_comp_WAIC = compare({hierarchical_model:trace_h, pooled_model:trace_p})
/home/fonnesbeck/anaconda3/envs/dev/lib/python3.6/site-packages/pymc3/ UserWarning: For one or more samples the posterior variance of the
        log predictive densities exceeds 0.4. This could be indication of
        WAIC starting to fail see for details
WAIC pWAIC dWAIC weight SE dSE var_warn
0 2072.02 48.43 0 0.93 55.87 0 1
1 2179.66 3.68 107.64 0.07 50 21.44 0

We have many columns so let check one by one the meaning of them:

  1. The first column contains the values of WAIC. The DataFrame is always sorted from lowest to highest WAIC. The index reflects the order in which the models are passed to this function.

  2. The second column is the estimated effective number of parameters. In general, models with more parameters will be more flexible to fit data and at the same time could also lead to overfitting. Thus we can interpret pWAIC as a penalization term, intuitively we can also interpret it as measure of how flexible each model is in fitting the data.

  3. The third column is the relative difference between the value of WAIC for the top-ranked model and the value of WAIC for each model. For this reason we will always get a value of 0 for the first model.

  4. Sometimes when comparing models, we do not want to select the "best" model, instead we want to perform predictions by averaging along all the models (or at least several models). Ideally we would like to perform a weighted average, giving more weight to the model that seems to explain/predict the data better. There are many approaches to perform this task, one of them is to use Akaike weights based on the values of WAIC for each model. These weights can be loosely interpreted as the probability of each model (among the compared models) given the data. One caveat of this approach is that the weights are based on point estimates of WAIC (i.e. the uncertainty is ignored).

  5. The fifth column records the standard error for the WAIC computations. The standard error can be useful to assess the uncertainty of the WAIC estimates. Nevertheless, caution need to be taken because the estimation of the standard error assumes normality and hence could be problematic when the sample size is low.

  6. In the same way that we can compute the standard error for each value of WAIC, we can compute the standard error of the differences between two values of WAIC. Notice that both quantities are not necessarily the same, the reason is that the uncertainty about WAIC is correlated between models. This quantity is always 0 for the top-ranked model.

  7. Finally we have the last column named "warning". A value of 1 indicates that the computation of WAIC may not be reliable, this warning is based on an empirical determined cutoff value and need to be interpreted with caution. For more details you can read this paper.

The second convenience function takes the output of compare and produces a summary plot in the style of the one used in the book Statistical Rethinking by Richard McElreath (check also this port of the examples in the book to PyMC3).

In [12]:
from pymc3 import compareplot

  • The empty circle represents the values of WAIC and the black error bars associated with them are the values of the standard deviation of WAIC.
  • The value of the lowest WAIC is also indicated with a vertical dashed grey line to ease comparison with other WAIC values.
  • The filled black dots are the in-sample deviance of each model, which for WAIC is 2 pWAIC from the corresponding WAIC value.

For all models except the top-ranked one we also get a triangle indicating the value of the difference of WAIC between that model and the top model and a grey errobar indicating the standard error of the differences between the top-ranked WAIC and WAIC for each model.

Leave-one-out Cross-validation (LOO)

LOO cross-validation is an estimate of the out-of-sample predictive fit. In cross-validation, the data are repeatedly partitioned into training and holdout sets, iteratively fitting the model with the former and evaluating the fit with the holdout data.

The estimate of out-of-sample predictive fit from applying LOO cross-validation to a Bayesian model is:

$$lppd_{loo} = \sum_{i=1}^N \log p_{post(-i)}(y_i) = \sum_{i=1}^N \log \left(\frac{1}{S} \sum_{s=1}^S p(y_i| \theta^{(is)})\right)$$

so, each prediction is conditioned on $N-1$ data points, which induces an underestimation of the predictive fit for smaller $N$. The resulting estimate of effective samples size is:

$$p_{loo} = lppd - lppd_{loo}$$

As mentioned, using cross-validation for a Bayesian model, fitting $N$ copies of the model under different subsets of the data is computationally expensive. However, Vehtari et al. (2016) introduced an efficient computation of LOO from MCMC samples, which are corrected using Pareto-smoothed importance sampling (PSIS) to provide an estimate of point-wise out-of-sample prediction accuracy.

This involves estimating the importance sampling LOO predictive distribution

$$p(\tilde{y}_i | y_{-i}) \approx \frac{\sum_{s=1}^S w_i(\theta^{(s)}) p(\tilde{y}_i|\theta^{(s)})}{\sum_{s=1}^S w_i(\theta^{(s)})}$$

where the importance weights are:

$$w_i(\theta^{(s)}) = \frac{1}{p(y_i | \theta^{(s)})} \propto \frac{p(\theta^{(s)}|y_{-i})}{p(\theta^{(s)}|y)}$$

The predictive distribution evaluated at the held-out point is then:

$$p(y_i | y_{-i}) \approx \frac{1}{\frac{1}{S} \sum_{s=1}^S \frac{1}{p(y_i | \theta^{(s)})}}$$

However, the posterior is likely to have a smaller variance and thinner tails than the LOO posteriors, so this approximation induces instability due to the fact that the importance ratios can have high or infinite variance. To deal with this instability, a generalized Pareto distribution fit to the upper tail of the distribution of the importance ratios can be used to construct a test for a finite importance ratio variance. If the test suggests the variance is infinite then importance sampling is halted.

LOO using using Pareto-smoothed importance sampling is implemented in PyMC3 in the loo function.

In [13]:
from pymc3 import loo

pooled_loo = loo(trace_p, pooled_model)
In [14]:
hierarchical_loo  = loo(trace_h, hierarchical_model)

We can also use compare with LOO.

In [15]:
df_comp_LOO = compare({hierarchical_model:trace_h, pooled_model:trace_p}, ic='LOO')
LOO pLOO dLOO weight SE dSE shape_warn
0 2072.93 48.88 0 0.92 55.96 0 0
1 2179.66 3.68 106.73 0.08 50 21.5 0

We can also plot the results

In [16]:


Though we might expect the hierarchical model to outperform a complete pooling model, there is little to choose between the models in this case, giving that both models gives very similar values of the information criteria. This is more clearly appreciated when we take into account the uncertainty (in terms of standard errors) of WAIC and LOO.


Gelman, A., Hwang, J., & Vehtari, A. (2014). Understanding predictive information criteria for Bayesian models. Statistics and Computing, 24(6), 997–1016.

Vehtari, A, Gelman, A, Gabry, J. (2016). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing