After running an MCMC simulation, `sample`

returns a `MultiTrace`

object containing the samples for all the stochastic and deterministic random variables. The final step in Bayesian computation is model checking, in order to ensure that inferences derived from your sample are valid. There are two components to model checking:

- Convergence diagnostics
- Goodness of fit

Convergence diagnostics are intended to detect lack of convergence in the Markov chain Monte Carlo sample; it is used to ensure that you have not halted your sampling too early. However, a converged model is not guaranteed to be a good model. The second component of model checking, goodness of fit, is used to check the internal validity of the model, by comparing predictions from the model to the data used to fit the model.

Valid inferences from sequences of MCMC samples are based on the
assumption that the samples are derived from the true posterior
distribution of interest. Theory guarantees this condition as the number
of iterations approaches infinity. It is important, therefore, to
determine the **minimum number of samples** required to ensure a reasonable
approximation to the target posterior density. Unfortunately, no
universal threshold exists across all problems, so convergence must be
assessed independently each time MCMC estimation is performed. The
procedures for verifying convergence are collectively known as
*convergence diagnostics*.

One approach to analyzing convergence is **analytical**, whereby the
variance of the sample at different sections of the chain are compared
to that of the limiting distribution. These methods use distance metrics
to analyze convergence, or place theoretical bounds on the sample
variance, and though they are promising, they are generally difficult to
use and are not prominent in the MCMC literature. More common is a
**statistical** approach to assessing convergence. With this approach,
rather than considering the properties of the theoretical target
distribution, only the statistical properties of the observed chain are
analyzed. Reliance on the sample alone restricts such convergence
criteria to **heuristics**. As a result, convergence cannot be guaranteed.
Although evidence for lack of convergence using statistical convergence
diagnostics will correctly imply lack of convergence in the chain, the
absence of such evidence will not *guarantee* convergence in the chain.
Nevertheless, negative results for one or more criteria may provide some
measure of assurance to users that their sample will provide valid
inferences.

For most simple models, convergence will occur quickly, sometimes within
a the first several hundred iterations, after which all remaining
samples of the chain may be used to calculate posterior quantities. For
more complex models, convergence requires a significantly longer burn-in
period; sometimes orders of magnitude more samples are needed.
Frequently, lack of convergence will be caused by **poor mixing**.
Recall that *mixing* refers to the degree to which the Markov
chain explores the support of the posterior distribution. Poor mixing
may stem from inappropriate proposals (if one is using the
Metropolis-Hastings sampler) or from attempting to estimate models with
highly correlated variables.

In [1]:

```
%matplotlib inline
import numpy as np
import seaborn as sns; sns.set_context('notebook')
import warnings
warnings.filterwarnings("ignore", module="mkl_fft")
warnings.filterwarnings("ignore", module="matplotlib")
```

In [2]:

```
from pymc3 import Normal, Binomial, sample, Model
from pymc3.math import invlogit
# Samples for each dose level
n = 5 * np.ones(4, dtype=int)
# Log-dose
dose = np.array([-.86, -.3, -.05, .73])
deaths = np.array([0, 1, 3, 5])
with Model() as bioassay_model:
# Logit-linear model parameters
alpha = Normal('alpha', 0, sd=100)
beta = Normal('beta', 0, sd=100)
# Calculate probabilities of death
theta = invlogit(alpha + beta * dose)
# Data likelihood
obs_deaths = Binomial('obs_deaths', n=n, p=theta, observed=deaths)
```

In [3]:

```
with bioassay_model:
bioassay_trace = sample(1000, cores=2)
```

In [4]:

```
from pymc3 import traceplot
traceplot(bioassay_trace, varnames=['alpha']);
```

The most straightforward approach for assessing convergence is based on
simply **plotting and inspecting traces and histograms** of the observed
MCMC sample, as was done in the cell above. If the trace of values for each of the stochastics exhibits
asymptotic behavior over the last $m$ iterations, this may be
satisfactory evidence for convergence.

A similar approach involves plotting a histogram for every set of $k$ iterations (perhaps 50-100) beyond some burn in threshold $n$; if the histograms are not visibly different among the sample intervals, this may be considered some evidence for convergence. Note that such diagnostics should be carried out for each stochastic estimated by the MCMC algorithm, because convergent behavior by one variable does not imply evidence for convergence for other variables in the analysis.

In [5]:

```
import matplotlib.pyplot as plt
beta_trace = bioassay_trace['beta']
fig, axes = plt.subplots(2, 5, figsize=(14,6))
axes = axes.ravel()
for i in range(10):
axes[i].hist(beta_trace[100*i:100*(i+1)])
plt.tight_layout()
```

An extension of this approach can be taken when multiple parallel chains are run, rather than just a single, long chain. In this case, the final values of $c$ chains run for $n$ iterations are plotted in a histogram; just as above, this is repeated every $k$ iterations thereafter, and the histograms of the endpoints are plotted again and compared to the previous histogram. This is repeated until consecutive histograms are indistinguishable.

Another *ad hoc* method for detecting lack of convergence is to examine
the traces of several MCMC chains initialized with different starting
values. Overlaying these traces on the same set of axes should (if
convergence has occurred) show each chain tending toward the same
equilibrium value, with approximately the same variance. Recall that the
tendency for some Markov chains to converge to the true (unknown) value
from diverse initial values is called *ergodicity*. This property is
guaranteed by the reversible chains constructed using MCMC, and should
be observable using this technique. Again, however, this approach is
only a heuristic method, and cannot always detect lack of convergence,
even though chains may appear ergodic.

In [6]:

```
from pymc3 import Metropolis
with bioassay_model:
tr = sample(200, cores=2, start=[{'alpha':-5}, {'alpha':5}], step=Metropolis(),
discard_tuned_samples=False, random_seed=1)
```

In [7]:

```
plt.plot(tr.get_values('alpha', chains=0)[:100], 'r--')
plt.plot(tr.get_values('alpha', chains=1)[:100], 'k--')
```

Out[7]:

A principal reason that evidence from informal techniques cannot
guarantee convergence is a phenomenon called ** metastability**. Chains may
appear to have converged to the true equilibrium value, displaying
excellent qualities by any of the methods described above. However,
after some period of stability around this value, the chain may suddenly
move to another region of the parameter space. This period
of metastability can sometimes be very long, and therefore escape
detection by these convergence diagnostics. Unfortunately, there is no
statistical technique available for detecting metastability.

Along with the *ad hoc* techniques described above, a number of more
formal methods exist which are prevalent in the literature. These are
considered more formal because they are based on existing statistical
methods, such as time series analysis.

PyMC currently includes three formal convergence diagnostic methods. The first, proposed by Geweke (1992), is a time-series approach that compares the mean and variance of segments from the beginning and end of a single chain.

$$z = \frac{\bar{\theta}_a - \bar{\theta}_b}{\sqrt{S_a(0) + S_b(0)}}$$where $a$ is the early interval and $b$ the late interval, and $S_i(0)$ is the spectral density estimate at zero frequency for chain segment $i$. If the z-scores (theoretically distributed as standard normal variates) of these two segments are similar, it can provide evidence for convergence. PyMC calculates z-scores of the difference between various initial segments along the chain, and the last 50% of the remaining chain. If the chain has converged, the majority of points should fall within 2 standard deviations of zero.

In PyMC, diagnostic z-scores can be obtained by calling the `geweke`

function. It
accepts either (1) a single trace, (2) a Node or Stochastic object, or
(4) an entire Model object:

In [8]:

```
from pymc3 import geweke
with bioassay_model:
tr = sample(2000, tune=1000, cores=2)
z = geweke(tr, intervals=15)
```

In [9]:

```
plt.scatter(*z[0]['alpha'].T)
plt.hlines([-1,1], 0, 1000, linestyles='dotted')
plt.xlim(0, 1000);
```

The arguments expected are the following:

`x`

: The trace of a variable.`first`

: The fraction of series at the beginning of the trace.`last`

: The fraction of series at the end to be compared with the section at the beginning.`intervals`

: The number of segments.

Plotting the output displays the scores in series, making it is easy to see departures from the standard normal assumption.

A second convergence diagnostic provided by PyMC is the Gelman-Rubin statistic Gelman and Rubin (1992). This diagnostic uses multiple chains to check for lack of convergence, and is based on the notion that if multiple chains have converged, by definition they should appear very similar to one another; if not, one or more of the chains has failed to converge.

The Gelman-Rubin diagnostic uses an analysis of variance approach to assessing convergence. That is, it calculates both the between-chain varaince (B) and within-chain varaince (W), and assesses whether they are different enough to worry about convergence. Assuming $m$ chains, each of length $n$, quantities are calculated by:

$$\begin{align}B &= \frac{n}{m-1} \sum_{j=1}^m (\bar{\theta}_{.j} - \bar{\theta}_{..})^2 \\ W &= \frac{1}{m} \sum_{j=1}^m \left[ \frac{1}{n-1} \sum_{i=1}^n (\theta_{ij} - \bar{\theta}_{.j})^2 \right] \end{align}$$for each scalar estimand $\theta$. Using these values, an estimate of the marginal posterior variance of $\theta$ can be calculated:

$$\hat{\text{Var}}(\theta | y) = \frac{n-1}{n} W + \frac{1}{n} B$$Assuming $\theta$ was initialized to arbitrary starting points in each chain, this quantity will overestimate the true marginal posterior variance. At the same time, $W$ will tend to underestimate the within-chain variance early in the sampling run. However, in the limit as $n \rightarrow \infty$, both quantities will converge to the true variance of $\theta$. In light of this, the Gelman-Rubin statistic monitors convergence using the ratio:

$$\hat{R} = \sqrt{\frac{\hat{\text{Var}}(\theta | y)}{W}}$$This is called the potential scale reduction, since it is an estimate of
the potential reduction in the scale of $\theta$ as the number of
simulations tends to infinity. In practice, we look for values of
$\hat{R}$ close to one (say, less than 1.1) to be confident that a
particular estimand has converged. In PyMC, the function
`gelman_rubin`

will calculate $\hat{R}$ for each stochastic node in
the passed model:

In [10]:

```
from pymc3 import gelman_rubin
gelman_rubin(bioassay_trace)
```

Out[10]:

For the best results, each chain should be initialized to highly dispersed starting values for each stochastic node.

By default, when calling the `forestplot`

function using nodes with
multiple chains, the $\hat{R}$ values will be plotted alongside the
posterior intervals.

In [11]:

```
from pymc3 import forestplot
forestplot(bioassay_trace)
```

Out[11]:

In general, samples drawn from MCMC algorithms will be autocorrelated. This is not a big deal, other than the fact that autocorrelated chains may require longer sampling in order to adequately characterize posterior quantities of interest. The calculation of autocorrelation is performed for each lag $i=1,2,\ldots,k$ (the correlation at lag 0 is, of course, 1) by:

$$\hat{\rho}_i = 1 - \frac{V_i}{2\hat{\text{Var}}(\theta | y)}$$where $\hat{\text{Var}}(\theta | y)$ is the same estimated variance as calculated for the Gelman-Rubin statistic, and $V_i$ is the variogram at lag $i$ for $\theta$:

$$\text{V}_i = \frac{1}{m(n-i)}\sum_{j=1}^m \sum_{k=i+1}^n (\theta_{jk} - \theta_{j(k-i)})^2$$This autocorrelation can be visualized using the `autocorrplot`

function in PyMC3:

In [12]:

```
from pymc3 import autocorrplot
autocorrplot(tr);
```

The effective sample size is estimated using the partial sum:

$$\hat{n}_{eff} = \frac{mn}{1 + 2\sum_{i=1}^T \hat{\rho}_i}$$where $T$ is the first odd integer such that $\hat{\rho}_{T+1} + \hat{\rho}_{T+2}$ is negative.

The issue here is related to the fact that we are **estimating** the effective sample size from the fit output. Values of $n_{eff} / n_{iter} < 0.001$ indicate a biased estimator, resulting in an overestimate of the true effective sample size.

In [13]:

```
from pymc3 import effective_n
effective_n(bioassay_trace)
```

Out[13]:

Both low $n_{eff}$ and high $\hat{R}$ indicate **poor mixing**.

It is tempting to want to **thin** the chain to eliminate the autocorrelation (*e.g.* taking every 20th sample from the traces above), but this is a waste of time. Since thinning deliberately throws out the majority of the samples, no efficiency is gained; you ultimately require more samples to achive a particular desired sample size.

Hamiltonian Monte Carlo is a powerful and efficient MCMC sampler when set up appropriately. However, this typically requires careful tuning of the sampler parameters, such as tree depth, leapfrog step size and target acceptance rate. Fortunately, the NUTS algorithm takes care of some of this for us. Nevertheless, tuning must be carefully monitored for failures that frequently arise. This is particularly the case when fitting challenging models, such as those with high curvature or heavy tails.

NUTS uses a recursive algorithm to build a set of likely candidate points that spans a wide swath of the target distribution. True to its name, it stops automatically when it starts to double back and retrace its steps.

The algorithm employs **binary doubling**, which takes leapfrog steps alternating in direction with respect to the initial gradient. That is, one step is taken in the forward direction, two in the reverse direction, then four, eight, etc. The result is a balanced, binary tree with nodes comprised of Hamiltonian states.

Doubling process builds a balanced binary tree whose leaf nodes correspond to position-momentum states. Doubling is halted when the subtrajectory from the leftmost to the rightmost nodes of any balanced subtree of the overall binary tree starts to double back on itself

To ensure detailed balance, a slice variable is sampled from:

$$ u \sim \text{Uniform}(0, \exp[L(\theta) - 0.5 r \cdot r])$$where $r$ is the initial momentum vector. The next sample is then chosen uniformly from the points in the remaining balanced tree.

Fortunately, however, gradient-based sampling provides the ability to diagnose these pathologies. PyMC makes several diagnostic statistics available as attributes of the `MultiTrace`

object returned by the `sample`

function.

In [14]:

```
bioassay_trace.stat_names
```

Out[14]:

`mean_tree_accept`

: The mean acceptance probability for the tree that generated this sample. The mean of these values across all samples but the burn-in should be approximately`target_accept`

(the default for this is 0.8).`diverging`

: Whether the trajectory for this sample diverged. If there are many diverging samples, this usually indicates that a region of the posterior has high curvature. Reparametrization can often help, but you can also try to increase`target_accept`

to something like 0.9 or 0.95.`energy`

: The energy at the point in phase-space where the sample was accepted. This can be used to identify posteriors with problematically long tails. See below for an example.`energy_error`

: The difference in energy between the start and the end of the trajectory. For a perfect integrator this would always be zero.`max_energy_error`

: The maximum difference in energy along the whole trajectory.`depth`

: The depth of the tree that was used to generate this sample`tree_size`

: The number of leafs of the sampling tree, when the sample was accepted. This is usually a bit less than $2 ^ \text{depth}$. If the tree size is large, the sampler is using a lot of leapfrog steps to find the next sample. This can for example happen if there are strong correlations in the posterior, if the posterior has long tails, if there are regions of high curvature ("funnels"), or if the variance estimates in the mass matrix are inaccurate. Reparametrisation of the model or estimating the posterior variances from past samples might help.`tune`

: This is`True`

, if step size adaptation was turned on when this sample was generated.`step_size`

: The step size used for this sample.`step_size_bar`

: The current best known step-size. After the tuning samples, the step size is set to this value. This should converge during tuning.

If the name of the statistic does not clash with the name of one of the variables, we can use indexing to get the values. The values for the chains will be concatenated.

We can see that the step sizes converged after the 2000 tuning samples for both chains to about the same value. The first 3000 values are from chain 1, the second from chain 2.

In [15]:

```
with bioassay_model:
trace = sample(1000, tune=2000, init=None, cores=2, discard_tuned_samples=False)
```

In [16]:

```
plt.plot(trace['step_size_bar']);
```

The `get_sampler_stats`

method provides more control over which values should be returned, and it also works if the name of the statistic is the same as the name of one of the variables. We can use the `chains`

option, to control values from which chain should be returned, or we can set `combine=False`

to get the values for the individual chains:

The failure of HMC samplers to be geometrically ergodic with respect to any target distribution manifests itself in distinct behaviors. One of these behaviors is the appearance of divergences that indicate the Hamiltonian Markov chain has encountered regions of high curvature in the target distribution which it cannot adequately explore.

The `NUTS`

step method has a maximum tree depth parameter so that infinite loops (which can occur for non-identified models) are avoided. When the maximum tree depth is reached (the default value is 10), the trajectory is stopped. However complex (but identifiable) models can saturate this threshold, which reduces sampling efficiency.

The `MultiTrace`

stores the tree depth for each iteration, so inspecting these traces can reveal saturation if it is occurring.

In [17]:

```
sizes1, sizes2 = trace.get_sampler_stats('depth', combine=False)
fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, sharey=True)
ax1.plot(sizes1)
ax2.plot(sizes2);
```

We can also check the acceptance for the trees that generated this sample. The mean of these values across all samples (except the tuning stage) is expected to be the same as `target_accept`

, which is 0.8 by default.

In [18]:

```
accept = trace.get_sampler_stats('mean_tree_accept', burn=1000)
sns.distplot(accept, kde=False);
```

Recall that simulating Hamiltonian dynamics via a symplectic integrator uses a discrete approximation of a continuous function. This is only a reasonable approximation when the step sizes of the integrator are suitably small. A divergent transition may indicate that the approximation is poor.

If there are too many divergent transitions, then samples are not being drawn from the full posterior, and inferences based on the resulting sample will be biased

If there are diverging transitions, PyMC3 will issue warnings indicating how many were discovered. We can obtain the indices of them from the trace.

In [19]:

```
diverging_ind = trace['diverging'].nonzero()[0]
diverging_ind
```

Out[19]:

If the location of the divergences are distributed differently than the samples as a whole, this is an indication that the posterior is not being well explored.

In [20]:

```
import pandas as pd
trace_values = pd.DataFrame({v: trace[v] for v in trace.varnames})
```

In [21]:

```
sns.violinplot(x='variable', y='value', data=pd.melt(trace_values), orient='v', inner='quartile')
plt.plot(np.zeros(diverging_ind.shape[0]), trace['alpha'][diverging_ind], 'r.')
plt.plot(np.ones(diverging_ind.shape[0]), trace['beta'][diverging_ind], 'r.')
```

Out[21]:

The Bayesian fraction of missing information (BFMI) is a measure of how hard it is to
sample level sets of the posterior at each iteration. Specifically, it quantifies **how well momentum resampling matches the marginal energy distribution**.

A small value indicates that the adaptation phase of the sampler was unsuccessful, and invoking the central limit theorem may not be valid. It indicates whether the sampler is able to *efficiently* explore the posterior distribution.

Though there is not an established rule of thumb for an adequate threshold, values close to one are optimal. Reparameterizing the model is sometimes helpful for improving this statistic.

In [22]:

```
from pymc3 import bfmi
bfmi(trace)
```

Out[22]:

Another way of diagnosting this phenomenon is by comparing the overall distribution of
energy levels with the *change* of energy between successive samples. Ideally, they should be very similar.

If the distribution of energy transitions is narrow relative to the marginal energy distribution, this is a sign of inefficient sampling, as many transitions are required to completely explore the posterior. On the other hand, if the energy transition distribution is similar to that of the marginal energy, this is evidence of efficient sampling, resulting in near-independent samples from the posterior.

In [23]:

```
energy = trace['energy']
energy_diff = np.diff(energy)
sns.distplot(energy - energy.mean(), label='energy')
sns.distplot(energy_diff, label='energy diff')
plt.legend();
```

If the overall distribution of energy levels has longer tails, the efficiency of the sampler will deteriorate quickly.

Checking for model convergence is only the first step in the evaluation of MCMC model outputs. It is possible for an entirely unsuitable model to converge, so additional steps are needed to ensure that the estimated model adequately fits the data. One intuitive way of evaluating model fit is to compare model predictions with the observations used to fit the model. In other words, the fitted model can be used to simulate data, and the distribution of the simulated data should resemble the distribution of the actual data.

Fortunately, simulating data from the model is a natural component of the Bayesian modelling framework. Recall, from the discussion on imputation of missing data, the posterior predictive distribution:

$$p(\tilde{y}|y) = \int p(\tilde{y}|\theta) f(\theta|y) d\theta$$Here, $\tilde{y}$ represents some hypothetical new data that would be expected, taking into account the posterior uncertainty in the model parameters.

Sampling from the posterior predictive distribution is easy
in PyMC. The `sample_ppc`

function draws posterior predictive checks from all of the data likelhioods. Consider the `gelman_bioassay`

example,
where deaths are modeled as a binomial random variable for which
the probability of death is a logit-linear function of the dose of a
particular drug.

The posterior predictive distribution of deaths uses the same functional form as the data likelihood, in this case a binomial stochastic. Here is the corresponding sample from the posterior predictive distribution (we typically need very few samples relative to the MCMC sample):

In [24]:

```
from pymc3 import sample_ppc
with bioassay_model:
deaths_sim = sample_ppc(bioassay_trace, samples=500)
```

The degree to which simulated data correspond to observations can be evaluated in at least two ways. First, these quantities can simply be compared visually. This allows for a qualitative comparison of model-based replicates and observations. If there is poor fit, the true value of the data may appear in the tails of the histogram of replicated data, while a good fit will tend to show the true data in high-probability regions of the posterior predictive distribution. The Matplot package in PyMC provides an easy way of producing such plots, via the `gof_plot`

function.

In [25]:

```
fig, axes = plt.subplots(1, 4, figsize=(14, 4))
for obs, sim, ax in zip(deaths, deaths_sim['obs_deaths'].T, axes):
ax.hist(sim, bins=range(7))
ax.plot(obs+0.5, 1, 'ro')
```

This example was taken from Gelman *et al.* (2013):

A study was performed for the Educational Testing Service to analyze the effects of special coaching programs on test scores. Separate randomized experiments were performed to estimate the effects of coaching programs for the SAT-V (Scholastic Aptitude Test- Verbal) in each of eight high schools. The outcome variable in each study was the score on a special administration of the SAT-V, a standardized multiple choice test administered by the Educational Testing Service and used to help colleges make admissions decisions; the scores can vary between 200 and 800, with mean about 500 and standard deviation about 100. The SAT examinations are designed to be resistant to short-term efforts directed specifically toward improving performance on the test; instead they are designed to reflect knowledge acquired and abilities developed over many years of education. Nevertheless, each of the eight schools in this study considered its short-term coaching program to be successful at increasing SAT scores. Also, there was no prior reason to believe that any of the eight programs was more effective than any other or that some were more similar in effect to each other than to any other.

Construct an appropriate model for estimating whether coaching effects are positive. You are given the estimated coaching effects (`d`

) and their sampling variances (`s`

). The estimates were obtained by independent experiments, with relatively large sample sizes (over thirty students in each school), so you can assume that they have approximately normal sampling distributions with known variances variances.

Here are the data:

In [26]:

```
J = 8
d = np.array([28., 8., -3., 7., -1., 1., 18., 12.])
s = np.array([15., 10., 16., 11., 9., 11., 10., 18.])
```

In [27]:

```
# Write your answer here
```

Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science. A Review Journal of the Institute of Mathematical Statistics, 457–472.

Geweke, J., Berger, J. O., & Dawid, A. P. (1992). Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. In Bayesian Statistics 4.

Brooks, S. P., Catchpole, E. A., & Morgan, B. J. T. (2000). Bayesian Animal Survival Estimation. Statistical Science. A Review Journal of the Institute of Mathematical Statistics, 15(4), 357–376. doi:10.1214/ss/1177010123

Gelman, A., Meng, X., & Stern, H. (1996). Posterior predicitive assessment of model fitness via realized discrepencies with discussion. Statistica Sinica, 6, 733–807.

Betancourt, M. (2017). A Conceptual Introduction to Hamiltonian Monte Carlo. arXiv.org.

Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis, Third Edition. CRC Press.