Bayesian Zig Zag

Developing probabilistic models using grid methods and MCMC.

Thanks to Chris Fonnesback for his help with this example.

Copyright 2018 Allen Downey

MIT License: https://opensource.org/licenses/MIT

In [1]:
from __future__ import print_function, division

%matplotlib inline

import numpy as np
import pymc3 as pm

import matplotlib.pyplot as plt
COLORS = plt.rcParams['axes.prop_cycle'].by_key()['color']
/home/downey/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters

Simulating hockey

I'll model hockey as a Poisson process, where each team has some long-term average scoring rate, lambda, in goals per game.

For the first example, we'll assume that lambda is known (somehow) to be 2.7. Since regulation play (as opposed to overtime) is 60 minutes, we can compute the goal scoring rate per minute.

In [2]:
lam_per_game = 2.7
min_per_game = 60
lam_per_min = lam_per_game / min_per_game
lam_per_min, lam_per_min**2
Out[2]:
(0.045000000000000005, 0.0020250000000000003)

If we assume that a goal is equally likely during any minute of the game, and we ignore the possibility of scoring more than one goal in the same minute, we can simulate a game by generating one random value each minute.

In [3]:
np.random.random(min_per_game)
Out[3]:
array([0.95474566, 0.05850483, 0.66980911, 0.73077852, 0.24336245,
       0.03809128, 0.37539112, 0.00969678, 0.08390627, 0.78005902,
       0.24185794, 0.40905508, 0.42617997, 0.60179578, 0.50304705,
       0.78429149, 0.76775916, 0.25556789, 0.80724967, 0.24811858,
       0.4044248 , 0.30056317, 0.44030775, 0.48182985, 0.33141048,
       0.59718011, 0.82299698, 0.3309434 , 0.23468649, 0.99653667,
       0.3005755 , 0.13623051, 0.83786721, 0.59821771, 0.14075628,
       0.88657572, 0.66519948, 0.52116217, 0.57931806, 0.46990466,
       0.30706128, 0.90740745, 0.03504269, 0.08754995, 0.71951051,
       0.17081812, 0.29243255, 0.77867813, 0.922552  , 0.38258157,
       0.36188948, 0.62941913, 0.58924827, 0.79719118, 0.61448107,
       0.69003979, 0.88986919, 0.43171462, 0.28118933, 0.96749458])

If the random value is less than lam_per_min, that means we score a goal during that minute.

In [4]:
np.random.random(min_per_game) < lam_per_min
Out[4]:
array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False,  True, False, False, False, False])

So we can get the number of goals scored by one team like this:

In [5]:
np.sum(np.random.random(min_per_game) < lam_per_min)
Out[5]:
3

I'll wrap that in a function.

In [6]:
def half_game(lam_per_min, min_per_game=60):
    return np.sum(np.random.random(min_per_game) < lam_per_min)

And simulate 10 games.

In [7]:
size = 10
sample = [half_game(lam_per_min) for i in range(size)]
sample
Out[7]:
[4, 0, 1, 1, 4, 1, 6, 2, 1, 2]

If we simulate 1000 games, we can see what the distribution looks like. The average of this sample should be close to lam_per_game.

In [8]:
size = 1000
sample_sim = [half_game(lam_per_min) for i in range(size)]
np.mean(sample_sim), lam_per_game
Out[8]:
(2.766, 2.7)

PMFs

To visualize distributions, I'll start with a probability mass function (PMF), which I'll implement using a Counter.

In [9]:
from collections import Counter

class Pmf(Counter):
    
    def normalize(self):
        """Normalizes the PMF so the probabilities add to 1."""
        total = sum(self.values())
        for key in self:
            self[key] /= total
            
    def sorted_items(self):
        """Returns the outcomes and their probabilities."""
        return zip(*sorted(self.items()))

Here are some functions for plotting PMFs.

In [10]:
plot_options = dict(linewidth=3, alpha=0.6)

def underride(options):
    """Add key-value pairs to d only if key is not in d.

    options: dictionary
    """

    for key, val in plot_options.items():
        options.setdefault(key, val)
    return options

def plot(xs, ys, **options):
    """Line plot with plot_options."""
    plt.plot(xs, ys, **underride(options))

def bar(xs, ys, **options):
    """Bar plot with plot_options."""
    plt.bar(xs, ys, **underride(options))

def plot_pmf(sample, **options):
    """Compute and plot a PMF."""
    pmf = Pmf(sample)
    pmf.normalize()
    xs, ps = pmf.sorted_items()
    bar(xs, ps, **options)
    
def pmf_goals():
    """Decorate the axes."""
    plt.xlabel('Number of goals')
    plt.ylabel('PMF')
    plt.title('Distribution of goals scored')
    legend()
    
def legend(**options):
    """Draw a legend only if there are labeled items.
    """
    ax = plt.gca()
    handles, labels = ax.get_legend_handles_labels()
    if len(labels):
        plt.legend(**options)

Here's what the results from the simulation look like.

In [11]:
plot_pmf(sample_sim, label='simulation')
pmf_goals()

Analytic distributions

For the simulation we just did, we can figure out the distribution analytically: it's a binomial distribution with parameters n and p, where n is the number of minutes and p is the probability of scoring a goal during any minute.

We can use NumPy to generate a sample from a binomial distribution.

In [12]:
n = min_per_game
p = lam_per_min
sample_bin = np.random.binomial(n, p, size)
np.mean(sample_bin)
Out[12]:
2.677

And confirm that the results are similar to what we got from the model.

In [13]:
plot_pmf(sample_sim, label='simulation')
plot_pmf(sample_bin, label='binomial')
pmf_goals()

But plotting PMFs is a bad way to compare distributions. It's better to use the cumulative distribution function (CDF).

In [14]:
def plot_cdf(sample, **options):
    """Compute and plot the CDF of a sample."""
    pmf = Pmf(sample)
    xs, freqs = pmf.sorted_items()
    ps = np.cumsum(freqs, dtype=np.float)
    ps /= ps[-1]
    plot(xs, ps, **options)
    
def cdf_rates():
    """Decorate the axes."""
    plt.xlabel('Goal scoring rate (mu)')
    plt.ylabel('CDF')
    plt.title('Distribution of goal scoring rate')
    legend()

def cdf_goals():
    """Decorate the axes."""
    plt.xlabel('Number of goals')
    plt.ylabel('CDF')
    plt.title('Distribution of goals scored')
    legend()

def plot_cdfs(*sample_seq, **options):
    """Plot multiple CDFs."""
    for sample in sample_seq:
        plot_cdf(sample, **options)
    cdf_goals()

Now we can compare the results from the simulation and the sample from the biomial distribution.

In [15]:
plot_cdf(sample_sim, label='simulation')
plot_cdf(sample_bin, label='binomial')
cdf_goals()

Poisson process

For large values of n, the binomial distribution converges to the Poisson distribution with parameter mu = n * p, which is also mu = lam_per_game.

In [16]:
mu = lam_per_game
sample_poisson = np.random.poisson(mu, size)
np.mean(sample_poisson)
Out[16]:
2.728

And we can confirm that the results are consistent with the simulation and the binomial distribution.

In [17]:
plot_cdfs(sample_sim, sample_bin)
plot_cdf(sample_poisson, label='poisson', linestyle='dashed')
legend()

Warming up PyMC

Soon we will want to use pymc3 to do inference, which is really what it's for. But just to get warmed up, I will use it to generate a sample from a Poisson distribution.

In [18]:
model = pm.Model()

with model:
    goals = pm.Poisson('goals', mu)
    trace = pm.sample(1000)
Multiprocess sampling (4 chains in 4 jobs)
Metropolis: [goals]
100%|██████████| 1500/1500 [00:00<00:00, 3214.00it/s]
INFO (theano.gof.compilelock): Waiting for existing lock by process '26625' (I am process '26626')
INFO (theano.gof.compilelock): To manually release the lock, delete /home/downey/.theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.6.5-64/lock_dir
INFO (theano.gof.compilelock): Waiting for existing lock by process '26625' (I am process '26627')
INFO (theano.gof.compilelock): To manually release the lock, delete /home/downey/.theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.6.5-64/lock_dir
INFO (theano.gof.compilelock): Waiting for existing lock by process '26626' (I am process '26627')
INFO (theano.gof.compilelock): To manually release the lock, delete /home/downey/.theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.6.5-64/lock_dir
The number of effective samples is smaller than 25% for some parameters.
In [19]:
len(trace)
Out[19]:
1000
In [20]:
sample_pm = trace['goals']
np.mean(sample_pm)
Out[20]:
2.699

This example is like using a cannon to kill a fly. But it help us learn to use the cannon.

In [21]:
plot_cdfs(sample_sim, sample_bin, sample_poisson)
plot_cdf(sample_pm, label='poisson pymc', linestyle='dashed')
legend()

Evaluating the Poisson distribution

One of the nice things about the Poisson distribution is that we can compute its CDF and PMF analytically. We can use the CDF to check, one more time, the previous results.

In [22]:
import scipy.stats as st

xs = np.arange(11)
ps = st.poisson.cdf(xs, mu)

plot_cdfs(sample_sim, sample_bin, sample_poisson, sample_pm)
plt.plot(xs, ps, label='analytic', linestyle='dashed')
legend()

And we can use the PMF to compute the probability of any given outcome. Here's what the analytic PMF looks like:

In [23]:
xs = np.arange(11)
ps = st.poisson.pmf(xs, mu)
bar(xs, ps, label='analytic PMF')
pmf_goals()

And here's a function that compute the probability of scoring a given number of goals in a game, for a known value of mu.

In [24]:
def poisson_likelihood(goals, mu):
    """Probability of goals given scoring rate.
    
    goals: observed number of goals (scalar or sequence)
    mu: hypothetical goals per game
    
    returns: probability
    """
    return np.prod(st.poisson.pmf(goals, mu))

Here's the probability of scoring 6 goals in a game if the long-term rate is 2.7 goals per game.

In [25]:
poisson_likelihood(goals=6, mu=2.7)
Out[25]:
0.036162211957124435

Here's the probability of scoring 3 goals.

In [26]:
poisson_likelihood(goals=3, mu=2.7)
Out[26]:
0.22046768454274915

This function also works with a sequence of goals, so we can compute the probability of scoring 6 goals in the first game and 3 in the second.

In [27]:
poisson_likelihood(goals=[6, 2], mu=2.7)
Out[27]:
0.008858443486812598

Bayesian inference with grid approximation

Ok, it's finally time to do some inference! The function we just wrote computes the likelihood of the data, given a hypothetical value of mu:

$\mathrm{Prob}~(x ~|~ \mu)$

But what we really want is the distribution of mu, given the data:

$\mathrm{Prob}~(\mu ~|~ x)$

If only there were some theorem that relates these probabilities!

The following class implements Bayes's theorem.

In [28]:
class Suite(Pmf):
    """Represents a set of hypotheses and their probabilities."""
    
    def bayes_update(self, data, like_func):
        """Perform a Bayesian update.
        
        data:      some representation of observed data
        like_func: likelihood function that takes (data, hypo), where
                   hypo is the hypothetical value of some parameter,
                   and returns P(data | hypo)
        """
        for hypo in self:
            self[hypo] *= like_func(data, hypo)
        self.normalize()
        
    def plot(self, **options):
        """Plot the hypotheses and their probabilities."""
        xs, ps = self.sorted_items()
        plot(xs, ps, **options)
        

def pdf_rate():
    """Decorate the axes."""
    plt.xlabel('Goals per game (mu)')
    plt.ylabel('PDF')
    plt.title('Distribution of goal scoring rate')
    legend()

I'll start with a uniform prior just to keep things simple. We'll choose a better prior later.

In [29]:
hypo_mu = np.linspace(0, 20, num=51)
hypo_mu
Out[29]:
array([ 0. ,  0.4,  0.8,  1.2,  1.6,  2. ,  2.4,  2.8,  3.2,  3.6,  4. ,
        4.4,  4.8,  5.2,  5.6,  6. ,  6.4,  6.8,  7.2,  7.6,  8. ,  8.4,
        8.8,  9.2,  9.6, 10. , 10.4, 10.8, 11.2, 11.6, 12. , 12.4, 12.8,
       13.2, 13.6, 14. , 14.4, 14.8, 15.2, 15.6, 16. , 16.4, 16.8, 17.2,
       17.6, 18. , 18.4, 18.8, 19.2, 19.6, 20. ])

Initially suite represents the prior distribution of mu.

In [30]:
suite = Suite(hypo_mu)
suite.normalize()
suite.plot(label='prior')
pdf_rate()

Now we can update it with the data and plot the posterior.

In [31]:
suite.bayes_update(data=6, like_func=poisson_likelihood)
suite.plot(label='posterior')
pdf_rate()

With a uniform prior, the posterior is the likelihood function, and the MAP is the value of mu that maximizes likelihood, which is the observed number of goals, 6.

This result is probably not reasonable, because the prior was not reasonable.

A better prior

To construct a better prior, I'll use scores from previous Stanley Cup finals to estimate the parameters of a gamma distribution.

Why gamma? You'll see.

Here are (total goals)/(number of games) for both teams from 2013 to 2017, not including games that went into overtime.

In [32]:
xs = [13/6, 19/6, 8/4, 4/4, 10/6, 13/6, 2/2, 4/2, 5/3, 6/3]

If those values were sampled from a gamma distribution, we can estimate its parameters, k and theta.

In [33]:
def estimate_gamma_params(xs):
    """Estimate the parameters of a gamma distribution.
    
    See https://en.wikipedia.org/wiki/Gamma_distribution#Parameter_estimation
    """
    s = np.log(np.mean(xs)) - np.mean(np.log(xs))
    k = (3 - s + np.sqrt((s-3)**2 + 24*s)) / 12 / s
    theta = np.mean(xs) / k
    alpha = k
    beta = 1 / theta
    return alpha, beta

Here are the estimates.

In [34]:
alpha, beta = estimate_gamma_params(xs)
print(alpha, beta)
9.590040427964036 5.092056864405683

The following function takes alpha and beta and returns a "frozen" distribution from SciPy's stats module:

In [35]:
def make_gamma_dist(alpha, beta):
    """Returns a frozen distribution with given parameters.
    """
    return st.gamma(a=alpha, scale=1/beta)

The frozen distribution knows how to compute its mean and standard deviation:

In [36]:
dist = make_gamma_dist(alpha, beta)
print(dist.mean(), dist.std())
1.883333333333333 0.6081587702831356

And it can compute its PDF.

In [37]:
hypo_mu = np.linspace(0, 10, num=101)
ps = dist.pdf(hypo_mu)
In [38]:
plot(hypo_mu, ps, label='gamma(9.6, 5.1)')
pdf_rate()

We can use make_gamma_dist to construct a prior suite with the given parameters.

In [39]:
def make_gamma_suite(xs, alpha, beta):
    """Makes a suite based on a gamma distribution.
    
    xs: places to evaluate the PDF
    alpha, beta: parameters of the distribution
    
    returns: Suite
    """
    dist = make_gamma_dist(alpha, beta)
    ps = dist.pdf(xs)
    prior = Suite(dict(zip(xs, ps)))
    prior.normalize()
    return prior

Here's what it looks like.

In [40]:
prior = make_gamma_suite(hypo_mu, alpha, beta)

prior.plot(label='gamma prior')
pdf_rate()

And we can update this prior using the observed data.

In [41]:
posterior = prior.copy()
posterior.bayes_update(data=6, like_func=poisson_likelihood)

prior.plot(label='prior')
posterior.plot(label='posterior')
pdf_rate()

The results are substantially different from what we got with the uniform prior.

In [42]:
suite.plot(label='posterior with uniform prior', color='gray')
posterior.plot(label='posterior with gamma prior', color=COLORS[1])
pdf_rate()

Suppose the same team plays again and scores 2 goals in the second game. We can perform a second update using the posterior from the first update as the prior for the second.

In [43]:
posterior2 = posterior.copy()
posterior2.bayes_update(data=2, like_func=poisson_likelihood)

prior.plot(label='prior')
posterior.plot(label='posterior')
posterior2.plot(label='posterior2')
pdf_rate()

Or, starting with the original prior, we can update with both pieces of data at the same time.

In [44]:
posterior3 = prior.copy()
posterior3.bayes_update(data=[6, 2], like_func=poisson_likelihood)

prior.plot(label='prior')
posterior.plot(label='posterior')
posterior2.plot(label='posterior2')
posterior3.plot(label='posterior3', linestyle='dashed')
pdf_rate()

Update using conjugate priors

I'm using a gamma distribution as a prior in part because it has a shape that seems credible based on what I know about hockey.

But it is also useful because it happens to be the conjugate prior of the Poisson distribution, which means that if the prior is gamma and we update with a Poisson likelihood function, the posterior is also gamma.

See https://en.wikipedia.org/wiki/Conjugate_prior#Discrete_distributions

And often we can compute the parameters of the posterior with very little computation. If we observe x goals in 1 game, the new parameters are alpha+x and beta+1.

In [45]:
class GammaSuite:
    """Represents a gamma conjugate prior/posterior."""
    
    def __init__(self, alpha, beta):
        """Initialize.
        
        alpha, beta: parameters
        dist: frozen distribution from scipy.stats
        """
        self.alpha = alpha
        self.beta = beta
        self.dist = make_gamma_dist(alpha, beta)
    
    def plot(self, xs, **options):
        """Plot the suite.
        
        xs: locations where we should evaluate the PDF.
        """
        ps = self.dist.pdf(xs)
        ps /= np.sum(ps)
        plot(xs, ps, **options)
        
    def bayes_update(self, data):
        return GammaSuite(self.alpha+data, self.beta+1)

Here's what the prior looks like using a GammaSuite:

In [46]:
gamma_prior = GammaSuite(alpha, beta)
gamma_prior.plot(hypo_mu, label='prior')
pdf_rate()
gamma_prior.dist.mean()
Out[46]:
1.883333333333333

And here's the posterior after one update.

In [47]:
gamma_posterior = gamma_prior.bayes_update(6)

gamma_prior.plot(hypo_mu, label='prior')
gamma_posterior.plot(hypo_mu, label='posterior')
pdf_rate()
gamma_posterior.dist.mean()
Out[47]:
2.559076642743212

And we can confirm that the posterior we get using the conjugate prior is the same as the one we got using a grid approximation.

In [48]:
gamma_prior.plot(hypo_mu, label='prior')
gamma_posterior.plot(hypo_mu, label='posterior conjugate')
posterior.plot(label='posterior grid', linestyle='dashed')
pdf_rate()

Posterior predictive distribution

Ok, let's get to what is usually the point of this whole exercise, making predictions.

The prior represents what we believe about the distribution of mu based on the data (and our prior beliefs).

Each value of mu is a possible goal scoring rate.

For a given value of mu, we can generate a distribution of goals scored in a particular game, which is Poisson.

But we don't have a given value of mu, we have a whole bunch of values for mu, with different probabilities.

So the posterior predictive distribution is a mixture of Poissons with different weights.

The simplest way to generate the posterior predictive distribution is to

  1. Draw a random mu from the posterior distribution.

  2. Draw a random number of goals from Poisson(mu).

  3. Repeat.

Here's a function that draws a sample from a posterior Suite (the grid approximation, not GammaSuite).

In [49]:
def sample_suite(suite, size):
    """Draw a random sample from a Suite
    
    suite: Suite object
    size: sample size
    """
    xs, ps = zip(*suite.items())
    return np.random.choice(xs, size, replace=True, p=ps)

Here's a sample of mu drawn from the posterior distribution (after one game).

In [50]:
size = 10000
sample_post = sample_suite(posterior, size)
np.mean(sample_post)
Out[50]:
2.56113

Here's what the posterior distribution looks like.

In [51]:
plot_cdf(sample_post, label='posterior sample')
cdf_rates()

Now for each value of mu in the posterior sample we draw one sample from Poisson(mu)

In [52]:
sample_post_pred = np.random.poisson(sample_post)
np.mean(sample_post_pred)
Out[52]:
2.5784

Here's what the posterior predictive distribution looks like.

In [53]:
plot_pmf(sample_post_pred, label='posterior predictive sample')
pmf_goals()

Posterior prediction done wrong

The posterior predictive distribution represents uncertainty from two sources:

  1. We don't know mu

  2. Even if we knew mu, we would not know the score of the next game.

It is tempting, but wrong, to generate a posterior prediction by taking the mean of the posterior distribution and drawing samples from Poisson(mu) with just a single value of mu.

That's wrong because it eliminates one of our sources of uncertainty.

Here's an example:

In [54]:
mu_mean = np.mean(sample_post)
sample_post_pred_wrong = np.random.poisson(mu_mean, size)
np.mean(sample_post_pred_wrong)
Out[54]:
2.5835

Here's what the samples looks like:

In [55]:
plot_cdf(sample_post_pred, label='posterior predictive sample')
plot_cdf(sample_post_pred_wrong, label='incorrect posterior predictive')
cdf_goals()

In the incorrect predictive sample, low values and high values are slightly less likely.

The means are about the same:

In [56]:
print(np.mean(sample_post_pred), np.mean(sample_post_pred_wrong))
2.5784 2.5835

But the standard deviation of the incorrect distribution is lower.

In [57]:
print(np.std(sample_post_pred), np.std(sample_post_pred_wrong))
1.7209455075626308 1.5950635567274427

Abusing PyMC

Ok, we are almost ready to use PyMC for its intended purpose, but first we are going to abuse it a little more.

Previously we used PyMC to draw a sample from a Poisson distribution with known mu.

Now we'll use it to draw a sample from the prior distribution of mu, with known alpha and beta.

We still have the values I estimated based on previous playoff finals:

In [58]:
print(alpha, beta)
9.590040427964036 5.092056864405683

Now we can draw a sample from the prior:

In [59]:
model = pm.Model()

with model:
    mu = pm.Gamma('mu', alpha, beta)
    trace = pm.sample(2000, tune=1000)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu_log__]
100%|██████████| 3000/3000 [00:01<00:00, 2185.18it/s]

Again, this is not a sensible way to use PyMC. If we just want to sample from the prior, PyMC can do that in a much simpler way, or we could use SciPy just as well. We're doing this to develop and test the model incrementally.

So let's see if the sample looks right.

In [60]:
sample_prior_pm = trace['mu']
np.mean(sample_prior_pm)
Out[60]:
1.9047798167195624
In [61]:
sample_prior = sample_suite(prior, 2000)
np.mean(sample_prior)
Out[61]:
1.8839000000000001
In [62]:
plot_cdf(sample_prior, label='prior')
plot_cdf(sample_prior_pm, label='prior pymc')
cdf_rates()

It looks pretty good (although not actually as close as I expected).

Now let's extend the model to sample from the prior predictive distribution. This is still a silly way to do it, but it is one more step toward inference.

In [63]:
model = pm.Model()

with model:
    mu = pm.Gamma('mu', alpha, beta)
    goals = pm.Poisson('goals', mu)
    trace = pm.sample(2000, tune=1000)
Multiprocess sampling (4 chains in 4 jobs)
CompoundStep
>NUTS: [mu_log__]
>Metropolis: [goals]
100%|██████████| 3000/3000 [00:01<00:00, 1616.07it/s]
The number of effective samples is smaller than 25% for some parameters.

Let's see how the results compare with a sample from the prior predictive distribution, generated by plain old NumPy.

In [64]:
sample_prior_pred_pm = trace['goals']
np.mean(sample_prior_pred_pm)
Out[64]:
1.926625
In [65]:
sample_prior_pred = np.random.poisson(sample_prior)
np.mean(sample_prior_pred)
Out[65]:
1.8395

Looks ok.

In [66]:
plot_cdf(sample_prior_pred, label='prior pred')
plot_cdf(sample_prior_pred_pm, label='prior pred pymc')
cdf_goals()

Using PyMC

Finally, we are ready to use PyMC for actual inference. We just have to make one small change.

Instead of generating goals, we'll mark goals as observed and provide the observed data, 6:

In [67]:
model = pm.Model()

with model:
    mu = pm.Gamma('mu', alpha, beta)
    goals = pm.Poisson('goals', mu, observed=[6])
    trace = pm.sample(2000, tune=1000)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu_log__]
100%|██████████| 3000/3000 [00:01<00:00, 2198.29it/s]

With goals fixed, the only unknown is mu, so trace contains a sample drawn from the posterior distribution of mu. We can plot the posterior using a function provided by PyMC:

In [68]:
pm.plot_posterior(trace)
pdf_rate()

And we can extract a sample from the posterior of mu

In [69]:
sample_post_pm = trace['mu']
np.mean(sample_post_pm)
Out[69]:
2.5596750814173586

And compare it to the sample we drew from the grid approximation:

In [70]:
plot_cdf(sample_post, label='posterior grid')
plot_cdf(sample_post_pm, label='posterior pymc')
cdf_rates()

Again, it looks pretty good.

To generate a posterior predictive distribution, we can use sample_ppc

In [71]:
with model:
    post_pred = pm.sample_ppc(trace, samples=2000)
100%|██████████| 2000/2000 [00:00<00:00, 7192.53it/s]

Here's what it looks like:

In [72]:
sample_post_pred_pm = post_pred['goals']
np.mean(sample_post_pred_pm)
Out[72]:
2.514
In [73]:
plot_cdf(sample_post_pred, label='posterior pred grid')
plot_cdf(sample_post_pred_pm, label='posterior pred pm')
cdf_goals()

Look's pretty good!

Going hierarchical

So far, all of this is based on a gamma prior. To choose the parameters of the prior, I used data from previous Stanley Cup finals and computed a maximum likelihood estimate (MLE). But that's not correct, because

  1. It assumes that the observed goal counts are the long-term goal-scoring rates.
  2. It treats alpha and beta as known values rather than parameters to estimate.

In other words, I have ignored two important sources of uncertainty. As a result, my predictions are almost certainly too confident.

The solution is a hierarchical model, where alpha and beta are the parameters that control mu and mu is the parameter that controls goals. Then we can use observed goals to update the distributions of all three unknown parameters.

Of course, now we need a prior distribution for alpha and beta. A common choice is the half Cauchy distribution (see Gelman), but on advice of counsel, I'm going with exponential.

In [74]:
sample = pm.Exponential.dist(lam=1).random(size=1000)
plot_cdf(sample)
plt.xscale('log')
plt.xlabel('Parameter of a gamma distribution')
plt.ylabel('CDF')
np.mean(sample)
Out[74]:
1.0272878581293894

This distribution represents radical uncertainty about the value of this distribution: it's probably between 0.1 and 10, but it could be really big or really small.

Here's a PyMC model that generates alpha and beta from an exponential distribution.

In [75]:
model = pm.Model()

with model:
    alpha = pm.Exponential('alpha', lam=1)
    beta = pm.Exponential('beta', lam=1)
    trace = pm.sample(1000, tune=2000, nuts_kwargs=dict(target_accept=0.95))
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_log__, alpha_log__]
100%|██████████| 3000/3000 [00:03<00:00, 913.70it/s]

Here's what the distributions of alpha and beta look like.

In [76]:
sample_prior_alpha = trace['alpha']
plot_cdf(sample_prior_alpha, label='alpha prior')
sample_prior_beta = trace['beta']
plot_cdf(sample_prior_beta, label='beta prior')

plt.xscale('log')
plt.xlabel('Parameter of a gamma distribution')
plt.ylabel('CDF')
np.mean(sample_prior_alpha)
Out[76]:
1.0158645379491282

Now that we have alpha and beta, we can generate mu.

In [77]:
model = pm.Model()

with model:
    alpha = pm.Exponential('alpha', lam=1)
    beta = pm.Exponential('beta', lam=1)
    mu = pm.Gamma('mu', alpha, beta)
    trace = pm.sample(1000, tune=4000, nuts_kwargs=dict(target_accept=0.9))
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu_log__, beta_log__, alpha_log__]
100%|██████████| 5000/5000 [00:15<00:00, 327.26it/s]
There were 1 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.8318103572008174, but should be close to 0.9. Try to increase the number of tuning steps.
The acceptance probability does not match the target. It is 0.965809840031, but should be close to 0.9. Try to increase the number of tuning steps.
The number of effective samples is smaller than 25% for some parameters.

Here's what the prior distribution of mu looks like.

In [78]:
sample_prior_mu = trace['mu']
plot_cdf(sample_prior_mu, label='mu prior hierarchical')
cdf_rates()
np.mean(sample_prior_mu)
Out[78]:
297.952144608823

In effect, the model is saying "I have never seen a hockey game before. As far as I know, it could be soccer, could be basketball, could be pinball."

If we zoom in on the range 0 to 10, we can compare the prior implied by the hierarchical model with the gamma prior I hand picked.

In [79]:
plot_cdf(sample_prior_mu, label='mu prior hierarchical')
plot_cdf(sample_prior, label='mu prior', color='gray')
plt.xlim(0, 10)
cdf_rates()

Obviously, they are very different. They agree that the most likely values are less than 10, but the hierarchical model admits the possibility that mu could be orders of magnitude bigger.

Crazy as it sounds, that's probably what we want in a non-committal prior.

Ok, last step of the forward process, let's generate some goals.

In [80]:
model = pm.Model()

with model:
    alpha = pm.Exponential('alpha', lam=1)
    beta = pm.Exponential('beta', lam=1)
    mu = pm.Gamma('mu', alpha, beta)
    goals = pm.Poisson('goals', mu)
    trace = pm.sample(1000, tune=2000, nuts_kwargs=dict(target_accept=0.99))
Multiprocess sampling (4 chains in 4 jobs)
CompoundStep
>NUTS: [mu_log__, beta_log__, alpha_log__]
>Metropolis: [goals]
100%|██████████| 3000/3000 [00:30<00:00, 97.86it/s] 
There were 2 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.9409566670308568, but should be close to 0.99. Try to increase the number of tuning steps.
The gelman-rubin statistic is larger than 1.2 for some parameters.
The estimated number of effective samples is smaller than 200 for some parameters.

PyMC is not very happy running forward problems. Nevertheless, here's the prior predictive distribution of goals.

In [81]:
sample_prior_goals = trace['goals']
plot_cdf(sample_prior_goals, label='goals prior')
cdf_goals()
np.mean(sample_prior_goals)
Out[81]:
9.97275

To see whether that distribution is right, I ran samples using SciPy.

In [82]:
def forward_hierarchical(size=1):
    alpha = st.expon().rvs(size=size)
    beta = st.expon().rvs(size=size)
    mu = st.gamma(a=alpha, scale=1/beta).rvs(size=size)
    goals = st.poisson(mu).rvs(size=size)
    return goals[0]

sample_prior_goals_st = [forward_hierarchical() for i in range(1000)]
In [83]:
plot_cdf(sample_prior_goals, label='goals prior')
plot_cdf(sample_prior_goals_st, label='goals prior scipy')
plt.xlim(0, 50)
cdf_goals()
np.mean(sample_prior_goals_st)
Out[83]:
7.869

Hierarchical inference

Once we have the forward process working, we only need a small change to run the reverse process.

In [84]:
model = pm.Model()

with model:
    alpha = pm.Exponential('alpha', lam=1)
    beta = pm.Exponential('beta', lam=1)
    mu = pm.Gamma('mu', alpha, beta)
    goals = pm.Poisson('goals', mu, observed=[6])
    trace = pm.sample(1000, tune=2000, nuts_kwargs=dict(target_accept=0.99))
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu_log__, beta_log__, alpha_log__]
100%|██████████| 3000/3000 [00:07<00:00, 419.88it/s]

PyMC is much happier running the reverse process.

Here's the posterior distribution of mu. The posterior mean is close to the observed value, which is what we expect from a weakly informative prior.

In [85]:
sample_post_mu = trace['mu']
plot_cdf(sample_post_mu, label='mu posterior')
cdf_rates()
np.mean(sample_post_mu)
Out[85]:
5.466653858016652

Two teams

We can extend the model to estimate different values of mu for the two teams.

In [86]:
model = pm.Model()

with model:
    alpha = pm.Exponential('alpha', lam=1)
    beta = pm.Exponential('beta', lam=1)
    mu_VGK = pm.Gamma('mu_VGK', alpha, beta)
    mu_WSH = pm.Gamma('mu_WSH', alpha, beta)
    goals_VGK = pm.Poisson('goals_VGK', mu_VGK, observed=[6])
    goals_WSH = pm.Poisson('goals_WSH', mu_WSH, observed=[4])
    trace = pm.sample(1000, tune=2000, nuts_kwargs=dict(target_accept=0.95))
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu_WSH_log__, mu_VGK_log__, beta_log__, alpha_log__]
100%|██████████| 3000/3000 [00:06<00:00, 469.59it/s]

We can use traceplot to review the results and do some visual diagnostics.

In [87]:
pm.traceplot(trace);

Here are the posterior distribitions for mu_WSH and mu_VGK.

In [88]:
sample_post_mu_WSH = trace['mu_WSH']
plot_cdf(sample_post_mu_WSH, label='mu_WSH posterior')

sample_post_mu_VGK = trace['mu_VGK']
plot_cdf(sample_post_mu_VGK, label='mu_VGK posterior')

cdf_rates()
np.mean(sample_post_mu_WSH), np.mean(sample_post_mu_VGK)
Out[88]:
(4.029021468399878, 5.4945431876391035)

On the basis of one game (and never having seen a previous game), here's the probability that Vegas is the better team.

In [89]:
np.mean(sample_post_mu_VGK > sample_post_mu_WSH)
Out[89]:
0.71925

More background

But let's take advantage of more information. Here are the results from the five most recent Stanley Cup finals, ignoring games that went into overtime.

In [90]:
data = dict(BOS13 = [2, 1, 2],
            CHI13 = [0, 3, 3],
            NYR14 = [0, 2],
            LAK14 = [3, 1],
            TBL15 = [1, 4, 3, 1, 1, 0],
            CHI15 = [2, 3, 2, 2, 2, 2],
            SJS16 = [2, 1, 4, 1],
            PIT16 = [3, 3, 2, 3],
            NSH17 = [3, 1, 5, 4, 0, 0],
            PIT17 = [5, 4, 1, 1, 6, 2],
            VGK18 = [6,2,1],
            WSH18 = [4,3,3],
           )

Here's how we can get the data into the model.

In [91]:
model = pm.Model()

with model:
    alpha = pm.Exponential('alpha', lam=1)
    beta = pm.Exponential('beta', lam=1)
    
    mu = dict()
    goals = dict()
    for name, observed in data.items():
        mu[name] = pm.Gamma('mu_'+name, alpha, beta)
        goals[name] = pm.Poisson(name, mu[name], observed=observed)
        
    trace = pm.sample(1000, tune=2000, nuts_kwargs=dict(target_accept=0.95))
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu_WSH18_log__, mu_VGK18_log__, mu_PIT17_log__, mu_NSH17_log__, mu_PIT16_log__, mu_SJS16_log__, mu_CHI15_log__, mu_TBL15_log__, mu_LAK14_log__, mu_NYR14_log__, mu_CHI13_log__, mu_BOS13_log__, beta_log__, alpha_log__]
100%|██████████| 3000/3000 [00:21<00:00, 137.45it/s]

And here are the results.

In [92]:
pm.traceplot(trace);