This is one of the 100 recipes of the IPython Cookbook, the definitive guide to high-performance scientific computing and data science in Python.

7.7. Fitting a Bayesian model by sampling from a posterior distribution with a Markov Chain Monte Carlo method¶

You can find the instructions to install PyMC on the package's website. (http://pymc-devs.github.io/pymc/)

You also need the Storms dataset from the book's website. (http://ipython-books.github.io)

Let's import the standard packages and PyMC.

In [ ]:

import numpy as np
import pandas as pd
import pymc
import matplotlib.pyplot as plt
%matplotlib inline

Let's import the data with Pandas.

In [ ]:

# http://www.ncdc.noaa.gov/ibtracs/index.php?name=wmo-data
df = pd.read_csv("data/Allstorms.ibtracs_wmo.v03r05.csv",
                 delim_whitespace=False)

With Pandas, it only takes a single line of code to get the annual number of storms in the North Atlantic Ocean. We first select the storms in that basin (NA), then we group the rows by year (Season), and we take the number of unique storm (Serial_Num) as each storm can span several days (nunique method).

In [ ]:

cnt = df[df['Basin'] == ' NA'].groupby('Season') \
      ['Serial_Num'].nunique()
years = cnt.index
y0, y1 = years[0], years[-1]
arr = cnt.values
plt.figure(figsize=(8,4));
plt.plot(years, arr, '-ok');
plt.xlim(y0, y1);
plt.xlabel("Year");
plt.ylabel("Number of storms");

Now, we define our probabilistic model. We assume that storms arise following a time-dependent Poisson process with a deterministic rate. We assume this rate is a piecewise-constant function that takes a first value early_mean before a certain switch point, and a second value late_mean after that point. These three unknown parameters are treated as random variables (we will describe them more in the How it works... section).

A Poisson process is a particular point process, that is, a stochastic process describing the random occurence of instantaneous events. The Poisson process is fully random: the events occur independently at a given rate.

In [ ]:

switchpoint = pymc.DiscreteUniform('switchpoint',
                                   lower=0, upper=len(arr))
early_mean = pymc.Exponential('early_mean', beta=1)
late_mean = pymc.Exponential('late_mean', beta=1)

We define the piecewise-constant rate as a Python function.

In [ ]:

@pymc.deterministic(plot=False)
def rate(s=switchpoint, e=early_mean, l=late_mean):
    out = np.empty(len(arr))
    out[:s] = e
    out[s:] = l
    return out

Finally, the observed variable is the annual number of storms. It follows a Poisson variable with a random mean (the rate of the underlying Poisson process). This fact is a known mathematical property of Poisson processes.

In [ ]:

storms = pymc.Poisson('storms', mu=rate, value=arr, observed=True)

Now, we use the MCMC method to sample from the posterior distribution, given the observed data. The sample method launches the fitting iterative procedure.

In [ ]:

model = pymc.Model([switchpoint, early_mean, late_mean, rate, storms])

In [ ]:

mcmc = pymc.MCMC(model)
mcmc.sample(iter=10000, burn=1000, thin=10)

Let's plot the sampled Markov chains. Their stationary distribution corresponds to the posterior distribution we want to characterize.

In [ ]:

plt.figure(figsize=(8,8))
plt.subplot(311);
plt.plot(mcmc.trace('switchpoint')[:]);
plt.ylabel("Switch point"); 
plt.subplot(312);
plt.plot(mcmc.trace('early_mean')[:]);
plt.ylabel("Early mean");
plt.subplot(313);
plt.plot(mcmc.trace('late_mean')[:]);
plt.xlabel("Iteration");
plt.ylabel("Late mean");

We also plot the distribution of the samples: they correspond to the posterior distributions of our parameters, after the data points have been taken into account.

In [ ]:

plt.figure(figsize=(14,3))
plt.subplot(131);
plt.hist(mcmc.trace('switchpoint')[:] + y0, 15);
plt.xlabel("Switch point")
plt.ylabel("Distribution")
plt.subplot(132);
plt.hist(mcmc.trace('early_mean')[:], 15);
plt.xlabel("Early mean");
plt.subplot(133);
plt.hist(mcmc.trace('late_mean')[:], 15);
plt.xlabel("Late mean");

Taking the sample mean of these distributions, we get posterior estimates for the three unknown parameters, including the year where the frequency of storms suddenly increased.

In [ ]:

yp = y0 + mcmc.trace('switchpoint')[:].mean()
em = mcmc.trace('early_mean')[:].mean()
lm = mcmc.trace('late_mean')[:].mean()
print((yp, em, lm))

Now we can plot the estimated rate on top of the observations.

In [ ]:

plt.figure(figsize=(8,4));
plt.plot(years, arr, '-ok');
plt.axvline(yp, color='k', ls='--');
plt.plot([y0, yp], [em, em], '-b', lw=3);
plt.plot([yp, y1], [lm, lm], '-r', lw=3);
plt.xlim(y0, y1);
plt.xlabel("Year");
plt.ylabel("Number of storms");

For a possible scientific interpretation of the data considered here, see http://www.gfdl.noaa.gov/global-warming-and-hurricanes.

You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).

IPython Cookbook, by Cyrille Rossant, Packt Publishing, 2014 (500 pages).