Christopher Fonnesbeck Vanderbilt University
John Salvatier Amazon.com
Thomas Wieki Brown University, Quantopian
Bayesian models specified and fit using a domain-specific language.
Usually involves interface to R language.
This approach keeps users at a distance from their model.
Disadvantages include:
An alternative approach is to build models in the native language used to perform data analysis.
A handful of packages exist that use MCMC to fit models in R.
MCMCpack
mcmc
MCMCglmm
Each of these operates on particular classes of models.
Most Bayesians use interfaces to BUGS/JAGS instead of native code.
Python is a dynamic, interpreted programming language that is flexible enough to be used for a variety of tasks, from scripts to mission-critical applications. It currently has a very strong presence in scientific computing. Despite being an easy-to-use, high-level language, it is also capable of achieving very high performance.
Runnable pseudocode: Python's syntax is readable and clear.
Gibbs sampler for function:
$$f(x,y) = x x^2 \exp(-xy^2 - y^2 + 2y - 4x)$$using conditional distributions:
$$x|y \sim Gamma(3, y^2 +4)$$$$y|x \sim Normal(\frac{1}{1+x}, \frac{1}{2(1+x)})$$from numpy import zeros, random, sqrt
gamma = random.gamma
normal = random.normal
def gibbs(N=20000, thin=200):
mat = zeros((N,2))
x,y = mat[0]
for i in range(N):
for j in range(thin):
x = gamma(3, y**2 + 4)
y = normal(1./(x+1), 1./sqrt(2*(x+1)))
mat[i] = x,y
return mat
Python objects are dynamically referenced
y = 5
y = 'foo'
x = y = [1, 2, 3]
y[0] = -9
x
[-9, 2, 3]
... but strongly typed!
'5' + 6
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-6-cfde8aeadc63> in <module>() ----> 1 '5' + 6 TypeError: cannot concatenate 'str' and 'int' objects
NumPy and SciPy are fundamental scientific modules for Python.
In particular, NumPy provides:
ndarray
SciPy provides a large suite of standard scientific computing:
# Calculate second difference matrix
import numpy as np
I2 = -2*np.eye(8)
E = np.diag(np.ones(7), k=-1)
I2 + E + E.T
array([[-2., 1., 0., 0., 0., 0., 0., 0.], [ 1., -2., 1., 0., 0., 0., 0., 0.], [ 0., 1., -2., 1., 0., 0., 0., 0.], [ 0., 0., 1., -2., 1., 0., 0., 0.], [ 0., 0., 0., 1., -2., 1., 0., 0.], [ 0., 0., 0., 0., 1., -2., 1., 0.], [ 0., 0., 0., 0., 0., 1., -2., 1.], [ 0., 0., 0., 0., 0., 0., 1., -2.]])
IPython is an enhanced Python shell which provides a more robust and productive development environment for users. It includes the HTML notebook featured here, as well as support for interactive data visualization and easy high-performance parallel computing.
IPython has a set of predefined ‘magic functions’ that you can call with a command line style syntax. These include:
%run
%edit
%debug
%timeit
%paste
%load_ext
For example, we can use %load_ext
to load an extension to run R code within IPython:
%load_ext rpy2.ipython
%%R
x <- seq(1, 10)
y <- rnorm(10)
plot(x, y)
The IPython Notebook is a web-based interactive interface to IPython that combines code execution, text, mathematics, plots and rich media into a single document.
It provides an easy way of reporting, from simple data-cleaning scripts to publishable books.
from IPython.display import HTML
HTML('<iframe src=http://nbviewer.ipython.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter3_MCMC/IntroMCMC.ipynb width=800 height=400></iframe>')
Pandas provides high-performance data structures for data manipulation and analysis. In particular, it allows for intelligent data alignment and integrated handling of missing data. Datasets can be easily reshaped, sliced, subsetted, and indexed hierarchically.
Here's a trivial example of a pandas DataFrame
, populated with columns of various types:
%matplotlib inline
from pandas import *
df = DataFrame({'A' : ['one', 'one', 'two', 'three'] * 6,
'B' : ['A', 'B', 'C'] * 8,
'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4,
'D' : np.random.randn(24),
'E' : np.random.randn(24),
'F' : date_range(start='4/1/2012', periods=24)})
df.head()
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | one | A | foo | -0.079579 | -0.769077 | 2012-04-01 |
1 | one | B | foo | 0.083715 | -0.043521 | 2012-04-02 |
2 | two | C | foo | 0.631201 | -0.101193 | 2012-04-03 |
3 | three | A | bar | 1.819274 | 0.930788 | 2012-04-04 |
4 | one | B | bar | -0.124589 | 0.629453 | 2012-04-05 |
Consider some real data, from Statistical Methods for the Analysis of Repeated Measurements by Charles S. Davis. These data are from a multicenter, randomized controlled trial of botulinum toxin type B (BotB) in patients with cervical dystonia from nine U.S. sites.
cdystonia = read_csv("cdystonia.csv", index_col=None)
cdystonia.head()
patient | obs | week | site | id | treat | age | sex | twstrs | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 0 | 1 | 1 | 5000U | 65 | F | 32 |
1 | 1 | 2 | 2 | 1 | 1 | 5000U | 65 | F | 30 |
2 | 1 | 3 | 4 | 1 | 1 | 5000U | 65 | F | 24 |
3 | 1 | 4 | 8 | 1 | 1 | 5000U | 65 | F | 37 |
4 | 1 | 5 | 12 | 1 | 1 | 5000U | 65 | F | 39 |
With Pandas, we can easily manipulate these data. For example, we might want a hierarchical index.
cdystonia.set_index(['patient','obs']).head(10)
week | site | id | treat | age | sex | twstrs | ||
---|---|---|---|---|---|---|---|---|
patient | obs | |||||||
1 | 1 | 0 | 1 | 1 | 5000U | 65 | F | 32 |
2 | 2 | 1 | 1 | 5000U | 65 | F | 30 | |
3 | 4 | 1 | 1 | 5000U | 65 | F | 24 | |
4 | 8 | 1 | 1 | 5000U | 65 | F | 37 | |
5 | 12 | 1 | 1 | 5000U | 65 | F | 39 | |
6 | 16 | 1 | 1 | 5000U | 65 | F | 36 | |
2 | 1 | 0 | 1 | 2 | 10000U | 70 | F | 60 |
2 | 2 | 1 | 2 | 10000U | 70 | F | 26 | |
3 | 4 | 1 | 2 | 10000U | 70 | F | 27 | |
4 | 8 | 1 | 2 | 10000U | 70 | F | 41 |
Or we may want to summarize data that are grouped according to one or more variables.
cdystonia.groupby(['treat', 'sex'])['twstrs'].mean()
treat sex 10000U F 42.830303 M 37.208333 5000U F 40.216981 M 42.523810 Placebo F 43.258065 M 38.891566 Name: twstrs, dtype: float64
Being an interpreted language, Python can sometimes be slow relative to compiled languages. Yet, we don't want to trade off the productivity benefits of using Python with the speed gains of a language like C or Fortran. Often, it is just a few lines of code that slows down an analysis, so if that code can be sped up somehow, the entire analysis could run more efficiently.
Cython is a language that allows Python programmers to write fast code without having to write C/C++/Fortran directly. It looks much like Python code, but with type declarations. Cython takes this code, translates it to C, then compiles the generated C code to create a Python extension.
from numpy import zeros, random, sqrt
gamma = random.gamma
normal = random.normal
def pygibbs(N=20000, thin=200):
mat = zeros((N,2))
x,y = mat[0]
for i in range(N):
for j in range(thin):
x = gamma(3, y**2 + 4)
y = normal(1./(x+1), 1./sqrt(2*(x+1)))
mat[i] = x,y
return mat
%timeit pygibbs(1000, 10)
10 loops, best of 3: 27.2 ms per loop
"Cythonizing" your Python code can result in significant increase in speed.
%load_ext cythonmagic
%%cython -lm -lgsl -lgslcblas
cimport cython
import numpy as np
from numpy cimport *
cdef extern from "math.h":
double sqrt(double)
cdef extern from "gsl/gsl_rng.h":
ctypedef struct gsl_rng_type
ctypedef struct gsl_rng
gsl_rng_type *gsl_rng_mt19937
gsl_rng *gsl_rng_alloc(gsl_rng_type * T) nogil
cdef extern from "gsl/gsl_randist.h":
double gamma "gsl_ran_gamma"(gsl_rng * r,double,double)
double gaussian "gsl_ran_gaussian"(gsl_rng * r,double)
cdef gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937)
@cython.wraparound(False)
@cython.boundscheck(False)
def gibbs(int N=20000,int thin=500):
cdef:
double x=0
double y=0
int i, j
ndarray[float64_t, ndim=2] samples
samples = np.empty((N,thin))
for i from 0 <= i < N:
for j from 0 <= j < thin:
x = gamma(r,3,1.0/(y*y+4))
y = gaussian(r,1.0/sqrt(x+1))
samples[i,0] = x
samples[i,1] = y
return samples
%timeit gibbs(1000, 10)
1000 loops, best of 3: 953 µs per loop
Python provides an idea platform for Bayesian computing, due to its unqiue combination of flexibility, performance and ease-of-use.
JAGS developer Martyn Plummer has described his software as:
A platform for experimenting with ideas in Bayesian modeling
Because it is implemented natively in Python, PyMC has the potential to be such an experimental platform for statistical methodologists to develop and implement new methods.
PyMC is a python module that implements Bayesian statistical models and fitting algorithms, including Markov chain Monte Carlo. Its flexibility and extensibility make it applicable to a large suite of problems. Along with core sampling functionality, PyMC includes methods for summarizing output, plotting, goodness-of-fit and convergence diagnostics.
*PyMC provides functionalities to make Bayesian analysis as painless as possible.*
Let's consider a simple attempt to model failure rates. The data are a vector of failure counts, as well as observation times, which can be used to calculate rates
y = 5,1,5,14,3,19,1,1,4,22
t = 94.320, 15.720, 62.880, 125.760, 5.240, 31.440, 1.048, 1.048, 2.096, 10.480
np.array(y)/t
array([ 0.05301103, 0.06361323, 0.07951654, 0.11132316, 0.57251908, 0.6043257 , 0.95419847, 0.95419847, 1.90839695, 2.09923664])
The simplest (naive) model is to specify a Poisson model for the data, and estimate the corresponding rate. We might place an exponential prior on this rate.
from pymc import *
rate = Exponential('rate', 0.1)
Calling Exponential
with appropriate arguments returns a PyMC Stochastic
object.
rate
<pymc.distributions.Exponential 'rate' at 0x111950f10>
isinstance(rate, Stochastic)
True
PyMC objects have attributes that are relevant to their roles in Bayesian model. Thus, a Stochastic
has a value and a log-probability:
rate.value
array(2.0112404848444427)
rate.logp
-2.5037091414784896
We did not specify a value for rate
(but we could have!), so one was sampled automatically when the object was created.
We can sample from rate
, which generates a new value for the object, and associated log-probability.
rate.random()
array(17.889479833802298)
rate.value
array(17.889479833802298)
The next step is to multiply the rate by the time offsets, to calculate expected failure counts. This generates a Deterministic
object that we will call theta
.
theta = rate * t
isinstance(theta, Deterministic)
True
theta.value
array([ 1687.33573792, 281.22262299, 1124.89049195, 2249.7809839 , 93.74087433, 562.44524597, 18.74817487, 18.74817487, 37.49634973, 187.48174866])
Notice that even though theta
is vector-valued, we did not need to construct a loop to specify it. This is due to the vectorization capabilities of NumPy.
We can now specify a likelihood, using theta
as the parameter(s). What type of object is the likelihood?
failures = Poisson('failures', theta, value=y, observed=True)
isinstance(failures, Stochastic)
True
Its a Stochastic
. The difference is the observed=True
flag fixes the value of the stochastic to the observed data, so it does not get updated.
This collection of PyMC objects forms a directed acyclic graph by virtue of the parent-child relationships defined by one object (parent) being the argument for another (child).
We can pass the nodes of our graph to an MCMC
object, which creates a sampler that can be used to fit our model.
simple_model = MCMC([failures, theta, rate])
simple_model.sample(10000, 5000)
[-----------------100%-----------------] 10000 of 10000 complete in 0.4 sec
By default, PyMC chooses an MCMC algorithm ("step method") for each stochastic node in the graph. We can see here that the Metropolis
step method was chosen for the rate parameter.
simple_model.step_method_dict
{<pymc.distributions.Exponential 'rate' at 0x111950f10>: [<pymc.StepMethods.Metropolis at 0x1166e0ed0>]}
Matplot.plot(rate)
Plotting rate
rate.summary()
rate: Mean SD MC Error 95% HPD interval ------------------------------------------------------------------ 0.214 0.024 0.001 [ 0.168 0.266] Posterior quantiles: 2.5 25 50 75 97.5 |---------------|===============|===============|---------------| 0.169 0.199 0.214 0.229 0.273
We did not have to use MCMC. PyMC also has approximation methods.
map_model = MAP([failures, theta, rate])
map_model.fit()
rate.value
array(0.2142176857025202)
This appears to be a lousy model for this data -- the data appear to vary more than would be expected under a Poisson model.
Let's update the model to make it more realistic.
The simplest extension of the Poisson model is to allow the mean to vary. A natural choice for the distribution of the mean is a gamma distribution.
alpha = Exponential('alpha', 0.1, value=1)
beta = Exponential('beta', 0.1, value=1)
rate = Gamma('rate', alpha, beta, size=len(y))
rate.value
array([ 2.88503664, 0.32513389, 1.85656985, 0.24414514, 1.89594116, 2.06834612, 2.58120821, 0.22028342, 0.51869994, 0.89119894])
theta = rate * t
failures = Poisson('failures', theta, value=y, observed=True)
Running this model yields more reasonable estimates of rates.
hierarchical_model = MCMC([failures, theta, rate, alpha, beta])
hierarchical_model.sample(10000, 5000)
[-----------------100%-----------------] 10000 of 10000 complete in 1.2 sec
Matplot.summary_plot(rate, rhat=False)
Samplers in PyMC 2 are limited to "first generation" MCMC algorithms (i.e. Metropolis and related).
PyMC 3, currently in beta, is a complete re-write of the PyMC code base, in part to allow for the implemetation of gradient-based sampling algorithms, such as Hamiltonian MC and NUTS.
Probabilistic modeling framework
from theano import function, tensor as T
x = T.dmatrix('x')
s = T.sum(1 / (1 + T.exp(-x)))
gs = T.grad(s, x)
dlogistic = function([x], gs)
dlogistic([[3, -1],[0, 2]])
array([[ 0.04517666, 0.19661193], [ 0.25 , 0.10499359]])
import imp
pymc3 = imp.load_module("pymc3", *imp.find_module("pymc", ["/Users/fonnescj/GitHub/pymc"]))
Radon is a radioactive gas that enters homes through contact points with the ground. It is a carcinogen that is the primary cause of lung cancer in non-smokers. Radon levels vary greatly from household to household.
The EPA did a study of radon levels in 80,000 houses. Two important predictors:
The hierarchy in this example is households within county.
Here is the data, which requires some cleanup, courtesy of Pandas.
import pandas as pd
# Import radon data
srrs2 = pd.read_csv('srrs2.dat')
srrs2.columns = srrs2.columns.map(str.strip)
srrs_mn = srrs2[srrs2.state=='MN']
srrs_mn['fips'] = srrs_mn.stfips*1000 + srrs_mn.cntyfips
cty = pd.read_csv('cty.dat')
cty_mn = cty[cty.st=='MN']
cty_mn['fips'] = 1000*cty_mn.stfips + cty_mn.ctfips
# Use the `merge` method to combine home- and county-level information in a single DataFrame.
srrs_mn = srrs_mn.merge(cty_mn[['fips', 'Uppm']], on='fips')
srrs_mn = srrs_mn.drop_duplicates(subset='idnum')
u = np.log(srrs_mn.Uppm)
n = len(srrs_mn)
# We also need a lookup table (`dict`) for each unique county, for indexing.
srrs_mn.county = srrs_mn.county.map(str.strip)
mn_counties = srrs_mn.county.unique()
county_lookup = dict(zip(mn_counties, range(len(mn_counties))))
# Finally, create local copies of variables.
county = srrs_mn['county_code'] = srrs_mn.county.replace(county_lookup).values
radon = srrs_mn.activity
srrs_mn['log_radon'] = log_radon = np.log(radon + 0.1).values
floor_measure = srrs_mn.floor.values
Normal, Uniform = pymc3.Normal, pymc3.Uniform
NUTS, sample = pymc3.NUTS, pymc3.sample
The most general model allows both the intercept and slope to vary by county:
$$y_i = \alpha_{j[i]} + \beta_{j[i]} x_{i} + \epsilon_i$$where
$$\epsilon_i \sim N(0, \sigma_y^2)$$and the intercept and slope random effects:
$$\alpha_{j[i]} \sim N(\mu_{\alpha}, \sigma_{\alpha}^2)$$$$\beta_{j[i]} \sim N(\mu_{\beta}, \sigma_{\beta}^2)$$As with the the “no-pooling” model, we set a separate intercept for each county, but rather than fitting separate least squares regression models for each county, multilevel modeling shares strength among counties, allowing for more reasonable inference in counties with little data.
A primary strength of multilevel models is the ability to handle predictors on multiple levels simultaneously. If we consider the varying-intercepts model above:
$$y_i = \alpha_{j[i]} + \beta x_{i} + \epsilon_i$$we may, instead of a simple random effect to describe variation in the expected radon value, specify another regression model with a county-level covariate. Here, we use the county uranium reading $u_j$, which is thought to be related to radon levels:
$$\alpha_j = \gamma_0 + \gamma_1 u_j + \zeta_j$$$$\zeta_j \sim N(0, \sigma_{\alpha}^2)$$Thus, we are now incorporating a house-level predictor (floor or basement) as well as a county-level predictor (uranium).
Note that the model has both indicator variables for each county, plus a county-level covariate. In classical regression, this would result in collinearity. In a multilevel model, the partial pooling of the intercepts towards the expected value of the group-level linear model avoids this.
Group-level predictors also serve to reduce group-level variation $\sigma_{\alpha}$. An important implication of this is that the group-level estimate induces stronger pooling.
hierarchical_model = pymc3.Model()
with hierarchical_model:
# Priors
sigma_a = Uniform('sigma_a', lower=0, upper=100)
tau_a = sigma_a**-2
# County uranium model for intercept
gamma_0 = Normal('gamma_0', mu=0., tau=0.0001)
gamma_1 = Normal('gamma_1', mu=0., tau=0.0001)
# Uranium model for intercept
mu_a = gamma_0 + gamma_1*u
# County variation not explained by uranium
eps_a = Normal('eps_a', mu=0, tau=tau_a, shape=len(set(county)))
a = pymc3.Deterministic('a', mu_a + eps_a[county])
# Random slope
b = Normal('b', mu=0, tau=0.001)
# Model error
sigma_y = Uniform('sigma_y', lower=0, upper=100)
tau_y = sigma_y**-2
# Expected value
y_hat = a + b * floor_measure
# Data likelihood
y_like = Normal('y_like', mu=y_hat, tau=tau_y, observed=log_radon)
with hierarchical_model:
step = NUTS()
samples = sample(2000, step, njobs=2)
/usr/local/lib/python2.7/site-packages/theano/scan_module/scan_perform_ext.py:117: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility from scan_perform.scan_perform import *
[-----------------100%-----------------] 2000 of 2000 complete in 54.1 sec
import matplotlib.pyplot as plt
trace = samples[-1000:]
u = np.log(srrs_mn.Uppm)
a_means = trace['a'][0].mean(axis=0)
plt.scatter(u, a_means)
g0 = trace['gamma_0'][0].mean()
g1 = trace['gamma_1'][0].mean()
xvals = np.linspace(-1, 0.8)
plt.plot(xvals, g0+g1*xvals, 'k--')
plt.xlim(-1, 0.8)
a_se = trace['a'][0].std(axis=0)
for ui, m, se in zip(u, a_means, a_se):
plt.plot([ui,ui], [m-se, m+se], 'b-')
plt.xlabel('County-level uranium'); plt.ylabel('Intercept estimate');
For easy specification of generalized linear models, the glm
module provides a convenience function. Models can be fit using only a line or two of Python code.
age = np.array([13, 14, 14,12, 9, 15, 10, 14, 9, 14, 13, 12, 9, 10, 15, 11, 15, 11, 7, 13,
13, 10, 9, 6, 11, 15, 13, 10, 9, 9, 15, 14, 14, 10, 14, 11, 13, 14, 10])
price = np.array([2950, 2300, 3900, 2800, 5000, 2999, 3950, 2995, 4500, 2800, 1990, 3500,
5100, 3900, 2900, 4950, 2000, 3400, 8999, 4000, 2950, 3250, 3950, 4600,
4500, 1600, 3900, 4200, 6500, 3500, 2999, 2600, 3250, 2500, 2400, 3990,
4600, 450,4700])/1000.
age_price = pd.DataFrame({'x': age, 'y': price})
glm = pymc3.glm
with pymc3.Model() as model:
glm.glm('y ~ x', age_price, family=glm.families.T())
trace = pymc3.sample(2000, pymc3.HamiltonianMC())
[-----------------100%-----------------] 2000 of 2000 complete in 2.5 sec
import matplotlib.pyplot as plt
import prettyplotlib as ppl
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, xlabel='Age', ylabel='Price')
ppl.scatter(ax, age, price)
pymc3.glm.plot_posterior_predictive(trace, eval=np.linspace(5, 16, 100), samples=100)
Python and its "scientific stack" offers a compelling computational environment for modern Bayesian computing.
In comparison to other Bayesian computing solutions, it requires the fewest tradeoffs and provides the most interactive platform for developing models.
PyMC provides very general implementations of modern Markov chain Monte Carlo and related procedures, and aspires to be a platform for probabilistic programming.