%matplotlib inline
import pymc3 as pm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels
import patsy
import theano.tensor as tt
plt.style.use('seaborn-darkgrid')
print('Running on PyMC3 v{}'.format(pm.__version__))
Running on PyMC3 v3.6
The previous example notebook on Bayesian parametric survival analysis introduced two different accelerated failure time (AFT) models: Weibull and log-linear. In this notebook, we present three different parameterizations of the Weibull AFT model.
The data set we'll use is the flchain
R data set, which comes from a medical study investigating the effect of serum free light chain (FLC) on lifespan. Read the full documentation of the data by running:
print(statsmodels.datasets.get_rdataset(package='survival', dataname='flchain').__doc__)
.
# Fetch and clean data
data = (statsmodels.datasets
.get_rdataset(package='survival', dataname='flchain')
.data
.sample(500) # Limit ourselves to 500 observations
.reset_index(drop=True))
y = data.futime.values
censored = ~data['death'].values.astype(bool)
y[:5]
array([4953, 4170, 2169, 4522, 4558])
censored[:5]
array([ True, False, True, False, True])
pm.Potential
¶We have an unique problem when modelling censored data. Strictly speaking, we don't have any data for censored values: we only know the number of values that were censored. How can we include this information in our model?
One way do this is by making use of pm.Potential
. The PyMC2 docs explain its usage very well. Essentially, declaring pm.Potential('x', logp)
will add logp
to the log-likelihood of the model.
This parameterization is an intuitive, straightforward parameterization of the Weibull survival function. This is probably the first parameterization to come to one's mind.
def weibull_lccdf(x, alpha, beta):
''' Log complementary cdf of Weibull distribution. '''
return -(x / beta)**alpha
with pm.Model() as model_1:
alpha_sd = 10.0
mu = pm.Normal('mu', mu=0, sigma=100)
alpha_raw = pm.Normal('a0', mu=0, sigma=0.1)
alpha = pm.Deterministic('alpha', tt.exp(alpha_sd * alpha_raw))
beta = pm.Deterministic('beta', tt.exp(mu / alpha))
y_obs = pm.Weibull('y_obs', alpha=alpha, beta=beta, observed=y[~censored])
y_cens = pm.Potential('y_cens', weibull_lccdf(y[censored], alpha, beta))
with model_1:
# Increase tune and change init to avoid divergences
trace_1 = pm.sample(draws=1000, tune=1000,
target_accept=0.9,
init='adapt_diag')
Auto-assigning NUTS sampler... Initializing NUTS using adapt_diag... Multiprocess sampling (2 chains in 2 jobs) NUTS: [a0, mu] Sampling 2 chains: 100%|██████████| 4000/4000 [00:08<00:00, 497.02draws/s] /Users/twiecki/anaconda3/lib/python3.6/site-packages/mkl_fft/_numpy_fft.py:1044: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. output = mkl_fft.rfftn_numpy(a, s, axes) The number of effective samples is smaller than 25% for some parameters.
pm.traceplot(trace_1, var_names=['alpha', 'beta']);
pm.summary(trace_1, varnames=['alpha', 'beta']).round(2)
/Users/twiecki/anaconda3/lib/python3.6/site-packages/mkl_fft/_numpy_fft.py:1044: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. output = mkl_fft.rfftn_numpy(a, s, axes)
mean | sd | mc_error | hpd_2.5 | hpd_97.5 | n_eff | Rhat | |
---|---|---|---|---|---|---|---|
alpha | 1.15 | 0.10 | 0.01 | 0.97 | 1.36 | 362.67 | 1.0 |
beta | 11225.11 | 1322.31 | 53.46 | 8899.77 | 13872.52 | 595.54 | 1.0 |
Note that, confusingly, alpha
is now called r
, and alpha
denotes a prior; we maintain this notation to stay faithful to the original implementation in Stan. In this parameterization, we still model the same parameters alpha
(now r
) and beta
.
For more information, see this Stan example model and the corresponding documentation.
with pm.Model() as model_2:
alpha = pm.Normal('alpha', mu=0, sigma=10)
r = pm.Gamma('r', alpha=1, beta=0.001, testval=0.25)
beta = pm.Deterministic('beta', tt.exp(-alpha / r))
y_obs = pm.Weibull('y_obs', alpha=r, beta=beta, observed=y[~censored])
y_cens = pm.Potential('y_cens', weibull_lccdf(y[censored], r, beta))
with model_2:
# Increase tune and target_accept to avoid divergences
trace_2 = pm.sample(draws=1000, tune=1000,
target_accept=0.9)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (2 chains in 2 jobs) NUTS: [r, alpha] Sampling 2 chains: 100%|██████████| 4000/4000 [00:09<00:00, 434.81draws/s] /Users/twiecki/anaconda3/lib/python3.6/site-packages/mkl_fft/_numpy_fft.py:1044: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. output = mkl_fft.rfftn_numpy(a, s, axes) The number of effective samples is smaller than 25% for some parameters.
pm.traceplot(trace_2, var_names=['r', 'beta']);
pm.summary(trace_2, varnames=['r', 'beta']).round(2)
/Users/twiecki/anaconda3/lib/python3.6/site-packages/mkl_fft/_numpy_fft.py:1044: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. output = mkl_fft.rfftn_numpy(a, s, axes)
mean | sd | mc_error | hpd_2.5 | hpd_97.5 | n_eff | Rhat | |
---|---|---|---|---|---|---|---|
r | 1.15 | 0.09 | 0.00 | 0.98 | 1.34 | 385.92 | 1.0 |
beta | 11275.85 | 1299.18 | 48.06 | 8804.12 | 13719.80 | 584.16 | 1.0 |
In this parameterization, we model the log-linear error distribution with a Gumbel distribution instead of modelling the survival function directly. For more information, see this blog post.
logtime = np.log(y)
def gumbel_sf(y, mu, sigma):
''' Gumbel survival function. '''
return 1.0 - tt.exp(-tt.exp(-(y - mu) / sigma))
with pm.Model() as model_3:
s = pm.HalfNormal('s', tau=5.0)
gamma = pm.Normal('gamma', mu=0, sigma=5)
y_obs = pm.Gumbel('y_obs', mu=gamma, beta=s, observed=logtime[~censored])
y_cens = pm.Potential('y_cens', gumbel_sf(y=logtime[censored], mu=gamma, sigma=s))
with model_3:
# Increase tune and change init to avoid divergences
trace_3 = pm.sample(draws=1000, tune=1000,
init='adapt_diag')
Auto-assigning NUTS sampler... Initializing NUTS using adapt_diag... Multiprocess sampling (2 chains in 2 jobs) NUTS: [gamma, s] Sampling 2 chains: 100%|██████████| 4000/4000 [00:02<00:00, 1538.34draws/s] /Users/twiecki/anaconda3/lib/python3.6/site-packages/mkl_fft/_numpy_fft.py:1044: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. output = mkl_fft.rfftn_numpy(a, s, axes)
pm.traceplot(trace_3);
pm.summary(trace_3).round(2)
/Users/twiecki/anaconda3/lib/python3.6/site-packages/mkl_fft/_numpy_fft.py:1044: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. output = mkl_fft.rfftn_numpy(a, s, axes)
mean | sd | mc_error | hpd_2.5 | hpd_97.5 | n_eff | Rhat | |
---|---|---|---|---|---|---|---|
gamma | 8.47 | 0.16 | 0.01 | 8.16 | 8.78 | 940.27 | 1.0 |
s | 2.17 | 0.11 | 0.00 | 1.98 | 2.40 | 900.70 | 1.0 |
%load_ext watermark
%watermark -n -u -v -iv -w
pymc3 3.8 arviz 0.7.0 pandas 0.25.3 seaborn 0.9.0 numpy 1.17.5 last updated: Wed Apr 22 2020 CPython 3.8.0 IPython 7.11.0 watermark 2.0.2