Reparameterizing the Weibull Accelerated Failure Time Model¶

In [1]:

%matplotlib inline
import pymc3 as pm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels
import patsy
import theano.tensor as tt

plt.style.use('seaborn-darkgrid')
print('Running on PyMC3 v{}'.format(pm.__version__))

Running on PyMC3 v3.6

Dataset¶

The previous example notebook on Bayesian parametric survival analysis introduced two different accelerated failure time (AFT) models: Weibull and log-linear. In this notebook, we present three different parameterizations of the Weibull AFT model.

The data set we'll use is the flchain R data set, which comes from a medical study investigating the effect of serum free light chain (FLC) on lifespan. Read the full documentation of the data by running:

print(statsmodels.datasets.get_rdataset(package='survival', dataname='flchain').__doc__).

In [2]:

# Fetch and clean data
data = (statsmodels.datasets
                   .get_rdataset(package='survival', dataname='flchain')
                   .data
                   .sample(500)  # Limit ourselves to 500 observations
                   .reset_index(drop=True))

In [3]:

y = data.futime.values
censored = ~data['death'].values.astype(bool)

In [4]:

y[:5]

Out[4]:

array([4953, 4170, 2169, 4522, 4558])

In [5]:

censored[:5]

Out[5]:

array([ True, False,  True, False,  True])

Using `pm.Potential`¶

We have an unique problem when modelling censored data. Strictly speaking, we don't have any data for censored values: we only know the number of values that were censored. How can we include this information in our model?

One way do this is by making use of pm.Potential. The PyMC2 docs explain its usage very well. Essentially, declaring pm.Potential('x', logp) will add logp to the log-likelihood of the model.

Parameterization 1¶

This parameterization is an intuitive, straightforward parameterization of the Weibull survival function. This is probably the first parameterization to come to one's mind.

In [6]:

def weibull_lccdf(x, alpha, beta):
    ''' Log complementary cdf of Weibull distribution. '''
    return -(x / beta)**alpha

In [7]:

with pm.Model() as model_1:
    alpha_sd = 10.0

    mu = pm.Normal('mu', mu=0, sigma=100)
    alpha_raw = pm.Normal('a0', mu=0, sigma=0.1)
    alpha = pm.Deterministic('alpha', tt.exp(alpha_sd * alpha_raw))
    beta = pm.Deterministic('beta', tt.exp(mu / alpha))
    
    y_obs = pm.Weibull('y_obs', alpha=alpha, beta=beta, observed=y[~censored])
    y_cens = pm.Potential('y_cens', weibull_lccdf(y[censored], alpha, beta))

In [8]:

with model_1:
    # Increase tune and change init to avoid divergences
    trace_1 = pm.sample(draws=1000, tune=1000,
                        target_accept=0.9,
                        init='adapt_diag')

Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [a0, mu]
Sampling 2 chains: 100%|██████████| 4000/4000 [00:08<00:00, 497.02draws/s]
/Users/twiecki/anaconda3/lib/python3.6/site-packages/mkl_fft/_numpy_fft.py:1044: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  output = mkl_fft.rfftn_numpy(a, s, axes)
The number of effective samples is smaller than 25% for some parameters.

In [9]:

pm.traceplot(trace_1, var_names=['alpha', 'beta']);

In [11]:

pm.summary(trace_1, varnames=['alpha', 'beta']).round(2)

/Users/twiecki/anaconda3/lib/python3.6/site-packages/mkl_fft/_numpy_fft.py:1044: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  output = mkl_fft.rfftn_numpy(a, s, axes)

Out[11]:

	mean	sd	mc_error	hpd_2.5	hpd_97.5	n_eff	Rhat
alpha	1.15	0.10	0.01	0.97	1.36	362.67	1.0
beta	11225.11	1322.31	53.46	8899.77	13872.52	595.54	1.0

Parameterization 2¶

Note that, confusingly, alpha is now called r, and alpha denotes a prior; we maintain this notation to stay faithful to the original implementation in Stan. In this parameterization, we still model the same parameters alpha (now r) and beta.

For more information, see this Stan example model and the corresponding documentation.

In [12]:

with pm.Model() as model_2:
    alpha = pm.Normal('alpha', mu=0, sigma=10)
    r = pm.Gamma('r', alpha=1, beta=0.001, testval=0.25)
    beta = pm.Deterministic('beta', tt.exp(-alpha / r))

    y_obs = pm.Weibull('y_obs', alpha=r, beta=beta, observed=y[~censored])
    y_cens = pm.Potential('y_cens', weibull_lccdf(y[censored], r, beta))

In [13]:

with model_2:
    # Increase tune and target_accept to avoid divergences
    trace_2 = pm.sample(draws=1000, tune=1000,
                        target_accept=0.9)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [r, alpha]
Sampling 2 chains: 100%|██████████| 4000/4000 [00:09<00:00, 434.81draws/s]
/Users/twiecki/anaconda3/lib/python3.6/site-packages/mkl_fft/_numpy_fft.py:1044: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  output = mkl_fft.rfftn_numpy(a, s, axes)
The number of effective samples is smaller than 25% for some parameters.

In [14]:

pm.traceplot(trace_2, var_names=['r', 'beta']);

In [15]:

pm.summary(trace_2, varnames=['r', 'beta']).round(2)

/Users/twiecki/anaconda3/lib/python3.6/site-packages/mkl_fft/_numpy_fft.py:1044: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  output = mkl_fft.rfftn_numpy(a, s, axes)

Out[15]:

	mean	sd	mc_error	hpd_2.5	hpd_97.5	n_eff	Rhat
r	1.15	0.09	0.00	0.98	1.34	385.92	1.0
beta	11275.85	1299.18	48.06	8804.12	13719.80	584.16	1.0

Parameterization 3¶

In this parameterization, we model the log-linear error distribution with a Gumbel distribution instead of modelling the survival function directly. For more information, see this blog post.

In [16]:

logtime = np.log(y)

def gumbel_sf(y, mu, sigma):
    ''' Gumbel survival function. '''
    return 1.0 - tt.exp(-tt.exp(-(y - mu) / sigma))

In [17]:

with pm.Model() as model_3:
    s = pm.HalfNormal('s', tau=5.0)
    gamma = pm.Normal('gamma', mu=0, sigma=5)

    y_obs = pm.Gumbel('y_obs', mu=gamma, beta=s, observed=logtime[~censored])
    y_cens = pm.Potential('y_cens', gumbel_sf(y=logtime[censored], mu=gamma, sigma=s))

In [18]:

with model_3:
    # Increase tune and change init to avoid divergences
    trace_3 = pm.sample(draws=1000, tune=1000,
                        init='adapt_diag')

Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [gamma, s]
Sampling 2 chains: 100%|██████████| 4000/4000 [00:02<00:00, 1538.34draws/s]
/Users/twiecki/anaconda3/lib/python3.6/site-packages/mkl_fft/_numpy_fft.py:1044: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  output = mkl_fft.rfftn_numpy(a, s, axes)

In [19]:

pm.traceplot(trace_3);

In [20]:

pm.summary(trace_3).round(2)

/Users/twiecki/anaconda3/lib/python3.6/site-packages/mkl_fft/_numpy_fft.py:1044: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  output = mkl_fft.rfftn_numpy(a, s, axes)

Out[20]:

	mean	sd	mc_error	hpd_2.5	hpd_97.5	n_eff	Rhat
gamma	8.47	0.16	0.01	8.16	8.78	940.27	1.0
s	2.17	0.11	0.00	1.98	2.40	900.70	1.0

Authors¶

Originally collated by Junpeng Lao on Apr 21, 2018. See original code here.
Authored and ported to Jupyter notebook by George Ho on Jul 15, 2018.

In [47]:

%load_ext watermark
%watermark -n -u -v -iv -w

pymc3   3.8
arviz   0.7.0
pandas  0.25.3
seaborn 0.9.0
numpy   1.17.5
last updated: Wed Apr 22 2020 

CPython 3.8.0
IPython 7.11.0
watermark 2.0.2