Think Bayes

Copyright 2018 Allen B. Downey

MIT License: https://opensource.org/licenses/MIT

In [1]:
# Configure Jupyter so figures appear in the notebook
%matplotlib inline

# Configure Jupyter to display the assigned value after an assignment
%config InteractiveShell.ast_node_interactivity='last_expr_or_assign'

import numpy as np
import pandas as pd

# import classes from thinkbayes2
from thinkbayes2 import Pmf, Cdf, Suite, Joint

import thinkplot

The Space Shuttle problem

Here's a problem from Bayesian Methods for Hackers

On January 28, 1986, the twenty-fifth flight of the U.S. space shuttle program ended in disaster when one of the rocket boosters of the Shuttle Challenger exploded shortly after lift-off, killing all seven crew members. The presidential commission on the accident concluded that it was caused by the failure of an O-ring in a field joint on the rocket booster, and that this failure was due to a faulty design that made the O-ring unacceptably sensitive to a number of factors including outside temperature. Of the previous 24 flights, data were available on failures of O-rings on 23, (one was lost at sea), and these data were discussed on the evening preceding the Challenger launch, but unfortunately only the data corresponding to the 7 flights on which there was a damage incident were considered important and these were thought to show no obvious trend. The data are shown below (see 1):

In [2]:
# !wget https://raw.githubusercontent.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/master/Chapter2_MorePyMC/data/challenger_data.csv
In [3]:
columns = ['Date', 'Temperature', 'Incident']
df = pd.read_csv('challenger_data.csv', parse_dates=[0])
df.drop(labels=[3, 24], inplace=True)
df
Out[3]:
Date Temperature Damage Incident
0 1981-04-12 66 0
1 1981-11-12 70 1
2 1982-03-22 69 0
4 1982-01-11 68 0
5 1983-04-04 67 0
6 1983-06-18 72 0
7 1983-08-30 73 0
8 1983-11-28 70 0
9 1984-02-03 57 1
10 1984-04-06 63 1
11 1984-08-30 70 1
12 1984-10-05 78 0
13 1984-11-08 67 0
14 1985-01-24 53 1
15 1985-04-12 67 0
16 1985-04-29 75 0
17 1985-06-17 70 0
18 1985-07-29 81 0
19 1985-08-27 76 0
20 1985-10-03 79 0
21 1985-10-30 75 1
22 1985-11-26 76 0
23 1986-01-12 58 1
In [4]:
df['Incident'] = df['Damage Incident'].astype(float)
df
Out[4]:
Date Temperature Damage Incident Incident
0 1981-04-12 66 0 0.0
1 1981-11-12 70 1 1.0
2 1982-03-22 69 0 0.0
4 1982-01-11 68 0 0.0
5 1983-04-04 67 0 0.0
6 1983-06-18 72 0 0.0
7 1983-08-30 73 0 0.0
8 1983-11-28 70 0 0.0
9 1984-02-03 57 1 1.0
10 1984-04-06 63 1 1.0
11 1984-08-30 70 1 1.0
12 1984-10-05 78 0 0.0
13 1984-11-08 67 0 0.0
14 1985-01-24 53 1 1.0
15 1985-04-12 67 0 0.0
16 1985-04-29 75 0 0.0
17 1985-06-17 70 0 0.0
18 1985-07-29 81 0 0.0
19 1985-08-27 76 0 0.0
20 1985-10-03 79 0 0.0
21 1985-10-30 75 1 1.0
22 1985-11-26 76 0 0.0
23 1986-01-12 58 1 1.0
In [5]:
import matplotlib.pyplot as plt

plt.scatter(df.Temperature, df.Incident, s=75, color="k",
            alpha=0.5)
plt.yticks([0, 1])
plt.ylabel("Damage Incident?")
plt.xlabel("Outside temperature (Fahrenheit)")
plt.title("Defects of the Space Shuttle O-Rings vs temperature");

Grid algorithm

We can solve the problem first using a grid algorithm, with parameters b0 and b1, and

$\mathrm{logit}(p) = b0 + b1 * T$

and each datum being a temperature T and a boolean outcome fail, which is true is there was damage and false otherwise.

Hint: the expit function from scipy.special computes the inverse of the logit function.

In [6]:
from scipy.special import expit

class Logistic(Suite, Joint):
    
    def Likelihood(self, data, hypo):
        """
        
        data: T, fail
        hypo: b0, b1
        """
        return 1
In [7]:
# Solution

from scipy.special import expit

class Logistic(Suite, Joint):
    
    def Likelihood(self, data, hypo):
        """
        
        data: T, fail
        hypo: b0, b1
        """
        temp, fail = data
        b0, b1 = hypo
        
        log_odds = b0 + b1 * temp
        p_fail = expit(log_odds)
        if fail == 1:
            return p_fail
        elif fail == 0:
            return 1-p_fail
        else:
            # NaN
            return 1
In [8]:
b0 = np.linspace(0, 50, 101);
In [9]:
b1 = np.linspace(-1, 1, 101);
In [10]:
from itertools import product
hypos = product(b0, b1)
Out[10]:
<itertools.product at 0x7f25c84d75e8>
In [11]:
suite = Logistic(hypos);
In [12]:
for data in zip(df.Temperature, df.Incident):
    print(data)
    suite.Update(data)
(66, 0.0)
(70, 1.0)
(69, 0.0)
(68, 0.0)
(67, 0.0)
(72, 0.0)
(73, 0.0)
(70, 0.0)
(57, 1.0)
(63, 1.0)
(70, 1.0)
(78, 0.0)
(67, 0.0)
(53, 1.0)
(67, 0.0)
(75, 0.0)
(70, 0.0)
(81, 0.0)
(76, 0.0)
(79, 0.0)
(75, 1.0)
(76, 0.0)
(58, 1.0)
In [13]:
thinkplot.Pdf(suite.Marginal(0))
thinkplot.decorate(xlabel='Intercept',
                   ylabel='PMF',
                   title='Posterior marginal distribution')
In [14]:
thinkplot.Pdf(suite.Marginal(1))
thinkplot.decorate(xlabel='Log odds ratio',
                   ylabel='PMF',
                   title='Posterior marginal distribution')

According to the posterior distribution, what was the probability of damage when the shuttle launched at 31 degF?

In [15]:
# Solution

T = 31
total = 0

for hypo, p in suite.Items():
    b0, b1 = hypo
    log_odds = b0 + b1 * T
    p_fail = expit(log_odds)
    total += p * p_fail
    
total
Out[15]:
0.9909003512180481
In [16]:
# Solution

pred = suite.Copy()
pred.Update((31, True))
Out[16]:
0.9909003512180481

MCMC

Implement this model using MCMC. As a starting place, you can use this example from the PyMC3 docs.

As a challege, try writing the model more explicitly, rather than using the GLM module.

In [17]:
from warnings import simplefilter
simplefilter('ignore', FutureWarning)

import pymc3 as pm
In [18]:
# Solution

with pm.Model() as model:
    pm.glm.GLM.from_formula('Incident ~ Temperature', df, 
                            family=pm.glm.families.Binomial())
    
    start = pm.find_MAP()
    trace = pm.sample(1000, start=start, tune=1000)
logp = -18, ||grad|| = 11.915: 100%|██████████| 27/27 [00:00<00:00, 1502.12it/s]      
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [Temperature, Intercept]
Sampling 2 chains: 100%|██████████| 4000/4000 [00:06<00:00, 594.28draws/s]
The estimated number of effective samples is smaller than 200 for some parameters.
In [19]:
pm.traceplot(trace);
In [20]:
# Solution

with pm.Model() as model:
    pm.glm.GLM.from_formula('Incident ~ Temperature', df, 
                            family=pm.glm.families.Binomial())
    
    trace = pm.sample(1000, tune=1000)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [Temperature, Intercept]
Sampling 2 chains:   0%|          | 0/4000 [00:00<?, ?draws/s]
---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/downey/anaconda3/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 73, in run
    self._start_loop()
  File "/home/downey/anaconda3/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 113, in _start_loop
    point, stats = self._compute_point()
  File "/home/downey/anaconda3/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 139, in _compute_point
    point, stats = self._step_method.step(self._point)
  File "/home/downey/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/arraystep.py", line 247, in step
    apoint, stats = self.astep(array)
  File "/home/downey/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/base_hmc.py", line 117, in astep
    'might be misspecified.' % start.energy)
ValueError: Bad initial energy: inf. The model might be misspecified.
"""

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
ValueError: Bad initial energy: inf. The model might be misspecified.

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
<ipython-input-20-18c7e811f775> in <module>()
      5                             family=pm.glm.families.Binomial())
      6 
----> 7     trace = pm.sample(1000, tune=1000)

~/anaconda3/lib/python3.6/site-packages/pymc3/sampling.py in sample(draws, step, init, n_init, start, trace, chain_idx, chains, cores, tune, nuts_kwargs, step_kwargs, progressbar, model, random_seed, live_plot, discard_tuned_samples, live_plot_kwargs, compute_convergence_checks, use_mmap, **kwargs)
    447             _print_step_hierarchy(step)
    448             try:
--> 449                 trace = _mp_sample(**sample_args)
    450             except pickle.PickleError:
    451                 _log.warning("Could not pickle model, sampling singlethreaded.")

~/anaconda3/lib/python3.6/site-packages/pymc3/sampling.py in _mp_sample(draws, tune, step, chains, cores, chain, random_seed, start, progressbar, trace, model, use_mmap, **kwargs)
    997         try:
    998             with sampler:
--> 999                 for draw in sampler:
   1000                     trace = traces[draw.chain - chain]
   1001                     if trace.supports_sampler_stats and draw.stats is not None:

~/anaconda3/lib/python3.6/site-packages/pymc3/parallel_sampling.py in __iter__(self)
    303 
    304         while self._active:
--> 305             draw = ProcessAdapter.recv_draw(self._active)
    306             proc, is_last, draw, tuning, stats, warns = draw
    307             if self._progress is not None:

~/anaconda3/lib/python3.6/site-packages/pymc3/parallel_sampling.py in recv_draw(processes, timeout)
    221         if msg[0] == 'error':
    222             old = msg[1]
--> 223             six.raise_from(RuntimeError('Chain %s failed.' % proc.chain), old)
    224         elif msg[0] == 'writing_done':
    225             proc._readable = True

~/anaconda3/lib/python3.6/site-packages/six.py in raise_from(value, from_value)

RuntimeError: Chain 0 failed.

The posterior distributions for these parameters should be similar to what we got with the grid algorithm.

In [ ]: