Bayes Factor for Log Regression/

Is there really a relationship between failure and temperature?

An critism of our above analysis is that assumed that the relationship followed a logistic model, this we implictly assumed that the probabilities change over temperature. Let's look at the data again. (Top figure)

Could it be that infact this particular sequence of defects occured by chance? After all, I can produce similar plots. (Bottom figure) This might explain the large overlap in temperatures.

In [1]:
figsize( 12.5, 6)
subplot(211)
plt.scatter( challenger_data[:,0], challenger_data[:,1], s = 75, \
    color="k", alpha = 0.5) 
plt.yticks([0,1])
plt.ylabel("Damage Incident?")
plt.title("(Real) Defects of the Space Shuttle O-Rings vs temperature")

subplot(212)
n = challenger_data.shape[0]
plt.scatter( challenger_data[:,0], stats.bernoulli.rvs(0.6, size = n) ,
     s = 75, color="k", alpha = 0.5) 
plt.yticks([0,1])
plt.ylabel("Damage Incident?")
plt.xlabel("Outside temperature (Farhenhit)" )
plt.title("(Artificial) Defects of the Space Shuttle O-Rings vs \
temperature")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-a4aeded293db> in <module>()
----> 1 figsize( 12.5, 6)
      2 subplot(211)
      3 plt.scatter( challenger_data[:,0], challenger_data[:,1], s = 75, \
      4     color="k", alpha = 0.5) 
      5 plt.yticks([0,1])

NameError: name 'figsize' is not defined

Introducing the Bayes factor

The Bayes factor is a measure comparing two models. In our example, on the one hand we believe that temperature plays an important role in the probability of O-ring failures. On the other hand, we believe that there is no significant connection, and the pattern appears by coincidence. We can compare these model using the ratio of the probabilities of observing the data, given the model:

$$ \frac{ P( \text{observations} | \text{Model}_1 ) }{ P( \text{observations} | \text{Model}_2 ) }$$

Calculating this is not at all obvious. For simplicity, let's select a set of parameters from the posterior distribution (which is tantamount to selecting a particular model).

In [2]:
alpha_hat = alpha_samples[0,0] #select the first alpha sample
beta_hat = beta_samples[0,0] #select the first beta sample

p_hat = logistic( temperature, beta_hat, alpha_hat)
print "estimates of probability at observed temperature, defects: "
print np.array(zip(p_hat, temperature, D) )[:3, :]
print "..."
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-bf332447957b> in <module>()
----> 1 alpha_hat = alpha_samples[0,0] #select the first alpha sample
      2 beta_hat = beta_samples[0,0] #select the first beta sample
      3 
      4 p_hat = logistic( temperature, beta_hat, alpha_hat)
      5 print "estimates of probability at observed temperature, defects: "

NameError: name 'alpha_samples' is not defined

To calculate the numerator in the Bayes factor, we start by assuming a model, in our case assume alpha_hat, beta_hat are true, and calculate the probability of the defects we observe, equal to the product of:

$$ P(\; \text{Ber}(\; p(t_i \; | \; \hat{\beta}, \hat{\alpha} )\; ) = D_i \; ),\;\; i=1..N $$

For example, using the output above, the first computation, $i=0$, would look like

\begin{align} & p = p(t_1 \; | \; \hat{\beta}, \hat{\alpha} ) = 0.667 \\\\ & d = D_0 = 0 \\\\ & \Rightarrow \; P( \; \text{Ber}(p) = d ) = ?? \end{align}

The probability of this ocurring is $(1-0.667) = 0.333\; $ (Recall the definition of a Bernoulli random variable $\text{Ber}(p)$: it is equal to $1$ with probability $p$ and equal to 0 with probability $1-p$). As each observation is independent, we multiply all the observations together. A clever way to do this for a specific $i$ is, recalling the $D_i$ can only be 1 or 0:

$$\left( p(t_1 \; | \; \hat{\beta}, \hat{\alpha} )\right)^{D_i} \left( 1 - p(t_1 \; | \; \hat{\beta}, \hat{\alpha} ) \right)^{(1-D_i)}$$

(it is possible that the quantity can overflow, so it is recommended to take the $\log$ of the above::

$$ D_i\log( p(t_1 \; | \; \hat{\beta}, \hat{\alpha} ) ) + (1-D_i)\log( 1- p(t_1 \; | \; \hat{\beta}, \hat{\alpha} ) ) $$

and sum the terms instead of multiplying. Be sure to use this transform for both models you are comparing.)

In [3]:
print p_hat
_product = p_hat**( D )*(1-p_hat)**(1-D)
prob_given_model_1 = _product.prod()
print "probability of observations, model 1: %.10f"%prob_given_model_1
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-be149a40cb07> in <module>()
----> 1 print p_hat
      2 _product = p_hat**( D )*(1-p_hat)**(1-D)
      3 prob_given_model_1 = _product.prod()
      4 print "probability of observations, model 1: %.10f"%prob_given_model_1

NameError: name 'p_hat' is not defined

The second model we are comparing against is that $\beta=0$, i.e. there is no relationship between probabilites and temperature. We perform the same calculations as above. Notice that beta=0 here in the below PyMC code:

In [4]:
beta = 0
alpha = mc.Normal( "alpha", 0, 0.001, value = 0 )

@mc.deterministic
def p( temp = temperature, alpha = alpha, beta = beta):
    return 1.0/( 1. + np.exp( beta*temperature + alpha) ) 


observed = mc.Bernoulli( "bernoulli_obs", p, \
                        value = D, observed=True)

model = mc.Model( {"observed":observed, "beta":beta, "alpha":alpha} )

#mysterious code to be explained in Chapter 3
map_ = mc.MAP(model)
map_.fit()
mcmc = mc.MCMC( model )
mcmc.sample( 260000, 220000, 3 )
######

alpha_samples_model_2 = mcmc.trace("alpha")[:, None]
alpha_hat = alpha_samples_model_2[0] #use the first 'model'
beta_hat = 0

p_hat = logistic( temperature, beta_hat, alpha_hat )
print "estimates of probability at observed temperature, defects: "
print np.array(zip(p_hat, temperature, D ) )[:3, :]
print
print "Notice the probability is constant for all temperatures?"
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-2fd082360270> in <module>()
      1 beta = 0
----> 2 alpha = mc.Normal( "alpha", 0, 0.001, value = 0 )
      3 
      4 @mc.deterministic
      5 def p( temp = temperature, alpha = alpha, beta = beta):

NameError: name 'mc' is not defined
In [5]:
#compute the probability of observations given the model_2

_product = p_hat**( D )*(1-p_hat)**(1-D)
prob_given_model_2 = _product.prod()
print "probability of observations, model 2: %.10f"%prob_given_model_2
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-de7da8dccc70> in <module>()
      1 #compute the probability of observations given the model_2
      2 
----> 3 _product = p_hat**( D )*(1-p_hat)**(1-D)
      4 prob_given_model_2 = _product.prod()
      5 print "probability of observations, model 2: %.10f"%prob_given_model_2

NameError: name 'p_hat' is not defined
In [7]:
print "Bayes factor = %.3f"%(prob_given_model_1/prob_given_model_2)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-c40e6307b245> in <module>()
----> 1 print "Bayes factor = %.3f"%(prob_given_model_1/prob_given_model_2)

NameError: name 'prob_given_model_1' is not defined

Is this good? Below is a chart that can be used to compare the computed Bayes factors to a degree of confidence in M1.

Bayes factorSupporting M1
<1:1 Negative (supports M2)
1:1 to 3:1 Barely worth mentioning
3:1 to 10:1 Substantial
10:1 to 30:1 Strong
30:1 to 100:1 Very strong
> 100:1 Decisive

We are not done yet. If you recall, we selected only one model, but recall we actually have a distribution of possible models (sequential pairs of ($\beta, \alpha$) from the posterior distributions). So to be correct we need to average over samples from the posterior:

In [9]:
p = logistic( temperature[None,:], beta_samples, alpha_samples )
_product = p**( D )*(1-p)**(1-D)
prob_given_model_1 = _product.prod(axis=1).mean()
print "expected prob. of obs., given model 1: %.10f"%prob_given_model_1

    
p_model_2 = logistic( temperature[:,None], 
                np.zeros_like(temperature), alpha_samples_model_2)

_product = p_model_2**( D )*(1-p_model_2)**(1-D)
prob_given_model_2 = _product.prod(axis=1).mean()
print "expected prob. of obs., given model 2: %.10f"%prob_given_model_2
print 
print "Bayes factor: %.3f"%(prob_given_model_1/prob_given_model_2)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-5e4bf3a1b066> in <module>()
----> 1 p = logistic( temperature[None,:], beta_samples, alpha_samples )
      2 _product = p**( D )*(1-p)**(1-D)
      3 prob_given_model_1 = _product.prod(axis=1).mean()
      4 print "expected prob. of obs., given model 1: %.10f"%prob_given_model_1
      5 

NameError: name 'logistic' is not defined

It looks we have a pretty good model, and temperature is significant. Look at you, you're a rocket scientist now.