Bayesian Methods for Hackers

The material for this notebook is provided via this free online textbook :

I am writing this notebook to document my learning, and hopefully to help others learn machine learning. You can think of it as my personal lecture notes. I would love suggestions / corrections / feedback for these notebooks.

Visit my webpage for more.

Email me: [email protected]

In [ ]:
Out[ ]:
submit to reddit

This notebook covers or includes:

  • Introduction to Bayesian Methods
  • Uses Python and PyMC
  • Many Examples and Figures

Chapter 1

The Phiosophy of Bayesian Inference

Bayesian inference differs from more traditional statistical inference by preserving uncertainty. Let's start thinking Bayesian. The Bayesian view interprets probability as a measure of believability in an event, that is, how confident are we in an event occurring. This is actually the natural interpretation of probability itself.

Consider an alternative interpretation of probability: Frequentist, known as the more classical version of statistics, assumes that probability is the long-run frequency of events. For example, the probability of plane accidents under a frequentist theory is interpreted as the long-term frequency of plane accidents. This makes sense, but is difficult to understand when events have no long-term frequency of occurrences, for example presidential elections, which only happen once. Frequentists get around this by creating alternate realities and saying that across all these realities, the frequency of occurrences defines the probability.

Bayesians, have a more intuitive approach. They interpret a probability as a measure of belief, or confidence, of an event occurring. Simply, a probability is a summary of an opinion. An individual who assigns a belief of 0 to an event has no confidence that it will occur; conversely, assigning a belief of 1 implies that they are absolutely certain of the event occurring. Beliefs between 0 and 1 allow of weightings of other outcomes. This definition agrees with the plane accident example. For having observed the frequency of plane accidents, an individuals belief should be equal to that frequency, excluding any outside information. Under this framework it is not meaningful to speak about probabilities (beliefs) of examples like the presidential outcome: how confident are you that candidate A will win?

Notice that assigning belief (probability) to an individuals implies that we can have different individuals that have different beliefs of events occurring, because they possess different information about the world. This is very similar to what naturally is occurring. The existence of different beliefs (probabilities) does not mean anyone is wrong. Consider these examples:

  • Flip a coin and two people guess the result. We would both agree that the probability of Heads is 1/2. Assume then, that I peek at the coin. Now I know for certain what the result is: I assign a probability of 1.0 to either Heads or Tails (whichever it is). Now what is the other persons belief that the coin is Heads? Unless we know for certain the result, you would still assume a 1/2 probability. Thus we assign different probabilities to the result.

  • Your code either has a bug in it or not, but we don't know for certain, though we have a belief about the presence or absence of a bug.

  • A medical patient is presenting symptoms $x$, $y$, and $z$. There are a number of diseases that could be causing all of them, but only one disease is present. A doctor has beliefs about which disease, but a second doctor may have slightly different beliefs.

Treating beliefs as probabilities is natural to humans. We use it constantly as we interact with the world and only see partial truths, but we gather evidence to form beliefs. Alternatively, we have been trained to think of probability like a Frequentist.

In traditional probability, we denote our belief about an event A as $P(A)$. This quantity is called the prior probability.

In Bayesian theory, similar to human nature, we update out beliefs after seeing evidence, even if the evidence is counter to what was initially believed. We denote this updated belief as $P(A|X)$, interpreted as the probability of $A$ given the evidence $X$. We called the updated belief the posterior probability so as to contrast it with the prior probability. For example, consider the posterior probabilities (posterior beliefs) of the above examples, after observing some evidence $X$.

  1. $P(A)$ : The coin has a 50% chance of being Heads. $P(A|X)$: You look at the coin, observe heads has landed, denote this information $X$, and assign a probability 1.0 to Heads and 0.0 to Tails.

  2. $P(A)$ : This big, but complex code likely has a big in it. $P(A|X)$ The code passed all $X$ tests; there still might be a big, but its presence is less likely now.

  3. $P(A)$ : The patient could have any number of diseases. $(PA|X)$ : Performing a blood test generated evidence $X$, ruling out some of the possible diseases from consideration.

In each example we did not completely discard the prior belief (probability) after seeing new evidence $X$, but we re-weighted the prior to incorporate the new evidence.

By introducing prior uncertainty about events, we are already admitting that any guess we make is potentially very wrong. After observing data, or other evidence, we update out beliefs, and our guess becomes less wrong. This is the flip side of the prediction coint, where we typically try to be more right than less wrong.

Bayesian Inference In Practice

If a Frequentist and Bayesian inference were programming functions, with inputs being statistical problems, then the two would be different in what they return. The frequentist function would return a number representing an estimate (typically a summary statistical like te sample average), whereas the Bayesian function would return probabilities.

For example, in our debugging problem above: calling the frequentist function with the argument "My code passed all $X$ tests; is my code bug-free?" would return a YES. On the other hand, the Bayesian function would test: "My code often has bugs. My code passed all $X$ tests; is my code bug-free?" would return probabilities of YES and NO. For example:

YES, with probability 0.8; NO, with probability 0.2

Note the Bayesian function accepted an additonal argument: "Often my code has bugs". This parameter is the prior. By including the prior parameter, we are telling the funcion to include out belief about the situation. Technically this parameter in the Bayesian function is optional, but we will see excluding it has its own consequences.

Incorperating Evidence

As we acquire more and more evidence, our prior belief is slowly replaced by the new evidence. For example, if your prior belief is something rediculous, like "I expect the sun to explode today", and each day you are proved wrong, you would hope that any inference would correct you. Bayesian inference will correct this belief.

Denote $N$ as the number of instances of evidence we possess. As we gather an infinite amount of evidence, say as $N$ --> inifinity, out Bayesian results (0ften) align with frequentist results. Hence for large $N$, statistical inference is more or less objective. On the other hand, for small $N$, inference is much more unstable frequentist estimates have more variance and larger confidence intervals. This is where Bayesian analysy excels. By introducing uncertainty that reflects the instability of statistical inference of a small $N$ dataset.

One may believe that for large $N$, one can be indifferent between the two techniques since they offere similar inference, and might lean towards the simpler frequentist methods. In this position, the authors offer a quote from Andrew Gelman (2005) <a href=''[1]></a>.

Sample sizes are never large. If $N$ is too small the get a sufficiently-precise estimate, you need to get more data (or make more assumptions). But once $N$ is 'large enough', you can start subdividing the data to learn more. For example: in a public opinion poll, once you have a good estimate for an entire country, you can estimate among men and women, and different age groups. $N$ is never enough because if it were "enough" you'd already be onto the next problem for which you need more data.

Are frequentist methods incorrect then? No.

Frequentist methods are still useful or state-of-the-art in many areas. Tools such as least squares linear regression, LASSO regression, and expectation-maximization algorithms are all powerful and fast. Bayesian methods compliment these techniques by solving problems that these approaches cannot, or supplimenting existing methods with more flexible models.

A more formal definition of the Bayesian framework

We are interested in beliefs, which can be interpreted as probabilities if we are thinking Bayesian. We have a prior belief in event $A$, beliefs formed by previous information.

Next, we observe our evidence. To continue the buggy-code example: if out code passes $X$ tests, we want to update our belief to incorporate this. We call the new updated belief the posterior probability. Updating the belief is done via the following equation, known as Bayes Theorem, discovered by Thomas Bayes.

$$ P(A|X) = \frac{P(X|A) P(A)}{ P(X)} $$

$$\propto P(X|A)(P(A)$$

This formula is not unique to Bayesian inference, but it is used in this context to connect prior probabilities $P(A)$ with an update posterior probability $P(A|X)$.

Example: Coin Flip

Suppose, naively, that you are unsure about the probability of $heads$ in a coin flip. You believe that there is some true underlying ration, call it $p$, but have no prior opinion on what $p$ might be.

We begin to flip a coin, and record the observations: either $H$ or $T$. This is our observed data. How does our inference change as we observe more and more data? More specifically, what do our posterior probabilities look like when we have little data, versus when we have lots of data.

Below we plot this sequence of updating posterior probabilities as we observe increasing amount of data (coin flips).

In [170]:
%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt

import seaborn as sns
# Some initial styling
rc={'axes.labelsize': 16.0, 'font.size': 16, 
    'legend.fontsize': 14.0, 'axes.titlesize': 18,
    'xtick.labelsize': 12.0, 'ytick.labelsize': 12.0}

sns.set(context = 'poster', style = 'whitegrid', rc=rc)

import scipy.stats as stats
In [171]:
# Create some data
dist = stats.beta
n_trials = [0,1,2,3,4,5,8,15,50,500]
data = stats.bernoulli.rvs(0.5, size = n_trials[-1])
x = np.linspace(0,1,100)

# Plot the data

for k, N in enumerate(n_trials):
    sx = plt.subplot(len(n_trials) / 2, 2, k+1)
    plt.xlabel("$p$, probability of heads") \
        if k in [0, len(n_trials) - 1] else None
    heads = data[:N].sum()
    y = dist.pdf(x, 1 + heads, 1 + N - heads)
    plt.plot(x, y, label='observed %d tosses, \n %d heads' % (N, heads))
    plt.fill_between(x, 0, y, color='blue', alpha=0.4)
    plt.vlines(0.5, 0, 4, color='k', linestyles='--', lw=1)
    leg = plt.legend()
    plt.subplots_adjust(wspace=None, hspace=0.4)
plt.suptitle('Bayesian updating of posterior probabilities',
             y = 1.02)
<matplotlib.text.Text at 0x14bee1a10>

The posterior probabilities are represented by the curves, and our uncertainty is proportional to the width of the curve. Eventually as we observe more data, our probabilities will tighten closer and closer to the true value of $p = 0.5$ (marked by the dashed line).

Note that at the start of these iterations the plots are not peaked at 0.5. There is no reason it should be, since we did not have a prior opinion of what $p$ was.

Example: Bug, or just an unintended feature?

Let $A$ denote the event that out code has no bugs in it. Let $X$ denote the event that the code passes all debugging tests. For now, we leave the prior probability of no bugs as a variable, $P(A) = p$.

We are interested in $P(A|X)$, i.e. the probability of no bugs, given our debugging tests $X$. To do this we need to compute some quantities.

what is $P(X|A)$ : The probability that the code passes $X$ tests given there are no bugs? For a code with no bugs, this is equal to 1 since it will pass all tests.

$P(X)$ is a bit trickier: The event $X$ can be divided into two possibilities, event $X$ occurring even though are code indeed has bugs (denoted ~ $A$, spoken not $A$), or even $X$ without bugs ($A$). $P(X)$ can be represented as:

$$P(X) = P(X and A) + P(X and ~ A)$$

$$=P(X|A)(P(A) + P(X | ~A)P(~A)$$ $$=P(X|A)p + P(X | ~A)(1 - p)$$

We have computed $P(X|A) above, yet not $P(X|~A), which is subjective: out code can pass tests but still have a bug in it, though the probability there is a bug present is reduced. This is dependent on the number of tests performed, the degree of complication in the tests, ect. We can be conservative and assign $P(X|~A) = 0.5. Then

$$P(A|X) = \frac{1p}{1p+0.5(1-p)}$$

$$= \frac{2p}{1+p}$$

This is the posterior probability, What does it look like as a function of our prior?

In [172]:
p = np.linspace(0, 1, 50)
plt.plot(p, 2 * p / (1 + p))
plt.scatter(0.2, (2*0.2) / 1.2, s=140)
plt.xlim(0, 1)
plt.ylim(0, 1)
plt.xlabel('Prior, $P(A) = p$')
plt.ylabel('Posterior, $P(A|X)$, with $P(A) = p$')
plt.title('Are there bugs in my code?')
<matplotlib.text.Text at 0x14c5a6f50>

Here we see the gains in posterior probability when our prior probability, $p$, is low. If we set a prior probability at 0.2, that means there is a 20% chance that the written code is bug-free. Though, it would be more realistic to scale this value by the size of the code. Therefore, the updated belief (posterior probability after running the debugging script) is that my code is bug-free 0.33 or 33% of the time.

We can chart this another way, since our prior $p$, is the probability that there are no bugs, $ 1 - p$ is the probability that there are bugs. Similarly, our posterior probability $P(A|X)$ is the probability that there is no bug given we saw all the tests pass, thus $1 - P(A|X)$ is the probability that there is a bug given all tests passed.

In [183]:
color = ["#9b59b6", "#3498db"]
prior = [0.2, 0.8]
posterior = [1. /3, 2./3][0, 0.7], prior, alpha=0.7, width=0.25,
        label='prior distribution', color=color[0])[0 + 0.25, 0.7 + 0.25], posterior, alpha=0.7,
        width=0.25, label='posterior distribution',

plt.xticks([0.2, 0.95], ['Bugs Absent', 'Bugs Present'])
plt.legend(loc='upper left')
<matplotlib.legend.Legend at 0x14d7bd5d0>

Note that after we observe $X$ occur, the probability of bugs being absernt increased. By increasing the number of tests, we can approach confidence (probability 1) that there are no bugs present.

In [191]:
from IPython.core.display import HTML

def css_styling():
    styles = open("/users/ryankelly/desktop/custom_notebook2.css", "r").read()

    return HTML(styles)
In [3]:
def social():
    code = """
    <a style='float:left; margin-right:5px;' href="" class="twitter-share-button" data-text="Check this out" data-via="Ryanmdk">Tweet</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);;js.src=p+'://';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
    <a style='float:left; margin-right:5px;' href="" class="twitter-follow-button" data-show-count="false">Follow @Ryanmdk</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);;js.src=p+'://';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
    <a style='float:left; margin-right:5px;'target='_parent' href="" onclick="window.location = '' + encodeURIComponent(window.location); return false"> <img src="" alt="submit to reddit" border="0" /> </a>
<script src="//" type="text/javascript">
  lang: en_US
<script type="IN/Share"></script>
    return HTML(code)