Bayesian Analysis with Dummy Data

The material for this notebook is provided via this free online textbook :

I am writing this notebook to document my learning, and hopefully to help others learn machine learning. You can think of it as my personal lecture notes. I would love suggestions / corrections / feedback for these notebooks.

Visit my webpage for more.

Email me: [email protected]

In [7]:
submit to reddit


This notebook expands on the PyMC syntax and design patterns, as well as ways to think about a model system from a Bayesian perspective.

Parent and Child relationships

To assist with describing Bayesian relationships, and to be consistent with PyMC's documentation, we introduce parent and child variables.

  • parent variables are variables that influence another variable.
  • child variables are variables that are affected by other variables, i.e. are the subject of parent variable.

A variable can be both a parent and child. For example, consider the PyMC code below.

In [9]:
import pymc as pm

param = pm.Exponential('poisson_param', 1)
data_generator = pm.Poisson('data_generator', param)
data_plus_one = data_generator + 1

The parameter variable controls the parameter of data_generator, hence influences its values. The former is a parent of the latter. By symmetry, data_generator is a child of parameter.

Likewise, data_generator is a parent to the variable data_plus_one, which makes data_generator both a child and a parent variable. Although it does not look like it, data_plus_one should be treated as a PyMC variable, since it is a function of another PyMC variable, hence is a child to data_generator.

This nomenclature is designed to help describe relationships in PyMC modeling. You can access the children and parent attributes by calling .children or parents. A variable can also have more than one child or parent.

In [16]:
print param.children, '\n'
print data_generator.parents, '\n'
print data_generator.children, '\n'
set([<pymc.distributions.Poisson 'data_generator' at 0x10c2be210>]) 

{'mu': <pymc.distributions.Exponential 'poisson_param' at 0x105b1da50>} 

set([<pymc.PyMCObjects.Deterministic '(data_generator_add_1)' at 0x10c2be290>]) 

PyMC Variables

All PyMC variables also have a value attribute. This method produces the current (possibly random) interval value of the variable. If the variable is a child, its value changes given the variables parents' value. Using the same variables as above:

In [18]:
print 'parameter.value = ', param.value
print 'data_generator.value = ', data_generator.value
print 'data_plus_one = ', data_plus_one.value
parameter.value =  0.204992133473
data_generator.value =  0
data_plus_one =  1

PyMC is concerned with two types of programming variables stochastic and deterministic.

  • Stochastic variables are variables that are not deterministic, even if you know the values of the variables' parents, they are random within a specified domain. Examples of this category are Poisson, DiscreteUniform, and Exponential.

  • Deterministic variables are variables that are not random if the variables parents were known. If you know all variable $Z$'s parent variables, you could determine $Z$'s value.

Explained further below...

Initializing Stochastic Variables

Initializing a stochastic variable requires a name argument, plus additional parameters that are class specific. For example:

some_variable = pm.DiscreteUniform('discrete_uni_var', 0, 4)

where 0, and 4 are the DiscreteUniform-specific lower and upper bound on the random variable. The PyMC docs contain the specific parameters for stochastic variables, or simply use object??, for example pm.DiscreteUniform?? if you are using IPython.

The name attribute is used to retrieve the posterior distribution later in the analysis, so it is best to use a descriptive name.

For multivariate problems, rather than creating a Python array of stochastic variables, addressing the size argument in the call to a Stochastic variable creates a multivariate array of independent stochastic variables. The array behavs like a Numpy array when used like one, and references to its value attribute will return Numpy arrays.

The size argument also solves the annoying case where you may have many variables $\beta _i \ , i = 1, ..., N$ you wish to model. Instead of creating arbitrary names and variables for each one.

$$beta2 = pm.Uniform('beta1', 0, 1)$$ $$beta2 = pm.Uniform('beta2', 0, 1)$$

We can instead wrap them into a single variable:

$$ betas = pm.Uniform('betas', 0, 1, size=N)

Calling random()

We can also call on a stochastic variabl's random() model, which (given the parent values) will generate a new, random value. Below we demonstrate this.

In [22]:
lambda_1 = pm.Exponential('lambda1', 1) # prior on first behaviour
lambda_2 = pm.Exponential('lambda2', 1) # prior on second behaviour
tau = pm.DiscreteUniform('tau', 0, 10) # prior on behaviour change)

print "lambda_1.value = %.3f" % lambda_1.value
print "lambda_2.value = %.3f" % lambda_2.value
print "tau.value = %.3f" % tau.value

# After calling random() the new value is stored in the variables
# 'value' attribute
lambda_1.random(), lambda_2.random(), tau.random()

print "lambda_1.value = %.3f" % lambda_1.value
print "lambda_2.value = %.3f" % lambda_2.value
print "tau.value = %.3f" % tau.value
lambda_1.value = 0.371
lambda_2.value = 1.822
tau.value = 9.000
lambda_1.value = 1.013
lambda_2.value = 0.594
tau.value = 5.000

Warning: Don't update stochastic variables' values in-place.

From the PyMC docs:

Stochastic objects' values should not be updated in place. This confuses PyMC's caching scheme. The only way a stochastic variables value should be updated is using statements of the form:

    A.value = new_value

The following are in-place updates and should never be used:

    A.value += 3
    A.value[2, 1] = 5
    A.value.attributes = new_attribute_value

Deterministic variables

Since most variables you will be modeling are stichastic, we can simply distinguish deterministic variables with a pymc.deterministic wrapper. If you dont know what a wrapper or decorator is in python, it doesn't matter, just add the decorator before the variable declaration.

def some_deterministic_var(v1=v2,):
    # meat and potatoes here

For all purposes, we can treat the object some_deterministic_var as a variable and not a Python function.

Prepending with the wrapper is the easiest, but not the only way to create deterministic variables: simple options, like, addition, expoentials etc. implicitly create deterministic variables. For example, the following returns a deterministic variable:

In [23]:
type(lambda_1 + lambda_2)

The use of the deterministic wrapper was seen in the previous chapter's text-messages example. Recall the model for $\lambda$ looked like:

$$ \lambda=\begin{cases} \ \lambda _1 \ \ if \ t < \tau \\ \ \lambda _2 \ \ if \ t < \tau \\ \end{cases} $$

And in PyMC code:

In [24]:
import numpy as np

n_data = 5

def lambda_(tau=tau, lambda_1=lambda_1, lambda_2=lambda_2):
    out = np.zeros(n_data)
    out[:tau] = lambda_1 # lambda before tau
    out[tau:] = lambda_2 # lambda after tau
    return out

If we know $\tau$, $\lambda _1$, $\lambda _2$, then $\lambda$is known completely, hence it is deterministic.

Inside the deterministic decorator, the Stochastic variables passed in behave like scalars or numpy arrays (if multivariate), and not like Stochastic variables. For example: running the following:

def seom_deterministic(stoch=some_stochastic_var):
    return stoch.value**2

will return an AttributeError detailing that stoch does not have a value attribute. It simply needs to be stoch**2. During the learning phase, it's the variable's value that is repeatedly passed in, not the actual variable.

Note that in the creation of the deterministic function we added defaults to each variable used in the function. This is a necessary step, and all variables must have default values.

Including observations in the model

At this point, it may not look like it, but we have fully specified out priors. For example, we can ask and answer questions like 'What does my prior distribution of $\lambda _1$ look like?"

In [27]:
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns

sns.set(context = 'poster', style = 'whitegrid')

samples = [lambda_1.random() for i in range(20000)]
plt.hist(samples, bins=70, normed=True, histtype='stepfilled', alpha=0.8)
plt.title('Prior distribution for $\lambda _1$')
plt.xlim(0, 8)
(0, 8)

To frame this in the notation of the first chapter , though this is a slight abuse of notation, we have specified $P(A)$. Our next goal is to include/evidence/observations $X$ into our model.

PyMC stochastic variables have a keyword argument observed which accepts a boolean (False by default). The keyword observed has a simple role: fix the variable's current value (make the value immutable). We have to specify an initial value in the variable's creation, equal to the observations we wish to include, typically an array (should be Numpy array for speed). For example:

In [28]:
data = np.array([10, 5])
fixed_variable = pm.Poisson('fxd', 1, value=data, observed=True)
print 'value: ', fixed_variable.value
print 'calling .random()'
print 'value: ', fixed_variable.value
value:  [10  5]
calling .random()
value:  [10  5]

This is how we include data into the models: initializing a stochastic variable to have a fixed balue.

To complete the text message example from the previous lesson, we fix the PyMC variable observations to the observed dataset.

In [29]:
# some fake data
data = np.array([10, 25, 15, 20, 35])
obs = pm.Poisson('obs', lambda_, value=data, observed=True)
array([10, 25, 15, 20, 35])


We wrap all the created variables into a pm.Model class. With this class, we can analyze the variables as a single unit. This is an options step, as the fitting algorithms can be sent and array of the variables, rather than a Model class. I may or may not use this class in future examples :P.

In [31]:
model = pm.Model([obs, lambda_, lambda_1, lambda_2, tau])

Modeling Approaches

A good starting though to Bayesian modeling is to think about how your data might have been generated. Position yourself in an omniscient position, and try to imagine how you would recreate the dataset.

In the last three notebooks we investigated text message data. We begin by asking how our observations may have been generated.

  1. We started by thinking 'what is the best random variable to describe this count data?' A Poisson random variable is a good candidate because it can represent count data. SO we model the number of messages received as sampled from a Poisson distribution.

  2. So, assuming the texts are Poisson-distributed, what do we need for the Poisson distribution? We need to satisfy the parameter $\lambda$.

  3. Do we know $\lambda$? Nope, but in this example we suspect there are two $\lambda$ values, one for the earlier behaviour and one for the latter behaviour. We don't quite know when the behaviour switches though, but we call the switchpoint $\tau$.

  4. What is a good distribution for the two $\lambda$s? The exponential is good, as it assigns probabilities to positive real numbers. The exponential distribution has a parameter too, call it $\alpha$.

  5. Do we know what $\alpha$ might be? Nope. At this point we could continue and assign a distribution to $\alpha$, but it is better to stop once we reach a set level of ignorance: whereas we have a prior belief about $\lambda$; like that it 'probably changes over time', 'it's likely between 10 and 30', ect. We dont really have any strong beliefs about $\alpha$, so its best to stop here. What is a good value for $\alpha$ then? We think that the $\lambda$s are between 10 - 30, so if we set $\alpha$ really low (which corresponds to a larger probability on high values) we are not reflecting out prior well. Similar, a too-high apha misses out prior belief as well. A good idea for $\alpha$ is to reflect our belief is to set the value so that the mean of $\lambda$, given $\alpha$, is equal to our observed mean. This was shown in the last chapter.

  6. We have no expert opinion of when $\tau$ might have occurred. So we will suppose $\tau$ is from a discrete uniform distribution over the entire timespan.

Same story; different ending.

Interestingly, we can create new datasets by retelling the story. For example, if we reverse the above steps, we can simulate a possible realization of the dataset.

  1. Specify when the users behaviour switches by sample from DiscreteUniform(0, 80):
In [36]:
tau = pm.rdiscrete_uniform(0, 80)
  • Draw $\lambda _1$ and $\lambda _2$ from an $Exp(\alpha)$ distribution:
In [37]:
alpha = 1. / 20.
lambda_1, lambda_2 = pm.rexponential(alpha, 2)
print lambda_1, lambda_2
26.8466829876 5.18750926671
  • For days before $\tau$, represent the users received texts count by sampling from $Poi(\lambda _1)$, and sample from $Poi(\lambda _2)$ for days after $\tau$. For example:
In [39]:
datau = np.r_[pm.rpoisson(lambda_1, tau), pm.rpoisson(lambda_2, 80 - tau)]
In [43]:, data, color='#3498db') - 1, data[tau - 1], color='r', label='user behaviour changed')
plt.xlabel('Time (days)')
plt.ylabel('count of texts received')
plt.title('Artificial dataset')
plt.xlim(0, 80)

It is okay that our fictional dataset does not look like our observed dataset: the probability is incredibly small that it would. PyMC's engine is designed to find good parameters, $\lambda _i$, $\tau$, that maximize this probability.

The ability to generate artificial datasets is an interesting side effect our modeling, and we will see that this ability is a very important method of Bayesian inference. We produce a few more datasets below.

In [45]:
def plot_artificial_sms_dataset():
    tau = pm.rdiscrete_uniform(0, 80)
    alpha = 1. / 20.
    lambda_1, lambda_2 = pm.rexponential(alpha, 2)
    data = np.r_[pm.rpoisson(lambda_1, tau), pm.rpoisson(lambda_2, 80 - tau)], data, color="#348ABD") - 1, data[tau - 1], color="r", label="user behaviour changed")
    plt.xlim(0, 80)

plt.title("More example of artificial datasets")
for i in range(4):
    plt.subplot(4, 1, i)

Later we will see how we use this to make predictions and test the appropriateness of out models.

Example: Bayesian A/B testing

A/B testing is a statistical design pattern for determining the difference of effectiveness between two different treatments. For example, a pharmaceutical company is interested in the effectiveness of drug A vs drug B. The company will test drug A on some fraction of their trials, and drug B on the other fraction (the fraction is often 1.2, but we will relax this assumption). After performing enough trials, the in-house statisticians sift through the data to determine which drug yielded better results.

Similarly, front-end web developers are interested in which design of their website yields more sales or some other metric of interest. They will route some fraction of visitors to site A, and the other fraction to site B, and record if the visit yielded a sale or not.

Often, the post-experiment analysis is done using something called a hypothesis test, like difference of means or difference of proportions tests. These are techniques taught in any basic statistics course. These work just fine, but the Bayesian approach is much more natural and flexible.

A Simple Case

We continue the web-dev example. For the moment we will focus on the analysis of site A only. Assume that there is some true 0 < $p _A$ < 1 probability that users who, upon shown site A, eventually purchase from the site. This is the true effectiveness of site A. Currently, this quantity is unknown to us.

Suppose site A was shown to $N$ people, and $n$ people purchased from the site. One might conclude that $P _A = \frac{n}{N}$. UNfortunately, the observed frequency $\frac{n}{N}$ does not necessarily equal $p _A$, there is a difference between the observed frequency and the true frequency of an event. The true frequency can be interpreted as the probability of an event occurring. For example, the true frequency of rolling a 1 on a 6-sided die is $\frac{1}{6}$. Knowing the frequency of events like:

  • fraction of users who make purchases.
  • frequency of social attributes.
  • percent of internet users with cats, etc.

are common requests we ask of reality. Unfortunately, the true frequency is hidden from us, and we have to infer it from observed data.

The observed frequency is then the frequency we observe: rolling a die 100 times you may observe 20 rolls of 1. The observed frequency then differs from the true frequency of $\frac{1}{6}$. We can use Bayesian statistics to infer probable values of the true frequency using an appropriate prior and observed data.

With respect to the A/B testing example, we are interested in using what we know, $N$ (the total trials administered) and $n$ (the number of conversions), to estimate what $p _A$, the true frequency of buyers might be.

To set up a Bayesian model, we need to assign prior distributions to our unknown quantities; a priori. What do we think $p _A$ might be? Since we have no clue what this quantity is right now, we can assume $p _A$ is uniform over [0,1]:

In [47]:
import pymc as pm

# The parameters are the bounds of the Uniform.
p = pm.Uniform('p', lower=0, upper=1)

Had we had stronger beliefs, we could have expressed them in the prior above.

For this example, consider $p _A$ = 0.05, and $N$ = 1500 users shown site A, and we will simulate whether the user made a purchase or not. To simulate this from $N$ trials, we will use a Bernoulli distribution: if $X \sim Ber(p)$, then $X$ is 1 with probability $p$ and 0 with probability 1 - $p$. Though in practice we do not know $p _A$, but we will use it here to simulate the data.

In [55]:
# set constants
p_true = 0.05 # remember this is normally unknown
N = 2500

# sample N Bernoulli random variables from Ber(0.05).
# each random variable has a 0.05 chance of being a 1.
# this is the data-generation step
occurrences = pm.rbernoulli(p_true, N)

print occurrences, occurrences.sum()
[False False False ..., False False False] 132

The observed frequency is:

In [56]:
# Occurrences.mean is equal to n/N
print 'What is the observed frequency in Group A? {}'.format(
print 'Does this equal the true frequency? {}'.format(
        occurrences.mean() == p_true)
What is the observed frequency in Group A? 0.0528
Does this equal the true frequency? False

Now we combine the observations into the PyMC observed variable, and run our inference algorithm:

In [57]:
# include the observations, which are Bernoulli
obs = pm.Bernoulli('obs', p, value=occurrences, observed=True)

# To be explained in later lessons
mcmc = pm.MCMC([p, obs])
mcmc.sample(18000, 1000)
 [-----------------100%-----------------] 18000 of 18000 complete in 4.8 sec

Now plot the posterior distribution of the unknown $p _A$ below:

In [58]:
plt.title('Posterior distribution of $p _A$, the true effectiveness of site A')
plt.vlines(p_true, 0, 90, linestyle='--', label='true $p_A$ (unknown)')
plt.hist(mcmc.trace('p')[:], bins=25, histtype='stepfilled', normed=True)
<matplotlib.legend.Legend at 0x110acbe50>

Our posteriod distribution is very close to the true value of $p_A$, and is a fairly tight distribution, which is a measure of how certain we are given our observations. By increasing $N$ we will find that we can be more and more certain.

A and B Together

A similar analysis can be done for site B's response data to determine the analogous $p_B$. But what we are really interested in is the difference between $p_A$ and $p_B$. Let's infer $p_A$, $p_B$, and delta, all at once. We can do this using PyMC's deterministic variables. We will assume for now that $p_B$ = 0.04, so delta = 0.01, and $N_B$ = 750 (much less than $N_A$) and we also simulate site B's data in the same way as site A.

In [74]:
# These two quantities are unknown to use
# We just use them to simulate test data

true_p_A = 0.05
true_p_B = 0.04

# Unequal sample sizes are no problem with Bayesian analysis
N_A = 1700
N_B = 950

# Generate observations
obs_A = pm.rbernoulli(true_p_A, N_A)
obs_B = pm.rbernoulli(true_p_B, N_B)
print 'Obs from Site A: ', obs_A[:30].astype(int), '...'
print 'Obs from Site B: ', obs_B[:30].astype(int), '...'
Obs from Site A:  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0] ...
Obs from Site B:  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0] ...
In [75]:
print obs_A.mean()
print obs_B.mean()
In [76]:
# Set up pymc model assuming uniform priors for p_A and p_B.
p_A = pm.Uniform('p_A', 0, 1)
p_B = pm.Uniform('p_B', 0, 1)

# Define the deterministic delta function.
# This is our unknown of interest.
def delta(p_A=p_A, p_B=p_B):
    return p_A - p_B

# Set of observations
obs_A = pm.Bernoulli('obs_A', p_A, value=obs_A, observed=True)
obs_b = pm.Bernoulli('obs_B', p_B, value=obs_B, observed=True)

# Explained later
mcmc = pm.MCMC([p_A, p_B, delta, obs_A, obs_B])
mcmc.sample(20000, 1000)
 [-----------------100%-----------------] 20000 of 20000 complete in 5.5 sec

Plot the posterior distributions of the three unknowns

In [77]:
p_A_samples = mcmc.trace('p_A')[:]
p_B_samples = mcmc.trace('p_B')[:]
delta_samples = mcmc.trace('delta')[:]
In [79]:
ax = plt.subplot(311)

plt.xlim(0, 0.1)
plt.hist(p_A_samples, histtype='stepfilled', bins=25, alpha=0.80,
         label='posterior of $p_A$', color='red', normed=True)
plt.vlines(true_p_A, 0, 80, linestyle='--', label='true $p_A$ (unknown)')
plt.legend(loc='upper right')
plt.title('Posterior distributions of $p_A$, $p_B$, and delta unknowns')

ax = plt.subplot(312)

plt.xlim(0, 0.1)
plt.hist(p_B_samples, histtype='stepfilled', bins=25, alpha=0.80,
         label='posterior of $p_B$', color='#467821', normed=True)
plt.vlines(true_p_B, 0, 80, linestyle='--', label='true $p_B$ (unknown)')
plt.legend(loc='upper right')

ax = plt.subplot(313)

plt.hist(delta_samples, histtype='stepfilled', bins=30, alpha=0.80,
         label='posterior of delta', color = '#7A68A6', normed=True)
plt.vlines(true_p_A - true_p_B, 0, 60, linestyle='--', 
         label='true delta (unknown)')
plt.vlines(0, 0, 60, color='black', alpha=0.2)
plt.legend(loc='upper right');

Notice that as a result of N_B < N_A, out posterior distribution of $p_B$ is fatter, implying we are less certain about the true value than we are of $p_A$.

With respect to the posterior distribution of delta, we can see that the majority of the distribution is on almost exactly delta = 0.01. If for example deltas distribution was above 0.01, we could imply that Site A's response is likely better than site B's response. The probability that this inference in incorrect can be computed:

In [82]:
# Count the number of samples less than 0, i.e. the area under the curve
# before 0, represent the probability that site A is worse than site B.

print 'Probability site A is WORSE than site B: {}'.format(
        (delta_samples < 0).mean())
print 'Probability site A is BETTER than site B: {}'.format(
        (delta_samples > 0).mean())
Probability site A is WORSE than site B: 0.100263157895
Probability site A is BETTER than site B: 0.899736842105

If this probability is too high for comfortable decision-making, we can perform more trials on site B (since site B has less samples to begin with, each additional data point for site B contributes more inferential power than each additional data point for site A).

Try playing with the parameters true_p_A, true_p_B, N_A, and N_B to see what the posterior of delta looks like. Notice in all this, the difference in samples sizes of site A and site B was never mentioned: it naturally fits into Bayesian analysis.

This approach is generally more natural than hypothesis testing, in later notebooks we will expand this method to dynamically adjust for bad sites, and improve the speed of the analysis.

In [8]:
from IPython.core.display import HTML

def css_styling():
    styles = open("/users/ryankelly/desktop/custom_notebook2.css", "r").read()

    return HTML(styles)
In [6]:
def social():
    code = """
    <a style='float:left; margin-right:5px;' href="" class="twitter-share-button" data-text="Check this out" data-via="Ryanmdk">Tweet</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);;js.src=p+'://';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
    <a style='float:left; margin-right:5px;' href="" class="twitter-follow-button" data-show-count="false">Follow @Ryanmdk</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);;js.src=p+'://';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
    <a style='float:left; margin-right:5px;'target='_parent' href="" onclick="window.location = '' + encodeURIComponent(window.location); return false"> <img src="" alt="submit to reddit" border="0" /> </a>
<script src="//" type="text/javascript">
  lang: en_US
<script type="IN/Share"></script>
    return HTML(code)