Review of Probability Distributions: Bayesian Context

The material for this notebook is provided via this free online textbook :

I am writing this notebook to document my learning, and hopefully to help others learn machine learning. You can think of it as my personal lecture notes. I would love suggestions / corrections / feedback for these notebooks.

Visit my webpage for more.

Email me: [email protected]

I'd love for you to share if you liked this post.

In [4]:
submit to reddit

This notebook covers or includes:

  • Probability Distribution Review

Probability Distributions

This notebook is meant as a review to prepare a reader, and myself for more advanced Bayesian modeling techniques.

Let's recall what a probability distribution is: Let $Z$ be some random variable. Then associated with $Z$ is a probability distribution function that assigns probabilities to the different outcomes $Z$ can take. Graphically, a probability distribution is a curve where the probability of an outcome is proportional to the height of the curve.

Random variables can be divided into three classifications:

  • $Z$ is discrete: Discrete random variables may only assume values on a specified list. Things like populations, movie ratings, and number of votes are discrete random variables.

  • $Z$ is continuous: Continuous random variable can take on arbitrarily exact values. For example, temperature, speed, time, and color are all modeled as continuous variables because you can progressively make the values more and more precise.

  • $Z$ is mixed: Mixed random variables assign probabilities to both discrete and continuous random variables as a combination of the two above categories.

Discrete Case

If $Z$ is discrete, then is distribution is called a probability mass function, which measures the probability $Z$ on the value of $K$, denoted $P(Z=k)$. Note that the probability mass function completely describes the random variable $Z$, that is, if we know the mass function, we know how $Z$ should behave. There are popular probability mass functions that consistently appear: we will introduce them as needed. Let's introduce the first one: we say $Z$ is Poisson-distributed if

$$P(Z=k)=\frac{\lambda^k e^{-k}}{k!}, \ k=0,1,2,...$$

$\lambda$ is called a parameter of the distribution, and it controls the distribution's shape. For the Poisson distribution, $\lambda$ can be any positive number. By increasing $\lambda$, we add more probability to larger values, and conversely we can apply the opposite effect. Formally, $\lambda$ describes the intensity of the Poisson distribution.

Unlike $\lambda$, which can be any positive number, the value $k$ in the formula must be a non-negative integer. This is important, since if you want to model a population you could not make sense of populations with 4.99 or 9.11 individuals.

If a random variable $Z$ has a Poisson mass distribution, we denote it as: $Z$ ~ $Poi(\lambda)$

Once useful property of the Poisson distribution is that its expected value is equal to its parameter


This will come up often, so it is necessary to remember. Below is a plot of the probability mass distribution for different $\lambda$ values. The first thing to notice is that by increasing $\lambda$, we add more probability of lager values occurring. Secondly, though the graph ends at 15, the distributions do not, they assign positive probability to every non-negative integer.

In [31]:
import scipy.stats as stats
import numpy as np

%matplotlib inline
from matplotlib import pyplot as plt

import seaborn as sns
# Some initial styling
rc={'axes.labelsize': 16.0, 'font.size': 16, 
    'legend.fontsize': 16.0, 'axes.titlesize': 18,
    'xtick.labelsize': 15.0, 'ytick.labelsize': 15.0}

sns.set(context = 'poster', style = 'whitegrid', rc=rc)
In [32]:
a = np.arange(16)
poi = stats.poisson
lambda_ = [1.5, 4.25]
colors = ["#9b59b6", "#3498db"], poi.pmf(a, lambda_[0]), color=colors[0],
        label='$\lambda = %.1f$' % lambda_[0], alpha=0.6), poi.pmf(a, lambda_[1]), color=colors[1],
         label='$\lambda = %.1f$' % lambda_[1], alpha=0.6)

plt.title('Probability mass funtion of a Poisson random variable;\n \
differing $\lambda$ values')
<matplotlib.text.Text at 0x10f2425d0>

Continuous Case

Instead of a probability mass function, a continuous random variable has a probability density function. Don't be fooled, the density function and the mass function are very different creatures. An example of a continuous random variable is a random variable with exponetial density. The density function for an exponential random variable looks like this:

$$f_Z(z|\lambda) = \lambda \exp^{-\lambda z},\ \ z >= 0$$

Like a Poisson random variable, an exponential random variable can take on only non-negative values. But unlike a Poisson variable, the exponential can take on any non-negative values, including non-negative integral values such as 4.103 or 5.12401. This property makes a poor choice for count data, which must be and integer, but a great choice for time data, temperature data (measured in Kelvins), or any other precise and positive variable. The graph below shows two probability density functions with different $\lambda$ values.

When a random variable $Z$ has an exponential distribution with parameter $\lambda$, we say $Z$ is exponential and write $Z$ ~ Exp($\lambda$).

Given a specific $\lambda$, the expect value of an exponential random variable is equal to the inverse of $\lambda$, that is:

$$E[Z|\lambda] = \frac{1}{\lambda}$$

In [44]:
a = np.linspace(0, 4, 100)
expo = stats.expon
lambda_ = [1, 0.5]

for l, c in zip(lambda_, colors):
    plt.plot(a, expo.pdf(a, scale=1. /l), color=c,
             label='$\lambda = %.1f$' % l,
    plt.fill_between(a, expo.pdf(a, scale=1./l), color=c, 
plt.ylabel('PDF at $z$')
plt.ylim(0, 1.2)
plt.title('Probability density function of an Expoential random variable: \n \
differing $\lambda$')
<matplotlib.text.Text at 0x1115338d0>

But what is $\lambda$ ?

This question is what motivates statistics. In the real world, $\lambda$ is hidden from us. We see only $Z$, and must go backwards to try and determine $\lambda$. The problem is difficult because there is no one-toone mapping from $Z$ to $\lambda$. Many different methods have been created to solve the problem of estimating $\lambda$, but since it is never actually observed, no one can say for certain which method is best!

Bayesian inference is concerned with beliefs about what $\lambda$ might be. Rather than try to guess exactly, we can only talk about what $\lambda$ is likely to be by assigning a probability distribution to $\lambda$.

This might seem odd at first, after all, $\lambda$ is fixed, but not necessarily random. How can we assign probabilities to values of a non-random variable. Recall that under Bayesian theory, we can assign probabilities if we interpret them as beliefs, contrary to the frequentist theory.

Example: Inferring behaviour from text messages

Here we model a more interesting example, from data that records the rate at which a user sends and receives text messages.

The data are daily text message counts from a single user over time.

In [87]:
import pandas as pd
count_data = pd.read_csv('',
    header=None, names=['msgs'])
N = len(count_data), count_data['msgs'], color="#9b59b6", alpha=0.8)
plt.xlabel('Time (days)')
plt.ylabel('count of msgs received')
plt.title('Did the texting habits change over time?')
plt.xlim(0, n)
(0, 74)

Before we start modeling, what can you see just by looking at the chart above? Is there a change in behaviour going on here?

How can we start to model this? As we have seen, a Poisson random variable is a very appropriate model for this type of count data. Denoting day $i$'s test message count by $C_i$

$$C_i \sim Poisson(\lambda)$$

We are not sure what the value of $\lambda$ parameter really is, however. Looking at the chart above it appears the rate may become higher later in the observation period (recall that higher $\lambda$ assigns more probability to larger outcomes).

How would we represent this mathematically? Let's assume that on some day during the observation period (call it $\tau$), the parameter $\lambda$ suddenly jumps to a higher value. We we really have two $\lambda$ parameters: one for the period before $\tau$, and one for the rest of the observation period. In the literature, a sudden transition would be called a switchpoint:

$$ \lambda=\begin{cases} \ \lambda _1 \ \ if \ t < \tau \\ \ \lambda _2 \ \ if \ t < \tau \\ \end{cases} $$

if, in reality, no sudden change occurred and indeed $\lambda _1 = \lambda _2$, then the $\lambda$s posterior distributions should look about equal.

We are interesting in inferring the unknown $\lambda$s. To use Bayesian inference, we need to assign prior probabilities to the different possible values of $\lambda$. What we be good prior probability distributions for them? Recall that $\lambda$ can be any positive number. As we saw earlier, the expoential distribution provides a continuous density function for positive numbers, so it might be a good choice for modeling $\lambda _i$. Yet, recall that the exponential distribution takes a parameter of its own, so we also need to include that variable in the model. The's call that parameter $\alpha$.

$$ \begin{align} \lambda _1 \ \sim \ Exp(\alpha) \\ \lambda _2 \ \sim \ Exp(\alpha) \\ \end{align} $$

$\alpha$ is called a hyper-parameter or parent variable. It is a parameter that influences other parameters. Our initial guess at $\alpha$ does not influence the model too strongly, so we have some flexibility in our choice. A good rule of thing is to set the exponential parameter equal to the inverse of the average count data. Since we are modeling $\lambda$ using an exponential distribution, we can expect the value identity shown earlier to get:

$$\frac{1}{N}\sum\limits_{i=0}^n C_i ≈ E[ \ λ \ | \ α]=\frac{1}{α}$$

An alternative, would be to have two prior: one for each $\lambda$. Creating two exponential distributions with different $\alpha$ values reflect our prior belief that the rate changed at some point during the observations.

What about $\tau$? Because of the noisiness in the data, it is hard to pick out a prior when $\tau$ might have occurred. Instead, we can assign a uniform prior belief to every possible day, equivalent to saying:

$$ \begin{align} \tau \ \sim \ DiscreteUniform(1, 70) \\ -> P(\tau = k) = \frac{1}{70} \\ \end{align} $$

So after all this, what does our overall prior distribution for the unknown variables look like? Frankly it doesn't matter. What we should understand is that it's an ugly complicated mess, and it only gets uglier. Regardless, all we really care about is the posterior distribution.

Next we will use PyMC, a python library for Bayesian analysis. Click the link below to follow along.

In [6]:
from IPython.core.display import HTML

def css_styling():
    styles = open("/users/ryankelly/desktop/custom_notebook2.css", "r").read()

    return HTML(styles)
In [5]:
def social():
    code = """
    <a style='float:left; margin-right:5px;' href="" class="twitter-share-button" data-text="Check this out" data-via="Ryanmdk">Tweet</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);;js.src=p+'://';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
    <a style='float:left; margin-right:5px;' href="" class="twitter-follow-button" data-show-count="false">Follow @Ryanmdk</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);;js.src=p+'://';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
    <a style='float:left; margin-right:5px;'target='_parent' href="" onclick="window.location = '' + encodeURIComponent(window.location); return false"> <img src="" alt="submit to reddit" border="0" /> </a>
<script src="//" type="text/javascript">
  lang: en_US
<script type="IN/Share"></script>
    return HTML(code)