pomegranate: fast and flexible probabilistic modelling

Author: Jacob Schreiber
Contact: [email protected]

In [1]:
%pylab inline
import seaborn, pandas, itertools, time, random
seaborn.set_style('whitegrid')
from pomegranate import *
numpy.set_printoptions(suppress=True)

random.seed(0)
numpy.random.seed(0)

import matplotlib
matplotlib.rcParams['xtick.labelsize'] = 14
matplotlib.rcParams['ytick.labelsize'] = 14
matplotlib.rcParams['axes.labelsize'] = 16
matplotlib.rcParams['axes.titlesize'] = 16
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['legend.fontsize'] = 16
Populating the interactive namespace from numpy and matplotlib

pomegranate is a probabilistic modelling library which focuses on combining an ease of use with a speedy cython backend. Its precursor, yahmm, began while I was an undergraduate working in the nanopore group at UC Santa Cruz which was developing a new DNA sequencing tool. Since it was in its beginning stages, sequence analysis using this tool was primarily done by hand, which was extremely slow and could suffer significantly from human bias. Through the use of hidden Markov models implemented in yahmm this process was automated, speeding up the analysis by several orders of magnitude while removing human bias and cutting out around 80% of the error (increasing accuracy from ~90% to ~98%).

pomegranate began when I was in graduate school at the University of Washington and taking machine learning courses. Homework assignments frequently required the implementation of a model, and I noticed that many implementations simply involved rearranging the components of yahmm into other useful models, such as Naive Bayes, mixture models, and Bayesian networks. Thus, pomegranate was born.

While efficient, pomegranate didn't fully utilize the benefits that cython can provide until last summer, while I was doing a summer internship working on sklearn. During the internship I was working on speeding up decision tree regressors, and a mathematical trick I proposed was able to speed them up up to 4x faster in some cases. This taught my how to write bare-bones and parallelizable cython code and utilize computational tricks to speed up seemingly-complex code. I've taken this knowledge to significantly improve the internals of pomegranate and make it extremely fast.

The main models which pomegranate implements and I will be discussing in this talk are

  1. Probability Distributions
  2. Naive Bayes
  3. Markov Chains
  4. General Mixture Models
  5. Hidden Markov Models
  6. Bayesian Networks

One of the core tenets of pomegranate is that every model is really a probability distribution. We're all familiar with simple distributions such as normal or uniform distributions, but even a Bayesian network can be thought of as a distribution which is factored along a graph. This idea makes pomegranate extremely flexible, allowing for models to be stacked within each other to neatly produce hidden Markov models with general mixture model emissions, Naive Bayes classifiers with Bayesian network emissions, and mixtures of hidden Markov models.

Here is a table of the ways in which models can be stacked. On the left is a chart of what is possible, with dark blue indicating that models can be stacked in a certain way, and light blue indicating a way models can be stacked which will be added in the near future. On the right is a chart of the comparison of pomegranates models to other known packages, with light red indicating model stacking which is known to exist in other packages but pomegranate provides more support for. For example, sklearn implements mixture models, but these are limited to Gaussian emissions, while pomegranate supports mixture models of arbitrary distributions. Dark red indicates new model stacking that no other known package supports, such as mixtures of hidden Markov models. If you know of a package which supports these meshes I'd love to know!

In [2]:
labels = 'Distributions', 'Naive Bayes', 'Markov Chains', 'GMM', 'HMM', 'Bayesian Networks' 
X = [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0],
     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
     [0.0, 1.0, 0.5, 0.0, 0.0, 0.0],
     [1.0, 1.0, 0.0, 1.0, 1.0, 0.0],
     [0.0, 1.0, 0.0, 1.0, 0.0, 0.0],
     [0.0, 1.0, 0.0, 0.5, 0.5, 0.0]]

Y = [[0.0, 0.5, 0.0, 0.5, 0.5, 0.0],
     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
     [0.0, 0.5, 0.0, 0.0, 0.0, 0.0],
     [0.5, 1.0, 0.0, 1.0, 0.5, 0.0],
     [0.0, 1.0, 0.0, 1.0, 0.0, 0.0],
     [0.0, 1.0, 0.0, 0.0, 0.0, 0.0]]

plt.figure(figsize=(20, 10))
plt.subplot(121)
plt.title("Model Stacking Allowed", fontsize=24)
plt.imshow(X, interpolation='nearest', cmap='Blues')
plt.xticks(range(6), labels, rotation=90, fontsize=20)
plt.xlabel("..is allowed in this model", fontsize=20)
plt.ylabel("This model...", fontsize=20)
plt.yticks(range(6), labels, fontsize=20)

plt.subplot(122)
plt.title("Model Stacking Compared To Others", fontsize=24)
plt.imshow(Y, interpolation='nearest', cmap='Reds')
plt.xticks(range(6), labels, rotation=90, fontsize=20)
plt.xlabel("..is allowed in this model", fontsize=20)
plt.yticks(range(6), '', fontsize=18)
plt.show()

All of the models (and stacks of models) in pomegranate support the same core functionality and methods. You can read about the API more at the readthedocs site. These methods are almost all implemented with a cython backend which releases the GIL. This makes the operations especially fast, but it also makes them all parallelizable using ~multithreading~. In many cases, this can be much more efficient than multiprocessing.

To utilize this parallelization, pomegranate has implemented some parallelization wrappers, which can run model prediction, log probability calculations, or model fitting in parallel for any model or stack of models.

Models

Since all models share the same core set of functionality, let's start with a quick API reference. All models have the following methods:

  1. probability
  2. log_probability
  3. fit
  4. sample
  5. summarize
  6. from_summaries
  7. clear_summaries
  8. to_json
  9. from_json

The first three methods are likely the most used functions---updating parameters of the distribution, and evaluating samples using the model parameters.

Methods 5-7 implement pomegranate's out-of-core learning scheme. All models can summarize data down to its sufficient statistics using 'summarize', update model parameters exactly using these sufficent statistics using 'from_summaries', and then clear the stored sufficent statistics using 'clear_sumaries'. This will ultimately allow you to decide if you want to clear the summaries after each parameter updates, or continue with online learning to get better estimates of a single process.

In addition, for models composed of simple distributions, a sklearn-like API is exposed for the common operations:

  1. predict
  2. predict_proba
  3. predict_log_proba

The 'predict' method returns the index of the model which most likely generated the sample, 'predict_proba' returns the posterior probabilities of each model having generated the sample, and 'predict_log_proba' returns the log probabilities.

1. Probability Distributions

The basic unit of probabilistic modelling is the probability distribution. Most people are familiar with simple ones like the Normal distribution, but pomegranate supports a wide variety of them. Here is a full list of the probability distributions which pomegranate currently (8/12/2016) supports.

Univariate Distributions

  1. UniformDistribution
  2. NormalDistribution
  3. LogNormalDistribution
  4. ExponentialDistribution
  5. BetaDistribution
  6. GammaDistribution
  7. DiscreteDistribution
  8. PoissonDistribution

Kernel Densities

  1. GaussianKernelDensity
  2. UniformKernelDensity
  3. TriangleKernelDensity

Multivariate Distributions

  1. IndependentComponentsDistribution
  2. MultivariateGaussianDistribution
  3. DirichletDistribution
  4. ConditionalProbabilityTable
  5. JointProbabilityTable

The first two methods are likely the most used functions---updating parameters of the distribution, and evaluating samples using the model parameters. The next three are involved with the out-of-core learning API, where summarize will store sufficient statistics from a dataset, from_summaries will update the model parameters using the stored sufficient statistics, and clear_summaries will reset the sufficient statistics if you don't want past points to influence the update anymore. Lastly, to/from_json allow you to serialize your models.

Let's take a look at one of the simpler distributions, the Normal Distribution. We can specify it easily in terms of the mean $\mu$ and the standard deviation $\sigma$.

In [3]:
a = NormalDistribution(0, 2)
b = NormalDistribution(1, 0.5)

We can look at the log probability of some points under these distributions using the log_probability method.

In [4]:
print a.log_probability(5)
print b.log_probability(5)
-4.73708571376
-32.2257913526

We can fit any type of distribution to the data using the class method Model.from_sample(X). This uses maximum likelihood estimates on the data to derive parameters for the distribution. In the case of the normal distribution, it just calculates the mean and standard deviation of the data.

Let's fit a normal distribution to that data and take a look at it

In [5]:
data = numpy.random.randn(100)

a = NormalDistribution.from_samples(data)

x = numpy.arange(-3.5, 3.5, 0.1)
y = a.probability(x)

plt.plot(x, y, c='b')
plt.fill_between(x, 0, y, color='b', alpha=0.3)
plt.scatter(data, numpy.zeros(100)-0.05)
plt.ylabel('Probability')
plt.xlim(-4, 4)
plt.show()

In addition to the basic fit method, pomegranate also supports out-of-core learning to train on datasets which don't fit in memory through the summarize and from_summaries methods. Essentially, the summarize method will reduce a batch of data down to its sufficient statistics, and the from_summaries method will calculate an exact update to the model parameters based on the stored sufficient statistics. A simple example of this is the Normal Distribution, where the three sufficient statistics are $\sum\limits_{i=1}^{n}w_{i}$, $\sum\limits_{i=1}^{n}w_{i}x_{i}$, and $\sum\limits_{i=1}^{n} w_{i}x_{i}^{2}$. Since this sum can be broken into parts, you can run summarize as many times as necessary to analyze the entire dataset. Once you're done summarizing data, you can then run from_summaries, which calculates the following:

\begin{align} \mu &= \frac{\sum\limits_{i=1}^{n}w_{i}x_{i}}{\sum\limits_{i=1}^{n}w_{i}} \\ \sigma^{2} &= \frac{\sum\limits_{i=1}^{n} w_{i}x_{i}^{2}}{\sum\limits_{i=1}^{n} w_{i}} - \frac{\left(\sum\limits_{i=1}^{n} w_{i}x_{i} \right)^{2}}{\left( \sum\limits_{i=1}^{n} w_{i} \right)^{2}} \end{align}

Let's confirm that this is true by taking a look at two distributions, one fit to the entire thing, and one summarized over five chunks of data.

In [6]:
data = numpy.random.randn(5000)
a = NormalDistribution(5, 2)
b = NormalDistribution(5, 2)

a.fit(data)
b.summarize(data[:1000])
b.summarize(data[1000:2000])
b.summarize(data[2000:3000])
b.summarize(data[3000:4000])
b.summarize(data[4000:])
b.from_summaries()


x = numpy.arange(-3.5, 3.5, 0.1)
ya = map( a.probability, x )
yb = map( b.probability, x )

plt.plot(x, ya, c='b')
plt.plot(x, yb, c='r')
plt.fill_between(x, 0, ya, color='b', alpha=0.3)
plt.fill_between(x, 0, yb, color='r', alpha=0.3)
plt.ylabel('Probability')
plt.show()

print "Fit Mean:       {}, Fit STD:       {}".format( *a.parameters )
print "Summarize Mean: {}, Summarize STD: {}".format( *b.parameters )
Fit Mean:       -0.0174820965846, Fit STD:       0.986767322871
Summarize Mean: -0.0174820965846, Summarize STD: 0.986767322871

Speed

Let's compare the speed against numpy for extracting model parameters and scipy for calculating the log probability of a point under these parameters.

In [7]:
data = numpy.random.randn(1000)

print "numpy time:"
%timeit -n 100 data.mean(), data.std()
print
print "pomegranate time:"
%timeit -n 100 NormalDistribution.from_samples(data)
numpy time:
100 loops, best of 3: 79 µs per loop

pomegranate time:
100 loops, best of 3: 36.8 µs per loop

Looks like we're around twice as fast at extracting the mean and standard deviation. This is mostly because the mean and variances are calculated together in pomegranate through the use of the sufficient statistics, instead of going through the entire array twice to calculate the mean and then the standard deviation.

Lets take a look at doing this with multivariate gaussian distributions on large amounts of data.

In [8]:
data = numpy.random.randn(10000000, 10)

print "numpy time:"
%timeit -n 10 data.mean(), numpy.cov(data.T)
print
print "pomegranate time:"
%timeit -n 10 MultivariateGaussianDistribution.from_samples(data)
numpy time:
10 loops, best of 3: 2.41 s per loop

pomegranate time:
10 loops, best of 3: 2.49 s per loop

Looks like pomegranate is able to be faster than numpy on large amounts of data. This effect diminishes with the number of dimensions, converging to be just as fast as numpy the more dimensions are added, but outperforming numpy as more data points are added.

In [9]:
from scipy.stats import norm

d = NormalDistribution(0, 1)

print "scipy time:"
%timeit -n 100 norm.logpdf(2, 0, 1)
print
print "pomegranate time:"
%timeit -n 100 NormalDistribution(0, 1).log_probability(2)
print
print "pomegranate with (w/ created object)"
%timeit -n 100 d.log_probability(2)
print
print "logp difference: {}".format( norm.logpdf(2, 0, 1) - NormalDistribution(0, 1).log_probability( 2 ) )
scipy time:
100 loops, best of 3: 162 µs per loop

pomegranate time:
The slowest run took 124.01 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 4.7 µs per loop

pomegranate with (w/ created object)
100 loops, best of 3: 4.86 µs per loop

logp difference: -3.99236199655e-13

One of the reasons which pomegranate can be so fast at calculating log probabilities is that it is able to cache parts of the logpdf equation so that it doesn't need to do all of the calculations each time. For example, let's look at the Normal distribution pdf equation:

\begin{equation} P(X|\mu, \sigma) = \frac{1}{\sqrt{2\pi}\sigma} exp \left( -\frac{(x - \mu)^{2}}{2\sigma^{2}} \right) \\ \end{equation}

We can take the log of this to simplify it.

\begin{equation} logP(X|\mu, \sigma) = -\log \left(\sqrt{2\pi}\sigma \right) - \frac{(x-\mu)^{2}}{2\sigma^{2}} \end{equation}

pomegranate speeds up this calculation by caching $-\log(\sqrt{2\pi}\sigma)$ and $2\sigma^{2}$ when the object is created. This means that the equation is simplified to the following:

\begin{equation} logP(X|\mu, \sigma) = \alpha - \frac{(x - \mu)^{2}}{\beta} \end{equation}

We don't need to calculate any logs or exponentials here, just a difference, a multiplication, a division, and a subtraction.

It is worth noting that numpy and scipy supports vectorized operations across an array of samples at the same time. As the size of this array grows, it may become faster to use numpy or scipy than to use pomegranate. However, most of the models inside pomegranate only require the calculation of one value at a time. Because of this, it's more useful for pomegranate to have fast one-sample calculations than vectorized ones.

Example: Gossip Girl

Let's turn now to a practical and useful example of how probabilistic modelling can help us.

A few years ago, my girlfriend wanted me to watch Gossip Girl with her. The show follows a group of angsty teenagers in Manhattan who take turns being in relationships with each other while constantly disappointing their parents. In the backdrop, an enigmatic figure known as 'gossip girl' sends out text message 'blasts' to stir up trouble--usually just in time to cause needless drama which could be solved if the characters ever decided to talk to each other. A mystery of the show is 'Who is Gossip Girl', with the implication that one of the main characters is secretly gossip girl.

Let's take a look at some of the blasts.

In [10]:
data = pandas.read_excel('GGBlasts.xlsx')
blasts = data['Blast']

print blasts[1]
print
print blasts[5]
Spotted: Lonely Boy. Can't believe the love of his life has returned. If only she knew who he was. But everyone knows Serena. And everyone is talking. Wonder what Blair Waldorf thinks. Sure, they're BFF's, but we always thought Blair's boyfriend Nate had a thing for Serena. 

Why'd she leave? Why'd she return? Send me all the deets. And who am I? That's the secret I'll never tell. The only one. —XOXO. Gossip Girl. 

I took that last blast as a challenge. I wanted to figure out which of the characters was gossip girl, and I wanted to use data science in order to do it.

To do this, the blasts were first all encoded as being either in favor of someone's agenda, or against someone's agenda. Presumably, the character who the blasts help the most would be gossip girl. So let's now take a look at a bar chart of the number of positive blasts minus the number of negative blasts.

In [11]:
data = data.fillna(0)
data.sum(axis=0).plot(kind='bar', facecolor='m', edgecolor='m')
plt.ylabel('Positive Blasts - Negative Blasts')
plt.show()

This chart seems to have some important information in it. Specifically, it seems like Blair has by far the most blasts working against her agenda. It seems unlikely that she is gossip girl. Likewise, it seems unlikely that Chuck is gossip girl due to the negativity. This leaves Dan, Nate, and Vanessa as the most likely contendors to be Gossip Girl based entirely on the negativity index. In order to break this tie, let's attempt to use probabilistic modelling to see if we can figure out who it is.

This brings us to the concept of the beta distribution. Basically, the beta distribution is a probability distribution between 0 and 1, which is parameterized by two shape parameters $\alpha$ and $\beta$. Typically, this represents a distribution over the probability that an event occurs, with $\alpha$ representing the number of times which an event happened, and $\beta$ representing the number of times which it did not happen.

A classic example of using a beta distribution is modelling the probability of a coin coming up heads. A coin coming up heads is considered a 'success', and an event coming up tails is considered a 'failure'. Typically a count of 1 is added to the beta distribution on top of the number of observations, for math reasons. Let's take a look.

In [12]:
a = BetaDistribution(0+1, 0+1)
b = BetaDistribution(0+1, 2+2)
c = BetaDistribution(1+1, 2+2)
d = BetaDistribution(25+1, 25+1)

plt.figure(figsize=(10, 6))

x = numpy.arange(0, 1, 0.01)
ya = map( a.probability, x )
yb = map( b.probability, x )
yc = map( c.probability, x )
yd = map( d.probability, x )

plt.plot(x, ya, c='b', label="0 heads, 0 tails")
plt.plot(x, yb, c='c', label="0 heads, 2 tails")
plt.plot(x, yc, c='r', label="1 head, 2 tails")
plt.plot(x, yd, c='g', label="25 heads, 25 tails")

plt.fill_between(x, 0, ya, color='b', alpha=0.3)
plt.fill_between(x, 0, yb, color='c', alpha=0.3)
plt.fill_between(x, 0, yc, color='r', alpha=0.3)
plt.fill_between(x, 0, yd, color='g', alpha=0.3)

plt.legend()
plt.ylabel('Probability')
plt.show()

This seems intuitive. If we haven't seen any heads or any tails (blue), we really have a uniform distribution over the possible values of the probability of the coin flipping heads. If we've seen two tails (cyan) then it becomes impossible that there is a 100% chance of seeing heads, and there is a growing probability that there is a 100% probability of tails. However, as soon as we see a single head (red), the probability of only flipping tails goes to 0. This makes sense, because we've observed a head. Continuing, if we flip 10 heads and 10 tails, we get a tighter distribution around 0.5, representing our increased confidence that the parameter is around there. This makes sense, because if we flip a coin 2 times and get 1 head and 1 tail, we're less confident than if we've flipped it 2000 times and get 1000 heads and 1000 tails.

Let's apply this concept to the Gossip Girl data. We can start off by saying that a blast in favor of a characters agenda is a 'success', and a blast against their agenda is a 'failure'. Let's then learn beta distributions for each character for each of the four seasons, and see how these distributions evolve over time.

In [13]:
characters = { 
               'Blair': [[1, 1, 1, 1], [1, 1, 1, 1]],
               'Serena' : [[1, 1, 1, 1], [1, 1, 1, 1]],
               'Dan' : [[1, 1, 1, 1], [1, 1, 1, 1]],
               'Vanessa' : [[1, 1, 1, 1], [1, 1, 1, 1]],
               'Jenny' : [[1, 1, 1, 1], [1, 1, 1, 1]],
               'Chuck' : [[1, 1, 1, 1], [1, 1, 1, 1]]
             }


for i, row in data.iterrows():
    season_s = row['Season/episode']
    if isinstance( season_s, unicode ):
        season = int( row['Season/episode'][1] ) - 1 

    for character in characters.keys():
        if row[character] == 1:
            characters[character][0][season] += 1.
        elif row[character] == -1:
            characters[character][1][season] += 1.

for character, (good, bad) in characters.items():
    characters[character] = [ np.cumsum(good), np.cumsum(bad) ]

x = np.linspace(0, 1, 100)
plt.figure( figsize=(16, 8))

colors = { 
           'Blair' : 'r',
           'Serena' : 'y',
           'Dan' : 'g',
           'Jenny' : 'c',
           'Vanessa' : 'm',
           'Chuck' : 'b'
         }

for k in xrange(4):
    plt.subplot(len(good) / 2, 2, k + 1)
    plt.xlabel("After Season {}".format( k+1 ))
    
    for character, (good, bad) in characters.items():
        B = BetaDistribution(good[k], bad[k])
        y = numpy.exp([B.log_probability(i) for i in x])
        
        plt.plot( x, y, label=character, c=colors[character] )
        plt.fill_between(x, 0, y, color=colors[character], alpha=0.2)
        plt.vlines( good[k] / ( good[k]+bad[k] ), 0, 9, colors=colors[character], linestyles='dashed' )
    
    plt.legend()

plt.tight_layout()

Above we've plotted both the distributions and the maximum likelihood estimates of these distributions plotted as vertical dotted bars. We can see that over the course of the show, the distributions tighten. Two interesting observations are that Dan appears to be the favorite to be Gossip Girl across all seasons, though not by much. Also, except for by the end of Season 3, Blair is not the least likely to be Gossip Girl, being surpassed by Chuck. In fact, by the end of season 1, she is above the median for being Gossip Girl.

Hopefully this practical example has shown how even basic probabilistic modelling can be applied to derive insight in important, real world, problems.

2. Naive Bayes Classifiers

Now that we've covered basic probability distributions, a typical thing you may want to do is create a machine learning classifier using them. After all, you can fit these distributions to some data, and use them to evaluate the likelihood of some future points. Wouldn't it be simple to just see which distribution produced the highest likelihood of a given point, and then assign that label to the point?

This is the basis of Naive Bayes classifiers, which are a form of probabilistic classifier which rely on Bayes Rule, shown below. M stands for model, and D stands for data.

\begin{equation} P(M|D) P(D) = P(D|M) P(M) \end{equation}

Let's look at the different parts of this equation. The first part of P(M|D), which is the probability of the model given the data. This is ultimately what we would like to calculate for each model (distribution) in the Naive Bayes classifier. We then return the model label corresponding to the model with the highest P(M|D) score for each point.

However, P(D) is a bit more ambiguous. What exactly is the probability of that dataset? Instead of worrying about that, we realize that we can simply get rid of that term, since it does not depend on the model, and so will be the same for each component. We can get rid of that term, and are left with the following:

\begin{equation} P(M|D) = P(D|M) P(M) \end{equation}

Next, we get to P(D|M), which is our basic likelihood function. All distributions implement model.log_probability(X), which calculates logP(D|M).

Lastly, we come to P(M), which is the prior distribution over the distributions in the Naive Bayes classifier. This term represents the probability that, without even having seeing the data point yet, we would assign a specific label to it. This typically will weight the likelihood by the frequency of that label, so that labels which occur far more frequently are recognized as being more likely.

pomegranate implements Naive Bayes in a fast, flexible, and simple to use format. Let's see a basic example using two normal distributions.

We can specify the model in two ways, depending on if we want to fit the model to a dataset, or have specific distributions in mind which we'd like to use to predict.

In [14]:
a = NormalDistribution(0, 2)
b = NormalDistribution(5, 1)

x = numpy.arange(-8, 14, 0.1)
ya = a.probability(x)
yb = b.probability(x)

plt.plot(x, ya, c='b')
plt.plot(x, yb, c='r')

plt.fill_between(x, 0, ya, color='b', alpha=0.2)
plt.fill_between(x, 0, yb, color='r', alpha=0.2)

plt.ylabel("Probability")
plt.show()

Let's use the predict function on a grid to see where it is predicting 0 and where it is predicting 1s, corresponding respectively to the blue and red distributions.

In [15]:
data = numpy.arange(-5, 10, 0.5).reshape(-1, 1)

clf = NaiveBayes([a, b])
y = clf.predict(data)

n0 = data.shape[0] - y.sum()
n1 = y.sum()

x = numpy.arange(-8, 14, 0.1)
ya = a.probability(x)
yb = b.probability(x)

plt.plot(x, ya, c='b', alpha=0.2)
plt.plot(x, yb, c='r', alpha=0.2)

plt.fill_between(x, 0, ya, color='b', alpha=0.2)
plt.fill_between(x, 0, yb, color='r', alpha=0.2)

plt.scatter( data[y==0], numpy.zeros(n0)-0.02, color='b', edgecolor='b' )
plt.scatter( data[y==1], numpy.zeros(n1)+0.42, color='r', edgecolor='r' )
plt.ylabel("Probability")
plt.show()

This makes sense. The model switches from predicting 0 (blue) to 1 (red) at the position in the plot were the red distribution has a higher probability than the blue distribution. This is pretty much what we would expect from a Naive Bayes classifier with no prior distribution on the components.

Now, we can get a probabilistic view of the classification by looking at the predicted probabilities instead of just a hard classification. This ends up being P(M|D), which is a vector of probabilities that each model component generated that sample, which will sum to 1 for each sample. We can get these values by using model.predict_proba.

In [16]:
data = numpy.arange(-3, 7, 0.25).reshape(-1, 1)
y = clf.predict_proba(data)

x = numpy.arange(-8, 14, 0.1)
ya = a.probability(x)
yb = b.probability(x)

plt.plot(x, ya, c='b', alpha=0.2)
plt.plot(x, yb, c='r', alpha=0.2)

plt.fill_between(x, 0, ya, color='b', alpha=0.2)
plt.fill_between(x, 0, yb, color='r', alpha=0.2)

plt.scatter( data[y[:,1] <= 0.5], y[y[:,1] <= 0.5, 1]*0.4, color='b', edgecolor='b' )
plt.scatter( data[y[:,1] > 0.5],  y[y[:,1] > 0.5, 1]*0.4, color='r', edgecolor='r' )

plt.yticks(numpy.arange(0, 0.41, 0.05), numpy.arange(0, 1.1, 0.125))
plt.ylim(-0.05, 0.45)
plt.ylabel("P(Red Model | Data)")
plt.show()

These results mesh pretty well with what we saw before. As the input values increases, there is a higher probability that the same was generated by the red distribution. The point at which the dots switch colors lines up with the point at which the red distribution has a higher probability than the blue distribution.

We can fit Naive Bayes classifiers in two ways. The first is by passing in an explicit list of instantiated models, possibly with a prior distribution over them, and then calling fit. The second is similar to sklearn, where we only pass a callable for the type of distribution which we would like to fit. We know the number of distributions which we'd like to fit from the label set.

In [17]:
X = numpy.concatenate([numpy.random.normal(0, 2, (100, 1)), numpy.random.normal(4, 1.5, (100, 1))])
y = numpy.concatenate([numpy.zeros(100), numpy.ones(100)])

clf = NaiveBayes.from_samples(NormalDistribution, X, y)
print(clf)
{
    "models" : [
        {
            "frozen" : false,
            "class" : "Distribution",
            "parameters" : [
                -0.01013271317851311,
                2.0996425031230492
            ],
            "name" : "NormalDistribution"
        },
        {
            "frozen" : false,
            "class" : "Distribution",
            "parameters" : [
                3.9719553014786957,
                1.4745674916478821
            ],
            "name" : "NormalDistribution"
        }
    ],
    "weights" : [
        -0.6931471805599453,
        -0.6931471805599453
    ],
    "class" : "NaiveBayes"
}

Speed

sklearn implements a few versions of Naive Bayes classifiers, of which Gaussian naive bayes is probably used the most. Let's run a speed test comparing the speed of pomegranate with the speed of sklearn as well as the accuracy of the models.

In [18]:
from sklearn.naive_bayes import GaussianNB

def create_dataset(n_samples, n_dim, n_classes):
    """Create a random dataset with n_samples in each class."""
    
    X = numpy.concatenate([numpy.random.randn(n_samples, n_dim) + i for i in range(n_classes)])
    y = numpy.concatenate([numpy.zeros(n_samples) + i for i in range(n_classes)])
    return X, y

def plot( fit, predict, skl_error, pom_error, sizes, xlabel ):
    """Plot the results."""
    
    idx = numpy.arange(fit.shape[1])
    
    plt.figure( figsize=(14, 4))
    plt.plot( fit.mean(axis=0), c='c', label="Fitting")
    plt.plot( predict.mean(axis=0), c='m', label="Prediction")
    plt.plot( [0, fit.shape[1]-1], [1, 1], c='k', label="Baseline" )
    
    plt.fill_between( idx, fit.min(axis=0), fit.max(axis=0), color='c', alpha=0.3 )
    plt.fill_between( idx, predict.min(axis=0), predict.max(axis=0), color='m', alpha=0.3 )
    
    plt.xticks(idx, sizes, rotation=65, fontsize=14)
    plt.xlabel('{}'.format(xlabel), fontsize=14)
    plt.ylabel('pomegranate is x times faster', fontsize=14)
    plt.legend(fontsize=12, loc=4)
    plt.show()
    
    
    plt.figure( figsize=(14, 4))
    plt.plot( 1 - skl_error.mean(axis=0), alpha=0.5, c='c', label="sklearn accuracy" )
    plt.plot( 1 - pom_error.mean(axis=0), alpha=0.5, c='m', label="pomegranate accuracy" )
    
    plt.fill_between( idx, 1-skl_error.min(axis=0), 1-skl_error.max(axis=0), color='c', alpha=0.3 )
    plt.fill_between( idx, 1-pom_error.min(axis=0), 1-pom_error.max(axis=0), color='m', alpha=0.3 )
    
    plt.xticks(idx, sizes, rotation=65, fontsize=14)
    plt.xlabel('{}'.format(xlabel), fontsize=14)
    plt.ylabel('Accuracy', fontsize=14)
    plt.legend(fontsize=14) 
    plt.show()
    
def evaluate_models():
    sizes = numpy.around(numpy.exp( numpy.arange(7, 13))).astype('int')
    n, m = sizes.shape[0], 5
    
    skl_predict, pom_predict = numpy.zeros((m, n)), numpy.zeros((m, n))
    skl_fit, pom_fit = numpy.zeros((m, n)), numpy.zeros((m, n))
    skl_error, pom_error = numpy.zeros((m, n)), numpy.zeros((m, n))
    
    for i in range(m):
        for j, size in enumerate(sizes):
            X, y = create_dataset(size, 20, 2)
            
            # bench fit times
            tic = time.time()
            skl = GaussianNB()
            skl.fit( X, y )
            skl_fit[i, j] = time.time() - tic

            tic = time.time()
            pom = NaiveBayes.from_samples(NormalDistribution, X, y)
            pom_fit[i, j] = time.time() - tic

            # bench predict times
            tic = time.time()
            skl_predictions = skl.predict( X )
            skl_predict[i, j] = time.time() - tic

            tic = time.time()
            pom_predictions = pom.predict( X )
            pom_predict[i, j] = time.time() - tic
        
            # check number wrong
            skl_e = (y != skl_predictions).mean()
            pom_e = (y != pom_predictions).mean()

            skl_error[i, j] = min(skl_e, 1-skl_e)
            pom_error[i, j] = min(pom_e, 1-pom_e)
    
    fit = skl_fit / pom_fit
    predict = skl_predict / pom_predict
    
    plot(fit, predict, skl_error, pom_error, sizes, "samples per component")

evaluate_models()

Example: Classifying Noisy Signals

A common case in the analysis of signals is to have a signal which is a mixture of two different underlying phenomena, which each have different characteristics. For example, this could be the electricity usage in a house when the lights are on versus off, or a pedometer which measures when a person is running versus walking.

Let's assume that we have a perfect segmenter right now, which will divide the signal in regions which are different from the adjacent regions. The task becomes to take this segment of the signal, and classify which phenomena are happening during it. Let's take a look at some data, colored red and blue to describe the underlying phenomena.

In [19]:
def plot_signal(X, n):
    plt.figure(figsize=(16, 6))

    t_current = 0
    for i in range(n):
        mu, std, t = X[i]
        t = int(t)
        chunk = numpy.random.normal(mu, std, t)
        plt.plot(numpy.arange(t_current, t_current+t), chunk, c='cm'[i % 2])
        t_current += t
    
    plt.ylim(20, 40)
    plt.show()

def create_signal(n):
    X, y = [], []

    for i in range(n):
        mu = numpy.random.normal(30.0, 0.4)
        std = numpy.random.lognormal(-0.1, 0.4)
        t = int(numpy.random.exponential(45)) + 1
        X.append([mu, std, t])
        y.append(0)

        mu = numpy.random.normal(30.5, 0.4)
        std = numpy.random.lognormal(-0.4, 0.4)
        t = int(numpy.random.exponential(250)) + 1
        X.append([mu, std, t])
        y.append(1)
    
    return numpy.array(X), numpy.array(y)

X_train, y_train = create_signal(1000)
X_test, y_test = create_signal(250)
plot_signal(X_train, 20)

Each segment is drawn from a normal distribution with the mean drawn from a normal distribution, the standard deviation drawn from a log normal distribution, and the duration drawn from an exponential distribution. These distributions are all fairly similar to each other, but together they may have enough of a difference to produce a good classifier. Let's say that when the classifier gets a segment of signal, it will be classifying based on the mean, standard deviation, and duration of the signal, and now the raw data points. This turns into a basic classification task with three features.

We can create two classifers in pomegranate and also see how the more well used sklearn classifier does. One uses the appropriate underlying distribution for the features, a Normal distribution for the mean, a log normal distribution for the standard deviation, and an exponential distribution for the duration. We can pass these directly into the from_samples method. The second classifier is a, more typical, multivariate Gaussian Naive Bayes model.

Let's fit our models to 1000 example segments, and then test the validation performance on 250 new segments.

In [20]:
from sklearn.naive_bayes import GaussianNB

distributions = [NormalDistribution, LogNormalDistribution, ExponentialDistribution]
model = NaiveBayes.from_samples(distributions, X_train, y_train)
print "Naive Bayes Using Normal, LogNormal, Exponential: ", (model.predict(X_test) == y_test).mean()

model = NaiveBayes.from_samples(NormalDistribution, X_train, y_train)
print "Multivariate Gaussian Naive Bayes: ", (model.predict(X_test) == y_test).mean()

clf = GaussianNB()
clf.fit(X_train, y_train)
print "sklearn Multivariate Gaussian Naive Bayes: ", ( clf.predict(X_test) == y_test).mean()
Naive Bayes Using Normal, LogNormal, Exponential:  0.844
Multivariate Gaussian Naive Bayes:  0.828
sklearn Multivariate Gaussian Naive Bayes:  0.828

Looks like using a mutlviariate Gaussian model is not a bad choice for this data, but it looks like using the proper distributions to model each of the features does allow you to get significantly better performance and generalizability. This improved performance is not just in the hard classifications, but the soft calls (using model.predict_proba) will be more reliable and accurate.

A difference between the pomegranate Naive Bayes clasifier and the sklearn classifier is that pomegranate will learn the full covariance matrix for the classifier, while sklearn will only learn the diagonal. In many applications, learning the full covariance matrix can improve performance as the underlying features are correlated. In this example, all of the data is generated from independent distributions, and so there isn't much correlation.

Example: Bayesian Network Naive Bayes

Bayesian networks are just distributions which are factorized along a graphical structure. As such, we can plug them into a Naive Bayes classifier to compare different structures. For example, let's compare a Bayesian network to a simple independent components distribution. This will let us compare whether or not adding structure to the variables is better than treating them indepdently.

In [21]:
intelligence = DiscreteDistribution({0: 0.7, 1: 0.3})
difficulty   = DiscreteDistribution({0: 0.4, 1: 0.6})
SAT = ConditionalProbabilityTable([[0, 0, 0.95],
                                   [0, 1, 0.05],
                                   [1, 0, 0.20],
                                   [1, 1, 0.80]], [intelligence])
grade = ConditionalProbabilityTable([[0, 0, 0, 0.70],
                                     [0, 0, 1, 0.30],
                                     [0, 1, 0, 0.95],
                                     [0, 1, 1, 0.05],
                                     [1, 0, 0, 0.90],
                                     [1, 0, 1, 0.10],
                                     [1, 1, 0, 0.30],
                                     [1, 1, 1, 0.70]], [intelligence, difficulty])
letter = ConditionalProbabilityTable([[0, 0, 0.90],
                                      [0, 1, 0.10],
                                      [1, 0, 0.40],
                                      [1, 1, 0.60]], [grade])

s1 = State(intelligence, name="intelligence")
s2 = State(difficulty, name="difficulty")
s3 = State(SAT, name="SAT")
s4 = State(grade, name="grade")
s5 = State(letter, name="letter")

dN = BayesianNetwork()
dN.add_nodes(s1, s2, s3, s4, s5)
dN.add_edge(s1, s3)
dN.add_edge(s1, s4)
dN.add_edge(s2, s4)
dN.add_edge(s4, s5)
dN.bake()

intelligence = DiscreteDistribution({0: 0.95, 1: 0.05})
difficulty   = DiscreteDistribution({0: 0.4, 1: 0.6})
SAT = ConditionalProbabilityTable([[0, 0, 0.95],
                                   [0, 1, 0.05],
                                   [1, 0, 0.20],
                                   [1, 1, 0.80]], [intelligence])
grade = DiscreteDistribution({0: 0.2, 1: 0.8})
letter = ConditionalProbabilityTable([[0, 0, 0.90],
                                      [0, 1, 0.10],
                                      [1, 0, 0.40],
                                      [1, 1, 0.60]], [grade])

s1 = State(intelligence, name="intelligence")
s2 = State(difficulty, name="difficulty")
s3 = State(SAT, name="SAT")
s4 = State(grade, name="grade")
s5 = State(letter, name="letter")

dC = BayesianNetwork()
dC.add_nodes(s1, s2, s3, s4, s5)
dC.add_edge(s1, s3)
dC.add_edge(s4, s5)
dC.bake()

Now we have two fully created Bayesian networks, one without any dependency on grades because the student is cheating. The two changes that we make are that the distribution over intelligence favors those who are not exceptional, because they are more likely to cheat in this example. Secondly, we remove the dependency of grade on difficulty and intelligence because cheating gives a better score.

Let's build a Bayes classifier using these two networks to try to identify students who are cheating just based on this network structure.

In [22]:
X = numpy.array([[0, 0, 1, 0, 0],
                 [1, 0, 1, 1, 0],
                 [1, 1, 0, 0, 0],
                 [0, 0, 1, 0, 1],
                 [0, 0, 1, 0, 1],
                 [0, 1, 1, 0, 1],
                 [1, 0, 1, 0, 1],
                 [0, 1, 0, 1, 0]])

clf = BayesClassifier([dN, dC])
clf.predict_proba(X)
Out[22]:
array([[ 0.72058824,  0.27941176],
       [ 0.27272727,  0.72727273],
       [ 0.96428571,  0.03571429],
       [ 0.72058824,  0.27941176],
       [ 0.72058824,  0.27941176],
       [ 0.77777778,  0.22222222],
       [ 0.96610169,  0.03389831],
       [ 0.04402516,  0.95597484]])

The student who is most likely to not be cheating makes a lot of sense. It's a smart student taking an easy class who does not do well, but does well on their SAT, and ultimately ends up getting a letter of recommendation. Conversely, the student most likely to be cheating was a poor student who was taking a difficult class, did not end up doing well on the SAT but somehow gets a good grade in the class and does not get a letter.

It is easy to see how these networks could be expanded to allow for more granularity at each level. For example, each grade could be the actual letter grade and SAT performance might be the number of standard deviations above or below the mean. More variables and connections could allow for more sophisticated differences between the networks. No matter how complicated these networks get, they can be fed into pomegranate.

3. Markov Chains

Markov Chains are a simple model based on conditional probability, where a sequence is modelled as the product of conditional probabilities. A n-th order Markov chain looks back n emissions to base its conditional probability on. For example, a 3rd order Markov chain models $P(X_{t}|X_{t−1},X_{t−2},X_{t−3})$.

However, a full Markov model needs to model the first observations, and the first n-1 observations. The first observation can't really be modelled well using $P(X_{t}|X+{t−1},X_{t−2},X_{t−3})$, but can be modelled by $P(X_{t})$. The second observation has to be modelled by $P(X_{t}|X_{t−1})$. This means that these distributions have to be passed into the Markov chain as well.

We can initialize a Markov chain easily enough by passing in a list of the distributions.

In [23]:
a = DiscreteDistribution({'A': 0.10, 'C': 0.40, 'G': 0.40, 'T': 0.10})
b = ConditionalProbabilityTable([['A', 'A', 0.10],
                                 ['A', 'C', 0.50],
                                 ['A', 'G', 0.30],
                                 ['A', 'T', 0.10],
                                 ['C', 'A', 0.10],
                                 ['C', 'C', 0.40],
                                 ['C', 'T', 0.40],
                                 ['C', 'G', 0.10],
                                 ['G', 'A', 0.05],
                                 ['G', 'C', 0.45],
                                 ['G', 'G', 0.45],
                                 ['G', 'T', 0.05],
                                 ['T', 'A', 0.20],
                                 ['T', 'C', 0.30],
                                 ['T', 'G', 0.30],
                                 ['T', 'T', 0.20]], [a])

model = MarkovChain([a, b])

Markov chains have log probability, fit, summarize, and from summaries methods implemented. They do not have classification capabilities by themselves, but when combined with a Naive Bayes classifier can be used to do discrimination between multiple models.

Let's see the log probability of some data.

In [24]:
model.log_probability( list('CAGCATCAGT') ) 
Out[24]:
-17.532789486599906
In [25]:
model.log_probability( list('C') )
Out[25]:
-0.916290731874155
In [26]:
model.log_probability( list('CACATCACGACTAATGATAAT') )
Out[26]:
-38.55615991599665

We can fit the model to sequences which we pass in, and as expected, see that these sequences subsequently have a higher likelihood.

In [27]:
model.fit( map( list, ('CAGCATCAGT', 'C', 'ATATAGAGATAAGCT', 'GCGCAAGT', 'GCATTGC', 'CACATCACGACTAATGATAAT') ) )
print model.log_probability( list('CAGCATCAGT') ) 
print model.log_probability( list('C') )
print model.log_probability( list('CACATCACGACTAATGATAAT') )
-9.49627091139
-0.69314718056
-25.2575143893

After fitting, we can take a look at the initial distribution and see that it has different parameters now.

In [28]:
print model.distributions[0] 
{
    "frozen" :false,
    "class" :"Distribution",
    "parameters" :[
        {
            "A" :0.16666666666666666,
            "C" :0.5,
            "T" :0.0,
            "G" :0.3333333333333333
        }
    ],
    "name" :"DiscreteDistribution"
}

4. General Mixture Models

It is frequently the case that the data you have is not explained by a single underlying distribution. If we want to try to recover the underlying distributions, we need to have a model which has multiple components. An example is the following data.

In [29]:
data = np.concatenate((np.random.normal(1, 2, (250, 1)), np.random.normal(7.85, 1.2, (500, 1))))
numpy.random.shuffle(data)

plt.figure(figsize=(10, 4))
plt.hist( data, edgecolor='c', color='c', bins=50 )
plt.show()

This data is clearly made up of two Gaussian distributions, one centered around 8, and one centered around 1. To be exact, the distributions from which the samples were generated are centered around 1 and 7.85 respectively. If we had labels for each sample, we could easily train a Naive Bayes classifier using this data, or just learn the two underlying distributions simply by grouping points with the same label and fitting a Gaussian to that data.

Unfortunately, we don't have labels for these data points. Without labels for this data, we can't directly fit a Gaussian distribution to some points. If we had explicit parameters for a distribution, we could try to infer labels based on the likelihood of each sample under each distribution (like a Naive Bayes from earlier), but we don't have explicit parameters for the distributions. We're stuck in this chicken-and-egg problem of being able to figure out one given the other, but ultimately having neither.

We can solve this problem using the Expectation-Maximization (EM) algorithm. The gist is that you start off with a rough estimate of the parameters (such as randomly choosing a sample to be the $\mu$ and use unit variance for each component in a 1d Gaussian mixture), and then you iterate between inferring labels and performing parameter updates.

Expectation: Calculate 'responsibility' of each component for each sample, which is the probability that component generated that sample. This sums to 1 for each sample.

Maximization: Use weighted maximum likelihood estimates to update the parameters of each component, with the weight being the probability of that component generating that sample.

The difficulty can be in getting an appropriate initial estimate. pomegranate current defaults to using a single iteration of kmeans clustering in order to get initial estimates. Those clusters are then used to fit the initial distributions, and EM is run to refine those estimates until convergence.

In [30]:
model = GeneralMixtureModel.from_samples(NormalDistribution, 2, data, verbose=True)
Improvement: 0.935615744235
Improvement: 0.213706399069
Improvement: 0.0611823687568
Total Improvement: 1.21050451206

Let's take a look both at the underlying data which was used, and at the resulting distributions and weights (prior probabilities) which were learned by the model.

In [31]:
a, b = model.distributions

plt.figure(figsize=(10, 4))

x = numpy.arange(-4, 12, 0.1)
p = numpy.exp(model.log_probability(x))
plt.plot(x, p, color='r')
plt.hist( data, edgecolor='c', color='c', normed=True, bins=50)

plt.ylabel("Probability", fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlim(-4, 12)
plt.show()

print numpy.exp(model.weights)
[ 0.6693856  0.3306144]

It seems like it did a fairly good job on this simple example. The two learned distributions (in the background) do seem to fit what we expected that distribution to by eye. In addition, the weights are pretty spot on with the process which randomly generated the data (one third in the left hand group, two thirds in the right hand group).

We can now use these underlying distributions, and Bayes rule, to calculate the probabilities of samples having been generated from the distributions (the posterior probability, P(M|D)). This is the exact same calculation as the Naive Bayes classifier, since we have some underlying models and we want to know which one was most likely to have generated the sample.

In [32]:
y = model.predict(data)

plt.figure(figsize=(10, 4))

plt.hist( data[y==0], edgecolor='b', color='b', normed=True, bins=25 )
plt.hist( data[y==1], edgecolor='r', color='r', normed=True, bins=25 )

plt.ylabel("Probability", fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlim(-4, 12)
plt.show()

Just like with the Naive Bayes classifier, we can plot the posterior probability of the two models (the distributions) over a grid of points. Since the two distributions are further away and have closer variances than the Naive Bayes example, we see a simpler logistic function representing the change in posterior probability for the two models.

In [33]:
x = numpy.array([numpy.arange(-3, 11, .2)]).T
y = model.predict_proba(x)

plt.figure(figsize=(10, 5))

plt.hist( data, edgecolor='c', color='c', normed=True, bins=50, alpha=0.3 )
plt.scatter(x, y[:,0]*0.30, c='b', edgecolor='b',  label="P(Model 1|D)")
plt.scatter(x, y[:,1]*0.30, c='r', edgecolor='r', label="P(Model 2|D)")

plt.ylabel("Probability", fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(numpy.arange(0, 0.31, 0.06), numpy.arange(0, 1.1, 0.2), fontsize=14)
plt.xlim(-4, 12)
plt.legend(loc=6, fontsize=14)
plt.show()

The results of this make sense. But this is still pretty cool, we can take entirely unlabelled data and derive similar results to a Naive Bayes classifier, which requires labeled data.

The reason which the model is called a General Mixture Model is because it supports mixtures of all types of distributions, not just Gaussians. This is one reason I chose to not abbreviate it to be GMM, just to make this point explicit. We can create a mixtue model of exponential distributions just as easily.

In [34]:
data = numpy.concatenate([ numpy.random.exponential(5, (1000, 1)), numpy.random.exponential(0.25, (1000, 1))])

plt.figure(figsize=(10, 5))
plt.hist( data[data < 20], edgecolor='c', color='c', normed=True, bins=100 )

plt.ylabel("Probability", fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.legend(loc=6, fontsize=14)
plt.show()
/home/jmschr/anaconda2/lib/python2.7/site-packages/matplotlib/axes/_axes.py:545: UserWarning: No labeled objects found. Use label='...' kwarg on individual plots.
  warnings.warn("No labeled objects found. "

The above data was created from two exponential distributions: one with a high peak which decays early ($\lambda = 5$), and one with a lower peak which slowly decays over time ($\lambda = 0.25$). Let's first fit a single exponential distribution to the data and see what it looks like.

In [35]:
model = ExponentialDistribution.from_samples(data)

x = numpy.arange(0, 20, 0.1)
y = map(model.probability, x)
plt.plot(x, y, color='r')
plt.hist(data[data < 20], edgecolor='c', facecolor='c', normed=True, bins=250)

plt.ylabel("Probability", fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlim(0, 20)
plt.legend(loc=6, fontsize=14)
plt.show()

Doesn't quite look like it is fitting the underlying data. Let's now fit a two component mixture model and see what we get.

In [36]:
model = GeneralMixtureModel.from_samples(ExponentialDistribution, 2, data, verbose=True, stop_threshold=1e-2)
Improvement: 87.5176450524
Improvement: 75.6222821658
Improvement: 67.8899702437
Improvement: 54.2432197872
Improvement: 37.7426503742
Improvement: 22.6741117622
Improvement: 11.9478628919
Improvement: 5.74704750839
Improvement: 2.62637502162
Improvement: 1.1700328187
Improvement: 0.514874542091
Improvement: 0.225203320711
Improvement: 0.0981964821772
Improvement: 0.042745321543
Improvement: 0.0185896465778
Improvement: 0.00808004362989
Total Improvement: 368.088886983
In [37]:
a, b = model.distributions
print a
print b
{
    "frozen" :false,
    "class" :"Distribution",
    "parameters" :[
        0.20738034451595228
    ],
    "name" :"ExponentialDistribution"
}
{
    "frozen" :false,
    "class" :"Distribution",
    "parameters" :[
        4.015765553450299
    ],
    "name" :"ExponentialDistribution"
}

Looks like we can get close, but not perfect. $\lambda = 3.8$ and $\lambda = 0.2$ is close to 5 and 0.25 respectively. This is likely made more difficult by exponential distributions both overlapping each other significantly, and sharing a mode of 0. Let's take a look at what the distributions look like superimposed with the underlying data.

In [38]:
x = numpy.arange(0, 20, 0.1)
y = model.probability(x)
plt.plot(x, y, color='r')
plt.hist(data[data < 20], edgecolor='c', facecolor='c', normed=True, bins=250)

plt.ylabel("Probability", fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlim(0, 20)
plt.legend(loc=6, fontsize=14)
plt.show()

The fit doesn't seem too bad. Much better than we would get using Gaussians. We can also calculate the log probability under either of the distributions, and the log probability of the data under the model. This returns the log likelihood of the data, which is log P(D|M), not the posterior P(M|D).

In [39]:
print "Mixture Fit: ", model.log_probability(data).sum()
print "Exp 1 Fit :  ", sum(a.log_probability(d) for d in data)
print "Exp 2 Fit :  ", sum(b.log_probability(d) for d in data)
Mixture Fit:  -3134.85530982
Exp 1 Fit :   -4215.57091754
Exp 2 Fit :   -17923.2112538

Continuing with the theme of being a ~General~ Mixture Model, you can also manually specify any combination of pre-instantiated distributions of any types, with any priors. In this case, let's try out using an exponential distribution and a uniform distribution together, with the exponential distribution modelling the spike near 0, and the uniform distribution modelling the tail.

In [40]:
model = GeneralMixtureModel([ExponentialDistribution(5), UniformDistribution(0, 20)])
model.fit(data, stop_threshold=1e-1, verbose=True)

x = numpy.arange(0, 20, 0.1)
y = model.probability(x)
plt.plot(x, y, color='r')
plt.hist(data[data < 20], edgecolor='c', facecolor='c', normed=True, bins=250)


plt.ylabel("Probability", fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlim(0, 20)
plt.legend(loc=6, fontsize=14)
plt.show()
Improvement: 1445.28046817
Improvement: 168.60089881
Improvement: 55.0567980666
Improvement: 23.3706549009
Improvement: 11.2217235544
Improvement: 5.76738335307
Improvement: 3.09542728326
Improvement: 1.71181971061
Improvement: 0.967311093627
Improvement: 0.555393796112
Improvement: 0.322726013158
Improvement: 0.189237034399
Improvement: 0.11173351242
Improvement: 0.0663230027617
Total Improvement: 1716.3178983

That doesn't look better than the previous fit, but this is just an example of what you ~can~ do in pomegranate, not necessarily what you ~should~ do.

Speed

We can compare the speed of pomegranate's GMM implementation to sklearn's GMM implementation for Gaussians.

In [41]:
from sklearn.mixture import GaussianMixture

def evaluate_models():
    sizes = numpy.around( numpy.exp( numpy.arange(7, 13) ) ).astype('int')
    n, m = sizes.shape[0], 5
    
    skl_predict, pom_predict = numpy.zeros((m, n)), numpy.zeros((m, n))
    skl_fit, pom_fit = numpy.zeros((m, n)), numpy.zeros((m, n))
    skl_error, pom_error = numpy.zeros((m, n)), numpy.zeros((m, n))

    for i in range(m):
        for j, size in enumerate(sizes):
            X, y = create_dataset(size, 1, 2)
            
            # bench fit times
            tic = time.time()
            skl = GaussianMixture(n_components=2, max_iter=10, tol=1e-10)
            skl.fit(X)
            skl_fit[i,