Notebook

This is one of the 100 recipes of the IPython Cookbook, the definitive guide to high-performance scientific computing and data science in Python.

7.5. Fitting a probability distribution to data with the maximum likelihood method¶

You need the statsmodels package to retrieve the test dataset. (http://statsmodels.sourceforge.net)

Statsmodels is a Python package for conducting statistical data analyses. It also contains real-world datasets that one can use when experimenting new methods. Here, we load the heart dataset.

In [ ]:

import numpy as np
import scipy.stats as st
import statsmodels.datasets
import matplotlib.pyplot as plt
%matplotlib inline

In [ ]:

data = statsmodels.datasets.heart.load_pandas().data

Let's take a look at this DataFrame.

In [ ]:

data.tail()

This dataset contains censored and uncensored data: a censor of 0 means that the patient was alive at the end of the study, so that we don't know the exact survival. We only know that the patient survived at least the indicated number of days. For simplicity here, we only keep uncensored data (we thereby create a bias toward patients that did not survive very long after their transplant...).

In [ ]:

data = data[data.censors==1]
survival = data.survival

Let's take a look at the data graphically, by plotting the raw survival data and the histogram.

In [ ]:

plt.figure(figsize=(10,4));
plt.subplot(121);
plt.plot(sorted(survival)[::-1], 'o');
plt.xlabel('Patient');
plt.ylabel('Survival time (days)');
plt.subplot(122);
plt.hist(survival, bins=15);
plt.xlabel('Survival time (days)');
plt.ylabel('Number of patients');

We observe that the histogram is decreasing very rapidly. Fortunately, the survival rates are today much higher (~70% after 5 years). Let's try to fit an exponential distribution to the data. According to this model, $S$ (number of days of survival) is an exponential random variable with parameter $\lambda$, and the observations $s_i$ are sampled from this distribution. Let:

$$\overline s = \frac 1 n \sum s_i$$

be the sample mean. The likelihood function of an exponential distribution is, by definition (see proof in the next section):

$$\mathcal L(\lambda, \{s_i\}) = P(\{s_i\} | \lambda) = \lambda^n \exp\left(-\lambda n \overline s\right)$$

The maximum likelihood estimate for the rate parameter is, by definition, the $\lambda$ that maximizes the likelihood function. In other words, it is the parameter that maximizes the probability of observing the data, assuming that the observations are sampled from an exponential distribution.

Here, it can be shown that the likelihood function has a maximum when $\lambda = 1/\overline s$, which is the maximum likelihood estimate for the rate parameter. Let's compute this parameter numerically.

In [ ]:

smean = survival.mean()
rate = 1./smean

To compare the fitted exponential distribution to the data, we first need to generate linearly spaced values for the x axis (days).

In [ ]:

smax = survival.max()
days = np.linspace(0., smax, 1000)
dt = smax / 999.  # bin size: interval between two
                  # consecutive values in `days`

We can obtain the probability density function of the exponential distribution with SciPy. The parameter is the scale, the inverse of the estimated rate.

In [ ]:

dist_exp = st.expon.pdf(days, scale=1./rate)

Now, let's plot the histogram and the obtained distribution. We need to rescale the theoretical distribution to the histogram (depending on the bin size and the total number of data points).

In [ ]:

nbins = 30
plt.figure(figsize=(5,3));
plt.hist(survival, nbins);
plt.plot(days, dist_exp*len(survival)*smax/nbins,
         '-r', lw=3);

The fit is far from perfect... We were able to find an analytical formula for the maximum likelihood estimate here. In more complex situations, this is not always possible, so that one needs to resort to numerical methods. SciPy actually integrates numerical maximum likelihood routines for a large number of distributions. Here, we use this other method to estimate the parameter of the exponential distribution.

In [ ]:

dist = st.expon
args = dist.fit(survival); args

We can use these parameters to perform a Kolmogorov-Smirnov test, which assesses the goodness of fit of the distribution with respect to the data. This test is based on a distance between the empirical distribution function of the data and the cumulative distribution function (CDF) of the reference distribution.

In [ ]:

st.kstest(survival, dist.cdf, args)

The second output value is the p-value. Here, it is very low: the null hypothesis (stating that the observed data stems from an exponential distribution with a maximum likelihood rate parameter) can be rejected with high confidence. Let's try another distribution, the Birnbaum-Sanders distribution, which is typically used to model failure times. (http://en.wikipedia.org/wiki/Birnbaum-Saunders_distribution)

In [ ]:

dist = st.fatiguelife
args = dist.fit(survival)
st.kstest(survival, dist.cdf, args)

This time, the p-value is 0.07, so that we would not reject the null hypothesis with a 5% confidence level. When plotting the resulting distribution, we observe a better fit than with the exponential distribution.

In [ ]:

dist_fl = dist.pdf(days, *args)
nbins = 30
plt.figure(figsize=(5,3));
plt.hist(survival, nbins);
plt.plot(days, dist_exp*len(survival)*smax/nbins,
         '-r', lw=3, label='exp');
plt.plot(days, dist_fl*len(survival)*smax/nbins,
         '--g', lw=3, label='BS');
plt.xlabel("Survival time (days)");
plt.ylabel("Number of patients");
plt.legend();

You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).

IPython Cookbook, by Cyrille Rossant, Packt Publishing, 2014 (500 pages).