You need the statsmodels package to retrieve the test dataset. (http://statsmodels.sourceforge.net)

- Statsmodels is a Python package for conducting statistical data analyses. It also contains real-world datasets that one can use when experimenting new methods. Here, we load the
*heart*dataset.

In [ ]:

```
import numpy as np
import scipy.stats as st
import statsmodels.datasets
import matplotlib.pyplot as plt
%matplotlib inline
```

In [ ]:

```
data = statsmodels.datasets.heart.load_pandas().data
```

- Let's take a look at this DataFrame.

In [ ]:

```
data.tail()
```

*at least* the indicated number of days. For simplicity here, we only keep uncensored data (we thereby create a bias toward patients that did not survive very long after their transplant...).

In [ ]:

```
data = data[data.censors==1]
survival = data.survival
```

- Let's take a look at the data graphically, by plotting the raw survival data and the histogram.

In [ ]:

```
plt.figure(figsize=(10,4));
plt.subplot(121);
plt.plot(sorted(survival)[::-1], 'o');
plt.xlabel('Patient');
plt.ylabel('Survival time (days)');
plt.subplot(122);
plt.hist(survival, bins=15);
plt.xlabel('Survival time (days)');
plt.ylabel('Number of patients');
```

- We observe that the histogram is decreasing very rapidly. Fortunately, the survival rates are today much higher (~70% after 5 years). Let's try to fit an exponential distribution to the data. According to this model, $S$ (number of days of survival) is an exponential random variable with parameter $\lambda$, and the observations $s_i$ are sampled from this distribution. Let:

be the sample mean. The likelihood function of an exponential distribution is, by definition (see proof in the next section):

$$\mathcal L(\lambda, \{s_i\}) = P(\{s_i\} | \lambda) = \lambda^n \exp\left(-\lambda n \overline s\right)$$The **maximum likelihood estimate** for the rate parameter is, by definition, the $\lambda$ that maximizes the likelihood function. In other words, it is the parameter that maximizes the probability of observing the data, assuming that the observations are sampled from an exponential distribution.

Here, it can be shown that the likelihood function has a maximum when $\lambda = 1/\overline s$, which is the *maximum likelihood estimate for the rate parameter*. Let's compute this parameter numerically.

In [ ]:

```
smean = survival.mean()
rate = 1./smean
```

In [ ]:

```
smax = survival.max()
days = np.linspace(0., smax, 1000)
dt = smax / 999. # bin size: interval between two
# consecutive values in `days`
```

In [ ]:

```
dist_exp = st.expon.pdf(days, scale=1./rate)
```

In [ ]:

```
nbins = 30
plt.figure(figsize=(5,3));
plt.hist(survival, nbins);
plt.plot(days, dist_exp*len(survival)*smax/nbins,
'-r', lw=3);
```

In [ ]:

```
dist = st.expon
args = dist.fit(survival); args
```

- We can use these parameters to perform a
**Kolmogorov-Smirnov test**, which assesses the goodness of fit of the distribution with respect to the data. This test is based on a distance between the**empirical distribution function**of the data and the**cumulative distribution function**(CDF) of the reference distribution.

In [ ]:

```
st.kstest(survival, dist.cdf, args)
```

*Birnbaum-Sanders distribution*, which is typically used to model failure times. (http://en.wikipedia.org/wiki/Birnbaum-Saunders_distribution)

In [ ]:

```
dist = st.fatiguelife
args = dist.fit(survival)
st.kstest(survival, dist.cdf, args)
```

In [ ]:

```
dist_fl = dist.pdf(days, *args)
nbins = 30
plt.figure(figsize=(5,3));
plt.hist(survival, nbins);
plt.plot(days, dist_exp*len(survival)*smax/nbins,
'-r', lw=3, label='exp');
plt.plot(days, dist_fl*len(survival)*smax/nbins,
'--g', lw=3, label='BS');
plt.xlabel("Survival time (days)");
plt.ylabel("Number of patients");
plt.legend();
```

You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).

IPython Cookbook, by Cyrille Rossant, Packt Publishing, 2014 (500 pages).