Notes:

- $\log(x)$ denotes the natural logarith of $x$
- This notebook refers to the
`scipy.stats`

modules as just`stats`

- See here for more info: http://en.wikipedia.org/wiki/Log-normal_distribution

Using scipy, you can easily use maximum likelihood estimation (MLE) do determine the parameters describing the distribution of a sample of data. However, you can have to get the terminology straight for each type of distribution. This notebook aims to clear that up for lognormal distributions.

If you want to estimate the location $(\mu)$ and scale $(\sigma)$
(as defined by Wikipedia) that describe a lognormally distributed
sample $(X)$ with `stats.lognorm.fit`

, do the following:

- fit the distribution, making sure to feed a guess that scipy's
`loc`

is 0 - set $\sigma$ to scipy's shape parameter
- set $\mu$ to the natural log of scipy's scale parameter
- check to confirm that scipy's loc parameter is close to 0

If you already know $\mu$ and $\sigma$ of your distribution, then you can create that distribution in the manner below:

`myDist = stats.lognorm(sigma, loc=0, scale=numpy.exp(mu))`

In [1]:

```
import numpy
from matplotlib import pyplot
import scipy
from scipy import stats
import seaborn
seaborn.set()
%matplotlib inline
print('numpy: {}'.format(numpy.version.full_version))
print('scipy: {}'.format(scipy.version.full_version))
```

A (perhaps over-)simplification of the lognormal distirbution would state that
a sample $(X)$ is lognormally distributed if $\log(X)$ is normally
distributed. This implies that the value of a sample from a populate
that is *truly* distributed lognormally can never be zero. You can
get really dang close, but not actually there.

So let's define the parameters that describe the distirbution of $\log(X)$ as $\mu$ (kind of the mean) and $\sigma$ (kind of the standard deviation). In other words:

\begin{equation*} X = e^{(\mu + \sigma \: Z)} \end{equation*}Where $Z$ is a standard normal variable.

Wikipedia calls these the location and scale.
For our purposes, let's change that to `wikiLoc`

and `wikiScale`

.
That's a pain in the ass, but it gets worse due to things out of
my control, so please bare with me.

In addition to $\mu$ and $\sigma$, we can also talk about the sample mean
and the sample standard deviation. Let's call them `wikiM`

and `wikiS`

.

In python you can compute those very easily with e.g., `numpy.mean(X)`

`numpy`

¶You can generate a random sample from a lognormal distribution
using numpy and `wikiLoc`

and `wikiScale`

. The wording of the
arguments (`mean`

and `sigma`

) is unfortunate, but I don't
expect that to change anytime soon.

In [2]:

```
numpy.random.seed(0)
N = 37000
wikiLoc = 2.25
wikiScale = 0.875
# _n denotes numpy
X_n = numpy.random.lognormal(mean=wikiLoc, sigma=wikiScale, size=N)
```

In [3]:

```
bins = numpy.logspace(-1, 3, num=250)
fig, ax = pyplot.subplots(figsize=(8, 5))
_ = ax.hist(X_n, bins=bins, cumulative=False, histtype='step',
normed=True, lw=1.25, label='Hist of random numpy sample')
ax.set_xscale('log')
ax.set_xlabel(r'$X$')
ax.set_ylabel('PDF')
ax.legend()
```

Out[3]:

We can fit distributions to samples using scipy. The problem is that
the terminology is pretty ambiguous. Scipy needs three parameters to
fully define a lognormal distribution. Here they are with `sp`

as a
prefix to differentiate them from Wikipedia's parameters:

`spShape`

`spLoc`

`spScale`

.

You might be thinking: `spShape`

is weird, but the rest make sense.

`spShape`

actually corresponds to`wikiScale`

.- Remember when we said that lognormal samples can't
take values at or below 0? Scipy disagrees.
`spLoc`

is used to translate the PDF of $X$ anywhere along the x-axis. In other words,`spLoc`

(almost)*always*should be set to 0 `spScale`

doesn't directly coorespond to`wikiLoc`

. It's actually $\log(\mathtt{wikiLoc})$

The take away here is that if you have real-world sample that you
think may be best described by a lognormal distribution, don't
just throw it into `scipy.stats.lognorm.fit`

and you can't just
take the returned parameters at face value.

When using the `.fit`

method of `scipy.stats`

distributions, you
can guide the parameters by specifying a start/guess value. For us,
this means doing:

`stats.lognorm.fit(mydata, loc=0)`

In [4]:

```
spShape, spLoc, spScale = stats.lognorm.fit(X_n, loc=0)
ln_dist = stats.lognorm(spShape, spLoc, spScale)
print("""
spShape = {:.3f}
spLoc = {:.3f}
spScale = {:.3f}
""".format(spShape, spLoc, spScale))
```

So as we expected, this doesn't really makes sense at facevalue. But information we can use is in there somewhere.

In [5]:

```
fig, ax = pyplot.subplots(figsize=(8, 5))
_ = ax.hist(X_n, bins=bins, cumulative=False, histtype='step',
normed=True, lw=1.25, label='Numpy random number')
# theoretical PDF
x_hat = numpy.logspace(-1, 2, num=1000)
y_hat = ln_dist.pdf(x_hat)
ax.plot(x_hat, y_hat, '-', lw=1.25, label='Fit with scipy')
ax.set_xscale('log')
ax.set_xlabel(r'$X$')
ax.set_ylabel('PDF')
ax.legend()
```

Out[5]:

See? Buried in all that nonsense is really good info. Let's get it.

So let's estimate our sample statistics both directly and from the fit theoretical formulas in the wiki article.

\begin{equation*} m = e^{\mu + \frac{1}{2}\sigma^2} \end{equation*}\begin{equation*} s = \sqrt{\left(e^{\sigma^2} - 1 \right) e^{2\mu + \sigma^2}} \end{equation*}In [6]:

```
wikiM = numpy.exp(wikiLoc + 0.5*wikiScale**2)
wikiM_est = X_n.mean()
wikiS = numpy.sqrt((numpy.exp(wikiScale**2) - 1) * numpy.exp(2*wikiLoc + wikiScale**2))
wikiS_est = X_n.std()
summary = """
\tTheory\tSample
wikiM\t{0:.3f}\t{1:.3f}
wikiS\t{2:.3f}\t{3:.3f}
"""
print(summary.format(wikiM, wikiM_est, wikiS, wikiS_est))
```

Everything is pretty close. Cool.

`wikiLoc`

and `wikiScale`

from our sample estimates¶In [7]:

```
sampleLoc = numpy.log((wikiM_est**2) / numpy.sqrt(wikiS_est**2 + wikiM_est**2))
sampleScale = numpy.sqrt(numpy.log(1 + (wikiS_est/wikiM_est)**2))
summary = """
\tTheory\tSample
wikiLoc \t{0:.3f}\t{1:.3f}
wikiScale\t{2:.3f}\t{3:.3f}
"""
print(summary.format(wikiLoc, sampleLoc, wikiScale, sampleScale))
```

Again. Pretty close so we're cool.

`.fit`

method¶Recall what scipy returned:

In [8]:

```
print("""
spShape = {:.3f}
spLoc = {:.3f}
spScale = {:.3f}
""".format(spShape, spLoc, spScale))
```

But we want the estimates of `wikiLoc`

(known = 2.25) and `wikiScale`

(0.875) from the scipy fit.

In [9]:

```
sp_wikiLoc = numpy.log(spScale)
sp_wikiScale = spShape
summary = """
\tTheory\tSample\tScipy
wikiLoc \t{0:.3f}\t{1:.3f}\t{2:.3f}
wikiScale\t{3:.3f}\t{4:.3f}\t{5:.3f}
"""
print(summary.format(wikiLoc, sampleLoc, sp_wikiLoc, wikiScale, sampleScale, sp_wikiScale))
```

Phew. It all makes sense.

So now we can use our original `wikiLoc`

and `wikiScale`

values to freeze a `stats.lognorm`

object.

In [10]:

```
frozen_dist = stats.lognorm(wikiScale, loc=0, scale=numpy.exp(wikiLoc))
```

But the theoretical and fit distributions are so close we can't really see the difference.

In [11]:

```
fig, ax = pyplot.subplots(figsize=(8, 5))
_ = ax.hist(X_n, bins=bins, cumulative=False, histtype='step',
normed=True, lw=1.25, alpha=0.5,
label='Numpy random number')
# theoretical PDF
x_hat = numpy.logspace(-1, 2, num=1000)
y_hat_1 = ln_dist.pdf(x_hat)
y_hat_2 = frozen_dist.pdf(x_hat)
ax.plot(x_hat, y_hat_1, '-', lw=1.25, label='Fit with scipy')
ax.plot(x_hat, y_hat_2, '--', lw=1.25, label='Theoretical')
ax.set_xscale('log')
ax.set_xlabel(r'$X$')
ax.set_ylabel('PDF')
ax.legend()
```

Out[11]: