I've recently been trying to characterise grainsize data, which are typically Weibull or lognormally distributed. The scipy.stats module has stats.distribution.fit() functions that use Maximum Liklihood Estimation to fit a curve to lists of individual measurements, but my data have been binned (by sieving or by the laser particle sizer). Fitting a curve to binned (or percentile / quantile) data required a different approach: calculating a cumulative density distribution, then fitting it via a Least Squares method using Python's lambda functions to fix any parameters that you want to keep constant.

I have prepared this notebook to remind myself and to help others. Note that you can download the notebook from the link at the top of this page and play with it yourself.

### Fitting a probability distribution to binned or percentile data.¶

In [139]:
from scipy import stats # Import the scipy.stats module
from scipy.optimize import curve_fit # Import the curve fitting module

## SET UP INPUT DATA
# Define a lognormal probability distribution from median and geometric
# standard deviation (see link below for tips on using lognormal
# distributions in Python)
s = 2.0 # Geometric standard deviation
M = 40.0 # Geometric mean == median
shape_in = np.log(s) # Scipy's shape parameter corresponds to log(geo. std. dev.)
scale_in = M # Scipy's scale parameter corresponds to median
# Calculate histogram for random variates (this is equivalent to my input data)
xx = stats.lognorm.rvs(shape_in, loc=0, scale=scale_in, size=1000) # Generate data
hist, bin_edges = np.histogram(xx, bins=50, density=True) # Calculate histogram
x_hist = bin_edges[1:] # Take the upper edges of the bins
y_hist = hist.cumsum()/hist.cumsum().max()  # Normalise the cumulative sum
y_hist = y_hist + np.random.normal(0, 0.02, len(y_hist)) # add some noise

## FIT THE DISTRIBUTION
(shape_out, scale_out), pcov = curve_fit(
lambda xdata, shape, scale: stats.lognorm.cdf(xdata, shape, loc=0, scale=scale, p0=[np.log(2), 40]),
x_hist, y_hist)

## HOW IT WORKS
# curve_fit takes a *function*, xdata and ydata as inputs.  It then varies the inputs to
# the *function* to get the best fit.  In this case, we want to fit the cumulative density
# function, keeping the location parameter fixed at 0.  We use the lambda function as our
# input function.  It is a temporary function that is just the cumulative density function,
# but with the location input already set.

## PLOT THE GRAPH
plt.figure(figsize=(9,4))
ax1=plt.subplot(121)
ax1.hist(xx, bins=25, normed=True, rwidth=0.8, alpha=0.5, label='Binned data')
x = np.linspace(0.1, 150, num=25) # values for x-axis
ax1.plot(x,stats.lognorm.pdf(x, shape_out, loc=0, scale=scale_out), 'r-', lw=3, label='Fitted distribution')
plt.xlim(0,150)
plt.xlabel('x')
plt.ylabel('y')
leg=ax1.legend()
ax2=plt.subplot(122)
ax2.plot(x_hist, y_hist, 'ro', label='Binned data points')
ax2.plot(x,stats.lognorm.cdf(x, shape_out, loc=0, scale=scale_out), label='Fitted distribution')
plt.xlim(0,150)
ax2.set_ylim(0,1.0)
plt.xlabel('x')
plt.ylabel('y_cdf')
leg=ax2.legend(loc='lower right', numpoints=1)
results_txt="""M_in=%.2f
M_out=%.2f
s_in=%.2f
s_out=%.2f""" % (M, scale_out, s, np.exp(shape_out))
txt=plt.text(0.97, 0.3, results_txt, transform=ax2.transAxes, horizontalalignment='right', fontsize='large')


### References¶

I am a geologist doing postdoctoral research at Edinburgh University. I blog and tweet about volcanoes, Iceland, GIS/maps & open source science software.