This cookbook shows how to plot a skewed distribution from the normal probability density function and its cummulative density function. You can take a look at this wikipedia page giving an introduction to skewness of a distribution.
This cookbook was written from this question on stackexchange.
%pylab --no-import-all inline
from scipy.special import erf
from scipy.optimize import curve_fit
matplotlib.rcParams["figure.figsize"] = (10, 6)
matplotlib.rcParams['lines.linewidth'] = 2
matplotlib.rcParams["font.family"] = "sans"
matplotlib.rcParams["font.size"] = 20
Populating the interactive namespace from numpy and matplotlib
The normal probability function reads
\begin{equation} \text{pdf}(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(- \frac{(x - \mu)^2}{2\sigma^2} \right) \end{equation}def normpdf(x, mu, sigma):
return 1 / (sigma * np.sqrt(2 * np.pi)) * np.exp(-(x - mu)**2 / (2 * sigma**2))
The cummulative density function reads
\begin{equation} \text{cdf}(x) = \frac{1}{2} \left[1 + \text{erf}\left(\frac{x - \mu}{\sigma\sqrt{2}} \right) \right] \end{equation}def normcdf(x, mu, sigma, alpha=1):
return 0.5 * (1 + erf(alpha * (x - mu) / (sigma * np.sqrt(2))))
In order to add skewness to the normal probability density function, it is multiplied by the normal cummulative density function. In order to adjust the skewness of the distribution, a parameter $\alpha$ is added to the cummulative density function and we get a skewed density function which reads
\begin{equation} \text{skewed}(x) = \frac{A}{2\sigma \sqrt{2\pi}} \, \exp\left(- \frac{(x - \mu)^2}{2\sigma^2} \right) \left[1 + \text{erf}\left(\alpha \, \frac{x - \mu}{\sigma\sqrt{2}} \right) \right] \end{equation}The prefactor $A$ is added for fitting purpose.
def skewed(x, mu, sigma, alpha, a):
return a * normpdf(x, mu, sigma) * normcdf(x, mu, sigma, alpha)
The plots below show the effect of parameter $\alpha$ in the $\text{cdf}$ part of the skewed distribution. With $\alpha$=0 we get the normal distribution.
ax1.set_ylabel?
fig, (ax1, ax2) = plt.subplots(ncols=2, sharex=True, sharey=True, figsize=(14, 6))
x = np.linspace(-5, 5, 200)
for alpha in [0, 1, 2, 5, 10]:
ax1.plot(x, skewed(x, mu=1, sigma=1, alpha=alpha, a=1), label=r"$\alpha$=%d" % alpha)
ax1.set_title("Positive skewness")
ax1.legend(loc="upper left", fontsize=16)
for alpha in [0, 1, 2, 5, 10]:
ax2.plot(x, skewed(x, mu=1, sigma=1, alpha=-alpha, a=1), label=r"$\alpha$=%d" % -alpha)
ax2.legend(loc="upper left", fontsize=16)
ax2.set_title("Negative skewness")
ax2.set_yticklabels([])
ax2.set_xlim(-4, 6)
fig.subplots_adjust(wspace=0)
In order to fit some datas to the skewed distribution, we will use the curve_fit
method of scipy.optimize
.
help(curve_fit)
Help on function curve_fit in module scipy.optimize.minpack: curve_fit(f, xdata, ydata, p0=None, sigma=None, absolute_sigma=False, check_finite=True, **kw) Use non-linear least squares to fit a function, f, to data. Assumes ``ydata = f(xdata, *params) + eps`` Parameters ---------- f : callable The model function, f(x, ...). It must take the independent variable as the first argument and the parameters to fit as separate remaining arguments. xdata : An M-length sequence or an (k,M)-shaped array for functions with k predictors. The independent variable where the data is measured. ydata : M-length sequence The dependent data --- nominally f(xdata, ...) p0 : None, scalar, or N-length sequence, optional Initial guess for the parameters. If None, then the initial values will all be 1 (if the number of parameters for the function can be determined using introspection, otherwise a ValueError is raised). sigma : None or M-length sequence, optional If not None, the uncertainties in the ydata array. These are used as weights in the least-squares problem i.e. minimising ``np.sum( ((f(xdata, *popt) - ydata) / sigma)**2 )`` If None, the uncertainties are assumed to be 1. absolute_sigma : bool, optional If False, `sigma` denotes relative weights of the data points. The returned covariance matrix `pcov` is based on *estimated* errors in the data, and is not affected by the overall magnitude of the values in `sigma`. Only the relative magnitudes of the `sigma` values matter. If True, `sigma` describes one standard deviation errors of the input data points. The estimated covariance in `pcov` is based on these values. check_finite : bool, optional If True, check that the input arrays do not contain nans of infs, and raise a ValueError if they do. Setting this parameter to False may silently produce nonsensical results if the input arrays do contain nans. Default is True. Returns ------- popt : array Optimal values for the parameters so that the sum of the squared error of ``f(xdata, *popt) - ydata`` is minimized pcov : 2d array The estimated covariance of popt. The diagonals provide the variance of the parameter estimate. To compute one standard deviation errors on the parameters use ``perr = np.sqrt(np.diag(pcov))``. How the `sigma` parameter affects the estimated covariance depends on `absolute_sigma` argument, as described above. Raises ------ OptimizeWarning if covariance of the parameters can not be estimated. ValueError if ydata and xdata contain NaNs. See Also -------- leastsq Notes ----- The algorithm uses the Levenberg-Marquardt algorithm through `leastsq`. Additional keyword arguments are passed directly to that algorithm. Examples -------- >>> import numpy as np >>> from scipy.optimize import curve_fit >>> def func(x, a, b, c): ... return a * np.exp(-b * x) + c >>> xdata = np.linspace(0, 4, 50) >>> y = func(xdata, 2.5, 1.3, 0.5) >>> ydata = y + 0.2 * np.random.normal(size=len(xdata)) >>> popt, pcov = curve_fit(func, xdata, ydata)
First, compute points representing a skewed distribution with a random noise.
# produce data to fit
sigma = 10
mu = 50
alpha = 2
a = 1
x = np.linspace(0,100,100)
y = skewed(x, mu, sigma, alpha=alpha, a=a)
yn = y + 0.002 * np.random.normal(size=len(x))
Now, we use curve_fit
to optimize the parameters and then plot the resulting functions.
# fit to skewed distribution
sopt, scov = curve_fit(skewed, x, yn, p0=(20, 20, 1, 1))
y_fit= skewed(x, *sopt)
# fit to normal distribution
gopt, gcov = curve_fit(normpdf, x, yn, p0=(20, 20))
y_gfit = normpdf(x, *gopt)
# plot
#plt.plot(x, y, "r-")
plt.plot(x, yn, "bo", label="data")
plt.plot(x, y_fit, "g", label="fit skewed")
plt.plot(x, y_gfit, "r", label="fit normal")
plt.legend(loc="upper left", fontsize=16)
# parameters
print("Parameters skewed :")
print("-------------------")
print("mu :", sopt[0])
print("sigma :", sopt[1])
print("alpha :", sopt[2])
print("a :", sopt[3])
print("\nParameters normal :")
print("-------------------")
print("mu :", gopt[0])
print("sigma :", gopt[1])
Parameters skewed : ------------------- mu : 50.0876601397 sigma : 9.76673500085 alpha : 1.94006355233 a : 0.976838902518 Parameters normal : ------------------- mu : 56.9087655557 sigma : 14.5346539279