In the previous notebook - User models in Sherpa - I showed how you could
write a user model to fit the
Cumulative Distribution Function (CDF) of a data set with a
Gamma distribution. Rather than use the
load_user_model function, I showed
how to write the model taking advantage of the Sherpa `Model`

class, and the
add_model function to register the
class with Sherpa.

One question I got from a colleague was why I fit the CDF, rather than the Probability Density Function (PDF) of the data? Whilst I am not going to answer that question here, I do aim to show how to fit the PDF, since it:

- lets me talk about writing models for "integrated" data sets
- lets me fit the data using a Maximim Likelihood statistic, rather than the "least-squares" approach I used previously.

I have a follow-up to this, which is available as plotting in Sherpa when using the low-level API. This was written on May 5 2015.

I have updated the notebook for the Sherpa 4.8.2 release in September 2016; there are no changes to the notebook, but you can now use Python 3.5 as well as 2.7. Actually, there's one minor change to the notebook, but that is due to a change in the IPython/Jupyter code rather than Sherpa: it is no longer necessary to say

```
import logging
logging.getLogger('sherpa').propagate = False
```

to stop seeing double messages from Sherpa. There has also been a fix to `matplotlib`

since I last ran this notebook
which fixes the display of error bars on a logarithmic scale when the point itself is negative.

This was written by Douglas Burke on June 4 2015. This notebook, and others that may be of interest, can be found on GitHub at https://github.com/DougBurke/sherpa-standalone-notebooks.

The information in this document is placed into the Publc Domain. It is not an official product of the Chandra X-ray Center, and I make no guarantee that it is not without bugs or embarassing typos. Please contact me via the GitHub repository or on Twitter - at @doug_burke - if you have any questions.

When was this notebook last run?

In [1]:

```
import datetime
datetime.datetime.now().strftime("%Y-%m-%d %H:%M")
```

Out[1]:

First of all I am going to load in the necessary modules, and set up the CDF model I wrote earlier:

In [2]:

```
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
```

In [3]:

```
%matplotlib inline
```

In [4]:

```
from sherpa import utils
from sherpa.models import model
from sherpa.models.parameter import tinyval
```

In [5]:

```
from sherpa import ui
```

With the imports out of the way, here are routines to calculate the CDF and PDF of a Gamma distribution:

In [6]:

```
def calc_gamma_cdf(x, k, theta):
"""Return the CDF of a gamma distribution.
The cumulative distribution function (CDF) of the gamma
distribution is defined in [1]_ as::
cdf(x;k,theta) = incomplete_gamma(k, x/theta)
----------------------------
gamma(k)
Parameters
----------
x : array
The values at which to evaluate the CDF.
k : number
The shape parameter, which must be greater than 0.
theta : number
The scale parameter, which must be greater than 0.
Returns
-------
cdf : array
The CDF evaluated at each element of x.
Notes
-----
The mean of the distribution is given by ``k * theta``,
and the rate is ``1 / theta``.
References
----------
.. [1] http://en.wikipedia.org/wiki/Gamma_distribution
"""
# Unfortunately igam does not accept a Numpy array as the
# second argument, so need to map over the input array.
#
tval = theta * 1.0
kval = k * 1.0
x = np.asarray(x)
out = np.zeros_like(x)
for i,xi in enumerate(x):
# igam is the "regularized" incomplete gamma (lower)
# form, so already has the normalization by gamma(a)
# included.
out[i] = utils.igam(kval, xi/tval)
return out
def calc_gamma_pdf(x, k, theta):
"""Return the PDF of a gamma distribution.
The probability density function (PDF) of the gamma
distribution is defined in [1]_ as::
pdf(x;k,theta) = x^(k-1) e^(-x/theta)
--------------------
gamma(k) theta^k
Parameters
----------
x : array
The values at which to evaluate the PDF.
k : number
The shape parameter, which must be greater than 0.
theta : number
The scale parameter, which must be greater than 0.
Returns
-------
pdf : array
The PDF evaluated at each element of x.
Notes
-----
The mean of the distribution is given by ``k * theta``,
and the rate is ``1 / theta``.
References
----------
.. [1] http://en.wikipedia.org/wiki/Gamma_distribution
"""
tval = theta * 1.0
kval = k * 1.0
norm = utils.gamma(kval) * theta**kval
return x**(kval-1.0) * np.exp(-x / tval) / norm
```

I am also going to need a routine that integrates the PDF over a range in x. This is just the difference in the CDF at the two points, so can be written as:

In [7]:

```
def calc_gamma_pdf_int(xlo, xhi, k, theta):
"""Return the PDF of a gamma distribution integrated over a bin.
The probability density function (PDF) of the gamma
distribution is defined in [1]_ as::
pdf(x;k,theta) = x^(k-1) e^(-x/theta)
--------------------
gamma(k) theta^k
Integrating this gives::
- theta^k igamma(k, x/theta) + c
----------------------------
gamma(k) theta^k
where igamma is the incomplete gamma function. Note that this is
the "upper" form (so int_x^infinity), and sherpa.utils.igam
calculates the lower-form (which is 1-upper), which is also
a normalized form, so we get to::
sherpa.utils.igam(k,xhi/theta) - sherpa.utils.igam(k,xlo/theta)
which shouldn't be surprising as it's just::
cdf(xhi;k,theta) - cdf(xlo;k,theta)
This means that this is just::
calc_gamma_cdf(xhi,k,t) - calc_gamma_cdf(xlo,k,t)
Parameters
----------
xlo, xhi : array
The bin edges over which to integrate the PDF.
k : number
The shape parameter, which must be greater than 0.
theta : number
The scale parameter, which must be greater than 0.
Returns
-------
pdf : array
The PDF integrated over xlo to xhi.
References
----------
.. [1] http://en.wikipedia.org/wiki/Gamma_distribution
"""
return calc_gamma_cdf(xhi, k, theta) - calc_gamma_cdf(xlo, k, theta)
```

With these, I can write the `GammaCDF`

class, which is essentially the same as in the
previous version, *except* that I have now used a finite, but *small*, lower limit on the
`k`

and `theta`

parameter values, rather than allow them to go down to `0`

. The lower limit
is given by `sherpa.models.parameter.tinyval`

, which is:

In [8]:

```
tinyval
```

Out[8]:

In [9]:

```
class GammaCDF(model.ArithmeticModel):
"""A Gamma CDF.
The cumulative distribution function (CDF) for the Gamma
distribution, as defined by [1]_, is::
cdf(x;k,theta) = incomplete_gamma(k, x/theta)
----------------------------
gamma(k)
The model parameters are:
k
The shape parameter, which must be greater than 0.
theta
The scale parameter, which must be greater than 0.
Notes
-----
The mean of the distribution is given by ``k * theta``,
and the rate is ``1 / theta``.
References
----------
.. [1] http://en.wikipedia.org/wiki/Gamma_distribution
"""
def __init__(self, name='gammacdf'):
self.k = model.Parameter(name, 'k', 5, min=tinyval, hard_min=0)
self.theta = model.Parameter(name, 'theta', 2, min=tinyval, hard_min=0)
model.ArithmeticModel.__init__(self, name, (self.k, self.theta))
@model.modelCacher1d
def calc(self, pars, x, *args, **kwargs):
(k, theta) = pars
if len(args) == 1:
x = (x + args[0]) / 2.0
return calc_gamma_cdf(x, k, theta)
```

The above model expects either one or two `x`

arrays - the independent variable value - which are the location at which to evaluate the model (one array), or the low and high edges of each bin, over which the model should be evaluated. In the CDF case I just use the mid-point of the bin; that is I do not "integrate" the model. However, for the PDF case, where the data is "integrated" - i.e. it represents the number of events within the range, we can integrate up the data. The overall form of the `GammaPDF`

class is very similar to `GammaCDF`

, but

- an extra parameter is needed (the number of events),
- the handling of the model evaluation is slightly different,
- and a
`guess`

routine is added so that the normalisation can be set based on the data,

as shown below:

In [10]:

```
class GammaPDF(model.ArithmeticModel):
"""A Gamma PDF.
The probability density function (PDF) for the Gamma
distribution, as defined by [1]_, is::
pdf(x;k,theta) = x^(k-1) e^(-x/theta)
--------------------
gamma(k) theta^k
The model parameters are:
k
The shape parameter, which must be greater than 0.
theta
The scale parameter, which must be greater than 0.
norm
Normalization. The soft minimum is set to 0, but the
hard_min is left at the default value, so the minimum
can be made negative if the user wants.
Notes
-----
The mean of the distribution is given by ``k * theta``,
and the rate is ``1 / theta``.
References
----------
.. [1] http://en.wikipedia.org/wiki/Gamma_distribution
"""
def __init__(self, name='gammapdf'):
self.k = model.Parameter(name, 'k', 5, min=tinyval, hard_min=0)
self.theta = model.Parameter(name, 'theta', 2, min=tinyval, hard_min=0)
self.norm = model.Parameter(name, 'norm', 1, min=0) # allow -ve if they realy want it
model.ArithmeticModel.__init__(self, name, (self.k, self.theta, self.norm))
@model.modelCacher1d
def calc(self, pars, x, *args, **kwargs):
(k, theta, norm) = pars
if len(args) == 0:
return norm * calc_gamma_pdf(x, k, theta)
elif len(args) == 1:
# note: this could be a 2D dataset
return norm * calc_gamma_pdf_int(x, args[0], k, theta)
else:
raise ValueError("Expected x or xlo/xhi grid.")
def guess(self, dep, *args, **kwargs):
"""Set the norm parameter based on the data.
The norm parameter is set to be the sum of the dependent axis
and frozen. The ``k`` and ``theta`` parameters are not changed.
Notes
-----
The mean of the values equals k * theta and the mode is
(k-1) * theta - when k > 1 - so these could be used to set
k and theta, but I am not convinced it is worth the effort.
"""
norm = dep.sum()
self.norm.val = dep.sum()
self.norm.frozen = True
```

These models can now be added to Sherpa:

In [11]:

```
ui.add_model(GammaCDF)
ui.add_model(GammaPDF)
```

Let's set up the same data as used in the previous notebook:

In [12]:

```
np.random.seed(1)
```

In [13]:

```
k_orig = 1.1
theta_orig = 2.6
ysim = stats.gamma.rvs(a=k_orig, scale=theta_orig, size=1000)
```

First I am going to repeat the CDF fit:

In [14]:

```
xcdf = ysim.copy()
xcdf.sort()
ycdf = np.arange(1, xcdf.size+1) * 1.0 / xcdf.size
```

In [15]:

```
ui.load_arrays('cdf', xcdf, ycdf)
ui.set_model('cdf', gammacdf.cdf)
ui.set_stat('leastsq')
ui.fit('cdf')
```

Now I can try the PDF model, which requires binning `ysim`

up. I am going to arbitrarily
choose 31 equal-width bins and ensure that it's not actually the PDF that we fit, but the
PDF times the number of events (so that I can use a Maximum-Likelihood statistic for the fit):

In [16]:

```
(ypdf, edges) = np.histogram(ysim, bins=31, density=False)
xlo = edges[:-1]
xhi = edges[1:]
```

This time the data is labelled as an "integrated" data set - that is, it is of type
`ui.Data1DInt`

rather than `ui.Data1D`

, which is the default. This means that the columns refer to the low and high edges of the independent axis, and then the dependent axis values. Since I have integer values, I can use the "Cash" statistic - from
Parameter estimation in astronomy through application of
the likelihood ratio - which is a Maximum-Likelihood statistic where the data values are assumed to follow a Poisson distribution. I have
seen cases where the default optimiser - the Levenberg-Marquardt method - does not work well
with the Cash statistic (normally when the normalisation of the model is very-far away from the data, which is not an issue for the "pdf" data set, but is for the "check" data set I try later), so I chose the Nelder-Mead Simplex algorithm.

In [17]:

```
ui.load_arrays('pdf', xlo, xhi, ypdf, ui.Data1DInt)
ui.set_stat('cash')
ui.set_method('neldermead')
```

When plotted, data bars are added. These values are *not* used in the fit, and are only to help visualize the data (they can be turned off but I have left them in here):

In [18]:

```
ui.plot_data('pdf')
```

Here I create an instance of the `GammaPDF`

class - using the `gammapdf`

function created by
the `add_model`

call - called `pdf`

:

In [19]:

```
ui.set_source('pdf', gammapdf.pdf)
print(pdf)
```

As I went to the effort of writing a `guess`

method for this class, let's see what it does:

In [20]:

```
ui.guess('pdf', pdf)
print(pdf)
```

The `norm`

parameter is now set to the sum of the data - in this 1000 - and has been frozen,
so that it is not varied during the fit.

In [21]:

```
ui.fit('pdf')
```

The fit is in the right area of parameter space (the input `k`

and `theta`

values are
1.1 and 2.6), and the fit looks reasonable:

In [22]:

```
ui.plot_fit_resid('pdf')
```

Since the Y values cover more than one order of magnitude, let's use a log scale to check:

In [23]:

```
ui.plot_fit('pdf')
plt.yscale('log')
```

One point of fitting the PDF was to see what the errors on the parameters are (note that due to some oddity of the IPython notebook the "lower" and "upper" bound messages, which are displayed as the fit runs, and should appear before the table, are in fact displayed last here):

In [24]:

```
ui.conf('pdf')
```

As well as the individual errors, let's look at how they are correlated (the contours
give the one, two, and three sigma ranges, and the cross is the best-fit location). I have
increased the number of points used in this plot since the evaluation is quick (as there
are no free parameters beyond `k`

and `theta`

).

I'll repeat this plot later on, adding on the actual values and the results from the CDF.

In [25]:

```
ui.reg_proj(pdf.k, pdf.theta, id='pdf', nloop=[41,41])
```

The `GammaPDF`

model was written to work with both integrated and non-integrated data, so let's see what happens when the latter is used. In this case, I use the same y values but place them at the mid-point of each bin:

In [26]:

```
ui.load_arrays('check', (xlo+xhi)/2, ypdf)
ui.set_source('check', gammapdf.pdf2)
ui.fit('check')
```

So, the results aren't too different - apart from the fact that the normalisation is lower than the "expected" value of 1000 (as I'm approximating a binned distribution with an un-binned one I wanted to see what happened to the `norm`

parameter, which is why I did not fix it).

The fit looks reasonable (but again, remember that the error bars should not be taken too seriously):

In [27]:

```
ui.plot_fit_resid('check')
```

How about the errors?

In [28]:

```
ui.conf('check')
```

I could try fixing the normalization - to 1000 - and fitting again, but this is enough of a digression.

The point of this notebook is to give examples of how to write a user model in Sherpa, not
what the best way to fit a Gamma distribution is, but let's just compare the results. I start
with the region-projection plot shown earlier (Sherpa has cached the results for us so we do not need to recalculate it). Then I add on horizontal and vertical lines to show the true
values (i.e. the `k_orig`

and `theta_orig`

values), and then two green points:

- the filled circle shows the fit to the CDF
- the filled triangle shows the fit to the "unbinned" PDF (i.e. the "check" data set)

As can be seen, all values are within the one-sigma contour of the PDF fit, but the CDF result - which has no error associated with it - is actually closer to the expected value. Now, this is only one data set, so if you really want to investigate this you would have to run a large number of simulations. However, that would be the topic for another post!

In [29]:

```
rplot = ui.get_reg_proj()
rplot.contour()
plt.xlabel('$k$')
plt.ylabel('$\\theta$')
xr = plt.xlim()
yr = plt.ylim()
plt.vlines(k_orig, yr[0], yr[1], linestyles='dotted')
plt.hlines(theta_orig, xr[0], xr[1], linestyles='dotted')
plt.plot(cdf.k.val, cdf.theta.val, 'go')
plt.plot(pdf2.k.val, pdf2.theta.val, 'g^')
```

Out[29]: