User models in Sherpa - II

In the previous notebook - User models in Sherpa - I showed how you could write a user model to fit the Cumulative Distribution Function (CDF) of a data set with a Gamma distribution. Rather than use the load_user_model function, I showed how to write the model taking advantage of the Sherpa Model class, and the add_model function to register the class with Sherpa.

One question I got from a colleague was why I fit the CDF, rather than the Probability Density Function (PDF) of the data? Whilst I am not going to answer that question here, I do aim to show how to fit the PDF, since it:

  • lets me talk about writing models for "integrated" data sets
  • lets me fit the data using a Maximim Likelihood statistic, rather than the "least-squares" approach I used previously.

Follow up

I have a follow-up to this, which is available as plotting in Sherpa when using the low-level API. This was written on May 5 2015.

I have updated the notebook for the Sherpa 4.8.2 release in September 2016; there are no changes to the notebook, but you can now use Python 3.5 as well as 2.7. Actually, there's one minor change to the notebook, but that is due to a change in the IPython/Jupyter code rather than Sherpa: it is no longer necessary to say

import logging
logging.getLogger('sherpa').propagate = False

to stop seeing double messages from Sherpa. There has also been a fix to matplotlib since I last ran this notebook which fixes the display of error bars on a logarithmic scale when the point itself is negative.

Author and disclaimer

This was written by Douglas Burke on June 4 2015. This notebook, and others that may be of interest, can be found on GitHub at https://github.com/DougBurke/sherpa-standalone-notebooks.

The information in this document is placed into the Publc Domain. It is not an official product of the Chandra X-ray Center, and I make no guarantee that it is not without bugs or embarassing typos. Please contact me via the GitHub repository or on Twitter - at @doug_burke - if you have any questions.

Last run

When was this notebook last run?

In [1]:
import datetime
datetime.datetime.now().strftime("%Y-%m-%d %H:%M")
Out[1]:
'2016-09-26 19:43'

Setting up

First of all I am going to load in the necessary modules, and set up the CDF model I wrote earlier:

In [2]:
import numpy as np
import matplotlib.pyplot as plt

from scipy import stats
In [3]:
%matplotlib inline
In [4]:
from sherpa import utils
from sherpa.models import model
from sherpa.models.parameter import tinyval
In [5]:
from sherpa import ui

With the imports out of the way, here are routines to calculate the CDF and PDF of a Gamma distribution:

In [6]:
def calc_gamma_cdf(x, k, theta):
    """Return the CDF of a gamma distribution.

    The cumulative distribution function (CDF) of the gamma
    distribution is defined in [1]_ as::

       cdf(x;k,theta) = incomplete_gamma(k, x/theta)
                        ----------------------------
                                  gamma(k)

    Parameters
    ----------
    x : array
       The values at which to evaluate the CDF.
    k : number
       The shape parameter, which must be greater than 0.
    theta : number
       The scale parameter, which must be greater than 0.

    Returns
    -------
    cdf : array
       The CDF evaluated at each element of x.

    Notes
    -----
    The mean of the distribution is given by ``k * theta``,
    and the rate is ``1 / theta``.
 
    References
    ----------

    .. [1] http://en.wikipedia.org/wiki/Gamma_distribution

    """

    # Unfortunately igam does not accept a Numpy array as the
    # second argument, so need to map over the input array.
    #
    tval = theta * 1.0
    kval = k * 1.0
    x = np.asarray(x)
    out = np.zeros_like(x)
    for i,xi in enumerate(x):
        # igam is the "regularized" incomplete gamma (lower)
        # form, so already has the normalization by gamma(a)
        # included.
        out[i] = utils.igam(kval, xi/tval)

    return out

def calc_gamma_pdf(x, k, theta):
    """Return the PDF of a gamma distribution.

    The probability density function (PDF) of the gamma
    distribution is defined in [1]_ as::

       pdf(x;k,theta) = x^(k-1) e^(-x/theta)
                        --------------------
                          gamma(k) theta^k

    Parameters
    ----------
    x : array
       The values at which to evaluate the PDF.
    k : number
       The shape parameter, which must be greater than 0.
    theta : number
       The scale parameter, which must be greater than 0.

    Returns
    -------
    pdf : array
       The PDF evaluated at each element of x.

    Notes
    -----
    The mean of the distribution is given by ``k * theta``,
    and the rate is ``1 / theta``.
 
    References
    ----------

    .. [1] http://en.wikipedia.org/wiki/Gamma_distribution

    """

    tval = theta * 1.0
    kval = k * 1.0
    norm = utils.gamma(kval) * theta**kval
    return x**(kval-1.0) * np.exp(-x / tval) / norm

I am also going to need a routine that integrates the PDF over a range in x. This is just the difference in the CDF at the two points, so can be written as:

In [7]:
def calc_gamma_pdf_int(xlo, xhi, k, theta):
    """Return the PDF of a gamma distribution integrated over a bin.

    The probability density function (PDF) of the gamma
    distribution is defined in [1]_ as::

       pdf(x;k,theta) = x^(k-1) e^(-x/theta)
                        --------------------
                          gamma(k) theta^k

    Integrating this gives::

      - theta^k igamma(k, x/theta) + c
      ----------------------------
            gamma(k) theta^k

    where igamma is the incomplete gamma function. Note that this is
    the "upper" form (so int_x^infinity), and sherpa.utils.igam
    calculates the lower-form (which is 1-upper), which is also
    a normalized form, so we get to::

       sherpa.utils.igam(k,xhi/theta) - sherpa.utils.igam(k,xlo/theta)

    which shouldn't be surprising as it's just::

       cdf(xhi;k,theta) - cdf(xlo;k,theta)

    This means that this is just::

       calc_gamma_cdf(xhi,k,t) - calc_gamma_cdf(xlo,k,t)

    Parameters
    ----------
    xlo, xhi : array
       The bin edges over which to integrate the PDF.
    k : number
       The shape parameter, which must be greater than 0.
    theta : number
       The scale parameter, which must be greater than 0.

    Returns
    -------
    pdf : array
       The PDF integrated over xlo to xhi.

    References
    ----------

    .. [1] http://en.wikipedia.org/wiki/Gamma_distribution

    """

    return calc_gamma_cdf(xhi, k, theta) - calc_gamma_cdf(xlo, k, theta)

With these, I can write the GammaCDF class, which is essentially the same as in the previous version, except that I have now used a finite, but small, lower limit on the k and theta parameter values, rather than allow them to go down to 0. The lower limit is given by sherpa.models.parameter.tinyval, which is:

In [8]:
tinyval
Out[8]:
1.1754943508222875e-38
In [9]:
class GammaCDF(model.ArithmeticModel):
    """A Gamma CDF.

    The cumulative distribution function (CDF) for the Gamma
    distribution, as defined by [1]_, is::

       cdf(x;k,theta) = incomplete_gamma(k, x/theta)
                        ----------------------------
                                  gamma(k)

    The model parameters are:

    k
       The shape parameter, which must be greater than 0.

    theta
       The scale parameter, which must be greater than 0.

    Notes
    -----
    The mean of the distribution is given by ``k * theta``,
    and the rate is ``1 / theta``.
    
    References
    ----------

    .. [1] http://en.wikipedia.org/wiki/Gamma_distribution

    """

    def __init__(self, name='gammacdf'):
        self.k = model.Parameter(name, 'k', 5, min=tinyval, hard_min=0)
        self.theta = model.Parameter(name, 'theta', 2, min=tinyval, hard_min=0)

        model.ArithmeticModel.__init__(self, name, (self.k, self.theta))

    @model.modelCacher1d
    def calc(self, pars, x, *args, **kwargs):
        (k, theta) = pars
        if len(args) == 1:
            x = (x + args[0]) / 2.0

        return calc_gamma_cdf(x, k, theta)

The above model expects either one or two x arrays - the independent variable value - which are the location at which to evaluate the model (one array), or the low and high edges of each bin, over which the model should be evaluated. In the CDF case I just use the mid-point of the bin; that is I do not "integrate" the model. However, for the PDF case, where the data is "integrated" - i.e. it represents the number of events within the range, we can integrate up the data. The overall form of the GammaPDF class is very similar to GammaCDF, but

  • an extra parameter is needed (the number of events),
  • the handling of the model evaluation is slightly different,
  • and a guess routine is added so that the normalisation can be set based on the data,

as shown below:

In [10]:
class GammaPDF(model.ArithmeticModel):
    """A Gamma PDF.

    The probability density function (PDF) for the Gamma
    distribution, as defined by [1]_, is::

       pdf(x;k,theta) = x^(k-1) e^(-x/theta)
                        --------------------
                          gamma(k) theta^k

    The model parameters are:

    k
       The shape parameter, which must be greater than 0.

    theta
       The scale parameter, which must be greater than 0.

    norm
       Normalization. The soft minimum is set to 0, but the
       hard_min is left at the default value, so the minimum
       can be made negative if the user wants.

    Notes
    -----
    The mean of the distribution is given by ``k * theta``,
    and the rate is ``1 / theta``.

    References
    ----------

    .. [1] http://en.wikipedia.org/wiki/Gamma_distribution

    """

    def __init__(self, name='gammapdf'):
        self.k = model.Parameter(name, 'k', 5, min=tinyval, hard_min=0)
        self.theta = model.Parameter(name, 'theta', 2, min=tinyval, hard_min=0)
        self.norm = model.Parameter(name, 'norm', 1, min=0) # allow -ve if they realy want it
        
        model.ArithmeticModel.__init__(self, name, (self.k, self.theta, self.norm))

    @model.modelCacher1d
    def calc(self, pars, x, *args, **kwargs):
        (k, theta, norm) = pars
        if len(args) == 0:
            return norm * calc_gamma_pdf(x, k, theta)
        elif len(args) == 1:
            # note: this could be a 2D dataset
            return norm * calc_gamma_pdf_int(x, args[0], k, theta)
        else:
            raise ValueError("Expected x or xlo/xhi grid.")
            
    def guess(self, dep, *args, **kwargs):
        """Set the norm parameter based on the data.
        
        The norm parameter is set to be the sum of the dependent axis
        and frozen. The ``k`` and ``theta`` parameters are not changed.
        
        Notes
        -----
        The mean of the values equals k * theta and the mode is
        (k-1) * theta - when k > 1 - so these could be used to set
        k and theta, but I am not convinced it is worth the effort.
        
        """
        norm = dep.sum()
        self.norm.val = dep.sum()
        self.norm.frozen = True

These models can now be added to Sherpa:

In [11]:
ui.add_model(GammaCDF)
ui.add_model(GammaPDF)

Let's set up the same data as used in the previous notebook:

In [12]:
np.random.seed(1)
In [13]:
k_orig = 1.1
theta_orig = 2.6
ysim = stats.gamma.rvs(a=k_orig, scale=theta_orig, size=1000)

Repeating the CDF fit

First I am going to repeat the CDF fit:

In [14]:
xcdf = ysim.copy()
xcdf.sort()
ycdf = np.arange(1, xcdf.size+1) * 1.0 / xcdf.size
In [15]:
ui.load_arrays('cdf', xcdf, ycdf)
ui.set_model('cdf', gammacdf.cdf)
ui.set_stat('leastsq')
ui.fit('cdf')
Dataset               = cdf
Method                = levmar
Statistic             = leastsq
Initial fit statistic = 230.647
Final fit statistic   = 0.0597901 at function evaluation 36
Data points           = 1000
Degrees of freedom    = 998
Change in statistic   = 230.587
   cdf.k          1.10221     
   cdf.theta      2.61027     

Fitting with the PDF model

Now I can try the PDF model, which requires binning ysim up. I am going to arbitrarily choose 31 equal-width bins and ensure that it's not actually the PDF that we fit, but the PDF times the number of events (so that I can use a Maximum-Likelihood statistic for the fit):

In [16]:
(ypdf, edges) = np.histogram(ysim, bins=31, density=False)
xlo = edges[:-1]
xhi = edges[1:]

This time the data is labelled as an "integrated" data set - that is, it is of type ui.Data1DInt rather than ui.Data1D, which is the default. This means that the columns refer to the low and high edges of the independent axis, and then the dependent axis values. Since I have integer values, I can use the "Cash" statistic - from Parameter estimation in astronomy through application of the likelihood ratio - which is a Maximum-Likelihood statistic where the data values are assumed to follow a Poisson distribution. I have seen cases where the default optimiser - the Levenberg-Marquardt method - does not work well with the Cash statistic (normally when the normalisation of the model is very-far away from the data, which is not an issue for the "pdf" data set, but is for the "check" data set I try later), so I chose the Nelder-Mead Simplex algorithm.

In [17]:
ui.load_arrays('pdf', xlo, xhi, ypdf, ui.Data1DInt)
ui.set_stat('cash')
ui.set_method('neldermead')

When plotted, data bars are added. These values are not used in the fit, and are only to help visualize the data (they can be turned off but I have left them in here):

In [18]:
ui.plot_data('pdf')
WARNING: The displayed errorbars have been supplied with the data or calculated using chi2xspecvar; the errors are not used in fits with cash

Here I create an instance of the GammaPDF class - using the gammapdf function created by the add_model call - called pdf:

In [19]:
ui.set_source('pdf', gammapdf.pdf)
print(pdf)
gammapdf.pdf
   Param        Type          Value          Min          Max      Units
   -----        ----          -----          ---          ---      -----
   pdf.k        thawed            5  1.17549e-38  3.40282e+38           
   pdf.theta    thawed            2  1.17549e-38  3.40282e+38           
   pdf.norm     thawed            1            0  3.40282e+38           

As I went to the effort of writing a guess method for this class, let's see what it does:

In [20]:
ui.guess('pdf', pdf)
print(pdf)
gammapdf.pdf
   Param        Type          Value          Min          Max      Units
   -----        ----          -----          ---          ---      -----
   pdf.k        thawed            5  1.17549e-38  3.40282e+38           
   pdf.theta    thawed            2  1.17549e-38  3.40282e+38           
   pdf.norm     frozen         1000            0  3.40282e+38           

The norm parameter is now set to the sum of the data - in this 1000 - and has been frozen, so that it is not varied during the fit.

In [21]:
ui.fit('pdf')
Dataset               = pdf
Method                = neldermead
Statistic             = cash
Initial fit statistic = -807.674
Final fit statistic   = -7315.31 at function evaluation 218
Data points           = 31
Degrees of freedom    = 29
Change in statistic   = 6507.64
   pdf.k          1.05479     
   pdf.theta      2.79661     

The fit is in the right area of parameter space (the input k and theta values are 1.1 and 2.6), and the fit looks reasonable:

In [22]:
ui.plot_fit_resid('pdf')
WARNING: The displayed errorbars have been supplied with the data or calculated using chi2xspecvar; the errors are not used in fits with cash
WARNING: The displayed errorbars have been supplied with the data or calculated using chi2xspecvar; the errors are not used in fits with cash

Since the Y values cover more than one order of magnitude, let's use a log scale to check:

In [23]:
ui.plot_fit('pdf')
plt.yscale('log')
WARNING: The displayed errorbars have been supplied with the data or calculated using chi2xspecvar; the errors are not used in fits with cash

One point of fitting the PDF was to see what the errors on the parameters are (note that due to some oddity of the IPython notebook the "lower" and "upper" bound messages, which are displayed as the fit runs, and should appear before the table, are in fact displayed last here):

In [24]:
ui.conf('pdf')
pdf.theta lower bound:	-0.156241
pdf.k lower bound:	-0.0525859
pdf.theta upper bound:	0.170394
pdf.k upper bound:	0.053836
Dataset               = pdf
Confidence Method     = confidence
Iterative Fit Method  = None
Fitting Method        = neldermead
Statistic             = cash
confidence 1-sigma (68.2689%) bounds:
   Param            Best-Fit  Lower Bound  Upper Bound
   -----            --------  -----------  -----------
   pdf.k             1.05479   -0.0525859     0.053836
   pdf.theta         2.79661    -0.156241     0.170394

As well as the individual errors, let's look at how they are correlated (the contours give the one, two, and three sigma ranges, and the cross is the best-fit location). I have increased the number of points used in this plot since the evaluation is quick (as there are no free parameters beyond k and theta).

I'll repeat this plot later on, adding on the actual values and the results from the CDF.

In [25]:
ui.reg_proj(pdf.k, pdf.theta, id='pdf', nloop=[41,41])

Trying out the "non-integrated" case

The GammaPDF model was written to work with both integrated and non-integrated data, so let's see what happens when the latter is used. In this case, I use the same y values but place them at the mid-point of each bin:

In [26]:
ui.load_arrays('check', (xlo+xhi)/2, ypdf)
ui.set_source('check', gammapdf.pdf2)
ui.fit('check')
Dataset               = check
Method                = neldermead
Statistic             = cash
Initial fit statistic = 11202.9
Final fit statistic   = -7315.15 at function evaluation 341
Data points           = 31
Degrees of freedom    = 28
Change in statistic   = 18518.1
   pdf2.k         1.06076     
   pdf2.theta     2.79291     
   pdf2.norm      845.983     

So, the results aren't too different - apart from the fact that the normalisation is lower than the "expected" value of 1000 (as I'm approximating a binned distribution with an un-binned one I wanted to see what happened to the norm parameter, which is why I did not fix it).

The fit looks reasonable (but again, remember that the error bars should not be taken too seriously):

In [27]:
ui.plot_fit_resid('check')
WARNING: The displayed errorbars have been supplied with the data or calculated using chi2xspecvar; the errors are not used in fits with cash
WARNING: The displayed errorbars have been supplied with the data or calculated using chi2xspecvar; the errors are not used in fits with cash

How about the errors?

In [28]:
ui.conf('check')
pdf2.theta lower bound:	-0.167317
pdf2.norm lower bound:	-26.8196
pdf2.k lower bound:	-0.0630497
pdf2.theta upper bound:	0.184758
pdf2.norm upper bound:	27.3998
pdf2.k upper bound:	0.0642998
Dataset               = check
Confidence Method     = confidence
Iterative Fit Method  = None
Fitting Method        = neldermead
Statistic             = cash
confidence 1-sigma (68.2689%) bounds:
   Param            Best-Fit  Lower Bound  Upper Bound
   -----            --------  -----------  -----------
   pdf2.k            1.06076   -0.0630497    0.0642998
   pdf2.theta        2.79291    -0.167317     0.184758
   pdf2.norm         845.983     -26.8196      27.3998

I could try fixing the normalization - to 1000 - and fitting again, but this is enough of a digression.

Comparing the answers

The point of this notebook is to give examples of how to write a user model in Sherpa, not what the best way to fit a Gamma distribution is, but let's just compare the results. I start with the region-projection plot shown earlier (Sherpa has cached the results for us so we do not need to recalculate it). Then I add on horizontal and vertical lines to show the true values (i.e. the k_orig and theta_orig values), and then two green points:

  • the filled circle shows the fit to the CDF
  • the filled triangle shows the fit to the "unbinned" PDF (i.e. the "check" data set)

As can be seen, all values are within the one-sigma contour of the PDF fit, but the CDF result - which has no error associated with it - is actually closer to the expected value. Now, this is only one data set, so if you really want to investigate this you would have to run a large number of simulations. However, that would be the topic for another post!

In [29]:
rplot = ui.get_reg_proj()
rplot.contour()
plt.xlabel('$k$')
plt.ylabel('$\\theta$')

xr = plt.xlim()
yr = plt.ylim()
plt.vlines(k_orig, yr[0], yr[1], linestyles='dotted')
plt.hlines(theta_orig, xr[0], xr[1], linestyles='dotted')

plt.plot(cdf.k.val, cdf.theta.val, 'go')
plt.plot(pdf2.k.val, pdf2.theta.val, 'g^')
Out[29]:
[<matplotlib.lines.Line2D at 0x7f2c36c37f98>]