The cell below loads the visual style of the notebook when run.
In [ ]:
from IPython.core.display import HTML
css_file = './styles/styles.css'
HTML(open(css_file, "r").read())

Fitting a model to data

Learning Objectives

  • Be able to describe what a figure of merit is, and what it is for.
  • Learn how to use scipy to fit models to data.
  • Understand why sometimes model-fitting algorithms do not give the right answer.


Probably the most common task for an observational astronomer is fitting a model to some data. Suppose we have a model $M$ with some parameters $\theta$. For example, our model, $M$ could be a straight line $y = mx+c$, with parameters $\theta = (m,c)$. Model fitting is the process of finding the parameters which best fit our data, and the corresponding uncertainties on them. To do this, we obviously have to have some objective measure of how well our model fits our data. Such a measure is often called a figure of merit.

Figures of merit

The sum-of-the-squares

Suppose we have a model $y(x)$ and we observe a set of points $(x_i,y_i)$. If our model is a good fit to the data then the model prediction at a certain data point $y(x_i)$ will lie close to the observation $y_i$. Therefore, the sum-of-the squares statistic

$${\rm SS} = \sum_i [y_i - y(x_i)]^2,$$

is a measure of how well our model fits the data. Smaller values of ${\rm SS}$ imply a better fit. For scientific data, the sum-of-the-squares is not the best figure of merit. This is because every data point contributes equally to the sum, regardless of it's uncertainty.

The $\chi^2$ statistic

This is the most commonly used measure of goodness-of-fit. Suppose again that we have a model $y(x)$ and we observe a set of points $(x_i,y_i)$, each with corresponding uncertainties $\sigma_i$. The $\chi^2$ statistic is

$$\chi^2 = \sum_i \left(\frac{y_i - y(x_i)}{\sigma_i} \right) ^2,$$

where $y(x_i)$ is the prediction of our model at $x_i$. Intuitively, we can see that $\chi^2$ measures goodness of fit. The quantity $ [y_i - y(x_i)]/\sigma_i $ is a measure of how many error bars a data point is away from our model. We saw in the lectures that if our model fits well, most data points lie within 1$\sigma$, but a third lie further away. Therefore, we'd expect $\chi^2$ for a good fit to roughly equal the number of data points. Poor fits will yield larger versions of $\chi^2$. A common way to fit a model is therefore to find the model parameters which minimise $\chi^2$.

(If you want to prove to yourself that minimising $\chi^2$ is not just intuitively correct, but formally correct, have a look at this companion notebook.)

Model fitting in Python

Model fitting is therefore just a matter of finding the model parameters that minimise $\chi^2$. You could do this by hand, of course, but this is the kind of task that computers are designed for. Python has several different methods available in libraries. They all do the same thing - they attempt to automatically search parameter space for the values that minimise $\chi^2$. Which you want to use depends a little on your task. We'll look at two cases: fitting simple polynomial models (e.g a straight line), and fitting more complex models

So we want to fit a straight line to data using python, and plot the results. This is very simple. First we need to import the relevant modules

In [ ]:
%matplotlib inline  
import numpy as np
import matplotlib.pyplot as plt

Now we wish to load in the data to fit. This is stored in a comma separated file. The loadtxt function in numpy will do this simply and will load each column into arrays called x, y and e.

In [ ]:
# note the "unpack" optional argument to unpack the columns automatically
x,y,e = np.loadtxt('data/data.txt',unpack=True)

Now we have our arrays of data we can plot it to see what it looks like.

In [ ]:
fig,ax = plt.subplots()

Fitting models with SciPy

In this case our data looks like a straight line will fit it. If we want to fit a polynomial to some data, the easiest way is to use the numpy.polyfit package. I show how to do that in the companion notebook here. For this lab, we are going to learn a different approach, which has the advantage that it will scale to more complex models.

If we want to fit a model which is more complex than a polynomial, we can use scipy's optimize library. This library has many routines for finding the parameters of a model which best fit your data. We're going to look at the curve fit function in scipy's optimize package, which uses the Levenberg-Marquardt algorithm to find the parameters which best fit our data.

The Levenberg-Marquardt algorithm: theory

Finding the model parameters which minimise $\chi^2$ for a function requires an algorithm, or recipe. You could imagine using a brute-force method where you calculate $\chi^2$ over a wide range of possible parameters, and find the best ones. Such a method does work, but is very slow - impractically so for models that have more than two parameters.

The Levenberg-Marquardt algorithm is an attempt to find the best parameters more efficiently. It is a type of "gradient-search" method, illustrated in the figure below:

The plot above shows how $\chi^2$ depends upon the model parameter for a model with a single free parameter. From a given starting point, the Levenberg-Marquardt algorithm finds the "downhill" direction in $\chi^2$, and steps in that direction until $\chi^2$ stops decreasing. This location gives the parameter values which minimise $\chi^2$. The steps taken are shown by the red dots and solid red line. The idea is easily extended to models with more than one free parameter (for example, a straight line fit has two - the slope and the gradient):

The greyscale in the image above shows the $\chi^2$ space for a model with two parameters, $a$ and $b$. Dark areas represent low values of $\chi^2$, blue contours show lines of constant $\chi^2$. From a given starting point, the Levenberg-Marquardt algorithm finds the "downhill" direction in $\chi^2$, and steps in that direction until $\chi^2$ stops decreasing. This location gives the parameter values which minimise $\chi^2$. The steps taken are shown by the red dots and solid red line.

This second figure also illustrates an important point about model-fitting: there can be more than one "valley" of low $\chi^2$ in the parameter space, and the Levenberg-Marquardt algorithm always sets off in the local downhill direction until. This can lead to the function finding a local $\chi^2$ minimum, which is not the global $\chi^2$ minimum, as illustrated by the dashed red line.

The Levenberg-Marquardt algorithm: practice

Suppose we want to fit a model (any model) to some data. The first step is to write a function that calculates the prediction of our model at some location, $x$. The first argument of this function should be the location $x$. The remaining arguments are our model parameters. Therefore, for a straight line, our function looks like this:

In [ ]:
def funcToFit(x,m,c):
    """Function to calculate a straight line
    x: np.ndarray, float
        the positions at which to evaluate the function
    m: float
        the gradient of the straight line
    c: float
        the intercept of the line
    np.ndarray, float
        the values predicted by our model at positions x
    return m*x + c

Now we can use the curve fit function in scipy's optimize package, which uses the Levenberg-Marquardt algorithm to find the parameters which best fit our data.

Look closely at the comments in the code below to understand how it works. curve_fit returns two variables. The first is a list of the best fitting model parameters. The second is the covariance matrix. You'll learn more about covariance matrices next year, for now you can assume that the diagonal elements of this matrix are our uncertainties, and these can easily be extracted using numpy's diag function

In [ ]:
from scipy.optimize import curve_fit

# we need an initial guess for our starting parameters
# this is a list with one entry for each parameter
p0 = [2,2]

We run curve fit itself below

The first argument is the function that defines our model
The next two arguments are the "x" and "y" data
The fourth argument is our starting guess
The final (optional) argument is the error bars, "e".

Without the error bars, curve_fit will minimise the SS.

Curve fit returns two variables. The first is the best fitting
model parameters, the second is the covariance matrix.
popt,pcov = curve_fit(funcToFit,x,y,p0,sigma=e)

# let's print out our fits
# I use numpy's diag function to pick out the diagonal
# elements of the covariance matrix
print( popt )
print( np.diag(pcov) )

# also, let's use python's string formatting to print out in a 
# nice format. Note how I ensure the correct number of decimal places
m,c = popt
em,ec = np.diag(pcov)
print( 'Gradient  = {0:.3f} +/- {1:.3f}'.format(m,em) )
print( 'Intercept = {0:.1f} +/- {1:.1f}'.format(c,ec) )

Plot the best fit

Make a plot of the data and the best fit to it. Use error bars for the data points. For the best fit, plot a solid line.

You will need to use your function funcToFit and the best fit parameters popt to calculate the best fitting model at the points x. You can then plot the result against x on the same graph as your data.

More complex models

Generalising the process to more complex models is easy - all we do is change the function we fit our data with - funcToFit (note, we could call this function anything, but funcToFit tells you what it is!)

Some complex modelling

The data file data/complex_model.txt contains some data in the same format as data.txt but the model we think explains the data is more complex. We have reason to believe the data is explained by the model

$$ y = a \cos{bx} + b \sin{ax} $$

In the cell below, write a new version of funcToFit which encodes this model. Use the version above as a template to make sure you get it right.

The test cell below should run without errors if you've done things right...

In [ ]:
In [ ]:
# this cell should raise no errors!
from import assert_almost_equal

Fit the data

In the cell(s) below, read the data from the file data/complex_model.txt into variables x, y and e. Use curve_fit to fit your function above to the data, and plot the data and best fitting model together.

Start from a guess for the parameters of $a=-12$, $b=2$. Does your model fit your data? If you didn't get a good fit, try again from a starting position of $a=5$, $b=2$. Do you get a good fit now?

Create a markdown cell, explaining the results you found above.

In [ ]:

Homework #7

Fitting Exoplanet Transit Data

As a reminder, we can use curve_fit to fit more or less any model we like, as long as we can write that model as a Python function. In the file I have written a Python function that calculates the shape of an exoplanet transit lightcurve. We can import this function from the Python file in the same we would import functions from scipy or numpy Let's import the functon and look at it's documentation:

In [ ]:
from transit import transit

Your task in this homework is to fit the transit lightcurve of the exoplanet Wasp-4b. However, we don't want to fit all the parameters taken by the function transit! I have been very kind, and found reference values for all but the planetary radius, the stellar radius and the inclination. These are the three parameters we want to fit, so we need to write a new function that hard-codes the other parameters in.

Q1: Writing our function to fit (4 points)

Complete the function below to write a new function which takes an array of times, plus three parameters for the planetary radius, the inclination and the limb darkening. Your function must call the transit function above, with the other parameters fixed at values found from the literature. These are:

Orbital period $P = 1.38823187$ days.

Time of mid-transit $T0 = 54748.150490$ days.

The limb darkening parameter $\mu = 0.311$.

In [ ]:
def funcToFit(t,Rs,Rp,i):
    """Calculate the transit shape due to an exoplanet.
    This function implements the model of exoplanetary transits found in
    Sackett et al (1999, ASIC, 532, 189). 
    t: np.ndarray, float
        the times at which to calcualte the transit lightcurve. Units of days (MJD)
    Rs: float
        radius of star, divided by the orbital separation
    Rp: float
        radius of exoplanet, divided by the orbital separation
    i: float
        inclination, in degrees
    np.ndarray, float
        the transit lightcurve, normalised to 1 outside of transit
    raise NotImplementedError()
In [ ]:
from import assert_almost_equal, assert_equal
t = [54822.925253333335,54748.150490]
rp = 0.02883
rs = 0.01827
i = 88.8
result = funcToFit(t,rs,rp,i)

Q2: Fit the transit data(4 points)

The transit data (time, flux, error) are stored in the text file data/wasp4_transit.txt. Using the examples provided above as a template, read in the data file and fit the data using curve_fit. Store the best fitting parameters in a variable named popt and the covariance matrix in a variable named pcov.

You will need reasonable starting guesses for the scaled star and planet radii and the inclination. I suggest 0.2, 0.02 and 88 degrees respectively.

In [ ]:
x,y,e = np.loadtxt('data/wasp4_transit.txt',unpack=True)
raise NotImplementedError()
In [ ]:

Q3: Plot the fit (2 points)

Plot the data and best fit on a single plot.

In [ ]:
raise NotImplementedError()