Zack Li, 7/27/2018
As a cosmologist, you'll probably face the following question.
Will this dataset be able to measure what I'm interested in?
More concrete examples from my own research include
If your observables have Gaussian uncertainties, Fisher information matrices can help answer these questions. They perform the change of variables which turns uncertainties of correlated observables (i.e. the CMB power spectrum) into uncertainties of correlated parameters (i.e. the inputs for the $\Lambda CDM$ model). It is a crude (but useful!) error propagation, and applies to many cosmological data sets like CMB, weak lensing, and BAO.
Since Fisher forecasting is a numerical technique, this guide is packaged as an interactive notebook. It's intended for advanced undergraduates and early graduate students, and the goal is to build intuition and be able to compute a Fisher matrix by the end. We'll cover the basic methods for computing Fisher information matrices, using the power spectrum of the cosmic microwave background (CMB) as an example. Useful references are Coe 2009 for implementation tips, as well as the more comprehensive Verde 2009 which I have drawn heavily from.
To use this notebook, you'll need to know a little calculus and cosmology, be able to state Bayes theorem, and have CLASS installed with the Python wrapper.
%matplotlib inline
%config InlineBackend.figure_format = 'retina' # I use a Mac
import numpy as np
import matplotlib.pyplot as plt
from classy import Class # CLASS python wrapper
Suppose you have some probability distribution $P(x)$ in terms of the variable $x$, and want to transform it so it's in terms of the variable $y$. What is $Q(y)$, the transformed distribution? Changing variables should conserve the probability of a set of events, so in terms of differentials,
$$P(x) \, dx = Q(y) \, dy.$$Dividing, we derive the familiar chain rule relating $P(x)$ and $Q(y)$,
$$P(x) = Q(y) \, \left|\frac{dy}{dx}\right|.$$That is, you can transform probability distributions into new variables by using derivatives. In the case of multivariate Gaussian measurement uncertainty, the probability distributions can be described in terms of a covariance matrix. This is one way to interpret Fisher information matrix methods: you are transforming the covariance matrix describing your observables (i.e. measurement uncertainties) into the covariance matrix describing your parameters (i.e. uncertainties on numbers like $H_0$).
What do you need to get there? Following the example above, you need to know the covariance matrix of your observable, and you need to know the derivatives between the observables and parameters.
We now define the Fisher information matrix, $\mathbf{F}$, with elements $F_{ij}$.
$$F_{ij} = - \left\langle\frac{\partial^2}{\partial \theta_i \theta_j} \mathrm{ln}\,f\right\rangle.$$Here $f$ is the likelihood, and $\theta_i$ and $\theta_j$ are parameters you are interested in. Each element $F_{ij}$ of the Fisher information matrix is defined as the Hessian of the log likelihood, averaged over realizations of your data. This is a more geometric interpretion of the Fisher information matrix: it describes the curvature of the log likelihood. If the likelihood is "sharp" or "peaked", that means our experiment is very good at measuring the parameters. With such a peaked likelihood, the curvature of the log likelihood at the peak will be high. Similarly, if the likelihood is wide and spread out, the curvature will be low, as will the Fisher information.
The Fisher matrix is useful because of the Cramér-Rao Bound, which states that the variance of an unbiased estimator is at least as high as the inverse of the Fisher information matrix. For parameters $x$ and $y$, the covariance $\sigma_{xy}$ is bounded with the inequality
$$\sigma_{xy} \geq F_{xy}^{-1}.$$In other words, the inverse of the Fisher information matrix gives you best possible covariance matrix for the parameters you are trying to estimate. A nice informal derivation of this fact is on the Wikipedia page for Fisher information.
We now return to the change of variables interpretation. Recall that the probabilities must sum to 1, so that $\int f\, d^n\theta = 1$. Taking a derivative of this identity and combining with the definition of the Fisher information matrix, one can derive the transformation rule for a set of variables $\{\lambda_i\}$ to the set $\{\theta_i\}$ for the Fisher matrix (more details in Tegmark 1996),
$$ \mathbf{F}^{\theta} = \mathbf{J}^T \mathbf{F}^{\theta} \mathbf{J},$$The matrix $\mathbf{J}$ is the Jacobian, with elements $J_{ij} \equiv \frac{\partial \lambda_i}{\partial \theta_j}$.
We've been speaking generally about covariances and derivatives. We'll now make a brief detour and discuss the main example of this notebook, the angular power spectrum $C_{\ell}$ of the fluctuations in the cosmic microwave background (CMB). There are many details here, and I'd recommend a textbook like Modern Cosmology by Dodelson if you want to learn more.
A short summary which might stir some memories if you've taken a cosmology course:
We don't need a very detailed knowledge of this to make progress here; if you want, just think of $C_{\ell}$ just as a set of numbers you can measure, indexed by $\ell$. The Boltzmann code CLASS can generate theory curves of $C_{\ell}$ given a set of cosmological parameters. Let's generate the $\Lambda CDM$ temperature (TT) spectrum now with CLASS. We'll name it fiducial
because we'll use it later as the fiducial power spectrum for Fisher forecasts.
# Define the CLASS input dictionary, use defaults
params = {
'output': 'tCl lCl',
'l_max_scalars': 2000,
'lensing': 'yes',
'omega_cdm': 0.120,
'omega_b': 0.0224,
'h': 0.674
}
# The usual CLASS code for computing C_ell
cosmo = Class()
cosmo.set(params)
cosmo.compute()
fiducial = cosmo.lensed_cl(2000)
cosmo.struct_cleanup()
cosmo.empty()
We are choosing a parametrization $\{ \omega_{cdm}, \omega_b, h\}$ with $\omega_{cdm} = \Omega_{c} h^2$, $\omega_{b} = \Omega_{b} h^2$, $h = H_0 / (100\:\mathrm{km/s/Mpc})$. Let's plot it in a common way to view power spectra, $\ell^2 C_{\ell}^{TT}$. The $\ell^2$ factor is so you can see the high-$\ell$ features, which are hard to see otherwise due to Silk damping.
plt.plot(fiducial['ell'], fiducial['ell']**2*fiducial['tt'])
plt.xlabel(r'$\ell$')
plt.ylabel(r'$\ell^2 C_{\ell}^{TT}$');
We've suggested that derivatives are important for a change of variables. For the CMB, we can actually write the Fisher matrix for the parameters in terms of the Fisher matrix for $C_{\ell}$, which we'll describe later. We can write the transformation rule as
$$ \mathbf{F}^{\theta} = \mathbf{J}^T \mathbf{F}^{C_{\ell}} \mathbf{J}.$$The variable we have is $C_{\ell}$, and the variables we want are a set of cosmological parameters $\{ \theta_i \}$, for example the normalized Hubble constant $h$. Thus, we will be interested in computing a Jacobian $\mathbf{J}$ which corresponds to
$$ \frac{\partial C_{\ell}^{XY}}{\partial \theta_i} $$for observable $XY \in \{ TT, TE, EE \}$. As an example, we will now estimate $\partial C_{\ell}^{TT} / \partial h$, the derivative of the TT power spectrum with respect to the normalized Hubble constant. This is just a matter of choosing a step size $\Delta h$, and then computing $\Delta C_{\ell} / \Delta h$.
Good step sizes for Fisher typically are $\sim 1\%$ of the parameter itself. Smaller steps are usually better for the derivative, but going too small can lead to numerical instability.
def utility_function_call_CLASS(input_dict, l_max=2000):
"""Compute Cl with this utility function, repeat less code."""
cosmo = Class()
cosmo.set(input_dict)
cosmo.compute()
temp_cl = cosmo.lensed_cl(l_max)
cosmo.struct_cleanup()
cosmo.empty()
return temp_cl
# Define the CLASS input dictionary, use defaults
params = {
'output': 'tCl lCl',
'l_max_scalars': 2000,
'lensing': 'yes',
'omega_cdm': 0.120,
'omega_b': 0.0224,
'h': 0.674
}
# for left and right sides of the derivative, copy
# the dict above, and then change the parameter
# we are interested in.
h_step = 0.01
left_params = params.copy()
left_params['h'] = params['h'] - h_step
right_params = params.copy()
right_params['h'] = params['h'] + h_step
# get the C_l^TT and then compute the derivative!
cl_tt_left = utility_function_call_CLASS(left_params)['tt']
cl_tt_right = utility_function_call_CLASS(right_params)['tt']
dCltt_dh = (cl_tt_right - cl_tt_left) / (2 * h_step)
Let's plot the derivative! We'll plot it but normalize by $C_{\ell}^{TT}$ so it's more clear. The warning message below is just because the CLASS output has zeroes for $\ell < 3$.
plt.plot( dCltt_dh / fiducial['tt'] )
plt.ylabel(r'$(\partial C_{\ell}^{TT} / \partial h) / C_{\ell}^{TT}$')
plt.xlabel(r'$\ell$');
/home/zequnl/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:1: RuntimeWarning: invalid value encountered in true_divide if __name__ == '__main__':
Write a function get_deriv()
which takes in a parameter name (like omega_b
), a channel (like 'tt'
), and a stepsize (like 0.01
) and returns the derivative $\partial C_{\ell}^{XY} / \partial \theta_i$. Test it out by reproducing the plot above.
# fill me in!
def get_deriv():
pass
We now have to revisit the angle brackets in the definition of the Fisher information matrix. Here is the definition again,
$$F_{ij} = - \left\langle\frac{\partial^2}{\partial \theta_i \theta_j} \mathrm{ln}\,f\right\rangle.$$The $\langle \cdots \rangle$ refer to an average over data realizations, or in other words, an average of each possible way that the data could have turned out given the assumed noise properties. For continuous variables, this requires taking an integral in data space that is weighted by the probability density of obtaining that data. In the case of CMB data, the power spectrum can have order 50 bins, thus requiring a 50-dimensional integral. Cancel.
If the measurement uncertainty is described by a multivariate Gaussian, then we can avoid doing this (see Tegmark et al. 1996 for the derivation in the CMB context, an identical derivation exists for lensing). The signal in the sky for temperature (T) as well as curl-free (E) and divergence-free (B) polarization (i.e. E and B modes) then consists of a set over $m$,
$$ (a_{\ell}^T, a_{\ell}^E, a_{\ell}^B)$$We don't have multiple skies, but a GRF has the nice property that we can estimate the variance on $a_{\ell m}$ by looking at different parts of the sky. Unfortunately this estimator is limited by the sample size for a specific angle scale on the sky, and we have only $2 \ell + 1$ samples. There is uncertainty on the estimate of $C_{\ell}$, we have $(\Delta C_{\ell})^2 = \frac{2}{2\ell+1} C_{\ell}^2$. This is the so-called cosmic variance.
We therefore have for each positive integer $\ell$ an observable $C_{\ell}$ that we can measure. The observable $C_{\ell}$ has intrinsic variance $(\Delta C_{\ell})^2 = \frac{2}{2\ell+1} C_{\ell}^2$. Assuming that the B-modes are negligible :( we can write for correlated T and E spectra a matrix proportional to the covariance for each $\ell$,
$$ \mathbf{C}_{\ell} \equiv \left( {\begin{array}{cc} C_{\ell}^{TT} + N_{\ell}^{TT} & C_{\ell}^{TE} \\ C_{\ell}^{TE} & C_{\ell}^{EE} + N_{\ell}^{EE} \\ \end{array} } \right) $$where $N_{\ell}^{TT}$ and $N_{\ell}^{EE}$ refer to additional instrumental noise in the temperature and polarization, which are added to the noise from cosmic variance. The cross-spectrum noise $N_{\ell}^{TE}$ is usually negligible for typical CMB experiments, so we have assumed $N_{\ell}^{TE} = 0$. Then the Fisher matrix can be expressed as a sum over each $\ell$,
$$ F_{ij} = \sum_{\ell} \frac{2 \ell + 1}{2} f_{\mathrm{sky}} \mathrm{Tr}\,\left( \mathbf{C}_{\ell}^{-1} \frac{\partial \mathbf{C}_{\ell}}{\partial \theta_i} \mathbf{C}_{\ell}^{-1} \frac{\mathbf{C}_{\ell}}{\partial \theta_j} \right)$$Here, $f_{\mathrm{sky}}$ refers to the fraction of the sky which the experiment covers, since you can think of the $a_{\ell m}$ as having $(2 \ell + 1)f_{\mathrm{sky}}$ samples to work with. The ($\mathrm{Tr}$) refers to the trace of the matrix. You might consult Wu et al. 2014 if you need a second description.
Now you'll compute your own Fisher matrix. Let's start with a full-sky, cosmic variance-limited (CV-limited) experiment. In a CV-limited experiment, we have $N_{\ell}^{TT} = 0$ and $N_{\ell}^{EE} = 0$. This is the best-possible CMB experiment we could ever make.
Using the fiducial
CLASS object we computed earlier, print out the $2 \times 2$ matrix $\mathbf{C}_{\ell}$ for $\ell = 1000$. Convert it into a numpy array, and then print its inverse too.
# you might use for example...
fiducial['tt']
array([0.00000000e+00, 0.00000000e+00, 1.56306222e-10, ..., 4.73088092e-17, 4.72642305e-17, 4.72195451e-17])
Compute a single element of the Fisher matrix, where $\theta_i = \omega_b$ and $\theta_j = h$. Use your derivative function you wrote earlier, and perform the matrix multiplications! Remember to sum over $\ell$.
Compute the full Fisher matrix over the set of parameters $\{ \omega_{cdm}, \omega_b, h \}$. That is, $F_{01}$ would refer to $\theta_0 = \omega_{cdm}$, and $\theta_1 = \omega_{b}$, and so on.
Print out your covariance! Recall that the best possible covariance of your parameters is the inverse of the Fisher matrix,
$$ \mathrm{Cov} = F^{-1} .$$This corresponds to the covariance matrix for a multivariate Gaussian which describes your parameters, also called the posterior. The ellipses one usually sees for Fisher forecasts are just 2D marginalized 1- and 2-$\sigma$ contours. The 1-$\sigma$ constraints marginalized over all other parameters (i.e. the error quoted in the $X \pm \sigma$) is the square root of the diagonal of the covariance matrix, i.e. $\sigma_i = \sqrt{\mathrm{Cov}_{ii}}$. What's the error bar on your experiment for $h$?
Finally, we'll perform a more realistic forecast! We'll add in the noise terms $N_{\ell}^{TT}$ and $N_{\ell}^{EE}$. My favorite reference for this is Wu et al. 2014. White noise curves can be generated with $$N_\ell^{XX'} = s^2 \text{exp} \left( \ell(\ell+1) \frac{\theta^2_\mathrm{FWHM}}{8 \ln 2} \right),$$ where $s$ is the noise level in $\mu K$-radians, and $\theta_\mathrm{FWHM}$ is the beam width in radians. A common forecasting prescription for Planck has temperature noise level $s_T = 33 \mu \mathrm{K}$-arcmin, polarization noise level $s_P = \sqrt{2} s_T$, and the beam $\theta_{FWHM} = 7$ arcmin. Importantly, you must remove a factor of $10^{-6} T_{cmb} / \mathrm{Kelvin}$ to match the unitless output of CLASS, or vice versa. Make a forecast for Planck 2015!
sT = 33 * (np.pi/60./180.)
sP = sT * np.sqrt(2.)
theta_FWHM = 7. * (np.pi/60./180.)
Once you've calculated a covariance matrix, it's straightforward to make triangle plots like below, as the math is just collapsing down the multivariate Gaussian you've computed. The implementation details for making pretty ellipses are in the very helpful Coe 2009.
It's sometimes useful to see visualize what's happening with Fisher matrices. Here I describe one way, where you plot the signal and noise curves. The binned variance (for a binsize
perhaps of 100-200) can be written
Basically the term in the $\mathbf{C_{\ell}}$ elements that go into $F_{ij}$. It is often convenient to plot the quantities $\Delta C_{\ell} / C_{\ell}$ and $\sigma_{C_{\ell}} / C_{\ell}$.
Fisher forecasts are pretty crude estimates of the posterior. They will fail for all sorts of reasons, in particular
I will also advertise my own Fisher code here, it's marginally better documented than other options and you can always contact me if something breaks!