In this note, I will show you how the spectroscopic data look like, and then make a simulation code to demonstrate how a line fitting is done. Since preprocessing and sky estimation is not the main topic, I will sometimes ignore detailed issues but just give big pictures. One very good summary of how spectroscopic data reduction is done is given in the lecture note by C. Harrison given at Durham University.
The following image shows an example of spectroscopic data. In the middle, there is the spectrum of the target. As you expect, the horizontal axis is the wavelength direction (depending on the instrument, but I described the wavelength increases towards right). But since the slit has finite width, much larger than the point source we are observing, there are sky spectra towards up/down direction. This vertical direction is called the spatial direction.
from IPython.display import Image
Image(filename='images/spec_concept.png', width='500')
In spectroscopy, basically we do
x = 1
is lambda = 500 nm
, x = 2
is lambda = 502nm
, ..."count = 1
at lambda = 500 nm
corresponds to 1.8 Jy, count = 1
at lambda = 502 nm
corresponds to 1.2 Jy, ..."import numpy as np
from matplotlib import pyplot as plt
from astropy.modeling.functional_models import Gaussian1D
from scipy.optimize import curve_fit
from astropy.modeling.fitting import LevMarLSQFitter
After importing the above packages, you are now ready to generate a fake spectroscopic data. I will generate a data by using the following information:
gain = 2.5
electrons/ADUreadout_noise = 10
electronsastropy.modeling.functional_models.Voigt1D
First let's generate the fake data:
def fake_3rd(x, a=1.e-7, b=4.e-5, c=1.e-2, d=10):
return a * x**3 + b * x**2 + c * x + d
def fake_e2ADU(signal_electron, gain=1.0):
sig_ADU = signal_electron / gain
sig_ADU_int = np.around(sig_ADU).astype(int)
return sig_ADU_int
np.random.seed(1)
#==============================================================================
# What we know a priori from instrument spec:
nbins = 100
lambda_min= 500
lambda_max= 750
gain = 2.5 # electrons/ADU
ronoise = 10 # electrons
lambdas = np.linspace(lambda_min, lambda_max, nbins)
#==============================================================================
# Hypothesis
# Our null hypothesis for the line
line_amp = 50 # amplitude in electrons
line_cent = 600 # central wavelength in nm
line_width= 10 # line width in nm
line_prof = Gaussian1D(amplitude=line_amp, mean=line_cent, stddev=line_width)
#==============================================================================
# Fake sky:
# sky related parameters should have been obtained from sky pixels in spec data.
# In this example, randomly generate them.
# We can sky values from, e.g., 7 pixels up/down from the line.
# sky_level: mean sky pixel value at lambda in *electrons* unit
# ( iid ~ N(100, 1^2) + 3rd order noise)
sky_level_e = ( np.random.normal(loc=100, scale=1, size=nbins) # random gaussian noise
+ fake_3rd(lambdas) # 3rd order noise
)
#==============================================================================
# Then generate fake data and convert to electron unit from ADU
# obs: # of *electrons* at each pixel.
obs = ( np.random.poisson(lam=sky_level_e) # sky from real sky value
+ np.random.normal(scale=ronoise, size=nbins) # ronoise
+ line_prof(lambdas) # line profile
)
obs_ADU = fake_e2ADU(obs, gain=gain)
plt.plot(lambdas, obs_ADU, label="obs data")
plt.plot(lambdas, sky_level_e/gain, label="sky (true value)")
plt.xlabel("Wavelength (nm)")
plt.ylabel("ADU")
plt.legend()
plt.grid(ls=":")
plt.show()
This is a simulation of the 1-D line plot, e.g., at the central bright horizontal line of the first image.
Let's now subtract the sky. To simulate the real data reduction, I first will generate the "Sky" region. Then fit a 3rd order function to that sky data. The raw and sky data will be converted to electron unit and error will be used to fit a Gaussian profile.
# Consider the ``obs_ADU`` as if it is the real data we obtained.
# I did non include the error term due to sky estimation, which is not correct!
# If you want to take this into account, you can use bootstrap, etc
# obs_e: observed values converted to electrons
# obs_err: observation error bar in electrons
obs_e = obs_ADU * gain # This of course should be equal to ``obs``
obs_err = np.sqrt( obs_e + ronoise**2 ) # Poisson + ronoise
# Say we obtained sky in ADU from sky pixels at each wavelength bin.
# That will be in ADU (``sky_ADU``) and we will fit the sky values to
# a 3rd order function.
# generate sky again to mimic sky pixels:
sky_est_e = ( np.random.normal(loc=100, scale=1, size=nbins) # random gaussian noise
+ fake_3rd(lambdas) # 3rd order noise
)
# Fit the sky pixel values with 3rd order function and subtract it from obs:
sigmas = np.sqrt( sky_est_e + ronoise**2) # in e-
popt, pcov = curve_fit(fake_3rd, xdata=lambdas, ydata=sky_est_e,
sigma=sigmas, absolute_sigma=True)
# Be careful that ``absolute_sigma`` is ``False`` by default!!
obs_skysub_e = obs_e - fake_3rd(lambdas, *popt)
fitter = LevMarLSQFitter()
g_init = Gaussian1D(amplitude=line_amp, mean=line_cent, stddev=line_width)
g_fit = fitter(g_init, lambdas, obs_skysub_e, weights=1/obs_err**2)
g_fit_noweight = fitter(g_init, lambdas, obs_skysub_e)
# The format for weights is different from some other cases. We need 1/err**2 here.
fig, axs = plt.subplots(2, 1, sharex=True, figsize=(8,8))
axs[0].errorbar(lambdas, obs_skysub_e, yerr=obs_err, capsize=3, lw=1, marker='x', ls=':')
axs[0].plot(lambdas, line_prof(lambdas), ls=':', label="True profile")
axs[0].plot(lambdas, g_fit(lambdas), label="Fit with weighting")
axs[0].plot(lambdas, g_fit_noweight(lambdas), label="Fit without weighting")
axs[1].plot(lambdas, sky_level_e, label="Real sky")
axs[1].plot(lambdas, fake_3rd(lambdas, *popt), label="Fitted sky")
axs[0].set_ylabel("electrons")
axs[1].set_ylabel("electrons")
axs[1].set_xlabel("wavelength (nm)")
axs[0].set_title("1-D extraction, sky subtracted")
axs[0].legend()
axs[1].legend()
plt.show()
print("fit with weighting")
print("-"*80)
print(g_fit)
print("\n"+"="*80+"\n")
print("fit without weighting")
print("-"*80)
print(g_fit_noweight)
fit with weighting -------------------------------------------------------------------------------- Model: Gaussian1D Inputs: ('x',) Outputs: ('y',) Model set size: 1 Parameters: amplitude mean stddev ------------- ------------- ------------- 56.6903141203 599.364236991 6.81830068632 ================================================================================ fit without weighting -------------------------------------------------------------------------------- Model: Gaussian1D Inputs: ('x',) Outputs: ('y',) Model set size: 1 Parameters: amplitude mean stddev ------------- ------------- ------------- 56.3854244005 599.322417536 7.25865308423
In the first figure, the fitted Gaussians with the original Gaussian profile is given. You can see that the amplitude is systematically over estimated and line width (2 * stddev
) is underestimated. The center (mean
) is fitted to be slightly blue-shifted by z∼1/600. The error for these three parameters will be obtained if you do bootstrapping (resampling technique). The resulting errors will be decreased if you make more observations.
The second figure shows the justification of sky fitting. In reality, the sky may not be fitted this well, since you may fit 3rd order while the actual sky is 10th order, etc.
Let's now regard the above data obs_skysub_e
as a given data, and do a realistic analysis. The big question is
How likely is that the peak is an actual line(s), not due to random noise?
To test this, let's set up hypotheses:
Note that we haven't specified the properties of the line (amplitude, width, etc) yet. Then
The first is called Model Selection and the third is called Parameter Estimation. Model selection deals with discrete case (finite number of different hypotheses), while parameter estimation usually deals with continuous case.
Bayes' theorem states
P(H|D,I)=P(H|I)×P(D|H,I)P(D|I) .Here, H, D, and I are the hypothesis, data, and prior information, respectively. The posterior P(H|D,I) is understood as "the probability that the hypothesis is true, given the data equals to D and we have prior information I." The prior P(H|I) can be quite subjective, since it means "the probability that the hypothesis is true." We take either uniform prior or Jeffrey's prior. The denominator, P(D|I)≡∫P(D|H,I)P(H|I)dH is understood as the normalization constant for the posterior. Usually it is thus simply put as
P(H|D,I)∝P(H|I)×P(D|H,I) .The term P(D|H,I) is called the likelihood, and this is what we have calculated during our high school course works:
Given the hypothesis that a person has possibility 0.5 to win a game, what is the probability for her to win 3 games in a row?
In this case, the hypothesis is H:"p=0.5", and the data D="win 3 in a row". The prior knowledge is that "the game rule does not change", "the person doesn't use faul measures to make the possibility to change over time", etc, which are just trivial assumptions. In real scientific problems, I can be something like "the fundamental physical laws do not change over time", "this galaxy is spiral for sure (if you're working on rotational curves of S galaxies, you may assume your samples are definitely S, not E or Irr)", etc.
The model selection is done based on the Bayes' ratio or the odds ratio:
R:=P(H1|D,I)P(H0|D,I)If R>1, we select H1 over H0 and vice versa. If R=1, we can't give any conclusion. Substituting Bayes' thm, P(D|I) cancels out and
R=P(H1|I)P(H1|I)×P(D|H1,I)P(D|H0,I)=ratio(prior)×ratio(likelihood) .Equivalently, we can use logeR and compare it with 0. This is because we frequently get extremely large or small R values (e.g., 10−40), which is inappropriate for some computer programming.
There are two major priors: uniform and Jeffrey's prior.
Uniform prior assumes uniform probability for the model within reasonable range. If we take this for our case, any model which has amplitude from 25 to 75 is equally likely. The lower/upper bounds can be set by the scientist subjectively. As far as the range is sufficiently large, the final result will not be affected much. Quantitatively, if A is the amplitude, P(A|I)=1/(75−25)=1/50 for all A∈[25,75].
Jeffrey's prior assumes the probability proportional to the inverse of the parameter. That is, if the parameter is small, we give more weight to it; this is similar to the information theory which gives higher weight if a data is "unlikely". For this reason, we usually use Jeffrey's prior for standard deviation estimation. Quantitatively, it is P(A|I)=1Aloge(75/25). This is understood as the uniform prior of the log of A: P(logeA|I)d(logeA)=d(logeA)logeAmax−logeAmin →P(logeA|I)=1Aloge(Amax/Amin) .
Since Jeffrey's prior is non-sensical when we include model of A=0, a modified Jeffrey's prior is also used: P(A|I)=1A+A0×1logeA0+AmaxA0+Amin. This is nothing but a parallel shift of parameters by a certain constant A0. A0 can be taken as a noise level, such as readout noise in our case.
Assume all data values are independent and follows an identical distribution (i.i.d. assumption). The "identical distribution" in our case can reasonably be assumed as Gaussian distribution with zero mean and stdev of readout noise (σr). We have Poissonian noise term and sky estimation noise term (the latter is not calculated above), so you may use this pixel-wise error values. Applying the product rule to independent measurements,
P(D|H,I)=P(D1,⋯,DN|H,I)=P(D1|H,I)×⋯×P(DN|H,I) .Under H0, P(Di|H0,I) is the probability to measure Di electrons when there is only the readout noise σr (no actual line). This is calculated by normal distribution: P(Di|H0,I)=1√2πσ2rexp(−(Di−0)22σ2r). Then
P(D|H0,I)=(1√2πσr)N×exp(N∑i=1−(Di−0)22σ2r)so
logeP(D|H0,I)=−N2loge(2πσ2r)−N∑i=1(Di−0)22σ2rFor H1, we can just change the zero mean to the Gaussian line profile: something like G(x=i)
. At any i
-th bin, we know expected value Ei from the Gaussian profile with input amplitude, center, and width. Thus
Ei will be nearly zero outside of mean +- 3 * sigma
.
Now let's do the real calculation to compare H0 and H1. By H1, we did not specify the amplitude (A), central wavelength(λ0), and width (w), we have to marginalize the probabilities to calculate the likelihood. In the previous equation, Ei can be understood as a function of given paramters: Ei(A,λ0,w).
Let me assume uniform prior to all three paramters with domains A∈[1,30], λ0∈[590,610], and w∈[5,15], and thus P(A|I)dA=(1/29)dA, etc. Assuming all paramters are independent, we can use product rule and
P(D|H1,I)=∫155∫610590∫7525P(D|A,λ0,w,I) P(A|I) P(λ0|I) P(w|I) dA dλ0 dw ,or
logeP(D|H1,I)=−loge(10000)+loge[∫155∫610590∫7525P(D|A,λ0,w,I) dA dλ0 dw] ,with P(D|A,λ0,w,I)=(1√2πσr)N×exp[−N∑i=1(Di−Ei(A,λ0,w))22σ2r] .
Since H0 has no more parameters, we just use the previous logeP(D|H0,I) formula.
def likelihood_gaussian(x, data, error, amplitude, mean, stddev):
# if len(error) != 1:
# raise ValueError('One and only one error should be given.')
N_data = len(data)
model = Gaussian1D(amplitude=amplitude, mean=mean, stddev=stddev)
const_term = (np.sqrt(2 * np.pi) * error)**(-N_data)
chi_term = -(data - model(x))**2 / (2 * error**2)
chi_sum = np.sum(chi_term)
return const_term * np.exp(chi_sum)
arr_A = np.arange(25, 75, 1)
arr_L = np.arange(590, 610.1, 0.5)
arr_w = np.arange(5, 15.1, 0.5)
likelihoods = np.zeros((len(arr_A), len(arr_L), len(arr_w)))
for i in range(len(arr_A)):
for j in range(len(arr_L)):
for k in range(len(arr_w)):
likelihoods[i, j, k] = likelihood_gaussian(lambdas,
obs_skysub_e,
ronoise,
arr_A[i],
arr_L[j],
arr_w[k])
loglikelihood_H1 = -np.log(10000) + np.log(likelihoods.sum())
loglikelihood_H0 = (- len(obs_skysub_e) * np.log(np.sqrt(2 * np.pi * ronoise**2))
- (obs_skysub_e**2/(2*ronoise**2)).sum())
R = loglikelihood_H1 - loglikelihood_H0
print('R = {:.2f}'.format(R))
R = 76.97
Since R is much larger than 1, we prefer H1 against H0. From now, thus, I will assume that there is a line.
If I finechop the grid of parameters to 0.2, the Bayes' ratio (odds ratio) was 80.41, which is still much larger than 1.
TIP: How about the case when you think there are 2+ lines? In that case, more parameters will be added to the calculation, and the calculation time will grow exponentially. If you can calculate everything as we did in the above code, that is THE BEST way. We may, however, find the maximum likelihood parameter sets quickly by sacrificing error calculation (not calculate likelihood at everywhere but use algorithm to find minimum). The efficient way to choose a model then, (under Gaussian noise assumption) is to use Bayesian Information Criterion (BIC). It only takes minimum chi square and number of parameters in the competing models, and mathematically proven to be THE most powerful criterion for model selection under some assumptions. We will not cover that since it's out of our scope.
TIP: When the signal to noise ratio is low (high noise level), the selection of prior affects the model selection critically (e.g., R=0.9 for uniform but R=11 for Jefferey's). So it's better to use various priors and test whether the result is robust against prior selection. This is out of our scope.
Now that we know there IS a line, what is the best fit values of parameters (A, λ0, and w) and their corresponding errors? The most intuitive and mathematically valid way is to marginalize the probability. I want to normalize each marginalized pdf to sum to 1.
def pplot(axes, x, y, xlabel, ylabel):
x_max = x[np.argmax(y)]
axes.plot(x, y)
axes.grid(ls=':')
axes.set_xlabel(xlabel)
axes.set_ylabel(ylabel)
axes.axvline(x_max, ls=':', color='r')
print('MLE of {}: {:.2f}'.format(xlabel, x_max))
normaln = likelihoods.sum()
pdf_A = likelihoods.sum(axis=(1,2))/normaln
pdf_L = likelihoods.sum(axis=(0,2))/normaln
pdf_w = likelihoods.sum(axis=(0,1))/normaln
fig, axs = plt.subplots(1, 3, sharey=True)
pplot(axs[0], arr_A, pdf_A, xlabel='Amplitude (e)', ylabel='PDF')
pplot(axs[1], arr_L, pdf_L, xlabel='Center (nm)', ylabel='PDF')
pplot(axs[2], arr_w, pdf_w, xlabel='Width (nm)', ylabel='PDF')
plt.show()
MLE of Amplitude (e): 55.00 MLE of Center (nm): 599.50 MLE of Width (nm): 7.50
If you want to quantitatively estimate the error, you may fit Gaussian to the pdf (not recommended). The recommended way is to find the minimum width from the most likely value (MLE = Maximum Likelihood Estimator) such that the integrated area is 0.67 (1-sigma). It may be assymetric.
Also if you want a publication quality plot, you have to tune the searching parameter for this case: the width seem to have possibility of lower value (<5 nm). You have to change the searching grid to include w<5 domain.