#!/usr/bin/env python # coding: utf-8 # #

# # # # PLAsTiCC - RAMP: classification of astronomical transients # # ##### Alexandre Boucaud (Paris-Saclay Center for Data Science) # ##### Emille E. O. Ishida (Universite Clermont-Auvergne) # ## Introduction # # # Most of the sources we observe in the night sky evolve during a very long period of time (millions or even billions of years) - rendering them practically static in comparison to the time scale of a human life. However, this is only a partial description of possible events taking place in the Universe. # # For a long time, astronomers have [recorded the observation](https://en.wikipedia.org/wiki/SN_185) events whose life spam lasts for days or months. More recently, the advent of modern telescopes expanded our ability to detect astronomical events which happen from a few seconds to years. These are called **[transients](https://www.lsst.org/science/transient-optical-sky)** and they are the may concern of researchers working in **time domain astronomy**. # # # # Transients can are the observational consequence of a large variety of astronomical phenomena. For example, a cataclismic event (e.g. a stellar explosion in the case of [supernovae](https://www.nasa.gov/audience/forstudents/5-8/features/nasa-knows/what-is-a-supernova.html)); the physical process governing high density regions of the Universe (e.g. [active galactic nuclei (AGN)](https://ned.ipac.caltech.edu/level5/Cambridge/frames.html)) or our position in relation to a set of observed objects (e.g. [eclipsing binaries](http://www.physics.sfasu.edu/astro/ebstar/ebstar.html)). # # # These relatively rapid changing objects can provide important clues about themselves and their environement - as well as the evolution of the universe as a whole (e.g. [type Ia supernovae](http://hubblesite.org/hubble_discoveries/dark_energy/de-type_ia_supernovae.php) provided the first evidence of the current accelerated expansion of the Universe caused by [dark energy](http://astronomy.swin.edu.au/cosmos/D/Dark+Energy)). Therefore, the proper classification of transients is a crucial task in observational astronomy - specially in the light of large data volumes expected for the next generation of astronomical surveys. # # # In what follows, we will use the example of supernovae to illustrate the more general problem of transient classification. # ## The Data Problem # # # Once a supernova is detected, we need to scrutinize its light through two different channels before we can use it in subsequent astrophysical analysis. These two methods are called *spectroscopy* and *photometry*. # # **[Spectroscopy](http://loke.as.arizona.edu/~ckulesa/camp/spectroscopy_intro.html)** is the modern equivalent of using a prism to separate a beam of white light in the rainbown colours. It is a high resolution measurement which allows us to identify emission/absorption features indicative of specific chemical elements - and are also the main source of information enabling classification. # # Despite being paramount for the classification task, spectroscopy is an extremely time consuming process - with integration times hanging from 20 minutes to a few hours depending on the telescope and brightness of the source. # # # Traditionally, transient classification is completely based on spectroscopic measurements - however, given the volume of data expected from the upcoming large scale sky surveys, this strategy is not sustainable. An alternative approach, based on cheaper/lower resolution measurements must be found. #

**Figure 1**: Example of spectra of SNe Ia, Ic, Ib and II (from top to bottom) highlighting their main spectral features. # *Figure by Daniel Kasen* #

# # **Photometry** gives the overall information of how bright the source is (how much energy the object is emitting) at a particular moment. It can be described as the low resolution counterpart of spectroscopy - or the integral of all the information stored in the spectra. # # # A sequence of photometric observations made at different moments in time is called a *light curve*. It holds the information of how the energy of the source evolves with time and can also be used to characterize different types of supernovae (although the distinction between classes is much more discrete than we would hope it to be). # # # Photometry has the advantadge of being a relatevely straight forward and cheap process. Moreover, it is also capable of accessing a much larger volume of the universe - reaching dimmer and, consequently, further objects. #

**Figure 2**: Example of the light-curves (brightness as a function of time) for different supernova types. *Image: J. Craig Wheeler, 2012* #

# New instruments, like the [Large Synoptic Survey Telescope (LSST)](https://www.lsst.org/) - scheduled to begin observations in 2022 - will produce an unprecedented number of light curves. Once these transients are detected, we rely on agreements with other telescopes in order to acquire a small number of spectroscopic observations. # # # In summary: we must develop ways to classify transients using only their light curves. # # # In the context of machine learning, this means using the small, biased spectroscopic sample in order to train a classifier which will subsequently be applied to the larger, photometric-only sample. # ### Filters # # # In a real data scenario, the concetual description of a light curve given above has a further degree of complexity. # # The electromagnetic spectrum is divided in broad wavelength ranges and the signal from the source is integrate only within this range - and convolved with the filter transmission. # # The figure bellow shown an example of a set of broad-band filters superimposed to a type Ia supernova spectra. #

**Figure 3**: Example of DES broad-band filters transmission superimposed to a type Ia supernova spectrum at peak brightness. *Figure by Emille Ishida* #

# As a consequence, for each supernova we will have a set of 4 light curves. These express the evolution of the brightness of the object as a function of time *for each one of the wavelength intervals covered by the broad-band filters*. # ## The prediction task # For this specific challenge we wil focus on the binary classification of supernovae, between SNIa and non-SNIa. # ### The Road so Far # # # The photometric classification of supernovae has been studied in the astronomical literature for a long time. This culminated in the *Supernova Photometric Classification Challenge (SNPCC)* ([Kessler *et al.*, 2010](https://arxiv.org/pdf/1008.1024.pdf)), where the participants were invited to classify a simulated dataset designed to mimic the specifications of the *[Dark Energy Survey](https://www.darkenergysurvey.org/)*. # # # The challenge provided a state of the art description of the classification methods available in the literature, bringing together different groups who had been investigating the problem for a while. Although none of the algorithms performed obviously better than all others, its main legacy was the updated data set and corresponding labels which were made public to the community. # # # This is the data set that we will use in this RAMP. # ### References # # A recent review of how different machine learning methods perform on this dataset can be found at [Lochner *et al.*, 2016](https://arxiv.org/abs/1603.00882) and references therein. # ## Data # # # # The lightcurve data is made of **non-homogeneously sampled**, **non-periodic** time series with correlated errors obtained in several wavelength filters. # # The DES dataset provided with this RAMP has light curves in 4 filters *g*, *r*, *i* and *z*. # ### Downloading data # # To obtain the DES dataset, make sure to run the provided script `download_data.py` # In[ ]: get_ipython().system('python download_data.py') # ### Reading the data # # Here we provide methods to read the `pickle` files and convert the data to a `pandas` dataframe for convenience. # # The SN label is removed from the main dataframe and provided as a separate one. # In[1]: import gzip import pickle import pandas as pd import numpy as np def read_data(filename): """Read data from pickled file to a pandas dataframe""" with gzip.open(filename, 'rb') as f: data = pickle.load(f) X = to_dataframe(data) y = pd.get_dummies(X.type == 0, prefix='SNIa', drop_first=True) X = X.drop(columns=['comment', 'type']) return X, y def to_dataframe(data): """Converts from a python dictionary to a pandas dataframe""" for idx in data: sn = data[idx] for filt in 'griz': sn['mjd_%s' % filt] = np.array(sn[filt]['mjd']) sn['fluxcal_%s' % filt] = np.array(sn[filt]['fluxcal']) sn['fluxcalerr_%s' % filt] = np.array(sn[filt]['fluxcalerr']) del sn[filt] sn.update(sn['header']) del sn['header'] return pd.DataFrame.from_dict(data, orient='index') # Here is a peak at the data. # In[2]: X, y = read_data('data/des_train.pkl') # In[3]: # 5 first rows of the dataframe X.head() # And the labels, converted to a {0, 1} array. # In[4]: y.head() # In the DataFrame, each supernova is indexed by its ID # In[5]: X.head().index # and can be accessed via the `loc` accessor # In[6]: # SN 1186 X.loc[1186] # or via positional indexing with the `iloc` accessor. # In[7]: # Also SN 1186 sn0 = X.iloc[0] # Here is a method to visualize the data # In[8]: import matplotlib.pyplot as plt plt.style.use('seaborn') get_ipython().run_line_magic('matplotlib', 'inline') DES_FILTERS = 'griz' def plot_lightcurves(idx): fig, axes = plt.subplots(2, 2, sharex=True, sharey=True, figsize=(12, 8)) for id_f, f in enumerate(DES_FILTERS): ax = axes[id_f // 2, id_f % 2] ax.errorbar(X.iloc[idx]['mjd_%s' % f], X.iloc[idx]['fluxcal_%s' % f], X.iloc[idx]['fluxcalerr_%s' % f], fmt='o') ax.set_xlabel('MJD') ax.set_ylabel('Calibrated flux') ax.set_title('%s-band' % f) # Use this method to look at the raw light curves # In[9]: plot_lightcurves(0) # ## Starting kit pipeline elements # For this classification challenge, participants are required to submit two files: # * `feature_extractor.py` # * `classifier.py` # # In the following, we will go through the process of designing a baseline submission, that is writing the script for both the pre-processing step and the classification step. # ### Pre-processing - a.k.a. feature extraction # As the SN data is highly non-homogeneous a little pre-processing need to be done. In the example below we fit the data using a parametric function proposed by [Bazin et al., 2009](https://arxiv.org/pdf/0904.1066.pdf): # $$ f(t) = A \frac{\exp\left(-\frac{t - t_0}{\tau_{fall}}\right)}{1 + \exp\left(\frac{t - t_0}{\tau_{rise}}\right)} + B$$ # In[10]: def bazin(time, A, B, t0, tfall, trise): X = np.exp(-(time - t0) / tfall) / (1 + np.exp((time - t0) / trise)) return A * X + B # Although this parametric form does not have a physical motivation, it reproduces quiet well the behavior of most light curves. # #

# NOTE # This function is very sensitive to the magnitude of the time of observation. The use of [Modified Julian Date](https://bowie.gsfc.nasa.gov/time/) creates large time values, which are not relevant for such analysis so we use a scaled time instead in order to ensure the convergence of the fit. #

# # # We provide an example fitting function, so you can fit a light curve and plot reslts as follows: # In[11]: from scipy.optimize import least_squares def lightcurve_fit(time, flux): scaled_time = time - time.min() t0 = scaled_time[flux.argmax()] guess = (0, 0, t0, 40, -5) errfunc = lambda params: abs(flux - bazin(scaled_time, *params)) result = least_squares(errfunc, guess, method='lm') return result.x # The parametric fit on the light curves yields 5 parameters for each filter $A, B, t_0, \tau_{fall}, \tau_{rise}$. # In[12]: lightcurve_fit(sn0.mjd_g, sn0.fluxcal_g) # In[13]: def plot_lightcurves_with_fit(idx): fig, axes = plt.subplots(2, 2, sharex=True, sharey=True, figsize=(12, 8)) for id_f, f in enumerate(DES_FILTERS): ax = axes[id_f // 2, id_f % 2] time = X.iloc[idx]['mjd_%s' % f] flux = X.iloc[idx]['fluxcal_%s' % f] fit = lightcurve_fit(time, flux) stime = np.arange(time.min(), time.max()) ax.plot(time, flux, 'o') ax.plot(stime, bazin(stime - stime.min(), *fit)) ax.set_xlabel('MJD') ax.set_ylabel('Calibrated flux') ax.set_title('%s-band' % f) # In[14]: plot_lightcurves_with_fit(2) # ### Constructing the data matrix for classification # # Now that you can fit each individual light curve, you are ready to build your low dimension representation of the data. # # # In this example, each entry in the data matrix will correspond to one object and each column to a best-fit parameters. As we have 4 filters, we will concatenate the 20 best-fit results for different filters in the same row of the data matrix. # In[15]: def preprocessing(data): # Create palceholder for output matrix full_params = np.zeros((len(data), 5 * len(DES_FILTERS))) # Iterate over supernovae for idx, snid in enumerate(data.index): params = np.zeros((len(DES_FILTERS), 5)) # Iterate over filters for id_f, f in enumerate(DES_FILTERS): time = data.loc[snid, 'mjd_%s' % f] flux = data.loc[snid, 'fluxcal_%s' % f] try: params[id_f] = lightcurve_fit(time, flux) except ValueError: # If fit does not converge leave zeros continue full_params[idx] = params.ravel() return full_params # ### `feature_extractor.py` # # The submitted `feature_extractor.py` is expected to define part of the `FeatureExtractor` class, namely the `transform` method that will process the raw data `X`. # # Given the methods defined in the section above, the class would boil down to # In[ ]: # feature_extractor.py class FeatureExtractor(): def __init__(self): pass def fit(self, X_df, y): pass def transform(self, X_df): return preprocessing(X_df) # However it will need to have all the methods above implemented with the right imports **within the file**. # # Checkout the baseline submission for a working example: # In[ ]: get_ipython().run_line_magic('load', 'submissions/starting_kit/feature_extractor.py') # ## Classification # We can now use the feature matrix to train a classifier of our choosing. This is the second expected contribution to the challenge. # ### `classifier.py` # # As for the `feature_extractor.py`, the `classifier.py` needs a `Classifier` class with `__init__()`, `fit()` and `predict_proba()` methods to be implemented. # # Since this is not the main focus of the challenge, we chose a very simple implementation for the baseline, that is a random forest classifier. The starting_kit example looks like # In[ ]: # classifier.py from sklearn.base import BaseEstimator from sklearn.ensemble import RandomForestClassifier class Classifier(BaseEstimator): def __init__(self): self.clf = RandomForestClassifier() def fit(self, X, y): self.clf.fit(X, y) def predict_proba(self, X): return self.clf.predict_proba(X) # This classifier will take the output of the `FeatureExtractor.transform()` and train using the `y` labels from the beginning. # # Once trained, the RAMP pipeline will automatically then test the `Classifier` predictions against the expected labels, following scores/metrics defined in the `problem.py`. # ## Local testing # # To test locally that the pipeline works, the RAMP workflow provides a command line tool: `ramp_test_submission`. # # For relatively big datasets (e.g. the DES target set has 20 000 + entries), a smaller portion of the data is also delivered: # * `des_train_mini.pkl` # * `des_test_mini.pkl` # so that the local testing is only a few minutes long. # # To train/test on this minimalist dataset, use the `--quick-test` keyword. # In[16]: get_ipython().system('ramp_test_submission --quick-test') # ### Writing your own submissions locally # # To start working on a new submission, the easiest way is to copy the starting_kit into a new directory and then modify the files from the notebook using the `%%file` magic command. # In[ ]: get_ipython().system('cp -r submissions/starting_kit submissions/first_submission') # In[ ]: get_ipython().run_cell_magic('file', 'submissions/first_submission/feature_extractor.py', '\nclass FeatureExtractor():\n def __init__(self):\n pass\n\n def fit(self, X_df, y):\n pass\n\n def transform(self, X_df):\n return preprocessing(X_df)\n') # In[ ]: get_ipython().run_line_magic('file', 'submissions/first_submission/classifier.py') from sklearn.base import BaseEstimator from sklearn.ensemble import RandomForestClassifier class Classifier(BaseEstimator): def __init__(self): self.clf = RandomForestClassifier() def fit(self, X, y): self.clf.fit(X, y) def predict_proba(self, X): return self.clf.predict_proba(X) # Finally, test your new submission by providing the name of the submission # In[ ]: get_ipython().system('ramp_test_submission --submission=first_submission --quick-test') # # --- # In[17]: from IPython.display import HTML HTML(open("style/custom.css").read())