#!/usr/bin/env python # coding: utf-8 # # # # # # PLAsTiCC - RAMP: classification of astronomical transients # # ##### Alexandre Boucaud (Paris-Saclay Center for Data Science) # ##### Emille E. O. Ishida (Universite Clermont-Auvergne) # ## Introduction # # # Most of the sources we observe in the night sky evolve during a very long period of time (millions or even billions of years) - rendering them practically static in comparison to the time scale of a human life. However, this is only a partial description of possible events taking place in the Universe. # # For a long time, astronomers have [recorded the observation](https://en.wikipedia.org/wiki/SN_185) events whose life spam lasts for days or months. More recently, the advent of modern telescopes expanded our ability to detect astronomical events which happen from a few seconds to years. These are called **[transients](https://www.lsst.org/science/transient-optical-sky)** and they are the may concern of researchers working in **time domain astronomy**. # # # # Transients can are the observational consequence of a large variety of astronomical phenomena. For example, a cataclismic event (e.g. a stellar explosion in the case of [supernovae](https://www.nasa.gov/audience/forstudents/5-8/features/nasa-knows/what-is-a-supernova.html)); the physical process governing high density regions of the Universe (e.g. [active galactic nuclei (AGN)](https://ned.ipac.caltech.edu/level5/Cambridge/frames.html)) or our position in relation to a set of observed objects (e.g. [eclipsing binaries](http://www.physics.sfasu.edu/astro/ebstar/ebstar.html)). # # # These relatively rapid changing objects can provide important clues about themselves and their environement - as well as the evolution of the universe as a whole (e.g. [type Ia supernovae](http://hubblesite.org/hubble_discoveries/dark_energy/de-type_ia_supernovae.php) provided the first evidence of the current accelerated expansion of the Universe caused by [dark energy](http://astronomy.swin.edu.au/cosmos/D/Dark+Energy)). Therefore, the proper classification of transients is a crucial task in observational astronomy - specially in the light of large data volumes expected for the next generation of astronomical surveys. # # # In what follows, we will use the example of supernovae to illustrate the more general problem of transient classification. # ## The Data Problem # # # Once a supernova is detected, we need to scrutinize its light through two different channels before we can use it in subsequent astrophysical analysis. These two methods are called *spectroscopy* and *photometry*. # # **[Spectroscopy](http://loke.as.arizona.edu/~ckulesa/camp/spectroscopy_intro.html)** is the modern equivalent of using a prism to separate a beam of white light in the rainbown colours. It is a high resolution measurement which allows us to identify emission/absorption features indicative of specific chemical elements - and are also the main source of information enabling classification. # # Despite being paramount for the classification task, spectroscopy is an extremely time consuming process - with integration times hanging from 20 minutes to a few hours depending on the telescope and brightness of the source. # # # Traditionally, transient classification is completely based on spectroscopic measurements - however, given the volume of data expected from the upcoming large scale sky surveys, this strategy is not sustainable. An alternative approach, based on cheaper/lower resolution measurements must be found. # # # **Photometry** gives the overall information of how bright the source is (how much energy the object is emitting) at a particular moment. It can be described as the low resolution counterpart of spectroscopy - or the integral of all the information stored in the spectra. # # # A sequence of photometric observations made at different moments in time is called a *light curve*. It holds the information of how the energy of the source evolves with time and can also be used to characterize different types of supernovae (although the distinction between classes is much more discrete than we would hope it to be). # # # Photometry has the advantadge of being a relatevely straight forward and cheap process. Moreover, it is also capable of accessing a much larger volume of the universe - reaching dimmer and, consequently, further objects. # # New instruments, like the [Large Synoptic Survey Telescope (LSST)](https://www.lsst.org/) - scheduled to begin observations in 2022 - will produce an unprecedented number of light curves. Once these transients are detected, we rely on agreements with other telescopes in order to acquire a small number of spectroscopic observations. # # # In summary: we must develop ways to classify transients using only their light curves. # # # In the context of machine learning, this means using the small, biased spectroscopic sample in order to train a classifier which will subsequently be applied to the larger, photometric-only sample. # ### Filters # # # In a real data scenario, the concetual description of a light curve given above has a further degree of complexity. # # The electromagnetic spectrum is divided in broad wavelength ranges and the signal from the source is integrate only within this range - and convolved with the filter transmission. # # The figure bellow shown an example of a set of broad-band filters superimposed to a type Ia supernova spectra. # # As a consequence, for each supernova we will have a set of 4 light curves. These express the evolution of the brightness of the object as a function of time *for each one of the wavelength intervals covered by the broad-band filters*. # ## The prediction task # For this specific challenge we wil focus on the binary classification of supernovae, between SNIa and non-SNIa. # ### The Road so Far # # # The photometric classification of supernovae has been studied in the astronomical literature for a long time. This culminated in the *Supernova Photometric Classification Challenge (SNPCC)* ([Kessler *et al.*, 2010](https://arxiv.org/pdf/1008.1024.pdf)), where the participants were invited to classify a simulated dataset designed to mimic the specifications of the *[Dark Energy Survey](https://www.darkenergysurvey.org/)*. # # # The challenge provided a state of the art description of the classification methods available in the literature, bringing together different groups who had been investigating the problem for a while. Although none of the algorithms performed obviously better than all others, its main legacy was the updated data set and corresponding labels which were made public to the community. # # # This is the data set that we will use in this RAMP. # ### References # # A recent review of how different machine learning methods perform on this dataset can be found at [Lochner *et al.*, 2016](https://arxiv.org/abs/1603.00882) and references therein. # ## Data # # # # The lightcurve data is made of **non-homogeneously sampled**, **non-periodic** time series with correlated errors obtained in several wavelength filters. # # The DES dataset provided with this RAMP has light curves in 4 filters *g*, *r*, *i* and *z*. # ### Downloading data # # To obtain the DES dataset, make sure to run the provided script `download_data.py` # In[ ]: get_ipython().system('python download_data.py') # ### Reading the data # # Here we provide methods to read the `pickle` files and convert the data to a `pandas` dataframe for convenience. # # The SN label is removed from the main dataframe and provided as a separate one. # In[1]: import gzip import pickle import pandas as pd import numpy as np def read_data(filename): """Read data from pickled file to a pandas dataframe""" with gzip.open(filename, 'rb') as f: data = pickle.load(f) X = to_dataframe(data) y = pd.get_dummies(X.type == 0, prefix='SNIa', drop_first=True) X = X.drop(columns=['comment', 'type']) return X, y def to_dataframe(data): """Converts from a python dictionary to a pandas dataframe""" for idx in data: sn = data[idx] for filt in 'griz': sn['mjd_%s' % filt] = np.array(sn[filt]['mjd']) sn['fluxcal_%s' % filt] = np.array(sn[filt]['fluxcal']) sn['fluxcalerr_%s' % filt] = np.array(sn[filt]['fluxcalerr']) del sn[filt] sn.update(sn['header']) del sn['header'] return pd.DataFrame.from_dict(data, orient='index') # Here is a peak at the data. # In[2]: X, y = read_data('data/des_train.pkl') # In[3]: # 5 first rows of the dataframe X.head() # And the labels, converted to a {0, 1} array. # In[4]: y.head() # In the DataFrame, each supernova is indexed by its ID # In[5]: X.head().index # and can be accessed via the `loc` accessor # In[6]: # SN 1186 X.loc[1186] # or via positional indexing with the `iloc` accessor. # In[7]: # Also SN 1186 sn0 = X.iloc[0] # Here is a method to visualize the data # In[8]: import matplotlib.pyplot as plt plt.style.use('seaborn') get_ipython().run_line_magic('matplotlib', 'inline') DES_FILTERS = 'griz' def plot_lightcurves(idx): fig, axes = plt.subplots(2, 2, sharex=True, sharey=True, figsize=(12, 8)) for id_f, f in enumerate(DES_FILTERS): ax = axes[id_f // 2, id_f % 2] ax.errorbar(X.iloc[idx]['mjd_%s' % f], X.iloc[idx]['fluxcal_%s' % f], X.iloc[idx]['fluxcalerr_%s' % f], fmt='o') ax.set_xlabel('MJD') ax.set_ylabel('Calibrated flux') ax.set_title('%s-band' % f) # Use this method to look at the raw light curves # In[9]: plot_lightcurves(0) # ## Starting kit pipeline elements # For this classification challenge, participants are required to submit two files: # * `feature_extractor.py` # * `classifier.py` # # In the following, we will go through the process of designing a baseline submission, that is writing the script for both the pre-processing step and the classification step. # ### Pre-processing - a.k.a. feature extraction # As the SN data is highly non-homogeneous a little pre-processing need to be done. In the example below we fit the data using a parametric function proposed by [Bazin et al., 2009](https://arxiv.org/pdf/0904.1066.pdf): # $$ f(t) = A \frac{\exp\left(-\frac{t - t_0}{\tau_{fall}}\right)}{1 + \exp\left(\frac{t - t_0}{\tau_{rise}}\right)} + B$$ # In[10]: def bazin(time, A, B, t0, tfall, trise): X = np.exp(-(time - t0) / tfall) / (1 + np.exp((time - t0) / trise)) return A * X + B # Although this parametric form does not have a physical motivation, it reproduces quiet well the behavior of most light curves. # #