#!/usr/bin/env python
# coding: utf-8

# # DC2 Object Catalog Run2.2i GCR tutorial -- Part I: GCR access
# 
# Owners: **Francois Lanusse [@EiffL](https://github.com/LSSTDESC/DC2-analysis/issues/new?body=@EiffL), Javier Sanchez [@fjaviersanchez](https://github.com/LSSTDESC/DC2-analysis/issues/new?body=@fjaviersanchez)**  
# Last Verifed to Run: **2020-07-02** (by @jrbogart)
# 
# This notebook will illustrate the basics of accessing the DPDD-like object catalog, which contains the detected objects at the coadd level, through the Generic Catalog Reader (GCR, https://github.com/yymao/generic-catalog-reader) as well as how to select useful samples of stars/galaxies from the DM outputs.
# 
# __Learning objectives__:
# 
# After going through this notebook, you should be able to:
#   1. Load and efficiently access a DC2 object catalog with the GCR
#   2. Understand and have references for the catalog schema
#   3. Apply cuts to the catalog using GCR functionalities
#   4. Have an example of quality cuts and simple star/galaxy separation cut
#   5. Load the HSC Public Data Release catalog in the XMM field [Advanced]
# 
# __Logistics__: This notebook is intended to be run through the JupyterHub NERSC interface available here: https://jupyter-dev.nersc.gov. To setup your NERSC environment, please follow the instructions available here: https://confluence.slac.stanford.edu/display/LSSTDESC/Using+Jupyter-dev+at+NERSC
# 
# __Other notes__: 
#    * If you restart your kernel, or if it automatically restarts for some reason, all imports and variables will become undefined so, you probably will have to re-run everything.

# In[1]:


get_ipython().run_line_magic('matplotlib', 'inline')
import numpy as np
import matplotlib.pyplot as plt


# ## Accessing the object catalog with the GCR
# 
# The [GCRCatalogs](https://github.com/LSSTDESC/gcr-catalogs) package is a DESC project which aims at gathering in one convenient location various simulation/data catalogs made available to the collaboration.  
# In this section, we illustrate how to use this tool to access the object catalogs from DC2 Run2.2i.

# In[2]:


import GCRCatalogs
# Load the object catalog
catalog = GCRCatalogs.load_catalog('dc2_object_run2.2i_dr3')


# A significant numbers of catalogs besides the DC2 object catalogs are already available, use `sorted(GCRCatalogs.get_available_catalogs(False))` to see a full list and visit the [DC2 Data Product](https://confluence.slac.stanford.edu/display/LSSTDESC/DC2+Data+Product+Overview) page to see all the DC2 related catalogs.

# ### DC2 object catalog Schema
# 
# 
# To see the quantities available in the catalog, you can use the following:

# In[3]:


sorted(catalog.list_all_quantities())


# The meaning of these fields is documented in the [SCHEMA.md](https://github.com/LSSTDESC/gcr-catalogs/blob/master/GCRCatalogs/SCHEMA.md#schema-for-dc2-coadd-catalogs) file of the `LSSTDESC/gcr-catalogs` repository.  
# As explained in that link, the values exposed here are not the native quantities produced by the Data Management stack, but instead this schema strives to follow the standard nomenclature of the LSST Data Products Definition Document [DPDD](http://ls.st/dpdd).
# 
# The DPDD is an effort made by the LSST project to standardize the format of the official Data Release Products (DRP). While the native outputs of the DM stack are succeptible to change, the DPDD will be more stable. An early adoption of these conventions by the DESC will save time and energy down the road.
# 
# This being said, not all use-cases and relevant quantities are covered by these conventions yet, so the GCR preserves access to the underlying native DM stack fieds, all 2046 of which can be listed using:  

# In[4]:


sorted(catalog.list_all_native_quantities())


# We can see that the catalog includes:
# 
# * Positions
# * Fluxes and magnitudes (PSF and [CModel](https://www.sdss.org/dr12/algorithms/magnitudes/#cmodel))
# * Shapes (using [GalSim's HSM](http://galsim-developers.github.io/GalSim/namespacegalsim_1_1hsm.html))
# * Quality flags: e.g, does the source have any interpolated pixels? Has any of the measurement algorithms returned an error?
# * Other useful quantities: `blendedness`, measure of how flux is affected by neighbors: (1 - flux.child/flux.parent) (see 4.9.11 of [Bosch et al. 2018](http://adsabs.harvard.edu/abs/2018PASJ...70S...5B)); `extendedness`, classifies sources as extended or psf-like.

# ### Accessing the data
# 
# Run2.2i is considerably larger than Run1.2i; accessing the full dataset can be challenging. In order to access the data efficiently, it is important to understand how it is physically stored and how to access it, one piece at the time. 
# 
# The coadds produced by the DM stack are structured in terms of large `tracts` and smaller `patches`, illustrated here for DC2 Run1.1i (Run2.2i covers 300 square degrees and has 165 tracts):
# <img src="assets/dc2_skymap.png">
# Here the tracts have large blue numbers, and the patches are denoted with an `(x,y)` format. (In the illustration each tract has 8x8 patches. In later runs, including 2.2i, each tract has 7x7 patches.)
# 
# You can learn more about how to make such a plot of the tract and patches [here](dm_butler_skymap.ipynb)
# 
# The **GCR object catalog preserves the tract structure of the data** so that any particular quantity can be accessed on a per-tract basis. The tracts available in the catalog can be listed using the following command:

# In[5]:


# Query all available tracts, only displays the first 5
catalog.available_tracts[:5]


# The DM stack includes functionality to get the tract and patch number corresponding to a certain position `(RA,DEC)`. However, it is out of the scope of this tutorial.

# GCR provides the following `native_filters` mechanism, which you can use to speed up data access if you only need certain chunks of the dataset.
# For the object catalog, the chunks are broken into `tract`, and hence that is what you can use in values for   `native_filters`:

# In[6]:


# Retrieve the ra,dec coordinates of all sources within tract number 4430
data = catalog.get_quantities(['ra', 'dec'], native_filters=['tract == 4430'])

# Plot a 2d histogram of sources
plt.figure(figsize=(10,7))
plt.hist2d(data['ra'], data['dec'],100); plt.gca().set_aspect('equal'); plt.colorbar(label='Number of objects')
plt.xlabel('RA [deg]');
plt.ylabel('dec [deg]');


# `native_filters` can be used to filter a catalog by some specific quantities that are related to its underlying data format
# (use `catalog._native_filter_quantities` to see them). You have more information [here](https://yymao.github.io/generic-catalog-reader/)

# The data returned by the GCR is structured as a native Python dictionary:

# In[7]:


data


# But it can also easily be converted into a [Pandas DataFrame](https://pandas.pydata.org), if you are so inclined ;-)

# In[8]:


import pandas
pdata = pandas.DataFrame(data)
pdata


# As a simple test, you can show the advantage of loading one tract at a time compared to the entire catalog:

# In[9]:


get_ipython().run_line_magic('time', "data = catalog.get_quantities(['ra', 'dec'], native_filters=['tract == 4431'])")


# In[10]:


get_ipython().run_line_magic('time', "data = catalog.get_quantities(['ra', 'dec']) #This cell takes a bit to execute so, if you are in a hurry you can skip this")


# In order to make accessing chunks of data convenient to the user, the `catalog.get_quantities` also provides the option to [return an iterator](https://yymao.github.io/generic-catalog-reader/#GCR.BaseGenericCatalog.get_quantities). This allows you to only read and work on one piece of data at a time while looping through the entire catalog. Remember that catalogs can be very large and might not fit in memory if you try to load the entire catalog at once. In the DC2 object catalog for Run2.2i, as we follow the DM structure of data, the catalog will iterate over `tract` when using `return_iterator`. 
# 
# You can find more general information about iterators [here](https://stackoverflow.com/questions/3294889/iterating-over-dictionaries-using-for-loops).

# In[11]:


# Loop through a few tracts using an iterator
for d in catalog.get_quantities(['ra', 'dec'],
                                native_filters=['tract >= 2900', 'tract < 3000'],
                                return_iterator=True):
    # Here we only handle a tract at a time
    plt.scatter(d['ra'], d['dec'], s=2);
    
plt.xlabel('RA');
plt.ylabel('Dec');
plt.title('2900 <= Tract < 3000');


# ### Applying filters and cuts
# 
# In order to avoid returning unecessary data, the GCR has a functionality to filter out entries as it reads the files. Note that this is different from the `native_filters` discussed above, which avoids reading part of the data altogether.
# 
# Defining these filters requires the `GCRQuery` module of the GCR package and can then be applied during the call to `get_quantities`:

# In[12]:


from GCR import GCRQuery

# Simple cut to remove unreliable detections
# More cuts can be added, as a logical AND, by appending GCRQuery objects to this list
simple_cuts = [
    GCRQuery('clean'), # The source has no flagged pixels (interpolated, saturated, edge, clipped...) 
                       # and was not skipped by the deblender
]

# Loads the data after cut
data_cut = catalog.get_quantities(['ra', 'dec'], 
                              filters = simple_cuts, 
                              native_filters=['tract == 3830'])

# Loads data without cuts
data_full = catalog.get_quantities(['ra', 'dec'], 
                              native_filters=['tract == 3830'])


# In[13]:


print(len(data_cut['ra']),len(data_full['ra']))


# In[14]:


# Plot a 2d histogram of sources
plt.figure(figsize=(15,7))
plt.subplot(121)
plt.hist2d(data_full['ra'], data_full['dec'],256); plt.gca().set_aspect('equal'); 
plt.xlabel('RA [deg]');
plt.ylabel('dec [deg]');
plt.title('Full sample')
plt.colorbar(label='Number of objects')

plt.subplot(122)
plt.hist2d(data_cut['ra'], data_cut['dec'],256); plt.gca().set_aspect('equal');
plt.xlabel('RA [deg]');
plt.ylabel('dec [deg]');
plt.title('Clean objects');
plt.colorbar(label='Number of objects');


# Admittedly, this plot is a little underwhelming, these quality cuts only remove a very small number of objects. This is due in part to the fact that there are relatively few pixel defects in the DC2 simulations that would be flagged by Instrument Signature Removal (ISR).  In general, the number of objects that are affected by pixel defects will grow as the coadds get deeper, but the total number of objects will also increase with depth.
# 
# To get a sense of the impact of these quality flags on real data, we can compare with a tract of the HSC PDR1 data ([Aihara et al (2018)](http://adsabs.harvard.edu/abs/2018PASJ...70S...8A)) which had been available on cori. Note that this HSC catalog follows the same schema as Run 1.2i. This tract is part of the XMM subfield of HSC (find out more about the HSC data release [here](https://hsc-release.mtk.nao.ac.jp/doc/) and [here](http://adsabs.harvard.edu/abs/2018PASJ...70S...8A)).
# 
# While this catalog was available, it was loaded, the same cuts were applied, and the data were plotted. Here is the result: 

# <img src="../tutorials/assets/GCR-obj-1-hsc-plot.png">

# This is a more dramatic plot, and illustrates the importance of selecting clean samples of objets.

# ## Example of filtering: Star/galaxy separation

# For now, we have `extendedness == base_ClassificationExtendedness_value` as a tool for star/galaxy classification. An object is considered extended if the the difference between the `PSF` magnitude and the [`CModel` magnitude](https://www.sdss.org/dr12/algorithms/magnitudes/#cmodel) is beyond certain threshold (0.0164). To know more about this see [Bosch et al. 2018](http://adsabs.harvard.edu/abs/2018PASJ...70S...5B) section 4.9.10

# In[15]:


star_cuts = [
    GCRQuery('clean'), # The source has no flagged pixels (interpolated, saturated, edge, clipped...) 
                       # and was not skipped by the deblender
    GCRQuery('extendedness==0'),
    GCRQuery((np.isfinite, 'mag_g_cModel')),
    GCRQuery((np.isfinite, 'mag_r_cModel')),
    GCRQuery((np.isfinite, 'mag_i_cModel')),
]

quantities = ['mag_g_cModel', 'mag_r_cModel', 'mag_i_cModel']

d = catalog.get_quantities(quantities, 
                           filters=star_cuts, 
                           native_filters=['tract == 3830'])


# __Note__: The cell above will output some runtime warnings related to objects that have negative or zero measured fluxes but we can safely ignore the warning since those objects will not appear in our 2D histogram.

# So now, we are selected what we think are stars. Let's take a look at the colors of these objects

# In[16]:


plt.hexbin(d['mag_g_cModel']-d['mag_r_cModel'],
           d['mag_r_cModel']-d['mag_i_cModel'], 
           bins='log', extent=[-1,2,-1,2]);
plt.xlabel('$g-r$')
plt.ylabel('$r-i$')
plt.colorbar(label='log(Number of objects)')


# Q: What else can you do to improve the star selection?

# __If you want to see more go to [Part II](object_gcr_2_lensing_cuts.ipynb)__