Notebook

Interactive Visualization with Bokeh, HoloViews, and Datashader¶

Owner: Keith Bechtol (@bechtol)
Last Verified to Run: 2020-04-11
Verified Stack Release: v21.0.0

This notebook demonstrates a few of the interactive features of the Bokeh, HoloViews, and Datashader plotting packages in the notebook environment. These packages are part of the PyViz set of python tools intended for visualization use cases in a web browser, and can be used to create quite sophisticated dashboard-like interactive displays and widgets. The goal of this notebook is to provide an introduction and starting point from which to create more advanced, custom interactive visualizations. To get inspired, check out this beautiful example notebook using HSC data created with the qa_explorer tools.

Learning Objectives¶

After working through and studying this notebook you should be able to

Use bokeh to create interactive figures with brushing and linking between multiple plots
Use holoviews and datashader to create two-dimensional histograms with dynamic binning to efficiently explore large datasets

Other techniques that are demonstrated, but not empasized, in this notebook are

Use parquet to efficiently access large amounts of data

Logistics¶

This notebook is intended to be runnable on lsst-lsp-stable.ncsa.illinois.edu from a local git clone of https://github.com/LSSTScienceCollaborations/StackClub.

Note that occasionally the notebook may seem to stall, or the interactive features may seem disabled. If this happens, usually a restart of the kernel fixes the issue. You might also need to log out of the LSP and start a "large" instance of the JupyterLab environment. In some examples shown in this notebook, the order in which the cells are run is important for understanding the interactive features, so you may want to re-run the set of cells in a given section if you encounter unexpected behavior.

Setup¶

You can find the Stack version by using eups list -s on the terminal command line.

In [1]:

# What version of the Stack am I using?
! echo $HOSTNAME
! eups list -s | grep lsst_distrib

nb-kadrlica-r21-0-0
lsst_distrib          21.0.0+973e4c9e85 	current v21_0_0 setup

In [2]:

import numpy as np
import astropy.io.fits as pyfits

import bokeh
from bokeh.io import output_file, output_notebook, show
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource, Range1d, HoverTool, Selection
from bokeh.plotting import figure, output_file

import holoviews as hv
from holoviews import streams
from holoviews.operation.datashader import datashade, dynspread, rasterize
from holoviews.plotting.util import process_cmap
hv.extension('bokeh')

In [3]:

# Need this line to display bokeh plots inline in the notebook
output_notebook()

Loading BokehJS ...

In [4]:

# What version of holoviews are we using
print(hv.__version__)
import datashader as dsh
print(dsh.__version__)

1.13.5
0.11.1

In [5]:

#Ignore all warnings 
import warnings
warnings.filterwarnings('ignore')

Prelude: Data Sample¶

The data in the following example comes from the Dark Energy Survey Data Release 1 (DES DR1). The input data for this example obtained with the M2 globular cluster database query in Appendix C of the DES DR1 paper from the DES Data Release page.

In [6]:

infile = '/project/kbechtol/des/dr1/dr1_m2_dered_test.fits'
reader = pyfits.open(infile)
data = reader[1].data
reader.close()

data = data[data['MAG_AUTO_G_DERED'] < 26.]
print(len(data))

Part 1: Brushing and linking between scatter plots with Bokeh¶

First, an example with brushing and linking between two panels showing different repsentations of the same dataset. A selection applied to either panel will highlight the selected points in the other panel.

Based on http://bokeh.pydata.org/en/latest/docs/user_guide/interaction/linking.html#linked-brushing

In [7]:

ra_target, dec_target = 323.36, -0.82

mag = data['MAG_AUTO_G_DERED']
color = data['MAG_AUTO_G_DERED'] - data['MAG_AUTO_R_DERED']

# create a column data source for the plots to share
source = ColumnDataSource(data=dict(x0=data['RA'] - ra_target,
                                    y0=data['DEC'] - dec_target,
                                    x1=color,
                                    y1=mag,
                                    ra=data['RA'],
                                    dec=data['DEC'],
                                    coadd_object_id=data['COADD_OBJECT_ID']))

In [8]:

# Create a custom hover tool on both panels
hover_left = HoverTool(tooltips=[("(RA,DEC)", "(@ra, @dec)"),
                                 ("(g-r,g)", "(@x1, @y1)"),
                                 ("coadd_object_id", "@coadd_object_id")])
hover_right = HoverTool(tooltips=[("(RA,DEC)", "(@ra, @dec)"),
                                  ("(g-r,g)", "(@x1, @y1)"),
                                  ("coadd_object_id", "@coadd_object_id")])
TOOLS = "box_zoom,box_select,lasso_select,reset,help"
TOOLS_LEFT = [hover_left, TOOLS]
TOOLS_RIGHT = [hover_right, TOOLS]

In [9]:

# create a new plot and add a renderer
left = figure(tools=TOOLS_LEFT, plot_width=500, plot_height=500, output_backend="webgl",
              title='Spatial: Centered on (RA, Dec) = (%.2f, %.2f)'%(ra_target, dec_target))
left.circle('x0', 'y0', hover_color='firebrick', source=source,
            selection_fill_color='steelblue', selection_line_color='steelblue',
            nonselection_fill_color='silver', nonselection_line_color='silver')
left.x_range = Range1d(0.3, -0.3)
left.y_range = Range1d(-0.3, 0.3)
left.xaxis.axis_label = 'Delta RA'
left.yaxis.axis_label = 'Delta DEC'

# create another new plot and add a renderer
right = figure(tools=TOOLS_RIGHT, plot_width=500, plot_height=500, output_backend="webgl",
               title='CMD')
right.circle('x1', 'y1', hover_color='firebrick', source=source,
             selection_fill_color='steelblue', selection_line_color='steelblue',
             nonselection_fill_color='silver', nonselection_line_color='silver')
right.x_range = Range1d(-0.5, 2.5)
right.y_range = Range1d(26., 16.)
right.xaxis.axis_label = 'g - r'
right.yaxis.axis_label = 'g'

p = gridplot([[left, right]])

# The plots can be exported as html files with data embedded
#output_file("bokeh_m2_example.html", title="M2 Example")

show(p)

Use the hover tool to see information about individual datapoints (e.g., the coadd_object_id). This information should appear automatically as you hover the mouse over the datapoints. Notice the data points highlighted in red on one panel with the hover tool are also highlighted on the other panel.

Next, click on the selection box icon (with a "+" sign) or the selection lasso icon found in the upper right corner of the figure. Use the selection box and selection lasso to make various selections in either panel by clicking and dragging on either panel. The selected data points will be displayed in the other panel.

Introducing HoloViews Linked Streams¶

If we want to do subsequent calculations with the set of selected points, we can use HoloViews linked streams for custom interactivity. The following visualization is a modification of this example.

For this visualization, as in the example above, use the selection box and selection lasso to datapoints on the left panel. The selected points should appear in the right panel.

Finally, notice that as you change the selection on the left panel, the mean x- and y-values for selected datapoints are shown in the title of right panel.

In [10]:

%%opts Points [tools=['box_select', 'lasso_select']]

# Declare some points
points = hv.Points((data['RA'] - ra_target, data['DEC'] - dec_target))

# Declare points as source of selection stream
selection = streams.Selection1D(source=points)

# Write function that uses the selection indices to slice points and compute stats
def selected_info(index):
    selected = points.iloc[index]
    if index:
        label = 'Mean x, y: %.3f, %.3f' % tuple(selected.array().mean(axis=0))
    else:
        label = 'No selection'
    return selected.relabel(label).options(color='red')

# Combine points and DynamicMap
# Notice the interesting syntax used here: the "+" sign makes side-by-side panels
points + hv.DynamicMap(selected_info, streams=[selection])

Out[10]:

In the next cell, we access the indices of the selected datapoints. We could use these indices to select a subset of full sample for further examination.

In [11]:

print(selection.index)

[]

Intermission: Rapid Data Access with Parquet¶

For the next example, we want to use a much larger dataset. Let's open up some data from Gata Data Release 2 (Gaia DR2) with Parquet.

In [12]:

import glob
import pandas as pd
import pyarrow.parquet as pq

In [13]:

infiles = sorted(glob.glob('/project/shared/data/gaia_dr2_1am/*.parquet'))
print('There are %i total files in the directory'%(len(infiles)))

There are 500 total files in the directory

In [14]:

%%time
df_array = []
for ii in range(0, 10):
    print(infiles[ii])
    columns = ['ra', 'dec', 'phot_g_mean_mag'] # 'phot_g_mean_mag', 'phot_bp_mean_mag', 'phot_rp_mean_mag']
    df_array.append(pq.read_table(infiles[ii], columns=columns).to_pandas())
df = pd.concat(df_array)

/project/shared/data/gaia_dr2_1am/part-00000-f1412da4-8053-4819-87f7-4874011b6d30_00000.c000.snappy.parquet
/project/shared/data/gaia_dr2_1am/part-00001-f1412da4-8053-4819-87f7-4874011b6d30_00001.c000.snappy.parquet
/project/shared/data/gaia_dr2_1am/part-00002-f1412da4-8053-4819-87f7-4874011b6d30_00002.c000.snappy.parquet
/project/shared/data/gaia_dr2_1am/part-00003-f1412da4-8053-4819-87f7-4874011b6d30_00003.c000.snappy.parquet
/project/shared/data/gaia_dr2_1am/part-00004-f1412da4-8053-4819-87f7-4874011b6d30_00004.c000.snappy.parquet
/project/shared/data/gaia_dr2_1am/part-00005-f1412da4-8053-4819-87f7-4874011b6d30_00005.c000.snappy.parquet
/project/shared/data/gaia_dr2_1am/part-00006-f1412da4-8053-4819-87f7-4874011b6d30_00006.c000.snappy.parquet
/project/shared/data/gaia_dr2_1am/part-00007-f1412da4-8053-4819-87f7-4874011b6d30_00007.c000.snappy.parquet
/project/shared/data/gaia_dr2_1am/part-00008-f1412da4-8053-4819-87f7-4874011b6d30_00008.c000.snappy.parquet
/project/shared/data/gaia_dr2_1am/part-00009-f1412da4-8053-4819-87f7-4874011b6d30_00009.c000.snappy.parquet
CPU times: user 3.69 s, sys: 2.62 s, total: 6.31 s
Wall time: 2.44 s

In [15]:

print('Dataframe contains %.2f M rows'%(len(df) / 1.e6))
print(df.columns.values)

Dataframe contains 38.73 M rows
['ra' 'dec' 'phot_g_mean_mag']

Part 2: Visualizing Larger Datasets with Datashader¶

The interactive features of Bokeh work well with datasets up to a few tens of thousands of data points. To efficiently explore larger datasets, we'd like to use another visualization model that offers better scalability, namely Datashader.

In the examples below, notice that as one zooms in on the datashaded two-dimensional histograms, the bin sizes are dynamically adjusted to show finer or coarser granularity in the distribution. This allows one to interactively explore large datasets without having to manually adjust the bin sizes while panning and zooming. Zoom in all the way and you can see individual points (i.e., bins contain either zero or one count). If you zoom in far enough, the individual points are represented by extremely small pixels in datashader that are difficult to see. A solution is to dynspread instead of datashade, which will preserve a finite size of the plotted points.

In this particular example, as we zoom in, we can see that the Gaia dataset has been sharded into narrow stripes in declination.

The next cell also uses the concept of linked Streams in HoloViews for custom interactivity, in this case to create a selection box. We'll use that selection box tool in the following cell.

In [16]:

#%%opts Points [tools=['box_select']]
points = hv.Points((df.ra, df.dec)) # Create a holoviews object to hold and plot data
#points = hv.Points(np.random.multivariate_normal((0, 0), [[1, 0.1], [0.1, 1]], (1000,))) # If you wanted a simple synthetic dataset

# Create the linked streams instance
boundsxy = (0, 0, 0, 0)
box = streams.BoundsXY(source=points, bounds=boundsxy)
bounds = hv.DynamicMap(lambda bounds: hv.Bounds(bounds), streams=[box]) 

# Apply the datashader
from holoviews.plotting.util import process_cmap
dynspread(datashade(points, cmap=process_cmap("Viridis", provider="bokeh"))) * bounds
# The "*" syntax puts multiple plot elements on the same panel
#datashade(points, cmap=bokeh.palettes.Viridis256) * bounds

Out[16]:

Next we add callback functionality to the plot above and retrieve the indices of the selected points. First, use the box selection tool to create a selection box for the two-dimensional histogram above. Then run the cell below to count the number of datapoints within the selection region.

In [17]:

selection = (points.data.x > box.bounds[0]) \
    & (points.data.y > box.bounds[1]) \
    & (points.data.x < box.bounds[2]) \
    & (points.data.y < box.bounds[3])
print('The selection box contains %i datapoints'%(np.sum(selection)))
if np.sum(selection) > 0:
    print('\nHere are some of the selected indices...')
    print(np.nonzero(selection.values)[0])

The selection box contains 0 datapoints

Another option is to make a second linked plot paired with the box selection on the two-dimensional histogram.

In [18]:

# First, create a holoviews dataset instance. Here we label some of the columns.
kdims = [('ra', 'RA(deg)'), ('dec', 'Dec(deg)')]
vdims = [('phot_g_mean_mag', 'G(mag)')]
ds = hv.Dataset(df, kdims, vdims)
ds

Out[18]:

:Dataset   [ra,dec]   (phot_g_mean_mag)

In [19]:

points = hv.Points(ds)

#boundsxy = (0, 0, 0, 0)
boundsxy = (np.min(ds.data['ra']), np.min(ds.data['dec']), np.max(ds.data['ra']), np.max(ds.data['dec']))
box = streams.BoundsXY(source=points, bounds=boundsxy)
box_plot = hv.DynamicMap(lambda bounds: hv.Bounds(bounds), streams=[box])

In [20]:

# This function defines the custom callback functionality to update the linked histogram
def update_histogram(bounds=bounds):
    
    selection = (ds.data['ra'] > bounds[0]) & \
                (ds.data['dec'] > bounds[1]) & \
                (ds.data['ra'] < bounds[2]) & \
                (ds.data['dec'] < bounds[3])
    
    selected_mag = ds.data.loc[selection]['phot_g_mean_mag']
    
    frequencies, edges = np.histogram(selected_mag)
    
    hist = hv.Histogram((np.log(frequencies), edges))
    return hist

In [21]:

%%output size=150
dmap = hv.DynamicMap(update_histogram, streams=[box])
datashade(points, cmap=process_cmap("Viridis", provider="bokeh")) * box_plot + dmap

Out[21]:

Notice that when you select different regions of the left panel with the box select tool, the histogram on the right is updated.

Part 3: Images¶

The next example demonstrates image visualization at the pixel level with datashader.

In [22]:

# Select the dataset to use
dataset='HSC'
#dataset='DC2'

# To reset the filter definitions
import lsst.obs.base as obsBase
obsBase.FilterDefinitionCollection.reset()


if dataset == 'HSC':
   
    dataDir = '/datasets/hsc/repo/rerun/RC/v20_0_0_rc1/DM-25349-sfm'
    dataId = {'filter': 'HSC-R', 'ccd': 50, 'visit': 1202}
    datasetType = "calexp"
    
    coadd_datadir = '/datasets/hsc/repo/rerun/RC/v20_0_0_rc1/DM-25349'
    coadd_dataId = {'filter':'HSC-I', 'tract': 9615, 'patch':'0,3'}
    # dataId = {'filter': 'HSC-R', 'ccd': 50, 'visit': 1202}
    coadd_dataset_type = "deepCoadd"
    frame='[height=512 width=300]' # The holoviews image frame

elif dataset == 'DC2':

    # DC2 WFD coadd
    dataDir = '/datasets/DC2/DR6/Run2.2i/patched/2021-02-10/rerun/run2.2i-coadd-wfd-dr6-v1'
    dataId = {'filter':'i', 'tract': 4226, 'patch':'0,4'}
    datasetType = "deepCoadd"
    frame='[height=512 width=600]' # The holoviews image frame
else:
    msg = "Unsupported dataset: %s"%dataset
    raise Exception(msg)
    

In [23]:

# First get a sensor image from the dataset selected previously. 
from lsst.daf.persistence import Butler
butler = Butler(dataDir)
#butler.queryMetadata('calexp', ['visit','detector','filter'], dataId=dataId)
image = butler.get(datasetType, dataId=dataId)

In [24]:

%%opts Image $frame
%%opts Bounds (color='white')
#%%output size=200

# Use an actual sensor image
bounds_img = (0, 0, image.getDimensions()[0], image.getDimensions()[1])
img = hv.Image(np.log10(image.image.array), 
               bounds=bounds_img).options(colorbar=True, 
                                          cmap=bokeh.palettes.Viridis256,
                                          # logz=True
                                         )

boundsxy = (0, 0, 0, 0)
box = streams.BoundsXY(source=img, bounds=boundsxy)
bounds = hv.DynamicMap(lambda bounds: hv.Bounds(bounds), streams=[box])

rasterize(img) * bounds

Out[24]:

As with the histograms, it is possible to use interactive callback features on the image plots, such as the selection box.

In [25]:

box

Out[25]:

BoundsXY(bounds=(0, 0, 0, 0))

Here's another version of the image with a tap stream instead of box select. Click on the image to place an 'X' marker.

In [26]:

%%opts Image  $frame
%%opts Points (color='white' marker='x' size=20)

posxy = hv.streams.Tap(source=img, x=0.5 * image.getDimensions()[0], y=0.5 * image.getDimensions()[1])
marker = hv.DynamicMap(lambda x, y: hv.Points([(x, y)]), streams=[posxy])

rasterize(img) * marker

Out[26]:

'X' marks the spot! What's the value at that location? Execute the next cell to find out.

In [27]:

print('The value at position (%.3f, %.3f) is %.3f'%(posxy.x, posxy.y, image.image.array[-int(posxy.y), int(posxy.x)]))

The value at position (1024.000, 2088.000) is 87.440