Visualize Parquet data

Use the Measured Parameter Data Access Parquet format to visualize STOQS data

Executing this Notebook requires a personal STOQS server - these instructions are for a Docker installation. This Notebook builds on issues raised in https://github.com/stoqs/stoqs/issues/227.

Docker Instructions

Install and start the software as detailed in the README. (Note that on MacOS you will need to modify settings in your docker-compose.yml and .env files — look for comments referencing 'HOST_UID'.)

Then, from your $STOQS_HOME/docker directory start the Jupyter Notebook server - you can query from the remote database or from a copy that you've made to your local system:

Option A: Query from MBARI's master database

Start the Jupyter Notebook server pointing to MBARI's master STOQS database server. (Note: firewall rules limit unprivileged access to such resources):

docker-compose exec \
    -e DATABASE_URL=postgis://everyone:[email protected]:5432/stoqs \
    stoqs stoqs/manage.py shell_plus --notebook

Option B: Query from your local Docker Desktop

Restore a database of your choice from https://stoqs.shore.mbari.org/, for example below is how to make a local copy of the stoqs_september2013 database from MBARI's server onto your local database and then start the Jupyter Notebook server using the default DATABASE_URL (which should be your local system) also make sure that your Docker Desktop has at least 16 GB of RAM allocated to it:

cd $STOQS_HOME/docker
docker-compose exec stoqs createdb -U postgres stoqs_september2013
curl -k https://stoqs.shore.mbari.org/media/pg_dumps/stoqs_september2013.pg_dump | \
    docker exec -i stoqs pg_restore -Fc -U postgres -d stoqs_september2013
docker-compose exec stoqs stoqs/manage.py shell_plus --notebook

Opening this Notebook

Following execution of the stoqs/manage.py shell_plus --notebook command a message is displayed giving a URL for you to use in a browser on your host, e.g.:

http://127.0.0.1:8888/?token=<a_token_generated_upon_server_start>

In the browser window opened to this URL navigate to this file (visualize_parquet.ipynb) and open it. You will then be able to execute the cells and modify the code to suit your needs.

The information in the output cells result from execution on a 2019 MacBook Pro with a 2.4 GHz 8-Core Intel Core i9 processor, 32 GB 2667 MHz DDR4 RAM, running Docker Desktop 3.1.0 with 16 GB with 4 CPUs and 16 GB allocated.

In a browser navigate to https://stoqs.mbari.org/stoqs_september2013 and make selections as shown in this screen grab: Constructing a parquet download URL

We will attempt to recreate this image from Issue 227: biplot

but this time using Datashader which can handle a lot more data.

In [1]:
import time
t_start = time.time()

# Issuing a STOQS api request from inside the stoqs container - where this
# notebook is running - is not really possible.  For testing with a 
# host='localhost' url you need to make that request from your system and 
# then copy the .parquet file to this directory: stoqs/contrib/notebooks.

# We have make two downloads as lrauvs and dorado have different Parameter names.
# It's theoretically possible to download all Parameter names from all three
# platforms, but that exceeds the container's RAM in my 16 GB Docker machine.
# It's more efficient to download just what we need.

##host = 'localhost'
host = 'stoqs.shore.mbari.org'
url_dorado = (f'https://{host}/stoqs_september2013/api/measuredparameter.parquet?'
               'parameter__name=bbp420&parameter__name=fl700_uncorr&'
               'parameter__name=salinity&parameter__name=temperature&'
               'measurement__instantpoint__activity__platform__name=dorado&'
               'collect=name')
##print(url_dorado)   # Uncoment for 'localhost' download from system browser
url_lrauvs  = (f'https://{host}/stoqs_september2013/api/measuredparameter.parquet?'
               'parameter__name=bb470&parameter__name=chlorophyll&'
               'parameter__name=salinity&parameter__name=temperature&'
               'measurement__instantpoint__activity__platform__name=daphne&'
               'measurement__instantpoint__activity__platform__name=tethys&'
               'collect=name')
##print(url_lrauv)   # Uncoment for 'localhost' download from system browser

!time wget --no-check-certificate -O stoqs_september2013_dorado.parquet "{url_dorado}"
!time wget --no-check-certificate -O stoqs_september2013_lrauvs.parquet "{url_lrauvs}"
--2021-03-17 17:52:07--  https://stoqs.shore.mbari.org/stoqs_september2013/api/measuredparameter.parquet?parameter__name=bbp420&parameter__name=fl700_uncorr&parameter__name=salinity&parameter__name=temperature&measurement__instantpoint__activity__platform__name=dorado&collect=name
Resolving stoqs.shore.mbari.org (stoqs.shore.mbari.org)... 134.89.12.71
Connecting to stoqs.shore.mbari.org (stoqs.shore.mbari.org)|134.89.12.71|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8334441 (7.9M) [application/octet-stream]
Saving to: ‘stoqs_september2013_dorado.parquet’

stoqs_september2013 100%[===================>]   7.95M  4.80MB/s    in 1.7s    

2021-03-17 17:52:16 (4.80 MB/s) - ‘stoqs_september2013_dorado.parquet’ saved [8334441/8334441]

0.01user 0.11system 0:08.53elapsed 1%CPU (0avgtext+0avgdata 6860maxresident)k
0inputs+0outputs (0major+364minor)pagefaults 0swaps
--2021-03-17 17:52:16--  https://stoqs.shore.mbari.org/stoqs_september2013/api/measuredparameter.parquet?parameter__name=bb470&parameter__name=chlorophyll&parameter__name=salinity&parameter__name=temperature&measurement__instantpoint__activity__platform__name=daphne&measurement__instantpoint__activity__platform__name=tethys&collect=name
Resolving stoqs.shore.mbari.org (stoqs.shore.mbari.org)... 134.89.12.71
Connecting to stoqs.shore.mbari.org (stoqs.shore.mbari.org)|134.89.12.71|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60694551 (58M) [application/octet-stream]
Saving to: ‘stoqs_september2013_lrauvs.parquet’

stoqs_september2013 100%[===================>]  57.88M  4.42MB/s    in 14s     

2021-03-17 17:53:46 (4.18 MB/s) - ‘stoqs_september2013_lrauvs.parquet’ saved [60694551/60694551]

0.14user 0.76system 1:29.96elapsed 1%CPU (0avgtext+0avgdata 6948maxresident)k
0inputs+0outputs (0major+363minor)pagefaults 0swaps
In [2]:
import pandas as pd

%time df_dorado = pd.read_parquet('stoqs_september2013_dorado.parquet')
print(f"dorado data: {df_dorado.shape}")
##print(df_dorado.head(2))

%time df_lrauvs = pd.read_parquet('stoqs_september2013_lrauvs.parquet')
print(f"lrauv data: {df_lrauvs.shape}")
##print(df_lrauvs.head(2))

# Combine into single DataFrame for more generalized follow-on processing
df = df_dorado.append(df_lrauvs)
CPU times: user 152 ms, sys: 32 ms, total: 184 ms
Wall time: 226 ms
dorado data: (120505, 4)
CPU times: user 1.7 s, sys: 204 ms, total: 1.9 s
Wall time: 2.05 s
lrauv data: (1531962, 4)
In [3]:
# Commit with do_plots = False, change to True for plots, but don't check it in that way
do_plots = False
plots = None
if do_plots:
    import colorcet
    import holoviews as hv
    from holoviews.operation.datashader import datashade

    hv.extension("bokeh")

    pts_dorado = hv.Points(df, kdims=['bbp420', 'fl700_uncorr'])
    pts_daphne = hv.Points(df, kdims=['bb470', 'chlorophyll'])
    pts_tethys = hv.Points(df, kdims=['bb470', 'chlorophyll'])

    plots = ( datashade(pts_dorado, cmap=colorcet.fire).opts(title='dorado')
            + datashade(pts_daphne, cmap=colorcet.fire).opts(title='daphne') 
            + datashade(pts_tethys, cmap=colorcet.fire).opts(title='tethys') )
plots
In [4]:
print(f"Time to execute this notebook: {(time.time() - t_start):.1f} seconds")
Time to execute this notebook: 103.8 seconds