Explore CONUS404 Dataset¶

This dataset was created by extracting specified variables from a collection of wrf2d output files, rechunking to better facilitate data extraction for a variety of use cases, and adding CF conventions to allow easier analysis, visualization and data extraction using Xarray and Holoviz.

In [1]:

import os
os.environ['USE_PYGEOS'] = '0'

import fsspec
import xarray as xr
import hvplot.xarray
import intake
import metpy
import cartopy.crs as ccrs

Open Dataset¶

1) Load data from an Intake catalog¶

For this demonstration notebook, we will open a cloud-native dataset. The details of its access are stored in an intake catalog.

In [2]:

# open the hytest data intake catalog
hytest_cat = intake.open_catalog(
    r"https://raw.githubusercontent.com/hytest-org/hytest/main/dataset_catalog/hytest_intake_catalog.yml"
)
list(hytest_cat)

Out[2]:

['conus404-catalog',
 'conus404-drb-eval-tutorial-catalog',
 'nhm-v1.0-daymet-catalog',
 'nhm-v1.1-c404-bc-catalog',
 'nhm-v1.1-gridmet-catalog',
 'nwis-streamflow-usgs-gages-onprem',
 'nwis-streamflow-usgs-gages-cloud',
 'nwm21-streamflow-usgs-gages-onprem',
 'nwm21-streamflow-usgs-gages-cloud',
 'nwm21-streamflow-cloud',
 'nwm21-scores',
 'lcmap-cloud',
 'rechunking-tutorial-cloud']

In [3]:

# open the conus404 sub-catalog
cat = hytest_cat['conus404-catalog']
list(cat)

Out[3]:

['conus404-hourly-onprem',
 'conus404-hourly-cloud',
 'conus404-hourly-osn',
 'conus404-hourly-osn2',
 'conus404-daily-diagnostic-onprem',
 'conus404-daily-diagnostic-cloud',
 'conus404-daily-diagnostic-osn',
 'conus404-daily-onprem',
 'conus404-daily-cloud',
 'conus404-daily-osn',
 'conus404-daily-osn2',
 'conus404-monthly-onprem',
 'conus404-monthly-cloud',
 'conus404-monthly-osn']

In [16]:

## NOTE: we happen to know this dataset's handle/name.
dataset = 'conus404-hourly-cloud' 
## If you did not know this name, you could list the datasets in the catalog with
## the command `list(cat)`

## But since we do know the name, let's see its metadata
cat[dataset]

conus404-hourly-cloud:
  args:
    consolidated: true
    storage_options:
      requester_pays: true
    urlpath: s3://nhgf-development/conus404/conus404_hourly_202209.zarr
  description: 'CONUS404 Hydro Variable subset, 40 years of hourly values. These files
    were created wrfout model output files (see ScienceBase data release for more
    details: https://www.sciencebase.gov/catalog/item/6372cd09d34ed907bf6c6ab1). This
    dataset is stored on AWS S3 cloud storage in a requester-pays bucket. You can
    work with this data for free if your workflow is running in the us-west-2 region,
    but you will be charged according to AWS S3 pricing (https://aws.amazon.com/s3/pricing/)
    to read the data into a workflow running outside of the cloud or in a different
    AWS cloud region.'
  driver: intake_xarray.xzarr.ZarrSource
  metadata:
    catalog_dir: https://raw.githubusercontent.com/hytest-org/hytest/main/dataset_catalog/subcatalogs

2) Parallelize with Dask¶

Some of the steps we will take are aware of parallel clustered compute environments using dask. We're going to start a cluster now so that future steps can take advantage of this ability.

This is an optional step, but speed ups data loading significantly, especially when accessing data from the cloud.

In [5]:

%run /shared/users/environment_set_up/Start_Dask_Cluster_Nebari.ipynb
## If this notebook is not being run on Nebari/ESIP, replace the above 
## path name with a helper appropriate to your compute environment.  Examples:
# %run ../environment_set_up/Start_Dask_Cluster_Denali.ipynb
# %run ../environment_set_up/Start_Dask_Cluster_Tallgrass.ipynb

The 'cluster' object can be used to adjust cluster behavior.  i.e. 'cluster.adapt(minimum=10)'
The 'client' object can be used to directly interact with the cluster.  i.e. 'client.submit(func)' 
The link to view the client dashboard is:
>  https://nebari.esipfed.org/gateway/clusters/dev.22d416af57f744f2a2caa71cde733857/status

In [6]:

client

Out[6]:

Client

Client-ba1d2dfb-2bc7-11ee-80a0-7a8d66915ae8

Connection method: Cluster object	Cluster type: dask_gateway.GatewayCluster
Dashboard: https://nebari.esipfed.org/gateway/clusters/dev.22d416af57f744f2a2caa71cde733857/status

Cluster Info

GatewayCluster

Name: dev.22d416af57f744f2a2caa71cde733857
Dashboard: https://nebari.esipfed.org/gateway/clusters/dev.22d416af57f744f2a2caa71cde733857/status

In [7]:

cluster.scale(30)

3) Explore the dataset¶

In [17]:

print(f"Reading {dataset} metadata...", end='')
ds = cat[dataset].to_dask().metpy.parse_cf()
print("done")
# Examine the grid data structure for SNOW: 
ds.SNOW

Reading conus404-hourly-cloud metadata...done

Out[17]:

<xarray.DataArray 'SNOW' (time: 368064, y: 1015, x: 1367)>
dask.array<open_dataset-60f4cd5b1dab559310716eb4d524a8baSNOW, shape=(368064, 1015, 1367), dtype=float32, chunksize=(144, 175, 175), chunktype=numpy.ndarray>
Coordinates:
    lat        (y, x) float32 dask.array<chunksize=(175, 175), meta=np.ndarray>
    lon        (y, x) float32 dask.array<chunksize=(175, 175), meta=np.ndarray>
  * time       (time) datetime64[ns] 1979-10-01 ... 2021-09-25T23:00:00
  * x          (x) float64 -2.732e+06 -2.728e+06 ... 2.728e+06 2.732e+06
  * y          (y) float64 -2.028e+06 -2.024e+06 ... 2.024e+06 2.028e+06
    metpy_crs  object Projection: lambert_conformal_conic
Attributes:
    description:   SNOW WATER EQUIVALENT
    grid_mapping:  crs
    long_name:     Snow water equivalent
    units:         kg m-2

Looks like this dataset is organized in three coordinates (x, y, and time). There is a metpy_crs attached:

In [14]:

crs = ds['SNOW'].metpy.cartopy_crs
crs

Out[14]:

<cartopy.crs.LambertConformal object at 0x7f2dca241930>

Example A: Load the entire spatial domain for a variable at a specific time step¶

In [ ]:

%%time
da = ds.T2.sel(time='2009-12-24 00:00').load()
### NOTE: the `load()` is dask-aware, so will operate in parallel if
### a cluster has been started. 

In [ ]:

da.hvplot.quadmesh(x='lon', y='lat', rasterize=True, geo=True, tiles='OSM', cmap='viridis').opts('Image', alpha=0.5)

Example B: Load a time series for a variable at a specific grid cell for a specified time range¶

SIDE NOTE To identify a point, we will start with its lat/lon coordinates. But the data is in Lambert Conformal Conic... need to re-project/transform using the built-in crs we examined earlier:

In [ ]:

lat,lon = 39.978322,-105.2772194    
x, y = crs.transform_point(lon, lat, src_crs=ccrs.PlateCarree())   
print(x,y) # these vals are in LCC

In [ ]:

%%time
da = ds.PREC_ACC_NC.sel(x=x, y=y, method='nearest').sel(time=slice('2013-01-01 00:00','2013-12-31 00:00')).load()

In [ ]:

da.hvplot(x='time', grid=True)

Example C: Compute the time mean for a variable over the entire domain for a specific time period¶

In [20]:

%%time
da = ds.PREC_ACC_NC.sel(time=slice('2016-01-01 00:00','2017-01-01 00:00')).mean(dim='time').compute()

CPU times: user 784 ms, sys: 23.6 ms, total: 808 ms
Wall time: 7.5 s

In [ ]:

%%time
#da = ds.PREC_ACC_NC.mean(dim='time').compute()

In [ ]:

da.hvplot.image(x='x', y='y', rasterize=True, crs=crs, tiles='OSM', alpha=0.66, cmap='viridis')

Stop cluster¶

In [ ]:

client.close(); cluster.shutdown()