This dataset was created by extracting specified variables from a collection of wrf2d output files, rechunking to better facilitate data extraction for a variety of use cases, and adding CF conventions to allow easier analysis, visualization and data extraction using Xarray and Holoviz.
import os
os.environ['USE_PYGEOS'] = '0'
import fsspec
import xarray as xr
import hvplot.xarray
import intake
import metpy
import cartopy.crs as ccrs
# open the hytest data intake catalog
hytest_cat = intake.open_catalog(
r"https://raw.githubusercontent.com/hytest-org/hytest/main/dataset_catalog/hytest_intake_catalog.yml"
)
list(hytest_cat)
['conus404-catalog', 'conus404-drb-eval-tutorial-catalog', 'nhm-v1.0-daymet-catalog', 'nhm-v1.1-c404-bc-catalog', 'nhm-v1.1-gridmet-catalog', 'nwis-streamflow-usgs-gages-onprem', 'nwis-streamflow-usgs-gages-cloud', 'nwm21-streamflow-usgs-gages-onprem', 'nwm21-streamflow-usgs-gages-cloud', 'nwm21-streamflow-cloud', 'nwm21-scores', 'lcmap-cloud', 'rechunking-tutorial-cloud']
# open the conus404 sub-catalog
cat = hytest_cat['conus404-catalog']
list(cat)
['conus404-hourly-onprem', 'conus404-hourly-cloud', 'conus404-hourly-osn', 'conus404-hourly-osn2', 'conus404-daily-diagnostic-onprem', 'conus404-daily-diagnostic-cloud', 'conus404-daily-diagnostic-osn', 'conus404-daily-onprem', 'conus404-daily-cloud', 'conus404-daily-osn', 'conus404-daily-osn2', 'conus404-monthly-onprem', 'conus404-monthly-cloud', 'conus404-monthly-osn']
## NOTE: we happen to know this dataset's handle/name.
dataset = 'conus404-hourly-cloud'
## If you did not know this name, you could list the datasets in the catalog with
## the command `list(cat)`
## But since we do know the name, let's see its metadata
cat[dataset]
conus404-hourly-cloud: args: consolidated: true storage_options: requester_pays: true urlpath: s3://nhgf-development/conus404/conus404_hourly_202209.zarr description: 'CONUS404 Hydro Variable subset, 40 years of hourly values. These files were created wrfout model output files (see ScienceBase data release for more details: https://www.sciencebase.gov/catalog/item/6372cd09d34ed907bf6c6ab1). This dataset is stored on AWS S3 cloud storage in a requester-pays bucket. You can work with this data for free if your workflow is running in the us-west-2 region, but you will be charged according to AWS S3 pricing (https://aws.amazon.com/s3/pricing/) to read the data into a workflow running outside of the cloud or in a different AWS cloud region.' driver: intake_xarray.xzarr.ZarrSource metadata: catalog_dir: https://raw.githubusercontent.com/hytest-org/hytest/main/dataset_catalog/subcatalogs
Some of the steps we will take are aware of parallel clustered compute environments
using dask
. We're going to start a cluster now so that future steps can take advantage
of this ability.
This is an optional step, but speed ups data loading significantly, especially when accessing data from the cloud.
%run /shared/users/environment_set_up/Start_Dask_Cluster_Nebari.ipynb
## If this notebook is not being run on Nebari/ESIP, replace the above
## path name with a helper appropriate to your compute environment. Examples:
# %run ../environment_set_up/Start_Dask_Cluster_Denali.ipynb
# %run ../environment_set_up/Start_Dask_Cluster_Tallgrass.ipynb
The 'cluster' object can be used to adjust cluster behavior. i.e. 'cluster.adapt(minimum=10)' The 'client' object can be used to directly interact with the cluster. i.e. 'client.submit(func)' The link to view the client dashboard is: > https://nebari.esipfed.org/gateway/clusters/dev.22d416af57f744f2a2caa71cde733857/status
client
Client-ba1d2dfb-2bc7-11ee-80a0-7a8d66915ae8
Connection method: Cluster object | Cluster type: dask_gateway.GatewayCluster |
Dashboard: https://nebari.esipfed.org/gateway/clusters/dev.22d416af57f744f2a2caa71cde733857/status |
cluster.scale(30)
print(f"Reading {dataset} metadata...", end='')
ds = cat[dataset].to_dask().metpy.parse_cf()
print("done")
# Examine the grid data structure for SNOW:
ds.SNOW
Reading conus404-hourly-cloud metadata...done
<xarray.DataArray 'SNOW' (time: 368064, y: 1015, x: 1367)> dask.array<open_dataset-60f4cd5b1dab559310716eb4d524a8baSNOW, shape=(368064, 1015, 1367), dtype=float32, chunksize=(144, 175, 175), chunktype=numpy.ndarray> Coordinates: lat (y, x) float32 dask.array<chunksize=(175, 175), meta=np.ndarray> lon (y, x) float32 dask.array<chunksize=(175, 175), meta=np.ndarray> * time (time) datetime64[ns] 1979-10-01 ... 2021-09-25T23:00:00 * x (x) float64 -2.732e+06 -2.728e+06 ... 2.728e+06 2.732e+06 * y (y) float64 -2.028e+06 -2.024e+06 ... 2.024e+06 2.028e+06 metpy_crs object Projection: lambert_conformal_conic Attributes: description: SNOW WATER EQUIVALENT grid_mapping: crs long_name: Snow water equivalent units: kg m-2
Looks like this dataset is organized in three coordinates (x, y, and time). There is a
metpy_crs
attached:
crs = ds['SNOW'].metpy.cartopy_crs
crs
<cartopy.crs.LambertConformal object at 0x7f2dca241930>
%%time
da = ds.T2.sel(time='2009-12-24 00:00').load()
### NOTE: the `load()` is dask-aware, so will operate in parallel if
### a cluster has been started.
da.hvplot.quadmesh(x='lon', y='lat', rasterize=True, geo=True, tiles='OSM', cmap='viridis').opts('Image', alpha=0.5)
SIDE NOTE
To identify a point, we will start with its lat/lon coordinates. But the
data is in Lambert Conformal Conic... need to re-project/transform using the
built-in crs
we examined earlier:
lat,lon = 39.978322,-105.2772194
x, y = crs.transform_point(lon, lat, src_crs=ccrs.PlateCarree())
print(x,y) # these vals are in LCC
%%time
da = ds.PREC_ACC_NC.sel(x=x, y=y, method='nearest').sel(time=slice('2013-01-01 00:00','2013-12-31 00:00')).load()
da.hvplot(x='time', grid=True)
%%time
da = ds.PREC_ACC_NC.sel(time=slice('2016-01-01 00:00','2017-01-01 00:00')).mean(dim='time').compute()
CPU times: user 784 ms, sys: 23.6 ms, total: 808 ms Wall time: 7.5 s
%%time
#da = ds.PREC_ACC_NC.mean(dim='time').compute()
da.hvplot.image(x='x', y='y', rasterize=True, crs=crs, tiles='OSM', alpha=0.66, cmap='viridis')
client.close(); cluster.shutdown()