HRRR Forecast Collection Best Time Series¶

Read a collection of GRIB2 files on AWS as a single dataset using the Zarr library, via fsspec's ReferenceFileSystem. This notebook also demonstrates both how to generate the JSON file that fsspec uses, speeding up extracting the metadata from each GRIB2 file using Reference File Maker with a Dask Cluster.
Requires development version of fsspec_reference_maker

pip install --user git+https://github.com/intake/fsspec-reference-maker

In [1]:

import xarray as xr
import hvplot.xarray
import datetime as dt
import dask
import json
import fsspec
from fsspec_reference_maker.grib2 import scan_grib
from fsspec_reference_maker.combine import MultiZarrToZarr

Create a best time series dataset¶

There is a new HRRR forecast every hour, so use forecast hour 1 from past forecasts and then append the latest forecast

In [2]:

fs = fsspec.filesystem('s3', anon=True, skip_instance_cache=True)

In [3]:

today = dt.datetime.utcnow().strftime('%Y%m%d')

In [4]:

files = fs.glob(f's3://noaa-hrrr-bdp-pds/hrrr.{today}/conus/*wrfsfcf01.grib2')

In [5]:

latest = files[-1].split('/')[3].split('.')[1]
print(latest)

t11z

In [6]:

latest_files = fs.glob(f's3://noaa-hrrr-bdp-pds/hrrr.{today}/conus/hrrr.{latest}.wrfsfc*.grib2')
files.extend(latest_files[2:])

Create Dask gateway cluster with credentials to write to specified S3 bucket¶

This is for the ESIP qhub: you will need to modify to work elsewhere.

In [7]:

import os
import sys
sys.path.append(os.path.join(os.environ['HOME'],'shared','users','lib'))
import ebdpy as ebd

ebd.set_credentials(profile='esip-qhub')

profile = 'esip-qhub'
region = 'us-west-2'
endpoint = f's3.{region}.amazonaws.com'
ebd.set_credentials(profile=profile, region=region, endpoint=endpoint)
worker_max = 30
client,cluster = ebd.start_dask_cluster(profile=profile,worker_max=worker_max, 
                                      region=region, use_existing_cluster=True,
                                      adaptive_scaling=False, wait_for_cluster=False, 
                                      environment='pangeo', worker_profile='Pangeo Worker', 
                                      propagate_env=True)

/home/conda/store/3d745bdbbc77faf1b06381a24d6e593eeec3375ed3ddf4003aa2fc578a214eab-pangeo/lib/python3.9/site-packages/dask_gateway/client.py:21: FutureWarning: format_bytes is deprecated and will be removed in a future release. Please use dask.utils.format_bytes instead.
  from distributed.utils import LoopRunner, format_bytes

No Cluster running.
Starting new cluster.
Setting Fixed Scaling workers=30
Reconnect client to clear cache
client.dashboard_link (for new browser tab/window or dashboard searchbar in Jupyterhub):
https://jupyter.qhub.esipfed.org/gateway/clusters/dev.ab10096f62044b71afd9beb36587a782/status
Propagating environment variables to workers

Create individual ReferenceFileSystem jsons¶

The afilter below controls which grib variables we want cfgrib to capture

In [8]:

afilter={'typeOfLevel': 'heightAboveGround', 'level': 2}    
so = {"anon": True, "default_cache_type": "readahead"}
common = ['time', 'step', 'latitude', 'longitude', 'valid_time']

In [9]:

def gen_json(u):
    date = u.split('/')[3].split('.')[1]
    name = u.split('/')[5].split('.')[1:3]
    outfname = f'{json_dir}{date}.{name[0]}.{name[1]}.json'
    out = scan_grib(u, common, so, inline_threashold=100, filter=afilter)        
    with fs2.open(outfname, "w") as f:
        f.write(json.dumps(out))

Bucket to store individual JSONs (need permission, so anon=False)¶

In [10]:

json_dir = 's3://esip-qhub/noaa/hrrr/jsons/'
fs2 = fsspec.filesystem('s3', anon=False, skip_instance_cache=True) 

In [11]:

try:
    fs2.rm(json_dir, recursive=True)
except:
    pass

In [12]:

urls = [f's3://{file}' for file in files]
#so = dict(mode='rb', anon=True, default_fill_cache=False, default_cache_type='first')

Compute the individual JSONs in parallel by computing a list of Dask delayed objects¶

In [13]:

%%time
_ = dask.compute(*[dask.delayed(gen_json)(u) for u in urls], retries=10);

CPU times: user 292 ms, sys: 99.9 ms, total: 392 ms
Wall time: 3min 58s

Use `MultiZarrToZarr()` to combine into single reference¶

In [14]:

flist2 = fs2.ls(json_dir)
furls = sorted(['s3://'+f for f in flist2])
print(furls[0])
print(furls[-1])

s3://esip-qhub/noaa/hrrr/jsons/20210903.t00z.wrfsfcf01.json
s3://esip-qhub/noaa/hrrr/jsons/20210903.t11z.wrfsfcf18.json

In [15]:

# mzz = MultiZarrToZarr(furls, 
#     storage_options={'anon':False}, 
#     remote_protocol='s3',
#     remote_options={'anon' : 'True'},   #JSON files  
#     xarray_open_kwargs={
#         'decode_cf' : False,
#         'mask_and_scale' : False,
#         'decode_times' : False,
#         'use_cftime' : False,
#         'drop_variables': ['reference_time', 'crs'],
#         'decode_coords' : False
#     },
#     xarray_concat_args={
# #          "data_vars": "minimal",
# #          "coords": "minimal",
# #          "compat": "override",
#         "join": "override",
#         "combine_attrs": "override",
#         "dim": "time"
#     }
# )
mzz = MultiZarrToZarr(furls, 
            storage_options={'anon':False},
            remote_protocol='s3', 
            remote_options={'anon': True},
            xarray_concat_args={'dim': 'valid_time'})

In [16]:

%%time
#%%prun -D multizarr_profile 
mzz.translate('hrrr_best.json')

CPU times: user 1.58 s, sys: 350 ms, total: 1.93 s
Wall time: 10.6 s

In [17]:

rpath = 's3://esip-qhub-public/noaa/hrrr/hrrr_best.json'
fs2.put_file(lpath='hrrr_best.json', rpath=rpath)

Access data and plot¶

In [18]:

rpath = 's3://esip-qhub-public/noaa/hrrr/hrrr_best.json'
s_opts = {'requester_pays':True, 'skip_instance_cache':True}
r_opts = {'anon':True}
fs = fsspec.filesystem("reference", fo=rpath, ref_storage_args=s_opts,
                       remote_protocol='s3', remote_options=r_opts)
m = fs.get_mapper("")
ds2 = xr.open_dataset(m, engine="zarr", backend_kwargs=dict(consolidated=False))

In [19]:

ds2.data_vars

Out[19]:

Data variables:
    d2m      (valid_time, y, x) float32 ...
    pt       (valid_time, y, x) float32 ...
    r2       (valid_time, y, x) float32 ...
    sh2      (valid_time, y, x) float32 ...
    t2m      (valid_time, y, x) float32 ...
    unknown  (valid_time, y, x) float32 ...

In [20]:

ds2.t2m

Hvplot wants lon [-180,180], not [0,360]:

In [21]:

ds2 = ds2.assign_coords(longitude=(((ds2.longitude + 180) % 360) - 180))

In [22]:

ds2.t2m.hvplot.quadmesh(x='longitude', y='latitude', rasterize=True, geo=True, 
                      tiles='OSM', cmap='turbo')

Out[22]:

Extract a time series at a point¶

We are reading GRIB2 files, which compress the entire spatial domain as a single chunk. Therefore reading all the time values at a single point actually needs to load and uncompress all the data for that variable. But with a lot of cores, it doesn't take terribly long

In [23]:

%%time
ds2.t2m[:,500,500].hvplot(x='valid_time', grid=True)

CPU times: user 17.2 s, sys: 1.65 s, total: 18.8 s
Wall time: 20 s

Out[23]:

In [24]:

client.close(); cluster.shutdown()