Explore the NWM Reanalysis (1979-2020) NetCDF files (all 367,439 of them) on AWS as a single xarray dataset! The only new file we created was a JSON file that points to data chunks in the original NetCDF files that is then read with the fsspec and zarr packages.
See this blog post for how this works.
Important note on performance: The data in the original NetCDF files is chunked as the entire spatial domain and a single time step. Thus reading a time series will be very slow -- to extract a time series at a single location for the entire time period will require reading and uncompressing 8TB of data! But extraction of a few days or weeks of data will be relatively fast.
import intake
The Intake catalog, the consolidated JSON file it accesses, and the NetCDF files the JSON file references are all on public S3 buckets that do not require an AWS account, so no credentials are required!
%%time
cat = intake.open_catalog('s3://esip-qhub-public/noaa/nwm/nwm_catalog.yml')
CPU times: user 1.2 s, sys: 551 ms, total: 1.75 s Wall time: 5.47 s
list(cat)
['nwm-reanalysis', 'nwm-forecast']
%%time
ds = cat['nwm-reanalysis'].to_dask()
/home/conda/store/896e738a7fff13f931bce6a4a04b3575ecd1f4cbd0e7da9d83afcc7273e57b60-pangeo/lib/python3.8/site-packages/intake_xarray/xzarr.py:31: RuntimeWarning: Failed to open Zarr store with consolidated metadata, falling back to try reading non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider: 1. Consolidating metadata in this existing store with zarr.consolidate_metadata(). 2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or 3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata. self._ds = xr.open_zarr(self._mapper, **self.kwargs) /home/conda/store/896e738a7fff13f931bce6a4a04b3575ecd1f4cbd0e7da9d83afcc7273e57b60-pangeo/lib/python3.8/site-packages/xarray/conventions.py:512: SerializationWarning: variable 'qBtmVertRunoff' has multiple fill values {-9999000, 0}, decoding all values to NaN. new_vars[k] = decode_cf_variable( /home/conda/store/896e738a7fff13f931bce6a4a04b3575ecd1f4cbd0e7da9d83afcc7273e57b60-pangeo/lib/python3.8/site-packages/xarray/conventions.py:512: SerializationWarning: variable 'qBucket' has multiple fill values {-999900000, 0}, decoding all values to NaN. new_vars[k] = decode_cf_variable( /home/conda/store/896e738a7fff13f931bce6a4a04b3575ecd1f4cbd0e7da9d83afcc7273e57b60-pangeo/lib/python3.8/site-packages/xarray/conventions.py:512: SerializationWarning: variable 'qSfcLatRunoff' has multiple fill values {-999900000, 0}, decoding all values to NaN. new_vars[k] = decode_cf_variable( /home/conda/store/896e738a7fff13f931bce6a4a04b3575ecd1f4cbd0e7da9d83afcc7273e57b60-pangeo/lib/python3.8/site-packages/xarray/conventions.py:512: SerializationWarning: variable 'q_lateral' has multiple fill values {0, -99990}, decoding all values to NaN. new_vars[k] = decode_cf_variable( /home/conda/store/896e738a7fff13f931bce6a4a04b3575ecd1f4cbd0e7da9d83afcc7273e57b60-pangeo/lib/python3.8/site-packages/xarray/conventions.py:512: SerializationWarning: variable 'streamflow' has multiple fill values {0, -999900}, decoding all values to NaN. new_vars[k] = decode_cf_variable( /home/conda/store/896e738a7fff13f931bce6a4a04b3575ecd1f4cbd0e7da9d83afcc7273e57b60-pangeo/lib/python3.8/site-packages/xarray/conventions.py:512: SerializationWarning: variable 'velocity' has multiple fill values {0, -999900}, decoding all values to NaN. new_vars[k] = decode_cf_variable(
CPU times: user 42.1 s, sys: 2.69 s, total: 44.8 s Wall time: 48.3 s
ds
<xarray.Dataset> Dimensions: (time: 367439, feature_id: 2776738) Coordinates: * feature_id (feature_id) float64 101.0 179.0 181.0 ... 1.18e+09 1.18e+09 latitude (feature_id) float32 dask.array<chunksize=(2776738,), meta=np.ndarray> longitude (feature_id) float32 dask.array<chunksize=(2776738,), meta=np.ndarray> * time (time) datetime64[ns] 1979-02-01T01:00:00 ... 2020-12-31T... Data variables: elevation (time, feature_id) float32 dask.array<chunksize=(1, 2776738), meta=np.ndarray> order (time, feature_id) float64 dask.array<chunksize=(1, 2776738), meta=np.ndarray> qBtmVertRunoff (time, feature_id) float64 dask.array<chunksize=(1, 2776738), meta=np.ndarray> qBucket (time, feature_id) float64 dask.array<chunksize=(1, 2776738), meta=np.ndarray> qSfcLatRunoff (time, feature_id) float64 dask.array<chunksize=(1, 2776738), meta=np.ndarray> q_lateral (time, feature_id) float64 dask.array<chunksize=(1, 2776738), meta=np.ndarray> streamflow (time, feature_id) float64 dask.array<chunksize=(1, 2776738), meta=np.ndarray> velocity (time, feature_id) float64 dask.array<chunksize=(1, 2776738), meta=np.ndarray> Attributes: (12/19) Conventions: CF-1.6 TITLE: OUTPUT FROM WRF-Hydro v5.2.0-beta2 _NCProperties: version=2,netcdf=4.7.4,hdf5=1.10.7, cdm_datatype: Station code_version: v5.2.0-beta2 dev: dev_ prefix indicates development/internal me... ... ... model_output_type: channel_rt model_output_valid_time: 1979-02-01_01:00:00 model_total_valid_times: 1416 proj4: +proj=lcc +units=m +a=6370000.0 +b=6370000.0 ... station_dimension: feature_id stream_order_output: 1
array([1.010000e+02, 1.790000e+02, 1.810000e+02, ..., 1.180002e+09, 1.180002e+09, 1.180002e+09])
|
|
array(['1979-02-01T01:00:00.000000000', '1979-02-01T02:00:00.000000000', '1979-02-01T03:00:00.000000000', ..., '2020-12-31T21:00:00.000000000', '2020-12-31T22:00:00.000000000', '2020-12-31T23:00:00.000000000'], dtype='datetime64[ns]')
|
|
|
|
|
|
|
|
ds.streamflow
<xarray.DataArray 'streamflow' (time: 367439, feature_id: 2776738)> dask.array<open_dataset-706b3adf98cd6789e67ba86ca4df351fstreamflow, shape=(367439, 2776738), dtype=float64, chunksize=(1, 2776738), chunktype=numpy.ndarray> Coordinates: * feature_id (feature_id) float64 101.0 179.0 181.0 ... 1.18e+09 1.18e+09 latitude (feature_id) float32 dask.array<chunksize=(2776738,), meta=np.ndarray> longitude (feature_id) float32 dask.array<chunksize=(2776738,), meta=np.ndarray> * time (time) datetime64[ns] 1979-02-01T01:00:00 ... 2020-12-31T23:0... Attributes: _Netcdf4Dimid: 0 grid_mapping: crs long_name: River Flow units: m3 s-1 valid_range: [0, 5000000]
|
array([1.010000e+02, 1.790000e+02, 1.810000e+02, ..., 1.180002e+09, 1.180002e+09, 1.180002e+09])
|
|
array(['1979-02-01T01:00:00.000000000', '1979-02-01T02:00:00.000000000', '1979-02-01T03:00:00.000000000', ..., '2020-12-31T21:00:00.000000000', '2020-12-31T22:00:00.000000000', '2020-12-31T23:00:00.000000000'], dtype='datetime64[ns]')
ds.streamflow.encoding
{'chunks': (1, 2776738), 'preferred_chunks': {'time': 1, 'feature_id': 2776738}, 'compressor': Zlib(level=2), 'filters': None, 'missing_value': -999900, '_FillValue': 0, 'scale_factor': 0.009999999776482582, 'add_offset': 0.0, 'dtype': dtype('int32'), 'coordinates': 'longitude latitude'}
url = 's3://noaa-nwm-retrospective-2-1-pds/model_output/2020/202001011100.CHRTOUT_DOMAIN1.comp'
import xarray as xr
import fsspec
fs = fsspec.filesystem('s3', anon=True)
ds2 = xr.open_dataset(fs.open(url), drop_variables='reference_time', chunks={})
ds2.streamflow.encoding
{'chunksizes': (2776738,), 'fletcher32': False, 'shuffle': False, 'zlib': True, 'complevel': 2, 'source': '<File-like object S3FileSystem, noaa-nwm-retrospective-2-1-pds/model_output/2020/202001011100.CHRTOUT_DOMAIN1.comp>', 'original_shape': (2776738,), 'dtype': dtype('int32'), 'missing_value': array([-999900], dtype=int32), '_FillValue': array([-999900], dtype=int32), 'scale_factor': array([0.01], dtype=float32), 'add_offset': array([0.], dtype=float32), 'coordinates': 'latitude longitude'}
ds.streamflow.encoding
{'chunks': (1, 2776738), 'preferred_chunks': {'time': 1, 'feature_id': 2776738}, 'compressor': Zlib(level=2), 'filters': None, 'missing_value': -999900, '_FillValue': 0, 'scale_factor': 0.009999999776482582, 'add_offset': 0.0, 'dtype': dtype('int32'), 'coordinates': 'longitude latitude'}