CONUS404 Spatial Aggregation¶

Using Xarray, GeoPandas and Sparse

Goal: Spatially aggregate a model data variable conservatively, i.e. by exactly partitioning each grid cell into the precise region boundaries.

Approach:

Represent both the original model grid and target grid as GeoPandas GeoSeries objects (with vector geometries)
Compute their area overlay and turn it into a sparse matrix
Perform matrix multiplication on the full Xarray dataset (with a time dimension)

It is quite fast and transparent.

In [1]:

%xmode minimal
import xarray as xr
import geopandas as gp
import pandas as pd
import sparse

import hvplot.pandas
import hvplot.xarray
import dask
import cf_xarray

from pynhd import NLDI, WaterData
import intake
import cartopy.crs as ccrs

Exception reporting mode: Minimal

In [2]:

def configure_cluster(machine):
    ''' Helper function to configure cluster
    '''
    if machine == 'denali':
        from dask.distributed import LocalCluster, Client
        cluster = LocalCluster(threads_per_worker=1)
        client = Client(cluster)
    
    elif machine == 'tallgrass':
        from dask.distributed import Client
        from dask_jobqueue import SLURMCluster
        cluster = SLURMCluster(queue='cpu', cores=1, 
                               walltime="01:00:00", account="woodshole",
                               interface='ib0', memory='6GB')
        cluster.adapt(maximum_jobs=10)
        client = Client(cluster)
        
    elif machine == 'local':
        import os
        import warnings
        from dask.distributed import LocalCluster, Client
        warnings.warn("Running locally can result in costly data transfers!\n")
        n_cores = os.cpu_count() # set to match your machine
        cluster = LocalCluster(threads_per_worker=n_cores)
        client = Client(cluster)
        
    elif machine in ['esip-qhub-gateway-v0.4']:   
        import sys, os
        sys.path.append(os.path.join(os.environ['HOME'],'shared','users','lib'))
        import ebdpy as ebd
        aws_profile = 'esip-qhub'
        ebd.set_credentials(profile=aws_profile)

        aws_region = 'us-west-2'
        endpoint = f's3.{aws_region}.amazonaws.com'
        ebd.set_credentials(profile=aws_profile, region=aws_region, endpoint=endpoint)
        worker_max = 10
        client,cluster = ebd.start_dask_cluster(profile=aws_profile, worker_max=worker_max, 
                                              region=aws_region, use_existing_cluster=True,
                                              adaptive_scaling=False, wait_for_cluster=False, 
                                              worker_profile='Medium Worker', propagate_env=True)
        
    return client, cluster

Same notebook can be run on on-prem HPC or Cloud¶

Here we start as Dask Cluster on the USGS HPC Denali system or on a JupyterHub deployment with Kubernetes running on Cloud

In [3]:

import os
if 'SLURM_CLUSTER_NAME' in os.environ:    # on prem
    dataset = 'conus404-hourly-onprem'
    machine = os.environ['SLURM_CLUSTER_NAME']
    client, cluster = configure_cluster(machine) 
else:                                     # on cloud
    dataset = 'conus404-hourly-cloud'     
    machine = 'esip-qhub-gateway-v0.4'
    client, cluster = configure_cluster(machine)

Region: us-west-2
Existing Dask clusters:
Cluster Index c_idx: 0 / Name: dev.728cb3f028724f7891f170f0e1c1e08a ClusterStatus.RUNNING
Using existing cluster [0].
Setting Fixed Scaling workers=10
Reconnect client to clear cache
client.dashboard_link (for new browser tab/window or dashboard searchbar in Jupyterhub):
https://nebari.esipfed.org/gateway/clusters/dev.728cb3f028724f7891f170f0e1c1e08a/status
Propagating environment variables to workers
Using environment: users/users-pangeo

Load the feature polygons (e.g. here catchment basins)¶

Load with geopandas:

In [4]:

%%time
# USGS gage 01482100 Delaware River at Del Mem Bridge at Wilmington De
gage_id = '01482100'
nldi = NLDI()
del_basins = nldi.get_basins(gage_id)
huc12_basins = WaterData('huc12').bygeom(del_basins.geometry[0])

CPU times: user 969 ms, sys: 51.4 ms, total: 1.02 s
Wall time: 2.41 s

In [5]:

regions_df = huc12_basins
region_name = 'name'

Check it:

In [6]:

regions_df.plot(column=region_name, figsize=(10,4))

Out[6]:

<AxesSubplot: >

All geodataframes should have a coordinate reference system. This is important (and sometimes unfamiliar to users coming from the global climate world).

In [7]:

crs_orig = f'EPSG:{regions_df.crs.to_epsg()}'
crs_orig

Out[7]:

'EPSG:4326'

Open the gridded model dataset (e.g. here CONUS404)¶

In [8]:

url = 'https://raw.githubusercontent.com/hytest-org/hytest/main/dataset_catalog/hytest_intake_catalog.yml'

In [9]:

cat = intake.open_catalog(url)
list(cat)

Out[9]:

['conus404-hourly-onprem',
 'conus404-hourly-cloud',
 'conus404-daily-onprem',
 'conus404-daily-diagnostic-onprem',
 'conus404-daily-cloud',
 'conus404-daily-diagnostic-cloud',
 'conus404-monthly-onprem',
 'conus404-monthly-cloud',
 'nwis-streamflow-usgs-gages-onprem',
 'nwis-streamflow-usgs-gages-cloud',
 'nwm21-streamflow-usgs-gages-onprem',
 'nwm21-streamflow-usgs-gages-cloud',
 'nwm21-streamflow-cloud',
 'nwm21-scores',
 'lcmap-cloud',
 'conus404-hourly-cloud-dev',
 'nhm-v1.0-daymet-byHRU-onprem',
 'nhm-v1.0-daymet-byHW-musk-onprem',
 'nhm-v1.0-daymet-byHW-musk-obs-onprem',
 'nhm-v1.0-daymet-byHW-noroute-onprem',
 'nhm-v1.0-daymet-byHW-noroute_obs-onprem',
 'nhm-v1.1-gridmet-byHRU-onprem',
 'nhm-v1.1-gridmet-byHW-onprem',
 'nhm-v1.1-gridmet-byHWobs-onprem']

In [10]:

ds = cat['conus404-daily-cloud'].to_dask()

In [11]:

x = 'x'  # projected x coordinate name
y = 'y'  # projected y coordinate name

In [12]:

ds.crs

Out[12]:

<xarray.DataArray 'crs' ()>
[1 values with dtype=int32]
Attributes:
    grid_mapping_name:              lambert_conformal_conic
    latitude_of_projection_origin:  39.100006103515625
    longitude_of_central_meridian:  262.0999984741211
    semi_major_axis:                6370000.0
    semi_minor_axis:                6370000.0
    standard_parallel:              [30.0, 50.0]

In [13]:

crs_info = ds.crs
xx = ds.x.values
yy = ds.y.values
globe = ccrs.Globe(ellipse='sphere', semimajor_axis=6370000, semiminor_axis=6370000)
lcc = ccrs.LambertConformal(globe=globe,
                            central_longitude=crs_info.longitude_of_central_meridian, 
                            central_latitude=crs_info.latitude_of_projection_origin,
                            standard_parallels=crs_info.standard_parallel)

In [14]:

lcc_wkt = lcc.to_wkt()

In [15]:

regions_df = regions_df.to_crs(lcc_wkt)

In [16]:

bbox = tuple(regions_df.total_bounds)
bbox

Out[16]:

(1751263.2290125347, 214989.48670458575, 1964829.8118029535, 619548.965931173)

Subset gridded model results to bounds of spatial dataframe to save on memory and computation. More useful when the regions_df is much smaller than the footprint of the gridded model

In [17]:

ds = ds.sel(x=slice(bbox[0],bbox[2]), y=slice(bbox[1],bbox[3]))

Now we extract just the horizontal grid information. The dataset has information about the lat and lon bounds of each cell, which we need to create the CONUS404 grid cell polygons.

In [18]:

var = 'T2'  # 2m Temp
var = 'PREC_ACC_NC' # precip

In [19]:

grid = ds[[var]].drop(['time', 'lon', 'lat', var]).reset_coords().load()
grid

Now we "stack" the data into a single 1D array. This is the first step towards transitioning to pandas.

In [20]:

grid = grid.cf.add_bounds([x, y])

In [21]:

points = grid.stack(point=(y,x))
points

This function creates geometries for a single pair of bounds. It is not fast, but it is fast enough here. Perhaps could be vectorized using pygeos...

In [22]:

from shapely.geometry import Polygon

def bounds_to_poly(x_bounds, y_bounds):
    return Polygon([
        (x_bounds[0], y_bounds[0]),
        (x_bounds[0], y_bounds[1]),
        (x_bounds[1], y_bounds[1]),
        (x_bounds[1], y_bounds[0])
    ])

We apply this function to each grid cell.

In [23]:

%%time
import numpy as np
boxes = xr.apply_ufunc(
    bounds_to_poly,
    points.x_bounds,
    points.y_bounds,
    input_core_dims=[("bounds",),  ("bounds",)],
    output_dtypes=[np.dtype('O')],
    vectorize=True
)

CPU times: user 74.7 ms, sys: 15.4 ms, total: 90.1 ms
Wall time: 67.8 ms

Finally, we convert to a GeoDataframe, specifying the projected CRS

In [24]:

grid_df= gp.GeoDataFrame(
    data={"geometry": boxes.values, "y": boxes[y], "x": boxes[x]},
    index=boxes.indexes["point"],
    crs=lcc_wkt
)

We will now transform to an area preserving projection. This is imporant because we want to do area-weighted regridding. Here we use the NSIDC EASE-Grid 2.0 grid for the Northern Hemisphere.

In [25]:

crs_area = "EPSG:6931"

regions_df = regions_df.to_crs(crs_area)
grid_df = grid_df.to_crs(crs_area)

grid_df.crs

Out[25]:

<Derived Projected CRS: EPSG:6931>
Name: WGS 84 / NSIDC EASE-Grid 2.0 North
Axis Info [cartesian]:
- X[south]: Easting (metre)
- Y[south]: Northing (metre)
Area of Use:
- name: Northern hemisphere.
- bounds: (-180.0, 0.0, 180.0, 90.0)
Coordinate Operation:
- name: US NSIDC EASE-Grid 2.0 North
- method: Lambert Azimuthal Equal Area
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

Key Step: Overlay the two geometries¶

This is the magic of geopandas; it can calculate the overlap between the original grid and the regions. It is expensive because it has to compare 64800 grid boxes with 242 regions.

In this dataframe, the latitude and longitude values are from the grid, while all the other columns are from the regions.

In [26]:

overlay = grid_df.overlay(regions_df, keep_geom_type=True)

This is essentially already a sparse matrix mapping one grid space to the other. How sparse?

In [27]:

sparsity = len(overlay) / (len(grid_df) * len(regions_df))
sparsity

Out[27]:

0.0024750477952660914

Let's explore these overlays a little bit

We can verify that each basin's area is preserved in the overlay operation.

In [28]:

overlay.geometry.area.groupby(overlay[region_name]).sum().nlargest(10)/1e6  # km2

Out[28]:

name
Delaware Bay-Deep                         1073.773336
Ashokan Reservoir-Esopus Creek             205.798540
Lower Little Swatara Creek                 194.683255
Lake Wallenpaupack-Wallenpaupack Creek     180.104323
Bear Creek                                 176.743216
Bear Kill                                  174.877729
Upper Mahanoy Creek                        170.342913
Headwaters Paulins Kill River              167.072157
Black Creek                                160.267392
Snitz Creek-Quittapahilla Creek            159.798726
dtype: float64

In [29]:

overlay[overlay[region_name] == "Delaware Bay-Deep"].geometry.plot(edgecolor='k', aspect='equal')

Out[29]:

<AxesSubplot: >

In [30]:

regions_df.geometry.area.groupby(regions_df[region_name]).sum().nlargest(10)

Out[30]:

name
Delaware Bay-Deep                         1.073773e+09
Ashokan Reservoir-Esopus Creek            2.057985e+08
Lower Little Swatara Creek                1.946833e+08
Lake Wallenpaupack-Wallenpaupack Creek    1.801043e+08
Bear Creek                                1.767432e+08
Bear Kill                                 1.748777e+08
Upper Mahanoy Creek                       1.703429e+08
Headwaters Paulins Kill River             1.670722e+08
Black Creek                               1.602674e+08
Snitz Creek-Quittapahilla Creek           1.597987e+08
dtype: float64

Calculate the area fraction for each region¶

This is another key step. This transform tells us how much of a country's total area comes from each of the grid cells. This is accurate because we used an area-preserving CRS.

In [31]:

grid_cell_fraction = overlay.geometry.area.groupby(overlay[region_name]).transform(lambda x: x / x.sum())
grid_cell_fraction

Out[31]:

0       0.006051
1       0.007286
2       0.000754
3       0.004720
4       0.014627
          ...   
6191    0.001910
6192    0.179560
6193    0.180305
6194    0.093585
6195    0.000193
Length: 6196, dtype: float64

We can verify that these all sum up to one.

In [32]:

grid_cell_fraction.groupby(overlay[region_name]).sum()

Out[32]:

name
Alexauken Creek-Delaware River                       1.0
Allegheny Creek-Delaware River                       1.0
Angelica Creek-Schuylkill River                      1.0
Antietam Creek                                       1.0
Appenzell Creek                                      1.0
                                                    ... 
Wrangle Brook                                        1.0
Wright Brook-Headwaters West Brach Delaware River    1.0
Wright Creek-Lehigh River                            1.0
Wyomissing Creek                                     1.0
Yaleville Brook-Susquehanna River                    1.0
Length: 451, dtype: float64

Turn this into a sparse Xarray DataArray¶

The first step is making a MultiIndex

In [33]:

multi_index = overlay.set_index([y, x, region_name]).index
df_weights = pd.DataFrame({"weights": grid_cell_fraction.values}, index=multi_index)
df_weights

Out[33]:

			weights
y	x	name
216039.100363	1.919902e+06	Delaware Bay-Deep	0.006051
216039.100363	1.923902e+06	Delaware Bay-Deep	0.007286
220039.100363	1.911902e+06	Delaware Bay-Deep	0.000754
	1.915902e+06	Delaware Bay-Deep	0.004720
	1.919902e+06	Delaware Bay-Deep	0.014627
...	...	...	...
616039.100363	1.859902e+06	Mine Kill	0.001910
	1.863902e+06	Mine Kill	0.179560
	1.867902e+06	Mine Kill	0.180305
	1.871902e+06	Mine Kill	0.093585
	1.875902e+06	Mine Kill	0.000193

6196 rows × 1 columns

We can bring this directly into xarray as a 1D Dataset.

In [34]:

import xarray as xr
ds_weights = xr.Dataset(df_weights)

Now we unstack it into a sparse array.

In [35]:

weights_sparse = ds_weights.unstack(sparse=True, fill_value=0.).weights

Here we can clearly see that this is a sparse matrix, mapping the input space (lat, lon) to the output space (SOVEREIGNT).

Perform Matrix Multiplication¶

To regrid the data, we just have to multiply the original precip dataset by this matrix.

In [36]:

#regridded = xr.dot(ds[var], weights_sparse)

In [37]:

#regridded = sparse.einsum('ij,jk->ik', ds[var].data, weights_sparse.data)

Unfortunately, that doesn't work out of the box, because sparse doesn't implement einsum (see https://github.com/pydata/sparse/issues/31).

In [38]:

# regridded[0].compute()  # fails

Sparse does implement matmul, so we can use that. But we have to do some reshaping to make it work with our data.

In [39]:

def apply_weights_matmul_sparse(weights, data):

    assert isinstance(weights, sparse.SparseArray)
    assert isinstance(data, np.ndarray)
    data = sparse.COO.from_numpy(data)
    data_shape = data.shape
    # k = nlat * nlon
    n, k = data_shape[0], data_shape[1] * data_shape[2]
    data = data.reshape((n, k))
    weights_shape = weights.shape
    k_, m = weights_shape[0] * weights_shape[1], weights_shape[2]
    assert k == k_
    weights_data = weights.reshape((k, m))

    regridded = sparse.matmul(data, weights_data)
    assert regridded.shape == (n, m)
    return regridded.todense()

Variables at the same location on the grid can use the same weights¶

In [40]:

#var = 'T2'  # 2m Temp, grid cell centers
#var = 'PREC_ACC_NC' # precip, grid cell centers

In [41]:

%%time
with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    var_regridded = xr.apply_ufunc(
        apply_weights_matmul_sparse,
        weights_sparse,
        ds[var],
        join="left",
        input_core_dims=[[y, x, region_name], [y, x]],
        output_core_dims=[[region_name]],
        dask="parallelized",
        dask_gufunc_kwargs=dict(meta=[np.ndarray((0,))])
    )
    

var_regridded.compute()

Explore timeseries data by region:¶

In [42]:

ds_var = var_regridded.sel(name=['Delaware Bay-Deep', 'Black Creek']).resample(time="MS").mean().to_dataset(region_name)

In [43]:

ds_var.hvplot(x='time', grid=True, frame_width=1000)

Out[43]:

Explore the mean var by region¶

In [44]:

df_mean = var_regridded.to_pandas().mean()
df_mean.name = var
df_mean = pd.DataFrame(df_mean).reset_index()

In [45]:

merged = pd.merge(regions_df, df_mean)

Convert back to geographic coordinates for plotting

In [46]:

crs_geo = 'EPSG:4326'

In [47]:

merged_geo = merged.to_crs(crs_geo)

Holoviz:

In [48]:

merged_geo.hvplot(c=var, geo=True, cmap='viridis_r', frame_width=500, tiles='StamenTerrain', 
               title='CONUS404', alpha=0.7)

Out[48]:

Be a good citizen and shut down the cluster¶

... Even though it would shut down in 15 minutes with no activity

In [49]:

#cluster.shutdown()

In [ ]: