Xarray

DS Python for GIS and Geoscience
October, 2020

© 2020, Joris Van den Bossche and Stijn Van Hoey. Licensed under CC BY 4.0 Creative Commons

In [ ]:

import numpy as np
import matplotlib.pyplot as plt

import rasterio
from rasterio.plot import plotting_extent, reshape_as_image

Introduction¶

By this moment you probably already know how to read data files with rasterio:

In [ ]:

data_file = "./data/gent/raster/2020-09-17_Sentinel_2_L1C_True_color.tiff"

In [ ]:

with rasterio.open(data_file) as src:
    # extract data, metadata and extent into memory
    gent_profile = src.profile
    gent_data = src.read([1, 2, 3], out_dtype=float, masked=False)
    gent_ext = plotting_extent(src)

In [ ]:

plt.imshow(gent_data[0, :, :], extent=gent_ext, cmap="Reds")

Rasterio...

Benefits

Direct link with Numpy data types
Rasterio supports important GIS transformations (clip, mask, warp, merge, transformation,...)
Only load a subset of a large data set into memory

Drawbacks:

Coordinate information is decoupled from the data itself (keep track and organize the extent and meta data)
Make sure to keep track of what each dimension represents (y-first, as arrays are organized along rows first)
Functionality overlap with GDAL (and sometimes installation issues)

Meet `xarray`¶

In [ ]:

import xarray as xr

In [ ]:

gent = xr.open_rasterio(data_file)
gent

In [ ]:

plt.imshow(gent.sel(band=1), cmap="Reds");

Xarray brings its own plotting methods, but relies on Matplotlib as well for the actual plotting:

In [ ]:

ax = gent.sel(band=1).plot.imshow(cmap="Reds", figsize=(12, 5))  # robust=True
# ax.axes.set_aspect('equal')

As a preview, plot the intersection of the data at x coordinate closest to 400000 for each band:

In [ ]:

gent.sel(x=400_000, method='nearest').plot.line(col='band')

But first, let's have a look at the data again:

In [ ]:

gent

The output of xarray is a bit different to what we've previous seen. Let's go through the different elements:

It is a xarray.DataArray, one of the main data types provided by xarray
It has 3 dimensions:
- band: 3 bands (RGB)
- y: the y coordinates of the data set
- x: the x coordinates of the data set
Each of these dimensions are defined by a coordinate (1D) array
Other metadata provided by the tiff are stored in the Attributes

Looking to the data itself (click on the icons on the right), we can see this is still a Numpy array

In [ ]:

#gent.values

In [ ]:

gent = gent.assign_coords(band=("band", ["R", "G", "B"]))
gent

Hence, we can name dimensions and also extract (slice) data using these names...

In [ ]:

gent.sel(band='R')

Using xarray:

Data stored as a Numpy arrays
Dimensions do have a name
The coordinates of each of the dimensions can represent coordinates, categories, dates,... instead of just an index

REMEMBER:

The xarray package introduces labels in the form of dimensions, coordinates and attributes on top of raw numPy-like arrays. Xarray is inspired by and borrows heavily from Pandas.

Numpy with labels...¶

Recap the NDVI exercise of the Numpy notebook, using a stacked version of the 4th and 8th Sentinel band:

In [ ]:

xr_array = xr.open_rasterio("./data/gent/raster/2020-09-17_Sentinel_2_L1C_B0408.tiff")
xr_array

In Numpy, we would do:

In [ ]:

b48_bands = xr_array.values  # 0 is band 4 and 1 is band 8
b48_bands.shape

In [ ]:

ndvi_np = (b48_bands[1] - b48_bands[0])/(b48_bands[0] + b48_bands[1]) # or was it b48_bands[0] -  b48_bands[1] ?

In [ ]:

plt.imshow(ndvi_np, cmap="YlGn")

In xarray:

In [ ]:

xr_array = xr_array.assign_coords(band=("band", ["b4", "b8"]))
xr_data = xr_array.to_dataset(dim="band")

In [ ]:

ndvi_xr = (xr_data["b8"] - xr_data["b4"])/(xr_data["b8"] + xr_data["b4"])

In [ ]:

plt.imshow(ndvi_xr, cmap="YlGn")

The result is the same, but no more struggling on what index is representing which variable!

In [ ]:

np.allclose(ndvi_xr.data, ndvi_np)

We can keep the result together with the other data variables by adding a new variable to the data, in a very similar way as we created a new column in Pandas:

In [ ]:

xr_data["ndvi"] = ndvi_xr
xr_data

You already encountered xarray.DataArray, but now we created a xarray.Dataset:

A xarray.Dataset is the second main data type provided by xarray
It has 2 dimensions:
- y: the y coordinates of the data set
- x: the x coordinates of the data set
Each of these dimensions are defined by a coordinate (1D) array
It has 3 Data variables: band_4, band_8 and ndvi that share the same coordinates
Other metadata provided by the tiff are stored in the Attributes

Looking to the data itself (click on the icons on the right), we can see each of the Data variables is a Numpy ndarrays:

In [ ]:

type(xr_data["b4"].data)

And also the coordinates that describe a dimension are Numpy ndarrays:

In [ ]:

type(xr_data.coords["x"].values)

Selecting data

Xarray’s labels make working with multidimensional data much easier:

In [ ]:

xr_array = xr.open_rasterio("./data/gent/raster/2020-09-17_Sentinel_2_L1C_B0408.tiff")

Rename the coordinates of the band dimension:

In [ ]:

xr_array = xr_array.assign_coords(band=("band", ["b4", "b8"]))

We could use the Numpy style of data slicing:

In [ ]:

xr_array[0]

However, it is often much more powerful to use xarray’s .sel() method to use label-based indexing:

In [ ]:

xr_array.sel(band="b4")

We can select a specific set of coordinate values as a list and take the value that is most near to the given value:

In [ ]:

xr_array.sel(x=[406803, 410380, 413958], method="nearest")   # .sel(band="b4").plot.line(hue="x");

Sometimes, a specific range is required. The .sel() method also supports slicing, so we can select band 4 and slice a subset of the data along the x direction:

In [ ]:

xr_array.sel(x=slice(400_000, 420_000), band="b4").plot.imshow()

Note Switch in between Array and Datasets as you like, it won't hurt your computer memory:

In [ ]:

xr_data = xr_array.to_dataset(dim="band")

In [ ]:

#xr_data.to_array()    # dim="band"

Reduction¶

Just like in numpy, we can reduce xarray DataArrays along any number of axes:

In [ ]:

xr_data["b4"].mean(axis=0).dims

In [ ]:

xr_data["b4"].mean(axis=1).dims

But we have dimensions with labels, so rather than performing reductions on axes (as in Numpy), we can perform them on dimensions. This turns out to be a huge convenience:

In [ ]:

xr_data["b4"].mean(dim="x").dims

Calculate minimum or quantile values for each of the bands separately:

In [ ]:

xr_array.min(dim=["x", "y"])

In [ ]:

xr_array.quantile([0.1, 0.5, 0.9], dim=["x", "y"])

Element-wise computation¶

Xarray DataArrays and Datasets work seamlessly with arithmetic operators and numpy array functions.

In [ ]:

xr_data["b4"] /10.

In [ ]:

np.log(xr_data["b8"])

As we seen in the example of the NDVI, we can combine multiple xarray datasets in arithemetic operations:

In [ ]:

xr_data["b8"] + xr_data["b4"]

Broadcasting¶

Perfoming an operation on arrays with differenty coordinates will result in automatic broadcasting:

In [ ]:

xr_data.x.shape, xr_data["b8"].shape

In [ ]:

xr_data["b8"] + xr_data.x  # Note, this calculaton does not make much sense, but illustrates broadcasting

Plotting¶

Similar to Pandas, there is a plot method, which can be used for different plot types:

In [ ]:

xr_array = xr.open_rasterio("./data/gent/raster/2020-09-17_Sentinel_2_L1C_B0408.tiff")
xr_array = xr_array.assign_coords(band=("band", ["b4", "b8"]))

It supports both 2 dimensional (e.g. line) as 3 (e.g. imshow, pcolormesh) dimensional plots. When just using plot, xarray will do a best guess on how to plot the data. However being explicit plot.line, plot.imshow, plot.pcolormesh, plot.scatter,... gives you more control.

In [ ]:

xr_array.sel(band="b4").plot();  # add .line() -> ValueError: For 2D inputs, please specify either hue, x or y.

In [ ]:

xr_array.sel(x=420000, method="nearest").plot.line(hue="band");

facetting splits the data in subplots according to a dimension, e.g. band

In [ ]:

xr_array.sel(x=420000, method="nearest").plot.line(col="band");  # row="band"

Use the robust option when there is a lack of visual difference. This will use the 2nd and 98th percentiles of the data to compute the color limits. The arrows on the color bar indicate that the colors include data points outside the bounds.

In [ ]:

ax = xr_array.sel(band="b4").plot(cmap="Reds", robust=True, figsize=(12, 5))
ax.axes.set_aspect('equal')

Compare data variables within a xarray Dataset:

In [ ]:

xr_data = xr_array.to_dataset(dim="band")
xr_data.plot.scatter(x="b4", y="b8", s=2)

Calculating and plotting the NDVI in three classes illustrates the options of the imshow method:

In [ ]:

xr_data["ndvi"] = (xr_data["b8"] - xr_data["b4"])/(xr_data["b8"] + xr_data["b4"])
xr_data["ndvi"].plot.imshow(levels=[-1, 0, 0.3, 1.], colors=["gray", "yellowgreen", "g"])

Let's practice!¶

The data set for the following exercises is from Argo floats, an international collaboration that collects high-quality temperature and salinity profiles from the upper 2000m of the ice-free global ocean and currents from intermediate depths.

These data do not represent full coverage image data (like remote sensing images), but measurements of salinity and temperature as a function of water level (related to the pressure). Each measurements happens at a given date on a given location (lon/lat).

In [ ]:

import xarray as xr
argo = xr.load_dataset("./data/argo_float.nc")

In [ ]:

argo

The bold font (or * symbol in plain text output version) in the coordinate representation above indicates that x and y are 'dimension coordinates' (they describe the coordinates associated with data variable axes) while band is a 'non-dimension coordinates'. We can make any variable a non-dimension coordinate.

Let's plot the coordinates of the available measurements and add a background map using contextly:

EXERCISE:

Add a new variable to the argo data set, called temperature_kelvin, by converting the temperature to Kelvin.

Degrees Kelvin = degrees celsius + 273.

Hints

Remember that xarray works as Numpy and relies on the same broadcasting rules.

In [ ]:

# %load _solutions/13-xarray1.py

EXERCISE:

The water level classes define different water depths. The pressure is a proxy for the water depth. Verify the relationship between the pressure and the level using a scatter plot. Does a larger value for the level represent deeper water depths or not?

Hints

If you get the error ValueError: Dataset.plot cannot be called directly. Use an explicit plot method, e.g. ds.plot.scatter(...), read the message and do what it says.

In [ ]:

# %load _solutions/13-xarray2.py

EXERCISE:

Assume that buoyancy is defined by the following formula:

$$g \cdot ( 2\times 10^{-4} \cdot T - 7\times 10^{-4} \cdot P )$$

With:

$g$ = 9.8
$T$ = temperature
$P$ = pressure

Calculate the buoyancy and add it as a new variable buoyancy to the argo data set.

Make a 2D (image) plot with the x-axis the date, the y-axis the water level and the color intensity the buoyancy. As the level represents the depth of the water, it makes more sense to have 0 (surface) at the top of the y-axis: switch the y-axis direction.

Hints

Remember that xarray works as Numpy and relies on the same broadcasting rules.
The imshow method does not work on irregular intervals. Matplotlib and xarray also have pcolormesh.
Look for options in the xarray documentation to control the axis direction. (The ax.invert_yaxis() Matplotlib function is not supported for pcolormesh)

In [ ]:

# %load _solutions/13-xarray3.py

In [ ]:

# %load _solutions/13-xarray4.py

In [ ]:

# %load _solutions/13-xarray5.py

EXERCISE:

Make a line plot of the salinity as a function of time at level 10

Hints

Break it down into different steps and chain the individual steps:

From the argo data set, select the variable salinity. This is similar to selecting a column in Pandas.
Next, use the sel method to select the level=10
Next, use the plot.line() method.

In [ ]:

# %load _solutions/13-xarray6.py

EXERCISE:

Make a line plot of the temperature as a function of time for the levels 10, 20 and 30 at the same graph
Make a second line plot with each of the 3 levels (10, 20, 30) in a different subplot.

Hints

Break it down into different steps and chain these individual steps:

From the argo data set, select the variable temperature. This is similar to selecting a column in Pandas.
Next, use the sel method to select the levels 10, 20 and 30.
Next, use the plot.line() method, but make sure the hue changes for each level

For the subplots, check the facetting documentation of xarray.

In [ ]:

# %load _solutions/13-xarray7.py

In [ ]:

# %load _solutions/13-xarray8.py

EXERCISE:

You wonder how the temperature evolves with increasing latitude and what the effect is of the depth (level):

Create a scatter plot of the level as a function of the temperature colored by the latitude.
As a further exploration step, pick a subset of levels 1, 10, 25, and 50 and create a second scatter plot with in the x-axis the latitude of the measurement and in the y-axis the temperature. To compare the effect of the different levels, give each level a separate subplot next to each other.

Hints

In a scatter plot, the color or hue can be linked to a variable.
From the argo data set, use the sel method to select the levels 1, 10, 25, and 50.
For the second scatter plot, but make sure the col changes for each level and define which variables need to go to which axis.

In [ ]:

# %load _solutions/13-xarray9.py

In [ ]:

# %load _solutions/13-xarray10.py

EXERCISE:

Make an image plot of the temperature as a function of time. Divide the colormap in 3 discrete categories:

x < 5
5 < x < 15
x > 15

Choose a custom colormap and adjust the label of the colorbar to 'Temperature (°C)'

Hints

Check the help of the plot function or the xarray documentation on discrete colormaps.
Adjustments to the colorbar settings can be defined with the cbar_kwargs as a dict. Adjust the label of the colorbar.

In [ ]:

# %load _solutions/13-xarray11.py

EXERCISE:

Calculate the average salinity and temperature as a function of level over the measurements taken between 2012-10-01 and 2012-12-01.

Make a separate line plot for each of them. Define the figure and 2 subplots first with Matplotlib.

Hints

xarray supports to query dates using a string representation.
Use the slice operator within the sel to select a range of the data.
Whereas in Numpy we used axis in reduction functions, xarray uses the dim name.
Also for line plots you can define which dimension should be on the x-axis and which on the y-axis by providing the name.
Use fig, (ax0, ax1) = plt.subplots(1, 2) to create subplots.

In [ ]:

# %load _solutions/13-xarray12.py

In [ ]:

# %load _solutions/13-xarray13.py

Pandas for multiple dimensions...¶

In [ ]:

argo = xr.load_dataset("./data/argo_float.nc")

If we are interested in the average over time for each of the levels, we can use a reducton function to get the averages of each of the variables at the same time:

In [ ]:

argo.mean(dim=["date"])

But if we wanted the average for each month of the year per level, we would first have to split the data set in a group for each month of the year, apply the average function on each of the months and combine the data again.

We already learned about the split-apply-combine approach when using Pandas. The syntax of Xarray’s groupby is almost identical to Pandas!

First, extract the month of the year (1-> 12) from each of the date coordinates:

In [ ]:

argo.date.dt.month  # The coordinates is a Pandas datetime index

We can use these arrays in a groupby operation:

In [ ]:

argo.groupby(argo.date.dt.month)

Xarray also offers a more concise syntax when the variable you're grouping on is already present in the dataset. This is identical to the previous line:

In [ ]:

argo.groupby("date.month")

Next, we apply an aggregation function for each of the months over the date dimension in order to end up with: for each month of the year, the average (over time) for each of the levels:

In [ ]:

argo.groupby("date.month").mean(dim="date")        #["temperature"].sel(level=1).to_series().plot.barh()

Another (alike) operation - specifically for time series data - is to resample the data to another time-aggregation. For example, resample to monthly (1M) or yearly (1Y) median values:

In [ ]:

argo.resample(date="1M").median()  # 1Y

In [ ]:

argo["salinity"].sel(level=1).plot.line(x="date");
argo["salinity"].resample(date="1M").median().sel(level=1).plot.line(x="date");  # 1Y

A similar, but different functionality is rolling to calculate rolling window aggregates:

In [ ]:

argo.rolling(level=10, center=True).std()

In [ ]:

argo["salinity"].sel(date='2012-10-31').plot.line(y="level", yincrease=False, color="grey");
argo["salinity"].sel(date='2012-10-31').rolling(level=10, center=True).median().plot.line(y="level", yincrease=False, linewidth=3, color="crimson");
plt.legend(), plt.title("");

REMEMBER:

The xarray groupby with the same syntax as Pandas implements the split-apply-combine strategy. Also resample and rolling are available in xarray.

Note: Xarray adds a groupby_bins convenience function for binned groups (instead of each value).

Note: Values are only read from disk when needed. For example, the following statement only reads the coordinate information and the metadata. The data itself is not yet loaded:

In [ ]:

gent = xr.open_rasterio(data_file)
gent

load() will explicitly load the data into memory:

In [ ]:

xr.open_rasterio(data_file).load()

Acknowledgements and great thanks to https://earth-env-data-science.github.io for the inspiration, data and examples.

Introduction¶

Meet xarray¶

Numpy with labels...¶

Reduction¶

Element-wise computation¶

Broadcasting¶

Plotting¶

Let's practice!¶

Pandas for multiple dimensions...¶

Meet `xarray`¶