Xarray
DS Python for GIS and Geoscience
October, 2020© 2020, Joris Van den Bossche and Stijn Van Hoey. Licensed under CC BY 4.0 Creative Commons
import numpy as np
import matplotlib.pyplot as plt
import rasterio
from rasterio.plot import plotting_extent, reshape_as_image
By this moment you probably already know how to read data files with rasterio:
data_file = "./data/gent/raster/2020-09-17_Sentinel_2_L1C_True_color.tiff"
with rasterio.open(data_file) as src:
# extract data, metadata and extent into memory
gent_profile = src.profile
gent_data = src.read([1, 2, 3], out_dtype=float, masked=False)
gent_ext = plotting_extent(src)
plt.imshow(gent_data[0, :, :], extent=gent_ext, cmap="Reds")
Rasterio...
Benefits
Drawbacks:
xarray
¶import xarray as xr
gent = xr.open_rasterio(data_file)
gent
plt.imshow(gent.sel(band=1), cmap="Reds");
Xarray brings its own plotting methods, but relies on Matplotlib as well for the actual plotting:
ax = gent.sel(band=1).plot.imshow(cmap="Reds", figsize=(12, 5)) # robust=True
# ax.axes.set_aspect('equal')
As a preview, plot the intersection of the data at x coordinate closest to 400000 for each band:
gent.sel(x=400_000, method='nearest').plot.line(col='band')
But first, let's have a look at the data again:
gent
The output of xarray is a bit different to what we've previous seen. Let's go through the different elements:
xarray.DataArray
, one of the main data types provided by xarrayband
: 3 bands (RGB)y
: the y coordinates of the data setx
: the x coordinates of the data settiff
are stored in the Attributes
Looking to the data itself (click on the icons on the right), we can see this is still a Numpy array
#gent.values
gent = gent.assign_coords(band=("band", ["R", "G", "B"]))
gent
Hence, we can name dimensions and also extract (slice) data using these names...
gent.sel(band='R')
Using xarray:
REMEMBER:
The xarray
package introduces labels in the form of dimensions, coordinates and attributes on top of raw numPy-like arrays. Xarray is inspired by and borrows heavily from Pandas.
Recap the NDVI exercise of the Numpy notebook, using a stacked version of the 4th and 8th Sentinel band:
xr_array = xr.open_rasterio("./data/gent/raster/2020-09-17_Sentinel_2_L1C_B0408.tiff")
xr_array
In Numpy, we would do:
b48_bands = xr_array.values # 0 is band 4 and 1 is band 8
b48_bands.shape
ndvi_np = (b48_bands[1] - b48_bands[0])/(b48_bands[0] + b48_bands[1]) # or was it b48_bands[0] - b48_bands[1] ?
plt.imshow(ndvi_np, cmap="YlGn")
In xarray:
xr_array = xr_array.assign_coords(band=("band", ["b4", "b8"]))
xr_data = xr_array.to_dataset(dim="band")
ndvi_xr = (xr_data["b8"] - xr_data["b4"])/(xr_data["b8"] + xr_data["b4"])
plt.imshow(ndvi_xr, cmap="YlGn")
The result is the same, but no more struggling on what index is representing which variable!
np.allclose(ndvi_xr.data, ndvi_np)
We can keep the result together with the other data variables by adding a new variable to the data, in a very similar way as we created a new column in Pandas:
xr_data["ndvi"] = ndvi_xr
xr_data
You already encountered xarray.DataArray
, but now we created a xarray.Dataset
:
xarray.Dataset
is the second main data type provided by xarrayy
: the y coordinates of the data setx
: the x coordinates of the data setband_4
, band_8
and ndvi
that share the same coordinatestiff
are stored in the Attributes
Looking to the data itself (click on the icons on the right), we can see each of the Data variables is a Numpy ndarrays:
type(xr_data["b4"].data)
And also the coordinates that describe a dimension are Numpy ndarrays:
type(xr_data.coords["x"].values)
Selecting data
Xarray’s labels make working with multidimensional data much easier:
xr_array = xr.open_rasterio("./data/gent/raster/2020-09-17_Sentinel_2_L1C_B0408.tiff")
Rename the coordinates of the band dimension:
xr_array = xr_array.assign_coords(band=("band", ["b4", "b8"]))
We could use the Numpy style of data slicing:
xr_array[0]
However, it is often much more powerful to use xarray’s .sel()
method to use label-based indexing:
xr_array.sel(band="b4")
We can select a specific set of coordinate values as a list and take the value that is most near to the given value:
xr_array.sel(x=[406803, 410380, 413958], method="nearest") # .sel(band="b4").plot.line(hue="x");
Sometimes, a specific range is required. The .sel()
method also supports slicing, so we can select band 4 and slice a subset of the data along the x direction:
xr_array.sel(x=slice(400_000, 420_000), band="b4").plot.imshow()
Note Switch in between Array
and Datasets
as you like, it won't hurt your computer memory:
xr_data = xr_array.to_dataset(dim="band")
#xr_data.to_array() # dim="band"
Just like in numpy, we can reduce xarray DataArrays along any number of axes:
xr_data["b4"].mean(axis=0).dims
xr_data["b4"].mean(axis=1).dims
But we have dimensions with labels, so rather than performing reductions on axes (as in Numpy), we can perform them on dimensions. This turns out to be a huge convenience:
xr_data["b4"].mean(dim="x").dims
Calculate minimum or quantile values for each of the bands separately:
xr_array.min(dim=["x", "y"])
xr_array.quantile([0.1, 0.5, 0.9], dim=["x", "y"])
Xarray DataArrays and Datasets work seamlessly with arithmetic operators and numpy array functions.
xr_data["b4"] /10.
np.log(xr_data["b8"])
As we seen in the example of the NDVI, we can combine multiple xarray datasets in arithemetic operations:
xr_data["b8"] + xr_data["b4"]
Perfoming an operation on arrays with differenty coordinates will result in automatic broadcasting:
xr_data.x.shape, xr_data["b8"].shape
xr_data["b8"] + xr_data.x # Note, this calculaton does not make much sense, but illustrates broadcasting
Similar to Pandas, there is a plot
method, which can be used for different plot types:
xr_array = xr.open_rasterio("./data/gent/raster/2020-09-17_Sentinel_2_L1C_B0408.tiff")
xr_array = xr_array.assign_coords(band=("band", ["b4", "b8"]))
It supports both 2 dimensional (e.g. line) as 3 (e.g. imshow, pcolormesh) dimensional plots. When just using plot
, xarray will do a best guess on how to plot the data. However being explicit plot.line
, plot.imshow
, plot.pcolormesh
, plot.scatter
,... gives you more control.
xr_array.sel(band="b4").plot(); # add .line() -> ValueError: For 2D inputs, please specify either hue, x or y.
xr_array.sel(x=420000, method="nearest").plot.line(hue="band");
facetting
splits the data in subplots according to a dimension, e.g. band
xr_array.sel(x=420000, method="nearest").plot.line(col="band"); # row="band"
Use the robust
option when there is a lack of visual difference. This will use the 2nd and 98th percentiles of the data to compute the color limits. The arrows on the color bar indicate that the colors include data points outside the bounds.
ax = xr_array.sel(band="b4").plot(cmap="Reds", robust=True, figsize=(12, 5))
ax.axes.set_aspect('equal')
Compare data variables within a xarray Dataset
:
xr_data = xr_array.to_dataset(dim="band")
xr_data.plot.scatter(x="b4", y="b8", s=2)
Calculating and plotting the NDVI in three classes illustrates the options of the imshow
method:
xr_data["ndvi"] = (xr_data["b8"] - xr_data["b4"])/(xr_data["b8"] + xr_data["b4"])
xr_data["ndvi"].plot.imshow(levels=[-1, 0, 0.3, 1.], colors=["gray", "yellowgreen", "g"])
The data set for the following exercises is from Argo floats, an international collaboration that collects high-quality temperature and salinity profiles from the upper 2000m of the ice-free global ocean and currents from intermediate depths.
These data do not represent full coverage image data (like remote sensing images), but measurements of salinity and temperature as a function of water level
(related to the pressure). Each measurements happens at a given date
on a given location (lon
/lat
).
import xarray as xr
argo = xr.load_dataset("./data/argo_float.nc")
argo
The bold font (or * symbol in plain text output version) in the coordinate representation above indicates that x and y are 'dimension coordinates' (they describe the coordinates associated with data variable axes) while band is a 'non-dimension coordinates'. We can make any variable a non-dimension coordinate.
Let's plot the coordinates of the available measurements and add a background map using contextly:
EXERCISE:
Add a new variable to the argo
data set, called temperature_kelvin
, by converting the temperature to Kelvin.
Degrees Kelvin = degrees celsius + 273.
# %load _solutions/13-xarray1.py
EXERCISE:
The water level classes define different water depths. The pressure is a proxy for the water depth. Verify the relationship between the pressure and the level using a scatter plot. Does a larger value for the level represent deeper water depths or not?
ValueError: Dataset.plot cannot be called directly. Use an explicit plot method, e.g. ds.plot.scatter(...)
, read the message and do what it says.# %load _solutions/13-xarray2.py
EXERCISE:
Assume that buoyancy is defined by the following formula:
$$g \cdot ( 2\times 10^{-4} \cdot T - 7\times 10^{-4} \cdot P )$$With:
Calculate the buoyancy and add it as a new variable buoyancy
to the argo
data set.
Make a 2D (image) plot with the x-axis the date, the y-axis the water level and the color intensity the buoyancy. As the level represents the depth of the water, it makes more sense to have 0 (surface) at the top of the y-axis: switch the y-axis direction.
imshow
method does not work on irregular intervals. Matplotlib and xarray also have pcolormesh
.ax.invert_yaxis()
Matplotlib function is not supported for pcolormesh)# %load _solutions/13-xarray3.py
# %load _solutions/13-xarray4.py
# %load _solutions/13-xarray5.py
EXERCISE:
Make a line plot of the salinity as a function of time at level 10
Break it down into different steps and chain the individual steps:
salinity
. This is similar to selecting a column in Pandas.sel
method to select the level=10
plot.line()
method.# %load _solutions/13-xarray6.py
EXERCISE:
Break it down into different steps and chain these individual steps:
temperature
. This is similar to selecting a column in Pandas.sel
method to select the levels 10, 20 and 30.plot.line()
method, but make sure the hue
changes for each levelFor the subplots, check the facetting documentation of xarray.
# %load _solutions/13-xarray7.py
# %load _solutions/13-xarray8.py
EXERCISE:
You wonder how the temperature evolves with increasing latitude and what the effect is of the depth (level):
Create a scatter plot of the level
as a function of the temperature
colored by the latitude
.
As a further exploration step, pick a subset of levels 1, 10, 25, and 50 and create a second scatter plot with in the x-axis the latitude of the measurement and in the y-axis the temperature. To compare the effect of the different levels, give each level a separate subplot next to each other.
sel
method to select the levels 1, 10, 25, and 50.col
changes for each level
and define which variables need to go to which axis.# %load _solutions/13-xarray9.py
# %load _solutions/13-xarray10.py
EXERCISE:
Make an image plot of the temperature as a function of time. Divide the colormap in 3 discrete categories:
Choose a custom colormap and adjust the label of the colorbar to 'Temperature (°C)'
plot
function or the xarray documentation on discrete colormaps.cbar_kwargs
as a dict. Adjust the label
of the colorbar.# %load _solutions/13-xarray11.py
EXERCISE:
Calculate the average salinity and temperature as a function of level over the measurements taken between 2012-10-01 and 2012-12-01.
Make a separate line plot for each of them. Define the figure and 2 subplots first with Matplotlib.
slice
operator within the sel
to select a range of the data.axis
in reduction functions, xarray uses the dim
name.fig, (ax0, ax1) = plt.subplots(1, 2)
to create subplots.# %load _solutions/13-xarray12.py
# %load _solutions/13-xarray13.py
argo = xr.load_dataset("./data/argo_float.nc")
If we are interested in the average over time for each of the levels, we can use a reducton function to get the averages of each of the variables at the same time:
argo.mean(dim=["date"])
But if we wanted the average for each month of the year per level, we would first have to split the data set in a group for each month of the year, apply the average function on each of the months and combine the data again.
We already learned about the split-apply-combine approach when using Pandas. The syntax of Xarray’s groupby is almost identical to Pandas!
First, extract the month of the year (1-> 12) from each of the date coordinates:
argo.date.dt.month # The coordinates is a Pandas datetime index
We can use these arrays in a groupby operation:
argo.groupby(argo.date.dt.month)
Xarray also offers a more concise syntax when the variable you're grouping on is already present in the dataset. This is identical to the previous line:
argo.groupby("date.month")
Next, we apply an aggregation function for each of the months over the date
dimension in order to end up with: for each month of the year, the average (over time) for each of the levels:
argo.groupby("date.month").mean(dim="date") #["temperature"].sel(level=1).to_series().plot.barh()
Another (alike) operation - specifically for time series data - is to resample
the data to another time-aggregation. For example, resample to monthly (1M
) or yearly (1Y
) median values:
argo.resample(date="1M").median() # 1Y
argo["salinity"].sel(level=1).plot.line(x="date");
argo["salinity"].resample(date="1M").median().sel(level=1).plot.line(x="date"); # 1Y
A similar, but different functionality is rolling
to calculate rolling window aggregates:
argo.rolling(level=10, center=True).std()
argo["salinity"].sel(date='2012-10-31').plot.line(y="level", yincrease=False, color="grey");
argo["salinity"].sel(date='2012-10-31').rolling(level=10, center=True).median().plot.line(y="level", yincrease=False, linewidth=3, color="crimson");
plt.legend(), plt.title("");
REMEMBER:
The xarray groupby
with the same syntax as Pandas implements the split-apply-combine strategy. Also resample
and rolling
are available in xarray.
Note: Xarray adds a groupby_bins
convenience function for binned groups (instead of each value).
Note: Values are only read from disk when needed. For example, the following statement only reads the coordinate information and the metadata. The data itself is not yet loaded:
gent = xr.open_rasterio(data_file)
gent
load()
will explicitly load the data into memory:
xr.open_rasterio(data_file).load()
Acknowledgements and great thanks to https://earth-env-data-science.github.io for the inspiration, data and examples.