CASE - Sea Surface Temperature data
DS Python for GIS and Geoscience
October, 2020© 2020, Joris Van den Bossche and Stijn Van Hoey. Licensed under CC BY 4.0 Creative Commons
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import xarray as xr
For this use case, we focus on the Extended Reconstructed Sea Surface Temperature (ERSST), a widely used and trusted gridded compilation of historical Sea Surface Temperature (SST).
The Extended Reconstructed Sea Surface Temperature (ERSST) dataset is a global monthly sea surface temperature dataset derived from the International Comprehensive Ocean–Atmosphere Dataset (ICOADS). It is produced on a 2° × 2° grid with spatial completeness enhanced using statistical methods. This monthly analysis begins in January 1854 continuing to the present and includes anomalies computed with respect to a 1971–2000 monthly climatology.
First we download the dataset. We will use the NOAA Extended Reconstructed Sea Surface Temperature (ERSST) v4 product. Download the data from this link: https://psl.noaa.gov/thredds/fileServer/Datasets/noaa.ersst/sst.mnmean.v4.nc and store it in the same folder as the notebook as sst.mnmean.v4.nc
.
Reading in the data set, ignoring the time_bnds
variable:
data = './sst.mnmean.v4.nc'
ds = xr.open_dataset(data, drop_variables=['time_bnds'], engine="h5netcdf")
For this use case, we will focus on the years after 1960, so we slice the data from 1960 and load the data into our computer memory. By only loading the data after the initial slice, we make sure to only load into memory the data we specifically need:
ds = ds.sel(time=slice('1960', '2018')).load() # load into memory
ds
The data with the extension nc
is a NetCDF format. NetCDF (Network Common Data Format) is the most widely used format for distributing geoscience data. NetCDF is maintained by the Unidata organization. Check the netcdf website for more information. Xarray was designed to make reading netCDF files in python as easy, powerful, and flexible as possible.
Note: As the data is in a OPeNDAP server, we could also load the NETCDF data directly without downloading anything. This would require us to add the netcdf4
package in our conda environment
The data contains a single data variable sst
and has 3 dimensions: lon, lat and time each described by a coordinate. Let's first get some insight in the structure and content of the data.
EXERCISE:
ds
the same as the attributes of the sst
data itself?size
of an array is an attribute of an xarray.DataArray and not of a xarray.Datasetshape
of an array is an attribute of an xarray.DataArray. A xarray.Dataset has the dims
attribute to query dimension sizes# %load _solutions/case-sea-surface-temperature1.py
# %load _solutions/case-sea-surface-temperature2.py
# %load _solutions/case-sea-surface-temperature3.py
# %load _solutions/case-sea-surface-temperature4.py
# %load _solutions/case-sea-surface-temperature5.py
As we work with a single data variable, we will introduce a new variable sst
which is the xarray.DataArray
of the SST values. Note that we only keep the attributes on the xarray.DataArray level.
sst = ds["sst"]
sst
EXERCISE:
Make an image plot of the SST in the first month of the data set, January 1960. Adjust the range of the colorbar and switch to the coolwarm
colormap.
2020-01-01
.vmin
and vmax
.# %load _solutions/case-sea-surface-temperature6.py
Note xaray uses xarray.plot.pcolormesh() as the default two-dimensional plot method because it is more flexible than xarray.plot.imshow(). However, for large arrays, imshow can be much faster than pcolormesh. If speed is important to you and you are plotting a regular mesh, consider using imshow.
EXERCISE:
How did the SST evolve in time for a specific location on the earth? Make a line plot of the SST at lon=300, lat=50 as a function of time.
Do you recognize the seasonality of the data?
sel
for both the lon/lat selection.# %load _solutions/case-sea-surface-temperature7.py
EXERCISE:
What is the evolution of the SST as function of the month of the year?
Calculate the average SST with respect to the month of the year for all positions in the data set and store the result as a variable ds_mm
.
Use the ds_mm
variable to make a plot: For longitude 164
, make a comparison in between the monthly average at latitude -23.4
versus latitude 23.4
. Use a line plot with in the x-axis the month of the year and in the y-axis the average SST.
sel
for both the lon/lat selection.method="nearest"
inside a selection.# %load _solutions/case-sea-surface-temperature8.py
# %load _solutions/case-sea-surface-temperature9.py
EXERCISE:
How does the zonal mean climatology for each month of the year changes with the latitude?
Reuse the ds_mm
from the previous exercise or recalculate the average SST with respect to the month of the year for all positions in the data set and store the result as a variable ds_mm
.
To check the mean climatology (aggregating over the longitudes) as a function of the latitude for each month in the year, calculate the average SST over the lon
dimension from ds_mm
. Plot the result as an image with the month of the year in the x-axis and the latitude in the y-axis.
groupby
, but need to calculate a reduction along a dimension.# %load _solutions/case-sea-surface-temperature10.py
# %load _solutions/case-sea-surface-temperature11.py
# %load _solutions/case-sea-surface-temperature12.py
EXERCISE:
How different is the mean climatology in between January and July?
Reuse the ds_mm
variable from the previous exercises or recalculate the average SST with respect to the month of the year for all positions in the data set and store the result as a variable ds_mm
.
Calculate the difference of the mean climatology between January an July and plot the result as an image (map) with the longitude of the year in the x-axis and the latitude in the y-axis.
groupby
, but only selections from the ds_mm
variable.# %load _solutions/case-sea-surface-temperature13.py
To understand how the SST temperature evolved as a function of time during the last decades, we want to remove this climatology from the dataset and examine the residual, called the anomaly, which is the interesting part from a climate perspective.
We will do this by subtracting the monthly average from the values of that specific month. Hence, subtract the average January value over the years from the January data, subtract the average February value over the years from the February data,...
Removing the seasonal climatology is an example of a transformation: it operates over a group, but does not change the size of the dataset as we do the operation on each element (x - x.mean()
)
This is not the same as the aggregations (e.g. average
) we applied on each of the groups earlier. When using groupby
, a calculation to the group can be applied and just like in Pandas, these calculations can either be:
One way to consider is that we apply
a function to each of the groups. For our anomaly calculation we want to do a transformation and apply the following function:
def remove_time_mean(x):
"""Subtract each value by the mean over time"""
return x - x.mean(dim='time')
We can apply
this function to each of the groups:
sst = ds["sst"]
ds_anom = sst.groupby('time.month').apply(remove_time_mean)
ds_anom
In other words:
subtract each element by the average over time of all elements of the month the element belongs to
Xarray makes these sorts of transformations easy by supporting groupby arithmetic. This concept is easiest explained by applying it for our application:
gb = sst.groupby('time.month') # make groups (in this example each month of the year is a group)
ds_anom = gb - gb.mean(dim='time') # subtract each element of the group/month by the mean of that group/month over time
ds_anom
Now we can view the climate signal without the overwhelming influence of the seasonal cycle:
ds_anom.sel(lon=300, lat=50).plot.line()
Check the difference between Jan. 1 2018 and Jan. 1 1960 to see where the evolution in time is the largest:
(ds_anom.sel(time='2018-01-01') - ds_anom.sel(time='1960-01-01')).plot()
EXERCISE:
Compute the five-year median of the ds_anom
variable for the location lon=300
, lat=50
as well as the 12 month rolling median of the same data set. Store the output as respectively ds_anom_resample
and ds_anom_rolling
.
Make a line plot as a function of time for the location lon=300
, lat=50
of the original ds_anom
data, the resampled data and the rolling median data.
resample
and the rolling
functions.# %load _solutions/case-sea-surface-temperature14.py
# %load _solutions/case-sea-surface-temperature15.py
# %load _solutions/case-sea-surface-temperature16.py
The previous maps were the default outputs of xarray without specification of the spatial context. For reporting these plots are not appropriate. We can use the cartopy package to adjust our Matplotlib axis to make them spatially aware.
To make sure the data of xarray can be integrated in a cartopy plot, the crucial element is to define the transform
argument to to control which coordinate system that the given data is in. You can add the transform keyword with an appropriate cartopy.crs.CRS
instance from the import cartopy.crs
module:
import cartopy.crs as ccrs
import cartopy
map_proj = ccrs.Robinson() # Define the projection
fig, ax = plt.subplots(figsize = (16,9), subplot_kw={"projection": map_proj})
ax.gridlines()
ax.coastlines()
sst.sel(time="1960-01-01").plot(ax=ax, vmin=-2, vmax=30, center=5,
cmap='coolwarm', transform = ccrs.PlateCarree(), # tranform argument
cbar_kwargs={'shrink':0.75})
EXERCISE:
Make a plot of the ds_anom
variable of 2018-01-01 with cartopy.
ccrs.Orthographic
with the central lon/lat on -20, 5# %load _solutions/case-sea-surface-temperature17.py
Apart from aggregations as a function of time, also spatial aggregations using other (spatial) data sets can be achieved. In the next section, we want to compute the average SST over different ocean basins. The http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NODC/.WOA09/.Masks/.basin/ is a data set that contains the main ocean basins in lon/lat:
basin = xr.open_dataset("./data/basin.nc")
basin = basin.rename({'X': 'lon', 'Y': 'lat'})
basin["basin"]
The name of the basins are included in the attributes of the data set. Using Pandas, we can create a mapping in between the basin names and the index used in the basin data set:
basin_names = basin["basin"].attrs['CLIST'].split('\n')
basin_s = pd.Series(basin_names, index=np.arange(1, len(basin_names)+1))
basin_s = basin_s.rename('basin')
basin_s.head()
We will use this mapping from identifier to label later in the analysis.
The basin data set provides multiple Z levels. We are interested in the division on surface level (0.0):
basin_surface = basin["basin"].sel(Z=0.0).drop_vars("Z")
basin_surface.plot(vmax=10, cmap='tab10')
The next step is to align both data sets. For this application, using the 'nearest' available data point will work to map both data sets with each other. Xarray provides the function interp_like
to interpolate the basin_surface
to the sst
variable:
basin_surface_interp = basin_surface.interp_like(sst, method='nearest')
basin_surface_interp.plot(vmax=10, cmap='tab10')
EXERCISE:
Compute the mean SST (over all dimensions) for each of the basins in the basin_surface
variable starting from the sst
variable.
Next, we want to plot a horizontal bar chart with the SST for each bar chart. To do so:
basin_s
variable by merging on the index (identifiers of the basin names).groupby
with the basin_surface_interp
as input.# %load _solutions/case-sea-surface-temperature18.py
# %load _solutions/case-sea-surface-temperature19.py
# %load _solutions/case-sea-surface-temperature20.py
# %load _solutions/case-sea-surface-temperature21.py
# %load _solutions/case-sea-surface-temperature22.py
Acknowledgements to https://earth-env-data-science.github.io/lectures/xarray/xarray-part2.html