Notebook

Accessing cloud satellite data¶

Funding: Interagency Implementation and Advanced Concepts Team IMPACT for the Earth Science Data Systems (ESDS) program and AWS Public Dataset Program

Credits: Tutorial development¶

Dr. Chelle Gentemann - Twitter - Farallon Institute
Lucas Sterzinger - Twitter - University of California, Davis

Satellite data¶

Satellite data come from either polar or low inclination orbits that circle the Earth or geostationary orbits that hold a location above a specific location on the Earth, near the Equator. The type of data from satellites depends on both the orbit and the type of instrument. There are definitions for the different satellite data types.

L1 - satellite observations, usually each observation has a lat/lon/time associated with it
L2 - derived geophysical retrievals (eg. SST), still with lat/lon/time
L3 - variables mapped to a uniform space/time grid
L4 - model output or analysis of data, may include multiple sensors

There are several different formats that satellite data: netCDF, HDF5, and geoTIFF. There are also som newer cloud optimized formats that data: Zarr, Cloud Optimized GeoTIFF (COG).

Understand the data¶

Data access can be easy and fast on the cloud, but to avoid potential issues down the road, spend time to understand the strengths and weaknesses of any data. Any science results will need to address uncertainties in the data, so it is easier to spend time at the beginning to understand them and avoid data mis-application.

This tutorial will focus on netCDF/HDF5/Zarr¶

Data proximate computing¶

These are BIG datasets that you can analyze on the cloud without downloading the data.
You can run this on your phone, a Raspberry Pi, laptop, or desktop.
By using public cloud data, your science is reproducible and easily shared!

The OHW Jupyter Hub is a m5.2xlarge instance shared by up to 4 users. What does that mean? m5.2xlarge has 8 vCPU and 32 GB memory. If there are 3 other users on your node, then the memory could be used up fast by large satellite datasets, so just be a little aware of that when you want to run something....

Here we will demonstrate some ways to access the different SST datasets on AWS:¶

AWS MUR sea surface temperatures (L4)
AWS GOES sea surface temperatures (L2)
AWS Himawari sea surface temperatures (L2 and L3)
ERA5 data (L4)

To run this notebook¶

Code is in the cells that have In [ ]: to the left of the cell and have a colored background

To run the code:

option 1) click anywhere in the cell, then hold shift down and press Enter
option 2) click on the Run button at the top of the page in the dashboard

Remember:

to insert a new cell below press Esc then b
to delete a cell press Esc then dd

First start by importing libraries¶

How to find an AWS Public Dataset¶

Click here: AWS Public Dataset

Click on Find public available data on AWS button
Search for MUR
Select MUR SST

MUR SST AWS Public dataset program ¶

Access the MUR SST Zarr store which is in an s3 bucket.¶

We will start with my favorite Analysis Ready Data (ARD) format: Zarr. Using data stored in Zarr is fast, simple, and contains all the metadata normally in a netcdf file, so you can figure out easily what is in the datastore.

Fast - Zarr is fast because all the metadata is consolidated into a .json file. Reading in massive datasets is lightning fast because it only reads the metadata and does read in data until it needs it for compute.
Simple - Filenames? Who needs them? Who cares? Not I. Simply point your read routine to the data directory.
Metadata - all you want!

In [ ]:

# filter some warning messages
import warnings 
warnings.filterwarnings("ignore") 

#libraries
import datetime as dt
import xarray as xr
import fsspec
import s3fs
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
# make datasets display nicely
xr.set_options(display_style="html")  

import dask
from dask.distributed import performance_report, Client, progress

#magic fncts #put static images of your plot embedded in the notebook
%matplotlib inline  
plt.rcParams['figure.figsize'] = 12, 6
%config InlineBackend.figure_format = 'retina' 

fsspec.get_mapper Creates a mapping between your computer and the s3 bucket. This isn't necessary if the Zarr file is stored locally.

xr.open_zarr Reads a Zarr store into an Xarray dataset

In [ ]:

%%time
file_location = 's3://mur-sst/zarr'

ikey = fsspec.get_mapper(file_location, anon=True)

ds_sst = xr.open_zarr(ikey,consolidated=True)

ds_sst

Read entire 10 years of data at 1 point.¶

Select the analysed_sst variable over a specific time period, lat, and lon and load the data into memory. This is small enough to load into memory which will make calculating climatologies easier in the next step.

In [ ]:

%%time

sst_timeseries = ds_sst['analysed_sst'].sel(time = slice('2010-01-01','2020-01-01'),
                                            lat  = 47,
                                            lon  = -145
                                           ).load()

sst_timeseries.plot()

The anomaly is more interesting...¶

Use .groupby method to calculate the climatology and .resample method to then average it into 1-month bins.

DataArray.mean arguments are important! Xarray uses metadata to plot, so keep_attrs is a nice feature. Also, for SST there are regions with changing sea ice. Setting skipna = False removes these regions.

In [ ]:

%%time

sst_climatology = sst_timeseries.groupby('time.dayofyear').mean('time',keep_attrs=True,skipna=False)

sst_anomaly = sst_timeseries.groupby('time.dayofyear')-sst_climatology

sst_anomaly_monthly = sst_anomaly.resample(time='1MS').mean(keep_attrs=True,skipna=False)

#plot the data
sst_anomaly.plot()
sst_anomaly_monthly.plot()
plt.axhline(linewidth=2,color='k')

Chukchi Sea SST timeseries¶

Note SST is set to -1.8 C (271.35 K) when ice is present¶

In [ ]:

sst_timeseries = ds_sst['analysed_sst'].sel(time = slice('2010-01-01','2020-01-01'),
                                            lat  = 72,
                                            lon  = -171
                                           ).load()

sst_timeseries.plot()

Grid resolution does NOT equal spatial resolution¶

many L4 SST analyses blend infrared (~ 1 - 4 km data) with passive microwave (~ 50 km) data. Data availability will determine regional / temporal changes in spatial resolution
many L4 SST analyses apply smoothing filters that may further reduce resolution

In [ ]:

%%time

subset = ds_sst['analysed_sst'].sel(time='2019-06-01',lat=slice(35,40),lon=slice(-126,-120))

subset.plot(vmin=282,vmax=289,cmap='inferno')

In [ ]:

%%time

subset = ds_sst['analysed_sst'].sel(time='2019-05-15',lat=slice(35,40),lon=slice(-126,-120))

subset.plot(vmin=282,vmax=289,cmap='inferno')

How to find an AWS Public Dataset¶

click here: AWS Public Dataset

Click on Find public available data on AWS button
Search for GOES
Select GOES-16 and GOES-17

NETCDF GEO DATA¶

Zarr made it seem so easy to read and access cloud SST data!¶

It is always that easy, right?¶

When data is just dumped on the cloud, it is still useful, but can be challenging to use.

Which brings us to GOES SSTs.....

Info on the data is here -- GOES has a lot of different parameters and they are all stored in different files with names that are difficult to understand. There are 80 different data products. This link lists them all and explains the different directory names. AWS info on GOES SST data.
The GOES data is netCDF format. There is a different for each of the 80 projects for year/day/hour. A lot of files. I find it really useful to 'browse' s3: buckets so that I can understand the directory and data structure. Explore S3 structure. The directory structure is <Product>/<Year>/<Day of Year>/<Hour>/<Filename>
In the code below we are going to create a function that searches for all files from a certain day, then creates the keys to opening them. The files are so messy that opening a day takes a little while. There are other ways you could write this function depending on what your analysis goals are, this is just one way to get some data in a reasonable amount of time.
This function uses
s3fs.S3FileSystem which holds a connection with a s3 bucket and allows you to list files, etc.
xr.open_mfdataset opens a list of filenames and concatenates the data along the specified dimensions

Website https://registry.opendata.aws/noaa-goes/

In [ ]:

@dask.delayed
def get_geo_data(sat,lyr,idyjl):
    # arguments
    # sat   goes-east,goes-west,himawari
    # lyr   year
    # idyjl day of year
    
    d = dt.datetime(lyr,1,1) + dt.timedelta(days=idyjl)
    fs = s3fs.S3FileSystem(anon=True) #connect to s3 bucket!

    #create strings for the year and julian day
    imon,idym=d.month,d.day
    syr,sjdy,smon,sdym = str(lyr).zfill(4),str(idyjl).zfill(3),str(imon).zfill(2),str(idym).zfill(2)
    
    #use glob to list all the files in the directory
    if sat=='goes-east':
        file_location,var = fs.glob('s3://noaa-goes16/ABI-L2-SSTF/'+syr+'/'+sjdy+'/*/*.nc'),'SST'
    if sat=='goes-west':
        file_location,var = fs.glob('s3://noaa-goes17/ABI-L2-SSTF/'+syr+'/'+sjdy+'/*/*.nc'),'SST'
    if sat=='himawari':
        file_location,var = fs.glob('s3://noaa-himawari8/AHI-L2-FLDK-SST/'+syr+'/'+smon+'/'+sdym+'/*/*L2P*.nc'),'sea_surface_temperature'
    
    #make a list of links to the file keys
    if len(file_location)<1:
        return file_ob
    file_ob = [fs.open(file) for file in file_location]        #open connection to files
    
    #open all the day's data
    ds = xr.open_mfdataset(file_ob,combine='nested',concat_dim='time') #note file is super messed up formatting
    
    #clean up coordinates which are a MESS in GOES
    #rename one of the coordinates that doesn't match a dim & should
    if not sat=='himawari':
        ds = ds.rename({'t':'time'})
        ds = ds.reset_coords()
    else:
        ds = ds.rename({'ni':'x','nj':'y'})
    
    #put in to Celsius
    #ds[var] -= 273.15   #nice python shortcut to +- from itself a-=273.15 is the same as a=a-273.15
    #ds[var].attrs['units'] = '$^\circ$C'
   
    return ds

Open a day of GOES-16 (East Coast) Data¶

In [ ]:

%%time
lyr, idyjl = 2020, 210  #may 30, 2020

ds = dask.delayed(get_geo_data)('goes-east',lyr,idyjl)

ds

In [ ]:

%%time
subset = ds.sel(x=slice(-0.01,0.07215601),y=slice(0.12,0.09))  #reduce to GS region

masked = subset.SST.where(subset.DQF==0).compute()

masked.isel(time=14).plot(vmin=14+273.15,vmax=30+273.15,cmap='inferno')

In [ ]:

%%time

mean_dy = masked.mean('time',skipna=True).compute()   #here I want all possible values so skipna=True

mean_dy.plot(vmin=14+273.15,vmax=30+273.15,cmap='inferno')

Lat/Lon¶

The data above is on the GOES grid, which isn't very helpful if you want to make a subset by a known lat/lon or to plot borders with Cartopy.

The function below takes in a GOES-16/17 dataset and returns the same dataset with latitude (lat) and longitude (lon) added as 2D coordinate arrays. The math to do this was taken from https://makersportal.com/blog/2018/11/25/goes-r-satellite-latitude-and-longitude-grid-projection-algorithm

In [ ]:

def calc_latlon(ds):
    # The math for this function was taken from 
    # https://makersportal.com/blog/2018/11/25/goes-r-satellite-latitude-and-longitude-grid-projection-algorithm 

    x = ds.x
    y = ds.y
    goes_imager_projection = ds.goes_imager_projection
    
    x,y = np.meshgrid(x,y)
    
    r_eq = goes_imager_projection.attrs["semi_major_axis"]
    r_pol = goes_imager_projection.attrs["semi_minor_axis"]
    l_0 = goes_imager_projection.attrs["longitude_of_projection_origin"] * (np.pi/180)
    h_sat = goes_imager_projection.attrs["perspective_point_height"]
    H = r_eq + h_sat
    
    a = np.sin(x)**2 + (np.cos(x)**2 * (np.cos(y)**2 + (r_eq**2 / r_pol**2) * np.sin(y)**2))
    b = -2 * H * np.cos(x) * np.cos(y)
    c = H**2 - r_eq**2
    
    r_s = (-b - np.sqrt(b**2 - 4*a*c))/(2*a)
    
    s_x = r_s * np.cos(x) * np.cos(y)
    s_y = -r_s * np.sin(x)
    s_z = r_s * np.cos(x) * np.sin(y)
    
    lat = np.arctan((r_eq**2 / r_pol**2) * (s_z / np.sqrt((H-s_x)**2 +s_y**2))) * (180/np.pi)
    lon = (l_0 - np.arctan(s_y / (H-s_x))) * (180/np.pi)
    
    ds = ds.assign_coords({
        "lat":(["y","x"],lat),
        "lon":(["y","x"],lon)
    })
    ds.lat.attrs["units"] = "degrees_north"
    ds.lon.attrs["units"] = "degrees_east"

    return ds

In [ ]:

ds_latlon = calc_latlon(ds.compute())

ds_latlon

It's really slow to use `.where()` to select data¶

e.g.

subset = ds.where((ds.lon >= lon1) & (ds.lon <= lon2) & (ds.lat >= lat1) & (ds.lat <= lat2), drop=True)

Instead, we can create a function that will return the x/y min and max so we can continue to use .isel() to quickly select data.

note: the y slice must be done as slice(y2,y1) since the y coordinate is decreasing, so the larger value must come first.

In [ ]:

def get_xy_from_latlon(ds, lats, lons):
    lat1, lat2 = lats
    lon1, lon2 = lons

    lat = ds.lat.data
    lon = ds.lon.data
    
    x = ds.x.data
    y = ds.y.data
    
    x,y = np.meshgrid(x,y)
    
    x = x[(lat >= lat1) & (lat <= lat2) & (lon >= lon1) & (lon <= lon2)]
    y = y[(lat >= lat1) & (lat <= lat2) & (lon >= lon1) & (lon <= lon2)] 
    
    return ((min(x), max(x)), (min(y), max(y)))

Now that we have a min/max `x` and `y`, we can subset in the same was as above¶

In [ ]:

lats = (31, 53)
lons = (-85, -56)

(x1,x2), (y1,y2) = get_xy_from_latlon(ds_latlon, lats, lons)
subset = ds_latlon.sel(x=slice(x1,x2), y=slice(y2,y1))

subset

To plot with Cartopy, an axis need to be set up with a projection. More info can be found at https://scitools.org.uk/cartopy/docs/latest/crs/projections.html ¶

In [ ]:

%%time
import cartopy.crs as ccrs
import cartopy.feature as cfeature

masked = subset.mean("time", skipna=True).where(subset.DQF==0)

# For a faster-running example, plot a single time instead of taking a mean
# masked = subset.isel(time=14).where(subset.DQF==0)


fig = plt.figure()
ax = fig.add_subplot(111, projection=ccrs.PlateCarree())
masked.SST.plot(x="lon", y="lat", vmin=14+273.15,vmax=30+273.15,cmap='inferno', ax=ax)

ax.add_feature(cfeature.COASTLINE)
ax.add_feature(cfeature.STATES)

Read AWS JMA Himawari SSTs¶

define a function to get all the filenames for a day
AWS info on Himawari SST data, here
Explore S3 structure here
SSTs are given in L2P and L3C GHRSST data formats.

L2P has dims that not mapped to a regular grid.
L3C is mapped to a grid, with dims lat,lon.

Website https://registry.opendata.aws/noaa-himawari/

In [ ]:

%%time
lyr, idyjl = 2020, 212

ds = dask.delayed(get_geo_data)('himawari',lyr,idyjl)

subset = ds.sel(x=slice(-0.05,0.08),y=slice(0.12,0.08))

masked = subset.sea_surface_temperature.where(subset.quality_level>=4)

mean_dy = masked.mean('time',skipna=True).compute()  #here I want all possible values so skipna=True

mean_dy.plot(vmin=14+273.15,vmax=30+273.15,cmap='inferno')

In [ ]:

masked.visualize()

ERA5 Data Structure on S3¶

The ERA5 data is organized into NetCDF files per variable, each containing a month of hourly data. The directory structure is /{year}/{month}/main.nc for all the variables or /{year}/{month}/data/{var}.nc for just a single variable.

Variables:

snow_density
sea_surface_temperature
lwe_thickness_of_surface_snow_amount
eastward_wind_at_10_metres
northward_wind_at_10_metres
time1_bounds
air_temperature_at_2_metres_1hour_Maximum
air_temperature_at_2_metres_1hour_Minimum
precipitation_amount_1hour_Accumulation
eastward_wind_at_100_metres
air_temperature_at_2_metres
dew_point_temperature_at_2_metres
integral_wrt_time_of_surface_direct_downwelling_shortwave_flux_in_air_1hour_Accumulation
northward_wind_at_100_metres
air_pressure_at_mean_sea_level
surface_air_pressure

For example, the full file path for sea surface temperature for January 2008 is:

/2008/01/data/sea_surface_temperature.nc

Note that due to the nature of the ERA5 forecast timing, which is run twice daily at 06:00 and 18:00 UTC, the monthly data file begins with data from 07:00 on the first of the month and continues through 06:00 of the following month. We'll see this in the coordinate values of a data file we download later in the notebook.
Granule variable structure and metadata attributes are stored in main.nc. This file contains coordinate and auxiliary variable data. This file is also annotated using NetCDF CF metadata conventions.

In [ ]:

#https://nbviewer.jupyter.org/github/awslabs/amazon-asdi/blob/main/examples/dask/notebooks/era5_zarr.ipynb
# zac flamig


def fix_accum_var_dims(ds, var):
    # Some varibles like precip have extra time bounds varibles, we drop them here to allow merging with other variables
    
    # Select variable of interest (drops dims that are not linked to current variable)
    ds = ds[[var]]  

    if var in ['air_temperature_at_2_metres',
               'dew_point_temperature_at_2_metres',
               'air_pressure_at_mean_sea_level',
               'northward_wind_at_10_metres',
               'eastward_wind_at_10_metres']:
        
        ds = ds.rename({'time0':'valid_time_end_utc'})
        
    elif var in ['precipitation_amount_1hour_Accumulation',
                 'integral_wrt_time_of_surface_direct_downwelling_shortwave_flux_in_air_1hour_Accumulation']:
        
        ds = ds.rename({'time1':'valid_time_end_utc'})
        
    else:
        print("Warning, Haven't seen {var} varible yet! Time renaming might not work.".format(var=var))
        
    return ds

@dask.delayed
def s3open(path):
    fs = s3fs.S3FileSystem(anon=True, default_fill_cache=False, 
                           config_kwargs = {'max_pool_connections': 20})
    return s3fs.S3Map(path, s3=fs)


def open_era5_range(start_year, end_year, variables):
    ''' Opens ERA5 monthly Zarr files in S3, given a start and end year (all months loaded) and a list of variables'''
    
    
    file_pattern = 'era5-pds/zarr/{year}/{month}/data/{var}.zarr/'
    
    years = list(np.arange(start_year, end_year+1, 1))
    months = ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]
    
    l = []
    for var in variables:
        print(var)
        
        # Get files
        files_mapper = [s3open(file_pattern.format(year=year, month=month, var=var)) for year in years for month in months]
        
        # Look up correct time dimension by variable name
        if var in ['precipitation_amount_1hour_Accumulation']:
            concat_dim='time1'
        else:
            concat_dim='time0'
            
        # Lazy load
        ds = xr.open_mfdataset(files_mapper, engine='zarr', 
                               concat_dim=concat_dim, combine='nested', 
                               coords='minimal', compat='override', parallel=True)
        
        # Fix dimension names
        ds = fix_accum_var_dims(ds, var)
        l.append(ds)
        
    ds_out = xr.merge(l)
    
    return ds_out

In [ ]:

%%time
ds = open_era5_range(2018, 2020, ["sea_surface_temperature"])

ds

In [ ]:

%%time

ds.sea_surface_temperature[0,:,:].plot()

In [ ]:

ds.sea_surface_temperature.sel(lon=130,lat=20).plot()

In [ ]:

Accessing cloud satellite data¶

Credits: Tutorial development¶

Satellite data¶

Understand the data¶

This tutorial will focus on netCDF/HDF5/Zarr¶

Data proximate computing¶

Here we will demonstrate some ways to access the different SST datasets on AWS:¶

To run this notebook¶

First start by importing libraries¶

How to find an AWS Public Dataset¶

MUR SST AWS Public dataset program¶

Access the MUR SST Zarr store which is in an s3 bucket.¶

Read entire 10 years of data at 1 point.¶

The anomaly is more interesting...¶

Chukchi Sea SST timeseries¶

Note SST is set to -1.8 C (271.35 K) when ice is present¶

Grid resolution does NOT equal spatial resolution¶

How to find an AWS Public Dataset¶

NETCDF GEO DATA¶

Zarr made it seem so easy to read and access cloud SST data!¶

It is always that easy, right?¶

Open a day of GOES-16 (East Coast) Data¶

Lat/Lon¶

It's really slow to use .where() to select data¶

Now that we have a min/max x and y, we can subset in the same was as above¶

To plot with Cartopy, an axis need to be set up with a projection. More info can be found at https://scitools.org.uk/cartopy/docs/latest/crs/projections.html¶

Read AWS JMA Himawari SSTs¶

ERA5 Data Structure on S3¶

MUR SST AWS Public dataset program ¶

It's really slow to use `.where()` to select data¶

Now that we have a min/max `x` and `y`, we can subset in the same was as above¶

To plot with Cartopy, an axis need to be set up with a projection. More info can be found at https://scitools.org.uk/cartopy/docs/latest/crs/projections.html ¶