Satellite data come from either polar or low inclination orbits that circle the Earth or geostationary orbits that hold a location above a specific location on the Earth, near the Equator. The type of data from satellites depends on both the orbit and the type of instrument. There are definitions for the different satellite data types.
There are several different formats that satellite data: netCDF, HDF5, and geoTIFF. There are also som newer cloud optimized formats that data: Zarr, Cloud Optimized GeoTIFF (COG).
Data access can be easy and fast on the cloud, but to avoid potential issues down the road, spend time to understand the strengths and weaknesses of any data. Any science results will need to address uncertainties in the data, so it is easier to spend time at the beginning to understand them and avoid data mis-application.
These are BIG datasets that you can analyze on the cloud without downloading the data.
You can run this on your phone, a Raspberry Pi, laptop, or desktop.
By using public cloud data, your science is reproducible and easily shared!
The OHW Jupyter Hub is a m5.2xlarge instance shared by up to 4 users. What does that mean? m5.2xlarge has 8 vCPU and 32 GB memory. If there are 3 other users on your node, then the memory could be used up fast by large satellite datasets, so just be a little aware of that when you want to run something....
Code is in the cells that have In [ ]: to the left of the cell and have a colored background
To run the code:
shift
down and press Enter
Remember:
Esc
then b
Esc
then dd
We will start with my favorite Analysis Ready Data (ARD) format: Zarr. Using data stored in Zarr is fast, simple, and contains all the metadata normally in a netcdf file, so you can figure out easily what is in the datastore.
Fast - Zarr is fast because all the metadata is consolidated into a .json file. Reading in massive datasets is lightning fast because it only reads the metadata and does read in data until it needs it for compute.
Simple - Filenames? Who needs them? Who cares? Not I. Simply point your read routine to the data directory.
Metadata - all you want!
# filter some warning messages
import warnings
warnings.filterwarnings("ignore")
#libraries
import datetime as dt
import xarray as xr
import fsspec
import s3fs
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
# make datasets display nicely
xr.set_options(display_style="html")
import dask
from dask.distributed import performance_report, Client, progress
#magic fncts #put static images of your plot embedded in the notebook
%matplotlib inline
plt.rcParams['figure.figsize'] = 12, 6
%config InlineBackend.figure_format = 'retina'
fsspec.get_mapper Creates a mapping between your computer and the s3 bucket. This isn't necessary if the Zarr file is stored locally.
xr.open_zarr Reads a Zarr store into an Xarray dataset
%%time
file_location = 's3://mur-sst/zarr'
ikey = fsspec.get_mapper(file_location, anon=True)
ds_sst = xr.open_zarr(ikey,consolidated=True)
ds_sst
Select the analysed_sst
variable over a specific time period, lat
, and lon
and load the data into memory. This is small enough to load into memory which will make calculating climatologies easier in the next step.
%%time
sst_timeseries = ds_sst['analysed_sst'].sel(time = slice('2010-01-01','2020-01-01'),
lat = 47,
lon = -145
).load()
sst_timeseries.plot()
Use .groupby method to calculate the climatology and .resample method to then average it into 1-month bins.
%%time
sst_climatology = sst_timeseries.groupby('time.dayofyear').mean('time',keep_attrs=True,skipna=False)
sst_anomaly = sst_timeseries.groupby('time.dayofyear')-sst_climatology
sst_anomaly_monthly = sst_anomaly.resample(time='1MS').mean(keep_attrs=True,skipna=False)
#plot the data
sst_anomaly.plot()
sst_anomaly_monthly.plot()
plt.axhline(linewidth=2,color='k')
sst_timeseries = ds_sst['analysed_sst'].sel(time = slice('2010-01-01','2020-01-01'),
lat = 72,
lon = -171
).load()
sst_timeseries.plot()
many L4 SST analyses blend infrared (~ 1 - 4 km data) with passive microwave (~ 50 km) data. Data availability will determine regional / temporal changes in spatial resolution
many L4 SST analyses apply smoothing filters that may further reduce resolution
%%time
subset = ds_sst['analysed_sst'].sel(time='2019-06-01',lat=slice(35,40),lon=slice(-126,-120))
subset.plot(vmin=282,vmax=289,cmap='inferno')
%%time
subset = ds_sst['analysed_sst'].sel(time='2019-05-15',lat=slice(35,40),lon=slice(-126,-120))
subset.plot(vmin=282,vmax=289,cmap='inferno')
Find public available data on AWS
buttonWhich brings us to GOES SSTs.....
Info on the data is here -- GOES has a lot of different parameters and they are all stored in different files with names that are difficult to understand. There are 80 different data products. This link lists them all and explains the different directory names. AWS info on GOES SST data.
The GOES data is netCDF format. There is a different for each of the 80 projects for year/day/hour. A lot of files. I find it really useful to 'browse' s3: buckets so that I can understand the directory and data structure. Explore S3 structure. The directory structure is <Product>/<Year>/<Day of Year>/<Hour>/<Filename>
In the code below we are going to create a function that searches for all files from a certain day, then creates the keys to opening them. The files are so messy that opening a day takes a little while. There are other ways you could write this function depending on what your analysis goals are, this is just one way to get some data in a reasonable amount of time.
This function uses
s3fs.S3FileSystem
which holds a connection with a s3 bucket and allows you to list files, etc.
xr.open_mfdataset
opens a list of filenames and concatenates the data along the specified dimensions
@dask.delayed
def get_geo_data(sat,lyr,idyjl):
# arguments
# sat goes-east,goes-west,himawari
# lyr year
# idyjl day of year
d = dt.datetime(lyr,1,1) + dt.timedelta(days=idyjl)
fs = s3fs.S3FileSystem(anon=True) #connect to s3 bucket!
#create strings for the year and julian day
imon,idym=d.month,d.day
syr,sjdy,smon,sdym = str(lyr).zfill(4),str(idyjl).zfill(3),str(imon).zfill(2),str(idym).zfill(2)
#use glob to list all the files in the directory
if sat=='goes-east':
file_location,var = fs.glob('s3://noaa-goes16/ABI-L2-SSTF/'+syr+'/'+sjdy+'/*/*.nc'),'SST'
if sat=='goes-west':
file_location,var = fs.glob('s3://noaa-goes17/ABI-L2-SSTF/'+syr+'/'+sjdy+'/*/*.nc'),'SST'
if sat=='himawari':
file_location,var = fs.glob('s3://noaa-himawari8/AHI-L2-FLDK-SST/'+syr+'/'+smon+'/'+sdym+'/*/*L2P*.nc'),'sea_surface_temperature'
#make a list of links to the file keys
if len(file_location)<1:
return file_ob
file_ob = [fs.open(file) for file in file_location] #open connection to files
#open all the day's data
ds = xr.open_mfdataset(file_ob,combine='nested',concat_dim='time') #note file is super messed up formatting
#clean up coordinates which are a MESS in GOES
#rename one of the coordinates that doesn't match a dim & should
if not sat=='himawari':
ds = ds.rename({'t':'time'})
ds = ds.reset_coords()
else:
ds = ds.rename({'ni':'x','nj':'y'})
#put in to Celsius
#ds[var] -= 273.15 #nice python shortcut to +- from itself a-=273.15 is the same as a=a-273.15
#ds[var].attrs['units'] = '$^\circ$C'
return ds
%%time
lyr, idyjl = 2020, 210 #may 30, 2020
ds = dask.delayed(get_geo_data)('goes-east',lyr,idyjl)
ds
%%time
subset = ds.sel(x=slice(-0.01,0.07215601),y=slice(0.12,0.09)) #reduce to GS region
masked = subset.SST.where(subset.DQF==0).compute()
masked.isel(time=14).plot(vmin=14+273.15,vmax=30+273.15,cmap='inferno')
%%time
mean_dy = masked.mean('time',skipna=True).compute() #here I want all possible values so skipna=True
mean_dy.plot(vmin=14+273.15,vmax=30+273.15,cmap='inferno')
The data above is on the GOES grid, which isn't very helpful if you want to make a subset by a known lat/lon or to plot borders with Cartopy.
The function below takes in a GOES-16/17 dataset and returns the same dataset with latitude (lat
) and longitude (lon
) added as 2D coordinate arrays. The math to do this was taken from https://makersportal.com/blog/2018/11/25/goes-r-satellite-latitude-and-longitude-grid-projection-algorithm
def calc_latlon(ds):
# The math for this function was taken from
# https://makersportal.com/blog/2018/11/25/goes-r-satellite-latitude-and-longitude-grid-projection-algorithm
x = ds.x
y = ds.y
goes_imager_projection = ds.goes_imager_projection
x,y = np.meshgrid(x,y)
r_eq = goes_imager_projection.attrs["semi_major_axis"]
r_pol = goes_imager_projection.attrs["semi_minor_axis"]
l_0 = goes_imager_projection.attrs["longitude_of_projection_origin"] * (np.pi/180)
h_sat = goes_imager_projection.attrs["perspective_point_height"]
H = r_eq + h_sat
a = np.sin(x)**2 + (np.cos(x)**2 * (np.cos(y)**2 + (r_eq**2 / r_pol**2) * np.sin(y)**2))
b = -2 * H * np.cos(x) * np.cos(y)
c = H**2 - r_eq**2
r_s = (-b - np.sqrt(b**2 - 4*a*c))/(2*a)
s_x = r_s * np.cos(x) * np.cos(y)
s_y = -r_s * np.sin(x)
s_z = r_s * np.cos(x) * np.sin(y)
lat = np.arctan((r_eq**2 / r_pol**2) * (s_z / np.sqrt((H-s_x)**2 +s_y**2))) * (180/np.pi)
lon = (l_0 - np.arctan(s_y / (H-s_x))) * (180/np.pi)
ds = ds.assign_coords({
"lat":(["y","x"],lat),
"lon":(["y","x"],lon)
})
ds.lat.attrs["units"] = "degrees_north"
ds.lon.attrs["units"] = "degrees_east"
return ds
ds_latlon = calc_latlon(ds.compute())
ds_latlon
.where()
to select data¶e.g.
subset = ds.where((ds.lon >= lon1) & (ds.lon <= lon2) & (ds.lat >= lat1) & (ds.lat <= lat2), drop=True)
Instead, we can create a function that will return the x/y min and max so we can continue to use .isel()
to quickly select data.
note: the y
slice must be done as slice(y2,y1)
since the y coordinate is decreasing, so the larger value must come first.
def get_xy_from_latlon(ds, lats, lons):
lat1, lat2 = lats
lon1, lon2 = lons
lat = ds.lat.data
lon = ds.lon.data
x = ds.x.data
y = ds.y.data
x,y = np.meshgrid(x,y)
x = x[(lat >= lat1) & (lat <= lat2) & (lon >= lon1) & (lon <= lon2)]
y = y[(lat >= lat1) & (lat <= lat2) & (lon >= lon1) & (lon <= lon2)]
return ((min(x), max(x)), (min(y), max(y)))
x
and y
, we can subset in the same was as above¶lats = (31, 53)
lons = (-85, -56)
(x1,x2), (y1,y2) = get_xy_from_latlon(ds_latlon, lats, lons)
subset = ds_latlon.sel(x=slice(x1,x2), y=slice(y2,y1))
subset
%%time
import cartopy.crs as ccrs
import cartopy.feature as cfeature
masked = subset.mean("time", skipna=True).where(subset.DQF==0)
# For a faster-running example, plot a single time instead of taking a mean
# masked = subset.isel(time=14).where(subset.DQF==0)
fig = plt.figure()
ax = fig.add_subplot(111, projection=ccrs.PlateCarree())
masked.SST.plot(x="lon", y="lat", vmin=14+273.15,vmax=30+273.15,cmap='inferno', ax=ax)
ax.add_feature(cfeature.COASTLINE)
ax.add_feature(cfeature.STATES)
%%time
lyr, idyjl = 2020, 212
ds = dask.delayed(get_geo_data)('himawari',lyr,idyjl)
subset = ds.sel(x=slice(-0.05,0.08),y=slice(0.12,0.08))
masked = subset.sea_surface_temperature.where(subset.quality_level>=4)
mean_dy = masked.mean('time',skipna=True).compute() #here I want all possible values so skipna=True
mean_dy.plot(vmin=14+273.15,vmax=30+273.15,cmap='inferno')
masked.visualize()
The ERA5 data is organized into NetCDF files per variable, each containing a month of hourly data.
The directory structure is /{year}/{month}/main.nc
for all the variables or /{year}/{month}/data/{var}.nc
for just a single variable.
Variables:
For example, the full file path for sea surface temperature for January 2008 is:
/2008/01/data/sea_surface_temperature.nc
Note that due to the nature of the ERA5 forecast timing, which is run twice daily at 06:00 and 18:00 UTC, the monthly data file begins with data from 07:00 on the first of the month and continues through 06:00 of the following month. We'll see this in the coordinate values of a data file we download later in the notebook.
Granule variable structure and metadata attributes are stored in main.nc. This file contains coordinate and auxiliary variable data. This file is also annotated using NetCDF CF metadata conventions.
#https://nbviewer.jupyter.org/github/awslabs/amazon-asdi/blob/main/examples/dask/notebooks/era5_zarr.ipynb
# zac flamig
def fix_accum_var_dims(ds, var):
# Some varibles like precip have extra time bounds varibles, we drop them here to allow merging with other variables
# Select variable of interest (drops dims that are not linked to current variable)
ds = ds[[var]]
if var in ['air_temperature_at_2_metres',
'dew_point_temperature_at_2_metres',
'air_pressure_at_mean_sea_level',
'northward_wind_at_10_metres',
'eastward_wind_at_10_metres']:
ds = ds.rename({'time0':'valid_time_end_utc'})
elif var in ['precipitation_amount_1hour_Accumulation',
'integral_wrt_time_of_surface_direct_downwelling_shortwave_flux_in_air_1hour_Accumulation']:
ds = ds.rename({'time1':'valid_time_end_utc'})
else:
print("Warning, Haven't seen {var} varible yet! Time renaming might not work.".format(var=var))
return ds
@dask.delayed
def s3open(path):
fs = s3fs.S3FileSystem(anon=True, default_fill_cache=False,
config_kwargs = {'max_pool_connections': 20})
return s3fs.S3Map(path, s3=fs)
def open_era5_range(start_year, end_year, variables):
''' Opens ERA5 monthly Zarr files in S3, given a start and end year (all months loaded) and a list of variables'''
file_pattern = 'era5-pds/zarr/{year}/{month}/data/{var}.zarr/'
years = list(np.arange(start_year, end_year+1, 1))
months = ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]
l = []
for var in variables:
print(var)
# Get files
files_mapper = [s3open(file_pattern.format(year=year, month=month, var=var)) for year in years for month in months]
# Look up correct time dimension by variable name
if var in ['precipitation_amount_1hour_Accumulation']:
concat_dim='time1'
else:
concat_dim='time0'
# Lazy load
ds = xr.open_mfdataset(files_mapper, engine='zarr',
concat_dim=concat_dim, combine='nested',
coords='minimal', compat='override', parallel=True)
# Fix dimension names
ds = fix_accum_var_dims(ds, var)
l.append(ds)
ds_out = xr.merge(l)
return ds_out
%%time
ds = open_era5_range(2018, 2020, ["sea_surface_temperature"])
ds
%%time
ds.sea_surface_temperature[0,:,:].plot()
ds.sea_surface_temperature.sel(lon=130,lat=20).plot()