%matplotlib inline
import glob
import pandas as pd
import xarray as xr
We have a collection of remote sensing data over time on the same grid stored in GeoTIFF files. We want to convert them to NetCDF, and rechunk the data with time chunked > 1 so that extraction of time series at specific locations is faster
filenames = glob.glob('/sand/usgs/users/rsignell/data/josef/20NRKsS1-EBD/20NRKsS1_A_vh_mtfil/*.tif')
print(len(filenames))
filenames[0:2]
30
['/sand/usgs/users/rsignell/data/josef/20NRKsS1-EBD/20NRKsS1_A_vh_mtfil/20NRKsS1_A_vh_20171006_0164_B_mtfil_amp.tif', '/sand/usgs/users/rsignell/data/josef/20NRKsS1-EBD/20NRKsS1_A_vh_mtfil/20NRKsS1_A_vh_20170924_0164_B_mtfil_amp.tif']
Make a function to create timestamps from file names
def time_index_from_filenames(filenames):
'''helper function to create a pandas DatetimeIndex
Filename example: 20NRKsS1_A_vh_20171006_0164_B_mtfil_amp.tif'''
return pd.DatetimeIndex([pd.Timestamp(f.split('/')[-1][14:22]) for f in filenames])
time = xr.Variable('time', time_index_from_filenames(filenames))
Specify reading of all GeoTIFF images. Specifying chunks means that Dask arrays will be used, so no data is read until needed
chunks = {'x': 5490, 'y': 5490, 'band': 1}
da = xr.concat([xr.open_rasterio(f, chunks=chunks).squeeze() for f in filenames], dim=time)
Provide a name for the data array variable
da.name='sar'
Don't need the singleton dimension called 'band', so drop it
da = da.drop(labels='band')
Rechunk the Dask array
da = da.chunk(chunks={'time':10, 'x':100, 'y':100})
Let's see what we have now
da
<xarray.DataArray 'sar' (time: 30, y: 5490, x: 5490)> dask.array<rechunk-merge, shape=(30, 5490, 5490), dtype=uint16, chunksize=(10, 100, 100)> Coordinates: * y (y) float64 5e+05 5e+05 5e+05 5e+05 5e+05 4.999e+05 4.999e+05 ... * x (x) float64 8e+05 8e+05 8e+05 8e+05 8.001e+05 8.001e+05 ... * time (time) datetime64[ns] 2017-10-06 2017-09-24 2017-09-12 ... Attributes: crs: +init=epsg:32620
Write the NetCDF file, specifying the same chunking. This is where the data finally get read!
encoding={'sar':{'chunksizes': (10,100,100)}}
da.to_netcdf('agg_chunked.nc', encoding=encoding, format='NETCDF4', engine='netcdf4', mode='w')
Now open the NetCDF file we just wrote
ds = xr.open_dataset('agg_chunked.nc')
ds
<xarray.Dataset> Dimensions: (time: 30, x: 5490, y: 5490) Coordinates: * y (y) float64 5e+05 5e+05 5e+05 5e+05 5e+05 4.999e+05 4.999e+05 ... * x (x) float64 8e+05 8e+05 8e+05 8e+05 8.001e+05 8.001e+05 ... * time (time) datetime64[ns] 2017-10-06 2017-09-24 2017-09-12 ... Data variables: sar (time, y, x) uint16 2388 2692 2847 2540 2009 1754 1795 1930 ...
Extract data at a specific time
dst = ds.sel(time='2017-09-24').squeeze()
Plot it up
dst['sar'].plot.imshow(vmin=0, vmax=4000);