%matplotlib inline
import pandas as pd, matplotlib.pyplot as plt, sqlite3, mpld3
import seaborn as sn, numpy as np, datetime as dt, os, time
from netCDF4 import Dataset
from dateutil import relativedelta
sn.set_context('notebook')
Jannicke would like to extract time series for ~1200 sites from a large collection of netCDF files. The files are located on the network here:
K:\Avdeling\214-Oseanografi\ABL\MARS_EU_PROJECT\Climate
but I've made a local copy as well because working over the network will be very slow.
The netCDF files come from two climate models (GFDL-ESM2M
and IPSL-CM5A-LR
) and each has been run for two emissions scenarios (rcp4p5
and rcp8p5
). The models produce a variety of output, but Jannicke is only interested in four of the variables: pr
, prsn
, tas
and wind
. There are also four time periods of interest:
Each netCDF file contains daily resolution output for one year for one variable under a particular model-scenario combination. The time stamps in each file correspond to days since 01/01/1860.
Jannicke also has a list of ~1200 stations for which she would like to extract data. The raw stations file is here:
K:\Prosjekter\MARS EU project\WP 5 EU scale modelling\Task 5.3\ClimateScenarios\t210_ClimScen_formatted_JMO_20160616.txt
but I've tidied it slightly to create jannicke_sites_tidied.xlsx.
We'll start by reading the station data and defining a function to get the netCDF array indices for each site based on latitude and longitude co-ordinates.
def geo_idx(dd, dd_array):
""" Get array indices for specified lat/long co-ordinate.
Adapted from here:
http://stackoverflow.com/questions/33789379/
netcdf-and-python-finding-the-closest-lon-lat-index-given-actual-lon-lat-values
Args:
dd Lat or long value in decimal degrees
dd_array Corresponding array of lat or long values from netCDF
"""
geo_idx = (np.abs(dd_array - dd)).argmin()
return geo_idx
Added 17/10/2016: When re-running this code for a different time period (see the end of this notebook) I noticed that Jannicke's sites list contains duplicated. Jannicke would like me to remove these, by extracting one time series for each unique grid cell in the list (see e-mail received 17/10/2016 at 13:21.
The code below has therefore been modified so that it now generates a new "site code" by concatenating the lat and lon co-ordinates, before dropping any duplicates. This leaves just 441 unique locations to consider.
# Read station locations
stn_xls = (r'C:\Data\James_Work\Staff\Jannicke_M\MARS\NetCDF_Processing\Data\jannicke_sites_tidied.xlsx')
stn_df = pd.read_excel(stn_xls, sheetname='Sheet1')
# Open one of the netCDF files
nc_file = (r'C:\Data\James_Work\Staff\Jannicke_M\MARS\NetCDF_Processing\Data'
r'\GFDL-ESM2M\rcp4p5\pr\pr_bced_1960_1999_GFDL-ESM2M_rcp4p5_2006.nc')
nc_data = Dataset(nc_file, 'r')
# Read lat and long grids from netCDF
nc_lats = nc_data.variables['lat'][:]
nc_lons = nc_data.variables['lon'][:]
# Get grid indices for each site
stn_df['lat_idx'] = stn_df['lat'].apply(geo_idx, args=(nc_lats,))
stn_df['lon_idx'] = stn_df['lon'].apply(geo_idx, args=(nc_lons,))
# Check for duplicates
print 'Number of records in dataframe:', len(stn_df)
print 'Number of duplicated site codes:', stn_df.duplicated(subset='site').sum()
print 'Number of duplicated co-ordinates:', stn_df.duplicated(subset=['lat', 'lon']).sum()
# Recalculate site column
stn_df['site'] = stn_df['lat'].map(str) + '_' + stn_df['lon'].map(str)
stn_df.drop_duplicates(inplace=True)
print 'Number of unique locations:', len(stn_df)
stn_df.head()
Number of records in dataframe: 1204 Number of duplicated site codes: 25 Number of duplicated co-ordinates: 763 Number of unique locations: 441
site | lat | lon | lat_idx | lon_idx | |
---|---|---|---|---|---|
0 | 62.75_28.75 | 62.75 | 28.75 | 54 | 77 |
1 | 61.25_27.25 | 61.25 | 27.25 | 57 | 74 |
2 | 61.25_27.75 | 61.25 | 27.75 | 57 | 75 |
3 | 61.25_28.25 | 61.25 | 28.25 | 57 | 76 |
4 | 61.75_27.75 | 61.75 | 27.75 | 56 | 75 |
We're going to be reading a lot of data: we have 1204 sites, times 365 days, times 40 years, times 4 climate variables, times two emissions scenarios, times two climate models.
$$ 1204 * 365 * 40 * 4 * 2 * 2 = 281,254,400 $$We are therefore expecting to generate more than 280 million records. As a minimum, each record will have 6 pieces of information: the site, model, scenario, variable, date and value. If we store this directly as text allowing, on average, 15 characters for the site, 10 for the model, 5 for the scenario, 5 for the variable, 10 for the date and 10 for the value, we can expect each record to occupy something like 55 bytes. The total file size for the simplest possible text file would therefore be
$$ 55 * 281,254,400 = 15,468,992,000 \approx 15.5 \; GB $$This file will be difficult to work with and it's also a rather inefficient way to store the data. However, Jannicke would prefer the file to be structured so that the four variables are present as columns, rather than in "long" format (i.e. with a single value
column). In addition to being easier to read, this will save a bit of storage space, because the total number of records will be reduced by a factor of four, but each record will have four "value" columns instead of one. A plain text file in this format should therefore occupy roughly
This is still a very large text file. A better option would be to load everything into a database like Access or, even better, SQLite. The database can then be queried to extract the desired information. The simplest approach to building the database is to use a single table with "raw" data types for each of the columns (i.e. string
for site, model, scenario and variable; timestamp
for date and float
for value). This is inefficient and will likely lead to a very large database file, but it'll still be better than a raw text file.
The best approach is to build a proper relational database with integers for the site, model, scenario and variable components. This will be much more efficient, but perhaps also more difficult for Jannicke to work with?
For now, I'll load everything into SQLite as a single, large table. This seems like a reasonable compromise between an unwieldy text file and a full relational database.
def read_nc_data(clim_var, model, scen, year, lat_idx, lon_idx):
""" Reads data from the specified netCDF file.
Returns a dataframe
"""
# Open file
nc_path = os.path.join(data_fold, model, scen, clim_var,
'%s_bced_1960_1999_%s_%s_%s.nc' % (clim_var,
model,
scen,
year))
nc_data = Dataset(nc_path, 'r')
# Get time period
st_day = int(nc_data.variables['time'][0])
st_date = base_date + relativedelta.relativedelta(days=st_day) # The date for 1st record
# Generate a series of dates for this file
rec_len = nc_data.variables[clim_var].shape[0]
dates = pd.date_range(start=st_date,
periods=rec_len,
freq='D') # Daily
# Extract data for site
values = nc_data.variables[clim_var][:, lat_idx, lon_idx]
# Build dataframe
df = pd.DataFrame({clim_var:values}, index=dates)
# Resample to monthly
if clim_var in ['pr', 'prsn']:
# Get the sum
df = df.resample('M').sum()
# Convert units from kg/m3/s to mm/day
df[clim_var] = df[clim_var] * 60*60*24
else:
df = df.resample('M').mean()
# Close nc file
nc_data.close()
return df
st_time = time.time()
# Define paths to data
data_fold = r'C:\Data\James_Work\Staff\Jannicke_M\MARS\NetCDF_Processing\Data'
# Create database and connect
db_path = (r'C:\Data\James_Work\Staff\Jannicke_M\MARS\NetCDF_Processing\Data'
r'\climate_data_2006_2099.sqlite')
con = sqlite3.connect(db_path)
# Define, models, scenarios, variables and time periods of interest
clim_models = ['GFDL-ESM2M', 'IPSL-CM5A-LR']
scens = ['rcp4p5', 'rcp8p5']
clim_vars = ['pr', 'prsn', 'tas', 'wind']
periods = [[2006, 2099],]
# All dates are expressed as days since 01/01/1860
base_date = dt.date(1860, 1, 1)
# Loop over data
for model in clim_models:
print model
for scen in scens:
print ' ', scen
for period in periods:
print ' ', period
# Get years for this period
for year in range(period[0], period[1]+1):
# Loop over sites
for row in stn_df.itertuples():
# Loop ocer clim_vars
df_list = []
for clim_var in clim_vars:
df_list.append(read_nc_data(clim_var,
model,
scen,
year,
row.lat_idx,
row.lon_idx))
# Concat dfs
df = pd.concat(df_list, axis=1, join='outer')
df.reset_index(inplace=True)
df['date'] = df['index']
del df['index']
# Add columns for site, model, scen and var
df['site'] = row.site
df['model'] = model
df['scen'] = scen
# Reorder columns
df = df[['model', 'scen', 'site', 'date'] + clim_vars]
# Write to database
df.to_sql('all_data', con=con, if_exists='append', index=False)
con.close()
end_time = time.time()
print 'Finished. Runtime: %.2f hours.' % ((end_time - st_time) / (60. * 60.))
GFDL-ESM2M rcp4p5 [2006, 2099] rcp8p5 [2006, 2099] IPSL-CM5A-LR rcp4p5 [2006, 2099] rcp8p5 [2006, 2099] Finished. Runtime: 13.46 hours.
A trial of the code above running for just two years (instead of than 40) takes just over an hour on my computer and produces a database with a file size of 0.32 GB. This implies that processing the full dataset should take about 22 hours, and create about 6.5 GB of data (as expected from the rough calculations above). This could be optimised a lot, both in terms of file size and speed, but it's probably not worth the effort at this stage. For now, I'd just like to check that the test output over two years looks reasonable, before running the main analysis.
As a test, let's choose the last site in Jannicke's dataset and see what the values look like. To make things easier to visualise, we'll resample the series from daily to monthly by summing the pr
and prsn
values in each month and averaging those for tas
and wind
.
# Find the site code for the last site
stn_df.iloc[-1]
site 55.25_-4.75 lat 55.25 lon -4.75 lat_idx 69 lon_idx 10 Name: 1199, dtype: object
# Connect to db
con = sqlite3.connect(db_path)
# Get results for this site
sql = ('SELECT * FROM all_data '
'WHERE site = "55.25_-4.75"')
df = pd.read_sql_query(sql, con)
con.close()
# Tidy up
df['date'] = pd.to_datetime(df['date'])
df = df[['model', 'scen', 'site', 'date']+clim_vars]
df.head()
model | scen | site | date | pr | prsn | tas | wind | |
---|---|---|---|---|---|---|---|---|
0 | GFDL-ESM2M | rcp4p5 | 55.25_-4.75 | 2006-01-31 | 138.582535 | 55.705231 | 276.454407 | 8.947670 |
1 | GFDL-ESM2M | rcp4p5 | 55.25_-4.75 | 2006-02-28 | 125.084084 | 39.165077 | 276.629700 | 7.894089 |
2 | GFDL-ESM2M | rcp4p5 | 55.25_-4.75 | 2006-03-31 | 169.586319 | 57.175537 | 277.812744 | 8.662620 |
3 | GFDL-ESM2M | rcp4p5 | 55.25_-4.75 | 2006-04-30 | 66.351212 | 5.325392 | 281.408051 | 4.199535 |
4 | GFDL-ESM2M | rcp4p5 | 55.25_-4.75 | 2006-05-31 | 53.561909 | 0.000000 | 281.860382 | 5.012456 |
And now plot the data for the various model, scenario and variable combinations.
def dateplot(x, y, **kwargs):
""" Function for mapping Pandas time series to Seaborn
FacetGrid. See lower part of this page:
https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.FacetGrid.html
"""
ax = plt.gca()
data = kwargs.pop("data")
data.plot(x=x, y=y, ax=ax, **kwargs)
# Restructure data
df = pd.melt(df, id_vars=['model', 'scen', 'site', 'date'], var_name='var')
# Plot
g = sn.FacetGrid(df, row='model', col='var', hue='scen', sharey=False)
g.map_dataframe(dateplot, 'date', 'value')
plt.tight_layout()
mpld3.display()
The plot above is interactive - hovering the mouse over it will show pan and zoom tools in the bottom left-hand corner.
The blue and green lines on each plot show the curves for the two emissions scenarios.
This all seems to be working OK. I just need to find time to let my laptop run for ~22 hours undisturbed, which probably won't be until this weekend.
June 2016. Jannicke only needs monthly data, so the code now resamples each series to monthly. This reduces the total data volumne by a factor of ~12.
October 2016. Jannicke would like the code re-running for the entire period from 2006 to 2099. Also, I've noticed that Jannicke's stations list includes some duplicated site codes, so some results may end up in the output more than once. Update 17/10/2016: I've corrected this - see additions to the processing of the stations table, described above.