Notebook

In [1]:

%matplotlib inline

import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.tsa.api import VAR
from statsmodels.tsa.api import ExponentialSmoothing, SimpleExpSmoothing
from statsmodels.tools.eval_measures import rmse, medianabs


# Reduces variance in results but won't eliminate it :-(
%env PYTHONHASHSEED=0
import random
random.seed(42)
np.random.seed(42)

/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

env: PYTHONHASHSEED=0

Baseline Forecasts for Cambridge UK Weather Time Series¶

Building baseline models for time series analysis of Cambridge UK temperature measurements taken at the University computer lab weather station.

I'm primarily interested in short term temperature forecasts (less than 2 hours) but will include results up to 24 hours in the future.

See my previous work for further details:

Import Data¶

The measurements are relatively noisy and there are usually several hundred missing values every year. Observations have been extensively cleaned but may still have issues. Interpolation and missing value imputation have been used to fill all missing values. See the cleaning section in the Cambridge Temperature Model repository for details. Observations start in August 2008 and end in April 2021 and occur every 30 mins.

In [2]:

if 'google.colab' in str(get_ipython()):
    data_loc = "https://github.com/makeyourownmaker/CambridgeTemperatureNotebooks/blob/main/data/CamMetCleanish2021.04.26.csv?raw=true"
else:
    data_loc = "../data/CamMetCleanish2021.04.26.csv"
df = pd.read_csv(data_loc, parse_dates = True)

df['ds'] = pd.to_datetime(df['ds'])
df['y'] = df['y'] / 10
df['wind.speed.mean'] = df['wind.speed.mean'] / 10

df = df.loc[df['ds'] > '2008-08-01 00:00:00',]

print("Shape:")
print(df.shape)
print("\nInfo:")
print(df.info())
print("\nSummary stats:")
display(df.describe())
print("\nRaw data:")
display(df)
print("\n")


def plot_examples(data, x_var):
    """Plot 9 sets of observations in 3 * 3 matrix ..."""

    assert len(data) == 9

    cols = [col for col in data[0].columns if col != x_var]

    fig, axs = plt.subplots(3, 3, figsize = (15, 10))
    axs = axs.ravel()  # apl for the win :-)

    for i in range(9):
      for col in cols:
        axs[i].plot(data[i][x_var], data[i][col])
        axs[i].xaxis.set_tick_params(rotation = 20, labelsize = 10)
    
    fig.legend(cols, loc = 'upper center',  ncol = len(cols))
    
    return None


cols = ['ds', 'y', 'humidity', 'dew.point', 'pressure', 
        'wind.speed.mean', 'wind.bearing.mean']
plots  = 9
window = 24
starts = [random.randint(0, np.floor(df.shape[0] / window)) for _ in range(plots)]
p_data = [df.loc[starts[i] * window:starts[i] * window + window, cols] 
          for i in range(plots)]
plot_examples(p_data, 'ds')

Shape:
(225251, 7)

Info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 225251 entries, 2 to 225252
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   ds                 225251 non-null  datetime64[ns]
 1   y                  225251 non-null  float64       
 2   humidity           225251 non-null  float64       
 3   dew.point          225251 non-null  float64       
 4   pressure           225251 non-null  float64       
 5   wind.speed.mean    225251 non-null  float64       
 6   wind.bearing.mean  225251 non-null  float64       
dtypes: datetime64[ns](1), float64(6)
memory usage: 13.7 MB
None

Summary stats:

	y	humidity	dew.point	pressure	wind.speed.mean	wind.bearing.mean
count	225251.000000	225251.000000	225251.000000	225251.000000	225251.000000	225251.000000
mean	10.027882	78.619532	58.600183	1014.342062	4.434737	194.974558
std	6.509969	17.308646	51.645273	11.911991	4.011087	82.876095
min	-7.000000	20.000000	-100.000000	963.000000	0.000000	0.000000
25%	5.200000	68.000000	19.000000	1008.000000	1.200000	135.000000
50%	9.600000	83.000000	59.100000	1016.000000	3.500000	225.000000
75%	14.500000	92.000000	97.000000	1022.000000	6.600000	270.000000
max	36.100000	100.000000	209.000000	1051.000000	29.200000	360.000000

Raw data:

	ds	y	humidity	dew.point	pressure	wind.speed.mean	wind.bearing.mean
2	2008-08-01 00:30:00	19.5	65.75000	119.150000	1014.416667	1.150000	225.0
3	2008-08-01 01:00:00	19.1	49.75000	79.200000	1014.384615	1.461538	225.0
4	2008-08-01 01:30:00	19.1	66.17875	106.600000	1014.500000	1.508333	225.0
5	2008-08-01 02:00:00	19.1	58.50000	99.250000	1014.076923	1.430769	225.0
6	2008-08-01 02:30:00	19.1	66.95000	121.883333	1014.416667	1.133333	225.0
...	...	...	...	...	...	...	...
225248	2021-04-25 23:00:00	3.6	61.00000	-32.000000	1028.000000	1.400000	45.0
225249	2021-04-25 23:30:00	3.6	64.00000	-26.000000	1028.000000	2.600000	45.0
225250	2021-04-26 00:00:00	3.6	58.00000	-39.000000	1028.000000	4.300000	45.0
225251	2021-04-26 00:30:00	3.2	62.00000	-34.000000	1027.000000	5.400000	45.0
225252	2021-04-26 01:00:00	3.2	62.00000	-34.000000	1027.000000	4.200000	45.0

225251 rows × 7 columns

Data Processing and Feature Engineering¶

The data must be reformatted before model building.

The following steps are carried out:

Wind direction and speed transformation
Time conversion
Train test data separation
Data windowing
- Some baseline methods operate on initial training windows of data

Wind direction and speed transformation¶

The wind.bearing.mean column gives wind direction in degrees but is categorised at 45 degree increments, i.e. 0, 45, 90, 135, 180, 225, 270, 315. Wind direction shouldn't matter if the wind is not blowing.

The distribution of wind direction and speed looks like this:

In [3]:

plt.hist2d(df['wind.bearing.mean'], df['wind.speed.mean'], bins = (50, 50), vmax = 400)
plt.colorbar()
plt.xlabel('Wind Direction - degrees')
plt.ylabel('Wind Velocity - Knots')
plt.title('Wind bearing and velocity');

Convert wind direction and speed to x and y vectors, so the model can more easily interpret them.

In [4]:

wv = df['wind.speed.mean']

# Convert to radians
wd_rad = df['wind.bearing.mean'] * np.pi / 180

# Calculate the wind x and y components
df['wind.x'] = wv * np.cos(wd_rad)
df['wind.y'] = wv * np.sin(wd_rad)

df_orig = df

plt.hist2d(df['wind.x'], df['wind.y'], bins = (50, 50), vmax = 400)
plt.colorbar()
plt.xlabel('Wind X - Knots')
plt.ylabel('Wind Y - Knots')
plt.title('Wind velocity vectors');

Better, but not ideal. Data augmentation with the mixup method is carried out below.

From the mixup paper: "mixup trains a neural network on convex combinations of pairs of examples and their labels".

Further details on how I apply mixup to time series are included in the Window data section of my keras_mlp_fcn_resnet_time_series.ipynb notebook.

Here is an illustration of the improvement in wind velocity sparsity with mixup augmentation.

In [5]:

def mixup(data, alpha = 1.0, factor = 1):
    batch_size = len(data) - 1
    
    data['epoch'] = data.index.astype(np.int64) // 10**9
    
    # random sample lambda value from beta distribution
    l   = np.random.beta(alpha, alpha, batch_size * factor)
    X_l = l.reshape(batch_size * factor, 1)
    
    # Get a pair of inputs and outputs
    y1  = data['y'].shift(-1).dropna()
    y1_ = pd.concat([y1] * factor)
    
    y2  = data['y'][0:batch_size]
    y2_ = pd.concat([y2] * factor)

    X1  = data.drop('y', 1).shift(-1).dropna()
    X1_ = pd.concat([X1] * factor)
    
    X2  = data.drop('y', 1)
    X2  = X2[0:batch_size]
    X2_ = pd.concat([X2] * factor)

    # Perform mixup
    X = X1_ * X_l + X2_ * (1 - X_l)
    y = y1_ * l   + y2_ * (1 - l)    
    
    df = pd.DataFrame(y).join(X)    
    df = data.append(df).sort_values('epoch', ascending = True)
    df = df.drop('epoch', 1)
    
    df = df.drop_duplicates(keep = False)
    
    return df


df_mix = mixup(df.loc[:, ['y','wind.x','wind.y']], factor = 2)
plt.hist2d(df_mix['wind.x'], df_mix['wind.y'], bins = (50, 50), vmax = 400)
plt.colorbar()
plt.xlabel('Wind X - Knots')
plt.ylabel('Wind Y - Knots')
plt.title('Wind velocity with mixup data augmentation');

Mixup improves the categorical legacy of the wind velocity data. Unfortunately, if outliers are present their influence will likely be reinforced.

For now, I'm not using the wind vector or mixup augmentation with any of the baseline methods. See my keras_mlp_fcn_resnet_time_series.ipynb notebook for an illustration of the profoundly beneficial improvement mixup provides with this dataset.

Time conversion¶

Convert ds timestamps to "time of day" and "time of year" variables using sin and cos functions.

In [6]:

# Convert to secs
date_time   = pd.to_datetime(df['ds'], format = '%Y.%m.%d %H:%M:%S')
timestamp_s = date_time.map(datetime.datetime.timestamp)

day  = 24 * 60 * 60
year = (365.2425) * day

df['day.sin']  = np.sin(timestamp_s * (2 * np.pi / day))
df['day.cos']  = np.cos(timestamp_s * (2 * np.pi / day))
df['year.sin'] = np.sin(timestamp_s * (2 * np.pi / year))
df['year.cos'] = np.cos(timestamp_s * (2 * np.pi / year))

plt.plot(np.array(df['day.sin'])[49:98])
plt.plot(np.array(df['day.cos'])[49:98])
plt.xlabel('Time')
plt.legend(['sin', 'cos'], loc = 'lower right')
plt.title('Time of day signal');

The yearly time components may benefit from a single phase shift so they align with the seasonal temperature peak around the end of July and temperature trough around the end of January. Similarly, the daily components may benefit from small daily phase shifts.

I use these time components with the multi-variate VAR method.

Split data¶¶

I use data from 2018 for validation, 2019 for testing and the remaining data for training. These are entirely arbitrary choices. This results in an approximate 84%, 8%, 8% split for the training, validation, and test sets respectively.

In [7]:

keep_cols = ['y', 'humidity', 'dew.point', 'pressure', 'wind.x', 'wind.y',
             'day.sin', 'day.cos', 'year.sin', 'year.cos', 'ds']

df['year'] = df['ds'].dt.year
train_df = df.loc[(df['year'] != 2018) & (df['year'] != 2019)]
valid_df = df.loc[df['year'] == 2018]
test_df  = df.loc[df['year'] == 2019]

train_df = train_df.drop('year', axis = 1)  # inplace = True gives SettingWithCopyWarning
valid_df = valid_df.drop('year', axis = 1)  # ...
test_df  = test_df.drop('year',  axis = 1)
df       = df.drop('year',       axis = 1)

print("df.drop shape: ", df.shape)
print("train shape:   ", train_df.shape)
print("valid shape:   ", valid_df.shape)
print("test shape:    ", test_df.shape)

plt.figure(figsize = (15, 8))
plt.plot(train_df.ds, train_df.y)
plt.plot(test_df.ds,   test_df.y)
plt.title('Temperature - C')
plt.legend(['train', 'test'])
plt.show()

plt.figure(figsize = (15, 8))
plt.plot(test_df.ds, test_df.y, color='orange')
plt.title('Temperature - C (test data)')
plt.show();

df.drop shape:  (225251, 13)
train shape:    (190023, 13)
valid shape:    (17655, 13)
test shape:     (17573, 13)

The training data is used to calculate the seasonal average values.

Baselines Methods¶

This notebook is being developed on Google Colab primarily with the statsmodels package.

I will not use complicated methods just for comparison of baseline methods. Also, I don't want to spend a great deal of time optimising hyperparameter settings. This means I will use simpler methods and default settings as far as possible.

Model diagnostics will not be checked. Some of the models built may have issues.

Runtime may become an issue for complicated methods.

Univariate baseline methods:

mean - simple average
naive - persistent
- commonly used in the meteorology literature
seasonal average - historical average
SES - Simple Exponential Smoothing
HWES - Holt Winter's Exponential Smoothing

I would also like to look at results from, the AutoReg and STL functions but they are unavailable in the current version of statsmodels (0.10.2) installed on Google Colab (as of 28/04/21). The alternative AR function is depracated. ARIMA models would be the next logical choice but often require model checking and tuning. Installing packages on Google Colab is straight-forward. Upgrading packages may also be straight-forward, but it's probably more productive to move onto multi-variate baselines instead.

Multivariate baseline methods:

VAR - Vector AutoRegression

Forecast horizons:

next 30 mins - 1 step ahead
next 2 hours - 4 steps ahead
next 24 hours - 48 steps ahead
horizon is abbreviated to h in the interrim results tables

Some forecast methods are limited to single step forecasts. For example, simple average and seasonal average.

Metrics considered:

rmse - root mean squared error
mae - median absolute error

Univariate Baseline Methods¶

Firstly, the mean baseline:

In [8]:

# sa - simple average
sa = train_df.y.mean()
rmse_sa = rmse(test_df.y, sa)
mae_sa  = medianabs(test_df.y, sa)

print("simple average:", round(sa, 2))
print("rmse:", round(rmse_sa, 2))
print("mae: ", round(mae_sa,  2))

cols = ['type', 'method', 'metric', 'horizon', 'value']
metrics_h = []  # h for horizon

for i in range(1, 49):
  metrics_h.append(dict(zip(cols, ['univariate', 'mean', 'rmse', i, rmse_sa])))
  metrics_h.append(dict(zip(cols, ['univariate', 'mean',  'mae', i,  mae_sa])))

metrics = pd.DataFrame(metrics_h, columns = cols)

simple average: 10.09
rmse: 6.42
mae:  4.89

The mean baseline has equivalent metric values across all forecast horizons.

Secondly, the naive baseline:

In [9]:

def get_naive_metrics(data, horizon, method):
  y_hat = data.copy()
  y_hat['naive'] = y_hat.y.shift(horizon)
  y_hat = y_hat.iloc[horizon:]  # remove na value
  naive_rmse = rmse(y_hat.y, y_hat.naive)
  naive_mae  = medianabs(y_hat.y, y_hat.naive)
  
  return [naive_rmse, naive_mae]


def plot_metrics(metrics, main_title):
  fig, axs = plt.subplots(1, 2, figsize = (14, 7))
  fig.suptitle(main_title)
  axs = axs.ravel()  # APL ftw!

  methods = metrics.method.unique()

  for method in methods:
    df = metrics.query('metric == "rmse" & method == "%s"' % method)
    axs[0].plot(df.horizon, df.value)
  axs[0].set_xlabel("horizon - hours")
  axs[0].set_ylabel("rmse")
  axs[0].legend(methods)

  for method in methods:
    df = metrics.query('metric == "mae" & method == "%s"' % method)
    axs[1].plot(df.horizon, df.value)
  axs[1].set_xlabel("horizon - hours")
  axs[1].set_ylabel("mae")
  axs[1].legend(methods);


def update_metrics(metrics, test_data, method, get_metrics, model=None):
  metrics_h = []

  if method in ['SES', 'HWES']:
    horizons = [i for i in range(4, 49, 4)]
    horizons.insert(0, 1)
  else:
    horizons = range(1, 49)

  if method in ['VAR']:
    variates = 'multivariate'
  else:
    variates = 'univariate'

  print("h\trmse\tmae")
  for h in horizons:
    if method in ['VAR']:
      rmse_h, mae_h = get_metrics(test_df, h, method, model)
    else:
      rmse_h, mae_h = get_metrics(test_df, h, method)

    metrics_h.append(dict(zip(cols, [variates, method, 'rmse', h, rmse_h])))
    metrics_h.append(dict(zip(cols, [variates, method,  'mae', h,  mae_h])))
    
    if h in [1, 4, 48]:
      print(h, "\t", round(rmse_h, 2), "\t", round(mae_h, 2))

  print("\n")

  metrics_method = pd.DataFrame(metrics_h, columns = cols)
  metrics = metrics.append(metrics_method)

  return metrics


metrics = update_metrics(metrics, test_df, 'naive', get_naive_metrics)
plot_metrics(metrics, "Univariate Baseline Comparison - 2019 test data")

h	rmse	mae
1 	 0.66 	 0.4
4 	 1.86 	 0.9
48 	 3.08 	 2.0

Obviously, the naive method is an improvement over the simple average.

The daily seasonality is clearly evident in the naive method.

Third, seasonal average:

In [10]:

train_df['doy'] = train_df.ds.dt.strftime('%j')
train_df['hms'] = train_df.ds.dt.strftime('%H:%M:%S')

# ha - historical average
ha = pd.DataFrame(train_df.groupby(['doy', 'hms']).y.mean())
ha.reset_index(inplace=True)  

rmse_ha = rmse(test_df.y[:17568],      ha.y)
mae_ha  = medianabs(test_df.y[:17568], ha.y)

print("rmse:", round(rmse_ha, 2))
print("mae: ", round(mae_ha,  2))

metrics_h = []  # h for horizon

for h in range(1, 49):
  metrics_h.append(dict(zip(cols, ['univariate', 'seasonal mean', 'rmse', h, rmse_ha])))
  metrics_h.append(dict(zip(cols, ['univariate', 'seasonal mean',  'mae', h,  mae_ha])))

metrics_ha = pd.DataFrame(metrics_h, columns = cols)
metrics = metrics.append(metrics_ha)

plot_metrics(metrics, "Univariate Baseline Comparison - 2019 test data")

rmse: 4.65
mae:  2.97

The historical average, or seasonal average, is superior to the simple average. It also illustrates how the naive method is struggling with the daily seasonality component.

Fourth, simple exponential smoothing:

Run time is increasing
- I only run every 4th horizon value for SES and all subsequent methods
- SES takes approx. 15 mins to complete
Default settings for SimpleExpSmoothing function
- See statmodels SimpleExpSmoothing documentation for details
Using 48 temperature measurements for the initial training window

In [11]:

%%time

def es_rolling_cv(data, horizon, method):
    i =  0
    x = 48  # initial training window
    h = horizon
    rmse_roll, mae_roll = [], []

    while (i + x + h) < len(data):
        train_ts = data[(i):(i + x)].y.values
        test_ts  = data[(i + x):(i + x + h)].y.values
        # print("train:", train_ts, "\ntest:", test_ts)  # DEBUG

        if method == 'SES':
          model_roll = SimpleExpSmoothing(train_ts).fit()
        elif method == 'HWES':
          model_roll = ExponentialSmoothing(train_ts,
                                            seasonal_periods = 48,
                                            seasonal = 'add',
                                            ).fit()

        y_hat = model_roll.forecast(h)
        # print("yhat:", y_hat)  # DEBUG

        rmse_i = rmse(test_ts, y_hat)
        mae_i  = medianabs(test_ts, y_hat)
        # print("rmse:", rmse_i, "\nmae:", mae_i, "\n")  # DEBUG
        rmse_roll.append(rmse_i)
        mae_roll.append(mae_i)

        i = i + 1
   
    return [np.mean(rmse_roll).round(2), np.mean(mae_roll).round(2)]


metrics = update_metrics(metrics, test_df, 'SES', es_rolling_cv)
plot_metrics(metrics, "Univariate Baseline Comparison - 2019 test data")

h	rmse	mae
1 	 0.44 	 0.44
4 	 1.02 	 0.9
48 	 4.0 	 3.23


CPU times: user 15min 25s, sys: 47 s, total: 16min 12s
Wall time: 15min 1s

Clearly, the SES method is an improvement over the naive method for forecast horizons less than approx. 18 hours.

It may be possible to optimise the SES method by setting initial training window length train_window and/or the initialization_method parameter. See statsmodels docs for initialization_method options.

Fifth, Holt-Winters exponential smoothing:

Run time is getting quite long
- I run every 4th horizon value
- Still takes approx. 30 mins
seasonal_periods is set to 48 (24 half-hourly observations)
- yearly seasonality is not accounted for
- ExponentialSmoothing function does not support multiple seasonalities
seasonal is set to additive
Temperature measurements are mostly cyclic with little or no trend across the years
- trend and damped_trend are set to default values of None
See statmodels ExponentialSmoothing documentation for details

In [12]:

%%time

metrics = update_metrics(metrics, test_df, 'HWES', es_rolling_cv)
plot_metrics(metrics, "Univariate Baseline Comparison - 2019 test data")

h	rmse	mae

/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/holtwinters.py:924: RuntimeWarning: divide by zero encountered in log
  aic = self.nobs * np.log(sse / self.nobs) + k * 2
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/holtwinters.py:929: RuntimeWarning: invalid value encountered in double_scalars
  aicc = aic + aicc_penalty
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/holtwinters.py:930: RuntimeWarning: divide by zero encountered in log
  bic = self.nobs * np.log(sse / self.nobs) + k * np.log(self.nobs)

1 	 2.43 	 2.43
4 	 2.5 	 2.42
48 	 2.8 	 2.24


CPU times: user 30min 31s, sys: 1min 13s, total: 31min 44s
Wall time: 30min 15s

On the whole, HWES is an improvement over the other methods even if short horizon forecasts are worse than the naive and SES methods.

The daily cyclic component present in the naive and SES results is not an issue with the HWES method as would be expected.

It's puzzling that the mean absolute error decreases slightly as the horizon increases. Adding error bars may partially address this issue.

It's also worth noting:

There are convergence warnings if the initial training window is not equal to the seaonality period of 48
Opportunities for HWES parameter tuning include the use_boxcox and initialization_method options

Multivariate Baseline Methods¶

Considering two models:

VAR - Vector Auto-regression

VAR models capture the linear interdependencies among multiple time series.

Firstly, VAR:

treating temperature (y), humidity, dew.point and pressure as endogenous varaibles
not using wind.x or wind.y in model
- Granger causality tests suggest poor performance for wind.x and wind.y measurements
- mixup data augmentation may give a different result
using daily and yearly sinusoidal time variables as exogenous data
- day.cos, day.sin, year.cos, year.sin
- not shown in this notebook but VAR model without these exogenous variables performs worse; particularly for longer horizon forecasts
limiting order of lagged variables to 12
only calculating rmse and mae for y
- not looking at forecast performance for other endogenous variables at this time
approx. 1 hour run time
See statmodels VAR documentation for details

In [13]:

from statsmodels.tsa.stattools import grangercausalitytests

maxlag = 12
test   = 'ssr_chi2test'

def grangers_causation_matrix(data, variables, test='ssr_chi2test', verbose=False):    
    """Check Granger Causality of all possible combinations of the Time series.
    The rows are the response variable, columns are predictors.  The values in the table 
    are the P-Values.  P-Values lesser than the significance level (0.05), implies 
    the Null Hypothesis that the coefficients of the corresponding past values is 
    zero, that is, the X does not cause Y can be rejected.

    data      : pandas dataframe containing the time series variables
    variables : list containing names of the time series variables.

    From https://www.machinelearningplus.com/time-series/vector-autoregression-examples-python/
    """

    df = pd.DataFrame(np.zeros((len(variables), len(variables))), columns=variables, index=variables)

    for c in df.columns:
        for r in df.index:
            test_result = grangercausalitytests(data[[r, c]], maxlag=maxlag, verbose=False)
            p_values = [round(test_result[i + 1][0][test][1], 4) for i in range(maxlag)]

            if verbose: print(f'Y = {r}, X = {c}, P Values = {p_values}')

            min_p_value  = np.min(p_values)
            df.loc[r, c] = min_p_value

    df.columns = [var + '_x' for var in variables]
    df.index   = [var + '_y' for var in variables]

    return df


vars = ['y', 'humidity', 'dew.point', 'pressure'] # ,'wind.x', 'wind.y']

gcm = grangers_causation_matrix(test_df, vars)
display(gcm)  

/usr/local/lib/python3.7/dist-packages/statsmodels/base/model.py:1752: ValueWarning: covariance of constraints does not have full rank. The number of constraints is 6, but rank is 1
  'rank is %d' % (J, J_), ValueWarning)

	y_x	humidity_x	dew.point_x	pressure_x
y_y	1.0	0.0000	0.0	0.0
humidity_y	0.0	1.0000	0.0	0.0
dew.point_y	0.0	0.0000	1.0	0.0
pressure_y	0.0	0.0017	0.0	1.0

WARNING: The function used to check Granger causality does not take multiple testing problems into account - ignore the questionable documentation regarding significance tests.

Onwards to model building:

In [14]:

%%time

# build model on train_df
endo_vars = ['y', 'humidity', 'dew.point', 'pressure'] # better model without wind.x & wind.y
exog_vars = ['day.cos', 'day.sin', 'year.cos', 'year.sin']
endo_df = train_df[endo_vars]
exog_df = train_df[exog_vars]

var_model = VAR(endog = endo_df, exog = exog_df)
var_fit   = var_model.fit(maxlags = 12, ic = 'aic')
print(var_fit.summary())


# rolling_cv with pre-trained model
def var_rolling_cv(data, horizon, method, model):
    lags = model.k_ar  # lag order
    i = lags
    h = horizon
    rmse_roll, mae_roll = [], []
    endo_vars = ['y', 'humidity', 'dew.point', 'pressure']
    exog_vars = ['day.cos', 'day.sin', 'year.cos', 'year.sin']
    #print("\ti:", i, "\th:", h, "\tlags:", lags)  # DEBUG

    while (i + h) < len(data):
        test_df = data[endo_vars].iloc[i:(i + h)]
        endo_df = data[endo_vars].iloc[(i - lags):i].values
        exog_df = data[exog_vars].iloc[i:(i + h)]
        #print("\tendo:", endo_df.shape, "\texog:", exog_df.shape)  # DEBUG
        # display(endo_df)  # DEBUG

        y_hat = model.forecast(endo_df, exog_future = exog_df, steps = h)
        #print("yhat:", y_hat)  # DEBUG

        preds = pd.DataFrame(y_hat, columns = endo_vars)

        rmse_i = rmse(test_df.y,      preds.y)
        mae_i  = medianabs(test_df.y, preds.y)
        # print("rmse:", rmse_i, "\nmae:", mae_i, "\n")  # DEBUG
        rmse_roll.append(rmse_i)
        mae_roll.append(mae_i)

        i = i + 1
   
    return [np.mean(rmse_roll).round(2), np.mean(mae_roll).round(2)]


metrics = update_metrics(metrics, test_df, 'VAR', var_rolling_cv, var_fit)

/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/base/tsa_model.py:215: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting.
  ' ignored when e.g. forecasting.', ValueWarning)

  Summary of Regression Results   
==================================
Model:                         VAR
Method:                        OLS
Date:           Fri, 10, Sep, 2021
Time:                     17:18:26
--------------------------------------------------------------------
No. of Equations:         4.00000    BIC:                    4.67359
Nobs:                     190011.    HQIC:                   4.66560
Log likelihood:      -1.52118e+06    FPE:                    105.875
AIC:                      4.66226    Det(Omega_mle):         105.757
--------------------------------------------------------------------
Results for equation y
================================================================================
                   coefficient       std. error           t-stat            prob
--------------------------------------------------------------------------------
const                -1.544437         0.137736          -11.213           0.000
exog0                -0.226294         0.003149          -71.873           0.000
exog1                 0.134527         0.002785           48.302           0.000
exog2                -0.153552         0.003575          -42.951           0.000
exog3                -0.024625         0.002600           -9.470           0.000
L1.y                  0.977004         0.003159          309.282           0.000
L1.humidity          -0.009401         0.000580          -16.214           0.000
L1.dew.point          0.004889         0.000260           18.836           0.000
L1.pressure          -0.018217         0.002418           -7.534           0.000
L2.y                  0.040475         0.004089            9.899           0.000
L2.humidity           0.010012         0.000692           14.465           0.000
L2.dew.point         -0.002872         0.000313           -9.170           0.000
L2.pressure           0.014847         0.003250            4.568           0.000
L3.y                  0.017295         0.004115            4.203           0.000
L3.humidity          -0.001800         0.000701           -2.569           0.010
L3.dew.point          0.001771         0.000316            5.596           0.000
L3.pressure           0.012485         0.003274            3.814           0.000
L4.y                  0.004718         0.004116            1.146           0.252
L4.humidity           0.004475         0.000701            6.384           0.000
L4.dew.point         -0.001763         0.000317           -5.570           0.000
L4.pressure          -0.003827         0.003275           -1.169           0.243
L5.y                 -0.027403         0.004119           -6.653           0.000
L5.humidity          -0.001565         0.000702           -2.230           0.026
L5.dew.point         -0.000117         0.000317           -0.370           0.712
L5.pressure           0.000103         0.003275            0.032           0.975
L6.y                  0.002218         0.004118            0.539           0.590
L6.humidity           0.004751         0.000702            6.771           0.000
L6.dew.point         -0.002048         0.000317           -6.465           0.000
L6.pressure           0.000590         0.003275            0.180           0.857
L7.y                 -0.006846         0.004118           -1.662           0.096
L7.humidity           0.001522         0.000702            2.168           0.030
L7.dew.point         -0.001229         0.000317           -3.879           0.000
L7.pressure           0.005164         0.003275            1.577           0.115
L8.y                 -0.024975         0.004119           -6.064           0.000
L8.humidity          -0.002039         0.000702           -2.905           0.004
L8.dew.point          0.001246         0.000317            3.933           0.000
L8.pressure          -0.011264         0.003275           -3.439           0.001
L9.y                  0.002985         0.004116            0.725           0.468
L9.humidity           0.000780         0.000701            1.113           0.266
L9.dew.point          0.000020         0.000317            0.064           0.949
L9.pressure           0.004215         0.003275            1.287           0.198
L10.y                -0.008054         0.004115           -1.957           0.050
L10.humidity         -0.000925         0.000701           -1.319           0.187
L10.dew.point        -0.000058         0.000316           -0.182           0.856
L10.pressure         -0.007244         0.003274           -2.213           0.027
L11.y                 0.002515         0.004087            0.615           0.538
L11.humidity          0.000662         0.000693            0.956           0.339
L11.dew.point        -0.000195         0.000313           -0.623           0.533
L11.pressure         -0.005101         0.003250           -1.569           0.117
================================================================================

Results for equation humidity
================================================================================
                   coefficient       std. error           t-stat            prob
--------------------------------------------------------------------------------
const                -0.010138         0.003169           -3.199           0.001
exog0                -0.003601         0.000580           -6.213           0.000
exog1                 0.001903         0.000259            7.345           0.000
exog2                 0.009757         0.002418            4.035           0.000
exog3                 9.805624         0.962126           10.192           0.000
L1.y                  1.000560         0.021993           45.493           0.000
L1.humidity          -0.416643         0.019455          -21.416           0.000
L1.dew.point          0.382083         0.024973           15.300           0.000
L1.pressure          -0.247444         0.018164          -13.623           0.000
L2.y                 -1.458346         0.022066          -66.090           0.000
L2.humidity           0.642661         0.004050          158.670           0.000
L2.dew.point         -0.016420         0.001813           -9.057           0.000
L2.pressure          -0.029554         0.016889           -1.750           0.080
L3.y                  0.631041         0.028562           22.094           0.000
L3.humidity           0.182591         0.004835           37.763           0.000
L3.dew.point          0.006746         0.002188            3.083           0.002
L3.pressure          -0.106031         0.022703           -4.670           0.000
L4.y                  0.225460         0.028742            7.844           0.000
L4.humidity           0.078080         0.004894           15.955           0.000
L4.dew.point          0.001727         0.002210            0.781           0.435
L4.pressure          -0.039336         0.022868           -1.720           0.085
L5.y                  0.202586         0.028751            7.046           0.000
L5.humidity           0.019582         0.004897            3.999           0.000
L5.dew.point          0.009068         0.002212            4.100           0.000
L5.pressure           0.009264         0.022877            0.405           0.686
L6.y                  0.071629         0.028769            2.490           0.013
L6.humidity           0.014463         0.004901            2.951           0.003
L6.dew.point          0.006990         0.002213            3.158           0.002
L6.pressure           0.019760         0.022877            0.864           0.388
L7.y                  0.149912         0.028768            5.211           0.000
L7.humidity           0.001403         0.004901            0.286           0.775
L7.dew.point         -0.000762         0.002213           -0.344           0.731
L7.pressure          -0.024300         0.022878           -1.062           0.288
L8.y                  0.015327         0.028767            0.533           0.594
L8.humidity           0.000971         0.004902            0.198           0.843
L8.dew.point          0.000302         0.002213            0.137           0.891
L8.pressure           0.039321         0.022878            1.719           0.086
L9.y                  0.066301         0.028770            2.305           0.021
L9.humidity          -0.010700         0.004902           -2.183           0.029
L9.dew.point         -0.001197         0.002213           -0.541           0.589
L9.pressure          -0.004615         0.022876           -0.202           0.840
L10.y                -0.016503         0.028752           -0.574           0.566
L10.humidity         -0.013503         0.004898           -2.757           0.006
L10.dew.point         0.005200         0.002212            2.351           0.019
L10.pressure          0.050434         0.022877            2.205           0.027
L11.y                 0.109473         0.028746            3.808           0.000
L11.humidity          0.022187         0.004895            4.532           0.000
L11.dew.point        -0.011784         0.002210           -5.333           0.000
L11.pressure          0.028116         0.022869            1.229           0.219
================================================================================

Results for equation dew.point
================================================================================
                   coefficient       std. error           t-stat            prob
--------------------------------------------------------------------------------
const                -0.092524         0.028550           -3.241           0.001
exog0                -0.026616         0.004839           -5.500           0.000
exog1                 0.011113         0.002187            5.081           0.000
exog2                 0.046912         0.022705            2.066           0.039
exog3                 0.029192         0.022139            1.319           0.187
L1.y                  0.015698         0.004049            3.877           0.000
L1.humidity          -0.007047         0.001810           -3.893           0.000
L1.dew.point          0.006485         0.016891            0.384           0.701
L1.pressure          -5.235230         1.996204           -2.623           0.009
L2.y                 -0.490357         0.045632          -10.746           0.000
L2.humidity           0.275447         0.040365            6.824           0.000
L2.dew.point         -0.321373         0.051813           -6.203           0.000
L2.pressure          -0.655026         0.037687          -17.381           0.000
L3.y                 -0.056794         0.045783           -1.241           0.215
L3.humidity          -0.120873         0.008403          -14.384           0.000
L3.dew.point          0.681391         0.003762          181.137           0.000
L3.pressure          -0.169970         0.035042           -4.851           0.000
L4.y                  0.006391         0.059259            0.108           0.914
L4.humidity           0.110637         0.010032           11.028           0.000
L4.dew.point          0.162959         0.004539           35.899           0.000
L4.pressure          -0.124674         0.047104           -2.647           0.008
L5.y                  0.375319         0.059633            6.294           0.000
L5.humidity           0.089129         0.010153            8.778           0.000
L5.dew.point          0.059155         0.004586           12.898           0.000
L5.pressure           0.086758         0.047447            1.829           0.067
L6.y                 -0.172983         0.059652           -2.900           0.004
L6.humidity          -0.035387         0.010161           -3.483           0.000
L6.dew.point          0.050800         0.004588           11.071           0.000
L6.pressure          -0.049732         0.047465           -1.048           0.295
L7.y                 -0.124365         0.059690           -2.084           0.037
L7.humidity           0.017902         0.010170            1.760           0.078
L7.dew.point          0.010530         0.004592            2.293           0.022
L7.pressure          -0.040725         0.047464           -0.858           0.391
L8.y                  0.317251         0.059687            5.315           0.000
L8.humidity           0.034609         0.010169            3.404           0.001
L8.dew.point         -0.015052         0.004591           -3.278           0.001
L8.pressure          -0.068948         0.047468           -1.453           0.146
L9.y                 -0.295447         0.059685           -4.950           0.000
L9.humidity          -0.037808         0.010170           -3.718           0.000
L9.dew.point          0.010976         0.004592            2.390           0.017
L9.pressure           0.117336         0.047466            2.472           0.013
L10.y                 0.007472         0.059691            0.125           0.900
L10.humidity         -0.004778         0.010171           -0.470           0.639
L10.dew.point        -0.010112         0.004592           -2.202           0.028
L10.pressure         -0.069403         0.047463           -1.462           0.144
L11.y                 0.158203         0.059654            2.652           0.008
L11.humidity         -0.036198         0.010162           -3.562           0.000
L11.dew.point         0.015412         0.004589            3.358           0.001
L11.pressure          0.200941         0.047464            4.233           0.000
================================================================================

Results for equation pressure
================================================================================
                   coefficient       std. error           t-stat            prob
--------------------------------------------------------------------------------
const                 0.118075         0.059641            1.980           0.048
exog0                 0.049769         0.010157            4.900           0.000
exog1                -0.028993         0.004585           -6.324           0.000
exog2                 0.024779         0.047449            0.522           0.602
exog3                -0.071992         0.059234           -1.215           0.224
L1.y                 -0.024298         0.010040           -2.420           0.016
L1.humidity           0.006812         0.004538            1.501           0.133
L1.dew.point          0.013954         0.047108            0.296           0.767
L1.pressure           0.049025         0.045933            1.067           0.286
L2.y                 -0.012004         0.008400           -1.429           0.153
L2.humidity           0.005862         0.003756            1.561           0.119
L2.dew.point          0.082341         0.035045            2.350           0.019
L2.pressure           2.737593         0.130708           20.944           0.000
L3.y                 -0.049294         0.002988          -16.498           0.000
L3.humidity          -0.002578         0.002643           -0.975           0.329
L3.dew.point         -0.010324         0.003393           -3.043           0.002
L3.pressure           0.003190         0.002468            1.293           0.196
L4.y                 -0.045975         0.002998          -15.336           0.000
L4.humidity          -0.004051         0.000550           -7.363           0.000
L4.dew.point          0.000844         0.000246            3.425           0.001
L4.pressure           0.898073         0.002294          391.407           0.000
L5.y                  0.007497         0.003880            1.932           0.053
L5.humidity           0.000917         0.000657            1.396           0.163
L5.dew.point         -0.000331         0.000297           -1.112           0.266
L5.pressure           0.161498         0.003084           52.361           0.000
L6.y                  0.007214         0.003905            1.848           0.065
L6.humidity           0.000152         0.000665            0.229           0.819
L6.dew.point          0.000009         0.000300            0.030           0.976
L6.pressure           0.036460         0.003107           11.736           0.000
L7.y                  0.003007         0.003906            0.770           0.441
L7.humidity           0.001021         0.000665            1.535           0.125
L7.dew.point         -0.000132         0.000300           -0.441           0.660
L7.pressure          -0.002062         0.003108           -0.664           0.507
L8.y                 -0.002104         0.003908           -0.538           0.590
L8.humidity           0.000300         0.000666            0.451           0.652
L8.dew.point         -0.000199         0.000301           -0.661           0.509
L8.pressure          -0.020225         0.003108           -6.508           0.000
L9.y                  0.011546         0.003908            2.954           0.003
L9.humidity           0.002001         0.000666            3.005           0.003
L9.dew.point         -0.000489         0.000301           -1.627           0.104
L9.pressure          -0.021942         0.003108           -7.059           0.000
L10.y                 0.003690         0.003908            0.944           0.345
L10.humidity         -0.001258         0.000666           -1.889           0.059
L10.dew.point         0.000419         0.000301            1.394           0.163
L10.pressure         -0.009384         0.003108           -3.019           0.003
L11.y                 0.000511         0.003908            0.131           0.896
L11.humidity         -0.000856         0.000666           -1.285           0.199
L11.dew.point         0.000231         0.000301            0.769           0.442
L11.pressure         -0.005998         0.003108           -1.930           0.054
================================================================================

Correlation matrix of residuals
                    y  humidity  dew.point  pressure
y            1.000000 -0.402649   0.169509 -0.047301
humidity    -0.402649  1.000000   0.640355  0.011679
dew.point    0.169509  0.640355   1.000000 -0.011011
pressure    -0.047301  0.011679  -0.011011  1.000000



h	rmse	mae
1 	 0.39 	 0.39
4 	 0.75 	 0.66
48 	 2.45 	 1.89


CPU times: user 54min 35s, sys: 34.9 s, total: 55min 10s
Wall time: 54min 55s

In [15]:

plot_metrics(metrics, "Multivariate Baseline Comparison - 2019 test data")

Forecasts from the multi-variate VAR model surpass all univariate methods in this notebook at all horizons tested. Obviously the combination of endogenous measurements and exogenous time components is superior to temperature measurements alone. Additionally, the VAR model may have utility for predicting the other endogenous variables.

Tuning lags should be the best approach to further improve VAR forecast accuracy. It would surprise me if lags in multiples of 48 were not significant, for at least the previous 2 or 3 days and perhaps up to a week. Meaning, the next set of lags to try would include combinations from: {1-12, 1-24, 1-48} and {48, 96, 144, 192, 240, 288, 336}.

Optimising time component frequencies and phase shifts would likely aid in reducing forecast errors and would be applicable to other methods.

Conclusion¶

Here is a brief summary of univariate and multivariate methods plus some potential future work.

Univariate methods:

Best method - HWES
Exponential smoothing models performed well
- As noted in the literature exponential smoothing approaches can beat more sophisticated methods

Multi-variate methods:

Best method - VAR

Univariate compared to multivariate methods:

VAR forecasts were substantially better than HWES on 2019 test data on all horizons tested

In conclusion, the VAR forecasts make a good baseline to beat.

Future work could include:

add error bars, confidence intervals etc
consider additional metrics such as Brier scores
- not MAPE because temperature can be 0 and below
investigate forecast performance for other variables in the VAR model
- particularly the wind vectors with and without mixup data augmentation
- optimise sinusoidal time component frequencies and phase shifts
- compare alternatives to the sinusoidal time components
  - Fourier terms
other methods
- upgrade statsmodels on Google Colab
  - check AutoReg and STL performance
- install sktime
  - TBATS implementation supports multiple seasonalities (but not exogenous variables)
- SARIMAX - Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors
  - from the statsmodels package
  - iirc supports multiple seasonalities
  - likely to struggle with very long annual seasonality
  - separate notebook justified
- machine learning methods
  - Support vector regression
  - Gaussian processes
  - separate notebook(s) justified

Metadata¶

Python and Jupyter versions plus modules imported and their version strings. This is the poor man's python equivalent of R's sessionInfo().

Code for imported modules and versions adapted from this stackoverflow answer. There are simpler alternatives, such as watermark, but they all require installation.

In [16]:

import sys
import IPython

print("Python version:")
print(sys.executable)
print(sys.version)
print("\nIPython version:")
print(IPython.__version__)

Python version:
/usr/bin/python3
3.7.11 (default, Jul  3 2021, 18:01:19) 
[GCC 7.5.0]

IPython version:
5.5.0

In [17]:

import pkg_resources
import types

def get_imports():
    for name, val in globals().items():
        if isinstance(val, types.ModuleType):
            # Split ensures you get root package, 
            # not just imported function
            name = val.__name__.split(".")[0]

        elif isinstance(val, type):
            name = val.__module__.split(".")[0]

        # Some packages are weird and have different
        # imported names vs. system/pip names.  Unfortunately,
        # there is no systematic way to get pip names from
        # a package's imported name.  You'll have to add
        # exceptions to this list manually!
        poorly_named_packages = {
            "PIL":       "Pillow",
            "sklearn":   "scikit-learn",
        }
        if name in poorly_named_packages.keys():
            name = poorly_named_packages[name]

        yield name

imports = list(set(get_imports()))

# The only way I found to get the version of the root package
# from only the name of the package is to cross-check the names 
# of installed packages vs. imported packages
requirements = []
for m in pkg_resources.working_set:
    if m.project_name in imports and m.project_name != "pip":
        requirements.append((m.project_name, m.version))

reqs = pd.DataFrame(requirements, columns = ['name', 'version'])
print("Imported modules:")
reqs.style.hide_index()

Imported modules:

Out[17]:

name	version
statsmodels	0.10.2
pandas	1.1.5
numpy	1.19.5
matplotlib	3.2.2

In [18]:

!date

Fri Sep 10 18:25:17 UTC 2021

Archival¶

Archive code, markdown, history and formatted notebooks.

Assumes all pdf, html, latex etc dependencies are installed.

WARNING Will overwrite existing files.

Notebook name is hardcoded below because the alternative is ghastly globs of unreliable javascript or external ipython libraries I'd prefer to avoid installing :-(

In [ ]:

from time import sleep

notebook = "cammet_baselines_2021.ipynb"
# !jupyter nbconvert --to script {notebook}
# !jupyter nbconvert --execute --to html {notebook}
# !jupyter nbconvert --execute --to pdf {notebook}
# !jupyter nbconvert --to pdf {notebook}

%rm history.txt
%history -f history.txt

!jupyter nbconvert --to python {notebook}
sleep(5)
!jupyter nbconvert --to markdown {notebook}
sleep(5)
!jupyter nbconvert --to html {notebook}