%matplotlib inline
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.tsa.api import VAR
from statsmodels.tsa.api import ExponentialSmoothing, SimpleExpSmoothing
from statsmodels.tools.eval_measures import rmse, medianabs
# Reduces variance in results but won't eliminate it :-(
%env PYTHONHASHSEED=0
import random
random.seed(42)
np.random.seed(42)
/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead. import pandas.util.testing as tm
env: PYTHONHASHSEED=0
Building baseline models for time series analysis of Cambridge UK temperature measurements taken at the University computer lab weather station.
I'm primarily interested in short term temperature forecasts (less than 2 hours) but will include results up to 24 hours in the future.
See my previous work for further details:
The measurements are relatively noisy and there are usually several hundred missing values every year. Observations have been extensively cleaned but may still have issues. Interpolation and missing value imputation have been used to fill all missing values. See the cleaning section in the Cambridge Temperature Model repository for details. Observations start in August 2008 and end in April 2021 and occur every 30 mins.
if 'google.colab' in str(get_ipython()):
data_loc = "https://github.com/makeyourownmaker/CambridgeTemperatureNotebooks/blob/main/data/CamMetCleanish2021.04.26.csv?raw=true"
else:
data_loc = "../data/CamMetCleanish2021.04.26.csv"
df = pd.read_csv(data_loc, parse_dates = True)
df['ds'] = pd.to_datetime(df['ds'])
df['y'] = df['y'] / 10
df['wind.speed.mean'] = df['wind.speed.mean'] / 10
df = df.loc[df['ds'] > '2008-08-01 00:00:00',]
print("Shape:")
print(df.shape)
print("\nInfo:")
print(df.info())
print("\nSummary stats:")
display(df.describe())
print("\nRaw data:")
display(df)
print("\n")
def plot_examples(data, x_var):
"""Plot 9 sets of observations in 3 * 3 matrix ..."""
assert len(data) == 9
cols = [col for col in data[0].columns if col != x_var]
fig, axs = plt.subplots(3, 3, figsize = (15, 10))
axs = axs.ravel() # apl for the win :-)
for i in range(9):
for col in cols:
axs[i].plot(data[i][x_var], data[i][col])
axs[i].xaxis.set_tick_params(rotation = 20, labelsize = 10)
fig.legend(cols, loc = 'upper center', ncol = len(cols))
return None
cols = ['ds', 'y', 'humidity', 'dew.point', 'pressure',
'wind.speed.mean', 'wind.bearing.mean']
plots = 9
window = 24
starts = [random.randint(0, np.floor(df.shape[0] / window)) for _ in range(plots)]
p_data = [df.loc[starts[i] * window:starts[i] * window + window, cols]
for i in range(plots)]
plot_examples(p_data, 'ds')
Shape: (225251, 7) Info: <class 'pandas.core.frame.DataFrame'> Int64Index: 225251 entries, 2 to 225252 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ds 225251 non-null datetime64[ns] 1 y 225251 non-null float64 2 humidity 225251 non-null float64 3 dew.point 225251 non-null float64 4 pressure 225251 non-null float64 5 wind.speed.mean 225251 non-null float64 6 wind.bearing.mean 225251 non-null float64 dtypes: datetime64[ns](1), float64(6) memory usage: 13.7 MB None Summary stats:
y | humidity | dew.point | pressure | wind.speed.mean | wind.bearing.mean | |
---|---|---|---|---|---|---|
count | 225251.000000 | 225251.000000 | 225251.000000 | 225251.000000 | 225251.000000 | 225251.000000 |
mean | 10.027882 | 78.619532 | 58.600183 | 1014.342062 | 4.434737 | 194.974558 |
std | 6.509969 | 17.308646 | 51.645273 | 11.911991 | 4.011087 | 82.876095 |
min | -7.000000 | 20.000000 | -100.000000 | 963.000000 | 0.000000 | 0.000000 |
25% | 5.200000 | 68.000000 | 19.000000 | 1008.000000 | 1.200000 | 135.000000 |
50% | 9.600000 | 83.000000 | 59.100000 | 1016.000000 | 3.500000 | 225.000000 |
75% | 14.500000 | 92.000000 | 97.000000 | 1022.000000 | 6.600000 | 270.000000 |
max | 36.100000 | 100.000000 | 209.000000 | 1051.000000 | 29.200000 | 360.000000 |
Raw data:
ds | y | humidity | dew.point | pressure | wind.speed.mean | wind.bearing.mean | |
---|---|---|---|---|---|---|---|
2 | 2008-08-01 00:30:00 | 19.5 | 65.75000 | 119.150000 | 1014.416667 | 1.150000 | 225.0 |
3 | 2008-08-01 01:00:00 | 19.1 | 49.75000 | 79.200000 | 1014.384615 | 1.461538 | 225.0 |
4 | 2008-08-01 01:30:00 | 19.1 | 66.17875 | 106.600000 | 1014.500000 | 1.508333 | 225.0 |
5 | 2008-08-01 02:00:00 | 19.1 | 58.50000 | 99.250000 | 1014.076923 | 1.430769 | 225.0 |
6 | 2008-08-01 02:30:00 | 19.1 | 66.95000 | 121.883333 | 1014.416667 | 1.133333 | 225.0 |
... | ... | ... | ... | ... | ... | ... | ... |
225248 | 2021-04-25 23:00:00 | 3.6 | 61.00000 | -32.000000 | 1028.000000 | 1.400000 | 45.0 |
225249 | 2021-04-25 23:30:00 | 3.6 | 64.00000 | -26.000000 | 1028.000000 | 2.600000 | 45.0 |
225250 | 2021-04-26 00:00:00 | 3.6 | 58.00000 | -39.000000 | 1028.000000 | 4.300000 | 45.0 |
225251 | 2021-04-26 00:30:00 | 3.2 | 62.00000 | -34.000000 | 1027.000000 | 5.400000 | 45.0 |
225252 | 2021-04-26 01:00:00 | 3.2 | 62.00000 | -34.000000 | 1027.000000 | 4.200000 | 45.0 |
225251 rows × 7 columns
The data must be reformatted before model building.
The following steps are carried out:
The wind.bearing.mean
column gives wind direction in degrees but is categorised at 45 degree increments, i.e. 0, 45, 90, 135, 180, 225, 270, 315. Wind direction shouldn't matter if the wind is not blowing.
The distribution of wind direction and speed looks like this:
plt.hist2d(df['wind.bearing.mean'], df['wind.speed.mean'], bins = (50, 50), vmax = 400)
plt.colorbar()
plt.xlabel('Wind Direction - degrees')
plt.ylabel('Wind Velocity - Knots')
plt.title('Wind bearing and velocity');
Convert wind direction and speed to x and y vectors, so the model can more easily interpret them.
wv = df['wind.speed.mean']
# Convert to radians
wd_rad = df['wind.bearing.mean'] * np.pi / 180
# Calculate the wind x and y components
df['wind.x'] = wv * np.cos(wd_rad)
df['wind.y'] = wv * np.sin(wd_rad)
df_orig = df
plt.hist2d(df['wind.x'], df['wind.y'], bins = (50, 50), vmax = 400)
plt.colorbar()
plt.xlabel('Wind X - Knots')
plt.ylabel('Wind Y - Knots')
plt.title('Wind velocity vectors');
Better, but not ideal. Data augmentation with the mixup method is carried out below.
From the mixup paper: "mixup trains a neural network on convex combinations of pairs of examples and their labels".
Further details on how I apply mixup to time series are included in the Window data section of my keras_mlp_fcn_resnet_time_series.ipynb notebook.
Here is an illustration of the improvement in wind velocity sparsity with mixup augmentation.
def mixup(data, alpha = 1.0, factor = 1):
batch_size = len(data) - 1
data['epoch'] = data.index.astype(np.int64) // 10**9
# random sample lambda value from beta distribution
l = np.random.beta(alpha, alpha, batch_size * factor)
X_l = l.reshape(batch_size * factor, 1)
# Get a pair of inputs and outputs
y1 = data['y'].shift(-1).dropna()
y1_ = pd.concat([y1] * factor)
y2 = data['y'][0:batch_size]
y2_ = pd.concat([y2] * factor)
X1 = data.drop('y', 1).shift(-1).dropna()
X1_ = pd.concat([X1] * factor)
X2 = data.drop('y', 1)
X2 = X2[0:batch_size]
X2_ = pd.concat([X2] * factor)
# Perform mixup
X = X1_ * X_l + X2_ * (1 - X_l)
y = y1_ * l + y2_ * (1 - l)
df = pd.DataFrame(y).join(X)
df = data.append(df).sort_values('epoch', ascending = True)
df = df.drop('epoch', 1)
df = df.drop_duplicates(keep = False)
return df
df_mix = mixup(df.loc[:, ['y','wind.x','wind.y']], factor = 2)
plt.hist2d(df_mix['wind.x'], df_mix['wind.y'], bins = (50, 50), vmax = 400)
plt.colorbar()
plt.xlabel('Wind X - Knots')
plt.ylabel('Wind Y - Knots')
plt.title('Wind velocity with mixup data augmentation');
Mixup improves the categorical legacy of the wind velocity data. Unfortunately, if outliers are present their influence will likely be reinforced.
For now, I'm not using the wind vector or mixup augmentation with any of the baseline methods. See my keras_mlp_fcn_resnet_time_series.ipynb notebook for an illustration of the profoundly beneficial improvement mixup provides with this dataset.
Convert ds
timestamps to "time of day" and "time of year" variables using sin
and cos
functions.
# Convert to secs
date_time = pd.to_datetime(df['ds'], format = '%Y.%m.%d %H:%M:%S')
timestamp_s = date_time.map(datetime.datetime.timestamp)
day = 24 * 60 * 60
year = (365.2425) * day
df['day.sin'] = np.sin(timestamp_s * (2 * np.pi / day))
df['day.cos'] = np.cos(timestamp_s * (2 * np.pi / day))
df['year.sin'] = np.sin(timestamp_s * (2 * np.pi / year))
df['year.cos'] = np.cos(timestamp_s * (2 * np.pi / year))
plt.plot(np.array(df['day.sin'])[49:98])
plt.plot(np.array(df['day.cos'])[49:98])
plt.xlabel('Time')
plt.legend(['sin', 'cos'], loc = 'lower right')
plt.title('Time of day signal');
The yearly time components may benefit from a single phase shift so they align with the seasonal temperature peak around the end of July and temperature trough around the end of January. Similarly, the daily components may benefit from small daily phase shifts.
I use these time components with the multi-variate VAR method.
I use data from 2018 for validation, 2019 for testing and the remaining data for training. These are entirely arbitrary choices. This results in an approximate 84%, 8%, 8% split for the training, validation, and test sets respectively.
keep_cols = ['y', 'humidity', 'dew.point', 'pressure', 'wind.x', 'wind.y',
'day.sin', 'day.cos', 'year.sin', 'year.cos', 'ds']
df['year'] = df['ds'].dt.year
train_df = df.loc[(df['year'] != 2018) & (df['year'] != 2019)]
valid_df = df.loc[df['year'] == 2018]
test_df = df.loc[df['year'] == 2019]
train_df = train_df.drop('year', axis = 1) # inplace = True gives SettingWithCopyWarning
valid_df = valid_df.drop('year', axis = 1) # ...
test_df = test_df.drop('year', axis = 1)
df = df.drop('year', axis = 1)
print("df.drop shape: ", df.shape)
print("train shape: ", train_df.shape)
print("valid shape: ", valid_df.shape)
print("test shape: ", test_df.shape)
plt.figure(figsize = (15, 8))
plt.plot(train_df.ds, train_df.y)
plt.plot(test_df.ds, test_df.y)
plt.title('Temperature - C')
plt.legend(['train', 'test'])
plt.show()
plt.figure(figsize = (15, 8))
plt.plot(test_df.ds, test_df.y, color='orange')
plt.title('Temperature - C (test data)')
plt.show();
df.drop shape: (225251, 13) train shape: (190023, 13) valid shape: (17655, 13) test shape: (17573, 13)
The training data is used to calculate the seasonal average values.
This notebook is being developed on Google Colab primarily with the statsmodels package.
I will not use complicated methods just for comparison of baseline methods. Also, I don't want to spend a great deal of time optimising hyperparameter settings. This means I will use simpler methods and default settings as far as possible.
Model diagnostics will not be checked. Some of the models built may have issues.
Runtime may become an issue for complicated methods.
Univariate baseline methods:
I would also like to look at results from, the AutoReg and STL functions but they are unavailable in the current version of statsmodels (0.10.2) installed on Google Colab (as of 28/04/21). The alternative AR function is depracated. ARIMA models would be the next logical choice but often require model checking and tuning. Installing packages on Google Colab is straight-forward. Upgrading packages may also be straight-forward, but it's probably more productive to move onto multi-variate baselines instead.
Multivariate baseline methods:
Forecast horizons:
h
in the interrim results tablesSome forecast methods are limited to single step forecasts. For example, simple average and seasonal average.
Metrics considered:
Firstly, the mean baseline:
# sa - simple average
sa = train_df.y.mean()
rmse_sa = rmse(test_df.y, sa)
mae_sa = medianabs(test_df.y, sa)
print("simple average:", round(sa, 2))
print("rmse:", round(rmse_sa, 2))
print("mae: ", round(mae_sa, 2))
cols = ['type', 'method', 'metric', 'horizon', 'value']
metrics_h = [] # h for horizon
for i in range(1, 49):
metrics_h.append(dict(zip(cols, ['univariate', 'mean', 'rmse', i, rmse_sa])))
metrics_h.append(dict(zip(cols, ['univariate', 'mean', 'mae', i, mae_sa])))
metrics = pd.DataFrame(metrics_h, columns = cols)
simple average: 10.09 rmse: 6.42 mae: 4.89
The mean baseline has equivalent metric values across all forecast horizons.
Secondly, the naive baseline:
def get_naive_metrics(data, horizon, method):
y_hat = data.copy()
y_hat['naive'] = y_hat.y.shift(horizon)
y_hat = y_hat.iloc[horizon:] # remove na value
naive_rmse = rmse(y_hat.y, y_hat.naive)
naive_mae = medianabs(y_hat.y, y_hat.naive)
return [naive_rmse, naive_mae]
def plot_metrics(metrics, main_title):
fig, axs = plt.subplots(1, 2, figsize = (14, 7))
fig.suptitle(main_title)
axs = axs.ravel() # APL ftw!
methods = metrics.method.unique()
for method in methods:
df = metrics.query('metric == "rmse" & method == "%s"' % method)
axs[0].plot(df.horizon, df.value)
axs[0].set_xlabel("horizon - hours")
axs[0].set_ylabel("rmse")
axs[0].legend(methods)
for method in methods:
df = metrics.query('metric == "mae" & method == "%s"' % method)
axs[1].plot(df.horizon, df.value)
axs[1].set_xlabel("horizon - hours")
axs[1].set_ylabel("mae")
axs[1].legend(methods);
def update_metrics(metrics, test_data, method, get_metrics, model=None):
metrics_h = []
if method in ['SES', 'HWES']:
horizons = [i for i in range(4, 49, 4)]
horizons.insert(0, 1)
else:
horizons = range(1, 49)
if method in ['VAR']:
variates = 'multivariate'
else:
variates = 'univariate'
print("h\trmse\tmae")
for h in horizons:
if method in ['VAR']:
rmse_h, mae_h = get_metrics(test_df, h, method, model)
else:
rmse_h, mae_h = get_metrics(test_df, h, method)
metrics_h.append(dict(zip(cols, [variates, method, 'rmse', h, rmse_h])))
metrics_h.append(dict(zip(cols, [variates, method, 'mae', h, mae_h])))
if h in [1, 4, 48]:
print(h, "\t", round(rmse_h, 2), "\t", round(mae_h, 2))
print("\n")
metrics_method = pd.DataFrame(metrics_h, columns = cols)
metrics = metrics.append(metrics_method)
return metrics
metrics = update_metrics(metrics, test_df, 'naive', get_naive_metrics)
plot_metrics(metrics, "Univariate Baseline Comparison - 2019 test data")
h rmse mae 1 0.66 0.4 4 1.86 0.9 48 3.08 2.0
Obviously, the naive method is an improvement over the simple average.
The daily seasonality is clearly evident in the naive method.
Third, seasonal average:
train_df['doy'] = train_df.ds.dt.strftime('%j')
train_df['hms'] = train_df.ds.dt.strftime('%H:%M:%S')
# ha - historical average
ha = pd.DataFrame(train_df.groupby(['doy', 'hms']).y.mean())
ha.reset_index(inplace=True)
rmse_ha = rmse(test_df.y[:17568], ha.y)
mae_ha = medianabs(test_df.y[:17568], ha.y)
print("rmse:", round(rmse_ha, 2))
print("mae: ", round(mae_ha, 2))
metrics_h = [] # h for horizon
for h in range(1, 49):
metrics_h.append(dict(zip(cols, ['univariate', 'seasonal mean', 'rmse', h, rmse_ha])))
metrics_h.append(dict(zip(cols, ['univariate', 'seasonal mean', 'mae', h, mae_ha])))
metrics_ha = pd.DataFrame(metrics_h, columns = cols)
metrics = metrics.append(metrics_ha)
plot_metrics(metrics, "Univariate Baseline Comparison - 2019 test data")
rmse: 4.65 mae: 2.97
The historical average, or seasonal average, is superior to the simple average. It also illustrates how the naive method is struggling with the daily seasonality component.
Fourth, simple exponential smoothing:
SimpleExpSmoothing
function%%time
def es_rolling_cv(data, horizon, method):
i = 0
x = 48 # initial training window
h = horizon
rmse_roll, mae_roll = [], []
while (i + x + h) < len(data):
train_ts = data[(i):(i + x)].y.values
test_ts = data[(i + x):(i + x + h)].y.values
# print("train:", train_ts, "\ntest:", test_ts) # DEBUG
if method == 'SES':
model_roll = SimpleExpSmoothing(train_ts).fit()
elif method == 'HWES':
model_roll = ExponentialSmoothing(train_ts,
seasonal_periods = 48,
seasonal = 'add',
).fit()
y_hat = model_roll.forecast(h)
# print("yhat:", y_hat) # DEBUG
rmse_i = rmse(test_ts, y_hat)
mae_i = medianabs(test_ts, y_hat)
# print("rmse:", rmse_i, "\nmae:", mae_i, "\n") # DEBUG
rmse_roll.append(rmse_i)
mae_roll.append(mae_i)
i = i + 1
return [np.mean(rmse_roll).round(2), np.mean(mae_roll).round(2)]
metrics = update_metrics(metrics, test_df, 'SES', es_rolling_cv)
plot_metrics(metrics, "Univariate Baseline Comparison - 2019 test data")
h rmse mae 1 0.44 0.44 4 1.02 0.9 48 4.0 3.23 CPU times: user 15min 25s, sys: 47 s, total: 16min 12s Wall time: 15min 1s
Clearly, the SES method is an improvement over the naive method for forecast horizons less than approx. 18 hours.
It may be possible to optimise the SES method by setting initial training window length train_window
and/or the initialization_method
parameter. See statsmodels docs for initialization_method
options.
Fifth, Holt-Winters exponential smoothing:
seasonal_periods
is set to 48 (24 half-hourly observations)ExponentialSmoothing
function does not support multiple seasonalitiesseasonal
is set to additivetrend
and damped_trend
are set to default values of None
%%time
metrics = update_metrics(metrics, test_df, 'HWES', es_rolling_cv)
plot_metrics(metrics, "Univariate Baseline Comparison - 2019 test data")
h rmse mae
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/holtwinters.py:924: RuntimeWarning: divide by zero encountered in log aic = self.nobs * np.log(sse / self.nobs) + k * 2 /usr/local/lib/python3.7/dist-packages/statsmodels/tsa/holtwinters.py:929: RuntimeWarning: invalid value encountered in double_scalars aicc = aic + aicc_penalty /usr/local/lib/python3.7/dist-packages/statsmodels/tsa/holtwinters.py:930: RuntimeWarning: divide by zero encountered in log bic = self.nobs * np.log(sse / self.nobs) + k * np.log(self.nobs)
1 2.43 2.43 4 2.5 2.42 48 2.8 2.24 CPU times: user 30min 31s, sys: 1min 13s, total: 31min 44s Wall time: 30min 15s
On the whole, HWES is an improvement over the other methods even if short horizon forecasts are worse than the naive and SES methods.
The daily cyclic component present in the naive and SES results is not an issue with the HWES method as would be expected.
It's puzzling that the mean absolute error decreases slightly as the horizon increases. Adding error bars may partially address this issue.
It's also worth noting:
use_boxcox
and initialization_method
optionsConsidering two models:
VAR models capture the linear interdependencies among multiple time series.
Firstly, VAR:
y
), humidity
, dew.point
and pressure
as endogenous varaibleswind.x
or wind.y
in modelwind.x
and wind.y
measurementsday.cos
, day.sin
, year.cos
, year.sin
y
from statsmodels.tsa.stattools import grangercausalitytests
maxlag = 12
test = 'ssr_chi2test'
def grangers_causation_matrix(data, variables, test='ssr_chi2test', verbose=False):
"""Check Granger Causality of all possible combinations of the Time series.
The rows are the response variable, columns are predictors. The values in the table
are the P-Values. P-Values lesser than the significance level (0.05), implies
the Null Hypothesis that the coefficients of the corresponding past values is
zero, that is, the X does not cause Y can be rejected.
data : pandas dataframe containing the time series variables
variables : list containing names of the time series variables.
From https://www.machinelearningplus.com/time-series/vector-autoregression-examples-python/
"""
df = pd.DataFrame(np.zeros((len(variables), len(variables))), columns=variables, index=variables)
for c in df.columns:
for r in df.index:
test_result = grangercausalitytests(data[[r, c]], maxlag=maxlag, verbose=False)
p_values = [round(test_result[i + 1][0][test][1], 4) for i in range(maxlag)]
if verbose: print(f'Y = {r}, X = {c}, P Values = {p_values}')
min_p_value = np.min(p_values)
df.loc[r, c] = min_p_value
df.columns = [var + '_x' for var in variables]
df.index = [var + '_y' for var in variables]
return df
vars = ['y', 'humidity', 'dew.point', 'pressure'] # ,'wind.x', 'wind.y']
gcm = grangers_causation_matrix(test_df, vars)
display(gcm)
/usr/local/lib/python3.7/dist-packages/statsmodels/base/model.py:1752: ValueWarning: covariance of constraints does not have full rank. The number of constraints is 6, but rank is 1 'rank is %d' % (J, J_), ValueWarning)
y_x | humidity_x | dew.point_x | pressure_x | |
---|---|---|---|---|
y_y | 1.0 | 0.0000 | 0.0 | 0.0 |
humidity_y | 0.0 | 1.0000 | 0.0 | 0.0 |
dew.point_y | 0.0 | 0.0000 | 1.0 | 0.0 |
pressure_y | 0.0 | 0.0017 | 0.0 | 1.0 |
WARNING: The function used to check Granger causality does not take multiple testing problems into account - ignore the questionable documentation regarding significance tests.
Onwards to model building:
%%time
# build model on train_df
endo_vars = ['y', 'humidity', 'dew.point', 'pressure'] # better model without wind.x & wind.y
exog_vars = ['day.cos', 'day.sin', 'year.cos', 'year.sin']
endo_df = train_df[endo_vars]
exog_df = train_df[exog_vars]
var_model = VAR(endog = endo_df, exog = exog_df)
var_fit = var_model.fit(maxlags = 12, ic = 'aic')
print(var_fit.summary())
# rolling_cv with pre-trained model
def var_rolling_cv(data, horizon, method, model):
lags = model.k_ar # lag order
i = lags
h = horizon
rmse_roll, mae_roll = [], []
endo_vars = ['y', 'humidity', 'dew.point', 'pressure']
exog_vars = ['day.cos', 'day.sin', 'year.cos', 'year.sin']
#print("\ti:", i, "\th:", h, "\tlags:", lags) # DEBUG
while (i + h) < len(data):
test_df = data[endo_vars].iloc[i:(i + h)]
endo_df = data[endo_vars].iloc[(i - lags):i].values
exog_df = data[exog_vars].iloc[i:(i + h)]
#print("\tendo:", endo_df.shape, "\texog:", exog_df.shape) # DEBUG
# display(endo_df) # DEBUG
y_hat = model.forecast(endo_df, exog_future = exog_df, steps = h)
#print("yhat:", y_hat) # DEBUG
preds = pd.DataFrame(y_hat, columns = endo_vars)
rmse_i = rmse(test_df.y, preds.y)
mae_i = medianabs(test_df.y, preds.y)
# print("rmse:", rmse_i, "\nmae:", mae_i, "\n") # DEBUG
rmse_roll.append(rmse_i)
mae_roll.append(mae_i)
i = i + 1
return [np.mean(rmse_roll).round(2), np.mean(mae_roll).round(2)]
metrics = update_metrics(metrics, test_df, 'VAR', var_rolling_cv, var_fit)
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/base/tsa_model.py:215: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting. ' ignored when e.g. forecasting.', ValueWarning)
Summary of Regression Results ================================== Model: VAR Method: OLS Date: Fri, 10, Sep, 2021 Time: 17:18:26 -------------------------------------------------------------------- No. of Equations: 4.00000 BIC: 4.67359 Nobs: 190011. HQIC: 4.66560 Log likelihood: -1.52118e+06 FPE: 105.875 AIC: 4.66226 Det(Omega_mle): 105.757 -------------------------------------------------------------------- Results for equation y ================================================================================ coefficient std. error t-stat prob -------------------------------------------------------------------------------- const -1.544437 0.137736 -11.213 0.000 exog0 -0.226294 0.003149 -71.873 0.000 exog1 0.134527 0.002785 48.302 0.000 exog2 -0.153552 0.003575 -42.951 0.000 exog3 -0.024625 0.002600 -9.470 0.000 L1.y 0.977004 0.003159 309.282 0.000 L1.humidity -0.009401 0.000580 -16.214 0.000 L1.dew.point 0.004889 0.000260 18.836 0.000 L1.pressure -0.018217 0.002418 -7.534 0.000 L2.y 0.040475 0.004089 9.899 0.000 L2.humidity 0.010012 0.000692 14.465 0.000 L2.dew.point -0.002872 0.000313 -9.170 0.000 L2.pressure 0.014847 0.003250 4.568 0.000 L3.y 0.017295 0.004115 4.203 0.000 L3.humidity -0.001800 0.000701 -2.569 0.010 L3.dew.point 0.001771 0.000316 5.596 0.000 L3.pressure 0.012485 0.003274 3.814 0.000 L4.y 0.004718 0.004116 1.146 0.252 L4.humidity 0.004475 0.000701 6.384 0.000 L4.dew.point -0.001763 0.000317 -5.570 0.000 L4.pressure -0.003827 0.003275 -1.169 0.243 L5.y -0.027403 0.004119 -6.653 0.000 L5.humidity -0.001565 0.000702 -2.230 0.026 L5.dew.point -0.000117 0.000317 -0.370 0.712 L5.pressure 0.000103 0.003275 0.032 0.975 L6.y 0.002218 0.004118 0.539 0.590 L6.humidity 0.004751 0.000702 6.771 0.000 L6.dew.point -0.002048 0.000317 -6.465 0.000 L6.pressure 0.000590 0.003275 0.180 0.857 L7.y -0.006846 0.004118 -1.662 0.096 L7.humidity 0.001522 0.000702 2.168 0.030 L7.dew.point -0.001229 0.000317 -3.879 0.000 L7.pressure 0.005164 0.003275 1.577 0.115 L8.y -0.024975 0.004119 -6.064 0.000 L8.humidity -0.002039 0.000702 -2.905 0.004 L8.dew.point 0.001246 0.000317 3.933 0.000 L8.pressure -0.011264 0.003275 -3.439 0.001 L9.y 0.002985 0.004116 0.725 0.468 L9.humidity 0.000780 0.000701 1.113 0.266 L9.dew.point 0.000020 0.000317 0.064 0.949 L9.pressure 0.004215 0.003275 1.287 0.198 L10.y -0.008054 0.004115 -1.957 0.050 L10.humidity -0.000925 0.000701 -1.319 0.187 L10.dew.point -0.000058 0.000316 -0.182 0.856 L10.pressure -0.007244 0.003274 -2.213 0.027 L11.y 0.002515 0.004087 0.615 0.538 L11.humidity 0.000662 0.000693 0.956 0.339 L11.dew.point -0.000195 0.000313 -0.623 0.533 L11.pressure -0.005101 0.003250 -1.569 0.117 ================================================================================ Results for equation humidity ================================================================================ coefficient std. error t-stat prob -------------------------------------------------------------------------------- const -0.010138 0.003169 -3.199 0.001 exog0 -0.003601 0.000580 -6.213 0.000 exog1 0.001903 0.000259 7.345 0.000 exog2 0.009757 0.002418 4.035 0.000 exog3 9.805624 0.962126 10.192 0.000 L1.y 1.000560 0.021993 45.493 0.000 L1.humidity -0.416643 0.019455 -21.416 0.000 L1.dew.point 0.382083 0.024973 15.300 0.000 L1.pressure -0.247444 0.018164 -13.623 0.000 L2.y -1.458346 0.022066 -66.090 0.000 L2.humidity 0.642661 0.004050 158.670 0.000 L2.dew.point -0.016420 0.001813 -9.057 0.000 L2.pressure -0.029554 0.016889 -1.750 0.080 L3.y 0.631041 0.028562 22.094 0.000 L3.humidity 0.182591 0.004835 37.763 0.000 L3.dew.point 0.006746 0.002188 3.083 0.002 L3.pressure -0.106031 0.022703 -4.670 0.000 L4.y 0.225460 0.028742 7.844 0.000 L4.humidity 0.078080 0.004894 15.955 0.000 L4.dew.point 0.001727 0.002210 0.781 0.435 L4.pressure -0.039336 0.022868 -1.720 0.085 L5.y 0.202586 0.028751 7.046 0.000 L5.humidity 0.019582 0.004897 3.999 0.000 L5.dew.point 0.009068 0.002212 4.100 0.000 L5.pressure 0.009264 0.022877 0.405 0.686 L6.y 0.071629 0.028769 2.490 0.013 L6.humidity 0.014463 0.004901 2.951 0.003 L6.dew.point 0.006990 0.002213 3.158 0.002 L6.pressure 0.019760 0.022877 0.864 0.388 L7.y 0.149912 0.028768 5.211 0.000 L7.humidity 0.001403 0.004901 0.286 0.775 L7.dew.point -0.000762 0.002213 -0.344 0.731 L7.pressure -0.024300 0.022878 -1.062 0.288 L8.y 0.015327 0.028767 0.533 0.594 L8.humidity 0.000971 0.004902 0.198 0.843 L8.dew.point 0.000302 0.002213 0.137 0.891 L8.pressure 0.039321 0.022878 1.719 0.086 L9.y 0.066301 0.028770 2.305 0.021 L9.humidity -0.010700 0.004902 -2.183 0.029 L9.dew.point -0.001197 0.002213 -0.541 0.589 L9.pressure -0.004615 0.022876 -0.202 0.840 L10.y -0.016503 0.028752 -0.574 0.566 L10.humidity -0.013503 0.004898 -2.757 0.006 L10.dew.point 0.005200 0.002212 2.351 0.019 L10.pressure 0.050434 0.022877 2.205 0.027 L11.y 0.109473 0.028746 3.808 0.000 L11.humidity 0.022187 0.004895 4.532 0.000 L11.dew.point -0.011784 0.002210 -5.333 0.000 L11.pressure 0.028116 0.022869 1.229 0.219 ================================================================================ Results for equation dew.point ================================================================================ coefficient std. error t-stat prob -------------------------------------------------------------------------------- const -0.092524 0.028550 -3.241 0.001 exog0 -0.026616 0.004839 -5.500 0.000 exog1 0.011113 0.002187 5.081 0.000 exog2 0.046912 0.022705 2.066 0.039 exog3 0.029192 0.022139 1.319 0.187 L1.y 0.015698 0.004049 3.877 0.000 L1.humidity -0.007047 0.001810 -3.893 0.000 L1.dew.point 0.006485 0.016891 0.384 0.701 L1.pressure -5.235230 1.996204 -2.623 0.009 L2.y -0.490357 0.045632 -10.746 0.000 L2.humidity 0.275447 0.040365 6.824 0.000 L2.dew.point -0.321373 0.051813 -6.203 0.000 L2.pressure -0.655026 0.037687 -17.381 0.000 L3.y -0.056794 0.045783 -1.241 0.215 L3.humidity -0.120873 0.008403 -14.384 0.000 L3.dew.point 0.681391 0.003762 181.137 0.000 L3.pressure -0.169970 0.035042 -4.851 0.000 L4.y 0.006391 0.059259 0.108 0.914 L4.humidity 0.110637 0.010032 11.028 0.000 L4.dew.point 0.162959 0.004539 35.899 0.000 L4.pressure -0.124674 0.047104 -2.647 0.008 L5.y 0.375319 0.059633 6.294 0.000 L5.humidity 0.089129 0.010153 8.778 0.000 L5.dew.point 0.059155 0.004586 12.898 0.000 L5.pressure 0.086758 0.047447 1.829 0.067 L6.y -0.172983 0.059652 -2.900 0.004 L6.humidity -0.035387 0.010161 -3.483 0.000 L6.dew.point 0.050800 0.004588 11.071 0.000 L6.pressure -0.049732 0.047465 -1.048 0.295 L7.y -0.124365 0.059690 -2.084 0.037 L7.humidity 0.017902 0.010170 1.760 0.078 L7.dew.point 0.010530 0.004592 2.293 0.022 L7.pressure -0.040725 0.047464 -0.858 0.391 L8.y 0.317251 0.059687 5.315 0.000 L8.humidity 0.034609 0.010169 3.404 0.001 L8.dew.point -0.015052 0.004591 -3.278 0.001 L8.pressure -0.068948 0.047468 -1.453 0.146 L9.y -0.295447 0.059685 -4.950 0.000 L9.humidity -0.037808 0.010170 -3.718 0.000 L9.dew.point 0.010976 0.004592 2.390 0.017 L9.pressure 0.117336 0.047466 2.472 0.013 L10.y 0.007472 0.059691 0.125 0.900 L10.humidity -0.004778 0.010171 -0.470 0.639 L10.dew.point -0.010112 0.004592 -2.202 0.028 L10.pressure -0.069403 0.047463 -1.462 0.144 L11.y 0.158203 0.059654 2.652 0.008 L11.humidity -0.036198 0.010162 -3.562 0.000 L11.dew.point 0.015412 0.004589 3.358 0.001 L11.pressure 0.200941 0.047464 4.233 0.000 ================================================================================ Results for equation pressure ================================================================================ coefficient std. error t-stat prob -------------------------------------------------------------------------------- const 0.118075 0.059641 1.980 0.048 exog0 0.049769 0.010157 4.900 0.000 exog1 -0.028993 0.004585 -6.324 0.000 exog2 0.024779 0.047449 0.522 0.602 exog3 -0.071992 0.059234 -1.215 0.224 L1.y -0.024298 0.010040 -2.420 0.016 L1.humidity 0.006812 0.004538 1.501 0.133 L1.dew.point 0.013954 0.047108 0.296 0.767 L1.pressure 0.049025 0.045933 1.067 0.286 L2.y -0.012004 0.008400 -1.429 0.153 L2.humidity 0.005862 0.003756 1.561 0.119 L2.dew.point 0.082341 0.035045 2.350 0.019 L2.pressure 2.737593 0.130708 20.944 0.000 L3.y -0.049294 0.002988 -16.498 0.000 L3.humidity -0.002578 0.002643 -0.975 0.329 L3.dew.point -0.010324 0.003393 -3.043 0.002 L3.pressure 0.003190 0.002468 1.293 0.196 L4.y -0.045975 0.002998 -15.336 0.000 L4.humidity -0.004051 0.000550 -7.363 0.000 L4.dew.point 0.000844 0.000246 3.425 0.001 L4.pressure 0.898073 0.002294 391.407 0.000 L5.y 0.007497 0.003880 1.932 0.053 L5.humidity 0.000917 0.000657 1.396 0.163 L5.dew.point -0.000331 0.000297 -1.112 0.266 L5.pressure 0.161498 0.003084 52.361 0.000 L6.y 0.007214 0.003905 1.848 0.065 L6.humidity 0.000152 0.000665 0.229 0.819 L6.dew.point 0.000009 0.000300 0.030 0.976 L6.pressure 0.036460 0.003107 11.736 0.000 L7.y 0.003007 0.003906 0.770 0.441 L7.humidity 0.001021 0.000665 1.535 0.125 L7.dew.point -0.000132 0.000300 -0.441 0.660 L7.pressure -0.002062 0.003108 -0.664 0.507 L8.y -0.002104 0.003908 -0.538 0.590 L8.humidity 0.000300 0.000666 0.451 0.652 L8.dew.point -0.000199 0.000301 -0.661 0.509 L8.pressure -0.020225 0.003108 -6.508 0.000 L9.y 0.011546 0.003908 2.954 0.003 L9.humidity 0.002001 0.000666 3.005 0.003 L9.dew.point -0.000489 0.000301 -1.627 0.104 L9.pressure -0.021942 0.003108 -7.059 0.000 L10.y 0.003690 0.003908 0.944 0.345 L10.humidity -0.001258 0.000666 -1.889 0.059 L10.dew.point 0.000419 0.000301 1.394 0.163 L10.pressure -0.009384 0.003108 -3.019 0.003 L11.y 0.000511 0.003908 0.131 0.896 L11.humidity -0.000856 0.000666 -1.285 0.199 L11.dew.point 0.000231 0.000301 0.769 0.442 L11.pressure -0.005998 0.003108 -1.930 0.054 ================================================================================ Correlation matrix of residuals y humidity dew.point pressure y 1.000000 -0.402649 0.169509 -0.047301 humidity -0.402649 1.000000 0.640355 0.011679 dew.point 0.169509 0.640355 1.000000 -0.011011 pressure -0.047301 0.011679 -0.011011 1.000000 h rmse mae 1 0.39 0.39 4 0.75 0.66 48 2.45 1.89 CPU times: user 54min 35s, sys: 34.9 s, total: 55min 10s Wall time: 54min 55s
plot_metrics(metrics, "Multivariate Baseline Comparison - 2019 test data")
Forecasts from the multi-variate VAR model surpass all univariate methods in this notebook at all horizons tested. Obviously the combination of endogenous measurements and exogenous time components is superior to temperature measurements alone. Additionally, the VAR model may have utility for predicting the other endogenous variables.
Tuning lags should be the best approach to further improve VAR forecast accuracy. It would surprise me if lags in multiples of 48 were not significant, for at least the previous 2 or 3 days and perhaps up to a week. Meaning, the next set of lags to try would include combinations from: {1-12, 1-24, 1-48} and {48, 96, 144, 192, 240, 288, 336}.
Optimising time component frequencies and phase shifts would likely aid in reducing forecast errors and would be applicable to other methods.
Here is a brief summary of univariate and multivariate methods plus some potential future work.
Univariate methods:
Multi-variate methods:
Univariate compared to multivariate methods:
In conclusion, the VAR forecasts make a good baseline to beat.
Future work could include:
Python and Jupyter versions plus modules imported and their version strings. This is the poor man's python equivalent of R's sessionInfo().
Code for imported modules and versions adapted from this stackoverflow answer. There are simpler alternatives, such as watermark, but they all require installation.
import sys
import IPython
print("Python version:")
print(sys.executable)
print(sys.version)
print("\nIPython version:")
print(IPython.__version__)
Python version: /usr/bin/python3 3.7.11 (default, Jul 3 2021, 18:01:19) [GCC 7.5.0] IPython version: 5.5.0
import pkg_resources
import types
def get_imports():
for name, val in globals().items():
if isinstance(val, types.ModuleType):
# Split ensures you get root package,
# not just imported function
name = val.__name__.split(".")[0]
elif isinstance(val, type):
name = val.__module__.split(".")[0]
# Some packages are weird and have different
# imported names vs. system/pip names. Unfortunately,
# there is no systematic way to get pip names from
# a package's imported name. You'll have to add
# exceptions to this list manually!
poorly_named_packages = {
"PIL": "Pillow",
"sklearn": "scikit-learn",
}
if name in poorly_named_packages.keys():
name = poorly_named_packages[name]
yield name
imports = list(set(get_imports()))
# The only way I found to get the version of the root package
# from only the name of the package is to cross-check the names
# of installed packages vs. imported packages
requirements = []
for m in pkg_resources.working_set:
if m.project_name in imports and m.project_name != "pip":
requirements.append((m.project_name, m.version))
reqs = pd.DataFrame(requirements, columns = ['name', 'version'])
print("Imported modules:")
reqs.style.hide_index()
Imported modules:
name | version |
---|---|
statsmodels | 0.10.2 |
pandas | 1.1.5 |
numpy | 1.19.5 |
matplotlib | 3.2.2 |
!date
Fri Sep 10 18:25:17 UTC 2021
Archive code, markdown, history and formatted notebooks.
Assumes all pdf, html, latex etc dependencies are installed.
WARNING Will overwrite existing files.
Notebook name is hardcoded below because the alternative is ghastly globs of unreliable javascript or external ipython libraries I'd prefer to avoid installing :-(
from time import sleep
notebook = "cammet_baselines_2021.ipynb"
# !jupyter nbconvert --to script {notebook}
# !jupyter nbconvert --execute --to html {notebook}
# !jupyter nbconvert --execute --to pdf {notebook}
# !jupyter nbconvert --to pdf {notebook}
%rm history.txt
%history -f history.txt
!jupyter nbconvert --to python {notebook}
sleep(5)
!jupyter nbconvert --to markdown {notebook}
sleep(5)
!jupyter nbconvert --to html {notebook}