Created by: Colin Kälin and Alessio Baccelli, January/February 2020
Blog post: "Energy Demand Prediction in Switzerland with giotto-time" https://towardsdatascience.com/getting-started-with-giotto-time-d9b2088d60ca Summary: This notebook provides a tutorial for time series analysis with giotto-time. The goal is to predict the energy demand of the swissgrid network 21 days in advance.
Data: We are using hourly energy demand data from swissgrid, a Swiss transmission grid operator. The data is available here https://www.swissgrid.ch/en/home/operation/grid-data/generation.html and processed to represent the mean hourly demand per day.
Contents:
# giotto-time
# Feature creation
from gtime.compose import FeatureCreation
from gtime.feature_generation import Calendar, PeriodicSeasonal
from gtime.feature_extraction import (Detrender, CustomFeature, Polynomial,
Shift, MovingAverage, Exogenous)
# Causality testing
from gtime.causality import ShiftedLinearCoefficient
# Models
from gtime.forecasting import GAR, GARFF
from gtime.model_selection import FeatureSplitter, horizon_shift
from gtime.metrics import max_error, smape
from gtime.regressors import LinearRegressor
# Other libraries
# Data handling
import pandas as pd
import numpy as np
# Machine Learning
from sklearn.ensemble import (RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor,
AdaBoostRegressor)
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, max_error, r2_score
from src.scores import relative_mean_absolute_error, calculate_score, relative_mean_squared_error, highlight_top
# Plotting
from src.plotting import plot_time_series
import plotly
data = pd.read_csv('data/raw/energy_demand_2016_2019.csv')
data['datetime'] = [pd.to_datetime(d) for d in data['datetime'].values]
We have to do some preprocessing here.
time_series = data[['Load [MW]', 'datetime']].copy()
time_series.set_index('datetime', inplace=True)
time_series = time_series.loc[~time_series.index.duplicated()] # Due to the change to winter time, some indices were duplicated. Remove them.
new_dates = time_series.iloc[np.where(np.diff(np.array(time_series.index))
!= np.diff(np.array(time_series.index))[0])[0]].index + pd.DateOffset(hours=1) # Due to the change to summer time, some indices were missing. Add them.
new_values = time_series.iloc[np.where(np.diff(np.array(time_series.index)) != np.diff(np.array(time_series.index))[0])[0]].values.flatten()
df_new = pd.DataFrame([new_dates, new_values]).T
df_new.columns = ['date', 'Load [MW]']
df_new.set_index('date', inplace=True)
time_series = time_series.append(df_new)
time_series = time_series.sort_index()
time_series.tail()
Load [MW] | |
---|---|
2019-11-30 20:00:00 | 5082.19 |
2019-11-30 21:00:00 | 5295.39 |
2019-11-30 22:00:00 | 5527.06 |
2019-11-30 23:00:00 | 5610.04 |
2019-12-01 00:00:00 | 5756.34 |
We aggregate the data on a daily basis. The goal is to predict the daily mean of the energy demand.
time_series['date'] = time_series.index.date
time_series = time_series.groupby('date').apply(lambda g: g['Load [MW]'].mean())
time_series.drop(time_series.index.max(), inplace=True)
time_series.index = [pd.to_datetime(d) for d in time_series.index]
time_series = pd.DataFrame(time_series, columns=['Load [MW]'])
Let's plot the full time series.
plot_time_series(time_series,
y_columns=['Load [MW]'],
y_axis_title='mean energy demand per day [MW]',
title='Energy Demand Switzerland 2016-2019')
Before we start doing anything, we have to split the data into training and test part. We will use the data from January 1st 2016 to November 30th 2018 as the training set and the rest, i.e. the data from December 1st 2018 to December 1st 2019 as a test set.
time_series_train = time_series[time_series.index < pd.to_datetime('12-01-2018')]
time_series_test = time_series[time_series.index >= pd.to_datetime('12-01-2018')]
In this part we will
Let's start with the removal of a polynomial trend.
detrend_feature = Detrender(trend="polynomial", trend_x0=np.zeros(3))
detrended_ts = detrend_feature.fit_transform(time_series_train)
As it is possible to see, a constant equal to the mean of the time series (a trend polynomial of order 0) has been removed from the time series.
plot_time_series(detrended_ts,
y_columns=['Load [MW]__Detrender'],
title='Detrended time series',
y_axis_title='mean daily energy demand [MW]')
What about the seasonality? We can use a sine wave with the period as in the season. For this purpose we will use giotto-time's 'PeriodicSeasonalFeature' function and then shift the sine wave. In this case the period is one year.
# Shift the time series such that the mean is zero.
df_detrended = detrended_ts
df_detrended.columns=['detrended']
df_detrended = df_detrended.join(time_series_train)
plot_time_series(df_detrended,
y_columns=['detrended', 'Load [MW]'],
title='Detrended time series',
y_axis_title='daily energy demand [MW]')
# Create the seasonal feature with period = 1 year
year_period = PeriodicSeasonal(start_date=pd.to_datetime('01-01-2015'),
length=3000,
index_period=3000,
period='365 days',
amplitude=1000)
year_wave = year_period.transform()
year_wave = year_wave.shift(275) #Found empirically
year_wave.index = pd.date_range(start='01-01-2015', periods=3000)
The sine wave now follows the seasonality as shownd in the figure below.
df_train = pd.DataFrame(df_detrended['detrended'], columns=['detrended']).copy()
df_train = df_train.join(year_wave)
plot_time_series(df_train,
y_columns=['detrended', '0__PeriodicSeasonal'])
year_wave['day'] = year_wave.index.day
year_wave['month'] = year_wave.index.month
start_day = time_series_test.index.min().day
start_month = time_series_test.index.min().month
mean_value = year_wave[(year_wave['day']==start_day) & (year_wave['month']==start_month)]['0__PeriodicSeasonal']
df_test = pd.DataFrame((time_series_test - time_series_test.values.mean()).values, columns=['detrended']).copy()
df_test.index = time_series_test.index
df_test = df_test.join(year_wave['0__PeriodicSeasonal'])
plot_time_series(df_test,
y_columns=['detrended', '0__PeriodicSeasonal'])
The question we want to answer here is which time series causes another. In order to do this, we will fix the time series we want to make predictions. Then we shift another time series and calculate the correlation between the two to find the shift that gives us the highest correlation. Giotto-time then allows us to transform the dataframe by shifting each column by the best value.
shifts = [('s2', Shift(2), ['Load [MW]']),
('s5', Shift(5), ['Load [MW]'])]
feature_creation = FeatureCreation(shifts)
shifted_df = feature_creation.fit_transform(time_series)
df_causality = pd.concat([time_series, shifted_df], axis='columns', sort=False).fillna(value=0)
df_causality.head(10)
Load [MW] | s2__Load [MW]__Shift | s5__Load [MW]__Shift | |
---|---|---|---|
2016-01-01 | 5500.348928 | 0.000000 | 0.000000 |
2016-01-02 | 5655.343201 | 0.000000 | 0.000000 |
2016-01-03 | 5572.463944 | 5500.348928 | 0.000000 |
2016-01-04 | 6535.289954 | 5655.343201 | 0.000000 |
2016-01-05 | 6619.365039 | 5572.463944 | 0.000000 |
2016-01-06 | 6710.745709 | 6535.289954 | 5500.348928 |
2016-01-07 | 6728.084527 | 6619.365039 | 5655.343201 |
2016-01-08 | 6467.398174 | 6710.745709 | 5572.463944 |
2016-01-09 | 5693.247283 | 6728.084527 | 6535.289954 |
2016-01-10 | 5589.310101 | 6467.398174 | 6619.365039 |
Let's perform a causality test, i.e. let's try to find the shift necessary to optimize the linear fit coefficients.
As the two columns 's2' and 's5' are the same time series as the original one but with a shift of 2 and 5 respectively, in the best case we get a fit coefficient of 1 as can be seen in the dataframe below.
cause = ShiftedLinearCoefficient(target_col="Load [MW]")
cause.fit(df_causality)
cause.max_corrs_
y | Load [MW] | s2__Load [MW]__Shift | s5__Load [MW]__Shift |
---|---|---|---|
x | |||
Load [MW] | 0.729400 | 1.000000 | 1.000000 |
s2__Load [MW]__Shift | 0.509247 | 0.729901 | 1.000000 |
s5__Load [MW]__Shift | 0.428024 | 0.533226 | 0.787478 |
We can also look at the corresponding shifts directly. It turns out the test can successfully find the correct shifts 2 and 5.
cause.best_shifts_
y | Load [MW] | s2__Load [MW]__Shift | s5__Load [MW]__Shift |
---|---|---|---|
x | |||
Load [MW] | 1 | 2 | 5 |
s2__Load [MW]__Shift | 5 | 1 | 3 |
s5__Load [MW]__Shift | 2 | 4 | 1 |
This could be used to align two time series in a way that optimizes the predictive performance of one on the other without having to train and test the full model for each possible shift.
cause.transform(df_causality).head()
Load [MW] | s2__Load [MW]__Shift | s5__Load [MW]__Shift | |
---|---|---|---|
2016-01-01 | 5500.348928 | 5500.348928 | 5500.348928 |
2016-01-02 | 5655.343201 | 5655.343201 | 5655.343201 |
2016-01-03 | 5572.463944 | 5572.463944 | 5572.463944 |
2016-01-04 | 6535.289954 | 6535.289954 | 6535.289954 |
2016-01-05 | 6619.365039 | 6619.365039 | 6619.365039 |
As expected, the time series are aligned again.
# Some parameters for use later: important is the horizon, i.e. the maximal time frame to try to make predictions about
parameters = {'horizon': 21,
'kernel': [1,100, 2],
'test_horizon': 21}
One feature that we are interested in are holidays. Giotto-time makes it possible to create features according to the public holiday calendar. The 'kernel' determines how the values are set on the relevant
calendar = Calendar(region='europe',
country='Switzerland',
kernel=parameters['kernel'],
start_date=time_series_train.index.min())
ts_copy = time_series.copy()
ts_copy.index = ts_copy.index.to_period()
holidays = calendar.fit_transform(ts_copy)
holidays.index = time_series.index
Many other features are available, some of which are used here to create features for the model.
# List of all features
mv_avg_3 = MovingAverage(window_size=3)
mv_avg_7 = MovingAverage(window_size=7)
shift_7 = Shift(shift=7)
shift_1 = Shift(shift=1)
poly_2 = Polynomial(degree=2)
features_creation = FeatureCreation([
('mvg_3', mv_avg_3, ['Load [MW]']),
('mvg_7', mv_avg_7, ['Load [MW]']),
('s1', shift_1, ['Load [MW]']),
('s7', shift_7, ['Load [MW]']),
('pol2_0', poly_2, ['Load [MW]']),
])
X = features_creation.fit_transform(time_series_train)
X = X.join(holidays)
X['yearly_period'] = df_train['0__PeriodicSeasonal']
y = horizon_shift(time_series=time_series_train, horizon=parameters["horizon"])
X_test = features_creation.transform(time_series_test)
y_test = horizon_shift(time_series=time_series_test, horizon=parameters["horizon"])
X_test = X_test.join(holidays)
X_test['yearly_period'] = df_test['0__PeriodicSeasonal']
test_index = X_test.index
X_test = X_test.dropna()
featuresplitter = FeatureSplitter()
X_train, y_train, _, _ = featuresplitter.transform(X, y)
A short explanation for y_train and y_test:
y_train has 21 columns, one for each prediction 1 to 21 days into the future. The same applies to y_test. Here however we have NaNs for the date and time step combination for which no values are available.
y_test.tail(10)
y_1 | y_2 | y_3 | y_4 | y_5 | y_6 | y_7 | y_8 | y_9 | y_10 | ... | y_12 | y_13 | y_14 | y_15 | y_16 | y_17 | y_18 | y_19 | y_20 | y_21 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2019-11-21 | 5794.625081 | 5054.226366 | 5105.656933 | 5790.468850 | 5922.916668 | 5928.284579 | 5936.924728 | 5797.397415 | 5559.360358 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2019-11-22 | 5054.226366 | 5105.656933 | 5790.468850 | 5922.916668 | 5928.284579 | 5936.924728 | 5797.397415 | 5559.360358 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2019-11-23 | 5105.656933 | 5790.468850 | 5922.916668 | 5928.284579 | 5936.924728 | 5797.397415 | 5559.360358 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2019-11-24 | 5790.468850 | 5922.916668 | 5928.284579 | 5936.924728 | 5797.397415 | 5559.360358 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2019-11-25 | 5922.916668 | 5928.284579 | 5936.924728 | 5797.397415 | 5559.360358 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2019-11-26 | 5928.284579 | 5936.924728 | 5797.397415 | 5559.360358 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2019-11-27 | 5936.924728 | 5797.397415 | 5559.360358 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2019-11-28 | 5797.397415 | 5559.360358 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2019-11-29 | 5559.360358 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2019-11-30 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10 rows × 21 columns
Giotto-time offers the possiblity to create a generalized autoregression model with feed forward (GARFF for short). It's possible to use any scikit-learn friendly estimator, create a GARFF model and then to use the well-known fit and predict methods to train and evaluate the model.
# Define the GARFF models
time_series_model_lr = GARFF(estimator=LinearRegression())
time_series_model_gb = GARFF(estimator=GradientBoostingRegressor(loss='lad', learning_rate=0.1,
n_estimators=100,
max_features='sqrt', random_state=29348))
time_series_model_rf = GARFF(estimator=RandomForestRegressor())
# Train the GARFF models
time_series_model_lr.fit(X_train, y_train)
time_series_model_gb.fit(X_train, y_train)
time_series_model_rf.fit(X_train, y_train)
# Make predictions with the GAR models
predictions_lr = time_series_model_lr.predict(X_test.dropna())
predictions_gb = time_series_model_gb.predict(X_test.dropna())
predictions_rf = time_series_model_rf.predict(X_test.dropna())
# Define the regressors with custom loss function
time_series_model_max = LinearRegressor(loss=max_error)
time_series_model_smape = LinearRegressor(loss=smape)
# Train the custom loss models
x0 = [0] * (X_train.shape[1] + 1)
time_series_model_max.fit(X_train, y_train['y_' + str(parameters['horizon'])], x0=x0)
time_series_model_smape.fit(X_train, y_train['y_' + str(parameters['horizon'])], x0=x0)
# Make predictions with the custom loss models
predictions_max = time_series_model_max.predict(X_test.dropna())
predictions_smape = time_series_model_smape.predict(X_test.dropna())
# Bring the results into the correct format
predictions_max = pd.DataFrame(predictions_max, columns=['y_' + str(parameters['horizon'])], index=X_test.index)
predictions_smape = pd.DataFrame(predictions_smape, columns=['y_' + str(parameters['horizon'])], index=X_test.index)
We can plot the predicted values against the reference values. Note that the values plotted here are the values we would expect 21 days later, i.e. we plot the predictions and reference values for a 21-days forecast.
Note: You can choose to plot another prediction by setting 'results = ... .copy()' to the predictions of another model.
# Plot predictions vs. reality
# Choose the predictions of one of the models trained above
results = predictions_lr.copy()
plot_data = pd.DataFrame(y_test['y_{}'.format(parameters['test_horizon'])].dropna())
plot_data.columns = ['reference']
plot_data = plot_data.join(results['y_{}'.format(parameters['test_horizon'])])
plot_data.index = y_test['y_{}'.format(parameters['test_horizon'])].dropna().index
plot_data = plot_data.dropna()
plot_data.index = plot_data.index + pd.Timedelta('{} days'.format(parameters['horizon'])) #Shift the data back
plot_data.columns = ['reference', 'prediction']
fig = plot_time_series(plot_data,
y_columns=['reference', 'prediction'],
title='Predictions vs. Reference Values with a 21 days horizon',
y_axis_title='mean energy demand per day [MW]')
fig
Note that predictions were made for all time steps not only for the ones that we have data for in y_test. The score calculated below is the relative mean absolute error.
Some remarks: You can clearly see that the model has learned that on Sundays the energy demand is lower than in the rest of the week. Furthermore, it follows the yearly trend pretty closely.
How does this result compare to the baseline model: In order to take the seasonality into account, we determine on the training set the ideal number of days that we have to look back and use the value there.
def find_best_shift(predictions, time_series, metric=relative_mean_absolute_error):
y_preds = predictions.copy()
scores = []
for i in range(max_days):
df = pd.DataFrame([y_preds[i].values, y_train['y_'+str(parameters['test_horizon'])]]).T
df.dropna(inplace=True)
scores.append(metric(y_pred=df[0], y_true=df[1]))
y_preds['ref'] = y_train['y_'+str(parameters['test_horizon'])].values
y_preds.dropna(inplace=True)
if metric!=r2_score:
shift = np.array(scores).argmin()
else:
shift = np.array(scores).argmax()
return shift
max_days = 400
predictions = []
index = []
for i in range(max_days):
indices = (y_train.index - pd.to_timedelta(i, unit='day')).date
index.append(pd.Series(indices))
predictions.append(time_series
.loc[time_series.index.intersection(index[-1])]
.reindex(index[-1])
.values
.flatten())
predictions = pd.DataFrame(predictions).T
shift_mae = find_best_shift(predictions, time_series, metric=relative_mean_absolute_error)
shift_mse = find_best_shift(predictions, time_series, metric=relative_mean_squared_error)
shift_maxerror = find_best_shift(predictions, time_series, metric=max_error)
shift_smape = find_best_shift(predictions, time_series, metric=smape)
shift_r2 = find_best_shift(predictions, time_series, metric=r2_score)
y_naive_mae = time_series.loc[y_test.index - pd.to_timedelta(shift_mae, unit='day')]
y_naive_mse = time_series.loc[y_test.index - pd.to_timedelta(shift_mse, unit='day')]
y_naive_maxerror = time_series.loc[y_test.index - pd.to_timedelta(shift_maxerror, unit='day')]
y_naive_smape = time_series.loc[y_test.index - pd.to_timedelta(shift_smape, unit='day')]
y_naive_r2 = time_series.loc[y_test.index - pd.to_timedelta(shift_r2, unit='day')]
mae_score = calculate_score(y_test['y_{}'.format(parameters['test_horizon'])], y_naive_mae, relative_mean_absolute_error)
mse_score = calculate_score(y_test['y_{}'.format(parameters['test_horizon'])], y_naive_mse, relative_mean_squared_error)
maxerror_score = calculate_score(y_test['y_{}'.format(parameters['test_horizon'])], y_naive_maxerror, max_error)
smape_score = calculate_score(y_test['y_{}'.format(parameters['test_horizon'])], y_naive_smape, smape)
rsquare_score = calculate_score(y_test['y_{}'.format(parameters['test_horizon'])], y_naive_r2, r2_score)
calculate_score(y_test['y_{}'.format(parameters['test_horizon'])], y_naive_mae, relative_mean_absolute_error)
0.09008899901518862
For the GARFF model with linear regression we get the following:
calculate_score(y_test['y_{}'.format(parameters['test_horizon'])],
predictions_lr['y_{}'.format(parameters['test_horizon'])],
relative_mean_absolute_error)
0.06849262080244989
Let's compare all the models trained above with different metrics. (Top results per metric are colored in yellow.)
metrics = ['rel. mean abs. error', 'rel. mean squared error', 'max. error', 'SMAPE', 'coeff. of determination']
models = ['baseline', 'GARFF Random Forest', 'GARFF Gradient Boosting', 'GARFF Lin. Reg.', 'Lin. Reg. with max error loss', 'Lin. Reg. with SMAPE loss']
predictions = [predictions_rf, predictions_gb, predictions_lr, predictions_max, predictions_smape]
scores = []
baseline = [mae_score, mse_score, maxerror_score, smape_score, rsquare_score]
scores.append(baseline)
for pred in predictions:
scores_model = [calculate_score(y_test['y_{}'.format(parameters['test_horizon'])],
pred['y_{}'.format(parameters['test_horizon'])], relative_mean_absolute_error),
calculate_score(y_test['y_{}'.format(parameters['test_horizon'])],
pred['y_{}'.format(parameters['test_horizon'])], relative_mean_squared_error),
calculate_score(y_test['y_{}'.format(parameters['test_horizon'])],
pred['y_{}'.format(parameters['test_horizon'])], max_error),
calculate_score(y_test['y_{}'.format(parameters['test_horizon'])],
pred['y_{}'.format(parameters['test_horizon'])], smape),
calculate_score(y_test['y_{}'.format(parameters['test_horizon'])],
pred['y_{}'.format(parameters['test_horizon'])], r2_score)
]
scores.append(scores_model)
scores = pd.DataFrame(np.array(scores), index=models, columns=metrics)
scores = scores.style.apply(highlight_top)
scores
rel. mean abs. error | rel. mean squared error | max. error | SMAPE | coeff. of determination | |
---|---|---|---|---|---|
baseline | 0.090089 | 0.01132 | 466.479 | 0.0868972 | 0.569792 |
GARFF Random Forest | 0.0856635 | 0.011686 | 1555.62 | 0.0839646 | 0.618687 |
GARFF Gradient Boosting | 0.08754 | 0.0125008 | 1381.25 | 0.0842262 | 0.619328 |
GARFF Lin. Reg. | 0.0684926 | 0.00842929 | 1484.84 | 0.065473 | 0.769918 |
Lin. Reg. with max error loss | 0.187018 | 0.0484672 | 2322.84 | 0.213068 | -0.553378 |
Lin. Reg. with SMAPE loss | 0.0802092 | 0.0108513 | 1484.89 | 0.0774084 | 0.698311 |
It turns out that the GARFF model with a linear regression base model performs best. It has to be noted that the maximal error very much depends on outliers (1st of August 2019 is an outlier visible in the plot).