Let's import tsboost, pandas and plotly for some visalisation (pip install plotly (--upgrade))
import tsboost as tsb
import pandas as pd
import numpy as np
np.random.seed(2017)
import plotly as py
import plotly.graph_objs as go
py.offline.init_notebook_mode()
Let's download air passenger data
data = pd.read_csv("air_passenger.csv", sep=";")
data.date = pd.to_datetime(data.date, dayfirst=True)
data.tail()
date | volume | |
---|---|---|
139 | 1960-08-01 | 606 |
140 | 1960-09-01 | 508 |
141 | 1960-10-01 | 461 |
142 | 1960-11-01 | 390 |
143 | 1960-12-01 | 432 |
Those datas represents the evolution of air transport passengers, let's visualise it:
trace = [go.Scatter(x=data.date, y=data.volume)]
py.offline.iplot(trace)
For a start, let's just forecast 5 years without doing any feature engenering / tuning
We just need to set some data configuration and precise how many steps ahead we want to forecast
data_config = {
'target' : "volume", # column we want to forecast
'date' : "date", # colum containing the date
}
Here is a monthly problem so 5 years = 12*5 = 60 months
model = tsb.TSRegressor(horizons=60)
results = model.fit_predict(data, **data_config)
results.tail()
date_last_data | horizon | date | forecast | |
---|---|---|---|---|
55 | 1960-12-01 | 56 | 1965-08-01 | 916.146024 |
56 | 1960-12-01 | 57 | 1965-09-01 | 748.292623 |
57 | 1960-12-01 | 58 | 1965-10-01 | 697.493432 |
58 | 1960-12-01 | 59 | 1965-11-01 | 590.572526 |
59 | 1960-12-01 | 60 | 1965-12-01 | 632.378526 |
Column "date_last_data" reprents the starting point from which forecasts have been made, and is the last data we have in our disposal in a production context. Column "date" represents the effective date for the forecast. horizon = date - date_last_data (in the time step unit of the problem, here it's months)
TSBoost will be in this mode when no cross validation dates is put as input parameters "cv_dates".
in this case, it's like we are in 1960-12-01, and we forecast the futur from this date.
Let's visualise the forecasts that we've just done:
traces = [
go.Scatter(x=data.date, y=data.volume, name="data"),
go.Scatter(x=results.date, y=results.forecast, name="forecast"),
]
py.offline.iplot(traces)
TSBoost allows data scientists to be in a classical machine learning context, so any feature engenering can be done as well as meta parmeters tuning
We can use period cross validation, which take time into consideration, to try to validate the use of the feature engenering & tuning in the future.
We need to create periods with dates. Let's make 2 crossvalidation periods, one for Feature engineering & tuning validation, and one for final testing
cv_valid = tsb.TSRegressor.generate_dates(begin_date="1959-01-01", horizon=12, time_step="month")
cv_test = tsb.TSRegressor.generate_dates(begin_date="1960-01-01", horizon=12, time_step="month")
let's visualise those 2 CVs
traces = [
go.Scatter(x=data[data.date < "1959-01-01"].date, y=data[data.date <= "1959-01-01"].volume, name="other_data"),
go.Scatter(x=data[data.date.isin(cv_valid)].date, y=data[data.date.isin(cv_valid)].volume, name="cv_valid"),
go.Scatter(x=data[data.date.isin(cv_test)].date, y=data[data.date.isin(cv_test)].volume, name="cv_test"),
]
py.offline.iplot(traces)
Let's make a 12 months ahead forecast algorithm for speeding things up:
horizons = 12
Any features can be added to the datas. You have to be carefull not to use to the future you didn't acces in the past.
Tsboost create by default lag features from past of the target we wish to forecast, and also temporal features with the informations contained in date column (month, year, day of the week, etc...)
We can use this with the meta parameter "inner_feature_eng" to activate it or not.
Let's try if this feature engenering is relevant for our problem and can help improving accuracy. Let's test the algorithm without feature engineering on our validation period :
model1 = tsb.TSRegressor(horizons=horizons,
inner_feature_eng=False)
forecast1_cv_valid = model1.fit_predict(data, cv_dates=cv_valid, **data_config)
forecast1_cv_valid.head()
date_last_data | horizon | date | forecast | |
---|---|---|---|---|
0 | 1958-12-01 | 1 | 1959-01-01 | 344.489314 |
1 | 1959-01-01 | 1 | 1959-02-01 | 334.546558 |
2 | 1959-02-01 | 1 | 1959-03-01 | 387.062901 |
3 | 1959-03-01 | 1 | 1959-04-01 | 380.945582 |
4 | 1959-04-01 | 1 | 1959-05-01 | 412.223502 |
If we analyse the data, it looks like the 60 months forecast we did before, but the column "date_last_data" varies, so that we can have each 12 horizons for every dates in the cross validation period and compare those forecasts equitably.
TSBoost has static function to get the mape for each horizon on the period, you can of course use whatever residual analyse you want
results1_cv_valid = tsb.TSRegressor.get_result(data, forecast1_cv_valid, metric="mape", **data_config)
Let's do the same but with the inner feature engenering of TSBoost:
model2 = tsb.TSRegressor(horizons=horizons,
inner_feature_eng=True)
forecast2_cv_valid = model2.fit_predict(data, cv_dates=cv_valid, **data_config)
results2_cv_valid = tsb.TSRegressor.get_result(data, forecast2_cv_valid, metric="mape", **data_config)
let's visualise how those 2 models performed on the validation period with a graph
on x-axis you have horizon (short term to long term forecasts)
on y-axis you have the MAPE (Mean Avergae Percentage Error) did on the validation period datas
# some plotly configs for graphs
layout = go.Layout(xaxis=dict(title= 'horizons'), yaxis=dict(title= 'MAPE'))
traces = [
go.Scatter(x=results1_cv_valid.index, y=results1_cv_valid.forecast, name="model_no_FE"),
go.Scatter(x=results2_cv_valid.index, y=results2_cv_valid.forecast, name="model_with_FE"),
]
fig= go.Figure(data=traces, layout=layout)
py.offline.iplot(fig)
We can see that on this period, adding feature enginnering from TSBoost works quiet well on almost all horizons.
We have a global patern that should appear on almost every time series problems :
the mean error of long term forecast is bigger that the shorter terms, we can observe a log shape fashion (to see it clearer we could have had selected larger period for cross validation & test periods).
It's a general rule for most time serie problems : it's easier to forecast short term than long term. Mainly because of shift distributions: variable ditributions has higher chance to be more similar to short term distribution than long term one
There is two main ways to do tuning :
we can tune tsboost meta parameters.
we can tune optimizer (xgboost or lightgbm) meta parameters
By default, tsboost use the instance of xgboost.XGBRegressor() with default instance parameters, but we can use a non default instance.
We can also chose lightgbm.LGBMRegressor() as optimizer. Let's see how xgboost's brother performs on this problem :
model3 = tsb.TSRegressor(horizons=horizons,
optimizer="lgbm",
inner_feature_eng=True)
forecast3_cv_valid = model3.fit_predict(data, cv_dates=cv_valid, **data_config)
results3_cv_valid = tsb.TSRegressor.get_result(data, forecast3_cv_valid, metric="mape", **data_config)
traces = [
go.Scatter(x=results2_cv_valid.index, y=results2_cv_valid.forecast, name="model_xgboost_with_FE"),
go.Scatter(x=results3_cv_valid.index, y=results3_cv_valid.forecast, name="model_lightgbm_with_FE"),
]
fig= go.Figure(data=traces, layout=layout)
py.offline.iplot(fig)
We can observ a nice performance of lightgbm with the feature engineering.
It is to be noted that it is better to find good features than overtuned algorithm.
We can use the test CV period to validate (or not) if our previous feature engineering / tuning have some positive effect on the future. And maybe conclude that it will be also the case for after (production usage)
results1 = model1.fit_predict(data, cv_dates=cv_test, **data_config)
results1 = tsb.TSRegressor.get_result(data, results1, metric="mape", **data_config)
results3 = model3.fit_predict(data, cv_dates=cv_test, **data_config)
results3 = tsb.TSRegressor.get_result(data, results3, metric="mape", **data_config)
traces = [
go.Scatter(x=results1.index, y=results1.forecast, name="model1"),
go.Scatter(x=results3.index, y=results3.forecast, name="model3"),
]
fig= go.Figure(data=traces, layout=layout)
py.offline.iplot(fig)
We can see that on the last CV period, the MAPE improvement is globaly there (even if short term could need some investigation)
We can suppose that the feature engenering & tuning should give better for the future (even if it's not completly sure with distribution drift)
results = model3.fit_predict(data, **data_config)
traces = [
go.Scatter(x=data.date, y=data.volume, name="data"),
go.Scatter(x=results.date, y=results.forecast, name="forecast"),
]
py.offline.iplot(traces)
It is important to note that we tuned and make feature engenering for 12 step ahead algorithm at the same time for this example.
Tsboost creates independant step forecasts, so each horizon can have it's own feature engineering & tuning without affecting the other horizons.
If we only need a 3 months steps ahead algorithm for example, we can set horizons like this:
model = tsb.TSRegressor(horizons=[3])
you could also do batches of horizons:
model = tsb.TSRegressor(horizons=[3,4,5])