TSBoost quick start¶

Import libs & data¶

Let's import tsboost, pandas and plotly for some visalisation (pip install plotly (--upgrade))

In [1]:

import tsboost as tsb
import pandas as pd
import numpy as np
np.random.seed(2017)

import plotly as py
import plotly.graph_objs as go
py.offline.init_notebook_mode()

Let's download air passenger data

In [2]:

data = pd.read_csv("air_passenger.csv", sep=";")
data.date = pd.to_datetime(data.date, dayfirst=True)
data.tail()

Out[2]:

	date	volume
139	1960-08-01	606
140	1960-09-01	508
141	1960-10-01	461
142	1960-11-01	390
143	1960-12-01	432

Those datas represents the evolution of air transport passengers, let's visualise it:

In [3]:

trace = [go.Scatter(x=data.date, y=data.volume)]
py.offline.iplot(trace)

Forecast few years ahead (Production usage)¶

For a start, let's just forecast 5 years without doing any feature engenering / tuning

We just need to set some data configuration and precise how many steps ahead we want to forecast

In [4]:

data_config = {
    'target' : "volume", # column we want to forecast
    'date' : "date", # colum containing the date
}

Here is a monthly problem so 5 years = 12*5 = 60 months

In [5]:

model = tsb.TSRegressor(horizons=60)

results = model.fit_predict(data, **data_config)

In [6]:

results.tail()

Out[6]:

	date_last_data	horizon	date	forecast
55	1960-12-01	56	1965-08-01	916.146024
56	1960-12-01	57	1965-09-01	748.292623
57	1960-12-01	58	1965-10-01	697.493432
58	1960-12-01	59	1965-11-01	590.572526
59	1960-12-01	60	1965-12-01	632.378526

Column "date_last_data" reprents the starting point from which forecasts have been made, and is the last data we have in our disposal in a production context. Column "date" represents the effective date for the forecast. horizon = date - date_last_data (in the time step unit of the problem, here it's months)

TSBoost will be in this mode when no cross validation dates is put as input parameters "cv_dates".

in this case, it's like we are in 1960-12-01, and we forecast the futur from this date.

Let's visualise the forecasts that we've just done:

In [7]:

traces = [
    go.Scatter(x=data.date, y=data.volume, name="data"),
    go.Scatter(x=results.date, y=results.forecast, name="forecast"),    
]
py.offline.iplot(traces)

Cross Validation for Feature Engineering & Tuning¶

TSBoost allows data scientists to be in a classical machine learning context, so any feature engenering can be done as well as meta parmeters tuning

We can use period cross validation, which take time into consideration, to try to validate the use of the feature engenering & tuning in the future.

We need to create periods with dates. Let's make 2 crossvalidation periods, one for Feature engineering & tuning validation, and one for final testing

In [8]:

cv_valid = tsb.TSRegressor.generate_dates(begin_date="1959-01-01", horizon=12, time_step="month")
cv_test = tsb.TSRegressor.generate_dates(begin_date="1960-01-01", horizon=12, time_step="month")

let's visualise those 2 CVs

In [9]:

traces = [
    go.Scatter(x=data[data.date < "1959-01-01"].date, y=data[data.date <= "1959-01-01"].volume, name="other_data"),
    go.Scatter(x=data[data.date.isin(cv_valid)].date, y=data[data.date.isin(cv_valid)].volume, name="cv_valid"),    
    go.Scatter(x=data[data.date.isin(cv_test)].date, y=data[data.date.isin(cv_test)].volume, name="cv_test"),
]

py.offline.iplot(traces)

Feature Engineering example¶

Let's make a 12 months ahead forecast algorithm for speeding things up:

In [10]:

horizons = 12

Any features can be added to the datas. You have to be carefull not to use to the future you didn't acces in the past.

Tsboost create by default lag features from past of the target we wish to forecast, and also temporal features with the informations contained in date column (month, year, day of the week, etc...)

We can use this with the meta parameter "inner_feature_eng" to activate it or not.

Let's try if this feature engenering is relevant for our problem and can help improving accuracy. Let's test the algorithm without feature engineering on our validation period :

In [11]:

model1 = tsb.TSRegressor(horizons=horizons,
                         inner_feature_eng=False)

forecast1_cv_valid = model1.fit_predict(data, cv_dates=cv_valid, **data_config)

In [12]:

forecast1_cv_valid.head()

Out[12]:

	date_last_data	horizon	date	forecast
0	1958-12-01	1	1959-01-01	344.489314
1	1959-01-01	1	1959-02-01	334.546558
2	1959-02-01	1	1959-03-01	387.062901
3	1959-03-01	1	1959-04-01	380.945582
4	1959-04-01	1	1959-05-01	412.223502

If we analyse the data, it looks like the 60 months forecast we did before, but the column "date_last_data" varies, so that we can have each 12 horizons for every dates in the cross validation period and compare those forecasts equitably.

TSBoost has static function to get the mape for each horizon on the period, you can of course use whatever residual analyse you want

In [13]:

results1_cv_valid = tsb.TSRegressor.get_result(data, forecast1_cv_valid, metric="mape", **data_config)

Let's do the same but with the inner feature engenering of TSBoost:

In [14]:

model2 = tsb.TSRegressor(horizons=horizons,
                         inner_feature_eng=True)

forecast2_cv_valid = model2.fit_predict(data, cv_dates=cv_valid, **data_config)
results2_cv_valid = tsb.TSRegressor.get_result(data, forecast2_cv_valid, metric="mape", **data_config)

let's visualise how those 2 models performed on the validation period with a graph

on x-axis you have horizon (short term to long term forecasts)

on y-axis you have the MAPE (Mean Avergae Percentage Error) did on the validation period datas

In [15]:

# some plotly configs for graphs
layout = go.Layout(xaxis=dict(title= 'horizons'), yaxis=dict(title= 'MAPE'))

traces = [
    go.Scatter(x=results1_cv_valid.index, y=results1_cv_valid.forecast, name="model_no_FE"),
    go.Scatter(x=results2_cv_valid.index, y=results2_cv_valid.forecast, name="model_with_FE"),
]

fig= go.Figure(data=traces, layout=layout)
py.offline.iplot(fig)

We can see that on this period, adding feature enginnering from TSBoost works quiet well on almost all horizons.

We have a global patern that should appear on almost every time series problems :

the mean error of long term forecast is bigger that the shorter terms, we can observe a log shape fashion (to see it clearer we could have had selected larger period for cross validation & test periods).

It's a general rule for most time serie problems : it's easier to forecast short term than long term. Mainly because of shift distributions: variable ditributions has higher chance to be more similar to short term distribution than long term one

Tuning example¶

There is two main ways to do tuning :

we can tune tsboost meta parameters.
we can tune optimizer (xgboost or lightgbm) meta parameters

By default, tsboost use the instance of xgboost.XGBRegressor() with default instance parameters, but we can use a non default instance.

We can also chose lightgbm.LGBMRegressor() as optimizer. Let's see how xgboost's brother performs on this problem :

In [16]:

model3 = tsb.TSRegressor(horizons=horizons,
                         optimizer="lgbm",
                         inner_feature_eng=True)

forecast3_cv_valid = model3.fit_predict(data, cv_dates=cv_valid, **data_config)
results3_cv_valid = tsb.TSRegressor.get_result(data, forecast3_cv_valid, metric="mape", **data_config)

In [17]:

traces = [
    go.Scatter(x=results2_cv_valid.index, y=results2_cv_valid.forecast, name="model_xgboost_with_FE"),
    go.Scatter(x=results3_cv_valid.index, y=results3_cv_valid.forecast, name="model_lightgbm_with_FE"),
]

fig= go.Figure(data=traces, layout=layout)
py.offline.iplot(fig)

We can observ a nice performance of lightgbm with the feature engineering.

It is to be noted that it is better to find good features than overtuned algorithm.

Test models on the future CV period¶

We can use the test CV period to validate (or not) if our previous feature engineering / tuning have some positive effect on the future. And maybe conclude that it will be also the case for after (production usage)

In [18]:

results1 = model1.fit_predict(data, cv_dates=cv_test, **data_config)
results1 = tsb.TSRegressor.get_result(data, results1, metric="mape", **data_config)

results3 = model3.fit_predict(data, cv_dates=cv_test, **data_config)
results3 = tsb.TSRegressor.get_result(data, results3, metric="mape", **data_config)

In [19]:

traces = [
    go.Scatter(x=results1.index,  y=results1.forecast, name="model1"),
    go.Scatter(x=results3.index, y=results3.forecast, name="model3"),
]
fig= go.Figure(data=traces, layout=layout)
py.offline.iplot(fig)

We can see that on the last CV period, the MAPE improvement is globaly there (even if short term could need some investigation)

We can suppose that the feature engenering & tuning should give better for the future (even if it's not completly sure with distribution drift)

If satisfied, we can use this last algo to forecast 12 steps ahead in production mode¶

In [20]:

results = model3.fit_predict(data, **data_config)

In [21]:

traces = [
    go.Scatter(x=data.date, y=data.volume, name="data"),
    go.Scatter(x=results.date, y=results.forecast, name="forecast"),    
]

py.offline.iplot(traces)

It is important to note that we tuned and make feature engenering for 12 step ahead algorithm at the same time for this example.

Tsboost creates independant step forecasts, so each horizon can have it's own feature engineering & tuning without affecting the other horizons.

If we only need a 3 months steps ahead algorithm for example, we can set horizons like this:

In [22]:

model = tsb.TSRegressor(horizons=[3])

you could also do batches of horizons:

In [23]:

model = tsb.TSRegressor(horizons=[3,4,5])

In [ ]: