Large Scale learning with ModelSelector¶

Very often we have many different products, regions, countries, shops...for which we need to delivery forecast. This can be easily done with ModelSelector. ModelSelector though does not bind you to use multiple data partitions and can also serve as convenient layer for accessing relevant information quickly.

In [ ]:

import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn')
plt.rcParams['figure.figsize'] = [12, 6]

In [ ]:

from hcrystalball.model_selection import ModelSelector
from hcrystalball.utils import get_sales_data
from hcrystalball.wrappers import get_sklearn_wrapper, ThetaWrapper
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

Get Dummy Data¶

In [ ]:

df = get_sales_data(n_dates=365*2, 
                    n_assortments=1, 
                    n_states=1, 
                    n_stores=2)

In [ ]:

df.head()

In [ ]:

# let's start simple
df_minimal = df[['Store','Sales']]

Get predefined sklearn models¶

ModelSelector has already predefined large scale of hcrystalball models by their classes. To get this predefined gridsearch use create_gridsearch method. It will allow you to create hundereds of different models in a second. Here for the sake of time, we will use the advantage of the method for cv splits, default scorer etc. and just extend empty grid with two models

Note on hcb_verbose flag¶

To make the grid search run faster, we create and empty grid with create_gridsearch and later add just few models to it.

As each wrapper's default is hcb_verbose=True (so that one can see warnings etc on the wrapper layer), but in grid search the default is hcb_verbose=False (so that the output of model selection is reasonably verbose), we add to the grid search models with hcb_verbose=False to closer simulate the default settings of the typical usage (where one selects in create_gridsearch e.g. prophet_models=True, sklearn_models=True and hcb_verbose flag keeps on False by default

In [ ]:

ms_minimal = ModelSelector(horizon=10, frequency='D')

In [ ]:

# see full default parameter grid in hands on exercise
ms_minimal.create_gridsearch(
    sklearn_models=False,
    n_splits=2,
    between_split_lag=None,
    sklearn_models_optimize_for_horizon=False,
    autosarimax_models=False,
    prophet_models=False,
    theta_models=False,
    tbats_models=False,
    exp_smooth_models=False,
    average_ensembles=False,
    stacking_ensembles=False)

Extend with custom models¶

In [ ]:

ms_minimal.add_model_to_gridsearch(get_sklearn_wrapper(LinearRegression, hcb_verbose=False))
ms_minimal.add_model_to_gridsearch(get_sklearn_wrapper(RandomForestRegressor, random_state=42, hcb_verbose=False))
ms_minimal.add_model_to_gridsearch(ThetaWrapper(hcb_verbose=False))

Run model selection¶

Method select_model is doing majority of the magic for you - it creates forecast for each combination of columns specified in partition_columns and for each of the time series it will run grid_search mentioned above. Optionally once can select list of columns over which the model selection will run in parallel using prefect (parallel_over_columns).

Required format for data is Datetime index, unsuprisingly numerical column for target_col_name all other columns except partition_columns will be used as exogenous variables - as additional features for modeling.

In [ ]:

ms_minimal.select_model(df=df_minimal, partition_columns=['Store'], target_col_name='Sales')

In [ ]:

ms_minimal.plot_results(plot_from='2015-06');

Persist and Load¶

ModelSelector stores multiple ModelSelectorResults in given folder as pickle files. As we only have 1 partition, only 1 file is written and loaded back.

In [ ]:

ms_minimal

In [ ]:

ms_minimal.persist_results('results')

In [ ]:

from hcrystalball.model_selection import load_model_selector

In [ ]:

ms_loaded = load_model_selector('results')

In [ ]:

ms_loaded

In [ ]:

# cleanup
import shutil
try:
    shutil.rmtree('results')
except:
    pass