Very often we have many different products, regions, countries, shops...for which we need to delivery forecast. This can be easily done with ModelSelector
. ModelSelector
though does not bind you to use multiple data partitions and can also serve as convenient layer for accessing relevant information quickly.
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn')
plt.rcParams['figure.figsize'] = [12, 6]
from hcrystalball.model_selection import ModelSelector
from hcrystalball.utils import get_sales_data
from hcrystalball.wrappers import get_sklearn_wrapper, ThetaWrapper
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
df = get_sales_data(n_dates=365*2,
n_assortments=1,
n_states=1,
n_stores=2)
df.head()
# let's start simple
df_minimal = df[['Store','Sales']]
ModelSelector
has already predefined large scale of hcrystalball models by their classes. To get this predefined gridsearch use create_gridsearch
method. It will allow you to create hundereds of different models in a second. Here for the sake of time, we will use the advantage of the method for cv splits, default scorer etc. and just extend empty grid with two models
To make the grid search run faster, we create and empty grid with
create_gridsearch
and later add just few models to it.
As each wrapper's default is hcb_verbose=True
(so that one can see warnings etc on the wrapper layer),
but in grid search the default is hcb_verbose=False
(so that the output of model selection is reasonably verbose),
we add to the grid search models with hcb_verbose=False
to closer simulate the default settings of the typical usage
(where one selects in create_gridsearch
e.g. prophet_models=True
, sklearn_models=True
and hcb_verbose
flag keeps on False
by default
ms_minimal = ModelSelector(horizon=10, frequency='D')
# see full default parameter grid in hands on exercise
ms_minimal.create_gridsearch(
sklearn_models=False,
n_splits=2,
between_split_lag=None,
sklearn_models_optimize_for_horizon=False,
autosarimax_models=False,
prophet_models=False,
theta_models=False,
tbats_models=False,
exp_smooth_models=False,
average_ensembles=False,
stacking_ensembles=False)
ms_minimal.add_model_to_gridsearch(get_sklearn_wrapper(LinearRegression, hcb_verbose=False))
ms_minimal.add_model_to_gridsearch(get_sklearn_wrapper(RandomForestRegressor, random_state=42, hcb_verbose=False))
ms_minimal.add_model_to_gridsearch(ThetaWrapper(hcb_verbose=False))
Method select_model
is doing majority of the magic for you - it creates forecast for each combination of columns specified in partition_columns
and for each of the time series it will run grid_search mentioned above. Optionally once can select list of columns over which the model selection will run in parallel using prefect (parallel_over_columns
).
Required format for data is Datetime index, unsuprisingly numerical column for target_col_name
all other columns except partition_columns
will be used as exogenous variables - as additional features for modeling.
ms_minimal.select_model(df=df_minimal, partition_columns=['Store'], target_col_name='Sales')
ms_minimal.plot_results(plot_from='2015-06');
ModelSelector
stores multiple ModelSelectorResults
in given folder as pickle files. As we only have 1 partition, only 1 file is written and loaded back.
ms_minimal
ms_minimal.persist_results('results')
from hcrystalball.model_selection import load_model_selector
ms_loaded = load_model_selector('results')
ms_loaded
# cleanup
import shutil
try:
shutil.rmtree('results')
except:
pass