mlmachine - Hyperparameter Tuning with Bayesian Optimization
Welcome to Example Notebook 4. If you're new to mlmachine, check out Example Notebook 1, Example Notebook 2 and Example Notebook 3.
Check out the GitHub repository.
Bayesian optimization is typically described as an advancement beyond exhaustive grid searches, and rightfully so. This hyperparameter tuning strategy succeeds by using prior information to inform future parameter selection for a given estimator. Check out Will Koehrsen's article on Medium for an excellent overview of the package.
mlmachine uses hyperopt as a foundation for performing Bayesian optimization, and takes the functionality of hyperopt a step further through a simplified workflow that allows for optimization of multiple models in single process execution. In this article, we are going to optimize four classifiers:
LogisticRegression()
XGBClassifier()
RandomForestClassifier()
KNeighborsClassifier()
First, we apply data preprocessing techniques to clean up our data. We'll start by creating two Machine()
objects - one for the training data and a second for the validation data.
# import libraries
import numpy as np
import pandas as pd
# import mlmachine tools
import mlmachine as mlm
from mlmachine.data import titanic
# use titanic() function to create DataFrames for training and validation datasets
df_train, df_valid = titanic()
# ordinal encoding hierarchy
ordinal_encodings = {"Pclass": [1, 2, 3]}
# instantiate a Machine object for the training data
mlmachine_titanic_train = mlm.Machine(
data=df_train,
target="Survived",
remove_features=["PassengerId","Ticket","Name","Cabin"],
identify_as_continuous=["Age","Fare"],
identify_as_count=["Parch","SibSp"],
identify_as_nominal=["Embarked"],
identify_as_ordinal=["Pclass"],
ordinal_encodings=ordinal_encodings,
is_classification=True,
)
# instantiate a Machine object for the validation data
mlmachine_titanic_valid = mlm.Machine(
data=df_valid,
remove_features=["PassengerId","Ticket","Name","Cabin"],
identify_as_continuous=["Age","Fare"],
identify_as_count=["Parch","SibSp"],
identify_as_nominal=["Embarked"],
identify_as_ordinal=["Pclass"],
ordinal_encodings=ordinal_encodings,
is_classification=True,
)
~/.pyenv/versions/main37/lib/python3.7/site-packages/sklearn/externals/joblib/__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+. warnings.warn(msg, category=FutureWarning)
>>> category label encoding 0 --> 0 1 --> 1
Now we process the data by imputing nulls and applying various binning, feature engineering and encoding techniques:
# standard libary and settings
from sklearn.impute import SimpleImputer
from sklearn.model_selection import KFold
from sklearn.preprocessing import (
OrdinalEncoder,
OneHotEncoder,
KBinsDiscretizer,
RobustScaler,
PolynomialFeatures,
)
from sklearn.pipeline import make_pipeline
from category_encoders import WOEEncoder, TargetEncoder, CatBoostEncoder
# import mlmachine tools
from mlmachine.features.preprocessing import (
DataFrameSelector,
PandasTransformer,
PandasFeatureUnion,
GroupbyImputer,
KFoldEncoder,
)
### create imputation PandasFeatureUnion pipeline
impute_pipe = PandasFeatureUnion([
("age", make_pipeline(
DataFrameSelector(include_columns=["Age","SibSp"]),
GroupbyImputer(null_column="Age", groupby_column="SibSp", strategy="mean")
)),
("fare", make_pipeline(
DataFrameSelector(include_columns=["Fare","Pclass"]),
GroupbyImputer(null_column="Fare", groupby_column="Pclass", strategy="mean")
)),
("embarked", make_pipeline(
DataFrameSelector(include_columns=["Embarked"]),
PandasTransformer(SimpleImputer(strategy="most_frequent"))
)),
("diff", make_pipeline(
DataFrameSelector(exclude_columns=["Age","Fare","Embarked"])
)),
])
# fit and transform training data, transform validation data
mlmachine_titanic_machine.training_features = impute_pipe.fit_transform(mlmachine_titanic_machine.training_features)
mlmachine_titanic_machine.validation_features = impute_pipe.transform(mlmachine_titanic_machine.validation_features)
### create polynomial feature PandasFeatureUnion pipeline
polynomial_pipe = PandasFeatureUnion([
("polynomial", make_pipeline(
DataFrameSelector(include_mlm_dtypes=["continuous"]),
PandasTransformer(PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)),
)),
("diff", make_pipeline(
DataFrameSelector(exclude_mlm_dtypes=["continuous"]),
)),
])
# fit and transform training data, transform validation data
mlmachine_titanic_machine.training_features = polynomial_pipe.fit_transform(mlmachine_titanic_machine.training_features)
mlmachine_titanic_machine.validation_features = polynomial_pipe.transform(mlmachine_titanic_machine.validation_features)
# update mlm_dtypes
mlmachine_titanic_machine.update_dtypes()
mlmachine_titanic_
### create simple encoding & binning PandasFeatureUnion pipeline
encode_pipe = PandasFeatureUnion([
("nominal", make_pipeline(
DataFrameSelector(include_columns=mlmachine_titanic_machine.training_features.mlm_dtypes["nominal"]),
PandasTransformer(OneHotEncoder(drop="first")),
)),
("ordinal", make_pipeline(
DataFrameSelector(include_columns=list(ordinal_encodings.keys())),
PandasTransformer(OrdinalEncoder(categories=list(ordinal_encodings.values()))),
)),
("bin", make_pipeline(
DataFrameSelector(include_columns=mlmachine_titanic_machine.training_features.mlm_dtypes["continuous"]),
PandasTransformer(KBinsDiscretizer(encode="ordinal")),
)),
("diff", make_pipeline(
DataFrameSelector(exclude_columns=mlmachine_titanic_machine.training_features.mlm_dtypes["nominal"] + list(ordinal_encodings.keys())),
)),
])
# fit and transform training data, transform validation data
mlmachine_titanic_machine.training_features = encode_pipe.fit_transform(mlmachine_titanic_machine.training_features)
mlmachine_titanic_machine.validation_features = encode_pipe.fit_transform(mlmachine_titanic_machine.validation_features)
# update mlm_dtypes
mlmachine_titanic_machine.update_dtypes()
mlmachine_titanic_
### create KFold encoding PandasFeatureUnion pipeline
target_encode_pipe = PandasFeatureUnion([
("target", make_pipeline(
DataFrameSelector(include_mlm_dtypes=["category"]),
KFoldEncoder(
target=mlmachine_titanic_machine.training_target,
cv=KFold(n_splits=5, shuffle=True, random_state=0),
encoder=TargetEncoder,
),
)),
("woe", make_pipeline(
DataFrameSelector(include_mlm_dtypes=["category"]),
KFoldEncoder(
target=mlmachine_titanic_machine.training_target,
cv=KFold(n_splits=5, shuffle=False),
encoder=WOEEncoder,
),
)),
("catboost", make_pipeline(
DataFrameSelector(include_mlm_dtypes=["category"]),
KFoldEncoder(
target=mlmachine_titanic_machine.training_target,
cv=KFold(n_splits=5, shuffle=False),
encoder=CatBoostEncoder,
),
)),
("diff", make_pipeline(
DataFrameSelector(exclude_mlm_dtypes=["category"]),
)),
])
# fit and transform training data, transform validation data
mlmachine_titanic_machine.training_features = target_encode_pipe.fit_transform(mlmachine_titanic_machine.training_features)
mlmachine_titanic_machine.validation_features = target_encode_pipe.transform(mlmachine_titanic_machine.validation_features)
# update mlm_dtypes
mlmachine_titanic_machine.update_dtypes()
mlmachine_titanic_
### scale values
scale = PandasTransformer(RobustScaler())
mlmachine_titanic_machine.training_features = scale.fit_transform(mlmachine_titanic_machine.training_features)
mlmachine_titanic_machine.validation_features = scale.transform(mlmachine_titanic_machine.validation_features)
mlmachine_titanic_machine.training_features[:10]
Age | Age*Fare | Age*Fare_binned_5 | Age*Fare_binned_5_catboost_encoded | Age*Fare_binned_5_target_encoded | Age*Fare_binned_5_woe_encoded | Age^2 | Age^2_binned_5 | Age^2_binned_5_catboost_encoded | Age^2_binned_5_target_encoded | ... | Parch | Pclass_ordinal_encoded | Pclass_ordinal_encoded_catboost_encoded | Pclass_ordinal_encoded_target_encoded | Pclass_ordinal_encoded_woe_encoded | Sex_male | Sex_male_catboost_encoded | Sex_male_target_encoded | Sex_male_woe_encoded | SibSp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.622287 | -0.241862 | -1.0 | -0.061922 | 0.257553 | -0.090433 | -0.568680 | -0.5 | 0.196885 | 0.754887 | ... | 0.0 | 0.0 | -0.089623 | 0.000000 | -0.167183 | 0.0 | 0.017268 | 0.017703 | 0.000604 | 1.0 |
1 | 0.608483 | 2.880011 | 1.0 | 1.746016 | 1.330575 | 1.479878 | 0.726867 | 1.0 | 0.660192 | 0.000000 | ... | 0.0 | -2.0 | 1.760707 | 1.370847 | 1.550231 | -1.0 | 0.996585 | 0.989747 | 0.991390 | 1.0 |
2 | -0.314594 | -0.184856 | -0.5 | -1.034485 | -0.868374 | -1.118681 | -0.309570 | -0.5 | 0.196885 | -0.053122 | ... | 0.0 | 0.0 | -0.089623 | -0.121216 | -0.167183 | -1.0 | 0.996585 | 0.991811 | 0.991390 | 0.0 |
3 | 0.377713 | 1.838762 | 1.0 | 1.746016 | 1.529026 | 1.479878 | 0.431320 | 0.5 | -0.550145 | -0.358207 | ... | 0.0 | -2.0 | 1.760707 | 1.518257 | 1.550231 | -1.0 | 0.996585 | 1.018386 | 0.991390 | 1.0 |
4 | 0.377713 | -0.092152 | 0.0 | -0.528664 | -0.424139 | -0.541412 | 0.431320 | 0.5 | -0.550145 | -0.164268 | ... | 0.0 | 0.0 | -0.089623 | -0.092714 | -0.167183 | 0.0 | 0.017268 | -0.004054 | 0.000604 | 0.0 |
5 | 0.100602 | -0.111967 | -0.5 | -1.034485 | -0.626603 | -1.118681 | 0.108522 | 0.5 | -0.550145 | -0.301361 | ... | 0.0 | 0.0 | -0.089623 | -0.035279 | -0.167183 | 0.0 | 0.017268 | 0.000000 | 0.000604 | 0.0 |
6 | 1.839252 | 2.992442 | 1.0 | 1.746016 | 1.529026 | 1.479878 | 2.713372 | 1.0 | 0.660192 | 0.442971 | ... | 0.0 | -2.0 | 1.760707 | 1.518257 | 1.550231 | 0.0 | 0.017268 | 0.017703 | 0.000604 | 0.0 |
7 | -2.160748 | -0.385571 | -1.0 | -0.061922 | 0.257553 | -0.090433 | -1.216453 | -1.0 | 1.956874 | 1.754877 | ... | 1.0 | 0.0 | -0.089623 | 0.000000 | -0.167183 | 0.0 | 0.017268 | 0.017703 | 0.000604 | 3.0 |
8 | -0.237671 | -0.069069 | 0.0 | -0.528664 | -0.325717 | -0.541412 | -0.238045 | -0.5 | 0.196885 | -0.196302 | ... | 2.0 | 0.0 | -0.089623 | -0.035279 | -0.167183 | -1.0 | 0.996585 | 0.989747 | 0.991390 | 0.0 |
9 | -1.237671 | 0.078365 | 0.0 | -0.528664 | -0.358860 | -0.541412 | -0.957344 | -1.0 | 1.956874 | 1.193962 | ... | 0.0 | -1.0 | 0.909780 | 0.823122 | 0.797100 | -1.0 | 0.996585 | 0.935964 | 0.991390 | 1.0 |
10 rows × 43 columns
As a second preparatory step, we want to perform feature selection for each of our classifiers.
# import libraries
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
# estimator list for model-specific feature importance techniques
estimators = [
LogisticRegression,
XGBClassifier,
RandomForestClassifier,
KNeighborsClassifier,
]
# instantiate FeatureSelector object
fs = mlmachine_titanic_machine.FeatureSelector(
data=mlmachine_titanic_machine.training_features,
target=mlmachine_titanic_machine.training_target,
estimators=estimators,
)
# run full feature selector suite, use accuracy metric and
# 0 CV folds where applicable
feature_selector_summary = fs.feature_selector_suite(
sequential_scoring="accuracy",
sequential_n_folds=0,
add_stats=True,
n_jobs=1,
save_to_csv=True,
)
For our final preparatory step, we use this feature selection summary to perform iterative cross-validation on smaller and smaller subsets of features for each of our estimators:
# use cross validation on progressively smaller feature subsets
# to find optimal feature set
cv_summary = fs.feature_selector_cross_val(
feature_selector_summary=feature_selector_summary,
estimators=estimators,
scoring=["accuracy"],
n_folds=5,
step=1,
n_jobs=4,
save_to_csv=True,
)
From this result, we extract our dictionary of optimum feature sets for each estimator.
The keys are estimator names, and the associated values are lists containing the column names of the best performing feature subset for each estimator. Taking XGBClassifier()
as an example, we used only 10 of the available 43 features to achieve the best average cross-validation accuracy on the validation dataset.
# create feature selector summary dictionary
cross_val_feature_dict = fs.create_cross_val_features_dict(
scoring="accuracy_score",
cv_summary=cv_summary,
feature_selector_summary=feature_selector_summary,
)
With our processed dataset and optimum feature subsets in hand, it's time to use Bayesian optimization to tune the hyperparameters of our 4 estimators.
First, we need to establish our feature space for each parameter for each estimator:
# import libraries
from hyperopt import hp
# create hyperopt parameter space for set of estimators
estimator_parameter_space = {
"LogisticRegression": {
"C": hp.loguniform("C", np.log(0.001), np.log(0.2)),
"penalty": hp.choice("penalty", ["l2"]),
},
"XGBClassifier": {
"colsample_bytree": hp.uniform("colsample_bytree", 0.5, 1.0),
"gamma": hp.uniform("gamma", 0.0, 10),
"learning_rate": hp.uniform("learning_rate", 0.01, 0.3),
"max_depth": hp.choice("max_depth", np.arange(2, 20, dtype=int)),
"min_child_weight": hp.uniform("min_child_weight", 1, 20),
"n_estimators": hp.choice("n_estimators", np.arange(100, 10000, 1, dtype=int)),
"subsample": hp.uniform("subsample", 0.3, 1),
},
"RandomForestClassifier": {
"bootstrap": hp.choice("bootstrap", [True, False]),
"max_depth": hp.choice("max_depth", np.arange(2, 20, dtype=int)),
"n_estimators": hp.choice("n_estimators", np.arange(100, 10000, 1, dtype=int)),
"max_features": hp.choice("max_features", ["auto", "sqrt"]),
"min_samples_split": hp.choice(
"min_samples_split", np.arange(2, 40, dtype=int)
),
"min_samples_leaf": hp.choice("min_samples_leaf", np.arange(2, 40, dtype=int)),
},
"KNeighborsClassifier": {
"algorithm": hp.choice("algorithm", ["auto", "ball_tree", "kd_tree", "brute"]),
"n_neighbors": hp.choice("n_neighbors", np.arange(1, 20, dtype=int)),
"weights": hp.choice("weights", ["distance", "uniform"]),
},
}
The outermost keys of the dictionary are names of classifiers, represented by strings. The associated values are also dictionaries, where the keys are parameter names, represented as strings, and the values are hyperopt sampling distributions from which parameter values will be chosen.
Now we're ready to run our Bayesian optimization hyperparameter tuning job. We will use a built-in method belonging to our Machine object called exec_bayes_optim_search(). Let's see mlmachine in action:
# execute bayesian optimization grid search
mlmachine_titanic_machine.exec_bayes_optim_search(
estimator_parameter_space=estimator_parameter_space,
data=mlmachine_titanic_machine.training_features,
target=mlmachine_titanic_machine.training_target,
columns=cross_val_feature_dict,
scoring="accuracy",
n_folds=5,
n_jobs=5,
iters=20,
show_progressbar=True,
)
#################################################################################################### Tuning LogisticRegression 100%|██████████| 20/20 [00:03<00:00, 6.42trial/s, best loss: 0.18967422007406953] #################################################################################################### Tuning XGBClassifier 100%|██████████| 20/20 [01:16<00:00, 3.83s/trial, best loss: 0.16612265394513837] #################################################################################################### Tuning RandomForestClassifier 100%|██████████| 20/20 [05:16<00:00, 15.83s/trial, best loss: 0.16945577804280965] #################################################################################################### Tuning KNeighborsClassifier 100%|██████████| 20/20 [00:02<00:00, 7.45trial/s, best loss: 0.1784131567384346]
Let's review the parameters:
estimator_parameter_space
: The dictionary-based feature space we setup above.data
: Our observations.target
: Our target data.columns
: An optional parameter that allows us to subset the input dataset features. Accepts a list of feature names, which will apply equally to all estimators. Also accepts a dictionary, where the keys represent estimator class names and values are lists of feature names to be used with the associated estimator. In this example, we use the latter by passing in the dictionary returned by cross_val_feature_dict
in the - FeatureSelector
workflow above.scoring
: The scoring metric to be evaluated.n_folds
: Number of folds to use in cross-validation procedure.iters
: Total number of iterations to run the hyperparameter tuning process. In this example, we run the experiment for 200 iterations.show_progressbar
: Controls whether progress bar displays and actively updates during the course of the process.Anyone familiar with hyperopt will be wondering where the objective function is. mlmachine abstracts away this complexity.
The process runtime depends on several attributes, including hardware, the number and type of estimators used, the Number of folds, feature selection, and the number of sampling iterations. Runtimes can be quite lengthy. For this reason, exec_bayes_optim_search()
automatically saves the result of each iteration to a CSV.
# reload bayes optimization summary
bayes_optim_summary = pd.read_csv("data/bayes_optimization_summary_accuracy.csv", na_values="nan")
bayes_optim_summary[:20]
iteration | estimator | scoring | loss | mean_score | std_score | min_score | max_score | train_time | status | params | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | LogisticRegression | accuracy | 0.204218 | 0.795782 | 0.026768 | 0.754190 | 0.825843 | 1.080047 | ok | {'C': 0.008174796663349533, 'penalty': 'l2', '... |
1 | 2 | LogisticRegression | accuracy | 0.210922 | 0.789078 | 0.037798 | 0.720670 | 0.820225 | 1.232385 | ok | {'C': 0.007456580176063114, 'penalty': 'l2', '... |
2 | 3 | LogisticRegression | accuracy | 0.196384 | 0.803616 | 0.027326 | 0.769663 | 0.848315 | 1.175930 | ok | {'C': 0.015197336097355196, 'penalty': 'l2', '... |
3 | 4 | LogisticRegression | accuracy | 0.191890 | 0.808110 | 0.025218 | 0.780899 | 0.842697 | 1.189311 | ok | {'C': 0.016970348333069277, 'penalty': 'l2', '... |
4 | 5 | LogisticRegression | accuracy | 0.196391 | 0.803609 | 0.014813 | 0.786517 | 0.825843 | 1.207008 | ok | {'C': 0.03346957219746275, 'penalty': 'l2', 'n... |
5 | 6 | LogisticRegression | accuracy | 0.193064 | 0.806936 | 0.020473 | 0.780899 | 0.831461 | 1.180980 | ok | {'C': 0.18501222413286775, 'penalty': 'l2', 'n... |
6 | 7 | LogisticRegression | accuracy | 0.292838 | 0.707162 | 0.044509 | 0.625698 | 0.747191 | 1.202759 | ok | {'C': 0.0011479966819945002, 'penalty': 'l2', ... |
7 | 8 | LogisticRegression | accuracy | 0.210922 | 0.789078 | 0.037965 | 0.720670 | 0.820225 | 1.166395 | ok | {'C': 0.007362723951833776, 'penalty': 'l2', '... |
8 | 9 | LogisticRegression | accuracy | 0.195255 | 0.804745 | 0.023097 | 0.776536 | 0.837079 | 1.171910 | ok | {'C': 0.028150535343720546, 'penalty': 'l2', '... |
9 | 10 | LogisticRegression | accuracy | 0.298456 | 0.701544 | 0.042828 | 0.625698 | 0.747191 | 1.179324 | ok | {'C': 0.0010719047187090322, 'penalty': 'l2', ... |
10 | 11 | LogisticRegression | accuracy | 0.191903 | 0.808097 | 0.014580 | 0.792135 | 0.831461 | 1.179505 | ok | {'C': 0.04917602741481999, 'penalty': 'l2', 'n... |
11 | 12 | LogisticRegression | accuracy | 0.188551 | 0.811449 | 0.010974 | 0.797753 | 0.831461 | 1.170878 | ok | {'C': 0.06940715377237823, 'penalty': 'l2', 'n... |
12 | 13 | LogisticRegression | accuracy | 0.241215 | 0.758785 | 0.043890 | 0.681564 | 0.803371 | 1.184253 | ok | {'C': 0.0029054318718131663, 'penalty': 'l2', ... |
13 | 14 | LogisticRegression | accuracy | 0.193014 | 0.806986 | 0.023733 | 0.780899 | 0.837079 | 1.156993 | ok | {'C': 0.02173685952136848, 'penalty': 'l2', 'n... |
14 | 15 | LogisticRegression | accuracy | 0.193020 | 0.806980 | 0.015832 | 0.787709 | 0.831461 | 1.139732 | ok | {'C': 0.042661619723085194, 'penalty': 'l2', '... |
15 | 16 | LogisticRegression | accuracy | 0.240092 | 0.759908 | 0.044651 | 0.681564 | 0.803371 | 1.172876 | ok | {'C': 0.003325935337999729, 'penalty': 'l2', '... |
16 | 17 | LogisticRegression | accuracy | 0.195305 | 0.804695 | 0.019433 | 0.780899 | 0.831461 | 1.192614 | ok | {'C': 0.15489548435948908, 'penalty': 'l2', 'n... |
17 | 18 | LogisticRegression | accuracy | 0.186310 | 0.813690 | 0.010923 | 0.797753 | 0.831461 | 1.204993 | ok | {'C': 0.05931036664011822, 'penalty': 'l2', 'n... |
18 | 19 | LogisticRegression | accuracy | 0.201971 | 0.798029 | 0.028659 | 0.754190 | 0.831461 | 1.141704 | ok | {'C': 0.008415834542277328, 'penalty': 'l2', '... |
19 | 20 | LogisticRegression | accuracy | 0.194143 | 0.805857 | 0.015445 | 0.787709 | 0.831461 | 1.147760 | ok | {'C': 0.03837255565837678, 'penalty': 'l2', 'n... |
Our Bayesian optimization log maintains key information about each iteration:
This log provides an immense amount of data for us to analyze and evaluate the effectiveness of the Bayesian optimization process.
First and foremost, we want to see how if performance improved over the iterations.
Let's visualize the XGBClassifier()
loss by iteration:
# generate model loss by iteration plot for XGBClassifier
mlmachine_titanic_machine.model_loss_plot(
bayes_optim_summary=bayes_optim_summary,
estimator_class="XGBClassifier",
)
Each dot represents the performance of one of our 200 experiments. The key detail to notice is that the line of best fit has a clear downward slope - exactly what we want. This means that with each iteration, model performance tends to improve compared to the previous iterations.
One of the coolest parts of Bayesian optimization is seeing how parameter selection is optimized.
For each model and for each model's parameters, we can generate a two-panel visual.
For numeric parameters, such as n_estimators
or learning_rate
, the two-visual panel includes:
For categorical parameters, such as loss function, the two-visual panel includes:
Let's review the parameter selection panels for KNeighborsClassifier()
:
# generate parameter selection panels for each KNeighborsClassifier parameter
mlmachine_titanic_machine.model_param_plot(
bayes_optim_summary=bayes_optim_summary,
estimator_class="KNeighborsClassifier",
estimator_parameter_space=estimator_parameter_space,
n_iter=200,
)
**************************************************************************************************** * KNeighborsClassifier ****************************************************************************************************
The built-in method model_param_plot()
cycles through of the estimator's parameters and presents the appropriate panel given each parameter's type. Let's look at a numeric parameter and categorical parameter separately.
First, we'll review the panel for the numeric parameter n_neighbors
:
On the left, we can see two overlapping kernel density plots summarizing the actual parameter selections and the theoretical parameter distribution. The purple line corresponds to the theoretical distribution, and, as expected, this curve is smooth and evenly distributed. The teal line corresponds to the actual parameter selections, and it's clearly evident that hyperopt prefers values between 5 and 10.
On the right, the scatter plot visualizes the n_neighbors
value selections over the iterations. There is a slight downward slope to the line of best fit, as the Bayesian optimization process hones in on values around 7.
Next, we'll review the panel for the categorical parameter algorithm
:
On the left, we see a bar chart displaying the counts of parameter selections, faceted by actual parameter selections and selections from the theoretical distribution . The purple bars, representing selections from the theoretical distribution, are more even than the teal bars, representing the actual selection.
On the right, the scatter plot again visualizes the algorithm value selection over the iterations. There is a clear decrease in selection of "ball_tree" and "auto" in favor of "kd_tree" and "brute" over the the iterations.
Our Machine()
object has a built-in method called top_bayes_optim_models()
, which identifies the best model for each estimator type based on the results in our Bayesian optimization log.
# identify top performing model for each estimator
top_models = mlmachine_titanic_machine.top_bayes_optim_models(
bayes_optim_summary=bayes_optim_summary,
num_models=1,
)
With this method, we can identify the top N models for each estimator based on mean cross-validation score. In this experiment, top_bayes_optim_models()
returns the dictionary below, which tells us that LogisticRegression()
identified its top model on iteration 30, XGBClassifier()
on iteration 61, RandomForestClassifier()
on iteration 46, and KNeighborsClassifier()
on iteration 109.
top_models
{'LogisticRegression': [30], 'XGBClassifier': [61], 'RandomForestClassifier': [46], 'KNeighborsClassifier': [109]}
To reinstantiate a model, we leverage our Machine()
object's built-in method BayesOptimClassifierBuilder()
. To use this method, we pass in our results log, specify an estimator class and iteration number. This will instantiate a model object with the parameters stored on that record of the log:
# reinstantiate top performing RandomForestClassifier
model = mlmachine_titanic_machine.BayesOptimClassifierBuilder(
bayes_optim_summary=bayes_optim_summary,
estimator_class="RandomForestClassifier",
model_iter=46
)
print(model.custom_model)
RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=19, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=4, min_samples_split=7, min_weight_fraction_leaf=0.0, n_estimators=4019, n_jobs=4, oob_score=False, random_state=0, verbose=0, warm_start=False)
The models instantiated with BayesOptimClassifierBuilder()
use .fit()
and .predict()
in a way that should feel quite familiar.
Let's finish this article with a very basic model performance evaluation. We will fit this RandomForestClassifier()
on the training data and labels, generate predictions with the training data, and evaluate the model's performance by comparing these predictions to the ground-truth:
# fit the model
model.fit(mlmachine_titanic_machine.training_features, mlmachine_titanic_machine.training_target)
# generate predictions
y_pred_train = model.predict(mlmachine_titanic_machine.training_features)
# summarize results
training_accuracy = sum(y_pred_train == mlmachine_titanic_machine.training_target) / len(y_pred_train)
print("RandomForestClassifier, iter = 46 \nTraining accuracy: {:.2%}".format(training_accuracy))
RandomForestClassifier, iter = 46 Training accuracy: 94.16%
Star the GitHub repository, and stay tuned for additional notebooks.