useR! 2019 H2O Tutorial (bit.ly/useR2019_h2o_tutorial)
useR! 2019 H2O Tutorial (bit.ly/useR2019_h2o_tutorial)
- 1 Agenda
- 2 Set Up
- 3 Introduction
- 4 Regression Part One: H2O AutoML
- 5 Regression Part Two: XAI
- 6 Coffee Break 10:30 - 11:00
- 7 Classification Part One: H2O AutoML
- 8 Classification Part Two: XAI
- 9 Bring Your Own Data + Q&A
1 Agenda
- 09:00 to 09:30 Set Up & Introduction
- 09:30 to 10:30 Regression Example
- 10:30 to 11:00 Coffee Break
- 11:00 to 11:30 Classification Example
- 11:30 to 12:30 Bring Your Own Data + Q&A
2 Set Up
2.1 Download -> bit.ly/useR2019_h2o_tutorial
setup.R
: install packages requiredtutorial.Rmd
: the main RMarkdown file with codetutorial.html
: this webpage- Full URL https://github.com/woobe/useR2019_h2o_tutorial (if
bit.ly
doesn’t work)
2.2 R Packages
- Check out
setup.R
- For this tutorial:
h2o
for machine learningmlbench
for Boston Housing datasetDALEX
,breakDown
&pdp
for explaining model predictions
- For RMarkdown
knitr
for rendering this RMarkdownrmdformats
forreadthedown
RMarkdown templateDT
for nice tables
3 Introduction
General Data Protection Regulation (GDPR) is now in place. Are you ready to explain your models? This is a hands-on tutorial for R beginners. I will demonstrate the use of H2O and other R packages for automatic and interpretable machine learning. Participants will be able to follow and build regression and classification models quickly with H2O’s AutoML. They will then be able to explain the model outcomes with various methods.
It is a workshop for R beginners and anyone interested in machine learning. RMarkdown and the rendered HTML will be provided so everyone can follow without running the code.
(Now go to slides …)
4 Regression Part One: H2O AutoML
4.1 Data - Boston Housing from mlbench
data("BostonHousing")
datatable(head(BostonHousing),
rownames = FALSE, options = list(pageLength = 6, scrollX = TRUE))
Source: UCI Machine Learning Repository Link
- crim: per capita crime rate by town.
- zn: proportion of residential land zoned for lots over 25,000 sq.ft.
- indus: proportion of non-retail business acres per town.
- chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
- nox: nitrogen oxides concentration (parts per 10 million).
- rm: average number of rooms per dwelling.
- age: proportion of owner-occupied units built prior to 1940.
- dis: weighted mean of distances to five Boston employment centres.
- rad: index of accessibility to radial highways.
- tax: full-value property-tax rate per $10,000.
- ptratio: pupil-teacher ratio by town.
- b: 1000(Bk - 0.63)^2 where Bk is the proportion of people of African American descent by town.
- lstat: lower status of the population (percent).
- medv (This is the TARGET): median value of owner-occupied homes in $1000s.
4.2 Define Target and Features
target <- "medv" # Median House Value
features <- setdiff(colnames(BostonHousing), target)
print(features)
[1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
[8] "dis" "rad" "tax" "ptratio" "b" "lstat"
4.3 Start a local H2O Cluster (JVM)
H2O is not running yet, starting it now...
Note: In case of errors look at the following log files:
/var/folders/4z/p7yt7_4n4fj1jlyq6g4qhfbw0000gn/T//Rtmpeo8saO/h2o_jofaichow_started_from_r.out
/var/folders/4z/p7yt7_4n4fj1jlyq6g4qhfbw0000gn/T//Rtmpeo8saO/h2o_jofaichow_started_from_r.err
Starting H2O JVM and connecting: .. Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 1 seconds 997 milliseconds
H2O cluster timezone: Europe/Paris
H2O data parsing timezone: UTC
H2O cluster version: 3.24.0.5
H2O cluster version age: 20 days
H2O cluster name: H2O_started_from_R_jofaichow_kkt017
H2O cluster total nodes: 1
H2O cluster total memory: 3.56 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4
R Version: R version 3.6.0 (2019-04-26)
4.4 Convert R dataframe into H2O dataframe
4.5 Split Data into Train/Test
h_split <- h2o.splitFrame(h_boston, ratios = 0.8, seed = n_seed)
h_train <- h_split[[1]] # 80% for modelling
h_test <- h_split[[2]] # 20% for evaluation
[1] 411 14
[1] 95 14
4.6 Cross-Validation
4.7 Baseline Models
h2o.glm()
: H2O Generalized Linear Modelh2o.randomForest()
: H2O Random Forest Modelh2o.gbm()
: H2O Gradient Boosting Modelh2o.deeplearning()
: H2O Deep Neural Network Modelh2o.xgboost()
: H2O wrapper for eXtreme Gradient Boosting Model from DMLC
4.7.1 Baseline Generalized Linear Model (GLM)
model_glm <- h2o.glm(x = features, # All 13 features
y = target, # medv (median value of owner-occupied homes in $1000s)
training_frame = h_train, # H2O dataframe with training data
model_id = "baseline_glm", # Give the model a name
nfolds = 5, # Using 5-fold CV
seed = n_seed) # Your lucky seed
H2ORegressionMetrics: glm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 23.04256
RMSE: 4.800267
MAE: 3.307191
RMSLE: NaN
Mean Residual Deviance : 23.04256
R^2 : 0.7076243
Null Deviance :32617.3
Null D.o.F. :410
Residual Deviance :9470.494
Residual D.o.F. :396
AIC :2487.815
H2ORegressionMetrics: glm
MSE: 28.87315
RMSE: 5.373374
MAE: 3.859189
RMSLE: 0.1861469
Mean Residual Deviance : 28.87315
R^2 : 0.7254239
Null Deviance :10402.21
Null D.o.F. :94
Residual Deviance :2742.949
Residual D.o.F. :80
AIC :621.075
Let’s use RMSE
4.7.2 Build Other Baseline Models (DRF, GBM, DNN & XGB)
# Baseline Distributed Random Forest (DRF)
model_drf <- h2o.randomForest(x = features,
y = target,
training_frame = h_train,
model_id = "baseline_drf",
nfolds = 5,
seed = n_seed)
# Baseline Gradient Boosting Model (GBM)
model_gbm <- h2o.gbm(x = features,
y = target,
training_frame = h_train,
model_id = "baseline_gbm",
nfolds = 5,
seed = n_seed)
# Baseline Deep Nerual Network (DNN)
# By default, DNN is not reproducible with multi-core. You may get slightly different results here.
# You can enable the `reproducible` option but it will run on a single core (very slow).
model_dnn <- h2o.deeplearning(x = features,
y = target,
training_frame = h_train,
model_id = "baseline_dnn",
nfolds = 5,
seed = n_seed)
4.7.3 Comparison (RMSE: Lower = Better)
# Create a table to compare RMSE from different models
d_eval <- data.frame(model = c("H2O GLM: Generalized Linear Model (Baseline)",
"H2O DRF: Distributed Random Forest (Baseline)",
"H2O GBM: Gradient Boosting Model (Baseline)",
"H2O DNN: Deep Neural Network (Baseline)",
"XGBoost: eXtreme Gradient Boosting Model (Baseline)"),
stringsAsFactors = FALSE)
d_eval$RMSE_cv <- NA
d_eval$RMSE_test <- NA
# Store RMSE values
d_eval[1, ]$RMSE_cv <- model_glm@model$cross_validation_metrics@metrics$RMSE
d_eval[2, ]$RMSE_cv <- model_drf@model$cross_validation_metrics@metrics$RMSE
d_eval[3, ]$RMSE_cv <- model_gbm@model$cross_validation_metrics@metrics$RMSE
d_eval[4, ]$RMSE_cv <- model_dnn@model$cross_validation_metrics@metrics$RMSE
d_eval[5, ]$RMSE_cv <- model_xgb@model$cross_validation_metrics@metrics$RMSE
d_eval[1, ]$RMSE_test <- h2o.rmse(h2o.performance(model_glm, newdata = h_test))
d_eval[2, ]$RMSE_test <- h2o.rmse(h2o.performance(model_drf, newdata = h_test))
d_eval[3, ]$RMSE_test <- h2o.rmse(h2o.performance(model_gbm, newdata = h_test))
d_eval[4, ]$RMSE_test <- h2o.rmse(h2o.performance(model_dnn, newdata = h_test))
d_eval[5, ]$RMSE_test <- h2o.rmse(h2o.performance(model_xgb, newdata = h_test))
4.8 Manual Tuning
4.8.1 Check out the hyper-parameters for each algo
4.8.2 Train a xgboost model with manual settings
model_xgb_m <- h2o.xgboost(x = features,
y = target,
training_frame = h_train,
model_id = "model_xgb_m",
nfolds = 5,
seed = n_seed,
# Manual Settings based on experience
learn_rate = 0.1, # use a lower rate (more conservative)
ntrees = 100, # use more trees (due to lower learn_rate)
sample_rate = 0.9, # use random n% of samples for each tree
col_sample_rate = 0.9) # use random n% of features for each tree
4.8.3 Comparison (RMSE: Lower = Better)
d_eval_tmp <- data.frame(model = "XGBoost: eXtreme Gradient Boosting Model (Manual Settings)",
RMSE_cv = model_xgb_m@model$cross_validation_metrics@metrics$RMSE,
RMSE_test = h2o.rmse(h2o.performance(model_xgb_m, newdata = h_test)))
d_eval <- rbind(d_eval, d_eval_tmp)
datatable(d_eval, rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, round)) %>%
formatRound(columns = -1, digits = 4)
4.9 H2O AutoML
# Run AutoML (try n different models)
# Check out all options using ?h2o.automl
automl = h2o.automl(x = features,
y = target,
training_frame = h_train,
nfolds = 5, # 5-fold Cross-Validation
max_models = 20, # Max number of models
stopping_metric = "RMSE", # Metric to optimize
project_name = "automl_boston", # Specify a name so you can add more models later
seed = n_seed)
4.9.1 Leaderboard
4.9.2 Best Model (Leader)
Model Details:
==============
H2ORegressionModel: stackedensemble
Model ID: StackedEnsemble_AllModels_AutoML_20190709_072844
NULL
H2ORegressionMetrics: stackedensemble
** Reported on training data. **
MSE: 0.1426481
RMSE: 0.3776879
MAE: 0.2960617
RMSLE: 0.01988458
Mean Residual Deviance : 0.1426481
H2ORegressionMetrics: stackedensemble
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 6.775532
RMSE: 2.602985
MAE: 1.870888
RMSLE: 0.1238892
Mean Residual Deviance : 6.775532
4.9.3 Comparison (RMSE: Lower = Better)
d_eval_tmp <- data.frame(model = "Best Model from H2O AutoML",
RMSE_cv = automl@leader@model$cross_validation_metrics@metrics$RMSE,
RMSE_test = h2o.rmse(h2o.performance(automl@leader, newdata = h_test)))
d_eval <- rbind(d_eval, d_eval_tmp)
datatable(d_eval, rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, round)) %>%
formatRound(columns = -1, digits = 4)
4.10 Make Predictions
predict
1 35.73996
2 16.79900
3 19.85202
4 16.64819
5 17.82168
6 18.14047
5 Regression Part Two: XAI
Let’s look at the first house in h_test
datatable(as.data.frame(h_test[1, ]),
rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE))
5.1 Using functions in h2o
h2o.varimp()
&h2o.varimp_plot
: Variable Importance (for GBM, DNN, GLM)h2o.partialPlot()
: Partial Dependence Plotsh2o.predict_contributions()
: SHAP values (for GBM and XGBoost only)
# Look at the impact of feature `rm` (no. of rooms)
# Not Run
h2o.partialPlot(model_glm, data = h_test, cols = c("rm"))
h2o.partialPlot(model_drf, data = h_test, cols = c("rm"))
h2o.partialPlot(model_gbm, data = h_test, cols = c("rm"))
h2o.partialPlot(model_dnn, data = h_test, cols = c("rm"))
h2o.partialPlot(model_xgb, data = h_test, cols = c("rm"))
h2o.partialPlot(automl@leader, data = h_test, cols = c("rm"))
5.2 Package DALEX
- Website: https://pbiecek.github.io/DALEX/
- Original DALEX-H2O Example: https://raw.githack.com/pbiecek/DALEX_docs/master/vignettes/DALEX_h2o.html
5.2.1 The explain()
Function
The first step of using the DALEX
package is to wrap-up the black-box model with meta-data that unifies model interfacing.
To create an explainer we use explain()
function. Validation dataset for the models is h_test
from part one. For the models created by h2o
package we have to provide custom predict function which takes two arguments: model
and newdata
and returns a numeric vector with predictions.
5.2.2 Explainer for H2O Models
explainer_drf <- DALEX::explain(model = model_drf,
data = as.data.frame(h_test)[, features],
y = as.data.frame(h_test)[, target],
predict_function = custom_predict,
label = "Random Forest")
explainer_dnn <- DALEX::explain(model = model_dnn,
data = as.data.frame(h_test)[, features],
y = as.data.frame(h_test)[, target],
predict_function = custom_predict,
label = "Deep Neural Networks")
explainer_xgb <- DALEX::explain(model = model_xgb,
data = as.data.frame(h_test)[, features],
y = as.data.frame(h_test)[, target],
predict_function = custom_predict,
label = "XGBoost")
explainer_automl <- DALEX::explain(model = automl@leader,
data = as.data.frame(h_test)[, features],
y = as.data.frame(h_test)[, target],
predict_function = custom_predict,
label = "H2O AutoML")
5.2.3 Variable importance
Using he DALEX package we are able to better understand which variables are important.
Model agnostic variable importance is calculated by means of permutations. We simply substract the loss function calculated for validation dataset with permuted values for a single variable from the loss function calculated for validation dataset.
This method is implemented in the variable_importance() function.
5.2.4 Partial Dependence Plots
Partial Dependence Plots (PDP) are one of the most popular methods for exploration of the relation between a continuous variable and the model outcome. Function variable_response() with the parameter type = “pdp” calls pdp::partial() function to calculate PDP response.
Let’s look at feature rm
(no. of rooms)
pdp_drf_rm <- variable_response(explainer_drf, variable = "rm")
pdp_dnn_rm <- variable_response(explainer_dnn, variable = "rm")
pdp_xgb_rm <- variable_response(explainer_xgb, variable = "rm")
pdp_automl_rm <- variable_response(explainer_automl, variable = "rm")
plot(pdp_drf_rm, pdp_dnn_rm, pdp_xgb_rm, pdp_automl_rm)
5.2.5 Prediction Understanding
# Predictions from different models
yhat <- data.frame(model = c("H2O DRF: Distributed Random Forest (Baseline)",
"H2O DNN: Deep Neural Network (Baseline)",
"XGBoost: eXtreme Gradient Boosting Model (Baseline)",
"Best Model from H2O AutoML"))
yhat$prediction <- NA
yhat[1,]$prediction <- as.matrix(h2o.predict(model_drf, h_test[1,]))
yhat[2,]$prediction <- as.matrix(h2o.predict(model_dnn, h_test[1,]))
yhat[3,]$prediction <- as.matrix(h2o.predict(model_xgb, h_test[1,]))
yhat[4,]$prediction <- as.matrix(h2o.predict(automl@leader, h_test[1,]))
# Show the predictions
datatable(yhat, rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE)) %>%
formatRound(columns = -1, digits = 3)
The function prediction_breakdown()
is a wrapper around the breakDown
package. Model prediction is visualized with Break Down Plots, which show the contribution of every variable present in the model. Function prediction_breakdown()
generates variable attributions for selected prediction. The generic plot()
function shows these attributions.
library(breakDown)
sample <- as.data.frame(h_test)[1, ] # Using the first sample from h_test
pb_drf <- prediction_breakdown(explainer_drf, observation = sample)
pb_dnn <- prediction_breakdown(explainer_dnn, observation = sample)
pb_xgb <- prediction_breakdown(explainer_xgb, observation = sample)
pb_automl <- prediction_breakdown(explainer_automl, observation = sample)
6 Coffee Break 10:30 - 11:00
7 Classification Part One: H2O AutoML
7.1 Data - Pima Indians Diabetes from mlbench
7.2 Data Prep
# Convert pos and neg to 1 and 0
d_new <- PimaIndiansDiabetes[, -ncol(PimaIndiansDiabetes)]
d_new$diabetes <- 0
d_new[which(PimaIndiansDiabetes$diabetes == "pos"), ]$diabetes <- 1
PimaIndiansDiabetes <- d_new
rm(d_new)
[1] "pregnant" "glucose" "pressure" "triceps" "insulin" "mass"
[7] "pedigree" "age"
7.3 Start a local H2O Cluster (JVM)
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 8 minutes 12 seconds
H2O cluster timezone: Europe/Paris
H2O data parsing timezone: UTC
H2O cluster version: 3.24.0.5
H2O cluster version age: 20 days
H2O cluster name: H2O_started_from_R_jofaichow_kkt017
H2O cluster total nodes: 1
H2O cluster total memory: 3.26 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4
R Version: R version 3.6.0 (2019-04-26)
7.4 Convert R dataframe into H2O dataframe
7.5 Split Data into Train/Test
h_split <- h2o.splitFrame(h_diabetes, ratios = 0.8, seed = n_seed)
h_train <- h_split[[1]] # 80% for modelling
h_test <- h_split[[2]] # 20% for evaluation
[1] 618 9
[1] 150 9
7.6 H2O AutoML
# Run AutoML (try n different models)
# Check out all options using ?h2o.automl
automl = h2o.automl(x = features,
y = target,
training_frame = h_train,
nfolds = 5, # 5-fold Cross-Validation
max_models = 20, # Max number of models
stopping_metric = "logloss", # Metric to optimize
project_name = "automl_diabetes", # Specify a name so you can add more models later
sort_metric = "logloss",
seed = n_seed)
8 Classification Part Two: XAI
8.1 Package DALEX
8.1.1 The explain()
Function
8.1.2 Explainer for H2O Models
8.1.3 Variable importance
8.1.4 Partial Dependence Plots
Let’s look at feature age
9 Bring Your Own Data + Q&A
Get your hands dirty!