useR! 2019 H2O Tutorial (bit.ly/useR2019_h2o_tutorial)

1 Agenda

  • 09:00 to 09:30 Set Up & Introduction
  • 09:30 to 10:30 Regression Example
  • 10:30 to 11:00 Coffee Break
  • 11:00 to 11:30 Classification Example
  • 11:30 to 12:30 Bring Your Own Data + Q&A

2 Set Up

2.1 Download -> bit.ly/useR2019_h2o_tutorial

2.2 R Packages

  • Check out setup.R
  • For this tutorial:
    • h2o for machine learning
    • mlbench for Boston Housing dataset
    • DALEX, breakDown & pdp for explaining model predictions
  • For RMarkdown
    • knitr for rendering this RMarkdown
    • rmdformats for readthedown RMarkdown template
    • DT for nice tables

3 Introduction

General Data Protection Regulation (GDPR) is now in place. Are you ready to explain your models? This is a hands-on tutorial for R beginners. I will demonstrate the use of H2O and other R packages for automatic and interpretable machine learning. Participants will be able to follow and build regression and classification models quickly with H2O’s AutoML. They will then be able to explain the model outcomes with various methods.

It is a workshop for R beginners and anyone interested in machine learning. RMarkdown and the rendered HTML will be provided so everyone can follow without running the code.

(Now go to slides …)

4 Regression Part One: H2O AutoML

4.1 Data - Boston Housing from mlbench

Source: UCI Machine Learning Repository Link

  • crim: per capita crime rate by town.
  • zn: proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus: proportion of non-retail business acres per town.
  • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  • nox: nitrogen oxides concentration (parts per 10 million).
  • rm: average number of rooms per dwelling.
  • age: proportion of owner-occupied units built prior to 1940.
  • dis: weighted mean of distances to five Boston employment centres.
  • rad: index of accessibility to radial highways.
  • tax: full-value property-tax rate per $10,000.
  • ptratio: pupil-teacher ratio by town.
  • b: 1000(Bk - 0.63)^2 where Bk is the proportion of people of African American descent by town.
  • lstat: lower status of the population (percent).
  • medv (This is the TARGET): median value of owner-occupied homes in $1000s.

4.2 Define Target and Features

 [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
 [8] "dis"     "rad"     "tax"     "ptratio" "b"       "lstat"  
ml_overview

4.3 Start a local H2O Cluster (JVM)


H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /var/folders/4z/p7yt7_4n4fj1jlyq6g4qhfbw0000gn/T//Rtmpeo8saO/h2o_jofaichow_started_from_r.out
    /var/folders/4z/p7yt7_4n4fj1jlyq6g4qhfbw0000gn/T//Rtmpeo8saO/h2o_jofaichow_started_from_r.err


Starting H2O JVM and connecting: .. Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         1 seconds 997 milliseconds 
    H2O cluster timezone:       Europe/Paris 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.24.0.5 
    H2O cluster version age:    20 days  
    H2O cluster name:           H2O_started_from_R_jofaichow_kkt017 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   3.56 GB 
    H2O cluster total cores:    8 
    H2O cluster allowed cores:  8 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4 
    R Version:                  R version 3.6.0 (2019-04-26) 

4.4 Convert R dataframe into H2O dataframe

4.6 Cross-Validation

CV

4.7 Baseline Models

  • h2o.glm(): H2O Generalized Linear Model
  • h2o.randomForest(): H2O Random Forest Model
  • h2o.gbm(): H2O Gradient Boosting Model
  • h2o.deeplearning(): H2O Deep Neural Network Model
  • h2o.xgboost(): H2O wrapper for eXtreme Gradient Boosting Model from DMLC

4.7.1 Baseline Generalized Linear Model (GLM)

H2ORegressionMetrics: glm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  23.04256
RMSE:  4.800267
MAE:  3.307191
RMSLE:  NaN
Mean Residual Deviance :  23.04256
R^2 :  0.7076243
Null Deviance :32617.3
Null D.o.F. :410
Residual Deviance :9470.494
Residual D.o.F. :396
AIC :2487.815
H2ORegressionMetrics: glm

MSE:  28.87315
RMSE:  5.373374
MAE:  3.859189
RMSLE:  0.1861469
Mean Residual Deviance :  28.87315
R^2 :  0.7254239
Null Deviance :10402.21
Null D.o.F. :94
Residual Deviance :2742.949
Residual D.o.F. :80
AIC :621.075

Let’s use RMSE

RMSE

4.7.3 Comparison (RMSE: Lower = Better)

4.8 Manual Tuning

4.8.1 Check out the hyper-parameters for each algo

4.8.3 Comparison (RMSE: Lower = Better)

4.9 H2O AutoML

4.9.1 Leaderboard

4.9.2 Best Model (Leader)

Model Details:
==============

H2ORegressionModel: stackedensemble
Model ID:  StackedEnsemble_AllModels_AutoML_20190709_072844 
NULL


H2ORegressionMetrics: stackedensemble
** Reported on training data. **

MSE:  0.1426481
RMSE:  0.3776879
MAE:  0.2960617
RMSLE:  0.01988458
Mean Residual Deviance :  0.1426481



H2ORegressionMetrics: stackedensemble
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  6.775532
RMSE:  2.602985
MAE:  1.870888
RMSLE:  0.1238892
Mean Residual Deviance :  6.775532

4.9.3 Comparison (RMSE: Lower = Better)

4.10 Make Predictions

   predict
1 35.73996
2 16.79900
3 19.85202
4 16.64819
5 17.82168
6 18.14047

5 Regression Part Two: XAI

Let’s look at the first house in h_test

5.2 Package DALEX

DALEX

5.2.1 The explain() Function

The first step of using the DALEX package is to wrap-up the black-box model with meta-data that unifies model interfacing.

To create an explainer we use explain() function. Validation dataset for the models is h_test from part one. For the models created by h2o package we have to provide custom predict function which takes two arguments: model and newdata and returns a numeric vector with predictions.

5.2.3 Variable importance

Using he DALEX package we are able to better understand which variables are important.

Model agnostic variable importance is calculated by means of permutations. We simply substract the loss function calculated for validation dataset with permuted values for a single variable from the loss function calculated for validation dataset.

This method is implemented in the variable_importance() function.

5.2.4 Partial Dependence Plots

Partial Dependence Plots (PDP) are one of the most popular methods for exploration of the relation between a continuous variable and the model outcome. Function variable_response() with the parameter type = “pdp” calls pdp::partial() function to calculate PDP response.

Let’s look at feature rm (no. of rooms)

5.2.5 Prediction Understanding

The function prediction_breakdown() is a wrapper around the breakDown package. Model prediction is visualized with Break Down Plots, which show the contribution of every variable present in the model. Function prediction_breakdown() generates variable attributions for selected prediction. The generic plot() function shows these attributions.

6 Coffee Break 10:30 - 11:00

coffee_break

7 Classification Part One: H2O AutoML

7.1 Data - Pima Indians Diabetes from mlbench

7.3 Start a local H2O Cluster (JVM)

 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         8 minutes 12 seconds 
    H2O cluster timezone:       Europe/Paris 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.24.0.5 
    H2O cluster version age:    20 days  
    H2O cluster name:           H2O_started_from_R_jofaichow_kkt017 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   3.26 GB 
    H2O cluster total cores:    8 
    H2O cluster allowed cores:  8 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4 
    R Version:                  R version 3.6.0 (2019-04-26) 

7.6 H2O AutoML

7.6.1 Leaderboard

9 Bring Your Own Data + Q&A

Get your hands dirty!

2019-07-09