hal9001
R package¶"Estimation of a regression function is a common goal of statistical learning. We propose a novel nonparametric regression estimator that, in contrast to many existing methods, does not rely on local smoothness assumptions nor is it constructed using local smoothing techniques. Instead, our estimator respects global smoothness constraints by virtue of falling in a class of right-hand continuous functions with left-hand limits that have variation norm bounded by a constant. Using empirical process theory, we establish a fast minimal rate of convergence of our proposed estimator and illustrate how such an estimator can be constructed using standard software." -excerpted from Benkeser & vdL (2016).
corresponding coefficients $d\psi_{m,s,j}$ summed over $s$ and $j$.
hal9001
¶First, let's load the packages we'll be using and some core project management tools.
options(repr.plot.width = 5, repr.plot.height = 5) ## resizing plots
options(scipen = 999) ## has scientific notation ever annoyed you?
library(here)
library(usethis)
library(tidyverse)
library(data.table, quietly = TRUE)
here() starts at /Users/nimahejazi/Dropbox/UC_Berkeley-grad/teaching/2018_Spring/tlbbd-labs/lab_10 ── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ── ✔ ggplot2 2.2.1 ✔ purrr 0.2.4 ✔ tibble 1.4.2 ✔ dplyr 0.7.4 ✔ tidyr 0.8.0 ✔ stringr 1.3.0 ✔ readr 1.1.1 ✔ forcats 0.3.0 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() Attaching package: ‘data.table’ The following objects are masked from ‘package:dplyr’: between, first, last The following object is masked from ‘package:purrr’: transpose
library(earth)
library(glmnet)
library(randomForest)
library(hal9001)
library(sl3)
set.seed(385971)
Loading required package: plotmo Loading required package: plotrix Loading required package: TeachingDemos Loading required package: Matrix Attaching package: ‘Matrix’ The following object is masked from ‘package:tidyr’: expand Loading required package: foreach Attaching package: ‘foreach’ The following objects are masked from ‘package:purrr’: accumulate, when Loaded glmnet 2.0-13 randomForest 4.6-12 Type rfNews() to see new features/changes/bug fixes. Attaching package: ‘randomForest’ The following object is masked from ‘package:dplyr’: combine The following object is masked from ‘package:ggplot2’: margin hal9001 v0.1.1: A fast and scalable Highly Adaptive LASSO
We begin by simulating a simple data set and illustrating a simple execution of how to use hal9001
for prediction.
n = 1000
p = 3
x <- cbind(matrix(rnorm(n * p), n, p), rbinom(n, 1, 0.61))
y <- sin(x[, 1]) * sin(x[, 2]) + x[, 1]^2 * (x[, 1] > x[, 2]) + x[, 3] * x[,4] +
rnorm(n, mean = 0, sd = 0.3)
data <- as.data.table(cbind(x, y))
setnames(data, c(paste0("X", seq_len(p + 1)), "Y"))
head(data)
X1 | X2 | X3 | X4 | Y |
---|---|---|---|---|
0.2086134 | -0.16809762 | 0.3635787 | 0 | 0.4550150 |
-0.9477350 | -0.68371703 | -1.2229561 | 0 | 0.1571233 |
0.9254026 | -1.42357891 | 0.1638252 | 1 | 0.1943260 |
0.5603774 | -0.09578228 | 1.4152984 | 1 | 1.6964338 |
0.6430260 | 1.01289165 | -1.1249601 | 0 | 0.5796670 |
-0.9730101 | 0.18683905 | 0.3577011 | 0 | -0.7109583 |
GGally::ggpairs(data)
Let's try a few standard prediction algorithms
# Intercept GLM
mean_fit <- mean(y)
mean_pred <- rep(mean_fit, length(y))
(mse_mean <- mean((y - mean_pred)^2))
# Simple GLM
glm_fit <- glm(Y ~ ., data = data)
glm_pred <- predict(glm_fit)
(mse_glm <- mean((y - glm_pred)^2))
# Ridge (L2-penalized) regression
ridge_fit <- glmnet(x = x, y = y, alpha = 0)
ridge_pred <- predict(ridge_fit, newx = x)
(mse_ridge <- mean((y - ridge_pred)^2))
# Lasso (L1-penalized) regression
lasso_fit <- glmnet(x = x, y = y, alpha = 1)
lasso_pred <- predict(lasso_fit, newx = x)
(mse_lasso <- mean((y - lasso_pred)^2))
# Multivariate Adaptive Regression Splines (MARS)
mars_fit <- earth(Y ~ ., data = data)
mars_pred <- predict(mars_fit)
(mse_mars <- mean((y - mars_pred)^2))
# Random Forest
rf_fit <- randomForest(Y ~ ., data = data)
rf_pred <- predict(rf_fit)
(mse_rf <- mean((y - rf_pred)^2))
# Highly Adaptive LASSO (HAL)
hal_fit <- fit_hal(X = x, Y = y)
hal_pred <- predict(hal_fit, new_data = x)
(mse_hal <- mean((y - hal_pred)^2))
[1] "Good morning, Dave."
sl3
¶Now that we've generate some data, we will generate an Task
object for use with sl3
:
task <- sl3_Task$new(data, covariates = c(paste0("X", seq_len(p + 1))), outcome = "Y")
task
A sl3 Task with 1000 obs and these nodes: $covariates [1] "X1" "X2" "X3" "X4" $outcome [1] "Y" $id NULL $weights NULL $offset NULL
Easily and flexibly build a stack (n.b., this is the sl3
idiom for the SL.library
of SuperLearner
)
lrnr_stack <- make_learner_stack("Lrnr_mean", "Lrnr_glm_fast", "Lrnr_randomForest")
lrnr_stack
[1] "Lrnr_mean" [1] "Lrnr_glm_fast_TRUE_Cholesky" [1] "Lrnr_randomForest_100_TRUE_5_NULL_FALSE" [[1]] [1] "Lrnr_mean" [[2]] NULL [[3]] NULL
Next, we'll train the Super Learner and predict on the observed data
sl <- Lrnr_sl$new(learners = list(lrnr_stack), metalearner = Lrnr_nnls$new())
sl_fit <- sl$train(task)
preds_sl <- sl_fit$predict()
(mse_sl <- mean((y - preds_sl)^2))
What's the stacked ensemble look like?
sl_fit
[1] "SuperLearner:" List of 1 $ : chr "Stack" [1] "Lrnr_nnls" lrnrs weights 1: Lrnr_mean 0.00000 2: Lrnr_glm_fast_TRUE_Cholesky 0.00000 3: Lrnr_randomForest_100_TRUE_5_NULL_FALSE 1.03947 [1] "Cross-validated risk (MSE, squared error loss):" learner coefficients mean_risk SE_risk 1: Lrnr_mean 0.00000 1.9951799 0.14827550 2: Lrnr_glm_fast_TRUE_Cholesky 0.00000 1.1664105 0.09117183 3: Lrnr_randomForest_100_TRUE_5_NULL_FALSE 1.03947 0.2158630 0.01532222 4: SuperLearner NA 0.2129381 0.01503779 fold_SD fold_min_risk fold_max_risk 1: 0.68348784 1.2554075 3.0880768 2: 0.37317242 0.6950223 1.7927896 3: 0.04590635 0.1337699 0.2771392 4: 0.04479534 0.1417545 0.2834144
What happens when we add HAL to the stacked ensemble?
Yes, there's a Lrnr_hal9001
now!
lrnr_stack_hal <- make_learner_stack("Lrnr_mean", "Lrnr_glm_fast", "Lrnr_randomForest",
"Lrnr_hal9001")
sl_hal <- Lrnr_sl$new(learners = list(lrnr_stack_hal), metalearner = Lrnr_nnls$new())
sl_hal_fit <- sl_hal$train(task)
preds_sl_hal <- sl_hal_fit$predict()
(mse_sl_hal <- mean((y - preds_sl_hal)^2))
sl_hal_fit
[1] "SuperLearner:" List of 1 $ : chr "Stack" [1] "Lrnr_nnls" lrnrs weights 1: Lrnr_mean 0.0000000 2: Lrnr_glm_fast_TRUE_Cholesky 0.0000000 3: Lrnr_randomForest_100_TRUE_5_NULL_FALSE 0.5168892 4: Lrnr_hal9001_NULL_glmnet_10_TRUE 0.5030761 [1] "Cross-validated risk (MSE, squared error loss):" learner coefficients mean_risk SE_risk 1: Lrnr_mean 0.0000000 1.9951799 0.14827550 2: Lrnr_glm_fast_TRUE_Cholesky 0.0000000 1.1664105 0.09117183 3: Lrnr_randomForest_100_TRUE_5_NULL_FALSE 0.5168892 0.2197907 0.01535048 4: Lrnr_hal9001_NULL_glmnet_10_TRUE 0.5030761 0.2166624 0.02211670 5: SuperLearner NA 0.1862682 0.01446540 fold_SD fold_min_risk fold_max_risk 1: 0.68348784 1.2554075 3.0880768 2: 0.37317242 0.6950223 1.7927896 3: 0.04455936 0.1359939 0.2765591 4: 0.06755858 0.1352412 0.3725420 5: 0.03469860 0.1249777 0.2387863
sl3
¶Update: Homework #3 is now officially due Friday, 06 April, by 3pm
What learners to contribute: https://github.com/tlverse/sl3/issues/114 How do learners work: https://sl3.tlverse.org/articles/custom_lrnrs.html
Tips and best practices:
sl3
repo from https://github.com/tlverse/sl3Lrnr_ranger
)sl3::write_learner_template(here::here("R", "Lrnr_ranger.R"))
tests/testthat/test_Lrnr_ranger.R
that ensure that your new learner works as intended/expected.sl3
package.tlverse
ecosystem uses the "Tidyverse
Style Guide". Make sure to follow the formatting guidelines for your code and for pull requests. You can use the styler
R package to automatically reformat your code.