Optimal Ensemble Learning with the sl3 R package

Author: Nima Hejazi

Date: 14 February 2018

Attribution: based on materials by David Benkeser, Jeremy Coyle, Ivana Malenica, and Oleg Sofrygin

Introduction

In this demonstration, we will illustrate the basic functionality of the sl3 R package. Specifically, we will walk through the concept of machine learning pipelines, the construction of ensemble models, simple optimality properties of stacked regression. After this introduction we will be well prepared to discuss more advanced topics in ensemble learning, such as optimal kernel density estimation.

Resources

Setup

First, we'll load the packages required for this exercise and load a simple data set (cpp_imputed below) that we'll use for demonstration purposes:

In [1]:
set.seed(49753)

# packages we'll be using
library(data.table)
library(SuperLearner)
library(origami)
library(sl3)

# load example data set
data(cpp_imputed)

# take a peek at the data
head(cpp_imputed)
Loading required package: nnls
Super Learner
Version: 2.0-23-9000
Package created on 2017-11-29

origami: Generalized Cross-Validation Framework
Version: 1.0.0
subjidagedayswtkghtcmlencmbmiwazhazwhzbazmmaritnmmaritmeducyrssesnsesparitygravidasmokedmcignumcomprisk
11 1 4.621 55 55 15.27603 2.380000 2.61000 0.19 1.35 1 Married 12 50 Middle 1 1 0 0 none
31 366 14.500 79 79 23.23346 3.840000 1.35000 4.02 3.89 1 Married 12 50 Middle 1 1 0 0 none
42 1 3.345 51 51 12.86044 0.060000 0.50000 -0.64 -0.43 1 Married 0 0 .0 0 1 35 none
62 366 8.400 73 73 15.76281 -1.270000 -1.17000 -0.96 -0.80 1 Married 0 0 .0 0 1 35 none
72 2558 19.100 114 0 14.69683 -1.372732 -1.46648 0.00 0.00 1 Married 0 0 .0 0 1 35 none
83 1 3.827 54 54 13.12414 0.990000 2.08000 -1.29 -0.22 1 Married 0 0 .1 1 1 20 none

To use this data set with sl3, the object must be wrapped in a customized sl3 container, an sl3 "Task" object. A task is an idiom for all of the elements of a prediction problem other than the learning algorithms and prediction approach itself -- that is, a task delineates the structure of the data set of interest and any potential metadata (e.g., observation-level weights).

In [2]:
# here are the covariates we are interested in and, of course, the outcome
covars <- c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs",
            "sexn")
outcome <- "haz"

# create the sl3 task and take a look at it
task <- make_sl3_Task(data = cpp_imputed, covariates = covars,
                            outcome = outcome, outcome_type = "continuous")

# let's take a look at the sl3 task
task
A sl3 Task with 1441 obs and these nodes:
$covariates
[1] "apgar1"   "apgar5"   "parity"   "gagebrth" "mage"     "meducyrs" "sexn"    

$outcome
[1] "haz"

$id
NULL

$weights
NULL

$offset
NULL

Interlude: Object Oriented Programming in R

sl3 is designed using basic OOP principles and the R6 OOP framework. While we’ve tried to make it easy to use sl3 without worrying much about OOP, it is helpful to have some intuition about how sl3 is structured. In this section, we briefly outline some key concepts from OOP. Readers familiar with OOP basics are invited to skip this section. The key concept of OOP is that of an object, a collection of data and functions that corresponds to some conceptual unit. Objects have two main types of elements: (1) fields, which can be thought of as nouns, are information about an object, and (2) methods, which can be thought of as verbs, are actions an object can perform. Objects are members of classes, which define what those specific fields and methods are. Classes can inherit elements from other classes (sometimes called base classes) – accordingly, classes that are similar, but not exactly the same, can share some parts of their definitions.

Many different implementations of OOP exist, with variations in how these concepts are implemented and used. R has several different implementations, including S3, S4, reference classes, and R6. sl3 uses the R6 implementation. In R6, methods and fields of a class object are accessed using the $ operator. The next section explains how these concepts are used in sl3 to model machine learning problems and algorithms.

sl3 Learners

Lrnr_base is the base class for defining machine learning algorithms, as well as fits for those algorithms to particular sl3_Tasks. Different machine learning algorithms are defined in classes that inherit from Lrnr_base. For instance, the Lrnr_glm class inherits from Lrnr_base, and defines a learner that fits generalized linear models. We will use the term learners to refer to the family of classes that inherit from Lrnr_base. Learner objects can be constructed from their class definitions using the make_learner function:

In [3]:
# make learner object
lrnr_glm <- make_learner(Lrnr_glm)

Because all learners inherit from Lrnr_base, they have many features in common, and can be used interchangeably. All learners define three main methods: train, predict, and chain. The first, train, takes an sl3_task object, and returns a learner_fit, which has the same class as the learner that was trained:

In [4]:
# fit learner to task data
lrnr_glm_fit <- lrnr_glm$train(task)

# verify that the learner is fit
lrnr_glm_fit$is_trained
TRUE

Here, we fit the learner to the CPP task we defined above. Both lrnr_glm and lrnr_glm_fit are objects of class Lrnr_glm, although the former defines a learner and the latter defines a fit of that learner. We can distiguish between the learners and learner fits using the is_trained field, which is true for fits but not for learners.

Now that we’ve fit a learner, we can generate predictions using the predict method:

In [5]:
# get learner predictions
preds <- lrnr_glm_fit$predict()
head(preds)
  1. 0.362984982737989
  2. 0.362984982737989
  3. 0.259930715399135
  4. 0.259930715399135
  5. 0.259930715399135
  6. 0.0568026361161085

Here, we specified task as the task for which we wanted to generate predictions. If we had omitted this, we would have gotten the same predictions because predict defaults to using the task provided to train (called the training task). Alternatively, we could have provided a different task for which we want to generate predictions.

The final important learner method, chain, will be discussed below, in the section on learner composition. As with sl3_Task, learners have a variety of fields and methods we haven’t discussed here. More information on these is available in the help for Lrnr_base.

Pipelines

A pipeline is a set of learners to be fit sequentially, where the fit from one learner is used to define the task for the next learner. There are many ways in which a learner can define the task for the downstream learner. The chain method defined by learners defines how this will work. Let’s look at the example of pre-screening variables. For now, we’ll rely on a screener from the SuperLearner package, although native sl3 screening algorithms will be implemented soon.

Below, we generate a screener object based on the SuperLearner function screen.corP and fit it to our task. Inspecting the fit, we see that it selected a subset of covariates:

In [6]:
screen_cor <- Lrnr_pkg_SuperLearner_screener$new("screen.corP")
screen_fit <- screen_cor$train(task)
print(screen_fit)
[1] "Lrnr_pkg_SuperLearner_screener_screen.corP"
$selected
[1] "parity"   "gagebrth"

The Pipeline class automates this process. It takes an arbitrary number of learners and fits them sequentially, training and chaining each one in turn. Since Pipeline is a learner like any other, it shares the same interface. We can define a pipeline using make_learner, and use train and predict just as we did before:

In [7]:
sg_pipeline <- make_learner(Pipeline, screen_cor, lrnr_glm)
sg_pipeline_fit <- sg_pipeline$train(task)
sg_pipeline_preds <- sg_pipeline_fit$predict()
head(sg_pipeline_preds)
  1. 0.380844719477376
  2. 0.380844719477376
  3. 0.298876230525396
  4. 0.298876230525396
  5. 0.298876230525396
  6. -0.00987783994257363

Stacks

Like Pipelines, Stacks combine multiple learners. Stacks train learners simultaneously, so that their predictions can be either combined or compared. Again, Stack is just a special learner and so has the same interface as all other learners:

In [8]:
stack <- make_learner(Stack, lrnr_glm, sg_pipeline)
stack_fit <- stack$train(task)
stack_preds <- stack_fit$predict()
head(stack_preds)
Lrnr_glmLrnr_pkg_SuperLearner_screener_screen.corP___Lrnr_glm
0.36298498 0.38084472
0.36298498 0.38084472
0.25993072 0.29887623
0.25993072 0.29887623
0.25993072 0.29887623
0.05680264 -0.00987784

Above, we’ve defined and fit a stack comprised of a simple glm learner as well as a pipeline that combines a screening algorithm with that same learner. We could have included any abitrary set of learners and pipelines, the latter of which are themselves just learners. We can see that the predict method now returns a matrix, with a column for each learner included in the stack.

The Super Learner Algorithm

Having defined a stack, we might want to compare the performance of learners in the stack, which we may do using cross-validation. The Lrnr_cv learner wraps another learner and performs training and prediction in a cross-validated fashion, using separate training and validation splits as defined by task$folds.

Below, we define a new Lrnr_cv object based on the previously defined stack and train it and generate predictions on the validation set:

In [9]:
cv_stack <- Lrnr_cv$new(stack)
cv_fit <- cv_stack$train(task)
cv_preds <- cv_fit$predict()
In [10]:
risks <- cv_fit$cv_risk(loss_squared_error)
print(risks)
                                             Lrnr_glm 
                                             1.604769 
Lrnr_pkg_SuperLearner_screener_screen.corP___Lrnr_glm 
                                             1.604186 

We can combine all of the above elements, Pipelines, Stacks, and cross-validation using Lrnr_cv, to easily define a Super Learner. The Super Learner algorithm works by fitting a “meta-learner”, which combines predictions from multiple stacked learners. It does this while avoiding overfitting by training the meta-learner on validation-set predictions in a manner that is cross-validated. Using some of the objects we defined in the above examples, this becomes a very simple operation:

In [11]:
metalearner <- make_learner(Lrnr_nnls)
cv_task <- cv_fit$chain()
ml_fit <- metalearner$train(cv_task)

Here, we used a special learner, Lrnr_nnls, for the meta-learning step. This fits a non-negative least squares meta-learner. It is important to note that any learner can be used as a meta-learner.

The Super Learner finally produced is defined as a pipeline with the learner stack trained on the full data and the meta-learner trained on the validation-set predictions. Below, we use a special behavior of pipelines: if all objects passed to a pipeline are learner fits (i.e., learner$is_trained is TRUE), the result will also be a fit:

In [12]:
sl_pipeline <- make_learner(Pipeline, stack_fit, ml_fit)
sl_preds <- sl_pipeline$predict()
head(sl_preds)
  1. 0.335036364767833
  2. 0.335036364767833
  3. 0.251128722856295
  4. 0.251128722856295
  5. 0.251128722856295
  6. 0.022648706935869

An optimal stacked regression model (or Super Learner) may be fit in a more streamlined manner using the Lrnr_sl learner. For simplicity, we will use the same set of learners and meta-learning algorithm as we did before:

In [13]:
sl <- Lrnr_sl$new(learners = stack,
                  metalearner = metalearner)
sl_fit <- sl$train(task)
lrnr_sl_preds <- sl_fit$predict()
head(lrnr_sl_preds)
  1. 0.335036364767833
  2. 0.335036364767833
  3. 0.251128722856295
  4. 0.251128722856295
  5. 0.251128722856295
  6. 0.022648706935869

We can see that this generates the same predictions as the more hands-on definition above.

Exercise

  • Construct a Super Learner using $5$ (or more) learning algorithms, fit it on the training data given below (task_train) , and obtain predictions on the held out set (task_valid).

  • At least $2$ of the learners that you choose should be variations of a single learner, differentiated from one another solely by the use of different values for $1$ (or more) hyperparameters.

  • After fitting the Super Learner, identify the "discrete Super Learner".

In [14]:
# let's split the data into training and validation sets
train_cpp_imputed <- as.data.table(cpp_imputed[sample(nrow(cpp_imputed), 0.75 * nrow(cpp_imputed)), ])
valid_cpp_imputed <- as.data.table(cpp_imputed[!(seq_len(nrow(cpp_imputed)) %in% rownames(train_cpp_imputed)), ])

# create the sl3 task and take a look at it
task_train <- make_sl3_Task(data = train_cpp_imputed, covariates = covars,
                            outcome = outcome, outcome_type = "continuous")
task_train

# we'll also create an sl3 task for the holdout set
task_valid <- make_sl3_Task(data = valid_cpp_imputed, covariates = covars,
                            outcome = outcome, outcome_type = "continuous")
A sl3 Task with 1080 obs and these nodes:
$covariates
[1] "apgar1"   "apgar5"   "parity"   "gagebrth" "mage"     "meducyrs" "sexn"    

$outcome
[1] "haz"

$id
NULL

$weights
NULL

$offset
NULL