# I. Nonparametric Density Estimation¶

For a moment, we will go back to simple data structures: we have observations which are realizations of univariate random variables, $$X_1, \ldots, X_n \sim F,$$ where $F$ denotes an unknown cumulative distribution function (CDF). The goal is to estimate the distribution $F$. In particular, we are interested in estimating the density $f = F′$ , assuming that it exists.

### Histograms¶

The histogram is the oldest and most popular density estimator. We need to specify an "origin" $x_0$ and the class width $h$ for the specifications of the intervals: $$I_j =(x_0 + j \cdot h,x_0 + (j + 1) \cdot h), (j = \ldots, −1, 0, 1, \ldots)$$

### The Naïve Kernel Estimator¶

$$\hat{f}(x) = \frac{1}{nh} \sum_{i = 1}^{n} w \left(\frac{x - X_i}{h}\right),$$

where $w(x) = \frac{1}{2}, \mid X \mid \leq 1; 0, \text{otherwise}$. This is merely a simple weight function that places a rectangular box around each interval $(x - h, x + h)$.

By replaceing $w$ with a generalized smooth kernel function, we get the definition of the kernel density estimator: $$\hat{f}(x) = \frac{1}{nh} \sum_{i = 1}^{n} K \left(\frac{x - X_i}{h}\right),$$ where $$K(x) \geq 0, \int_{-\infty}^{\infty} K(x) dx = 1, K(x) = K(-x).$$

The positivity of the kernel function $K(\cdot)$ guarantees a positive density estimate $f(\cdot)$ and the normalization $K(x)dx = 1$ implies that $f(x)dx = 1$, which is necessary for $f(\cdot)$ to be a density. Typically, the kernel function $K(\cdot)$ is chosen as a probability density which is symmetric around $0$. Additionally, the smoothness of $f(\cdot)$ is inherited from the smoothness of the kernel.

In the above definition, we leave the bandwidth $h$ as a tuning parameter, which can be chosen so as to minimize an arbitrary distance metric that ensures the estimated function is optimal, given the available data. For large bandwidth $h$, the estimate $f(x)$ tends to be very slowly varying as a function of $x$, while small bandwidths will produce a more variable function estimate.

### The Bandwidth $h$¶

The bandwidth h is often also called the "smoothing parameter". It should be clear that for $h \to 0$, we will have spikes at every observation $X_i$, whereas $f(\cdot) = fh(\cdot)$ becomes smoother as $h$ is increasing. In the above, we use a global bandwidth, which we might choose optimally using cross-validation, but, we can also use variable bandwidths (locally changing bandwidths $h(x)$), with the general idea her being to use a large bandwidth for regions where the data is sparse. With respect to the bias-variance tradeoff: the (absolute value of the) bias of $\hat{f}$ increases and the variance of $\hat{f}$ decreases as $h$ increases.

# II. Density Estimation with condensier¶

First, let's load the packages we'll be using and some core project management tools.

In [1]:
library(usethis)
usethis::create_project(".")
library(here)
library(tidyverse)
library(data.table, quietly = TRUE)

Changing active project to lab_09
✔ Writing a sentinel file '.here'
● Build robust paths within your project via here::here()

here() starts at /Users/nimahejazi/Dropbox/UC_Berkeley-grad/teaching/2018_Spring/tlbbd-labs/lab_09
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1     ✔ purrr   0.2.4
✔ tibble  1.4.2     ✔ dplyr   0.7.4
✔ tidyr   0.8.0     ✔ stringr 1.3.0
✔ readr   1.1.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

Attaching package: ‘data.table’

The following objects are masked from ‘package:dplyr’:

between, first, last

The following object is masked from ‘package:purrr’:

transpose


In [2]:
library(simcausal)
library(condensier)
library(sl3)

condensier
The condensier package is still in beta testing. Interpret results with caution.


We begin by simulating a simple data set and illustrating a simple execution of how to use condensier to perform conditional density estimation:

In [3]:
D <- DAG.empty()
D <- D + node("W1", distr = "rbern", prob = 0.5) +
node("W2", distr = "rbern", prob = 0.3) +
node("W3", distr = "rbern", prob = 0.3) +
node("A", distr = "rnorm", mean = (0.98 * W1 + 0.58 * W2 + 0.33 * W3), sd = 1)
D <- set.DAG(D, n.test = 10)

...automatically assigning order attribute to some nodes...
node W1, order:1
node W2, order:2
node W3, order:3
node A, order:4

In [4]:
plotDAG(D, xjitter = 0.3, yjitter = 0.04, edge_attrs = list(width = 0.5, arrow.width = 0.4, arrow.size = 0.8), vertex_attrs = list(size = 12, label.cex = 0.8))

using the following vertex attributes:
using the following edge attributes:
0.50.40.8black1


Now that we've taken a look at the structure of the data-generating process (the DAG), let's generate some data and take a quick look at the data set:

In [5]:
data_O <- as.data.table(sim(D, n = 10000, rndseed = 57192))

simulating observed dataset from the DAG object

IDW1W2W3A
1 0 0 0 0.264097302
2 1 0 1 1.947282783
3 1 0 0 0.343522286
4 0 0 0 -1.342266575
5 0 0 0 1.446429830
6 0 0 0 -0.001315845
In [6]:
newdata <- data_O[sample(seq_len(1000)), c("W1", "W2", "W3", "A"), with = FALSE]


Now that we've generate some data, we will generate an Task object for use with sl3:

In [7]:
task <- sl3_Task$new(data_O, covariates = c("W1", "W2", "W3"), outcome = "A") task  A sl3 Task with 10000 obs and these nodes:$covariates
[1] "W1" "W2" "W3"

$outcome [1] "A"$id
NULL

$weights NULL$offset
NULL


We'll do this same conversion for the sub-sampled testing data as well:

In [8]:
new_task <- sl3_Task$new(newdata, covariates = c("W1", "W2", "W3"), outcome = "A")  In [13]: sl3::sl3_list_properties()  1. 'binomial' 2. 'categorical' 3. 'continuous' 4. 'density' 5. 'ids' 6. 'multivariate_outcome' 7. 'offset' 8. 'preprocessing' 9. 'timeseries' 10. 'weights' 11. 'wrapper' In [14]: lrn <- Lrnr_condensier$new(nbins = 10, bin_method = "equal.len",
bin_estimator = Lrnr_xgboost$new(nrounds = 50, objective = "reg:logistic"))  In [15]: trained_lrn = lrn$train(task)
pred_probs = trained_lrn$predict(new_task) head(pred_probs)  likelihood 0.1054223 0.3310383 0.3310383 0.1044063 0.1429661 0.1044063 Next, ... In [16]: lrn1 <- Lrnr_condensier$new(nbins = 25, bin_method = "equal.len",
bin_estimator = Lrnr_glm_fast$new()) lrn2 <- Lrnr_condensier$new(nbins = 35, bin_method = "equal.len",
bin_estimator = Lrnr_xgboost$new(nrounds = 50, objective = "reg:logistic"))  In [19]: sl <- Lrnr_sl$new(learners = list(lrn1, lrn2),
metalearner = Lrnr_solnp_density$new()) sl_fit <- sl$train(task)
head(sl_fit$predict(new_task))  Iter: 1 fn: 17575.2858 Pars: 0.38205 0.61795 Iter: 2 fn: 17575.2858 Pars: 0.38205 0.61795 solnp--> Completed in 2 iterations density meta-learner fit: Lrnr_condensier_equal.len_25_20_FALSE_NA_FALSE_NULL 0.3820534 Lrnr_condensier_equal.len_35_20_FALSE_NA_FALSE_NULL 0.6179466  likelihood 0.15692777 0.32778128 0.29269600 0.13861217 0.21158060 0.04837965 In [18]: ?Lrnr_solnp_density   Lrnr_solnp_density {sl3} R Documentation ## Nonlinear Optimization via Augmented Lagrange ### Description This meta-learner provides fitting procedures for density estimation, finding convex combinations of candidate density estimators by minimizing the cross-validated negative log-likelihood loss of each candidate density. The optimization problem is solved by making use of solnp, using Lagrange multipliers. For further details, consult the documentation of the Rsolnp package. ### Usage Lrnr_solnp_density  ### Format R6Class object. ### Value Learner object with methods for training and prediction. See Lrnr_base for documentation on learners. ### Parameters ... Not currently used. ### Common Parameters Individual learners have their own sets of parameters. Below is a list of shared parameters, implemented by Lrnr_base, and shared by all learners. covariates A character vector of covariates. The learner will use this to subset the covariates for any specified task outcome_type A variable_type object used to control the outcome_type used by the learner. Overrides the task outcome_type if specified ... All other parameters should be handled by the invidual learner classes. See the documentation for the learner class you're instantiating ### See Also Other Learners: Custom_chain, Lrnr_HarmonicReg, Lrnr_arima, Lrnr_base, Lrnr_bilstm, Lrnr_condensier, Lrnr_cv, Lrnr_define_interactions, Lrnr_expSmooth, Lrnr_glm_fast, Lrnr_glmnet, Lrnr_glm, Lrnr_h2o_grid, Lrnr_independent_binomial, Lrnr_lstm, Lrnr_mean, Lrnr_nnls, Lrnr_optim, Lrnr_pkg_SuperLearner, Lrnr_randomForest, Lrnr_rugarch, Lrnr_sl, Lrnr_solnp, Lrnr_subset_covariates, Lrnr_tsDyn, Lrnr_xgboost, Pipeline, Stack, define_h2o_X, undocumented_learner [Package sl3 version 1.0.0 ] ### Nesting the Super Learner for bin hazards with density Super Learner¶ Note that bin_estimator can be also a Super-Learner object from sl3. In this case the bin hazard will be estimated by stacking several candidate estimators. For example, below, we define a single density learner lrn, with the hazard estimator defined by the Super-Learner that stacks two candidates (GLM and xgboost GBM). Note that in contrast to the above example, this Super-Learner fit will be optimized for the logistic regression problem (estimating pooled bin hazards), but still using internal 10-fold cross-validation. In [20]: lrn <- Lrnr_condensier$new(nbins = 35, bin_method = "equal.len", bin_estimator =
Lrnr_sl$new( learners = list( Lrnr_glm_fast$new(family = "binomial"),
Lrnr_xgboost$new(nrounds = 50, objective = "reg:logistic") ), metalearner = Lrnr_glm$new()
))
binSL_fit <- lrn$train(task) head(binSL_fit$predict(new_task))

likelihood
0.10960448
0.30398384
0.30398384
0.14400828
0.20825493
0.06986179

### Homwork #3: Using and contributing to sl3¶

What learners to contribute: https://github.com/tlverse/sl3/issues/114 How do learners work: https://sl3.tlverse.org/articles/custom_lrnrs.html

Tips and best practices:

• For the sl3 repo from https://github.com/tlverse/sl3
• Using git, create a new branch for your proposed contribution (e.g., Lrnr_ranger)
• Generate a template for your new learner via sl3::write_learner_template(here::here("R", "Lrnr_ranger.R"))
• Fill out the template based on the properties of your new learner. Try looking at already written learners to get a better idea of how to fill out the various methods slots.
• Write a set of unit tests under tests/testthat/test_Lrnr_ranger.R that ensure that your new learner works as intended/expected.
• For a guide on how to write unit tests properly, try looking at the numerous unit tests that already exist inside the sl3 package.
• For background on unit testing in R, take a look here: http://r-pkgs.had.co.nz/tests.html
• The tlverse ecosystem uses the "Tidyverse Style Guide". Make sure to follow the formatting guidelines for your code and for pull requests. You can use the styler R package to automatically reformat your code.