Machine learning models learn by looking at examples. In supervised learning, there is a known outcome, or label, and by looking at data of many, many examples, a relationship is found between the data and the label. Having more examples makes it easier for the model to find patterns; having fewer examples makes it more difficult to learn, and hence, leads to worse performance.

Often there are only a few examples available for certain groups in the data. In these cases case, all that the model can do is “guess”, or predict an average of the few examples present. This often leads to poor performance.

Prior probabilities, also know as “offsets”, help to solve this problem. Instead of relying on the examples present in the data by themselves, and basing the prediction entirely on the data, we can be clever about setting the initial predictions, or “guesses”, based on our own knowledge or intuition. If there are a lot of examples present in the data, more weight is given to the model’s prediction; if there are few examples, more weight is given to the prior probability.

Potential Applications:

If predicting a person’s annual healthcare costs, use their last year’s costs as the offset so that people with few records in the current year can use their data from the prior year;
In insurance ratemaking premium adjustment, use the manual premium as the offset as described in this paper;
For a kaggle competition, use a competitor’s model prediction as the offset to build a model that seeks to improve upon it using a new data source;

Example: Prediction if a mushroom is poisonous or edible

In the agaricus data set, there are examples of different mushrooms described in terms of physical characteristics. Each is either poisonous or edible. There are a lot of characteristics.

library(xgboost)
library(tidyverse)
library(scales)
library(kableExtra)

data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')

t <- function(df){df %>% 
    knitr::kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))}

agaricus.test$data %>% as.matrix() %>% as_tibble() %>% names()

##   [1] "cap-shape=bell"                   "cap-shape=conical"               
##   [3] "cap-shape=convex"                 "cap-shape=flat"                  
##   [5] "cap-shape=knobbed"                "cap-shape=sunken"                
##   [7] "cap-surface=fibrous"              "cap-surface=grooves"             
##   [9] "cap-surface=scaly"                "cap-surface=smooth"              
##  [11] "cap-color=brown"                  "cap-color=buff"                  
##  [13] "cap-color=cinnamon"               "cap-color=gray"                  
##  [15] "cap-color=green"                  "cap-color=pink"                  
##  [17] "cap-color=purple"                 "cap-color=red"                   
##  [19] "cap-color=white"                  "cap-color=yellow"                
##  [21] "bruises?=bruises"                 "bruises?=no"                     
##  [23] "odor=almond"                      "odor=anise"                      
##  [25] "odor=creosote"                    "odor=fishy"                      
##  [27] "odor=foul"                        "odor=musty"                      
##  [29] "odor=none"                        "odor=pungent"                    
##  [31] "odor=spicy"                       "gill-attachment=attached"        
##  [33] "gill-attachment=descending"       "gill-attachment=free"            
##  [35] "gill-attachment=notched"          "gill-spacing=close"              
##  [37] "gill-spacing=crowded"             "gill-spacing=distant"            
##  [39] "gill-size=broad"                  "gill-size=narrow"                
##  [41] "gill-color=black"                 "gill-color=brown"                
##  [43] "gill-color=buff"                  "gill-color=chocolate"            
##  [45] "gill-color=gray"                  "gill-color=green"                
##  [47] "gill-color=orange"                "gill-color=pink"                 
##  [49] "gill-color=purple"                "gill-color=red"                  
##  [51] "gill-color=white"                 "gill-color=yellow"               
##  [53] "stalk-shape=enlarging"            "stalk-shape=tapering"            
##  [55] "stalk-root=bulbous"               "stalk-root=club"                 
##  [57] "stalk-root=cup"                   "stalk-root=equal"                
##  [59] "stalk-root=rhizomorphs"           "stalk-root=rooted"               
##  [61] "stalk-root=missing"               "stalk-surface-above-ring=fibrous"
##  [63] "stalk-surface-above-ring=scaly"   "stalk-surface-above-ring=silky"  
##  [65] "stalk-surface-above-ring=smooth"  "stalk-surface-below-ring=fibrous"
##  [67] "stalk-surface-below-ring=scaly"   "stalk-surface-below-ring=silky"  
##  [69] "stalk-surface-below-ring=smooth"  "stalk-color-above-ring=brown"    
##  [71] "stalk-color-above-ring=buff"      "stalk-color-above-ring=cinnamon" 
##  [73] "stalk-color-above-ring=gray"      "stalk-color-above-ring=orange"   
##  [75] "stalk-color-above-ring=pink"      "stalk-color-above-ring=red"      
##  [77] "stalk-color-above-ring=white"     "stalk-color-above-ring=yellow"   
##  [79] "stalk-color-below-ring=brown"     "stalk-color-below-ring=buff"     
##  [81] "stalk-color-below-ring=cinnamon"  "stalk-color-below-ring=gray"     
##  [83] "stalk-color-below-ring=orange"    "stalk-color-below-ring=pink"     
##  [85] "stalk-color-below-ring=red"       "stalk-color-below-ring=white"    
##  [87] "stalk-color-below-ring=yellow"    "veil-type=partial"               
##  [89] "veil-type=universal"              "veil-color=brown"                
##  [91] "veil-color=orange"                "veil-color=white"                
##  [93] "veil-color=yellow"                "ring-number=none"                
##  [95] "ring-number=one"                  "ring-number=two"                 
##  [97] "ring-type=cobwebby"               "ring-type=evanescent"            
##  [99] "ring-type=flaring"                "ring-type=large"                 
## [101] "ring-type=none"                   "ring-type=pendant"               
## [103] "ring-type=sheathing"              "ring-type=zone"                  
## [105] "spore-print-color=black"          "spore-print-color=brown"         
## [107] "spore-print-color=buff"           "spore-print-color=chocolate"     
## [109] "spore-print-color=green"          "spore-print-color=orange"        
## [111] "spore-print-color=purple"         "spore-print-color=white"         
## [113] "spore-print-color=yellow"         "population=abundant"             
## [115] "population=clustered"             "population=numerous"             
## [117] "population=scattered"             "population=several"              
## [119] "population=solitary"              "habitat=grasses"                 
## [121] "habitat=leaves"                   "habitat=meadows"                 
## [123] "habitat=paths"                    "habitat=urban"                   
## [125] "habitat=waste"                    "habitat=woods"

Many of the characteristics are uncommon. In data science language, these are “sparse” features.

df = agaricus.test$data %>% as.matrix() %>% as_tibble() %>% summarise_all(mean) %>% gather(feature, percent_one) %>% arrange(desc(percent_one))

df %>% head() %>% t()

feature	percent_one
veil-type=partial	1.0000000
gill-attachment=free	0.9726878
veil-color=white	0.9720670
ring-number=one	0.9199255
gill-spacing=close	0.8454376
gill-size=broad	0.6703911

Some characteristics are very common, such as veil-type=partial or gill-attachment=free, which are present in 100% and 97% of the example mushrooms respectively.

df %>% tail() %>% t()

feature	percent_one
stalk-root=cup	0
stalk-root=rhizomorphs	0
veil-type=universal	0
ring-type=cobwebby	0
ring-type=sheathing	0
ring-type=zone	0

Other characteristics are very uncommon. In fact, non appear in the test data set at all!

But just because they are not in the data set does not mean that we know nothing about them. Suppose that a botanist looks at this and makes a guess: “About 50% of mushrooms with a cup-shaped stalk-root are poisonous, and about 80% of mushrooms with cobwebby ring types are edible.” We could add these as prior weights of 50% and 20% into the model.

We fit an xgboost model which does not use prior probabilities.

dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
watchlist <- list(train = dtrain, eval = dtest)

## A simple xgb.train example:
param <- list(max_depth = 2, eta = 0.05, verbose = 0, nthread = 2,
              objective = "binary:logistic", eval_metric = "auc")

baseline <- xgb.train(param, dtrain, nrounds = 5, watchlist, verbose = 0)

df <- agaricus.test$data %>% as.matrix() %>% as_tibble() %>% 
  mutate(
    y_hat = predict(baseline, agaricus.test$data),
    y = agaricus.test$label)

df %>% 
  group_by(`odor=musty`) %>% 
  summarise(
    avg_pred = percent(mean(y_hat))
  ) %>% t()

odor=musty	avg_pred
0	49.6%
1	48.5%

Instead of the previous examples, which do not appear in the test data at all, consider the characteristic odor=musty. Of those mushrooms with a musty odor, the average predicted probabilities are about 49%. In other words, the model is just blindly guessing (making a 50% guess) and not differentiating musty mushrooms from non-musty mushrooms.

We see that less than 0.5% (32) of the mushrooms in the training data set are actually musty. This means that the model does not have many examples to learn from.

agaricus.train$data %>% as.matrix() %>% as_tibble() %>% 
  summarise(perent_musty = percent(mean(`odor=musty`)), n_musty = sum(`odor=musty`)) %>% t()

perent_musty	n_musty
0.491%	32

Suppose that we had prior information that told us that musty mushrooms are edible 90% of the time. We can use a prior probability of 90% edible by setting the base margin in xgboost.

#set the prior weight for all musty-smelling mushrooms to be 50%
prior_probabilities <- agaricus.train$data %>% 
  as.matrix() %>% 
  as_tibble() %>% 
  transmute(prior_wts = ifelse(`odor=musty` == 1, 
                               yes = 0.1, 
                               no = 0.9)) %>% 
  unlist() %>% 
  as.numeric()

setinfo(dtrain, "base_margin", log(prior_probabilities))

## [1] TRUE

prior_weighted <- xgb.train(param, dtrain, nrounds = 30, watchlist, verbose = 0)

agaricus.test$data %>% as.matrix() %>% as_tibble() %>% 
  mutate(
    y_hat_baseline = predict(baseline, agaricus.test$data),
    y_hat_prior_wtd = predict(prior_weighted, agaricus.test$data),
    y = agaricus.test$label) %>% 
   group_by(`odor=musty`) %>% 
  summarise(
    baseline_xgboost = percent(mean(y_hat_baseline)),
    prior_xgboost = percent(mean(y_hat_prior_wtd)),
    percent_poisonous = percent(mean(y))
  ) %>% t()

odor=musty	baseline_xgboost	prior_xgboost	percent_poisonous
0	49.6%	49.4%	48.0%
1	48.5%	74.4%	100%

After adding the prior weights, the musty mushrooms are much more likely to be predicted as poisonous as the average predicted probability increases to 74.4% from 48.5%. This is closer to reality, as 100% of the musty mushrooms in the test set are actually poisonous.

Bayesian XGBoost: How to Use Prior Probabilities with XGBoost

R Tutorial

Sam Castillo

2019-09-15

Example: Prediction if a mushroom is poisonous or edible