import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
A depth-one tree is a function of a single variable. In this setting the boosting algorithm becormes:
Because the model is constrained to using depth-one trees, each ^fb generated in step 2 is a function of a single feature ˆfb(Xj) and so the model outputed in step 3 is a sum of functions of a given feature which can be thought of as an additive model.
Hint: In a setting with two classes, pˆm1 = 1 − pˆm2. You could make this plot by hand, but it will be much easier to make in R.
# pmk represent the proportion of training observations
# in the mth region that are from the kth class
# Here we consider two classes, K=2 for a given region m
pm1 = np.arange(0.01, 1, 0.01)
gini = (pm1 * (1 - pm1)) + (pm2 * (1 - pm2))
# Classification error is the proportion of observations assigned to the wrong class
err=[]
for p in pm1:
if 0.5 <= p:
err += [1 - p]
if p < 0.5:
err += [p]
# Entropy
entropy = -((pm1 * np.log(pm1)) + (pm2 * np.log(pm2)))
# Plot against pm1
df = pd.DataFrame(np.stack([pm1, gini, err, entropy], axis=1),
columns=['pm1', 'gini', 'error', 'entropy']).set_index('pm1')
plt.figure(figsize=(10, 7))
sns.lineplot(data=df);
x_isred_boot = np.array([0.1,0.15,0.2,0.2,0.55,0.6,0.6,0.65,0.7, 0.75])
def majority_clf(votes):
pro_votes = (0.5 < votes).sum()
majority_is_pro = (len(votes)/2) < pro_votes
return majority_is_pro, pro_votes
def avg_clf(votes):
avg = np.mean(votes)
return (0.5 < avg, avg)
print('Is red (by Majority): {}, votes_red={}'.format(*majority_clf(x_isred_boot)))
print('Is red (by Average) : {}, avg={}'.format(*avg_clf(x_isred_boot)))
Is red (by Majority): True, votes_red=6 Is red (by Average) : False, avg=0.45
A decision tree divides the predictor space into J distinct non-overlapping regions (see 4b above). But how do we decide where to place the splits?
One approach might be to consider all possible combinations of splits across all predictors, but this isn't likely to be tractable because the number of possible combinations of splits is extremely large for any moderately sized predictor space.
To get around this problem a top-down greedy approach is used called recursive binary splitting. This reduces the search space by only considering the optimal split at each step. Residual Sum of Squares (RSS) is used to measure optimality, at each step we search for a single additional split in predictor space that most reduces RSS. This process is repeated recursively with all sub regions of a all features considered at each step, with splits from previous steps being maintained (e.g. you can't split across a pre-existing split).
As this process is repeated through recursion an additional split and resultant region is added at each step. The regions at the bottom of the tree with no subsequent splits are called leaf nodes or leaves. The process is repeated until the population of observations in each leaf is less than some threshold t.
In the regression setting the estimated value for an observation that falls into a given leaf node, is the mean response value of all observations in that leaf.
If the trheshold t is small then the above step will result in a large complex tree that is optimised with respect to the training set. There will likely be some bias in the training set that is not representative of the true population, that the tree has accounted for, it has overfit the training data and this will impede its predictive power on new observations.
One approach to mitigate overfitting might be to continue step one only so long ass the reduction in RSS at each step excedes some threshold. The problem with this is that some very valuable split might follow a low value split that doesn't exceed the threshold, in which case tree growth is halted prematurely. I'm not sure if this situation occurs often in practice, but the ISL authors suggest so.
Another approach is to grow a very large tree that overfits the training set, and then prune it back by removing splits. How to chose which splits to remove?
This can be achieved by adding a penalty to RSS cost function that penalises a higher number of terminal nodes in the tree. RSS is given by:
J∑j=1∑i∈Rj(yi−ˆyRj)2Adding a penalty for the number of terminal nodes T is simple:
J∑j=1∑i∈Rj(yi−ˆyRj)2+α|T|Here α is the tuning parameter that controls the severity of the penalty, a small value of α around 0 will result in a large tree which decreases in size as α is increased.
We can now use cross-validation for some range of α to choose the optimal size tree. Each α results in a subtree of the full tree that we test using cross validation to estimate its test score. More generally this is an optimisaiton of the bias-variance tradeoff.
Finally we chose the subtree (value of α) that produces the best cross-validation score and use this for prediction.