Notebook

Draw an example (of your own invention) of a partition of two- dimensional feature space that could result from recursive binary splitting. Your example should contain at least six regions. Draw a decision tree corresponding to this partition. Be sure to label all as- pects of your figures, including the regions R1, R2, . . ., the cutpoints t1,t2,..., and so forth.

It is mentioned in Section 8.2.3 that boosting using depth-one trees (or stumps) leads to an additive model: that is, a model of the form: SEE ISL p332

Explain why this is the case. You can begin with (8.12) in Algorithm 8.2.

... We can see this by inspection?!

Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of pˆm1. The x- axis should display pˆm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.

Hint: In a setting with two classes, pˆm1 = 1 − pˆm2. You could make this plot by hand, but it will be much easier to make in R.

In [61]:

import numpy as np
import pandas as pd
import seaborn as sns

# binary classification: pm1 is the complement of pm2
pm1 = np.linspace(0, 1, 10)
pm2 = np.repeat(1, 10) - pm1
pmk = pd.DataFrame({'pm1':pm1,'pm2':pm2})

# binary classification: k = 2

# Gini
g1 = pmk.pm1 * (1 - pmk.pm1)
g2 = pmk.pm2 * (1 - pmk.pm2)
g = g1 + g2

# Entropy
e1 = pmk.pm1 * np.log(pmk.pm1.replace([0], 1))
e2 = pmk.pm2 * np.log(pmk.pm2.replace([0], 1))
e = -(e1 + e2)


# Classification error
# it is the fraction not belonging to most common class
ce = pmk.min(axis=1)



# plot
line_df = pd.DataFrame({'gini' : g,
                        'pm1': pmk.pm1,
                        'class_err': ce,
                        'entropy' : e})
sns.lineplot(x='pm1', y='gini', data=line_df, color='tab:red')
sns.lineplot(x='pm1', y='entropy', data=line_df, color='tab:blue')
sns.lineplot(x='pm1', y='class_err', data=line_df, color='tab:green')

Out[61]:

<matplotlib.axes._subplots.AxesSubplot at 0x10f565320>

This question relates to the plots in Figure 8.12.

(a) Sketch the tree corresponding to the partition of the predictor space illustrated in the left-hand panel of Figure 8.12. The num- bers inside the boxes indicate the mean of Y within each region.

(b) Create a diagram similar to the left-hand panel of Figure 8.12, using the tree illustrated in the right-hand panel of the same figure. You should divide up the predictor space into the correct regions, and indicate the mean for each region.

Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of X, produce 10 estimates of:

P(Class is Red|X): 0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, and 0.75.

3There are two common ways to combine these results together into a single class prediction. One is the majority vote approach discussed in this chapter. The second approach is to classify based on the average probability. In this example, what is the final classification under each of these two approaches?

In [81]:

# marjority vote
ps = np.array([0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75])
res = 'Red' if (((ps > 0.5).sum() / len(ps)) >  0.5) else 'Green'
print('majority vote: {}'.format(res))

# avg probability
res = 'Red' if (ps.mean() >  0.5) else 'Green'
print('mean probability: {}'.format(res))

majority vote: Red
mean probability: Green

Provide a detailed explanation of the algorithm that is used to fit a regression tree.

(1) Recursive binary splitting: select split such that RSS is minimized, repeat until some stopping condition - for example minimum number of observations in each terminal node. How do we try all such split for a continuous variable? The answer is here: https://stats.stackexchange.com/questions/220350/regression-trees-how-are-splits-decided. We try splits at each value of the predictor for every observation.

If we decide on some small number for the minimum number of observations in the terminal nodes, (1) will yield a tree with low bias and high variance. To obtain a lower vairance model one can use cost complexity pruning, to collapse nodes; increasing bias in the hope of a bigger dividend in reduced variance with the aim of improving overall predictive performance. In practice, this technique is not commonplace in ML, the preferred means of reducing variance is to use an ensemble method: bagging and random forests or to use the additive boosted tree model. These techniques are not as interpretable as the simple tree.