The essentials of a random forest are best illustrated by introducing its decision trees first. A decision tree should better predict a label by splitting the samples. To define some terms:
The picture shows seven samples (x,y) that have been labelled "ocker" or "green". The outlined circle represents two identical samples (x,y) that only differ in label. Therefore, the samples (x,y) cannot be split into pure sets.
.
The Classification And Regression Tree (CART) algorithm uses the Gini impurity to quantify the diversity of the labels assigned to a set of samples by:
I=∑ni=1pi×(1−pi)
where n is the number of labels, i.e. (ocher, green) and pi is the proportion of the samples that carries the label i. The proportion pi may range from zero to one. The picture shows that a label i does not contribute to the Gini impurity I as pi=0 or pi=1 and that it maximally contributes to the Gini impurity I as pi=1/2.
.
The Gini impurity of the set of all samples (x,y) is:
Iu=pocher×(1−pocher)+pgreen×(1−pgreen)=3/7×4/7+4/7×3/7=24/49≈0.49
The Gini impurity of the sets resulting from the partitioning by x1 in the picture is given by:
I<x1=pocher×(1−pocher)+pgreen×(1−pgreen)=2/2×(1−2/2)+0/2×(1−0/2)=0
I≥x1=pocher×(1−pocher)+pgreen×(1−pgreen)=1/5×(1−1/5)+4/5×(1−4/5)=8/25≈0.32
The Table shows the Gini impurity I of the sets generated by all splits in the samples (x,y) in the picture:
Gini impurity I< | Gini impurity I≥ | Gini gain G | |
---|---|---|---|
x0 | 0.00 | 0.44 | 0.11 |
x1 | 0.00 | 0.32 | 0.26 |
x2 | 0.44 | 0.38 | 0.08 |
x3 | 0.48 | 0.00 | 0.15 |
x4 | 0.50 | 0.00 | 0.06 |
y | 0.48 | 0.50 | 0.01 |
To best split is the split that yields subsets that are large and pure. The Gini gain G is just one of the optimisation criteria that may be used to identify the best split. The Gini gain G of a split equals the difference in the impurity Iu of the set of all samples entering that split and the pj weighted sum of the impurities Ij of the j partitioned sets:
G=Iu−∑mj=1Ij×pj
The Gini gain G resulting from the split of the set of all samples by x1 is given by:
G=24/49−(2/7×0.00+5/7×0.32)=24/49−8/35≈0.26
The Table shows the Gini gain G of the optional splits of the set of all samples (x,y). The split by x1 shows the largest Gini gain G. This means that the decision tree that splits by x1 best separates the labels ("green", "ocher") in terms of Gini gain G. So, the split by x1 yields subsets that are relatively large and pure. The picture below shows the decision tree with the largest Gini gain G:
.
As the set {x<x1} is pure, i.e. I<=0, further splitting is superfluous. However, partitioning of the set {x≥x1} may further reduce impurity. The picture shows the optional splits:
.
The Table shows the Gini impurity I and the Gini gain G:
Gini impurity I≥x1;< | Gini impurity I≥x1;≥ | Gini gain G | |
---|---|---|---|
x3 | 0.44 | 0.00 | 0.06 |
x4 | 0.38 | 0.00 | 0.02 |
y | 0.38 | 0.00 | 0.02 |
A split over x3 yields the largest Gini gain G. Therefore, extending the decision tree by a split over x3 will best separate the labels ("green", "ocher").
.
Ultimately, the decision tree cannot be extended beyond here
.
Decision trees are known to be:
A random forest is a means to reduce the sensitivity for the composition of the training set while preserving the nice insensitivities of a decision tree to a large extent. A random forest classification aggregates the classifications of many independent decision trees that have been built on bootstrapped samples. Aggregating bootstrapped samples is known as bagging.
Step 1 Create a bootstrapped sample (x,y). Bootstrapping is random sampling with replacement. Note that the bootstrapped sample typically does not entail all samples (x,y). The unselected samples are said to be out of bag samples.
.
Step 2 Create a decision tree on the bootstrapped sample (x,y), but only consider a randomly selected subset of the factors at each node of the tree. For example, only the factor X will be considered to assess the best split for the root node of the tree using the Gini gain G criterion. For this specific bootstrapped sample, the root node of the decision tree reduces the Gini impurity I of the subsets {x<x1} and {x≥x1} to zero. Otherwise, another node may be added to the decision tree that again uses a randomly selected subset of the factors {X,Y}.
.
By repeating step 1 and step 2 many times a random forest of different decision trees will be created. Now, a new sample (x8,y8) with an unknown label will be evaluated by each of the decision trees from the random forest. Each decision tree votes whether the sample (x8,y8) is likely to be labelled "ocher" or "green". The random forest classifier then expresses the support of a particular label among the decision trees in the random forest.
To validate the random forest, the demo script partitions the sample into a training set and a test set. The oil samples in the training set will be used to construct the random forest. The oil samples in the test set will be evaluated by the random forest. As the labels of the samples in the test set are in fact known, it becomes clear whether the random forest classifier correctly labelled these oil samples.
.
The picture shows that the random forest in the demo script correctly predicts the label "Age > 1 year" of the oil samples in the test set at this specific run of the script. This accuracy score indicates the random forest's capability to predict beyond the training set. In the picture above, the accuracy score is just the proportion of correctly predicted labels the were assigned to the test set of oil samples.
To enlarge trust in the predictions, decision makers are interested to know which predictors are the most important. The demo script just assigns a Gini importance and a permutation importance to the various factors. Both importance measurements have their pro's and con's. Gini importance tends to favour factors that take many different values over factors that only take a few values. The permutation importance tends to ignore factors that are mutually dependent.
Gini importance Et,j of an individual node j in a decision tree t follows from:
Et,j=nj×Gt,j∑mj=1nt,j×Gt,j
where nj is the number of samples entering the node j. For the decision tree above that has been constructed on the measurements (x,y). The Gini importance Et,1 of the top node is:
Et,1=7×0.26(7×0.26)+(5×0.06)+(3×0.11)=0.74
Similarly, the Gini importance Et,x of a factor X in this decision tree t is just the sum of the Gini importance Et,j of the nodes j that involve this factor X:
Et,X=∑∀j|xEt,j
For the above decision tree that has been constructed on the seven samples (x,y) the first two nodes involve the factor X. The Gini importance Et,X of the factor X is therefore:
Et,X=Et,1+Et,2=0.74+0.12=0.86
The Gini importance of a factor X in a random forest is just the mean of the Gini importance Et,X of each individual decision tree t:
ERF,X=1T×∑Tt=1Et,X
The permutation importance ERF,X of a factor X in a specific random forest RF follows from:
ERF,X=A−1K×∑Kk=1Ak,X
where A is the accuracy of the random forest and Ak,X is the accuracy of the random forest after a random reshuffling (permuting) the factor X. If the random reshuffle of the factor X heavily affects the accuracy, the permutation importance ERF,X assigned to the factor X will be large. To avoid statistical obscurities, the factor X will be reshuffled K times and the mean of the accuracy Ak,X is used.
If the label is a categorical variable, the relative frequency of the correctly predicted samples typically represents the accuracy A. If the label is numerical, the correlation r2 may represent the accuracy.
This Section will compare the Gini importance and the permutation importance with respect to a random forest that has been inferred from the oil analysis data. The demo script presents both importance measures. For the label "Age > 1 year", the Table below shows the most important factors in a specific random forest.
Gini importance | zzzzzzzzzzzzzzzzzzzzzzzz | Permutation importance | ||
---|---|---|---|---|
CU | 0.2244810 | CU | 0.029167 | |
FE | 0.09556239 | WATER | 0.028125 | |
VIS40 | 0.07206340 | BRSTVD | 0.025000 | |
ISO4406 L | 0.06953566 | P | 0.023958 | |
BRSTVD | 0.06820062 | VIS99 | 0.019792 | |
WATER | 0.06671420 | FE | 0.019792 |
The Gini importance ERF,CU of the copper factor (CU) appears to be the largest, meaning that copper yields the largest mean decrease in impurity. As the values of this copper factor are diverse as compared to some other factors, the Gini importance ERF,CU may overrate the decision maker's intuition of importance.
The permutation importance ERF,CU of the copper factor (CU) also appears to be the largest, meaning that reshuffling the copper values yields the largest decrease in accuracy. As the factors in the oil samples are mutually dependent, other factors may compensate for the randomising of the copper factor. Then, the permutation importance of the copper factor may become very small, even if copper is actually a cause of the label.
In the absence of a genuine representation of importance, usually several imperfect indicators of importance will be compared. Awareness of their imperfections assists a decision maker in understanding the predictions of a random forest.