Oil analysis using random forest classification¶

The essentials of a random forest are best illustrated by introducing its decision trees first. A decision tree should better predict a label by splitting the samples. To define some terms:

The factors are candidate explanatory variables for a label (the X and Y axis in the picture);
The label is the response variable that should be predicted from the explanatory variables ("ocher" or "green" in the picture);
The sample is an ordered list of factor values assigned to some experimental unit (the coordinates ( $x$ , $y$ ) in the picture represent an experimental unit)

The picture shows seven samples ( $x$ , $y$ ) that have been labelled "ocker" or "green". The outlined circle represents two identical samples ( $x$ , $y$ ) that only differ in label. Therefore, the samples ( $x$ , $y$ ) cannot be split into pure sets.

The Classification And Regression Tree (CART) algorithm uses the Gini impurity to quantify the diversity of the labels assigned to a set of samples by:

$I=\sum_{i=1}^{n} p_i \times{(1-p_i)}$

where $n$ is the number of labels, i.e. (ocher, green) and $p_i$ is the proportion of the samples that carries the label $i$ . The proportion $p_i$ may range from zero to one. The picture shows that a label $i$ does not contribute to the Gini impurity $I$ as $p_i=0$ or $p_i=1$ and that it maximally contributes to the Gini impurity $I$ as $p_i=1/2$ .

The Gini impurity of the set of all samples ( $x$ , $y$ ) is:

$I_{u}= p_{ocher} \times{(1-p_{ocher})}+p_{green} \times{(1-p_{green})}=3/7 \times{4/7} +4/7 \times{3/7}=24/49 \approx{0.49}$

The Gini impurity of the sets resulting from the partitioning by $x_1$ in the picture is given by:

$I_{\lt{x_1}}= p_{ocher} \times{(1-p_{ocher})}+p_{green} \times{(1-p_{green})}=2/2 \times{(1-2/2)}+0/2 \times{(1-0/2)}=0$

$I_{\ge{x_1}}= p_{ocher} \times{(1-p_{ocher})}+p_{green} \times{(1-p_{green})}=1/5 \times{(1-1/5)}+4/5 \times{(1-4/5)}=8/25 \approx{0.32}$

The Table shows the Gini impurity $I$ of the sets generated by all splits in the samples ( $x$ , $y$ ) in the picture:

	Gini impurity $I_\lt$	Gini impurity $I_\ge$	Gini gain $G$
$x_0$	0.00	0.44	0.11
$x_1$	0.00	0.32	0.26
$x_2$	0.44	0.38	0.08
$x_3$	0.48	0.00	0.15
$x_4$	0.50	0.00	0.06
$y_\:\:$	0.48	0.50	0.01

To best split is the split that yields subsets that are large and pure. The Gini gain $G$ is just one of the optimisation criteria that may be used to identify the best split. The Gini gain $G$ of a split equals the difference in the impurity $I_{u}$ of the set of all samples entering that split and the $p_{j}$ weighted sum of the impurities $I_{j}$ of the $j$ partitioned sets:

$G=I_{u}-\sum_{j=1}^{m} I_{j} \times{p_{j}}$

The Gini gain $G$ resulting from the split of the set of all samples by $x_1$ is given by:

$G=24/49-(2/7 \times{0.00}+ 5/7 \times{0.32})=24/49-8/35 \approx{0.26}$

The Table shows the Gini gain $G$ of the optional splits of the set of all samples ( $x$ , $y$ ). The split by $x_1$ shows the largest Gini gain $G$ . This means that the decision tree that splits by $x_1$ best separates the labels ("green", "ocher") in terms of Gini gain $G$ . So, the split by $x_1$ yields subsets that are relatively large and pure. The picture below shows the decision tree with the largest Gini gain $G$ :

As the set { $x \lt{x_1}$ } is pure, i.e. $I_\lt=0$ , further splitting is superfluous. However, partitioning of the set { $x \ge{x_1}$ } may further reduce impurity. The picture shows the optional splits:

The Table shows the Gini impurity $I$ and the Gini gain $G$ :

	Gini impurity $I_{\ge{x_1};\lt}$	Gini gain $G$
$x_3$	0.44	0.06
$x_4$	0.38	0.02
$y_\:\:$	0.38	0.02

A split over $x_3$ yields the largest Gini gain $G$ . Therefore, extending the decision tree by a split over $x_3$ will best separate the labels ("green", "ocher").

Ultimately, the decision tree cannot be extended beyond here

Decision trees are known to be:

insensitive to many scaling and transformations
insensitive to irrelevant factors
very sensitive to the composition of the training set of labelled samples

A random forest is a means to reduce the sensitivity for the composition of the training set while preserving the nice insensitivities of a decision tree to a large extent. A random forest classification aggregates the classifications of many independent decision trees that have been built on bootstrapped samples. Aggregating bootstrapped samples is known as bagging.

Step 1 Create a bootstrapped sample ( $x$ , $y$ ). Bootstrapping is random sampling with replacement. Note that the bootstrapped sample typically does not entail all samples ( $x$ , $y$ ). The unselected samples are said to be out of bag samples.

Step 2 Create a decision tree on the bootstrapped sample ( $x$ , $y$ ), but only consider a randomly selected subset of the factors at each node of the tree. For example, only the factor $X$ will be considered to assess the best split for the root node of the tree using the Gini gain $G$ criterion. For this specific bootstrapped sample, the root node of the decision tree reduces the Gini impurity $I$ of the subsets { $x\lt{x_1}$ } and { $x\ge{x_1}$ } to zero. Otherwise, another node may be added to the decision tree that again uses a randomly selected subset of the factors { $X$ , $Y$ }.

By repeating step 1 and step 2 many times a random forest of different decision trees will be created. Now, a new sample ( $x_8$ , $y_8$ ) with an unknown label will be evaluated by each of the decision trees from the random forest. Each decision tree votes whether the sample ( $x_8$ , $y_8$ ) is likely to be labelled "ocher" or "green". The random forest classifier then expresses the support of a particular label among the decision trees in the random forest.

To validate the random forest, the demo script partitions the sample into a training set and a test set. The oil samples in the training set will be used to construct the random forest. The oil samples in the test set will be evaluated by the random forest. As the labels of the samples in the test set are in fact known, it becomes clear whether the random forest classifier correctly labelled these oil samples.

The picture shows that the random forest in the demo script correctly predicts the label "Age > 1 year" of the oil samples in the test set at this specific run of the script. This accuracy score indicates the random forest's capability to predict beyond the training set. In the picture above, the accuracy score is just the proportion of correctly predicted labels the were assigned to the test set of oil samples.

To enlarge trust in the predictions, decision makers are interested to know which predictors are the most important. The demo script just assigns a Gini importance and a permutation importance to the various factors. Both importance measurements have their pro's and con's. Gini importance tends to favour factors that take many different values over factors that only take a few values. The permutation importance tends to ignore factors that are mutually dependent.

Gini importance¶

Gini importance $E_{t,j}$ of an individual node $j$ in a decision tree $t$ follows from:

$E_{t,j}={{{n_j}\times{G_{t,j}}}\over{\sum_{j=1}^{m} {{n_{t,j}}\times{G_{t,j}}}}}$

where $n_j$ is the number of samples entering the node $j$ . For the decision tree above that has been constructed on the measurements $(x,y)$ . The Gini importance $E_{t,1}$ of the top node is:

$E_{t,1}={{7\times{0.26}}\over{(7\times{0.26})+(5\times{0.06})+(3\times{0.11}) }}=0.74$

Similarly, the Gini importance $E_{t,x}$ of a factor $X$ in this decision tree $t$ is just the sum of the Gini importance $E_{t,j}$ of the nodes $j$ that involve this factor $X$ :

$E_{t,X}=\sum_{\forall{j|x}}{E_{t,j}}$

For the above decision tree that has been constructed on the seven samples $(x,y)$ the first two nodes involve the factor $X$ . The Gini importance $E_{t,X}$ of the factor $X$ is therefore:

$E_{t,X}=E_{t,1}+E_{t,2}=0.74+0.12=0.86$

The Gini importance of a factor $X$ in a random forest is just the mean of the Gini importance $E_{t,X}$ of each individual decision tree $t$ :

$E_{RF,X}={{1}\over{T}}\times{\sum_{t=1}^{T}{E_{t,X}}}$

Permutation importance¶

The permutation importance $E_{RF,X}$ of a factor $X$ in a specific random forest $RF$ follows from:

$E_{RF,X}=A-{{1}\over{K}}\times{\sum_{k=1}^{K}{A_{k,X}}}$

where $A$ is the accuracy of the random forest and $A_{k,X}$ is the accuracy of the random forest after a random reshuffling (permuting) the factor $X$ . If the random reshuffle of the factor $X$ heavily affects the accuracy, the permutation importance $E_{RF,X}$ assigned to the factor $X$ will be large. To avoid statistical obscurities, the factor $X$ will be reshuffled $K$ times and the mean of the accuracy $A_{k,X}$ is used.

If the label is a categorical variable, the relative frequency of the correctly predicted samples typically represents the accuracy $A$ . If the label is numerical, the correlation $r^2$ may represent the accuracy.

Gini importance and permutation importance compared¶

This Section will compare the Gini importance and the permutation importance with respect to a random forest that has been inferred from the oil analysis data. The demo script presents both importance measures. For the label "Age > 1 year", the Table below shows the most important factors in a specific random forest.

	Gini importance		Permutation importance
CU	0.2244810	CU	0.029167
FE	0.09556239	WATER	0.028125
VIS40	0.07206340	BRSTVD	0.025000
ISO4406 L	0.06953566	P	0.023958
BRSTVD	0.06820062	VIS99	0.019792
WATER	0.06671420	FE	0.019792

The Gini importance $E_{RF,CU}$ of the copper factor (CU) appears to be the largest, meaning that copper yields the largest mean decrease in impurity. As the values of this copper factor are diverse as compared to some other factors, the Gini importance $E_{RF,CU}$ may overrate the decision maker's intuition of importance.

The permutation importance $E_{RF,CU}$ of the copper factor (CU) also appears to be the largest, meaning that reshuffling the copper values yields the largest decrease in accuracy. As the factors in the oil samples are mutually dependent, other factors may compensate for the randomising of the copper factor. Then, the permutation importance of the copper factor may become very small, even if copper is actually a cause of the label.

In the absence of a genuine representation of importance, usually several imperfect indicators of importance will be compared. Awareness of their imperfections assists a decision maker in understanding the predictions of a random forest.

Click here to see the random forest script ¶

Oil analysis using random forest classification¶

Gini importance¶

Permutation importance¶

Gini importance and permutation importance compared¶

Click here to see the random forest script¶

Click here to see the random forest script ¶