Each tree node tests one attribute from the input vector $X$
Each tree branch selects one value (or range) for the given attribute
Each leaf node predicts an output, Y
Decision tree models cab be applied to both regression and classification problems, however the focus here will be on classification trees. Note also, that decision trees can represent any boolean function. The tree will have $2^N$ leaf nodes where $N$ is the number of boolean variables, however there are $2^{2^N}$ possible trees given $N$ variables. For other decision trees (i.e. for non-Boolean variables) the optimal number of leaves is unknown and the input vector selection attribute may be repeated at multiple nodes.
The algorithms presented here are the so-called CART (classification and regression trees) methods. Another popular method, is the C5.0 (formally ID3) method presented by Quinlan. The resulting tree models will be binary decision trees (i.e. there are exactly 2 edges from each node except for leaf nodes)
$k(m) = \max_k p^e_{mk} = \max_k\frac{1}{N_m} \sum_{x_i \in R_m} I(y_i = k)$
where $I$ returns 0 if $y_i=k$ and 0 otherwise, $p^e$ is the estimate of the probability of the observation being in class $k$ given the training data, and
$N_m = num\\{x_i \in R_m\\}$
is the number of inputs from the training set in region $R_m$. NOTE: I would prefer to use $\hat{p}$ instead of $p^e$ but there appears to be a bug in IPython Notebook.
Misclassification error: $\frac{1}{N_m} \sum_{i\in R_m} I \left( y_i \ne k(m) \right) = 1 - p^e_{mk}(m)$
Gini Index: $\sum_{k \ne k'} p^e_{mk} p^e_{mk'} = \sum_{k=1}^{K} p^e_{mk} \left(1 - p^e_{mk} \right)$
Cross-entropy: $- \sum_{k=1}^K p^e_{mk} \log p^e_{mk}$
Given a choice of $Q_m$, it is a straightforward matter of iterating over each feature, $j$, and split value $s$ to determine the optimal choice. Note that even if a feature is continuous, implying an infinite space of possible split values, the training data set, from which $s$ must be selected, is finite.
The stopping criteria is generally chosen as some number, $D$, of training values associated with the leaf nodes. That is, if any region $R_m$ contains greater than $D$ inputs from the training data, continue the splitting process.
TODO
$C_{\lambda}(T) = \sum_{m=1}^{|T|} N_m Q_m(T) + \lambda |T|$
where $T \subset T_0$, $|T|$ is the number of leaf nodes in $T$, and $\lambda$ is a tuning parameter.
Reduced-Error Pruning 1. Split data inot training and validation set 2. Create tree that classifies training set data Prune until further pruning is harmful 1. Evaluate impact on validation data of pruning each possible node 2. Greedily remove the node that most improves the validation accuracy
Rule Post-Pruning 1. Convert tree to equivalent set of rules 2. Prune each rule independently of others 3. Sort final rules into desired sequence for use
sys.path.append('pysrc')
import decision_trees as dt
import networkx as nx
import numpy as np
inputs = np.zeros((16, 2))
outputs = []
row = 0
for x in range(4):
for y in range(4):
inputs[row][0] = x
inputs[row][1] = y
row += 1
for row in inputs:
if (row[0] > 1 and row[1] < 2) or (row[0] < 2 and row[1] > 1):outputs.append(1)
else: outputs.append(0)
clazz = [0,1]
meta = ['x','y']
tree = dt.build_tree(inputs, outputs, clazz, meta)
dt.draw_tree(tree)
data = np.zeros((16,4))
for r in range(16):
data[r][0] = r
data[r][1] = inputs[r][0]
data[r][2] = inputs[r][1]
data[r][3] = outputs[r]
#print data
sys.path.append('pysrc')
import decision_trees as dt
import numpy as np
import networkx as nx
from matplotlib import pyplot as plt
p = 'data/SAheart.data'
f = open(p,'r')
all_lines = f.readlines()
train_cnt = int(0.75 * len(all_lines))
lines = all_lines[0:train_cnt]
col_cnt = len(lines[0].split(','))
row_cnt = len(lines)
outputs = [int(line.split(',')[col_cnt-1][0]) for line in lines]
inputs = np.zeros((row_cnt, col_cnt-1))
for row in range(row_cnt):
line = lines[row].split(',')[0:col_cnt-1]
inputs[row] = [float(v) for v in line]
clazz = [0,1]
tree = dt.build_tree(inputs, outputs, clazz, meta=['sbp','tob','ldl','adip','famhist','typea','obes','alc','age'], max_rm=5)
#dt.draw_tree(tree)
#compare the tree precition to training values, since we haven't pruned this should be very accurate
diff = []
for row in range(train_cnt):
p = dt.decide(tree, inputs[row])
if p == outputs[row]: diff.append(0)
else: diff.append(1)
misses = sum(diff)
print "In the training data, there were {0} miss classifications for {1} inputs, a rate of {2}%".format(misses, train_cnt, 100*misses/float(train_cnt))
x = range(train_cnt)
f, axarr = plt.subplots(2,1)
f.subplots_adjust(right=1.5)
f.subplots_adjust(top=1.5)
#plot training comparison
ax1 = axarr[0]
ax1.scatter(x,diff)
#compare the tree prediction to actual values not used in training set
test_lines = all_lines[train_cnt+1:len(all_lines)-1]
actual_out = [int(line.split(',')[col_cnt-1][0]) for line in test_lines]
row_cnt = len(test_lines)
test_in = np.zeros((row_cnt, col_cnt-1))
for row in range(row_cnt):
line = test_lines[row].split(',')[0:col_cnt-1]
test_in[row] = [float(v) for v in line]
diff = []
for row in range(len(test_in)):
p = dt.decide(tree, test_in[row])
if p == actual_out[row]: diff.append(0)
else: diff.append(1)
misses = sum(diff)
print "In the hold out data, there were {0} miss classifications for {1} inputs, a rate of {2}%".format(misses, len(test_in), 100*misses/float(len(test_in)))
x = range(len(diff))
ax2 = axarr[1]
ax2.scatter(x,diff)
In the training data, there were 95 miss classifications for 345 inputs, a rate of 27.5362318841% In the hold out data, there were 27 miss classifications for 114 inputs, a rate of 23.6842105263%
<matplotlib.collections.PathCollection at 0x1050acfd0>