Adapted from Chapter 8 of An Introduction to Statistical Learning
Why are we learning about decision trees?
Students will be able to:
Major League Baseball player data from 1986-87:
Group exercise:
Rules for segmenting:
Above are the regions created by a computer:
Note: Years and Hits are both integers, but the convention is to use the midpoint between adjacent values to label a split.
These regions are used to make predictions on out-of-sample data. Thus, there are only three possible predictions! (Is this different from how linear regression makes predictions?)
Below is the equivalent regression tree:
The first split is Years < 4.5, thus that split goes at the top of the tree. When a splitting rule is True, you follow the left branch. When a splitting rule is False, you follow the right branch.
For players in the left branch, the mean Salary is $166,000, thus you label it with that value. (Salary has been divided by 1000 and log-transformed to 5.11.)
For players in the right branch, there is a further split on Hits < 117.5, dividing players into two more Salary regions: $403,000 (transformed to 6.00), and $846,000 (transformed to 6.74).
What does this tree tell you about your data?
Question: What do you like and dislike about decision trees so far?
Your training data is a tiny dataset of used vehicle sale prices. Your goal is to predict price for testing data.
group_by
).Ideal approach: Consider every possible partition of the feature space (computationally infeasible)
"Good enough" approach: recursive binary splitting
# vehicle data
import pandas as pd
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/vehicles_train.csv'
train = pd.read_csv(url)
# before splitting anything, just predict the mean of the entire dataset
train['prediction'] = train.price.mean()
train
price | year | miles | doors | vtype | prediction | |
---|---|---|---|---|---|---|
0 | 22000 | 2012 | 13000 | 2 | car | 6571.428571 |
1 | 14000 | 2010 | 30000 | 2 | car | 6571.428571 |
2 | 13000 | 2010 | 73500 | 4 | car | 6571.428571 |
3 | 9500 | 2009 | 78000 | 4 | car | 6571.428571 |
4 | 9000 | 2007 | 47000 | 4 | car | 6571.428571 |
5 | 4000 | 2006 | 124000 | 2 | car | 6571.428571 |
6 | 3000 | 2004 | 177000 | 4 | car | 6571.428571 |
7 | 2000 | 2004 | 209000 | 4 | truck | 6571.428571 |
8 | 3000 | 2003 | 138000 | 2 | car | 6571.428571 |
9 | 1900 | 2003 | 160000 | 4 | car | 6571.428571 |
10 | 2500 | 2003 | 190000 | 2 | truck | 6571.428571 |
11 | 5000 | 2001 | 62000 | 4 | car | 6571.428571 |
12 | 1800 | 1999 | 163000 | 2 | truck | 6571.428571 |
13 | 1300 | 1997 | 138000 | 4 | car | 6571.428571 |
# calculate RMSE for those predictions
from sklearn import metrics
import numpy as np
np.sqrt(metrics.mean_squared_error(train.price, train.prediction))
5936.9819859959835
# define a function that calculates the RMSE for a given split of miles
def mileage_split(miles):
lower_mileage_price = train[train.miles < miles].price.mean()
higher_mileage_price = train[train.miles >= miles].price.mean()
train['prediction'] = np.where(train.miles < miles, lower_mileage_price, higher_mileage_price)
return np.sqrt(metrics.mean_squared_error(train.price, train.prediction))
# calculate RMSE for tree which splits on miles < 50000
print 'RMSE:', mileage_split(50000)
train
RMSE: 3984.09174254
price | year | miles | doors | vtype | prediction | |
---|---|---|---|---|---|---|
0 | 22000 | 2012 | 13000 | 2 | car | 15000.000000 |
1 | 14000 | 2010 | 30000 | 2 | car | 15000.000000 |
2 | 13000 | 2010 | 73500 | 4 | car | 4272.727273 |
3 | 9500 | 2009 | 78000 | 4 | car | 4272.727273 |
4 | 9000 | 2007 | 47000 | 4 | car | 15000.000000 |
5 | 4000 | 2006 | 124000 | 2 | car | 4272.727273 |
6 | 3000 | 2004 | 177000 | 4 | car | 4272.727273 |
7 | 2000 | 2004 | 209000 | 4 | truck | 4272.727273 |
8 | 3000 | 2003 | 138000 | 2 | car | 4272.727273 |
9 | 1900 | 2003 | 160000 | 4 | car | 4272.727273 |
10 | 2500 | 2003 | 190000 | 2 | truck | 4272.727273 |
11 | 5000 | 2001 | 62000 | 4 | car | 4272.727273 |
12 | 1800 | 1999 | 163000 | 2 | truck | 4272.727273 |
13 | 1300 | 1997 | 138000 | 4 | car | 4272.727273 |
# calculate RMSE for tree which splits on miles < 100000
print 'RMSE:', mileage_split(100000)
train
RMSE: 3530.14653008
price | year | miles | doors | vtype | prediction | |
---|---|---|---|---|---|---|
0 | 22000 | 2012 | 13000 | 2 | car | 12083.333333 |
1 | 14000 | 2010 | 30000 | 2 | car | 12083.333333 |
2 | 13000 | 2010 | 73500 | 4 | car | 12083.333333 |
3 | 9500 | 2009 | 78000 | 4 | car | 12083.333333 |
4 | 9000 | 2007 | 47000 | 4 | car | 12083.333333 |
5 | 4000 | 2006 | 124000 | 2 | car | 2437.500000 |
6 | 3000 | 2004 | 177000 | 4 | car | 2437.500000 |
7 | 2000 | 2004 | 209000 | 4 | truck | 2437.500000 |
8 | 3000 | 2003 | 138000 | 2 | car | 2437.500000 |
9 | 1900 | 2003 | 160000 | 4 | car | 2437.500000 |
10 | 2500 | 2003 | 190000 | 2 | truck | 2437.500000 |
11 | 5000 | 2001 | 62000 | 4 | car | 12083.333333 |
12 | 1800 | 1999 | 163000 | 2 | truck | 2437.500000 |
13 | 1300 | 1997 | 138000 | 4 | car | 2437.500000 |
# check all possible mileage splits
mileage_range = range(train.miles.min(), train.miles.max(), 1000)
RMSE = [mileage_split(miles) for miles in mileage_range]
# allow plots to appear in the notebook
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (6, 4)
plt.rcParams['font.size'] = 14
# plot mileage cutpoint (x-axis) versus RMSE (y-axis)
plt.plot(mileage_range, RMSE)
plt.xlabel('Mileage cutpoint')
plt.ylabel('RMSE (lower is better)')
<matplotlib.text.Text at 0x172f5b70>
Recap: Before every split, this process is repeated for every feature, and the feature and cutpoint that produces the lowest MSE is chosen.
# encode car as 0 and truck as 1
train['vtype'] = train.vtype.map({'car':0, 'truck':1})
# define X and y
feature_cols = ['year', 'miles', 'doors', 'vtype']
X = train[feature_cols]
y = train.price
# instantiate a DecisionTreeRegressor (with random_state=1)
from sklearn.tree import DecisionTreeRegressor
treereg = DecisionTreeRegressor(random_state=1)
treereg
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None, max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, random_state=1, splitter='best')
# use leave-one-out cross-validation (LOOCV) to estimate the RMSE for this model
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(treereg, X, y, cv=14, scoring='mean_squared_error')
np.mean(np.sqrt(-scores))
2928.5714285714284
The training error continues to go down as the tree size increases (due to overfitting), but the lowest cross-validation error occurs for a tree with 3 leaves.
Let's try to reduce the RMSE by tuning the max_depth parameter:
# try different values one-by-one
treereg = DecisionTreeRegressor(max_depth=1, random_state=1)
scores = cross_val_score(treereg, X, y, cv=14, scoring='mean_squared_error')
np.mean(np.sqrt(-scores))
3757.936507936508
Or, we could write a loop to try a range of values:
# list of values to try
max_depth_range = range(1, 8)
# list to store the average RMSE for each value of max_depth
RMSE_scores = []
# use LOOCV with each value of max_depth
for depth in max_depth_range:
treereg = DecisionTreeRegressor(max_depth=depth, random_state=1)
MSE_scores = cross_val_score(treereg, X, y, cv=14, scoring='mean_squared_error')
RMSE_scores.append(np.mean(np.sqrt(-MSE_scores)))
# plot max_depth (x-axis) versus RMSE (y-axis)
plt.plot(max_depth_range, RMSE_scores)
plt.xlabel('max_depth')
plt.ylabel('RMSE (lower is better)')
<matplotlib.text.Text at 0x17514c88>
# max_depth=3 was best, so fit a tree using that parameter
treereg = DecisionTreeRegressor(max_depth=3, random_state=1)
treereg.fit(X, y)
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None, max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, random_state=1, splitter='best')
# "Gini importance" of each feature: the (normalized) total reduction of error brought by that feature
pd.DataFrame({'feature':feature_cols, 'importance':treereg.feature_importances_})
feature | importance | |
---|---|---|
0 | year | 0.798744 |
1 | miles | 0.201256 |
2 | doors | 0.000000 |
3 | vtype | 0.000000 |
# create a Graphviz file
from sklearn.tree import export_graphviz
export_graphviz(treereg, out_file='tree_vehicles.dot', feature_names=feature_cols)
# At the command line, run this to convert to PNG:
# dot -Tpng tree_vehicles.dot -o tree_vehicles.png
Reading the internal nodes:
Reading the leaves:
# read the testing data
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/vehicles_test.csv'
test = pd.read_csv(url)
test['vtype'] = test.vtype.map({'car':0, 'truck':1})
test
price | year | miles | doors | vtype | |
---|---|---|---|---|---|
0 | 3000 | 2003 | 130000 | 4 | 1 |
1 | 6000 | 2005 | 82500 | 4 | 0 |
2 | 12000 | 2010 | 60000 | 2 | 0 |
Question: Using the tree diagram above, what predictions will the model make for each observation?
# use fitted model to make predictions on testing data
X_test = test[feature_cols]
y_test = test.price
y_pred = treereg.predict(X_test)
y_pred
array([ 4000., 5000., 13500.])
# calculate RMSE
np.sqrt(metrics.mean_squared_error(y_test, y_pred))
1190.2380714238084
# calculate RMSE for your own tree!
y_test = [3000, 6000, 12000]
y_pred = [0, 0, 0]
from sklearn import metrics
np.sqrt(metrics.mean_squared_error(y_test, y_pred))
7937.2539331937714
Example: Predict whether Barack Obama or Hillary Clinton will win the Democratic primary in a particular county in 2008:
Questions:
regression trees | classification trees |
---|---|
predict a continuous response | predict a categorical response |
predict using mean response of each leaf | predict using most commonly occuring class of each leaf |
splits are chosen to minimize MSE | splits are chosen to minimize Gini index (discussed below) |
Common options for the splitting criteria:
Pretend we are predicting whether someone buys an iPhone or an Android:
Our goal in making splits is to reduce the classification error rate. Let's try splitting on gender:
Compare that with a split on age:
The decision tree algorithm will try every possible split across all features, and choose the split that reduces the error rate the most.
Calculate the Gini index before making a split:
$$1 - \left(\frac {iPhone} {Total}\right)^2 - \left(\frac {Android} {Total}\right)^2 = 1 - \left(\frac {10} {25}\right)^2 - \left(\frac {15} {25}\right)^2 = 0.48$$Evaluating the split on gender using Gini index:
$$\text{Males: } 1 - \left(\frac {2} {14}\right)^2 - \left(\frac {12} {14}\right)^2 = 0.24$$$$\text{Females: } 1 - \left(\frac {8} {11}\right)^2 - \left(\frac {3} {11}\right)^2 = 0.40$$$$\text{Weighted Average: } 0.24 \left(\frac {14} {25}\right) + 0.40 \left(\frac {11} {25}\right) = 0.31$$Evaluating the split on age using Gini index:
$$\text{30 or younger: } 1 - \left(\frac {4} {12}\right)^2 - \left(\frac {8} {12}\right)^2 = 0.44$$$$\text{31 or older: } 1 - \left(\frac {6} {13}\right)^2 - \left(\frac {7} {13}\right)^2 = 0.50$$$$\text{Weighted Average: } 0.44 \left(\frac {12} {25}\right) + 0.50 \left(\frac {13} {25}\right) = 0.47$$Again, the decision tree algorithm will try every possible split, and will choose the split that reduces the Gini index (and thus increases the "node purity") the most.
Note: There is another common splitting criteria called cross-entropy. It's numerically similar to Gini index, but slower to compute, thus it's not as popular as Gini index.
We'll build a classification tree using the Titanic data:
# read in the data
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/titanic.csv'
titanic = pd.read_csv(url)
# encode female as 0 and male as 1
titanic['Sex'] = titanic.Sex.map({'female':0, 'male':1})
# fill in the missing values for age with the median age
titanic.Age.fillna(titanic.Age.median(), inplace=True)
# create a DataFrame of dummy variables for Embarked
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True)
# concatenate the original DataFrame and the dummy DataFrame
titanic = pd.concat([titanic, embarked_dummies], axis=1)
# print the updated DataFrame
titanic.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 0 | 1 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 0 | 0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 | 1 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S | 0 | 1 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35 | 0 | 0 | 373450 | 8.0500 | NaN | S | 0 | 1 |
# define X and y
feature_cols = ['Pclass', 'Sex', 'Age', 'Embarked_Q', 'Embarked_S']
X = titanic[feature_cols]
y = titanic.Survived
# fit a classification tree with max_depth=3 on all data
from sklearn.tree import DecisionTreeClassifier
treeclf = DecisionTreeClassifier(max_depth=3, random_state=1)
treeclf.fit(X, y)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3, max_features=None, max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, random_state=1, splitter='best')
# create a Graphviz file
export_graphviz(treeclf, out_file='tree_titanic.dot', feature_names=feature_cols)
# At the command line, run this to convert to PNG:
# dot -Tpng tree_titanic.dot -o tree_titanic.png
Notice the split in the bottom right: the same class is predicted in both of its leaves. That split didn't affect the classification error rate, though it did increase the node purity, which is important because it increases the accuracy of our predicted probabilities.
# compute the feature importances
pd.DataFrame({'feature':feature_cols, 'importance':treeclf.feature_importances_})
feature | importance | |
---|---|---|
0 | Pclass | 0.242664 |
1 | Sex | 0.655584 |
2 | Age | 0.064494 |
3 | Embarked_Q | 0.000000 |
4 | Embarked_S | 0.037258 |
Advantages of decision trees:
Disadvantages of decision trees: