Adapted from Chapter 8 of An Introduction to Statistical Learning
continuous | categorical | |
---|---|---|
supervised | regression | classification |
unsupervised | dimension reduction | clustering |
Let's look at a simple example to motivate our learning.
Our goal is to predict a baseball player's Salary based on Years (number of years playing in the major leagues) and Hits (number of hits he made in the previous year). Here is the training data, represented visually (low salary is blue/green, high salary is red/yellow):
How might you "stratify" or "segment" the feature space into regions, based on salary? Intuitively, you want to maximize the similarity (or "homogeneity") within a given region, and minimize the similarity between different regions.
Below is a regression tree that has been fit to the data by a computer. (We will talk later about how the fitting algorithm actually works.) Note that Salary is measured in thousands and has been log-transformed.
How do we make Salary predictions (for out-of-sample data) using a decision tree?
Examples predictions:
How did we come up with the numbers at the bottom of the tree? Each number is just the mean Salary in the training data of players who fit that criteria. Here's the same diagram as before, split into the three regions:
This diagram is essentially a combination of the two previous diagrams (except that the observations are no longer color-coded). In $R_1$, the mean log Salary was 5.11. In $R_2$, the mean log Salary was 6.00. In $R_3$, the mean log Salary was 6.74. Thus, those values are used to predict out-of-sample data.
Let's introduce some terminology:
How might you interpret the "meaning" of this tree?
What we have seen so far hints at the advantages and disadvantages of decision trees:
Advantages:
Disadvantages:
How do you build a decision tree? You're going to find out by building one in pairs!
Your training data is a tiny dataset of used vehicle sale prices. Your goal is to predict Price for out-of-sample data. Here are your instructions:
group_by
).The ideal approach would be for the computer to consider every possible partition of the feature space. However, this is computationally infeasible, so instead an approach is used called recursive binary splitting:
How does it know when to stop?
Method 2 involves setting a tuning parameter that penalizes the tree for having too many leaves. As the parameter is increased, branches automatically get pruned from the tree, resulting in smaller and smaller trees. The tuning parameter can be selected through cross-validation.
Note: Method 2 is not currently supported by scikit-learn, and so we will use Method 1 instead.
Here's an example of an unpruned tree, and a comparison of the training, test, and cross-validation errors for trees with different numbers of leaves:
As you can see, the training error continues to go down as the tree size increases, but the lowest cross-validation error occurs for a tree with 3 leaves.
# import pandas
import pandas as pd
# read in vehicle data
vehicles = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT4/master/data/used_vehicles.csv')
# print out data
vehicles
price | year | miles | doors | type | |
---|---|---|---|---|---|
0 | 22000 | 2012 | 13000 | 2 | car |
1 | 14000 | 2010 | 30000 | 2 | car |
2 | 13000 | 2010 | 73500 | 4 | car |
3 | 9500 | 2009 | 78000 | 4 | car |
4 | 9000 | 2007 | 47000 | 4 | car |
5 | 4000 | 2006 | 124000 | 2 | car |
6 | 3000 | 2004 | 177000 | 4 | car |
7 | 2000 | 2004 | 209000 | 4 | truck |
8 | 3000 | 2003 | 138000 | 2 | car |
9 | 1900 | 2003 | 160000 | 4 | car |
10 | 2500 | 2003 | 190000 | 2 | truck |
11 | 5000 | 2001 | 62000 | 4 | car |
12 | 1800 | 1999 | 163000 | 2 | truck |
13 | 1300 | 1997 | 138000 | 4 | car |
# convert car to 0 and truck to 1
vehicles['type'] = vehicles.type.map({'car':0, 'truck':1})
# select feature columns (every column except for the 0th column)
feature_cols = vehicles.columns[1:]
# define X (features) and y (response)
X = vehicles[feature_cols]
y = vehicles.price
# split into train/test
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# print out each of the arrays
print X_train
print y_train
print X_test
print y_test
[[ 2003 190000 2 1] [ 2007 47000 4 0] [ 2010 30000 2 0] [ 1999 163000 2 1] [ 2012 13000 2 0] [ 1997 138000 4 0] [ 2003 160000 4 0] [ 2003 138000 2 0] [ 2001 62000 4 0] [ 2006 124000 2 0]] [ 2500 9000 14000 1800 22000 1300 1900 3000 5000 4000] [[ 2009 78000 4 0] [ 2004 209000 4 1] [ 2004 177000 4 0] [ 2010 73500 4 0]] [ 9500 2000 3000 13000]
# import class, instantiate estimator, fit with training set
from sklearn.tree import DecisionTreeRegressor
treereg = DecisionTreeRegressor(random_state=1)
treereg.fit(X_train, y_train)
DecisionTreeRegressor(compute_importances=None, criterion='mse', max_depth=None, max_features=None, max_leaf_nodes=None, min_density=None, min_samples_leaf=1, min_samples_split=2, random_state=1, splitter='best')
# make predictions
preds = treereg.predict(X_test)
# print predictions and actual values
print preds
print y_test
[ 5000. 1900. 1900. 5000.] [ 9500 2000 3000 13000]
# print RMSE
from sklearn import metrics
import numpy as np
np.sqrt(metrics.mean_squared_error(y_test, preds))
4622.4993239588475
# use cross-validation to find best max_depth
from sklearn.cross_validation import cross_val_score
# try max_depth=2
treereg = DecisionTreeRegressor(max_depth=2, random_state=1)
scores = cross_val_score(treereg, X, y, cv=3, scoring='mean_squared_error')
np.mean(np.sqrt(-scores))
4804.3767888427128
# try max_depth=3
treereg = DecisionTreeRegressor(max_depth=3, random_state=1)
scores = cross_val_score(treereg, X, y, cv=3, scoring='mean_squared_error')
np.mean(np.sqrt(-scores))
4592.1554255755254
# try max_depth=4
treereg = DecisionTreeRegressor(max_depth=4, random_state=1)
scores = cross_val_score(treereg, X, y, cv=3, scoring='mean_squared_error')
np.mean(np.sqrt(-scores))
4704.0052694797387
# max_depth=3 was best, so fit a tree using that parameter with ALL DATA
treereg = DecisionTreeRegressor(max_depth=3, random_state=1)
treereg.fit(X, y)
DecisionTreeRegressor(compute_importances=None, criterion='mse', max_depth=3, max_features=None, max_leaf_nodes=None, min_density=None, min_samples_leaf=1, min_samples_split=2, random_state=1, splitter='best')
# compute the "Gini importance" of each feature: the (normalized) total reduction of MSE brought by that feature
pd.DataFrame({'feature':feature_cols, 'importance':treereg.feature_importances_})
feature | importance | |
---|---|---|
0 | year | 0.798744 |
1 | miles | 0.201256 |
2 | doors | 0.000000 |
3 | type | 0.000000 |
# create a Graphviz file
from sklearn.tree import export_graphviz
with open("15_vehicles.dot", 'wb') as f:
f = export_graphviz(treereg, out_file=f, feature_names=feature_cols)
# at the command line, run this to convert to PNG:
# dot -Tpng 15_vehicles.dot -o 15_vehicles.png
How do we read this decision tree?
Internal nodes:
Leaves:
How accurate is scikit-learn's regression tree at predicting the out-of-sample data?
# read in out-of-sample data
oos = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT4/master/data/used_vehicles_oos.csv')
# convert car to 0 and truck to 1
oos['type'] = oos.type.map({'car':0, 'truck':1})
# print data
oos
price | year | miles | doors | type | |
---|---|---|---|---|---|
0 | 3000 | 2003 | 130000 | 4 | 1 |
1 | 6000 | 2005 | 82500 | 4 | 0 |
2 | 12000 | 2010 | 60000 | 2 | 0 |
# define X and y
X_oos = oos[feature_cols]
y_oos = oos.price
# make predictions on out-of-sample data
preds = treereg.predict(X_oos)
# print predictions and actual values
print preds
print y_oos.values
[ 4000. 5000. 13500.] [ 3000 6000 12000]
# print RMSE
np.sqrt(metrics.mean_squared_error(y_oos, preds))
1190.2380714238084
# print RMSE for the tree you created!
your_preds = [4000, 5000, 13500]
np.sqrt(metrics.mean_squared_error(y_oos, your_preds))
1190.2380714238084
Classification trees are very similar to regression trees. Here is a quick comparison:
regression trees | classification trees |
---|---|
predict a continuous response | predict a categorical response |
predict using mean response of each leaf | predict using most commonly occuring class of each leaf |
splits are chosen to minimize MSE | splits are chosen to minimize a different criterion (discussed below) |
Note that classification trees easily handle more than two response classes! (How have other classification models we've seen handled this scenario?)
Here's an example of a classification tree, which predicts whether or not a patient who presented with chest pain has heart disease:
Here are common options for the splitting criteria:
Which to use?
Why do some splits result in leaves with the same predicted class?
Some implementations of classification trees will allow you to handle categorical predictors without creating dummy variables. When splitting on a categorical predictor, they will try splitting on every possible combination of categories to find the best split. In the example above, "ChestPain:bc" means that the left-hand branch consists of observations with the second and third ChestPain categories, and the right-hand branch consists of remaining observations.
Unfortunately, scikit-learn's classification tree implementation does not support this approach. Instead, here's how you can handle categorical predictors:
We'll see examples of these strategies below.
We'll build a classification tree using the Titanic data provided by Kaggle.
# read in the data
titanic = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT4/master/data/titanic.csv')
titanic.head(10)
survived | pclass | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2 | 3 | 1 | 349909 | 21.0750 | NaN | S |
8 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27 | 0 | 2 | 347742 | 11.1333 | NaN | S |
9 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14 | 1 | 0 | 237736 | 30.0708 | NaN | C |
# look for missing values
titanic.isnull().sum()
survived 0 pclass 0 name 0 sex 0 age 177 sibsp 0 parch 0 ticket 0 fare 0 cabin 687 embarked 2 dtype: int64
Let's choose our response and a few features, and decide whether we need to adjust them:
# encode sex feature
titanic['sex'] = titanic.sex.map({'female':0, 'male':1})
# fill in missing values for age
titanic.age.fillna(titanic.age.mean(), inplace=True)
# print the updated DataFrame
titanic.head(10)
survived | pclass | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.000000 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.000000 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.000000 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.000000 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.000000 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 0 | 3 | Moran, Mr. James | 1 | 29.699118 | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 0 | 1 | McCarthy, Mr. Timothy J | 1 | 54.000000 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | 0 | 3 | Palsson, Master. Gosta Leonard | 1 | 2.000000 | 3 | 1 | 349909 | 21.0750 | NaN | S |
8 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | 0 | 27.000000 | 0 | 2 | 347742 | 11.1333 | NaN | S |
9 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | 0 | 14.000000 | 1 | 0 | 237736 | 30.0708 | NaN | C |
# create three dummy variables using get_dummies
pd.get_dummies(titanic.embarked, prefix='embarked').head(10)
embarked_C | embarked_Q | embarked_S | |
---|---|---|---|
0 | 0 | 0 | 1 |
1 | 1 | 0 | 0 |
2 | 0 | 0 | 1 |
3 | 0 | 0 | 1 |
4 | 0 | 0 | 1 |
5 | 0 | 1 | 0 |
6 | 0 | 0 | 1 |
7 | 0 | 0 | 1 |
8 | 0 | 0 | 1 |
9 | 1 | 0 | 0 |
# create three dummy variables, drop the first dummy variable, and store this as a DataFrame
embarked_dummies = pd.get_dummies(titanic.embarked, prefix='embarked').iloc[:, 1:]
# concatenate the two dummy variable columns onto the original DataFrame
# note: axis=0 means rows, axis=1 means columns
titanic = pd.concat([titanic, embarked_dummies], axis=1)
# print the updated DataFrame
titanic.head(10)
survived | pclass | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | embarked_Q | embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.000000 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 0 | 1 |
1 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.000000 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 0 | 0 |
2 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.000000 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 | 1 |
3 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.000000 | 1 | 0 | 113803 | 53.1000 | C123 | S | 0 | 1 |
4 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.000000 | 0 | 0 | 373450 | 8.0500 | NaN | S | 0 | 1 |
5 | 0 | 3 | Moran, Mr. James | 1 | 29.699118 | 0 | 0 | 330877 | 8.4583 | NaN | Q | 1 | 0 |
6 | 0 | 1 | McCarthy, Mr. Timothy J | 1 | 54.000000 | 0 | 0 | 17463 | 51.8625 | E46 | S | 0 | 1 |
7 | 0 | 3 | Palsson, Master. Gosta Leonard | 1 | 2.000000 | 3 | 1 | 349909 | 21.0750 | NaN | S | 0 | 1 |
8 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | 0 | 27.000000 | 0 | 2 | 347742 | 11.1333 | NaN | S | 0 | 1 |
9 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | 0 | 14.000000 | 1 | 0 | 237736 | 30.0708 | NaN | C | 0 | 0 |
# create a list of feature columns
feature_cols = ['pclass', 'sex', 'age', 'embarked_Q', 'embarked_S']
# define X and y
X = titanic[feature_cols]
y = titanic.survived
# fit a classification tree with max_depth=3 on all data
from sklearn.tree import DecisionTreeClassifier
treeclf = DecisionTreeClassifier(max_depth=3, random_state=1)
treeclf.fit(X, y)
DecisionTreeClassifier(compute_importances=None, criterion='gini', max_depth=3, max_features=None, max_leaf_nodes=None, min_density=None, min_samples_leaf=1, min_samples_split=2, random_state=1, splitter='best')
# create a Graphviz file
with open("15_titanic.dot", 'wb') as f:
f = export_graphviz(treeclf, out_file=f, feature_names=feature_cols)
Notice the split in the bottom right, which was made only to increase node purity.
# compute the feature importances
pd.DataFrame({'feature':feature_cols, 'importance':treeclf.feature_importances_})
feature | importance | |
---|---|---|
0 | pclass | 0.242664 |
1 | sex | 0.655584 |
2 | age | 0.064494 |
3 | embarked_Q | 0.000000 |
4 | embarked_S | 0.037258 |
Here are some advantages and disadvantages of decision trees that we haven't yet talked about:
Advantages:
Disadvantages:
Note that there is not just one decision tree algorithm; instead, there are many variations. A few common decision tree algorithms that are often referred to by name are C4.5, C5.0, and CART. (More details are available in the scikit-learn documentation.) scikit-learn uses an "optimized version" of CART.