#!/usr/bin/env python # coding: utf-8 # # Decision Trees # # *Adapted from Chapter 8 of [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/)* # Why are we learning about decision trees? # # - Can be applied to both regression and classification problems # - Many useful properties # - Very popular # - Basis for more sophisticated models # - Have a different way of "thinking" than the other models we have studied # ## Lesson objectives # # Students will be able to: # # - Explain how a decision tree is created # - Build a decision tree model in scikit-learn # - Tune a decision tree model and explain how tuning impacts the model # - Interpret a tree diagram # - Describe the key differences between regression and classification trees # - Decide whether a decision tree is an appropriate model for a given problem # # Part 1: Regression trees # # Major League Baseball player data from 1986-87: # # - **Years** (x-axis): number of years playing in the major leagues # - **Hits** (y-axis): number of hits in the previous year # - **Salary** (color): low salary is blue/green, high salary is red/yellow # ![Salary data](images/salary_color.png) # Group exercise: # # - The data above is our **training data**. # - We want to build a model that predicts the Salary of **future players** based on Years and Hits. # - We are going to "segment" the feature space into regions, and then use the **mean Salary in each region** as the predicted Salary for future players. # - Intuitively, you want to **maximize** the similarity (or "homogeneity") within a given region, and **minimize** the similarity between different regions. # # Rules for segmenting: # # - You can only use **straight lines**, drawn one at a time. # - Your line must either be **vertical or horizontal**. # - Your line **stops** when it hits an existing line. # ![Salary regions](images/salary_regions.png) # Above are the regions created by a computer: # # - $R_1$: players with **less than 5 years** of experience, mean Salary of **\$166,000 ** # - $R_2$: players with **5 or more years** of experience and **less than 118 hits**, mean Salary of **\$403,000 ** # - $R_3$: players with **5 or more years** of experience and **118 hits or more**, mean Salary of **\$846,000 ** # # **Note:** Years and Hits are both integers, but the convention is to use the **midpoint** between adjacent values to label a split. # # These regions are used to make predictions on **out-of-sample data**. Thus, there are only three possible predictions! (Is this different from how **linear regression** makes predictions?) # # Below is the equivalent regression tree: # ![Salary tree](images/salary_tree.png) # The first split is **Years < 4.5**, thus that split goes at the top of the tree. When a splitting rule is **True**, you follow the left branch. When a splitting rule is **False**, you follow the right branch. # # For players in the **left branch**, the mean Salary is \$166,000, thus you label it with that value. (Salary has been divided by 1000 and log-transformed to 5.11.) # # For players in the **right branch**, there is a further split on **Hits < 117.5**, dividing players into two more Salary regions: \$403,000 (transformed to 6.00), and \$846,000 (transformed to 6.74). # ![Salary tree annotated](images/salary_tree_annotated.png) # **What does this tree tell you about your data?** # # - Years is the most important factor determining Salary, with a lower number of Years corresponding to a lower Salary. # - For a player with a lower number of Years, Hits is not an important factor determining Salary. # - For a player with a higher number of Years, Hits is an important factor determining Salary, with a greater number of Hits corresponding to a higher Salary. # # **Question:** What do you like and dislike about decision trees so far? # ## Building a regression tree by hand # # Your **training data** is a tiny dataset of [used vehicle sale prices](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/vehicles_train.csv). Your goal is to **predict price** for testing data. # # 1. Read the data into a Pandas DataFrame. # 2. Explore the data by sorting, plotting, or split-apply-combine (aka `group_by`). # 3. Decide which feature is the most important predictor, and use that to create your first splitting rule. # - Only binary splits are allowed. # 4. After making your first split, split your DataFrame into two parts, and then explore each part to figure out what other splits to make. # 5. Stop making splits once you are convinced that it strikes a good balance between underfitting and overfitting. # - Your goal is to build a model that generalizes well. # - You are allowed to split on the same variable multiple times! # 6. Draw your tree, labeling the leaves with the mean price for the observations in that region. # - Make sure nothing is backwards: You follow the **left branch** if the rule is true, and the **right branch** if the rule is false. # ## How does a computer build a regression tree? # # **Ideal approach:** Consider every possible partition of the feature space (computationally infeasible) # # **"Good enough" approach:** recursive binary splitting # # 1. Begin at the top of the tree. # 2. For **every feature**, examine **every possible cutpoint**, and choose the feature and cutpoint such that the resulting tree has the lowest possible mean squared error (MSE). Make that split. # 3. Examine the two resulting regions, and again make a **single split** (in one of the regions) to minimize the MSE. # 4. Keep repeating step 3 until a **stopping criterion** is met: # - maximum tree depth (maximum number of splits required to arrive at a leaf) # - minimum number of observations in a leaf # ### Demo: Choosing the ideal cutpoint for a given feature # In[1]: # vehicle data import pandas as pd url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/vehicles_train.csv' train = pd.read_csv(url) # In[2]: # before splitting anything, just predict the mean of the entire dataset train['prediction'] = train.price.mean() train # In[3]: # calculate RMSE for those predictions from sklearn import metrics import numpy as np np.sqrt(metrics.mean_squared_error(train.price, train.prediction)) # In[4]: # define a function that calculates the RMSE for a given split of miles def mileage_split(miles): lower_mileage_price = train[train.miles < miles].price.mean() higher_mileage_price = train[train.miles >= miles].price.mean() train['prediction'] = np.where(train.miles < miles, lower_mileage_price, higher_mileage_price) return np.sqrt(metrics.mean_squared_error(train.price, train.prediction)) # In[5]: # calculate RMSE for tree which splits on miles < 50000 print 'RMSE:', mileage_split(50000) train # In[6]: # calculate RMSE for tree which splits on miles < 100000 print 'RMSE:', mileage_split(100000) train # In[7]: # check all possible mileage splits mileage_range = range(train.miles.min(), train.miles.max(), 1000) RMSE = [mileage_split(miles) for miles in mileage_range] # In[8]: # allow plots to appear in the notebook get_ipython().run_line_magic('matplotlib', 'inline') import matplotlib.pyplot as plt plt.rcParams['figure.figsize'] = (6, 4) plt.rcParams['font.size'] = 14 # In[9]: # plot mileage cutpoint (x-axis) versus RMSE (y-axis) plt.plot(mileage_range, RMSE) plt.xlabel('Mileage cutpoint') plt.ylabel('RMSE (lower is better)') # **Recap:** Before every split, this process is repeated for every feature, and the feature and cutpoint that produces the lowest MSE is chosen. # ## Building a regression tree in scikit-learn # In[10]: # encode car as 0 and truck as 1 train['vtype'] = train.vtype.map({'car':0, 'truck':1}) # In[11]: # define X and y feature_cols = ['year', 'miles', 'doors', 'vtype'] X = train[feature_cols] y = train.price # In[12]: # instantiate a DecisionTreeRegressor (with random_state=1) from sklearn.tree import DecisionTreeRegressor treereg = DecisionTreeRegressor(random_state=1) treereg # In[13]: # use leave-one-out cross-validation (LOOCV) to estimate the RMSE for this model from sklearn.cross_validation import cross_val_score scores = cross_val_score(treereg, X, y, cv=14, scoring='mean_squared_error') np.mean(np.sqrt(-scores)) # ## What happens when we grow a tree too deep? # # - Left: Regression tree for Salary **grown deeper** # - Right: Comparison of the **training, testing, and cross-validation errors** for trees with different numbers of leaves # ![Salary tree grown deep](images/salary_tree_deep.png) # The **training error** continues to go down as the tree size increases (due to overfitting), but the lowest **cross-validation error** occurs for a tree with 3 leaves. # ## Tuning a regression tree # # Let's try to reduce the RMSE by tuning the **max_depth** parameter: # In[14]: # try different values one-by-one treereg = DecisionTreeRegressor(max_depth=1, random_state=1) scores = cross_val_score(treereg, X, y, cv=14, scoring='mean_squared_error') np.mean(np.sqrt(-scores)) # Or, we could write a loop to try a range of values: # In[15]: # list of values to try max_depth_range = range(1, 8) # list to store the average RMSE for each value of max_depth RMSE_scores = [] # use LOOCV with each value of max_depth for depth in max_depth_range: treereg = DecisionTreeRegressor(max_depth=depth, random_state=1) MSE_scores = cross_val_score(treereg, X, y, cv=14, scoring='mean_squared_error') RMSE_scores.append(np.mean(np.sqrt(-MSE_scores))) # In[16]: # plot max_depth (x-axis) versus RMSE (y-axis) plt.plot(max_depth_range, RMSE_scores) plt.xlabel('max_depth') plt.ylabel('RMSE (lower is better)') # In[17]: # max_depth=3 was best, so fit a tree using that parameter treereg = DecisionTreeRegressor(max_depth=3, random_state=1) treereg.fit(X, y) # In[18]: # "Gini importance" of each feature: the (normalized) total reduction of error brought by that feature pd.DataFrame({'feature':feature_cols, 'importance':treereg.feature_importances_}) # ## Creating a tree diagram # In[19]: # create a Graphviz file from sklearn.tree import export_graphviz export_graphviz(treereg, out_file='tree_vehicles.dot', feature_names=feature_cols) # At the command line, run this to convert to PNG: # dot -Tpng tree_vehicles.dot -o tree_vehicles.png # ![Tree for vehicle data](images/tree_vehicles.png) # Reading the internal nodes: # # - **samples:** number of observations in that node before splitting # - **mse:** MSE calculated by comparing the actual response values in that node against the mean response value in that node # - **rule:** rule used to split that node (go left if true, go right if false) # # Reading the leaves: # # - **samples:** number of observations in that node # - **value:** mean response value in that node # - **mse:** MSE calculated by comparing the actual response values in that node against "value" # ## Making predictions for the testing data # In[20]: # read the testing data url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/vehicles_test.csv' test = pd.read_csv(url) test['vtype'] = test.vtype.map({'car':0, 'truck':1}) test # **Question:** Using the tree diagram above, what predictions will the model make for each observation? # In[21]: # use fitted model to make predictions on testing data X_test = test[feature_cols] y_test = test.price y_pred = treereg.predict(X_test) y_pred # In[22]: # calculate RMSE np.sqrt(metrics.mean_squared_error(y_test, y_pred)) # In[23]: # calculate RMSE for your own tree! y_test = [3000, 6000, 12000] y_pred = [0, 0, 0] from sklearn import metrics np.sqrt(metrics.mean_squared_error(y_test, y_pred)) # # Part 2: Classification trees # # **Example:** Predict whether Barack Obama or Hillary Clinton will win the Democratic primary in a particular county in 2008: # ![Obama-Clinton decision tree](images/obama_clinton_tree.jpg) # **Questions:** # # - What are the observations? How many observations are there? # - What is the response variable? # - What are the features? # - What is the most predictive feature? # - Why does the tree split on high school graduation rate twice in a row? # - What is the class prediction for the following county: 15% African-American, 90% high school graduation rate, located in the South, high poverty, high population density? # - What is the predicted probability for that same county? # ## Comparing regression trees and classification trees # # |regression trees|classification trees| # |---|---| # |predict a continuous response|predict a categorical response| # |predict using mean response of each leaf|predict using most commonly occuring class of each leaf| # |splits are chosen to minimize MSE|splits are chosen to minimize Gini index (discussed below)| # ## Splitting criteria for classification trees # # Common options for the splitting criteria: # # - **classification error rate:** fraction of training observations in a region that don't belong to the most common class # - **Gini index:** measure of total variance across classes in a region # ### Example of classification error rate # # Pretend we are predicting whether someone buys an iPhone or an Android: # # - At a particular node, there are **25 observations** (phone buyers), of whom **10 bought iPhones and 15 bought Androids**. # - Since the majority class is **Android**, that's our prediction for all 25 observations, and thus the classification error rate is **10/25 = 40%**. # # Our goal in making splits is to **reduce the classification error rate**. Let's try splitting on gender: # # - **Males:** 2 iPhones and 12 Androids, thus the predicted class is Android # - **Females:** 8 iPhones and 3 Androids, thus the predicted class is iPhone # - Classification error rate after this split would be **5/25 = 20%** # # Compare that with a split on age: # # - **30 or younger:** 4 iPhones and 8 Androids, thus the predicted class is Android # - **31 or older:** 6 iPhones and 7 Androids, thus the predicted class is Android # - Classification error rate after this split would be **10/25 = 40%** # # The decision tree algorithm will try **every possible split across all features**, and choose the split that **reduces the error rate the most.** # ### Example of Gini index # # Calculate the Gini index before making a split: # # $$1 - \left(\frac {iPhone} {Total}\right)^2 - \left(\frac {Android} {Total}\right)^2 = 1 - \left(\frac {10} {25}\right)^2 - \left(\frac {15} {25}\right)^2 = 0.48$$ # # - The **maximum value** of the Gini index is 0.5, and occurs when the classes are perfectly balanced in a node. # - The **minimum value** of the Gini index is 0, and occurs when there is only one class represented in a node. # - A node with a lower Gini index is said to be more "pure". # # Evaluating the split on **gender** using Gini index: # # $$\text{Males: } 1 - \left(\frac {2} {14}\right)^2 - \left(\frac {12} {14}\right)^2 = 0.24$$ # $$\text{Females: } 1 - \left(\frac {8} {11}\right)^2 - \left(\frac {3} {11}\right)^2 = 0.40$$ # $$\text{Weighted Average: } 0.24 \left(\frac {14} {25}\right) + 0.40 \left(\frac {11} {25}\right) = 0.31$$ # # Evaluating the split on **age** using Gini index: # # $$\text{30 or younger: } 1 - \left(\frac {4} {12}\right)^2 - \left(\frac {8} {12}\right)^2 = 0.44$$ # $$\text{31 or older: } 1 - \left(\frac {6} {13}\right)^2 - \left(\frac {7} {13}\right)^2 = 0.50$$ # $$\text{Weighted Average: } 0.44 \left(\frac {12} {25}\right) + 0.50 \left(\frac {13} {25}\right) = 0.47$$ # # Again, the decision tree algorithm will try **every possible split**, and will choose the split that **reduces the Gini index (and thus increases the "node purity") the most.** # ### Comparing classification error rate and Gini index # # - Gini index is generally preferred because it will make splits that **increase node purity**, even if that split does not change the classification error rate. # - Node purity is important because we're interested in the **class proportions** in each region, since that's how we calculate the **predicted probability** of each class. # - scikit-learn's default splitting criteria for classification trees is Gini index. # # Note: There is another common splitting criteria called **cross-entropy**. It's numerically similar to Gini index, but slower to compute, thus it's not as popular as Gini index. # ## Building a classification tree in scikit-learn # We'll build a classification tree using the Titanic data: # In[24]: # read in the data url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/titanic.csv' titanic = pd.read_csv(url) # encode female as 0 and male as 1 titanic['Sex'] = titanic.Sex.map({'female':0, 'male':1}) # fill in the missing values for age with the median age titanic.Age.fillna(titanic.Age.median(), inplace=True) # create a DataFrame of dummy variables for Embarked embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked') embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True) # concatenate the original DataFrame and the dummy DataFrame titanic = pd.concat([titanic, embarked_dummies], axis=1) # print the updated DataFrame titanic.head() # - **Survived:** 0=died, 1=survived (response variable) # - **Pclass:** 1=first class, 2=second class, 3=third class # - What will happen if the tree splits on this feature? # - **Sex:** 0=female, 1=male # - **Age:** numeric value # - **Embarked:** C or Q or S # In[25]: # define X and y feature_cols = ['Pclass', 'Sex', 'Age', 'Embarked_Q', 'Embarked_S'] X = titanic[feature_cols] y = titanic.Survived # In[26]: # fit a classification tree with max_depth=3 on all data from sklearn.tree import DecisionTreeClassifier treeclf = DecisionTreeClassifier(max_depth=3, random_state=1) treeclf.fit(X, y) # In[27]: # create a Graphviz file export_graphviz(treeclf, out_file='tree_titanic.dot', feature_names=feature_cols) # At the command line, run this to convert to PNG: # dot -Tpng tree_titanic.dot -o tree_titanic.png # ![Tree for Titanic data](images/tree_titanic.png) # Notice the split in the bottom right: the **same class** is predicted in both of its leaves. That split didn't affect the **classification error rate**, though it did increase the **node purity**, which is important because it increases the accuracy of our predicted probabilities. # In[28]: # compute the feature importances pd.DataFrame({'feature':feature_cols, 'importance':treeclf.feature_importances_}) # # Part 3: Comparing decision trees with other models # # **Advantages of decision trees:** # # - Can be used for regression or classification # - Can be displayed graphically # - Highly interpretable # - Can be specified as a series of rules, and more closely approximate human decision-making than other models # - Prediction is fast # - Features don't need scaling # - Automatically learns feature interactions # - Tends to ignore irrelevant features # - Non-parametric (will outperform linear models if relationship between features and response is highly non-linear) # ![Trees versus linear models](images/tree_vs_linear.png) # **Disadvantages of decision trees:** # # - Performance is (generally) not competitive with the best supervised learning methods # - Can easily overfit the training data (tuning is required) # - Small variations in the data can result in a completely different tree (high variance) # - Recursive binary splitting makes "locally optimal" decisions that may not result in a globally optimal tree # - Doesn't tend to work well if the classes are highly unbalanced # - Doesn't tend to work well with very small datasets