#!/usr/bin/env python # coding: utf-8 # # 07 - Ensemble Methods # # by [Alejandro Correa Bahnsen](http://www.albahnsen.com/) & [Iván Torroledo](http://www.ivantorroledo.com/) # # version 1.3, June 2018 # # ## Part of the class [Applied Deep Learning](https://github.com/albahnsen/AppliedDeepLearningClass) # # # This notebook is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License](http://creativecommons.org/licenses/by-sa/3.0/deed.en_US). Special thanks goes to [Kevin Markham](https://github.com/justmarkham) # Why are we learning about ensembling? # # - Very popular method for improving the predictive performance of machine learning models # - Provides a foundation for understanding more sophisticated models # ## Lesson objectives # # Students will be able to: # # - Define ensembling and its requirements # - Identify the two basic methods of ensembling # - Decide whether manual ensembling is a useful approach for a given problem # - Explain bagging and how it can be applied to decision trees # - Explain how out-of-bag error and feature importances are calculated from bagged trees # - Explain the difference between bagged trees and Random Forests # - Build and tune a Random Forest model in scikit-learn # - Decide whether a decision tree or a Random Forest is a better model for a given problem # # Part 1: Introduction # Ensemble learning is a widely studied topic in the machine learning community. The main idea behind # the ensemble methodology is to combine several individual base classifiers in order to have a # classifier that outperforms each of them. # # Nowadays, ensemble methods are one # of the most popular and well studied machine learning techniques, and it can be # noted that since 2009 all the first-place and second-place winners of the KDD-Cup https://www.sigkdd.org/kddcup/ used ensemble methods. The core # principle in ensemble learning, is to induce random perturbations into the learning procedure in # order to produce several different base classifiers from a single training set, then combining the # base classifiers in order to make the final prediction. In order to induce the random permutations # and therefore create the different base classifiers, several methods have been proposed, in # particular: # * bagging # * pasting # * random forests # * random patches # # Finally, after the base classifiers # are trained, they are typically combined using either: # * majority voting # * weighted voting # * stacking # # There are three main reasons regarding why ensemble # methods perform better than single models: statistical, computational and representational . First, from a statistical point of view, when the learning set is too # small, an algorithm can find several good models within the search space, that arise to the same # performance on the training set $\mathcal{S}$. Nevertheless, without a validation set, there is # a risk of choosing the wrong model. The second reason is computational; in general, algorithms # rely on some local search optimization and may get stuck in a local optima. Then, an ensemble may # solve this by focusing different algorithms to different spaces across the training set. The last # reason is representational. In most cases, for a learning set of finite size, the true function # $f$ cannot be represented by any of the candidate models. By combining several models in an # ensemble, it may be possible to obtain a model with a larger coverage across the space of # representable functions. # ![s](images/ch9_fig1.png) # ## Example # # Let's pretend that instead of building a single model to solve a binary classification problem, you created **five independent models**, and each model was correct about 70% of the time. If you combined these models into an "ensemble" and used their majority vote as a prediction, how often would the ensemble be correct? # In[1]: import numpy as np # set a seed for reproducibility np.random.seed(1234) # generate 1000 random numbers (between 0 and 1) for each model, representing 1000 observations mod1 = np.random.rand(1000) mod2 = np.random.rand(1000) mod3 = np.random.rand(1000) mod4 = np.random.rand(1000) mod5 = np.random.rand(1000) # each model independently predicts 1 (the "correct response") if random number was at least 0.3 preds1 = np.where(mod1 > 0.3, 1, 0) preds2 = np.where(mod2 > 0.3, 1, 0) preds3 = np.where(mod3 > 0.3, 1, 0) preds4 = np.where(mod4 > 0.3, 1, 0) preds5 = np.where(mod5 > 0.3, 1, 0) # print the first 20 predictions from each model print(preds1[:20]) print(preds2[:20]) print(preds3[:20]) print(preds4[:20]) print(preds5[:20]) # In[2]: # average the predictions and then round to 0 or 1 ensemble_preds = np.round((preds1 + preds2 + preds3 + preds4 + preds5)/5.0).astype(int) # print the ensemble's first 20 predictions print(ensemble_preds[:20]) # In[3]: # how accurate was each individual model? print(preds1.mean()) print(preds2.mean()) print(preds3.mean()) print(preds4.mean()) print(preds5.mean()) # In[4]: # how accurate was the ensemble? print(ensemble_preds.mean()) # **Note:** As you add more models to the voting process, the probability of error decreases, which is known as [Condorcet's Jury Theorem](http://en.wikipedia.org/wiki/Condorcet%27s_jury_theorem). # ## What is ensembling? # # **Ensemble learning (or "ensembling")** is the process of combining several predictive models in order to produce a combined model that is more accurate than any individual model. # # - **Regression:** take the average of the predictions # - **Classification:** take a vote and use the most common prediction, or take the average of the predicted probabilities # # For ensembling to work well, the models must have the following characteristics: # # - **Accurate:** they outperform the null model # - **Independent:** their predictions are generated using different processes # # **The big idea:** If you have a collection of individually imperfect (and independent) models, the "one-off" mistakes made by each model are probably not going to be made by the rest of the models, and thus the mistakes will be discarded when averaging the models. # # There are two basic **methods for ensembling:** # # - Manually ensemble your individual models # - Use a model that ensembles for you # ### Theoretical performance of an ensemble # If we assume that each one of the $T$ base classifiers has a probability $\rho$ of # being correct, the probability of an ensemble making the correct decision, assuming independence, # denoted by $P_c$, can be calculated using the binomial distribution # # $$P_c = \sum_{j>T/2}^{T} {{T}\choose{j}} \rho^j(1-\rho)^{T-j}.$$ # # Furthermore, as shown, if $T\ge3$ then: # # $$ # \lim_{T \to \infty} P_c= \begin{cases} # 1 &\mbox{if } \rho>0.5 \\ # 0 &\mbox{if } \rho<0.5 \\ # 0.5 &\mbox{if } \rho=0.5 , # \end{cases} # $$ # leading to the conclusion that # $$ # \rho \ge 0.5 \quad \text{and} \quad T\ge3 \quad \Rightarrow \quad P_c\ge \rho. # $$ # # Part 2: Manual ensembling # # What makes a good manual ensemble? # # - Different types of **models** # - Different combinations of **features** # - Different **tuning parameters** # ![Machine learning flowchart](https://raw.githubusercontent.com/justmarkham/DAT8/master/notebooks/images/crowdflower_ensembling.jpg) # # *Machine learning flowchart created by the [winner](https://github.com/ChenglongChen/Kaggle_CrowdFlower) of Kaggle's [CrowdFlower competition](https://www.kaggle.com/c/crowdflower-search-relevance)* # In[5]: # read in and prepare the vehicle training data import zipfile import pandas as pd with zipfile.ZipFile('../datasets/vehicles_train.csv.zip', 'r') as z: f = z.open('vehicles_train.csv') train = pd.io.parsers.read_table(f, index_col=False, sep=',') with zipfile.ZipFile('../datasets/vehicles_test.csv.zip', 'r') as z: f = z.open('vehicles_test.csv') test = pd.io.parsers.read_table(f, index_col=False, sep=',') train['vtype'] = train.vtype.map({'car':0, 'truck':1}) # read in and prepare the vehicle testing data test['vtype'] = test.vtype.map({'car':0, 'truck':1}) # In[6]: train.head() # ### Train different models # In[7]: from sklearn.linear_model import LinearRegression from sklearn.tree import DecisionTreeRegressor from sklearn.naive_bayes import GaussianNB from sklearn.neighbors import KNeighborsRegressor models = {'lr': LinearRegression(), 'dt': DecisionTreeRegressor(), 'nb': GaussianNB(), 'nn': KNeighborsRegressor()} # In[8]: # Train all the models X_train = train.iloc[:, 1:] X_test = test.iloc[:, 1:] y_train = train.price y_test = test.price for model in models.keys(): models[model].fit(X_train, y_train) # In[9]: # predict test for each model y_pred = pd.DataFrame(index=test.index, columns=models.keys()) for model in models.keys(): y_pred[model] = models[model].predict(X_test) # In[10]: # Evaluate each model from sklearn.metrics import mean_squared_error for model in models.keys(): print(model,np.sqrt(mean_squared_error(y_pred[model], y_test))) # ### Evaluate the error of the mean of the predictions # In[11]: np.sqrt(mean_squared_error(y_pred.mean(axis=1), y_test)) # ## Comparing manual ensembling with a single model approach # # **Advantages of manual ensembling:** # # - Increases predictive accuracy # - Easy to get started # # **Disadvantages of manual ensembling:** # # - Decreases interpretability # - Takes longer to train # - Takes longer to predict # - More complex to automate and maintain # - Small gains in accuracy may not be worth the added complexity # # Part 3: Bagging # # The primary weakness of **decision trees** is that they don't tend to have the best predictive accuracy. This is partially due to **high variance**, meaning that different splits in the training data can lead to very different trees. # # **Bagging** is a general purpose procedure for reducing the variance of a machine learning method, but is particularly useful for decision trees. Bagging is short for **bootstrap aggregation**, meaning the aggregation of bootstrap samples. # # What is a **bootstrap sample**? A random sample with replacement: # In[12]: # set a seed for reproducibility np.random.seed(1) # create an array of 1 through 20 nums = np.arange(1, 21) print(nums) # sample that array 20 times with replacement print(np.random.choice(a=nums, size=20, replace=True)) # **How does bagging work (for decision trees)?** # # 1. Grow B trees using B bootstrap samples from the training data. # 2. Train each tree on its bootstrap sample and make predictions. # 3. Combine the predictions: # - Average the predictions for **regression trees** # - Take a vote for **classification trees** # # Notes: # # - **Each bootstrap sample** should be the same size as the original training set. # - **B** should be a large enough value that the error seems to have "stabilized". # - The trees are **grown deep** so that they have low bias/high variance. # # Bagging increases predictive accuracy by **reducing the variance**, similar to how cross-validation reduces the variance associated with train/test split (for estimating out-of-sample error) by splitting many times an averaging the results. # # In[13]: # set a seed for reproducibility np.random.seed(123) n_samples = train.shape[0] n_B = 10 # create ten bootstrap samples (will be used to select rows from the DataFrame) samples = [np.random.choice(a=n_samples, size=n_samples, replace=True) for _ in range(1, n_B +1 )] samples # In[14]: # show the rows for the first decision tree train.iloc[samples[0], :] # Build one tree for each sample # In[15]: from sklearn.tree import DecisionTreeRegressor # grow each tree deep treereg = DecisionTreeRegressor(max_depth=None, random_state=123) # DataFrame for storing predicted price from each tree y_pred = pd.DataFrame(index=test.index, columns=[list(range(n_B))]) # grow one tree for each bootstrap sample and make predictions on testing data for i, sample in enumerate(samples): X_train = train.iloc[sample, 1:] y_train = train.iloc[sample, 0] treereg.fit(X_train, y_train) y_pred[i] = treereg.predict(X_test) # In[16]: y_pred # Results of each tree # In[17]: for i in range(n_B): print(i, np.sqrt(mean_squared_error(y_pred[i], y_test))) # Results of the ensemble # In[18]: y_pred.mean(axis=1) # In[19]: np.sqrt(mean_squared_error(y_test, y_pred.mean(axis=1))) # ## Bagged decision trees in scikit-learn (with B=500) # In[20]: # define the training and testing sets X_train = train.iloc[:, 1:] y_train = train.iloc[:, 0] X_test = test.iloc[:, 1:] y_test = test.iloc[:, 0] # In[21]: # instruct BaggingRegressor to use DecisionTreeRegressor as the "base estimator" from sklearn.ensemble import BaggingRegressor bagreg = BaggingRegressor(DecisionTreeRegressor(), n_estimators=500, bootstrap=True, oob_score=True, random_state=1) # In[22]: # fit and predict bagreg.fit(X_train, y_train) y_pred = bagreg.predict(X_test) y_pred # In[23]: # calculate RMSE np.sqrt(mean_squared_error(y_test, y_pred)) # ## Estimating out-of-sample error # # For bagged models, out-of-sample error can be estimated without using **train/test split** or **cross-validation**! # # On average, each bagged tree uses about **two-thirds** of the observations. For each tree, the **remaining observations** are called "out-of-bag" observations. # In[24]: # show the first bootstrap sample samples[0] # In[25]: # show the "in-bag" observations for each sample for sample in samples: print(set(sample)) # In[26]: # show the "out-of-bag" observations for each sample for sample in samples: print(sorted(set(range(n_samples)) - set(sample))) # How to calculate **"out-of-bag error":** # # 1. For every observation in the training data, predict its response value using **only** the trees in which that observation was out-of-bag. Average those predictions (for regression) or take a vote (for classification). # 2. Compare all predictions to the actual response values in order to compute the out-of-bag error. # # When B is sufficiently large, the **out-of-bag error** is an accurate estimate of **out-of-sample error**. # In[27]: # compute the out-of-bag R-squared score (not MSE, unfortunately!) for B=500 bagreg.oob_score_ # ## Estimating feature importance # # Bagging increases **predictive accuracy**, but decreases **model interpretability** because it's no longer possible to visualize the tree to understand the importance of each feature. # # However, we can still obtain an overall summary of **feature importance** from bagged models: # # - **Bagged regression trees:** calculate the total amount that **MSE** is decreased due to splits over a given feature, averaged over all trees # - **Bagged classification trees:** calculate the total amount that **Gini index** is decreased due to splits over a given feature, averaged over all trees # # Part 4: Random Forests # # Random Forests is a **slight variation of bagged trees** that has even better performance: # # - Exactly like bagging, we create an ensemble of decision trees using bootstrapped samples of the training set. # - However, when building each tree, each time a split is considered, a **random sample of m features** is chosen as split candidates from the **full set of p features**. The split is only allowed to use **one of those m features**. # - A new random sample of features is chosen for **every single tree at every single split**. # - For **classification**, m is typically chosen to be the square root of p. # - For **regression**, m is typically chosen to be somewhere between p/3 and p. # # What's the point? # # - Suppose there is **one very strong feature** in the data set. When using bagged trees, most of the trees will use that feature as the top split, resulting in an ensemble of similar trees that are **highly correlated**. # - Averaging highly correlated quantities does not significantly reduce variance (which is the entire goal of bagging). # - By randomly leaving out candidate features from each split, **Random Forests "decorrelates" the trees**, such that the averaging process can reduce the variance of the resulting model. # # Part 5: Building and tuning decision trees and Random Forests # # - Major League Baseball player data from 1986-87: [data](https://github.com/justmarkham/DAT8/blob/master/data/hitters.csv), [data dictionary](https://cran.r-project.org/web/packages/ISLR/ISLR.pdf) (page 7) # - Each observation represents a player # - **Goal:** Predict player salary # In[28]: # read in the data with zipfile.ZipFile('../datasets/hitters.csv.zip', 'r') as z: f = z.open('hitters.csv') hitters = pd.read_csv(f, sep=',', index_col=False) # remove rows with missing values hitters.dropna(inplace=True) hitters.head() # In[29]: # encode categorical variables as integers hitters['League'] = pd.factorize(hitters.League)[0] hitters['Division'] = pd.factorize(hitters.Division)[0] hitters['NewLeague'] = pd.factorize(hitters.NewLeague)[0] hitters.head() # In[30]: # allow plots to appear in the notebook get_ipython().run_line_magic('matplotlib', 'inline') import matplotlib.pyplot as plt plt.style.use('fivethirtyeight') # In[31]: # scatter plot of Years versus Hits colored by Salary hitters.plot(kind='scatter', x='Years', y='Hits', c='Salary', colormap='jet', xlim=(0, 25), ylim=(0, 250)) # In[32]: # define features: exclude career statistics (which start with "C") and the response (Salary) feature_cols = hitters.columns[hitters.columns.str.startswith('C') == False].drop('Salary') feature_cols # In[33]: # define X and y X = hitters[feature_cols] y = hitters.Salary # ## Predicting salary with a decision tree # # Find the best **max_depth** for a decision tree using cross-validation: # In[34]: # list of values to try for max_depth max_depth_range = range(1, 21) # list to store the average RMSE for each value of max_depth RMSE_scores = [] # use 10-fold cross-validation with each value of max_depth from sklearn.model_selection import cross_val_score for depth in max_depth_range: treereg = DecisionTreeRegressor(max_depth=depth, random_state=1) MSE_scores = cross_val_score(treereg, X, y, cv=10, scoring='neg_mean_squared_error') RMSE_scores.append(np.mean(np.sqrt(-MSE_scores))) # In[35]: # plot max_depth (x-axis) versus RMSE (y-axis) plt.plot(max_depth_range, RMSE_scores) plt.xlabel('max_depth') plt.ylabel('RMSE (lower is better)') # In[36]: # show the best RMSE and the corresponding max_depth sorted(zip(RMSE_scores, max_depth_range))[0] # In[37]: # max_depth=2 was best, so fit a tree using that parameter treereg = DecisionTreeRegressor(max_depth=2, random_state=1) treereg.fit(X, y) # In[38]: # compute feature importances pd.DataFrame({'feature':feature_cols, 'importance':treereg.feature_importances_}).sort_values('importance') # ## Predicting salary with a Random Forest # In[39]: from sklearn.ensemble import RandomForestRegressor rfreg = RandomForestRegressor() rfreg # ### Tuning n_estimators # # One important tuning parameter is **n_estimators**, which is the number of trees that should be grown. It should be a large enough value that the error seems to have "stabilized". # In[40]: # list of values to try for n_estimators estimator_range = range(10, 310, 10) # list to store the average RMSE for each value of n_estimators RMSE_scores = [] # use 5-fold cross-validation with each value of n_estimators (WARNING: SLOW!) for estimator in estimator_range: rfreg = RandomForestRegressor(n_estimators=estimator, random_state=1, n_jobs=-1) MSE_scores = cross_val_score(rfreg, X, y, cv=5, scoring='neg_mean_squared_error') RMSE_scores.append(np.mean(np.sqrt(-MSE_scores))) # In[41]: # plot n_estimators (x-axis) versus RMSE (y-axis) plt.plot(estimator_range, RMSE_scores) plt.xlabel('n_estimators') plt.ylabel('RMSE (lower is better)') # ### Tuning max_features # # The other important tuning parameter is **max_features**, which is the number of features that should be considered at each split. # In[42]: # list of values to try for max_features feature_range = range(1, len(feature_cols)+1) # list to store the average RMSE for each value of max_features RMSE_scores = [] # use 10-fold cross-validation with each value of max_features (WARNING: SLOW!) for feature in feature_range: rfreg = RandomForestRegressor(n_estimators=150, max_features=feature, random_state=1, n_jobs=-1) MSE_scores = cross_val_score(rfreg, X, y, cv=10, scoring='neg_mean_squared_error') RMSE_scores.append(np.mean(np.sqrt(-MSE_scores))) # In[43]: # plot max_features (x-axis) versus RMSE (y-axis) plt.plot(feature_range, RMSE_scores) plt.xlabel('max_features') plt.ylabel('RMSE (lower is better)') # In[44]: # show the best RMSE and the corresponding max_features sorted(zip(RMSE_scores, feature_range))[0] # ### Fitting a Random Forest with the best parameters # In[ ]: # max_features=8 is best and n_estimators=150 is sufficiently large rfreg = RandomForestRegressor(n_estimators=150, max_features=8, max_depth=3, oob_score=True, random_state=1) rfreg.fit(X, y) # In[46]: # compute feature importances pd.DataFrame({'feature':feature_cols, 'importance':rfreg.feature_importances_}).sort_values('importance') # In[47]: # compute the out-of-bag R-squared score rfreg.oob_score_ # ### Reducing X to its most important features # # In[48]: # check the shape of X X.shape # In[49]: rfreg # In[50]: # set a threshold for which features to include from sklearn.feature_selection import SelectFromModel print(SelectFromModel(rfreg, threshold=0.1, prefit=True).transform(X).shape) print(SelectFromModel(rfreg, threshold='mean', prefit=True).transform(X).shape) print(SelectFromModel(rfreg, threshold='median', prefit=True).transform(X).shape) # In[51]: # create a new feature matrix that only includes important features X_important = SelectFromModel(rfreg, threshold='mean', prefit=True).transform(X) # In[53]: # check the RMSE for a Random Forest that only includes important features rfreg = RandomForestRegressor(n_estimators=150, max_features=3, random_state=1) scores = cross_val_score(rfreg, X_important, y, cv=10, scoring='neg_mean_squared_error') np.mean(np.sqrt(-scores)) # ## Comparing Random Forests with decision trees # # **Advantages of Random Forests:** # # - Performance is competitive with the best supervised learning methods # - Provides a more reliable estimate of feature importance # - Allows you to estimate out-of-sample error without using train/test split or cross-validation # # **Disadvantages of Random Forests:** # # - Less interpretable # - Slower to train # - Slower to predict