#!/usr/bin/env python
# coding: utf-8

# # 07 - Ensemble Methods
# 
# by [Alejandro Correa Bahnsen](http://www.albahnsen.com/) & [Iván Torroledo](http://www.ivantorroledo.com/)
# 
# version 1.3, June 2018
# 
# ## Part of the class [Applied Deep Learning](https://github.com/albahnsen/AppliedDeepLearningClass)
# 
# 
# This notebook is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License](http://creativecommons.org/licenses/by-sa/3.0/deed.en_US). Special thanks goes to [Kevin Markham](https://github.com/justmarkham)

# Why are we learning about ensembling?
# 
# - Very popular method for improving the predictive performance of machine learning models
# - Provides a foundation for understanding more sophisticated models

# ## Lesson objectives
# 
# Students will be able to:
# 
# - Define ensembling and its requirements
# - Identify the two basic methods of ensembling
# - Decide whether manual ensembling is a useful approach for a given problem
# - Explain bagging and how it can be applied to decision trees
# - Explain how out-of-bag error and feature importances are calculated from bagged trees
# - Explain the difference between bagged trees and Random Forests
# - Build and tune a Random Forest model in scikit-learn
# - Decide whether a decision tree or a Random Forest is a better model for a given problem

# # Part 1: Introduction

# Ensemble learning is a widely studied topic in the machine learning community. The main idea behind 
# the ensemble methodology is to combine several individual base classifiers in   order to have a 
# classifier that outperforms each of them.
# 
# Nowadays,   ensemble methods are  one 
# of the most popular and well studied machine learning techniques, and it can be 
# noted that since 2009 all the first-place and   second-place winners of the KDD-Cup https://www.sigkdd.org/kddcup/   used  ensemble methods. The core 
# principle in ensemble learning, is to induce random perturbations into  the learning procedure in 
# order to produce several different base classifiers from a single  training set, then combining the 
# base classifiers in order to make the final prediction.  In order to induce the random permutations 
# and therefore create the different base classifiers,   several methods have been proposed, in 
# particular: 
# * bagging
# * pasting
# * random forests 
# * random patches  
# 
# Finally, after  the base   classifiers 
# are trained, they are typically   combined using either:
# * majority voting
# * weighted  voting  
# * stacking
# 

# There are three main reasons regarding why ensemble 
# methods perform better than single models: statistical, computational and representational . First, from a statistical point of view, when the learning set is too 
# small, an algorithm can find several good models within the search space, that arise to the same 
# performance on the training set $\mathcal{S}$. Nevertheless, without a validation set, there is 
# a risk of choosing the wrong model. The second reason is computational; in general, algorithms 
# rely on some local search optimization and may get stuck in a local optima. Then, an ensemble may 
# solve this by focusing different algorithms to different spaces across the training set. The last 
# reason is representational. In most cases, for a learning set of finite size, the  true function 
# $f$ cannot be represented by any of the candidate models. By combining several  models in an 
# ensemble, it may be possible to obtain a model with a larger coverage across the  space of 
# representable functions.

# ![s](images/ch9_fig1.png)

# ## Example
# 
# Let's pretend that instead of building a single model to solve a binary classification problem, you created **five independent models**, and each model was correct about 70% of the time. If you combined these models into an "ensemble" and used their majority vote as a prediction, how often would the ensemble be correct?

# In[1]:


import numpy as np

# set a seed for reproducibility
np.random.seed(1234)

# generate 1000 random numbers (between 0 and 1) for each model, representing 1000 observations
mod1 = np.random.rand(1000)
mod2 = np.random.rand(1000)
mod3 = np.random.rand(1000)
mod4 = np.random.rand(1000)
mod5 = np.random.rand(1000)

# each model independently predicts 1 (the "correct response") if random number was at least 0.3
preds1 = np.where(mod1 > 0.3, 1, 0)
preds2 = np.where(mod2 > 0.3, 1, 0)
preds3 = np.where(mod3 > 0.3, 1, 0)
preds4 = np.where(mod4 > 0.3, 1, 0)
preds5 = np.where(mod5 > 0.3, 1, 0)

# print the first 20 predictions from each model
print(preds1[:20])
print(preds2[:20])
print(preds3[:20])
print(preds4[:20])
print(preds5[:20])


# In[2]:


# average the predictions and then round to 0 or 1
ensemble_preds = np.round((preds1 + preds2 + preds3 + preds4 + preds5)/5.0).astype(int)

# print the ensemble's first 20 predictions
print(ensemble_preds[:20])


# In[3]:


# how accurate was each individual model?
print(preds1.mean())
print(preds2.mean())
print(preds3.mean())
print(preds4.mean())
print(preds5.mean())


# In[4]:


# how accurate was the ensemble?
print(ensemble_preds.mean())


# **Note:** As you add more models to the voting process, the probability of error decreases, which is known as [Condorcet's Jury Theorem](http://en.wikipedia.org/wiki/Condorcet%27s_jury_theorem).

# ## What is ensembling?
# 
# **Ensemble learning (or "ensembling")** is the process of combining several predictive models in order to produce a combined model that is more accurate than any individual model.
# 
# - **Regression:** take the average of the predictions
# - **Classification:** take a vote and use the most common prediction, or take the average of the predicted probabilities
# 
# For ensembling to work well, the models must have the following characteristics:
# 
# - **Accurate:** they outperform the null model
# - **Independent:** their predictions are generated using different processes
# 
# **The big idea:** If you have a collection of individually imperfect (and independent) models, the "one-off" mistakes made by each model are probably not going to be made by the rest of the models, and thus the mistakes will be discarded when averaging the models.
# 
# There are two basic **methods for ensembling:**
# 
# - Manually ensemble your individual models
# - Use a model that ensembles for you

# ### Theoretical performance of an ensemble
#   If we assume that each one of the $T$ base classifiers has a probability $\rho$ of 
#   being correct, the probability of an ensemble making the correct decision, assuming independence, 
#   denoted by $P_c$, can be calculated using the binomial distribution
# 
# $$P_c = \sum_{j>T/2}^{T} {{T}\choose{j}} \rho^j(1-\rho)^{T-j}.$$
# 
#   Furthermore, as shown, if $T\ge3$ then:
# 
# $$
#   \lim_{T \to  \infty} P_c= \begin{cases} 
#             1  &\mbox{if } \rho>0.5 \\ 
#             0  &\mbox{if } \rho<0.5 \\ 
#             0.5  &\mbox{if } \rho=0.5 ,
#             \end{cases}
# $$
# 	leading to the conclusion that 
# $$
#   \rho \ge 0.5 \quad \text{and} \quad T\ge3 \quad \Rightarrow \quad P_c\ge \rho.
# $$

# # Part 2: Manual ensembling
# 
# What makes a good manual ensemble?
# 
# - Different types of **models**
# - Different combinations of **features**
# - Different **tuning parameters**

# ![Machine learning flowchart](https://raw.githubusercontent.com/justmarkham/DAT8/master/notebooks/images/crowdflower_ensembling.jpg)
# 
# *Machine learning flowchart created by the [winner](https://github.com/ChenglongChen/Kaggle_CrowdFlower) of Kaggle's [CrowdFlower competition](https://www.kaggle.com/c/crowdflower-search-relevance)*

# In[5]:


# read in and prepare the vehicle training data
import zipfile
import pandas as pd
with zipfile.ZipFile('../datasets/vehicles_train.csv.zip', 'r') as z:
    f = z.open('vehicles_train.csv')
    train = pd.io.parsers.read_table(f, index_col=False, sep=',')
with zipfile.ZipFile('../datasets/vehicles_test.csv.zip', 'r') as z:
    f = z.open('vehicles_test.csv')
    test = pd.io.parsers.read_table(f, index_col=False, sep=',')

train['vtype'] = train.vtype.map({'car':0, 'truck':1})
# read in and prepare the vehicle testing data
test['vtype'] = test.vtype.map({'car':0, 'truck':1})


# In[6]:


train.head()


# ### Train different models

# In[7]:


from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsRegressor

models = {'lr': LinearRegression(),
          'dt': DecisionTreeRegressor(),
          'nb': GaussianNB(),
          'nn': KNeighborsRegressor()}


# In[8]:


# Train all the models
X_train = train.iloc[:, 1:]
X_test = test.iloc[:, 1:]
y_train = train.price
y_test = test.price

for model in models.keys():
    models[model].fit(X_train, y_train)


# In[9]:


# predict test for each model
y_pred = pd.DataFrame(index=test.index, columns=models.keys())
for model in models.keys():
    y_pred[model] = models[model].predict(X_test)
    

# In[10]:


# Evaluate each model
from sklearn.metrics import mean_squared_error

for model in models.keys():
    print(model,np.sqrt(mean_squared_error(y_pred[model], y_test)))


# ### Evaluate the error of the mean of the predictions

# In[11]:


np.sqrt(mean_squared_error(y_pred.mean(axis=1), y_test))


# ## Comparing manual ensembling with a single model approach
# 
# **Advantages of manual ensembling:**
# 
# - Increases predictive accuracy
# - Easy to get started
# 
# **Disadvantages of manual ensembling:**
# 
# - Decreases interpretability
# - Takes longer to train
# - Takes longer to predict
# - More complex to automate and maintain
# - Small gains in accuracy may not be worth the added complexity

# # Part 3: Bagging
# 
# The primary weakness of **decision trees** is that they don't tend to have the best predictive accuracy. This is partially due to **high variance**, meaning that different splits in the training data can lead to very different trees.
# 
# **Bagging** is a general purpose procedure for reducing the variance of a machine learning method, but is particularly useful for decision trees. Bagging is short for **bootstrap aggregation**, meaning the aggregation of bootstrap samples.
# 
# What is a **bootstrap sample**? A random sample with replacement:

# In[12]:


# set a seed for reproducibility
np.random.seed(1)

# create an array of 1 through 20
nums = np.arange(1, 21)
print(nums)

# sample that array 20 times with replacement
print(np.random.choice(a=nums, size=20, replace=True))


# **How does bagging work (for decision trees)?**
# 
# 1. Grow B trees using B bootstrap samples from the training data.
# 2. Train each tree on its bootstrap sample and make predictions.
# 3. Combine the predictions:
#     - Average the predictions for **regression trees**
#     - Take a vote for **classification trees**
# 
# Notes:
# 
# - **Each bootstrap sample** should be the same size as the original training set.
# - **B** should be a large enough value that the error seems to have "stabilized".
# - The trees are **grown deep** so that they have low bias/high variance.
# 
# Bagging increases predictive accuracy by **reducing the variance**, similar to how cross-validation reduces the variance associated with train/test split (for estimating out-of-sample error) by splitting many times an averaging the results.
# 

# In[13]:


# set a seed for reproducibility
np.random.seed(123)

n_samples = train.shape[0]
n_B = 10

# create ten bootstrap samples (will be used to select rows from the DataFrame)
samples = [np.random.choice(a=n_samples, size=n_samples, replace=True) for _ in range(1, n_B +1 )]
samples


# In[14]:


# show the rows for the first decision tree
train.iloc[samples[0], :]


#  Build one tree for each sample

# In[15]:


from sklearn.tree import DecisionTreeRegressor

# grow each tree deep
treereg = DecisionTreeRegressor(max_depth=None, random_state=123)

# DataFrame for storing predicted price from each tree
y_pred = pd.DataFrame(index=test.index, columns=[list(range(n_B))])

# grow one tree for each bootstrap sample and make predictions on testing data
for i, sample in enumerate(samples):
    X_train = train.iloc[sample, 1:]
    y_train = train.iloc[sample, 0]
    treereg.fit(X_train, y_train)
    y_pred[i] = treereg.predict(X_test)


# In[16]:


y_pred


# Results of each tree

# In[17]:


for i in range(n_B):
    print(i, np.sqrt(mean_squared_error(y_pred[i], y_test)))


# Results of the ensemble

# In[18]:


y_pred.mean(axis=1)


# In[19]:


np.sqrt(mean_squared_error(y_test, y_pred.mean(axis=1)))


# ## Bagged decision trees in scikit-learn (with B=500)

# In[20]:


# define the training and testing sets
X_train = train.iloc[:, 1:]
y_train = train.iloc[:, 0]
X_test = test.iloc[:, 1:]
y_test = test.iloc[:, 0]


# In[21]:


# instruct BaggingRegressor to use DecisionTreeRegressor as the "base estimator"
from sklearn.ensemble import BaggingRegressor
bagreg = BaggingRegressor(DecisionTreeRegressor(), n_estimators=500, 
                          bootstrap=True, oob_score=True, random_state=1)


# In[22]:


# fit and predict
bagreg.fit(X_train, y_train)
y_pred = bagreg.predict(X_test)
y_pred


# In[23]:


# calculate RMSE
np.sqrt(mean_squared_error(y_test, y_pred))


# ## Estimating out-of-sample error
# 
# For bagged models, out-of-sample error can be estimated without using **train/test split** or **cross-validation**!
# 
# On average, each bagged tree uses about **two-thirds** of the observations. For each tree, the **remaining observations** are called "out-of-bag" observations.

# In[24]:


# show the first bootstrap sample
samples[0]


# In[25]:


# show the "in-bag" observations for each sample
for sample in samples:
    print(set(sample))


# In[26]:


# show the "out-of-bag" observations for each sample
for sample in samples:
    print(sorted(set(range(n_samples)) - set(sample)))


# How to calculate **"out-of-bag error":**
# 
# 1. For every observation in the training data, predict its response value using **only** the trees in which that observation was out-of-bag. Average those predictions (for regression) or take a vote (for classification).
# 2. Compare all predictions to the actual response values in order to compute the out-of-bag error.
# 
# When B is sufficiently large, the **out-of-bag error** is an accurate estimate of **out-of-sample error**.

# In[27]:


# compute the out-of-bag R-squared score (not MSE, unfortunately!) for B=500
bagreg.oob_score_


# ## Estimating feature importance
# 
# Bagging increases **predictive accuracy**, but decreases **model interpretability** because it's no longer possible to visualize the tree to understand the importance of each feature.
# 
# However, we can still obtain an overall summary of **feature importance** from bagged models:
# 
# - **Bagged regression trees:** calculate the total amount that **MSE** is decreased due to splits over a given feature, averaged over all trees
# - **Bagged classification trees:** calculate the total amount that **Gini index** is decreased due to splits over a given feature, averaged over all trees

# # Part 4: Random Forests
# 
# Random Forests is a **slight variation of bagged trees** that has even better performance:
# 
# - Exactly like bagging, we create an ensemble of decision trees using bootstrapped samples of the training set.
# - However, when building each tree, each time a split is considered, a **random sample of m features** is chosen as split candidates from the **full set of p features**. The split is only allowed to use **one of those m features**.
#     - A new random sample of features is chosen for **every single tree at every single split**.
#     - For **classification**, m is typically chosen to be the square root of p.
#     - For **regression**, m is typically chosen to be somewhere between p/3 and p.
# 
# What's the point?
# 
# - Suppose there is **one very strong feature** in the data set. When using bagged trees, most of the trees will use that feature as the top split, resulting in an ensemble of similar trees that are **highly correlated**.
# - Averaging highly correlated quantities does not significantly reduce variance (which is the entire goal of bagging).
# - By randomly leaving out candidate features from each split, **Random Forests "decorrelates" the trees**, such that the averaging process can reduce the variance of the resulting model.

# # Part 5: Building and tuning decision trees and Random Forests
# 
# - Major League Baseball player data from 1986-87: [data](https://github.com/justmarkham/DAT8/blob/master/data/hitters.csv), [data dictionary](https://cran.r-project.org/web/packages/ISLR/ISLR.pdf) (page 7)
# - Each observation represents a player
# - **Goal:** Predict player salary

# In[28]:


# read in the data
with zipfile.ZipFile('../datasets/hitters.csv.zip', 'r') as z:
    f = z.open('hitters.csv')
    hitters = pd.read_csv(f, sep=',', index_col=False)

# remove rows with missing values
hitters.dropna(inplace=True)
hitters.head()


# In[29]:


# encode categorical variables as integers
hitters['League'] = pd.factorize(hitters.League)[0]
hitters['Division'] = pd.factorize(hitters.Division)[0]
hitters['NewLeague'] = pd.factorize(hitters.NewLeague)[0]
hitters.head()


# In[30]:


# allow plots to appear in the notebook
get_ipython().run_line_magic('matplotlib', 'inline')
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')


# In[31]:


# scatter plot of Years versus Hits colored by Salary
hitters.plot(kind='scatter', x='Years', y='Hits', c='Salary', colormap='jet', xlim=(0, 25), ylim=(0, 250))


# In[32]:


# define features: exclude career statistics (which start with "C") and the response (Salary)
feature_cols = hitters.columns[hitters.columns.str.startswith('C') == False].drop('Salary')
feature_cols


# In[33]:


# define X and y
X = hitters[feature_cols]
y = hitters.Salary


# ## Predicting salary with a decision tree
# 
# Find the best **max_depth** for a decision tree using cross-validation:

# In[34]:


# list of values to try for max_depth
max_depth_range = range(1, 21)

# list to store the average RMSE for each value of max_depth
RMSE_scores = []

# use 10-fold cross-validation with each value of max_depth
from sklearn.model_selection import cross_val_score
for depth in max_depth_range:
    treereg = DecisionTreeRegressor(max_depth=depth, random_state=1)
    MSE_scores = cross_val_score(treereg, X, y, cv=10, scoring='neg_mean_squared_error')
    RMSE_scores.append(np.mean(np.sqrt(-MSE_scores)))


# In[35]:


# plot max_depth (x-axis) versus RMSE (y-axis)
plt.plot(max_depth_range, RMSE_scores)
plt.xlabel('max_depth')
plt.ylabel('RMSE (lower is better)')


# In[36]:


# show the best RMSE and the corresponding max_depth
sorted(zip(RMSE_scores, max_depth_range))[0]


# In[37]:


# max_depth=2 was best, so fit a tree using that parameter
treereg = DecisionTreeRegressor(max_depth=2, random_state=1)
treereg.fit(X, y)


# In[38]:


# compute feature importances
pd.DataFrame({'feature':feature_cols, 'importance':treereg.feature_importances_}).sort_values('importance')


# ## Predicting salary with a Random Forest

# In[39]:


from sklearn.ensemble import RandomForestRegressor
rfreg = RandomForestRegressor()
rfreg


# ### Tuning n_estimators
# 
# One important tuning parameter is **n_estimators**, which is the number of trees that should be grown. It should be a large enough value that the error seems to have "stabilized".

# In[40]:


# list of values to try for n_estimators
estimator_range = range(10, 310, 10)

# list to store the average RMSE for each value of n_estimators
RMSE_scores = []

# use 5-fold cross-validation with each value of n_estimators (WARNING: SLOW!)
for estimator in estimator_range:
    rfreg = RandomForestRegressor(n_estimators=estimator, random_state=1, n_jobs=-1)
    MSE_scores = cross_val_score(rfreg, X, y, cv=5, scoring='neg_mean_squared_error')
    RMSE_scores.append(np.mean(np.sqrt(-MSE_scores)))


# In[41]:


# plot n_estimators (x-axis) versus RMSE (y-axis)
plt.plot(estimator_range, RMSE_scores)
plt.xlabel('n_estimators')
plt.ylabel('RMSE (lower is better)')


# ### Tuning max_features
# 
# The other important tuning parameter is **max_features**, which is the number of features that should be considered at each split.

# In[42]:


# list of values to try for max_features
feature_range = range(1, len(feature_cols)+1)

# list to store the average RMSE for each value of max_features
RMSE_scores = []

# use 10-fold cross-validation with each value of max_features (WARNING: SLOW!)
for feature in feature_range:
    rfreg = RandomForestRegressor(n_estimators=150, max_features=feature, random_state=1, n_jobs=-1)
    MSE_scores = cross_val_score(rfreg, X, y, cv=10, scoring='neg_mean_squared_error')
    RMSE_scores.append(np.mean(np.sqrt(-MSE_scores)))


# In[43]:


# plot max_features (x-axis) versus RMSE (y-axis)
plt.plot(feature_range, RMSE_scores)
plt.xlabel('max_features')
plt.ylabel('RMSE (lower is better)')


# In[44]:


# show the best RMSE and the corresponding max_features
sorted(zip(RMSE_scores, feature_range))[0]


# ### Fitting a Random Forest with the best parameters

# In[ ]:


# max_features=8 is best and n_estimators=150 is sufficiently large
rfreg = RandomForestRegressor(n_estimators=150, max_features=8, max_depth=3, oob_score=True, random_state=1)
rfreg.fit(X, y)


# In[46]:


# compute feature importances
pd.DataFrame({'feature':feature_cols, 'importance':rfreg.feature_importances_}).sort_values('importance')


# In[47]:


# compute the out-of-bag R-squared score
rfreg.oob_score_


# ### Reducing X to its most important features
# 

# In[48]:


# check the shape of X
X.shape


# In[49]:


rfreg


# In[50]:


# set a threshold for which features to include
from sklearn.feature_selection import SelectFromModel
print(SelectFromModel(rfreg, threshold=0.1, prefit=True).transform(X).shape)
print(SelectFromModel(rfreg, threshold='mean', prefit=True).transform(X).shape)
print(SelectFromModel(rfreg, threshold='median', prefit=True).transform(X).shape)


# In[51]:


# create a new feature matrix that only includes important features
X_important = SelectFromModel(rfreg, threshold='mean', prefit=True).transform(X)


# In[53]:


# check the RMSE for a Random Forest that only includes important features
rfreg = RandomForestRegressor(n_estimators=150, max_features=3, random_state=1)
scores = cross_val_score(rfreg, X_important, y, cv=10, scoring='neg_mean_squared_error')
np.mean(np.sqrt(-scores))


# ## Comparing Random Forests with decision trees
# 
# **Advantages of Random Forests:**
# 
# - Performance is competitive with the best supervised learning methods
# - Provides a more reliable estimate of feature importance
# - Allows you to estimate out-of-sample error without using train/test split or cross-validation
# 
# **Disadvantages of Random Forests:**
# 
# - Less interpretable
# - Slower to train
# - Slower to predict