#!/usr/bin/env python
# coding: utf-8

# # Ensembling
# 
# *Adapted from Chapter 8 of [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/)*

# Why are we learning about ensembling?
# 
# - Very popular method for improving the predictive performance of machine learning models
# - Provides a foundation for understanding more sophisticated models

# ## Lesson objectives
# 
# Students will be able to:
# 
# - Define ensembling and its requirements
# - Identify the two basic methods of ensembling
# - Decide whether manual ensembling is a useful approach for a given problem
# - Explain bagging and how it can be applied to decision trees
# - Explain how out-of-bag error and feature importances are calculated from bagged trees
# - Explain the difference between bagged trees and Random Forests
# - Build and tune a Random Forest model in scikit-learn
# - Decide whether a decision tree or a Random Forest is a better model for a given problem

# # Part 1: Introduction
# 
# Let's pretend that instead of building a single model to solve a binary classification problem, you created **five independent models**, and each model was correct about 70% of the time. If you combined these models into an "ensemble" and used their majority vote as a prediction, how often would the ensemble be correct?

# In[1]:


import numpy as np

# set a seed for reproducibility
np.random.seed(1234)

# generate 1000 random numbers (between 0 and 1) for each model, representing 1000 observations
mod1 = np.random.rand(1000)
mod2 = np.random.rand(1000)
mod3 = np.random.rand(1000)
mod4 = np.random.rand(1000)
mod5 = np.random.rand(1000)

# each model independently predicts 1 (the "correct response") if random number was at least 0.3
preds1 = np.where(mod1 > 0.3, 1, 0)
preds2 = np.where(mod2 > 0.3, 1, 0)
preds3 = np.where(mod3 > 0.3, 1, 0)
preds4 = np.where(mod4 > 0.3, 1, 0)
preds5 = np.where(mod5 > 0.3, 1, 0)

# print the first 20 predictions from each model
print preds1[:20]
print preds2[:20]
print preds3[:20]
print preds4[:20]
print preds5[:20]


# In[2]:


# average the predictions and then round to 0 or 1
ensemble_preds = np.round((preds1 + preds2 + preds3 + preds4 + preds5)/5.0).astype(int)

# print the ensemble's first 20 predictions
print ensemble_preds[:20]


# In[3]:


# how accurate was each individual model?
print preds1.mean()
print preds2.mean()
print preds3.mean()
print preds4.mean()
print preds5.mean()


# In[4]:


# how accurate was the ensemble?
print ensemble_preds.mean()


# **Note:** As you add more models to the voting process, the probability of error decreases, which is known as [Condorcet's Jury Theorem](http://en.wikipedia.org/wiki/Condorcet%27s_jury_theorem).

# ## What is ensembling?
# 
# **Ensemble learning (or "ensembling")** is the process of combining several predictive models in order to produce a combined model that is more accurate than any individual model.
# 
# - **Regression:** take the average of the predictions
# - **Classification:** take a vote and use the most common prediction, or take the average of the predicted probabilities
# 
# For ensembling to work well, the models must have the following characteristics:
# 
# - **Accurate:** they outperform the null model
# - **Independent:** their predictions are generated using different processes
# 
# **The big idea:** If you have a collection of individually imperfect (and independent) models, the "one-off" mistakes made by each model are probably not going to be made by the rest of the models, and thus the mistakes will be discarded when averaging the models.
# 
# There are two basic **methods for ensembling:**
# 
# - Manually ensemble your individual models
# - Use a model that ensembles for you

# # Part 2: Manual ensembling
# 
# What makes a good manual ensemble?
# 
# - Different types of **models**
# - Different combinations of **features**
# - Different **tuning parameters**

# ![Machine learning flowchart](images/crowdflower_ensembling.jpg)
# 
# *Machine learning flowchart created by the [winner](https://github.com/ChenglongChen/Kaggle_CrowdFlower) of Kaggle's [CrowdFlower competition](https://www.kaggle.com/c/crowdflower-search-relevance)*

# ## Comparing manual ensembling with a single model approach
# 
# **Advantages of manual ensembling:**
# 
# - Increases predictive accuracy
# - Easy to get started
# 
# **Disadvantages of manual ensembling:**
# 
# - Decreases interpretability
# - Takes longer to train
# - Takes longer to predict
# - More complex to automate and maintain
# - Small gains in accuracy may not be worth the added complexity

# # Part 3: Bagging
# 
# The primary weakness of **decision trees** is that they don't tend to have the best predictive accuracy. This is partially due to **high variance**, meaning that different splits in the training data can lead to very different trees.
# 
# **Bagging** is a general purpose procedure for reducing the variance of a machine learning method, but is particularly useful for decision trees. Bagging is short for **bootstrap aggregation**, meaning the aggregation of bootstrap samples.
# 
# What is a **bootstrap sample**? A random sample with replacement:

# In[5]:


# set a seed for reproducibility
np.random.seed(1)

# create an array of 1 through 20
nums = np.arange(1, 21)
print nums

# sample that array 20 times with replacement
print np.random.choice(a=nums, size=20, replace=True)


# **How does bagging work (for decision trees)?**
# 
# 1. Grow B trees using B bootstrap samples from the training data.
# 2. Train each tree on its bootstrap sample and make predictions.
# 3. Combine the predictions:
#     - Average the predictions for **regression trees**
#     - Take a vote for **classification trees**
# 
# Notes:
# 
# - **Each bootstrap sample** should be the same size as the original training set.
# - **B** should be a large enough value that the error seems to have "stabilized".
# - The trees are **grown deep** so that they have low bias/high variance.
# 
# Bagging increases predictive accuracy by **reducing the variance**, similar to how cross-validation reduces the variance associated with train/test split (for estimating out-of-sample error) by splitting many times an averaging the results.

# ## Manually implementing bagged decision trees (with B=10)

# In[6]:


# read in and prepare the vehicle training data
import pandas as pd
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/vehicles_train.csv'
train = pd.read_csv(url)
train['vtype'] = train.vtype.map({'car':0, 'truck':1})
train


# In[7]:


# set a seed for reproducibility
np.random.seed(123)

# create ten bootstrap samples (will be used to select rows from the DataFrame)
samples = [np.random.choice(a=14, size=14, replace=True) for _ in range(1, 11)]
samples


# In[8]:


# show the rows for the first decision tree
train.iloc[samples[0], :]


# In[9]:


# read in and prepare the vehicle testing data
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/vehicles_test.csv'
test = pd.read_csv(url)
test['vtype'] = test.vtype.map({'car':0, 'truck':1})
test


# In[10]:


from sklearn.tree import DecisionTreeRegressor

# grow each tree deep
treereg = DecisionTreeRegressor(max_depth=None, random_state=123)

# list for storing predicted price from each tree
predictions = []

# define testing data
X_test = test.iloc[:, 1:]
y_test = test.iloc[:, 0]

# grow one tree for each bootstrap sample and make predictions on testing data
for sample in samples:
    X_train = train.iloc[sample, 1:]
    y_train = train.iloc[sample, 0]
    treereg.fit(X_train, y_train)
    y_pred = treereg.predict(X_test)
    predictions.append(y_pred)

# convert predictions from list to NumPy array
predictions = np.array(predictions)
predictions


# In[11]:


# average predictions
np.mean(predictions, axis=0)


# In[12]:


# calculate RMSE
from sklearn import metrics
y_pred = np.mean(predictions, axis=0)
np.sqrt(metrics.mean_squared_error(y_test, y_pred))


# ## Bagged decision trees in scikit-learn (with B=500)

# In[13]:


# define the training and testing sets
X_train = train.iloc[:, 1:]
y_train = train.iloc[:, 0]
X_test = test.iloc[:, 1:]
y_test = test.iloc[:, 0]


# In[14]:


# instruct BaggingRegressor to use DecisionTreeRegressor as the "base estimator"
from sklearn.ensemble import BaggingRegressor
bagreg = BaggingRegressor(DecisionTreeRegressor(), n_estimators=500, bootstrap=True, oob_score=True, random_state=1)


# In[15]:


# fit and predict
bagreg.fit(X_train, y_train)
y_pred = bagreg.predict(X_test)
y_pred


# In[16]:


# calculate RMSE
np.sqrt(metrics.mean_squared_error(y_test, y_pred))


# ## Estimating out-of-sample error
# 
# For bagged models, out-of-sample error can be estimated without using **train/test split** or **cross-validation**!
# 
# On average, each bagged tree uses about **two-thirds** of the observations. For each tree, the **remaining observations** are called "out-of-bag" observations.

# In[17]:


# show the first bootstrap sample
samples[0]


# In[18]:


# show the "in-bag" observations for each sample
for sample in samples:
    print set(sample)


# In[19]:


# show the "out-of-bag" observations for each sample
for sample in samples:
    print sorted(set(range(14)) - set(sample))


# How to calculate **"out-of-bag error":**
# 
# 1. For every observation in the training data, predict its response value using **only** the trees in which that observation was out-of-bag. Average those predictions (for regression) or take a vote (for classification).
# 2. Compare all predictions to the actual response values in order to compute the out-of-bag error.
# 
# When B is sufficiently large, the **out-of-bag error** is an accurate estimate of **out-of-sample error**.

# In[20]:


# compute the out-of-bag R-squared score (not MSE, unfortunately!) for B=500
bagreg.oob_score_


# ## Estimating feature importance
# 
# Bagging increases **predictive accuracy**, but decreases **model interpretability** because it's no longer possible to visualize the tree to understand the importance of each feature.
# 
# However, we can still obtain an overall summary of **feature importance** from bagged models:
# 
# - **Bagged regression trees:** calculate the total amount that **MSE** is decreased due to splits over a given feature, averaged over all trees
# - **Bagged classification trees:** calculate the total amount that **Gini index** is decreased due to splits over a given feature, averaged over all trees

# # Part 4: Random Forests
# 
# Random Forests is a **slight variation of bagged trees** that has even better performance:
# 
# - Exactly like bagging, we create an ensemble of decision trees using bootstrapped samples of the training set.
# - However, when building each tree, each time a split is considered, a **random sample of m features** is chosen as split candidates from the **full set of p features**. The split is only allowed to use **one of those m features**.
#     - A new random sample of features is chosen for **every single tree at every single split**.
#     - For **classification**, m is typically chosen to be the square root of p.
#     - For **regression**, m is typically chosen to be somewhere between p/3 and p.
# 
# What's the point?
# 
# - Suppose there is **one very strong feature** in the data set. When using bagged trees, most of the trees will use that feature as the top split, resulting in an ensemble of similar trees that are **highly correlated**.
# - Averaging highly correlated quantities does not significantly reduce variance (which is the entire goal of bagging).
# - By randomly leaving out candidate features from each split, **Random Forests "decorrelates" the trees**, such that the averaging process can reduce the variance of the resulting model.

# # Part 5: Building and tuning decision trees and Random Forests
# 
# - Major League Baseball player data from 1986-87: [data](https://github.com/justmarkham/DAT8/blob/master/data/hitters.csv), [data dictionary](https://cran.r-project.org/web/packages/ISLR/ISLR.pdf) (page 7)
# - Each observation represents a player
# - **Goal:** Predict player salary

# ## Preparing the data

# In[21]:


# read in the data
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/hitters.csv'
hitters = pd.read_csv(url)

# remove rows with missing values
hitters.dropna(inplace=True)


# In[22]:


hitters.head()


# In[23]:


# encode categorical variables as integers
hitters['League'] = pd.factorize(hitters.League)[0]
hitters['Division'] = pd.factorize(hitters.Division)[0]
hitters['NewLeague'] = pd.factorize(hitters.NewLeague)[0]
hitters.head()


# In[24]:


# allow plots to appear in the notebook
get_ipython().run_line_magic('matplotlib', 'inline')
import matplotlib.pyplot as plt


# In[25]:


# scatter plot of Years versus Hits colored by Salary
hitters.plot(kind='scatter', x='Years', y='Hits', c='Salary', colormap='jet', xlim=(0, 25), ylim=(0, 250))


# In[26]:


# define features: exclude career statistics (which start with "C") and the response (Salary)
feature_cols = hitters.columns[hitters.columns.str.startswith('C') == False].drop('Salary')
feature_cols


# In[27]:


# define X and y
X = hitters[feature_cols]
y = hitters.Salary


# ## Predicting salary with a decision tree
# 
# Find the best **max_depth** for a decision tree using cross-validation:

# In[28]:


# list of values to try for max_depth
max_depth_range = range(1, 21)

# list to store the average RMSE for each value of max_depth
RMSE_scores = []

# use 10-fold cross-validation with each value of max_depth
from sklearn.cross_validation import cross_val_score
for depth in max_depth_range:
    treereg = DecisionTreeRegressor(max_depth=depth, random_state=1)
    MSE_scores = cross_val_score(treereg, X, y, cv=10, scoring='mean_squared_error')
    RMSE_scores.append(np.mean(np.sqrt(-MSE_scores)))


# In[29]:


# plot max_depth (x-axis) versus RMSE (y-axis)
plt.plot(max_depth_range, RMSE_scores)
plt.xlabel('max_depth')
plt.ylabel('RMSE (lower is better)')


# In[30]:


# show the best RMSE and the corresponding max_depth
sorted(zip(RMSE_scores, max_depth_range))[0]


# In[31]:


# max_depth=2 was best, so fit a tree using that parameter
treereg = DecisionTreeRegressor(max_depth=2, random_state=1)
treereg.fit(X, y)


# In[32]:


# compute feature importances
pd.DataFrame({'feature':feature_cols, 'importance':treereg.feature_importances_}).sort('importance')


# ## Predicting salary with a Random Forest

# In[33]:


from sklearn.ensemble import RandomForestRegressor
rfreg = RandomForestRegressor()
rfreg


# ### Tuning n_estimators
# 
# One important tuning parameter is **n_estimators**, which is the number of trees that should be grown. It should be a large enough value that the error seems to have "stabilized".

# In[34]:


# list of values to try for n_estimators
estimator_range = range(10, 310, 10)

# list to store the average RMSE for each value of n_estimators
RMSE_scores = []

# use 5-fold cross-validation with each value of n_estimators (WARNING: SLOW!)
for estimator in estimator_range:
    rfreg = RandomForestRegressor(n_estimators=estimator, random_state=1)
    MSE_scores = cross_val_score(rfreg, X, y, cv=5, scoring='mean_squared_error')
    RMSE_scores.append(np.mean(np.sqrt(-MSE_scores)))


# In[35]:


# plot n_estimators (x-axis) versus RMSE (y-axis)
plt.plot(estimator_range, RMSE_scores)
plt.xlabel('n_estimators')
plt.ylabel('RMSE (lower is better)')


# ### Tuning max_features
# 
# The other important tuning parameter is **max_features**, which is the number of features that should be considered at each split.

# In[36]:


# list of values to try for max_features
feature_range = range(1, len(feature_cols)+1)

# list to store the average RMSE for each value of max_features
RMSE_scores = []

# use 10-fold cross-validation with each value of max_features (WARNING: SLOW!)
for feature in feature_range:
    rfreg = RandomForestRegressor(n_estimators=150, max_features=feature, random_state=1)
    MSE_scores = cross_val_score(rfreg, X, y, cv=10, scoring='mean_squared_error')
    RMSE_scores.append(np.mean(np.sqrt(-MSE_scores)))


# In[37]:


# plot max_features (x-axis) versus RMSE (y-axis)
plt.plot(feature_range, RMSE_scores)
plt.xlabel('max_features')
plt.ylabel('RMSE (lower is better)')


# In[38]:


# show the best RMSE and the corresponding max_features
sorted(zip(RMSE_scores, feature_range))[0]


# ### Fitting a Random Forest with the best parameters

# In[39]:


# max_features=8 is best and n_estimators=150 is sufficiently large
rfreg = RandomForestRegressor(n_estimators=150, max_features=8, oob_score=True, random_state=1)
rfreg.fit(X, y)


# In[40]:


# compute feature importances
pd.DataFrame({'feature':feature_cols, 'importance':rfreg.feature_importances_}).sort('importance')


# In[41]:


# compute the out-of-bag R-squared score
rfreg.oob_score_


# ### Reducing X to its most important features

# In[42]:


# check the shape of X
X.shape


# In[43]:


# set a threshold for which features to include
print rfreg.transform(X, threshold=0.1).shape
print rfreg.transform(X, threshold='mean').shape
print rfreg.transform(X, threshold='median').shape


# In[44]:


# create a new feature matrix that only includes important features
X_important = rfreg.transform(X, threshold='mean')


# In[45]:


# check the RMSE for a Random Forest that only includes important features
rfreg = RandomForestRegressor(n_estimators=150, max_features=3, random_state=1)
scores = cross_val_score(rfreg, X_important, y, cv=10, scoring='mean_squared_error')
np.mean(np.sqrt(-scores))


# ## Comparing Random Forests with decision trees
# 
# **Advantages of Random Forests:**
# 
# - Performance is competitive with the best supervised learning methods
# - Provides a more reliable estimate of feature importance
# - Allows you to estimate out-of-sample error without using train/test split or cross-validation
# 
# **Disadvantages of Random Forests:**
# 
# - Less interpretable
# - Slower to train
# - Slower to predict

# ![Machine learning flowchart](images/driver_ensembling.png)
# 
# *Machine learning flowchart created by the [second place finisher](http://blog.kaggle.com/2015/04/20/axa-winners-interview-learning-telematic-fingerprints-from-gps-data/) of Kaggle's [Driver Telematics competition](https://www.kaggle.com/c/axa-driver-telematics-analysis)*