#!/usr/bin/env python
# coding: utf-8

# [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fonnesbeck/Bios8366/blob/master/notebooks/Section6_5-Decision-Trees.ipynb)
# 
# # Supervised Learning: Decision Trees

# For methods like support vector machines to work well, it requires identifying an appropriate kernel function, which is not always straightforward or intuitive.
# 
# An alternative to this is to perform statistical learning more directly, say, as a combination of the predictors (features). An **adaptive basis-function model** (ABM) is one example of this.
# 
# $$f(x) = w_0 + \sum_{j=1}^k w_j \phi_j(\mathbf{x})$$
# 
# here, $\phi_j$ is a *basis function*, which is typically parametric:
# 
# $$\phi_j(\mathbf{x}) = \phi_j(\mathbf{x}|\alpha_j)$$
# 
# The parameter set for this model is thus $\theta = \{\mathbf{w} = w_0,\ldots,w_k; \mathbf{\alpha} = \alpha_1, \ldots, \alpha_k\}$. This model is *not* linear in the parameters.
# 
# 
# **Decision trees** use an ABM to *recursively partition* the space of predictor variables into a piecewise-constant response surface. We can consider each component $j=1,\ldots,k$ to be a region in the response surface, and $w_j$ the expected response in that region.
# 
# $$f(x) = \sum_{j=1}^k w_j I(\mathbf{x} \in R_j)$$
# 
# Each paramter $\alpha_j$ encodes both (1) a variable used for splitting and (2) the corresponding threshold value. Specifically, the basis functions define the regions, and the weights encode the response value in each region.
# 
# This particular formulation implies a regression-type model, but we can generalize this to classification by storing the *distribution over classes* in each leaf, instead of the mean response.

# To get a sense of how decision trees work, consider the diabetes dataset that we have seen previously. In the plot below, the response variable (`target`, an index of disease progression) is color-coded as a function of two variables, metabolic rate (`bmi`) and a blood serum measurement (`ltg`).

# In[ ]:


get_ipython().run_cell_magic('capture', '', '!pip install shap\n')


# In[ ]:


import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_diabetes

# Predictors: "age" "sex" "bmi" "map" "tc"  "ldl" "hdl" "tch" "ltg" "glu"
diabetes = load_diabetes()
y = diabetes['target']
bmi, ltg = diabetes['data'][:,[2,8]].T

plt.scatter(ltg, bmi,  c=y, cmap="Blues")
plt.colorbar()
plt.xlabel('ltg'); plt.ylabel('bmi');


# One approach to building a predictive model is to subdivide the variable space into regions, by sequentially subdividing each variable. For example, if we split `ltg` at a threshold value of -0.01, it does a reasonable job of isolating the large values in one of the resulting subspaces.

# In[ ]:


ltg_split = -0.01
plt.scatter(ltg, bmi,  c=y, cmap="Blues")
plt.vlines(ltg_split, *plt.gca().get_ylim(), linestyles='dashed')
plt.colorbar()
plt.xlabel('ltg'); plt.ylabel('bmi');


# However, that region still contains a fair number of low (blue) values, so we can similarly bisect the region using a `bmi` value of -0.03 as a threshold value:

# In[ ]:


bmi_split = -0.03
plt.scatter(ltg, bmi,  c=y, cmap="Blues")
plt.vlines(ltg_split, *plt.gca().get_ylim(), linestyles='dashed')
plt.hlines(bmi_split, ltg_split, plt.gca().get_xlim()[1], linestyles='dashed')
plt.colorbar()
plt.xlabel('ltg'); plt.ylabel('bmi');


# We can use this partition to create a piecewise-constant function, which returns the average value of the observations in each region defined by the threshold values. We could then use this rudimentary function as a predictive model.

# In[ ]:


np.mean(y[(bmi>bmi_split) & (ltg>ltg_split)])


# In[ ]:


np.mean(y[(bmi<=bmi_split) & (ltg>ltg_split)])


# In[ ]:


np.mean(y[ltg<ltg_split])


# The choices for splitting the variables here were relatively arbitrary. Better choices can be made using a cost function $C$, such as **residual sums of squares** (RSS).
# 
# $$C = \sum_j \sum_{i \in R_j} (y_i - \hat{y}_{R_j})^2$$
# 
# where $\hat{y}_{R_j}$ is the mean response for the training observations in the jth region.
# 
# ### Exercise
# 
# Use residual sums of squares to select competitive threshold values for the predictive model defined above

# In[ ]:


# Write your answer here


# The recursive partitioning demonstrated above results in a **decision tree**. The regions defined by the trees are called *terminal nodes*. Locations at which a predictor is split, such as `bmi`=-0.03, are called  *internal nodes*. As with this simple example, splits are not generally symmetric, in the sense that splits do not occur similarly on all branches.
# 
# Now consider a subset of three variables from the Titanic dataset, which we would like to use to predict survival from the disaster. The following describes one such decision tree:
# 
# - We first check if gender of the passenger is male. If "no", we follow the right branch and end up in a leaf where the probability of survival is $p(y=1,x_1=F)=0.73$, so we predict survival ($y=1$) at this node (36% of observations fall under this leaf)
# - If the passenger is male, we then check the age of the passenger. If he is older than 9.5 years, then the probability of survival $p(y=1,x_1=M,x_2>9.5)=0.17$, so we predict death ($y=0$). 
# - If, on the other hand, the passenger is younger than 9.5 years, we then check if the number of siblings and spouses on board was higher than 2.5; if "yes", then the probability of survival  $p(y=1, x_1=M, x_2>9.5, x_3>2.5)=0.05$, so we predict death, otherwise we predict survival with $p(y=1, x_1=M, x_2>9.5 , x_3 \lt 2.5)=0.89$. 
# 
# Hence, these probabilities are just the empirical fraction of positive examples that satisfy each conjunction of feature values, which defines a path from the root to a leaf.
# 
# ![titanic tree](http://upload.wikimedia.org/wikipedia/commons/f/f3/CART_tree_titanic_survivors.png)
# 
# There is no way to feasibly evaluate all possible partitions. Instead, the strategy is to use a top-down, **greedy** approach that is optimal (according to a particular cost function) for the current split only. By "greedy", we mean that at each step it chooses the most advantageous binary partition, not taking into account the impact of the choice on the quality of subsequent partitions.
# 
# $$(j^*, t^*) = \text{argmin}_{j,t} C(\{\mathbf{x}_i,y_i: x_{ij} \le t\}) + C(\{\mathbf{x}_i,y_i: x_{ij} \gt t\})$$
# 
# where $C$ is a cost function, $j$ and $t$ are a variable index and cutpoint, respectively. We will restrict consideration to **binary partitions**.
# 
# ## Classification Trees
# 
# In addition to regression trees, we can also use decision trees on categorical outcomes, and these are called classification trees. The primary difference in implementation is that residual sums of squares is no longer an appropriate splitting criterion.

# ### Entropy
# 
# An alternative splitting criterion for classification tree learning algorithms is *information gain*. It measures how well a particular attribute distinguishes among different target classifications. Information gain is measured in terms of the expected reduction in the entropy or impurity of the data. The entropy of a set of probabilities is:
# 
# $$H(p) = -\sum_i p_i log_2(p_i)$$
# 
# If we have a set of binary responses from some variable, all of which are positive/true/1, then knowing the values of the variable does not hold any predictive value for us, since all the outcomes are positive. Hence, the entropy is zero:

# In[ ]:


import numpy as np

entropy = lambda p: -np.sum(p * np.log2(p)) if not 0 in p else 0


# In[ ]:


entropy([0.,1.])


# However, if the variable splits responses into equal numbers of positive and negative values, then entropy is maximized, and we wish to know about the feature:

# In[ ]:


entropy([0.5, 0.5])


# In[ ]:


pvals = np.linspace(0, 1)        
plt.plot(pvals, [entropy([p,1-p]) for p in pvals])


# The entropy calculation tells us how much additional information we would obtain with knowledge of the variable.
# 
# So, if we have a set of candidate covariates from which to choose as a node in a decision tree, we should choose the one that gives us the most information about the response variable (*i.e.* the one with the highest entropy).
# 
# ### Misclassification Rate
# 
# Alternatively, we can use the misclassification rate:
# 
# $$C(j,t) = \frac{1}{n_{jt}} \sum_{y_i: x_{ij} \gt t} I(y_i \ne \hat{y})$$
# 
# where $\hat{y}$ is the most probable class label and $n_{ij}$ is the number of observations in the data subset obtained from splitting via $j,t$.
# 
# ### Gini index
# 
# The Gini index is simply the expected error rate:
# 
# $$C(j,t) = \sum_{k=1}^K \hat{\pi}_{jt}[k] (1 - \hat{\pi}_{jt}[k]) = 1 - \sum_{k=1}^K \hat{\pi}_{jt}[k]^2$$
# 
# where $\hat{\pi}_{jt}[k]$ is the probability of an observation being correctly classified as class $k$ for the data subset obtained from splitting via $j,t$ (hence, $(1 - \hat{\pi}_{jt}[k])$ is the misclassification probability).

# In[ ]:


gini = lambda p: 1. - (np.array(p)**2).sum()


# In[ ]:


pvals = np.linspace(0, 1)        
plt.plot(pvals, [entropy([p,1-p])/2. for p in pvals], label='Entropy')
plt.plot(pvals, [gini([p,1-p]) for p in pvals], label='Gini')
plt.legend()


# ## ID3
# 
# A given cost function can be used to construct a decision tree via one of several algorithms. The Iterative Dichotomiser 3 (ID3) is one such algorithm, which uses entropy, and a related concept, *information gain*, to choose features and partitions at each classification step in the tree.
# 
# Information gain is the difference between the current entropy of a system and the entropy measured after a feature is chosen. If $S$ is a set of examples and $X$ is a possible feature on which to partition the examples, then:
# 
# $$G(S,X) = \text{Entropy}(S) - \sum_{x \in X} \frac{\#(S_x)}{\#(S)} \text{Entropy}(S_x)$$
# 
# where $\#$ is the count function and $x$ is a particular value of $X$.
# 
# Let's say $S$ is a set of survival events, $S = \{s_1=survived, s_2=died, s_3=died, s_4=died\}$ and a particular variable $X$ can have values $\{x_1, x_2, x_3\}$. To perform a sample calculation of information gain, we will say that:
# 
# * $X(s_1) = x_2$
# * $X(s_2) = x_2$
# * $X(s_3) = x_3$
# * $X(s_4) = x_1$

# In[ ]:


pd.DataFrame({'S': [1, 0, 0, 0], 'X': ['x2', 'x2', 'x3', 'x1']})


# The current entropy of this state is:
# 
# $$\begin{align}
# \text{Entropy}(S) &= -p^{(+)} \log_2(p^{(+)}) - p^{(-)} \log_2(p^{(-)}) \\
# &= -0.25 \log_2(0.25) - 0.75 \log_2(0.75) \\
# &= 0.5 + 0.311 = 0.811
# \end{align}$$
# 
# Now, we need to compute the information after selecting variable $X$, which is the sum of three terms:
# 
# $$\begin{align}
# \frac{\#(S_{x1})}{\#(S)} \text{Entropy}(S) &= 0.25 (-0 \log_2(0) - 1 \log_2(1)) = 0\\
# \frac{\#(S_{x2})}{\#(S)} \text{Entropy}(S) &= 0.5 (-0.5 \log_2(0.5) - 0.5 \log_2(0.5) = 0.5\\
# \frac{\#(S_{x3})}{\#(S)} \text{Entropy}(S) &= 0.25 (-0 \log_2(0) - 1 \log_2 1) = 0\\
# \end{align}$$
# 
# Therefore, the information gain is:
# 
# $$G(S,X) = 0.811 - (0 + 0.5 + 0) = 0.311$$

# This algorithm is implemented in the following function:

# In[ ]:


import numpy as np

def info_gain(X, y, feature):
    # Calculates the information gain based on entropy
    
    gain = 0
    n = len(X)

    # List the values that feature can take
    values = list(set(X[feature]))

    feature_counts = np.zeros(len(values))
    E = np.zeros(len(values))
    ivalue = 0
    
    # Find where those values appear in X[feature] and the corresponding class
    for value in values:
        
        new_y = [y[i] for i,d in enumerate(X[feature].values) if d==value]
        feature_counts[ivalue] += len(new_y)

        # Get the values in newClasses
        class_values = list(set(new_y))

        class_counts = np.zeros(len(class_values))
        iclass = 0
        for v in class_values:
            for c in new_y:
                if c == v:
                    class_counts[iclass] += 1 
            iclass += 1
        
        nc = float(np.sum(class_counts))
        new_entropy = entropy([class_counts[c] / nc for c in range(len(class_values))])
        E[ivalue] += new_entropy

        # Computes both the Gini gain and the entropy
        gain += float(feature_counts[ivalue])/n * E[ivalue]
        ivalue += 1
        
    return gain 


# Consider a few variables from the titanic database:

# In[ ]:


titanic = pd.read_excel("../data/titanic.xls", "titanic")
titanic.head(3)


# Here, we have selected pasenger class (`pclass`), sex, port of embarcation (`embarked`), and a derived variable called `adult`. We can calculate the information gain for each of these.

# In[ ]:


X = titanic.copy()
y = X.pop('survived')
X['adult'] = titanic.age<17

info_gain(X, y, 'pclass')


# In[ ]:


info_gain(X, y, 'sex')


# In[ ]:


info_gain(X, y, 'embarked')


# In[ ]:


info_gain(X, y, 'adult')


# Hence, the ID3 algorithm computes the information gain for each variable, selecting the one with the highest value (in this case, `adult`). In this way, it searches the "tree space" according to a greedy strategy.
# 
# A tree can be constructed by recursively selecting the feature from the current dataset with the largest information gain, then removing it from the datset. Recursion stops when there are either no variables remaining, or there is only one class left in the subset (*e.g.* all `True` or all `False`).
# 
# The ID3 algorithm is as follows:
# 
# > * if all response data have the same class:
# > 
# >     - return leaf with data label
# >     
# > * else if no features:
# > 
# >     - return leaf with most common label
# >     
# > * else:
# > 
# >     - choose variable $X'$ that maximizes information gain to be a tree node
# >     - add branch from node for each value of $X'$
# >     - for each branch of node:
# >     
# >         * calculate $S_{x}$ by removing $X'$ from $S$
# >         * set $S=S_{x}$ and call algorithm again 
# 
# The greedy approach of maximizing information gain at each step tends to bias solutions towards smaller trees.

# ## Decision Trees in `scikit-learn`
# 
# Classification trees, either binary or multi-class, are implemented in `scikit-learn` in the `DecisionTreeClassifier` class. Where trees are binary, it expects the response variable to be coded as `[-1,1]` for negative and positive outcomes.
# 
# Let's build a decision tree on the wine dataset.

# In[ ]:


wine = pd.read_table("../data/wine.dat", sep='\s+')

attributes = ['Alcohol',
            'Malic acid',
            'Ash',
            'Alcalinity of ash',
            'Magnesium',
            'Total phenols',
            'Flavanoids',
            'Nonflavanoid phenols',
            'Proanthocyanins',
            'Color intensity',
            'Hue',
            'OD280/OD315 of diluted wines',
            'Proline']

grape = wine.pop('region')
y = grape
wine.columns = attributes
X = wine


# In[ ]:


from sklearn import tree
from sklearn import model_selection

X_train, X_test, y_train, y_test = model_selection.train_test_split(
        X, y, test_size=0.4, random_state=0)

clf = tree.DecisionTreeClassifier(criterion='entropy',
                                  max_features="auto",
                                  min_samples_leaf=10)
clf.fit(X_train, y_train)


# If you have [GraphViz](http://www.graphviz.org) and `pydot` installed, you can draw the resulting tree:

# In[ ]:


import graphviz
gv_data = tree.export_graphviz(clf, out_file=None, 
                     feature_names=attributes,  
                     filled=True, rounded=True,  
                     special_characters=True) 

graphviz.Source(gv_data)


# In[ ]:


preds = clf.predict(X_test)
pd.crosstab(y_test, preds, rownames=['actual'], 
            colnames=['prediction'])


# ### Pruning
# 
# Despite the **inductive bias** associated with trees that tend to make them small, the ID3 algorithm continues choosing nodes and branches until either it runs out of variables, or all outputs are of the same class. This can clearly lead to overfit trees.
# 
# To prevent overfitting, we can **stop growing** the tree if the information gain (or reduction in error, etc.) is not sufficient to justify the extra complexity of adding another node. However, this simple rule is not optimal, because an uninformative subtree can lead to informative ones later on. 
# 
# The standard approach is therefore to grow a full tree, and then to **prune** it. The easiest approach is to remove branches that give the least increase in the error (information gain). To determine how far back to prune, we can evaluate the cross-validated error on each candidate pruning, and then pick the tree whose CV error is within 1 standard error of the minimum.
# 
# Analogous to the lasso or ridge regression, we can **penalize** the number of terminal nodes in a tree:
# 
# $$\sum_{m=1}^{|T|} \sum_{x_i \in R_m} (y_i - \hat{y}_{R_j})^2 + \alpha |T|$$
# 
# where $|T|$ is the number of terminal nodes in tree T.
# 
# ### Pruned Decision Tree Algorithm
# 
# 1. Use recursive binary splitting to grow a large tree, such that each terminal node has fewer than some minimum number of observations.
# 2. Apply pruning to obtain a sequence of best subtrees, as a function of $\alpha$.
# 3. Use k-fold cross-validation to choose $\alpha$. Average results and pick $\alpha$ to minimize the average error.
# 4. Return subtree from (2) that corresponds to chosen $\alpha$.
# 

# ##  Averaging Trees
# 
# Decision trees have several advantages: 
# 
# * ease of interpretation
# * easy of implementation
# * handles continuous and discrete features (and mixed)
# * invariant to monotone transformation of features
# * variable selection automated (ignores redundant variables)
# * robust
# * scalable (can handle huge datasets)
# 
# However, relative to other statistical learning methods, trees do not predict very accurately, due to the greedy nature of the tree construction algorithm. Also, trees tend to be **unstable**, as small changes to the inputs can have large effects on the structure of the tree; poor decisions near the root of the tree will propogate to the rest of the tree. Hence, trees are **high variance** (*i.e.* noisy) estimators.
# 
# One way to reduce the variance of an estimate is to average together many estimates. In the case of decision trees, we can train $T$ different trees on random subsets of the data (with replacement) then average according to:
# 
# $$\hat{f}(\mathbf{x}) = \frac{1}{T} \sum_{i=1}^T f_t(\mathbf{x})$$
# 
# where $f_t$ is the $t^{th}$ tree. This approach is called "bootstrap aggregating", or **bagging**.
# 
# Note that, since we are averaging over trees, there is *no need to prune*. With bagging, we reduce variance by averaging, rather than by pruning.

# In[ ]:


from sklearn.ensemble import BaggingClassifier

bc = BaggingClassifier(n_jobs=4, oob_score=True)
bc


# In[ ]:


bc.fit(X_train, y_train)

preds = bc.predict(X_test)
pd.crosstab(y_test, preds, rownames=['actual'], 
            colnames=['prediction'])


# Test error of a bagged model is measured by estimating **out-of-bag error**.
# 
# On average, each bagged tree uses about 2/3 of observations, leaving the remaining third as "out-of bag". The response for the ith observation for each of the trees in which that observation was excluded (on average, B/3) is averaged. This essentially the same as performing leave-one-out (LOO) cross-validation.

# In[ ]:


bc.oob_score_


# This approach is an **ensemble learning** method, because it takes a set of *weak* learners, and combines them to construct a *strong* learner that is more robust, with lower generalization error.
# 
# An average of B trees, each with variance $\sigma^2$, has variance $\sigma^2/B$. If the variables are simply identically distributed, with positive pairwise correlation $\rho$, then the variance of the average of the B trees is:
# 
# $$\rho \sigma^2 + \frac{1-\rho}{B}\sigma^2$$

# As the number of trees becomes large, the second term goes to zero. Further reductions in variance are limited by the size of the **correlation** among the trees $\rho$.

# ## Random Forests
# 
# **Random forests** improves upon bagging by creating a set of decision trees that are less correlated than bootstrapped trees. This is done by selecting from only a subset $m$ out of $M$ possible predictors at each split. Typically, we choose approximately the square root of the available number.
# 
# This procedure is used to create a set of trees, most of which will be poor at predicting a given observation. However, classification is based on a *majority vote* of the constituent trees.

# In[ ]:


from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_jobs=4)
rf.fit(X_train, y_train)

preds = rf.predict(X_test)
pd.crosstab(y_test, preds, rownames=['actual'], 
            colnames=['prediction'])


# With random forests, it is possible to quantify the **relative importance** of feature inputs for classification. In scikit-learn, the Gini index (recall, a measure of error reduction) is calculated for each internal node that splits on a particular feature of a given tree, which is multiplied by the number of samples that were routed to the node (this approximates the probability of reaching that node). For each variable, this quantity is averaged over the trees in the forest to yield a measure of importance.

# In[ ]:


importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. %s (%f)" % (f + 1, X.columns[indices[f]], importances[indices[f]]))


# In[ ]:


plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", align="center")
plt.xticks(range(X.shape[1]), X.columns[indices], rotation=90)
plt.xlim([-1, X.shape[1]]);


# `RandomForestClassifier` uses the Gini impurity index by default; one may instead use the entropy information gain as a criterion.

# In[ ]:


rf = RandomForestClassifier(n_jobs=4, criterion='entropy')
rf.fit(X_train, y_train)
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. %s (%f)" % (f + 1, X.columns[indices[f]], importances[indices[f]]))


# ## Partial Dependence Plots
# 
# To understand a feature's importance in a model it is necessary to understand both how changing that feature impacts the model's output, and also the distribution of that feature's values. 
# Partial dependence plots show the dependence between the response and a set of features, marginalizing over the values of all other features. Intuitively, we can interpret the partial dependence as the expected response as a function of the features we conditioned on.

# In[ ]:


import shap

shap.partial_dependence_plot(
    "Alcohol", rf.predict, X, ice=False,
    model_expected_value=True, feature_expected_value=True
)


# The gray horizontal line in the plot above represents the expected value of the model when applied to the wine dataset. The vertical gray line represents the average value of the alcohol feature. Note that the blue partial dependence plot line (which the is average value of the model output when we fix alcohol to a given value) always passes through the interesection of the two gray expected value lines. We can consider this intersection point as the "center" of the partial dependence plot with respect to the data distribution. 

# Note the distribution of feature values as a histogram on the x-axis.
# 

# ## SHAP Values
# 
# SHAP (SHapley Additive exPlanations) values allow users to interpret machine learning models with respect to the influence of each feature on the model's output. They are based on *Shapley values*, which are taken from cooperative game theory. In that setting, they are used to assign credit to players on a team for their contributions to the outcome of a game; analogously, SHAP values assign each feature an importance value for a particular prediction derived from a model.
# 
# A SHAP value quantifies the marginal effect that the observed level of a variable for an entity has on the final predicted probability for that entity.

# 
# To evaluate an existing model $f$ when only a subset $S$ of features are part of the model we integrate out the other features using a conditional expectated value formulation.
# 
# $$
# E[f(X) \mid X_S = x_S]
# $$
# 
# SHAP values can be very complicated to compute (they are NP-hard in general). When we are explaining a prediction $f(x)$, the SHAP value for a specific feature $i$ is just the difference between the expected model output and the partial dependence plot at the feature's value $x_i$:

# In[ ]:


explainer = shap.Explainer(rf.predict, X)
shap_values = explainer(X)

sample_ind = 18
shap.partial_dependence_plot(
    "Alcohol", rf.predict, X, model_expected_value=True,
    feature_expected_value=True, ice=False,
    shap_values=shap_values[sample_ind:sample_ind+1,:]
)


# ### The additive nature of Shapley values
# 
# One the fundemental properties of Shapley values is that they always sum up to the difference between the game outcome when all players are present and the game outcome when no players are present. For machine learning models this means that SHAP values of all the input features will always sum up to the difference between baseline (expected) model output and the current model output for the prediction being explained. The easiest way to see this is through a waterfall plot that starts our background prior expectation for a home price $E[f(X)]$, and then adds features one at a time until we reach the current model output $f(x)$.
# 
# The waterfall plot shows how we get from the expected value of the mode to the specific prediction for a given sample.

# In[ ]:


shap.plots.waterfall(shap_values[sample_ind])


# In[ ]:


shap.plots.waterfall(shap_values[103])


# In[ ]:


shap.plots.bar(shap_values[sample_ind])


# This illustrates a key feature of SHAP values: **local interpretability**. Each observation gets its own set of SHAP values. Using these, we can explain why a case receives its prediction and the contributions of the predictors. Other measures of feature importance only show the results across the entire population but not on each individual case. 
# 
# If we are willing to deal with a bit more complexity we can use a beeswarm plot to summarize the entire distribution of SHAP values for each feature.
# 
# SHAP values can also be aggregated to provide **global interpretability**. The collective SHAP values in the beeswarm plot shows how much each predictor contributes, either positively or negatively, to the target variable. This is similar to the standard variable importance plot. but addds information regarding the direction of the relationship for each variable with the target.

# In[ ]:


shap.plots.beeswarm(shap_values)


# While a SHAP beeswarm plot gives a general overview of each feature, a SHAP scatter plot show how the model output varies by feauture value. Note that every dot is an observation, and the vertical dispersion at a single feature value results from interaction effects in the model. 
# 
# The feature used for coloring is automatically chosen to highlight what might be driving these interactions.

# In[ ]:


shap.plots.scatter(shap_values[:, 'Alcohol'], color=shap_values)


# The heatmap plot function creates a plot with the instances on the x-axis, the model inputs on the y-axis, and the SHAP values encoded on a color scale. By default the samples are ordered using hierarchical clustering, which orders the samples based on their explanation similarity. This results in samples that have the same model output for the same reason getting grouped together.

# In[ ]:


shap.plots.heatmap(shap_values)


# ## Example: Baseball pitch classification
# 
# Today, over 6TB of data per game are collected by MLB Advanced Media. This consists of information from cameras, used to track primarily the players, and radar, used to track the ball. The TrackMan radar system follows ball by tracking the seams at a rate of 20,000 frame/second. This results in information on where the pitcher released the ball (extension), how hard he threw (velocity), what sort of spin he had (spin rate), how fast it appeared to the batter (perceived velocity), how hard the ball was hit (exit velocity), how it came off the bat (launch angle), and exactly where it went (batted ball direction). 
# 
# The dataset ‘pitches.csv’ contain [Statcast](https://en.wikipedia.org/wiki/Statcast) data corresponding to over 13,000 pitches from 5 different pitchers over 3 years. The data dictionary describing each column is given in `pitches.md`.

# In[ ]:


DATA_URL = 'https://raw.githubusercontent.com/fonnesbeck/Bios8366/master/data/'

try:
    pitches = pd.read_csv('../data/pitches.csv', index_col=0)
except FileNotFoundError:
    pitches = pd.read_csv(DATA_URL + 'pitches.csv', index_col=0)


# In[ ]:


pitches.head()


# Recode a few character columns

# In[ ]:


pitch_lookup = {'CB': 0,
               'SL': 1,
               'SI': 2,
               'CH': 3,
               'CU': 4,
               'FB': 5}

pitches_recoded = pitches.replace({'throws': {'R': 0, 'L': 1},
                                    'type': pitch_lookup})


# Extract predictor columns

# In[ ]:


prediction_cols = ['throws', 'height', 'initspeed', 'breakx',
       'breakz', 'initposx', 'initposz', 'extension', 'spinrate']


# In[ ]:


X = pitches_recoded[prediction_cols].copy()
y = pitches_recoded.type


# In[ ]:


X_train, X_test, y_train, y_test = model_selection.train_test_split(
        X, y, test_size=0.4, random_state=0)


# Fit model

# In[ ]:


baseball_clf = RandomForestClassifier()
baseball_clf.fit(X_train, y_train)


# Generate predictions

# In[ ]:


preds = baseball_clf.predict(X_test)


# In[ ]:


baseball_clf.score(X_test, y_test)


# In[ ]:


from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, preds)


# We can tune several hyperparameters using cross-validation.

# In[ ]:


from sklearn.model_selection import RandomizedSearchCV

n_estimators = [50, 100, 150]
max_depth = [10, 20, 30, 40]
min_samples_split = [4, 6, 8]
min_samples_leaf = [1, 2, 4]

# Create the random grid
parameter_grid = {'n_estimators': n_estimators,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}


# In[ ]:


clf_baseball = RandomForestClassifier()

rf_cf = RandomizedSearchCV(clf_baseball, parameter_grid, 
                               n_iter=50, cv=5, n_jobs=2)

rf_cf.fit(X_train, y_train)


# In[ ]:


rf_cf.best_estimator_.get_params()


# In[ ]:


preds = rf_cf.best_estimator_.predict(X_test)


# In[ ]:


from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, preds)


# In[ ]:


pd.Series(rf_cf.best_estimator_.feature_importances_, index=prediction_cols).sort_values().plot.barh()


# ## Decision Tree Regression

# While it may not be apparent how to use trees for regression analysis, it requires only a straightforward modification to the algorithm. A popular tree-based regression algorithm is the **classification and regression tree** (CART).
# 
# The file `TNNASHVI.txt` in your data directory contains daily temperature readings for Nashville, courtesy of the [Average Daily Temperature Archive](http://academic.udayton.edu/kissock/http/Weather/). This data, as one would expect, oscillates annually. We can use a decision tree regression model to fit the data.

# In[ ]:


daily_temps = pd.read_table("../data/TNNASHVI.txt", sep='\s+', 
                            names=['month','day','year','temp'], na_values=-99)


# In[ ]:


daily_temps.temp[daily_temps.year>2010].plot(style='b.', figsize=(10,6))


# In this context, none of the cost functions considered so far would be appropriate. Instead, it would be more suitable to use something like mean squared error (MSE) to guide the growth of the tree. With this, we can proceed to choose:
# 
# 1. a variable on which to split the dataset
# 2. a value of the variable at which to place a node (in the case of continuous features)
# 
# Recall that the output of a tree is just a constant value for each leaf; here, we simply return the average of all the response values in the region. Thus, we choose a cut point that minimizes the MSE at each step.

# In[ ]:


# Transmogrify data
y = daily_temps.temp[daily_temps.year>2010]
X = np.atleast_2d(np.arange(len(y))).T


# In[ ]:


from sklearn.tree import DecisionTreeRegressor
clf = DecisionTreeRegressor(max_depth=7, min_samples_leaf=2)
clf.fit(X, y)


# In[ ]:


X_fit = np.linspace(0, len(X), 1000).reshape((-1, 1))
y_fit_1 = clf.predict(X_fit)

plt.plot(X.ravel(), y, '.k', alpha=0.3)
plt.plot(X_fit.ravel(), y_fit_1, color='red');


# A single decision tree allows us to estimate the signal in a non-parametric way,
# but clearly has some issues.  In some regions, the model shows high bias and
# under-fits the data
# (seen in the long flat lines which don't follow the contours of the data),
# while in other regions the model shows high variance and over-fits the data
# (reflected in the narrow spikes which are influenced by noise in single points).
# 
# One way to address this is to use an ensemble method, like random forests, so that the
# effects of their over-fitting go away on average. 
# 
# Here we will use a random forest of 200 trees to reduce the tendency of each
# tree to over-fitting the data.

# In[ ]:


from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor(n_estimators=200, max_depth=9, 
                            min_samples_leaf=10)
clf.fit(X, y)

y_fit_200 = clf.predict(X_fit)

plt.plot(X.ravel(), y, '.k', alpha=0.3)
plt.plot(X_fit.ravel(), y_fit_200, color='red')


# ### Prediction intervals
# 
# The predictions from random forests are not accompanied by estimates of uncertainty, as is the case with Bayesian regression models. However, it is possible to obtain probability intervals using a random forests approach. Since we are using an ensemble of trees, it is possible to track *all* predicted values for all leaf nodes in a random forest, rather than just the mean or modal value. This results in conditional distributions $P(y|X=x)$ for every x, from which percentiles can be calculated for desired endpoints in a prediction interval. This approach is called **quantile regression forests**.
# 
# To implement quantile regression forests in scikit-learn, we need to allow each tree to grow so that each leaf node contains exactly one value. Then, each tree returns a single response variable, from which a conditional distribution can be approximated. Of course, fully expanding trees will result in overfitting, but these can also be cross-validated.
# 
# scikit-learn does not automatically calculate prediction intervals, but the estimators from each constitutent tree in the `RandomForestRegressor` is avaialable, from which individual tree predictions can be made.

# In[ ]:


def prediction_intervals(mod, X, alpha=0.05):
    preds = [pred.predict(X) for pred in mod.estimators_]
    lower = np.percentile(preds, 100*(alpha/2), axis=0)
    upper = np.percentile(preds, 100*(1-alpha/2), axis=0)
    return lower, upper, np.array(preds)


# In[ ]:


X_train, X_test, y_train, y_test = model_selection.train_test_split(
        X, y, test_size=0.4, random_state=0)


# In[ ]:


clf = RandomForestRegressor(n_estimators=1000, min_samples_leaf=1)
clf.fit(X_train, y_train)

y_fit_200 = clf.predict(X_fit)


# In[ ]:


lower, upper, preds = prediction_intervals(clf, X_test, alpha=0.1)


# In[ ]:


x_sorted = np.sort(X_test.ravel())
order = np.argsort(X_test.ravel())
plt.errorbar(x_sorted, y_test.values[order], 
             yerr=[(y_test.values-lower)[order], (upper-y_test.values)[order]])
plt.plot(X_test, y_test, '.r', alpha=0.3)


# ### Exercise
# 
# Selecting the optimal random forest regression model for the nashville daily temperature data via cross-validation in `scikit-learn`. Use the number of estimators and the maximim leaf nodes as tuning parameters.

# In[ ]:


# Write your answer here


# ---
# ## References
# 
# Meinshausen, N. [Quantile Regression Forests](http://www.jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf) Journal of Machine Learning Research 7 (2006) 983–999
# 
# T. Hastie, R. Tibshirani and J. Friedman. (2009) [Elements of Statistical Learning: Data Mining, Inference, and Prediction](http://statweb.stanford.edu/~tibs/ElemStatLearn/), second edition. Springer.
#