#!/usr/bin/env python
# coding: utf-8

# ### Nonlinear Classification and Regression with Decision Trees  
# 
# #### Decision trees  
# 
# Decision trees are commonly learned by recursively splitting the set of training
# instances into subsets based on the instances' values for the explanatory variables.  
# 
# In classification tasks, the leaf nodes
# of the decision tree represent classes. In regression tasks, the values of the response
# variable for the instances contained in a leaf node may be averaged to produce the
# estimate for the response variable. After the decision tree has been constructed,
# making a prediction for a test instance requires only following the edges until a
# leaf node is reached.  
# 
# Let's create a decision tree using an algorithm called Iterative Dichotomiser 3 (ID3).
# Invented by Ross Quinlan, ID3 was one of the first algorithms used to train decision
# trees.  
# 
# But how to choose the first variable on which we have to divide the data so that we can have smaller tree.  
# 
# Measured in bits, entropy quantifies the amount of uncertainty in a variable. Entropy
# is given by the following equation, where n is the number of outcomes and ( ) i P x is
# the probability of the outcome i. Common values for b are 2, e, and 10. Because the
# log of a number less than one will be negative, the entire sum is negated to return a
# positive value. 
# 
# **entropy** $$ H(X) =  -\sum_{i=1}^{n} P(x_i)log_b P(x_i) $$ 
# 
# 
# #### Information gain  
# Selecting the test that produces the subsets with the lowest average entropy can produce a suboptimal tree. 
# we will measure the reduction in entropy using a metric called information gain.  
# Calculated with the following equation, information gain is the difference between the entropy of the parent
# node, H (T ), and the weighted average of the children nodes' entropies.  
# 
# ![](data/information_gain.png)  
# 
# 
# For creating Decision Tree, Algo **ID3** is the one mostly used. **C4.5** is a modified version of ID3
# that can be used with continuous explanatory variables and can accommodate
# missing values for features. C4.5 also can prune trees.  
# Pruning reduces the size of a tree by replacing branches that classify few instances with leaf nodes. Used by
# scikit-learn's implementation of decision trees, **CART** is another learning algorithm
# that supports pruning.

# #### Gini impurity  
# Gini impurity measures the proportions of classes in a set. Gini impurity
# is given by the following equation, where j is the number of classes, t is the subset
# of instances for the node, and P(i|t) is the probability of selecting an element of
# class i from the node's subset:  
# 
# $$ Gini (t) = 1 - \sum_{i=1}^{j} P(i|t)^2 $$  
# 
# Intuitively, Gini impurity is zero when all of the elements of the set are the same
# class, as the probability of selecting an element of that class is equal to one. Like
# entropy, Gini impurity is greatest when each class has an equal probability of being
# selected. The maximum value of Gini impurity depends on the number of possible
# classes, and it is given by the following equation:  
# 
# $$ Gini_{max} = 1 - \frac{1}{n} $$

# In[22]:


# import
import pandas as pd

from sklearn.tree import DecisionTreeClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier


# In[3]:


df = pd.read_csv("data/ad.data", header=None)

explanatory_variable_columns = set(df.columns.values)
response_variable_column = df[len(df.columns.values)-1]
# The last column describes the targets
explanatory_variable_columns.remove(len(df.columns.values)-1)
y = [1 if e == 'ad.' else 0 for e in response_variable_column]
X = df[list(explanatory_variable_columns)]


# In[20]:


#X.replace(to_replace=' *\?', value=-1, regex=True, inplace=True)
X.replace(['?'], [-1])


# In[15]:


X_train, X_test, y_train, y_test = train_test_split(X, y)


# In[16]:


pipeline = Pipeline([
('clf', DecisionTreeClassifier(criterion='entropy'))
])

parameters = {
'clf__max_depth': (150, 155, 160),
'clf__min_samples_split': (1, 2, 3),
'clf__min_samples_leaf': (1, 2, 3)
}


# In[17]:


grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='f1')


# In[21]:


#grid_search.fit(X_train, y_train)


# In[ ]:


print( 'Best score: %0.3f' % grid_search.best_score_)
print( 'Best parameters set:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print( '\t%s: %r' % (param_name, best_parameters[param_name]))
        
predictions = grid_search.predict(X_test)

print ('Accuracy:', accuracy_score(y_test, predictions))
print ('Confusion Matrix:', confusion_matrix(y_test, predictions))
print ('Classification Report:', classification_report(y_test, predictions))


# #### Tree ensembles (RandomForestClassifier)  
# 
# Ensemble learning methods combine a set of models to produce an estimator that
# has better predictive performance than its individual components. A random forest
# is a collection of decision trees that have been trained on randomly selected subsets
# of the training instances and explanatory variables. Random forests usually make
# predictions by returning the mode or mean of the predictions of their constituent
# trees.  
# 
# Random forests are less prone to overfitting than decision trees because no single
# tree can learn from all of the instances and explanatory variables; no single tree can
# memorize all of the noise in the representation

# In[23]:


pipeline = Pipeline([
('clf', RandomForestClassifier(criterion='entropy'))
])

parameters = {
'clf__n_estimators': (5, 10, 20, 50),
'clf__max_depth': (50, 150, 250),
'clf__min_samples_split': (1, 2, 3),
'clf__min_samples_leaf': (1, 2, 3)
}


# In[26]:


grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='f1')
#grid_search.fit(X_train, y_train)


# #### The advantages and disadvantages of decision trees   
# 
# Decision trees are easy to use. Unlike many learning
# algorithms, decision trees do not require the data to have zero mean and unit
# variance. While decision trees can tolerate missing values for explanatory variables,
# scikit-learn's current implementation cannot. Decision trees can even learn to ignore
# explanatory variables that are not relevant to the task.  
# 
# Small decision trees can be easy to interpret and visualize with the export_graphviz
# function from scikit-learn's tree module. The branches of a decision tree are
# conjunctions of logical predicates, and they are easily visualized as flowcharts.
# Decision trees support multioutput tasks, and a single decision tree can be used for
# multiclass classification without employing a strategy like one-versus-all.  
# 
# 
# decision trees are eager learners. Eager learners
# must build an input-independent model from the training data before they can be
# used to estimate the values of test instances, but can predict relatively quickly once
# the model has been built. In contrast, lazy learners such as the k-nearest neighbors
# algorithm defer all generalization until they must make a prediction. Lazy learners
# do not spend time training, but often predict slowly compared to eager learners.  
# 
# Decision trees are more prone to overfitting than many of the models, Pruning is a common
# strategy that removes some of the tallest nodes and leaves of a decision tree but
# it is not currently implemented in scikit-learn. However, similar effects can be
# achieved by setting a maximum depth for the tree or by creating child nodes only
# when the number of training instances they will contain exceeds a threshold.
# 
# Some of Algo are :
# ID3, C4.5, J4.5, RandomeForest
# 

# In[ ]: