#!/usr/bin/env python # coding: utf-8 # ### Nonlinear Classification and Regression with Decision Trees # # #### Decision trees # # Decision trees are commonly learned by recursively splitting the set of training # instances into subsets based on the instances' values for the explanatory variables. # # In classification tasks, the leaf nodes # of the decision tree represent classes. In regression tasks, the values of the response # variable for the instances contained in a leaf node may be averaged to produce the # estimate for the response variable. After the decision tree has been constructed, # making a prediction for a test instance requires only following the edges until a # leaf node is reached. # # Let's create a decision tree using an algorithm called Iterative Dichotomiser 3 (ID3). # Invented by Ross Quinlan, ID3 was one of the first algorithms used to train decision # trees. # # But how to choose the first variable on which we have to divide the data so that we can have smaller tree. # # Measured in bits, entropy quantifies the amount of uncertainty in a variable. Entropy # is given by the following equation, where n is the number of outcomes and ( ) i P x is # the probability of the outcome i. Common values for b are 2, e, and 10. Because the # log of a number less than one will be negative, the entire sum is negated to return a # positive value. # # **entropy** $$ H(X) = -\sum_{i=1}^{n} P(x_i)log_b P(x_i) $$ # # # #### Information gain # Selecting the test that produces the subsets with the lowest average entropy can produce a suboptimal tree. # we will measure the reduction in entropy using a metric called information gain. # Calculated with the following equation, information gain is the difference between the entropy of the parent # node, H (T ), and the weighted average of the children nodes' entropies. # # ![](data/information_gain.png) # # # For creating Decision Tree, Algo **ID3** is the one mostly used. **C4.5** is a modified version of ID3 # that can be used with continuous explanatory variables and can accommodate # missing values for features. C4.5 also can prune trees. # Pruning reduces the size of a tree by replacing branches that classify few instances with leaf nodes. Used by # scikit-learn's implementation of decision trees, **CART** is another learning algorithm # that supports pruning. # #### Gini impurity # Gini impurity measures the proportions of classes in a set. Gini impurity # is given by the following equation, where j is the number of classes, t is the subset # of instances for the node, and P(i|t) is the probability of selecting an element of # class i from the node's subset: # # $$ Gini (t) = 1 - \sum_{i=1}^{j} P(i|t)^2 $$ # # Intuitively, Gini impurity is zero when all of the elements of the set are the same # class, as the probability of selecting an element of that class is equal to one. Like # entropy, Gini impurity is greatest when each class has an equal probability of being # selected. The maximum value of Gini impurity depends on the number of possible # classes, and it is given by the following equation: # # $$ Gini_{max} = 1 - \frac{1}{n} $$ # In[22]: # import import pandas as pd from sklearn.tree import DecisionTreeClassifier from sklearn.grid_search import GridSearchCV from sklearn.metrics import classification_report from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier # In[3]: df = pd.read_csv("data/ad.data", header=None) explanatory_variable_columns = set(df.columns.values) response_variable_column = df[len(df.columns.values)-1] # The last column describes the targets explanatory_variable_columns.remove(len(df.columns.values)-1) y = [1 if e == 'ad.' else 0 for e in response_variable_column] X = df[list(explanatory_variable_columns)] # In[20]: #X.replace(to_replace=' *\?', value=-1, regex=True, inplace=True) X.replace(['?'], [-1]) # In[15]: X_train, X_test, y_train, y_test = train_test_split(X, y) # In[16]: pipeline = Pipeline([ ('clf', DecisionTreeClassifier(criterion='entropy')) ]) parameters = { 'clf__max_depth': (150, 155, 160), 'clf__min_samples_split': (1, 2, 3), 'clf__min_samples_leaf': (1, 2, 3) } # In[17]: grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='f1') # In[21]: #grid_search.fit(X_train, y_train) # In[ ]: print( 'Best score: %0.3f' % grid_search.best_score_) print( 'Best parameters set:') best_parameters = grid_search.best_estimator_.get_params() for param_name in sorted(parameters.keys()): print( '\t%s: %r' % (param_name, best_parameters[param_name])) predictions = grid_search.predict(X_test) print ('Accuracy:', accuracy_score(y_test, predictions)) print ('Confusion Matrix:', confusion_matrix(y_test, predictions)) print ('Classification Report:', classification_report(y_test, predictions)) # #### Tree ensembles (RandomForestClassifier) # # Ensemble learning methods combine a set of models to produce an estimator that # has better predictive performance than its individual components. A random forest # is a collection of decision trees that have been trained on randomly selected subsets # of the training instances and explanatory variables. Random forests usually make # predictions by returning the mode or mean of the predictions of their constituent # trees. # # Random forests are less prone to overfitting than decision trees because no single # tree can learn from all of the instances and explanatory variables; no single tree can # memorize all of the noise in the representation # In[23]: pipeline = Pipeline([ ('clf', RandomForestClassifier(criterion='entropy')) ]) parameters = { 'clf__n_estimators': (5, 10, 20, 50), 'clf__max_depth': (50, 150, 250), 'clf__min_samples_split': (1, 2, 3), 'clf__min_samples_leaf': (1, 2, 3) } # In[26]: grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='f1') #grid_search.fit(X_train, y_train) # #### The advantages and disadvantages of decision trees # # Decision trees are easy to use. Unlike many learning # algorithms, decision trees do not require the data to have zero mean and unit # variance. While decision trees can tolerate missing values for explanatory variables, # scikit-learn's current implementation cannot. Decision trees can even learn to ignore # explanatory variables that are not relevant to the task. # # Small decision trees can be easy to interpret and visualize with the export_graphviz # function from scikit-learn's tree module. The branches of a decision tree are # conjunctions of logical predicates, and they are easily visualized as flowcharts. # Decision trees support multioutput tasks, and a single decision tree can be used for # multiclass classification without employing a strategy like one-versus-all. # # # decision trees are eager learners. Eager learners # must build an input-independent model from the training data before they can be # used to estimate the values of test instances, but can predict relatively quickly once # the model has been built. In contrast, lazy learners such as the k-nearest neighbors # algorithm defer all generalization until they must make a prediction. Lazy learners # do not spend time training, but often predict slowly compared to eager learners. # # Decision trees are more prone to overfitting than many of the models, Pruning is a common # strategy that removes some of the tallest nodes and leaves of a decision tree but # it is not currently implemented in scikit-learn. However, similar effects can be # achieved by setting a maximum depth for the tree or by creating child nodes only # when the number of training instances they will contain exceeds a threshold. # # Some of Algo are : # ID3, C4.5, J4.5, RandomeForest # # In[ ]: