The root node (aka.the first decision node) partitions the data based on the most influential feature.
There are 2 measures for this, Entropy and Gini Impurity.
The root node (the first decision node) partitions the data using the feature that provides the most information gain.
Information gain tells us how important a given attribute of the feature vectors is.
It is calculated as:
Information Gain=entropy(parent)–[average entropy(children)]
Where entropy is a common measure of target class impurity, given as:
Entropy=Σi–pilog2pi
Where i is each of the target classes.
Gini Impurity is another measure of impurity and is calculated as follows:
Gini=1–Σip2iWhere i is each of the target classes.
Gini impurity is computationally faster as it doesn’t require calculating logarithmic functions. Though in reality which of the two methods is used rarely makes too much of a difference.
We’ll be using a decision tree to make predictions about the Titanic data set from Kaggle. This data set provides information on the Titanic passengers and can be used to predict whether a passenger survived or not.
Recall: Data pre-processing is the step that comes before creating, training, and testing our model by cleaning the data and preparing it for consumption by our model. No model is a good model without only the best data!
import pandas as pd
df = pd.read_csv('train.csv', index_col='PassengerId')
Lets take a look at the data-frame we just created so that we can select the attributes we would like to use for our refined classification model (a Decision Tree).
df.head()
We will be using Pclass, Sex, Age, SibSp (Siblings aboard), Parch (Parents/children aboard), and Fare to predict whether a passenger survived.
# go ahead and re-assign the dataframe we created to only include the features listed above
# e.g. if we wanted only Sex, we'd do something like df = df[['Sex']]
# type your code here
df = df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Survived']]
# print out the first 5 items of the dataframe to make sure we're on the right track!
df.head()
We need to convert ‘Sex’ into an integer value of 0 or 1.
# let's use pandas' built in map function to turn all the 'male' instances to 0 and all the 'female' instances to 1
# e.g. if you were to do this for handedness, it would look something like:
# df['handedness'] = df['handedness'].map({ 'right': 0, 'left': 1 })
# type your code here
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
# print out the first 5 items of the dataframe to make sure we're on the right track!
df.head()
We will also drop any rows with missing values.
Missing values are bad as they tend to screw up our classifier. We don't want anything getting in the way of our super talented model!
The data (X) is all our data, and the target (y) is the corresponding result for each row of data.
df = df.dropna()
X = df.drop('Survived', axis=1)
y = df['Survived']
Now, we're going to want to split our dataset into training and testing instances. Remember, for both training and testing, we need data and labels.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
Lets initialize our model so we can take a look at it's attributes.
from sklearn import tree
model = tree.DecisionTreeClassifier()
# Displays the model attributes
model
Defining some of the attributes like max_depth, max_leaf_nodes, min_impurity_split, and min_samples_leaf can help prevent overfitting the model to the training data.
First we fit our model using our training data.
# Fit the training data to the model
Then we score the predicted output from the model on our test data against our ground truth test data.
y_predict = model.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predict)
We see an accuracy score of ~81.01%, which is significantly better than 50/50 guessing.
Let’s also take a look at our confusion matrix:
from sklearn.metrics import confusion_matrix
pd.DataFrame(
confusion_matrix(y_test, y_predict),
columns=['Predicted Not Survival', 'Predicted Survival'],
index=['True Not Survival', 'True Survival']
)
#tree.export_graphviz(model.tree_, out_file='tree.dot', feature_names=X.columns)
We can then convert this dot file to a png file.
#from subprocess import call
#call(['dot', '-T', 'png', 'tree.dot', '-o', 'tree.png'])
We can then view our tree, which will look something similar this:
The root node, with the most information gain, tells us that the biggest factor in determining survival is Sex.
If we zoom in on some of the leaf nodes, we can follow some of the decisions down.
We have already zoomed into the part of the decision tree that describes males, with a ticket lower than first class, that are under the age of 10.
The impurity is the measure as given at the top by Gini, the samples are the number of observations remaining to classify and the value is the how many samples are in class 0 (Did not survive) and how many samples are in class 1 (Survived).
Let’s follow this part of the tree down, the nodes to the left are True and the nodes to the right are False:
We see that we have 19 observations left to classify: 9 did not survive and 10 did.
From this point the most information gain is how many siblings (SibSp) were aboard.
A. 9 out of the 10 samples with less than 2.5 siblings survived.
B. This leaves 10 observations left, 9 did not survive and 1 did.
6 of these children that only had one parent (Parch) aboard did not survive.
None of the children aged > 3.5 survived
Of the 2 remaining children, the one with > 4.5 siblings did not survive.