Authors: Maria Sumarokova, senior data scientist/analyst at Veon, and Yury Kashnitsky, data scientist at Mail.Ru Group. Translated and edited by Gleb Filatov, Aleksey Kiselev, Anastasia Manokhina, Egor Polusmak, and Yuanyuan Pao. All content is distributed under the Creative Commons CC BY-NC-SA 4.0 license.
Please fill in the answers in the web-form.
Let's start by loading all necessary libraries:
%matplotlib inline
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = (10, 8)
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import collections
from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from ipywidgets import Image
from io import StringIO
import pydotplus #pip install pydotplus
Your goal is to figure out how decision trees work by walking through a toy problem. While a single decision tree does not yield outstanding results, other performant algorithms like gradient boosting and random forests are based on the same idea. That is why knowing how decision trees work might be useful.
We'll go through a toy example of binary classification - Person A is deciding whether they will go on a second date with Person B. It will depend on their looks, eloquence, alcohol consumption (only for example), and how much money was spent on the first date.
# Create dataframe with dummy variables
def create_df(dic, feature_list):
out = pd.DataFrame(dic)
out = pd.concat([out, pd.get_dummies(out[feature_list])], axis = 1)
out.drop(feature_list, axis = 1, inplace = True)
return out
# Some feature values are present in train and absent in test and vice-versa.
def intersect_features(train, test):
common_feat = list( set(train.keys()) & set(test.keys()))
return train[common_feat], test[common_feat]
features = ['Looks', 'Alcoholic_beverage','Eloquence','Money_spent']
df_train = {}
df_train['Looks'] = ['handsome', 'handsome', 'handsome', 'repulsive',
'repulsive', 'repulsive', 'handsome']
df_train['Alcoholic_beverage'] = ['yes', 'yes', 'no', 'no', 'yes', 'yes', 'yes']
df_train['Eloquence'] = ['high', 'low', 'average', 'average', 'low',
'high', 'average']
df_train['Money_spent'] = ['lots', 'little', 'lots', 'little', 'lots',
'lots', 'lots']
df_train['Will_go'] = LabelEncoder().fit_transform(['+', '-', '+', '-', '-', '+', '+'])
df_train = create_df(df_train, features)
df_train
df_test = {}
df_test['Looks'] = ['handsome', 'handsome', 'repulsive']
df_test['Alcoholic_beverage'] = ['no', 'yes', 'yes']
df_test['Eloquence'] = ['average', 'high', 'average']
df_test['Money_spent'] = ['lots', 'little', 'lots']
df_test = create_df(df_test, features)
df_test
# Some feature values are present in train and absent in test and vice-versa.
y = df_train['Will_go']
df_train, df_test = intersect_features(train=df_train, test=df_test)
df_train
df_test
1. What is the entropy $S_0$ of the initial system? By system states, we mean values of the binary feature "Will_go" - 0 or 1 - two states in total.
# you code here
2. Let's split the data by the feature "Looks_handsome". What is the entropy $S_1$ of the left group - the one with "Looks_handsome". What is the entropy $S_2$ in the opposite group? What is the information gain (IG) if we consider such a split?
# you code here
# you code here
# you code here
Consider the following warm-up example: we have 9 blue balls and 11 yellow balls. Let ball have label 1 if it is blue, 0 otherwise.
balls = [1 for i in range(9)] + [0 for i in range(11)]
<img src = '../../img/decision_tree3.png'>
Next split the balls into two groups:
<img src = '../../img/decision_tree4.png'>
# two groups
balls_left = [1 for i in range(8)] + [0 for i in range(5)] # 8 blue and 5 yellow
balls_right = [1 for i in range(1)] + [0 for i in range(6)] # 1 blue and 6 yellow
def entropy(a_list):
# you code here
pass
Tests
print(entropy(balls)) # 9 blue и 11 yellow
print(entropy(balls_left)) # 8 blue и 5 yellow
print(entropy(balls_right)) # 1 blue и 6 yellow
print(entropy([1,2,3,4,5,6])) # entropy of a fair 6-sided die
3. What is the entropy of the state given by the list balls_left?
4. What is the entropy of a fair dice? (where we look at a dice as a system with 6 equally probable states)?
# information gain calculation
def information_gain(root, left, right):
''' root - initial data, left and right - two partitions of initial data'''
# you code here
pass
5. What is the information gain from splitting the initial dataset into balls_left and balls_right ?
def best_feature_to_split(X, y):
'''Outputs information gain when splitting on best feature'''
# you code here
pass
Dataset UCI Adult (no need to download it, we have a copy in the course repository): classify people using demographical data - whether they earn more than \$50,000 per year or not.
Feature descriptions:
Target – earnings level, categorical (binary) feature.
data_train = pd.read_csv('../../data/adult_train.csv', sep=';')
data_train.tail()
data_test = pd.read_csv('../../data/adult_test.csv', sep=';')
data_test.tail()
# necessary to remove rows with incorrect labels in test dataset
data_test = data_test[(data_test['Target'] == ' >50K.') | (data_test['Target']==' <=50K.')]
# encode target variable as integer
data_train.loc[data_train['Target']==' <=50K', 'Target'] = 0
data_train.loc[data_train['Target']==' >50K', 'Target'] = 1
data_test.loc[data_test['Target']==' <=50K.', 'Target'] = 0
data_test.loc[data_test['Target']==' >50K.', 'Target'] = 1
data_test.describe(include='all').T
data_train['Target'].value_counts()
fig = plt.figure(figsize=(25, 15))
cols = 5
rows = np.ceil(float(data_train.shape[1]) / cols)
for i, column in enumerate(data_train.columns):
ax = fig.add_subplot(rows, cols, i + 1)
ax.set_title(column)
if data_train.dtypes[column] == np.object:
data_train[column].value_counts().plot(kind="bar", axes=ax)
else:
data_train[column].hist(axes=ax)
plt.xticks(rotation="vertical")
plt.subplots_adjust(hspace=0.7, wspace=0.2)
data_train.dtypes
data_test.dtypes
As we see, in the test data, age is treated as type object. We need to fix this.
data_test['Age'] = data_test['Age'].astype(int)
Also we'll cast all float features to int type to keep types consistent between our train and test data.
data_test['fnlwgt'] = data_test['fnlwgt'].astype(int)
data_test['Education_Num'] = data_test['Education_Num'].astype(int)
data_test['Capital_Gain'] = data_test['Capital_Gain'].astype(int)
data_test['Capital_Loss'] = data_test['Capital_Loss'].astype(int)
data_test['Hours_per_week'] = data_test['Hours_per_week'].astype(int)
# choose categorical and continuous features from data
categorical_columns = [c for c in data_train.columns
if data_train[c].dtype.name == 'object']
numerical_columns = [c for c in data_train.columns
if data_train[c].dtype.name != 'object']
print('categorical_columns:', categorical_columns)
print('numerical_columns:', numerical_columns)
# fill missing data
for c in categorical_columns:
data_train[c].fillna(data_train[c].mode(), inplace=True)
data_test[c].fillna(data_train[c].mode(), inplace=True)
for c in numerical_columns:
data_train[c].fillna(data_train[c].median(), inplace=True)
data_test[c].fillna(data_train[c].median(), inplace=True)
We'll dummy code some categorical features: Workclass, Education, Martial_Status, Occupation, Relationship, Race, Sex, Country. It can be done via pandas method get_dummies
data_train = pd.concat([data_train[numerical_columns],
pd.get_dummies(data_train[categorical_columns])], axis=1)
data_test = pd.concat([data_test[numerical_columns],
pd.get_dummies(data_test[categorical_columns])], axis=1)
set(data_train.columns) - set(data_test.columns)
data_train.shape, data_test.shape
data_test['Country_ Holand-Netherlands'] = 0
set(data_train.columns) - set(data_test.columns)
data_train.head(2)
data_test.head(2)
X_train = data_train.drop(['Target'], axis=1)
y_train = data_train['Target']
X_test = data_test.drop(['Target'], axis=1)
y_test = data_test['Target']
Train a decision tree (DecisionTreeClassifier) with a maximum depth of 3, and evaluate the accuracy metric on the test data. Use parameter random_state = 17 for results reproducibility.
# you code here
# tree =
# tree.fit
Make a prediction with the trained model on the test data.
# you code here
# tree_predictions = tree.predict
# you code here
# accuracy_score
6. What is the test set accuracy of a decision tree with maximum tree depth of 3 and random_state = 17?
Train a decision tree (DecisionTreeClassifier, random_state = 17). Find the optimal maximum depth using 5-fold cross-validation (GridSearchCV).
tree_params = {'max_depth': range(2,11)}
locally_best_tree = GridSearchCV # you code here
locally_best_tree.fit; # you code here
Train a decision tree with maximum depth of 9 (it is the best max_depth in my case), and compute the test set accuracy. Use parameter random_state = 17 for reproducibility.
# you code here
# tuned_tree =
# tuned_tree.fit
# tuned_tree_predictions = tuned_tree.predict
# accuracy_score
7. What is the test set accuracy of a decision tree with maximum depth of 9 and random_state = 17?
Let's take a sneak peek of upcoming lectures and try to use a random forest for our task. For now, you can imagine a random forest as a bunch of decision trees, trained on slightly different subsets of the training data.
Train a random forest (RandomForestClassifier). Set the number of trees to 100 and use random_state = 17.
# you code here
# rf =
# rf.fit # you code here
Make predictions for the test data and assess accuracy.
# you code here
Train a random forest (RandomForestClassifier). Tune the maximum depth and maximum number of features for each tree using GridSearchCV.
# forest_params = {'max_depth': range(10, 21),
# 'max_features': range(5, 105, 20)}
# locally_best_forest = GridSearchCV # you code here
# locally_best_forest.fit # you code here
Make predictions for the test data and assess accuracy.
# you code here