mlcourse.ai – Open Machine Learning Course

Author: Yury Kashnitsky (@yorko). Edited by Anna Tarelina (@feuerengel), and Mikhail Korshchikov (@MS4). This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose.

Assignment #2. Fall 2019

Part 1. Decision trees for classification and regression

In this assignment, we will find out how a decision tree works in a regression task, then will build and tune classification decision trees for identifying heart diseases.

Prior to working on the assignment, you'd better check out the corresponding course material:

  1. Classification, Decision Trees and k Nearest Neighbors, the same as an interactive web-based Kaggle Kernel
  2. Ensembles:
  3. You can also practice with demo assignments, which are simpler and already shared with solutions:
  4. There are also 7 video lectures on trees, forests, boosting and their applications: mlcourse.ai/lectures

Your task is to:

  1. write code and perform computations in the cells below
  2. choose answers in the webform. Solutions will be shared only with those who've filled in this form

Deadline for A2: 2019 October 6, 20:59 CET (London time)

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier, export_graphviz

1. Decision trees for regression: a toy example

Let's consider the following one-dimensional regression problem. We need to build a function $\large a(x)$ to approximate the dependency $\large y = f(x)$ using the mean-squared error criterion: $\large \min \sum_i {(a(x_i) - f(x_i))}^2$.

In [2]:
X = np.linspace(-2, 2, 7)
y = X ** 3 # original dependecy 

plt.scatter(X, y)
plt.xlabel(r'$x$')
plt.ylabel(r'$y$');

Let's make several steps to build a decision tree. In the case of a regression task, at prediction time, the leaf returns the average value for all observations in this leaf.

Let's start with a tree of depth 0, i.e. all observations placed in a single leaf.


You'll need to build a tree with only one node (also called root) that contains all train observations (instances).
How will predictions of this tree look like for $x \in [-2, 2]$?
Create an appropriate plot using a pen, paper and Python if needed (but no sklearn is needed yet).

In [3]:
# You code here

Making first splits.
Let's split the data according to the following condition $[x < 0]$. It gives us the tree of depth 1 with two leaves. To clarify, for all instances with $x \geqslant 0$ the tree will return some value, for all instances with $x < 0$ it will return another value. Let's create a similar plot for predictions of this tree.

In [4]:
# You code here

In the decision tree algorithm, the feature and the threshold for splitting are chosen according to some criterion. The commonly used criterion for regression is based on variance: $$\large Q(X, y, j, t) = D(X, y) - \dfrac{|X_l|}{|X|} D(X_l, y_l) - \dfrac{|X_r|}{|X|} D(X_r, y_r),$$ where $\large X$ and $\large y$ are a feature matrix and a target vector (correspondingly) for training instances in a current node, $\large X_l, y_l$ and $\large X_r, y_r$ are splits of samples $\large X, y$ into two parts w.r.t. $\large [x_j < t]$ (by $\large j$-th feature and threshold $\large t$), $\large |X|$, $\large |X_l|$, $\large |X_r|$ (or, the same, $\large |y|$, $\large |y_l|$, $\large |y_r|$) are sizes of appropriate samples, and $\large D(X, y)$ is variance of answers $\large y$ for all instances in $\large X$: $$\large D(X, y) = \dfrac{1}{|X|} \sum_{j=1}^{|X|}(y_j – \dfrac{1}{|X|}\sum_{i = 1}^{|X|}y_i)^2$$ Here $\large y_i = y(x_i)$ is the answer for the $\large x_i$ instance. Feature index $\large j$ and threshold $\large t$ are chosen to maximize the value of criterion $\large Q(X, y, j, t)$ for each split.

In our 1D case, there's only one feature so $\large Q$ depends only on threshold $\large t$ and training data $\large X$ and $\large y$. Let's designate it $\large Q_{1d}(X, y, t)$ meaning that the criterion no longer depends on feature index $\large j$, i.e. in 1D case $\large j = 1$.

In [5]:
def regression_var_criterion(X, y, t):
    pass
    # You code here

Create the plot of criterion $\large Q_{1d}(X, y, t)$ as a function of threshold value $t$ on the interval $\large [-1.9, 1.9]$.

In [6]:
# You code here

Question 1. What is the worst threshold value (to perform a split) according to the variance criterion?

**Answer options:**

  • -1.9
  • -1.3
  • 0
  • 1.3
  • 1.9

For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #a2_part1_fall2019

Then let's make splitting in each of the leaves nodes.
Take your tree with first threshold [$x<0$].
Now add a split in the left branch (where previous split was $x < 0$) using the criterion $[x < -1.5]$, in the right branch (where previous split was $x \geqslant 0$) with the following criterion $[x < 1.5]$.
It gives us a tree of depth 2 with 7 nodes and 4 leaves. Create a plot of this tree predictions for $x \in [-2, 2]$.

In [7]:
# You code here

Question 2. Tree predictions is a piecewise-constant function, right? How many "pieces" (horizontal segments in the plot that you've just built) are there in the interval [-2, 2]?

**Answer options:**

  • 2
  • 4
  • 6
  • 8

For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #a2_part1_fall2019

2. Building a decision tree for predicting heart diseases

Let's read the data on heart diseases. The dataset can be downloaded from the course repo from here by clicking on Download and then selecting Save As option. If you work with Git, then the dataset is already there in data/mlbootcamp5_train.csv.

Problem

Predict presence or absence of cardiovascular disease (CVD) using the patient examination results.

Data description

There are 3 types of input features:

  • Objective: factual information;
  • Examination: results of medical examination;
  • Subjective: information given by the patient.
Feature Variable Type Variable Value Type
Age Objective Feature age int (days)
Height Objective Feature height int (cm)
Weight Objective Feature weight float (kg)
Gender Objective Feature gender categorical code
Systolic blood pressure Examination Feature ap_hi int
Diastolic blood pressure Examination Feature ap_lo int
Cholesterol Examination Feature cholesterol 1: normal, 2: above normal, 3: well above normal
Glucose Examination Feature gluc 1: normal, 2: above normal, 3: well above normal
Smoking Subjective Feature smoke binary
Alcohol intake Subjective Feature alco binary
Physical activity Subjective Feature active binary
Presence or absence of cardiovascular disease Target Variable cardio binary

All of the dataset values were collected at the moment of medical examination.

In [8]:
df = pd.read_csv('../../data/mlbootcamp5_train.csv', 
                 index_col='id', sep=';')
In [9]:
df.head()
Out[9]:
age gender height weight ap_hi ap_lo cholesterol gluc smoke alco active cardio
id
0 18393 2 168 62.0 110 80 1 1 0 0 1 0
1 20228 1 156 85.0 140 90 3 1 0 0 1 1
2 18857 1 165 64.0 130 70 3 1 0 0 0 1
3 17623 2 169 82.0 150 100 1 1 0 0 1 1
4 17474 1 156 56.0 100 60 1 1 0 0 0 0

Transform the features:

  • create "age in years" dividing age by 365.25 and taking floor ($\lfloor{x}\rfloor$ is the largest integer that is less than or equal to $x$)
  • create 3 binary features based on cholesterol.
  • create 3 binary features based on gluc.
    Binary features equal to 1, 2 or 3. This method is called dummy-encoding or One Hot Encoding (OHE). It is more convenient to use pandas.get_dummies. There is no need to use the original features cholesterol and gluc after encoding.
In [10]:
# You code here

Split data into train and holdout parts in the proportion of 7/3 using sklearn.model_selection.train_test_split with random_state=17.

In [11]:
# You code here
# X_train, X_valid, y_train, y_valid = ...

Train a decision tree on the dataset (X_train, y_train) with max depth equal to 3 and random_state=17. Plot this tree with sklearn.tree.export_graphviz and Graphviz. Here we need to mention that sklearn doesn't draw decision trees on its own, but is able to output a tree in the .dot format that can be used by Graphviz for visualization.

How to plot a decision tree, alternatives:

  1. Install Graphviz and pydotpus yourself (see below)
  2. Use our docker image with all needed packages already installed
  3. Easy way: execute print(dot_data.getvalue()) with dot_data defined below (this can be done without pydotplus and Graphviz), go to http://www.webgraphviz.com, paste the graph code string (digraph Tree {...) and generate a nice picture

There are may be some troubles with graphviz for Windows users. The error is 'GraphViz's executables not found'.
To fix that - install Graphviz from here.
Then add graphviz path to your system PATH variable. You can do this manually, but don't forget to restart kernel.
Or just run this code:

In [12]:
import os
path_to_graphviz = '' # your path to graphviz (C:\\Program Files (x86)\\Graphviz2.38\\bin\\ for example) 
os.environ["PATH"] += os.pathsep + path_to_graphviz

Take a look how trees are visualized in the 3rd part of course materials.

Question 3. Which 3 features are used to make predictions in the created decision tree?

**Answer options:**

  • age, ap_lo, chol=1
  • age, ap_hi, chol=3
  • smoke, age, gender
  • alco, weight, gluc=3

For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #a2_part1_fall2019

Make predictions for holdout data (X_valid, y_valid) with the trained decision tree. Calculate accuracy.

In [13]:
# You code here

Set up the depth of the tree using cross-validation on the dataset (X_train, y_train) in order to increase quality of the model. Use GridSearchCV with 5 folds. Fix random_state=17 and change max_depth from 2 to 10.

In [14]:
tree_params = {'max_depth': list(range(2, 11))}

tree_grid = GridSearchCV # You code here

Draw the plot to show how mean accuracy is changing in regards to max_depth value on cross-validation.

In [15]:
# You code here

Print the best value of max_depth where the mean value of cross-validation quality metric reaches maximum. Also compute accuracy on holdout data. This can be done with the trained instance of the class GridSearchCV.

In [16]:
# You code here

Сalculate the effect of GridSearchCV: check out the expression (acc2 - acc1) / acc1 * 100%, where acc1 and acc2 are accuracies on holdout data before and after tuning max_depth with GridSearchCV respectively.

In [17]:
# You code here

Question 4. Choose all correct statements.

**Answer options:**

  • There exists a local maximum of accuracy on the built validation curve
  • GridSearchCV increased holdout accuracy by more than 1%
  • There is no local maximum of accuracy on the built validation curve
  • GridSearchCV increased holdout accuracy by less than 1%

For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #a2_part1_fall2019

Take a look at the SCORE table to estimate ten-year risk of fatal cardiovascular disease in Europe. Source paper.

Let's create new features according to this picture:

  • $age \in [40,50), age \in [50,55), age \in [55,60), age \in [60,65) $ (4 features)
  • systolic blood pressure: $ap\_hi \in [120,140), ap\_hi \in [140,160), ap\_hi \in [160,180),$ (3 features)

If the values of age or blood pressure don't fall into any of the intervals then all binary features will be equal to zero.


Add a smoke feature.
Build the cholesterol and gender features. Transform the cholesterol to 3 binary features according to it's 3 unique values ( cholesterol=1, cholesterol=2 and cholesterol=3). Transform the gender from 1 and 2 into 0 and 1. It is better to rename it to male (0 – woman, 1 – man). In general, this is typically done with sklearn.preprocessing.LabelEncoder but here in case of only 2 unique values it's not necessary.

Finally, the decision tree is built using these 12 binary features (excluding all original features that we had before this feature engineering part).

Create a decision tree with the limitation max_depth=3 and train it on the whole train data. Use the DecisionTreeClassifier class with fixed random_state=17, but all other arguments (except for max_depth and random_state) should be left with their default values.

Question 5. Which binary feature is the most important for heart disease detection (i.e., it is placed in the root of the tree)?

**Answer options:**

  • Systolic blood pressure from 160 to 180 (mmHg)
  • Cholesterol level == 3
  • Systolic blood pressure from 140 to 160 (mmHg)
  • Age from 50 to 55 (years)
  • Smokes / doesn't smoke
  • Age from 60 to 65 (years)

For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #a2_part1_fall2019

In [18]:
# You code here