Decision Tree with the Iris Dataset¶

For an explanation of decision trees, see our course notes.

This notebook uses example code from http://scikit-learn.org/stable/modules/tree.html.

Instructions¶

If you haven't already, follow the setup instructions here to get all necessary software installed.
Install the software specific to this notebook, as explained in the Setup section.
Read through the code in the following sections:
Complete one or both exercise options:
- Exercise Option #1 - Standard Difficulty
- Exercise Option #2 - Advanced Difficulty

Setup¶

Before you can run this code, you will need to install some extra software.

Install homebrew (if you don't already have it) following the directions on their site.
Install the graphviz library that will let us visualize the decision tree. In Terminal, run

brew install graphviz

Install the pydot library that allows you to call graphviz from Python. In Terminal run

pip3 install pydot.

In [1]:

from sklearn.datasets import load_iris # the iris dataset is included in scikit-learn
from sklearn import tree # for fitting our model

# these are all needed for the particular visualization we're doing
from six import StringIO
import pydot
import os.path

# to display graphs in this notebook
%matplotlib inline
import matplotlib.pyplot

Iris Dataset¶

Before you go on, make sure you understand this dataset. Modify the cell below to examine different parts of the dataset that are contained in the 'iris' dictionary object.

What are the features? What are we trying to classify?

In [2]:

iris = load_iris()
iris.keys()

Out[2]:

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

You can also try looking at it using a pandas dataframe.

In [3]:

import pandas
iris_df = pandas.DataFrame(iris.data)
iris_df.columns = iris.feature_names
iris_df['target'] = [iris.target_names[target] for target in iris.target]
iris_df.head()

Out[3]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

In [4]:

iris_df.describe()

Out[4]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

Visualization of Dataset¶

Let's visualize our dataset, so that we can better understand what it looks like.

Change the first two variables to change which features you are looking at.

In [5]:

# Plot two of the features (the first and fourth columns, in this case)
x1_feature = 0
x2_feature = 3

x1 = iris.data[:,x1_feature]
x2 = iris.data[:,x2_feature]

# The data are in order by type. Find out where the other types start
start_type_one = list(iris.target).index(1)
start_type_two = list(iris.target).index(2)

# create a figure and label it
fig = matplotlib.pyplot.figure()
fig.suptitle('Two Features of the Iris Data Set')
matplotlib.pyplot.xlabel(iris.feature_names[x1_feature])
matplotlib.pyplot.ylabel(iris.feature_names[x2_feature])

# put the input data on the graph, with different colors and shapes for each type
scatter_0 = matplotlib.pyplot.scatter(x1[:start_type_one], x2[:start_type_one],
                                      c="red", marker="o", label=iris.target_names[0])
scatter_1 = matplotlib.pyplot.scatter(x1[start_type_one:start_type_two], x2[start_type_one:start_type_two],
                                      c="blue", marker="^", label=iris.target_names[1])
scatter_2 = matplotlib.pyplot.scatter(x1[start_type_two:], x2[start_type_two:],
                                      c="yellow", marker="*", label=iris.target_names[2])

# add a legend to explain which points are which
matplotlib.pyplot.legend(handles=[scatter_0, scatter_1, scatter_2])

# show the graph
matplotlib.pyplot.show()

Model Training¶

Next, we want to fit our decision tree model to the iris data we're using.

In [6]:

# Train the model
model = tree.DecisionTreeClassifier()
model.fit(iris.data, iris.target)

Out[6]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

Visualization of Model Output¶

Using graphviz and pydot, we can create a flowchart that shows the model decisions. The flowchart will be printed to a PDF on your desktop.

In [7]:

dot_data = StringIO()
tree.export_graphviz(model, out_file=dot_data, feature_names=iris.feature_names, class_names=iris.target_names,
                     filled=True, rounded=True, special_characters=True)
graph = pydot.graph_from_dot_data(dot_data.getvalue())[0]
graph.write_pdf(os.path.expanduser("~/Desktop/iris_decision_tree.pdf"))

Prediction¶

Now we can make some predictions using the trained model. We'll pull out some examples from our training data and see what the model says about them.

In [8]:

# Use the first input from each class
inputs = [iris.data[0], iris.data[start_type_one], iris.data[start_type_two]]

print('Class predictions: {0}'.format(model.predict(inputs))) # guess which class
print('Probabilities:\n{0}'.format(model.predict_proba(inputs))) # give probability of each class

Class predictions: [0 1 2]
Probabilities:
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

Exercise Option #1 - Standard Difficulty¶

Answer the following questions. You may find it helpful to compare the PDF output to the graph above (remember you can change which columns the graph is displaying), to see the boundaries the decision tree is finding.

Submit the PDF you generated as a separate file in Canvas.
According to the PDF, what feature values would tell you with high probability that you were looking at a setosa iris?
According to the PDF, which features would you look at to tell a virginica from a versicolor?
What is the value array in the PDF showing?
The predictions just above are all 100% confident in the correct answer. If you try using other data points from the training data, you'll find the same thing. Why is that always true for our Decision Tree?
Try using subsets of the input data (look at the iris_inputs variable in LogisticRegressionIris to see how to use only some of the columns in the model). How does this change the decision tree?

Exercise Option #2 - Advanced Difficulty¶

Try fitting a Random Forest model to the iris data. See this example to help you get started.

How does the performance and output of Random Forest compare to the single Decision Tree? Since you can't get the graphical representation of the Random Forest model the way we did for the single Decision Tree, you'll have to think of a different way to understand what the model is doing. Think about how we can validate the performance of our classifier models.

In [ ]: