Logistic Regression with the Iris Dataset¶

For more explanation of logistic regression, see

Instructions¶

If you haven't already, follow the setup instructions here to get all necessary software installed.
Read through the code in the following sections

Iris Dataset
Visualization
Model Training
Prediction

Complete at least one of the following exercises

Exercise Option #1 - Standard Difficulty
Exercise Option #2 - Advanced Difficulty

In [1]:

from sklearn import linear_model # for fitting our model
from sklearn.datasets import load_iris # the iris dataset is included in scikit-learn

# force numpy not to use scientific notation, to make it easier to read the numbers the program prints out
import numpy
numpy.set_printoptions(suppress=True)

# to display graphs in this notebook
%matplotlib inline
import matplotlib.pyplot

Iris Dataset¶

Before you go on, make sure you understand this dataset. Modify the cell below to examine different parts of the dataset that are contained in the 'iris' dictionary object.

What are the features? What are we trying to classify?

In [2]:

iris = load_iris()
iris.keys()

Out[2]:

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

You can also try looking at it using a pandas dataframe.

In [3]:

import pandas
iris_df = pandas.DataFrame(iris.data)
iris_df.columns = iris.feature_names
iris_df['target'] = [iris.target_names[target] for target in iris.target]
iris_df.head()

Out[3]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

In [4]:

iris_df.describe()

Out[4]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.054000	3.758667	1.198667
std	0.828066	0.433594	1.764420	0.763161
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

For this tutorial, at least to start, we're not going to use the whole dataset, just because it is easier to visualize two features than four. The code below decides which two features we're going to use.

We'll also need to know at what location in the list each of the classes start at.

In [5]:

# Use just two columns (the first and fourth in this case).
x1_feature = 0
x2_feature = 3
iris_inputs = iris.data[:,[x1_feature,x2_feature]]

# The data are in order by class. Find out where the other classes start in the list
start_class_one = list(iris.target).index(1)
start_class_two = list(iris.target).index(2)

Visualization¶

Let's visualize our dataset, so that we can better understand what it looks like.

In [6]:

# split the two inputs into single dimensional arrays for plotting
x1 = iris_inputs[:,0]
x2 = iris_inputs[:,1]

# create a figure and label it
fig = matplotlib.pyplot.figure()
fig.suptitle('Iris Data Set')
matplotlib.pyplot.xlabel(iris.feature_names[x1_feature])
matplotlib.pyplot.ylabel(iris.feature_names[x2_feature])

# put the input data on the graph, with different colors and shapes for each type
scatter_0 = matplotlib.pyplot.scatter(x1[:start_class_one], x2[:start_class_one],
                                      c="red", marker="o", label=iris.target_names[0])
scatter_1 = matplotlib.pyplot.scatter(x1[start_class_one:start_class_two], x2[start_class_one:start_class_two],
                                      c="blue", marker="^", label=iris.target_names[1])
scatter_2 = matplotlib.pyplot.scatter(x1[start_class_two:], x2[start_class_two:],
                                      c="yellow", marker="*", label=iris.target_names[2])

# add a legend to explain which points are which
matplotlib.pyplot.legend(handles=[scatter_0, scatter_1, scatter_2])

# show the graph
matplotlib.pyplot.show()

Model Training¶

Next, we want to fit our logistic regression model to the subset of the data we're using.

In [7]:

model = linear_model.LogisticRegression()
model.fit(iris_inputs, iris.target)

print('Intercept: {0}  Coefficients: {1}'.format(model.intercept_, model.coef_))

Intercept: [ 0.96256986 -0.19641091 -1.7644289 ]  Coefficients: [[ 0.44374849 -4.60187424]
 [-0.17912292  0.45576962]
 [-0.77517855  4.03438217]]

Prediction¶

Now we can make some predictions using the trained model. We'll pull out some examples from our training data and see what the model says about them.

In [8]:

# Use the first input from each class
inputs = [iris_inputs[0], iris_inputs[start_class_one], iris_inputs[start_class_two]]

print('Class predictions: {0}'.format(model.predict(inputs))) # guess which class
print('Probabilities:\n{0}'.format(model.predict_proba(inputs))) # give probability of each class

Class predictions: [0 1 2]
Probabilities:
[[0.76937325 0.22444044 0.0061863 ]
 [0.14977904 0.54049543 0.30972553]
 [0.00030362 0.3188655  0.68083088]]

Exercise Option #1 - Standard Difficulty¶

Answer the following questions. You can also use the graph below, if seeing the data visually helps you understand the data.

What should we be expecting as the output for class predictions in the above cell? If the model is not giving the expected output, what are some of the reasons it might not be?
How do the probabilities output by the above cell relate to the class predictions? Why do you think the model might be more or less confident in its predictions?
Looking at the intercept and coefficient output further above, if a coefficient is negative, what has the model learned about this feature? In other words, if you took a datapoint and you increased the value of a feature that has a negative coefficient, what would you expect to happen to the probabilities the model gives this datapoint?
Do these two features allow you to predict the iris type well? How do you know? Explain using both the text output in the cells above and the graph below.
Try a different pair of features. Do these allow you to predict the iris type well? How do you know?

Exercise Option #2 - Advanced Difficulty¶

The plot above is only showing the data, and not anything about what the model learned. Come up with some ideas for how to show the model fit and implement one of them in code. Remember, we are here to help if you are not sure how to write the code for your ideas!

In [ ]: