For more explanation of logistic regression, see
from sklearn import linear_model # for fitting our model
from sklearn.datasets import load_iris # the iris dataset is included in scikit-learn
# force numpy not to use scientific notation, to make it easier to read the numbers the program prints out
import numpy
numpy.set_printoptions(suppress=True)
# to display graphs in this notebook
%matplotlib inline
import matplotlib.pyplot
Before you go on, make sure you understand this dataset. Modify the cell below to examine different parts of the dataset that are contained in the 'iris' dictionary object.
What are the features? What are we trying to classify?
iris = load_iris()
iris.keys()
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])
You can also try looking at it using a pandas dataframe.
import pandas
iris_df = pandas.DataFrame(iris.data)
iris_df.columns = iris.feature_names
iris_df['target'] = [iris.target_names[target] for target in iris.target]
iris_df.head()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
iris_df.describe()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
mean | 5.843333 | 3.054000 | 3.758667 | 1.198667 |
std | 0.828066 | 0.433594 | 1.764420 | 0.763161 |
min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
For this tutorial, at least to start, we're not going to use the whole dataset, just because it is easier to visualize two features than four. The code below decides which two features we're going to use.
We'll also need to know at what location in the list each of the classes start at.
# Use just two columns (the first and fourth in this case).
x1_feature = 0
x2_feature = 3
iris_inputs = iris.data[:,[x1_feature,x2_feature]]
# The data are in order by class. Find out where the other classes start in the list
start_class_one = list(iris.target).index(1)
start_class_two = list(iris.target).index(2)
Let's visualize our dataset, so that we can better understand what it looks like.
# split the two inputs into single dimensional arrays for plotting
x1 = iris_inputs[:,0]
x2 = iris_inputs[:,1]
# create a figure and label it
fig = matplotlib.pyplot.figure()
fig.suptitle('Iris Data Set')
matplotlib.pyplot.xlabel(iris.feature_names[x1_feature])
matplotlib.pyplot.ylabel(iris.feature_names[x2_feature])
# put the input data on the graph, with different colors and shapes for each type
scatter_0 = matplotlib.pyplot.scatter(x1[:start_class_one], x2[:start_class_one],
c="red", marker="o", label=iris.target_names[0])
scatter_1 = matplotlib.pyplot.scatter(x1[start_class_one:start_class_two], x2[start_class_one:start_class_two],
c="blue", marker="^", label=iris.target_names[1])
scatter_2 = matplotlib.pyplot.scatter(x1[start_class_two:], x2[start_class_two:],
c="yellow", marker="*", label=iris.target_names[2])
# add a legend to explain which points are which
matplotlib.pyplot.legend(handles=[scatter_0, scatter_1, scatter_2])
# show the graph
matplotlib.pyplot.show()
Next, we want to fit our logistic regression model to the subset of the data we're using.
model = linear_model.LogisticRegression()
model.fit(iris_inputs, iris.target)
print('Intercept: {0} Coefficients: {1}'.format(model.intercept_, model.coef_))
Intercept: [ 0.96256986 -0.19641091 -1.7644289 ] Coefficients: [[ 0.44374849 -4.60187424] [-0.17912292 0.45576962] [-0.77517855 4.03438217]]
Now we can make some predictions using the trained model. We'll pull out some examples from our training data and see what the model says about them.
# Use the first input from each class
inputs = [iris_inputs[0], iris_inputs[start_class_one], iris_inputs[start_class_two]]
print('Class predictions: {0}'.format(model.predict(inputs))) # guess which class
print('Probabilities:\n{0}'.format(model.predict_proba(inputs))) # give probability of each class
Class predictions: [0 1 2] Probabilities: [[0.76937325 0.22444044 0.0061863 ] [0.14977904 0.54049543 0.30972553] [0.00030362 0.3188655 0.68083088]]
Answer the following questions. You can also use the graph below, if seeing the data visually helps you understand the data.
The plot above is only showing the data, and not anything about what the model learned. Come up with some ideas for how to show the model fit and implement one of them in code. Remember, we are here to help if you are not sure how to write the code for your ideas!