For more explanation of logistic regression, see
import numpy.random # for generating our dataset
from sklearn import linear_model # for fitting our model
# force numpy not to use scientific notation, to make it easier to read the numbers the program prints out
numpy.set_printoptions(suppress=True)
# to display graphs in this notebook
%matplotlib inline
import matplotlib.pyplot
As we did in the linear regression notebook, we will be generating some fake data.
In this fake dataset, we have two types of plants.
Plant A tends to be taller (average 60cm) and thinner (average 8cm).
Plant B tends to be shorter (average 58cm) and wider (average 10cm).
The heights and diameters of both plants are normally distributed (they follow a bell curve).
Class 0 will represent Plant A and Class 1 will represent Plant B
NUM_INPUTS = 50 # inputs per class
PLANT_A_AVG_HEIGHT = 60.0
PLANT_A_AVG_WIDTH = 8.0
PLANT_B_AVG_HEIGHT = 58.0
PLANT_B_AVG_WIDTH = 10.0
# Pick numbers randomly with a normal distribution centered around the averages
plant_a_heights = numpy.random.normal(loc=PLANT_A_AVG_HEIGHT, size=NUM_INPUTS)
plant_a_widths = numpy.random.normal(loc=PLANT_A_AVG_WIDTH, size=NUM_INPUTS)
plant_b_heights = numpy.random.normal(loc=PLANT_B_AVG_HEIGHT, size=NUM_INPUTS)
plant_b_widths = numpy.random.normal(loc=PLANT_B_AVG_WIDTH, size=NUM_INPUTS)
# this creates a 2-dimensional matrix, with heights in the first column and widths in the second
# the first half of rows are all plants of type a and the second half are type b
plant_inputs = list(zip(numpy.append(plant_a_heights, plant_b_heights),
numpy.append(plant_a_widths, plant_b_widths)))
# this is a list where the first half are 0s (representing plants of type a) and the second half are 1s (type b)
classes = [0]*NUM_INPUTS + [1]*NUM_INPUTS
Let's visualize our dataset, so that we can better understand what it looks like.
# create a figure and label it
fig = matplotlib.pyplot.figure()
fig.suptitle('Plant Data Set')
matplotlib.pyplot.xlabel('Height')
matplotlib.pyplot.ylabel('Width')
# put the generated points on the graph
a_scatter = matplotlib.pyplot.scatter(plant_a_heights, plant_a_widths, c="red", marker="o", label='plant a')
b_scatter = matplotlib.pyplot.scatter(plant_b_heights, plant_b_widths, c="blue", marker="^", label='plant b')
# add a legend to explain which points are which
matplotlib.pyplot.legend(handles=[a_scatter, b_scatter])
# show the graph
matplotlib.pyplot.show()
Next, we want to fit our logistic regression model to our dataset.
model = linear_model.LogisticRegression()
model.fit(plant_inputs, classes)
print('Intercept: {0} Coefficients: {1}'.format(model.intercept_, model.coef_))
Intercept: [0.4923611] Coefficients: [[-0.30765052 1.95764602]]
Now we can make some predictions using the trained model. Note that we are generating the new data exactly the same way that we generated the training data above.
# Generate some new random values for two plants, one of each class
new_a_height = numpy.random.normal(loc=PLANT_A_AVG_HEIGHT)
new_a_width = numpy.random.normal(loc=PLANT_A_AVG_WIDTH)
new_b_height = numpy.random.normal(loc=PLANT_B_AVG_HEIGHT)
new_b_width = numpy.random.normal(loc=PLANT_B_AVG_WIDTH)
# Pull the values into a matrix, because that is what the predict function wants
inputs = [[new_a_height, new_a_width], [new_b_height, new_b_width]]
# Print out the outputs for these new inputs
print('Plant A: {0} {1}'.format(new_a_height, new_a_width))
print('Plant B: {0} {1}'.format(new_b_height, new_b_width))
print('Class predictions: {0}'.format(model.predict(inputs))) # guess which class
print('Probabilities:\n{0}'.format(model.predict_proba(inputs))) # give probability of each class
Plant A: 59.93207269053181 7.030073699859045 Plant B: 59.619487931898846 10.240893276828622 Class predictions: [0 1] Probabilities: [[0.98498204 0.01501796] [0.09989079 0.90010921]]
Answer the following questions. You can also use the graph below, if seeing the data visually helps you understand the data.
The plot above is only showing the data, and not anything about what the model learned. Come up with some ideas for how to show the model fit and implement one of them in code. Remember, we are here to help if you are not sure how to write the code for your ideas!
If you have more than two classes, you can use multinomial logistic regression or the one vs. rest technique, where you use a binomial logistic regression for each class that you have and decide if it is or is not in that class. Try expanding the program with a third type and implementing your own one vs. rest models. To test if this is working, compare your output to running your expanded dataset through scikit-learn, which will automatically do one vs. rest if there are more than two classes.