By Jen Selby and Carl Shan
This Jupyter Notebook will introduce to you to how to make a Linear Regression model using the Sci-kit Learn (aka sklearn
) Python library.
You can see basic example here:
http://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares
and full documentation of the sklearn linear_model module here:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
Make sure you've read and learned a bit about the Linear Regression model. Click here for course notes.
Read through the instructions and code behind the following sections:
Then, pick and complete at least one of the set of exercises (Standard or Advanced) and write code that answers each set of questions.
First, make sure you have installed all of the necessary Python libraries, following the instructions here.
You should have sklearn
, numpy
, matplotlib
and pandas
installed.
If you haven't installed them, use pip install <library here>
to install them in your Terminal.
Next, we want to make sure we can display our graphs in this notebook and import all of the libraries we'll need into the notebook.
# We're going to be doing some plotting, and we want to be able to see these plots.
# To display graphs in this notebook, run this cell.
%matplotlib inline
# We're now going to import some important libraries
import numpy.random # for generating a noisy data set
from sklearn import linear_model # for training a linear model
import matplotlib.pyplot # for plotting in general
from mpl_toolkits.mplot3d import Axes3D # for 3D plotting
import pandas as pd
We're going to generate some fake data to test out our ideas about linear regression. These constant variables decide some of the characteristics of our data: the x
range (which will also be used to set the size of the graph later) and how many inputs we should generate.
# Setting the limits and number of our first, X, variable
MIN_X = -10
MAX_X = 10
NUM_INPUTS = 50
Our first dataset has just one input feature. We are going to pick out 50 random real numbers between our min and max. Then, we will generate one output for each of these inputs following the function $y = 0.3x + 1$.
# randomly pick numbers for x
x_one_x = numpy.random.uniform(low=MIN_X, high=MAX_X, size=(NUM_INPUTS, 1))
print(x_one_x)
[[ 6.2042874 ] [-6.44558848] [ 6.15347981] [-5.84754016] [ 1.43601348] [-4.31411709] [-9.82494627] [ 8.48626601] [-7.62915955] [-3.29137353] [-9.99398847] [-8.37608792] [ 5.07202459] [ 5.50636949] [ 6.09568009] [-4.30089789] [-8.88273978] [ 9.12468103] [-7.73938696] [-9.33474834] [-3.49694032] [-8.9676608 ] [-2.80176355] [-5.03206763] [-0.68356 ] [ 1.73552019] [ 7.9379289 ] [-7.70543788] [-1.45995305] [ 5.09314035] [ 5.99847056] [ 3.34302821] [-8.10582136] [-2.26602336] [-2.27335965] [-4.09892983] [-8.99217476] [ 8.90280292] [-8.6455045 ] [-4.26283741] [ 0.11768981] [ 5.15041637] [ 8.15758258] [-5.45726117] [-6.92202854] [-9.78166627] [ 5.57196798] [ 4.4655849 ] [ 3.24344148] [ 5.48035288]]
Let's store this data into a pandas
DataFrame
object and name the column 'x'
.
data_one_x = pd.DataFrame(data=x_one_x, columns=['x'])
data_one_x.head()
x | |
---|---|
0 | 6.204287 |
1 | -6.445588 |
2 | 6.153480 |
3 | -5.847540 |
4 | 1.436013 |
Cool. Now we have some fake x
data.
Let's make the fake y
data now.
Let's try to make data that follows the equation: $y = 0.3x + 1$.
data_one_x['y'] = 0.3 * data_one_x['x'] + 1
data_one_x.plot.scatter(x='x', y='y')
<matplotlib.axes._subplots.AxesSubplot at 0x1195b1290>
Okay. That looks too perfect.
Most data in the real world look less linear than that.
So let's add a little bit of noise. Noise are random pertubations to your data that happens naturally in the real world. We will simulate some noise.
Otherwise our linear model will be too easy.
Note: We can generate some noise by picking numbers in a normal distribution (also called bell curve) around zero.
# First, let's create some noise to make our data a little bit more spread out.
# generate some normally distributed noise
noise_one_x = numpy.random.normal(size=NUM_INPUTS)
# Now let's create the 'y' variable
# It turns out you can make a new column in pandas just by doing the below.
# It's so simple!
data_one_x['y'] = data_one_x['y'] + noise_one_x
data_one_x.plot.scatter(x='x', y='y')
<matplotlib.axes._subplots.AxesSubplot at 0x11b6f7450>
Great!
This looks more like real data now.
Now that we have our data, we can train our model to find the best fit line. We will use the linear model module from the scikit-learn library to do this.
Note: you may get a warning about LAPACK. According to this discussion on the scikit-learn github page, this is safe to ignore.
# This creates an "empty" linear model
model_one_x = linear_model.LinearRegression()
First, we need to reshape our data.
Currently, our data looks like the following:
# data_one_x['x'] looks like
[-3.44342026, 9.60082542, 4.99683803, 7.11339915, 9.69287893, ...]
In other words, it's just a list.
However, this isn't sufficient.
That's because later on, we will use a command called .fit()
and this command expects our data to look like a list of lists.
For example:
[[-3.44342026],
[ 9.60082542],
[ 4.99683803],
[ 7.11339915],
[ 9.69287893],
[-5.1383316 ],
[ 8.96638209],
...
[-9.12492363]]
We will use a the command .reshape()
.
# Run this code
x_one_x = data_one_x['x'].values.reshape(-1, 1)
y_one_x = data_one_x['y'].values.reshape(-1, 1)
There we go. Now we can "fit" the data.
"Fitting" the data means to give the "empty model" real data and ask it to find the "best parameters" that "best fits" the data.
Using the amazing sklearn
library, it's as easy as running the .fit()
command.
Note: you may get a warning about LAPACK. According to this discussion on the scikit-learn github page, this is safe to ignore.
# Run this code
model_one_x.fit(X=x_one_x, y=y_one_x)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Now, let's see what our model learned. We can look at the results numerically:
def print_model_fit(model):
# Print out the parameters for the best fit line
print('Intercept: {i} Coefficients: {c}'.format(i=model.intercept_, c=model.coef_))
print_model_fit(model_one_x)
Intercept: [1.11825823] Coefficients: [[0.32606958]]
## How would this model make predictions?
# Let's make some new data that have the following values and see how to predict their corresponding 'y' values.
# Print out the model's guesses for some values of x
new_x_values = [ [-1.23], [0.66], [1.98] ]
predictions = model_one_x.predict(new_x_values)
print(predictions)
[[0.71719265] [1.33346415] [1.76387599]]
# Let's print them a little bit nicer
for datapoint, prediction in zip(new_x_values, predictions):
print('Model prediction for {}: {}'.format(datapoint[0], prediction))
Model prediction for -1.23: [0.71719265] Model prediction for 0.66: [1.33346415] Model prediction for 1.98: [1.76387599]
We can also look at them graphically.
def plot_best_fit_line(model, x, y):
# create the figure
fig = matplotlib.pyplot.figure(1)
fig.suptitle('Data and Best-Fit Line')
matplotlib.pyplot.xlabel('x values')
matplotlib.pyplot.ylabel('y values')
# put the generated dataset points on the graph
matplotlib.pyplot.scatter(x, y)
# Now we actually want to plot the best-fit line.
# To simulate that, we'll simply generate all the
# inputs on the graph and plot that.
# predict for inputs along the graph to find the best-fit line
X = numpy.linspace(MIN_X, MAX_X) # generates all the possible values of x
Y = model.predict(list(zip(X)))
matplotlib.pyplot.plot(X, Y)
plot_best_fit_line(model_one_x, x_one_x, y_one_x)
Answer the following questions about dataset 1:
print_model_fit()
function in the "Results and Visualization" section above. What numbers did you expect to see printed if the linear regression code was working, and why?
Let's look at a dataset has two inputs, like the tree example in our notes.
NOTE: This will make it a littler harder to visualize, particularly because you cannot rotate the graph interactively in the Jupyter notebook. If you are interested in looking more closely at this graph, you can copy the code below in the next several cells into a file and run it through Python normally. This will open a graph window that will allow you to drag to rotate the graph.
# generate some normally distributed noise
noise_two_x = numpy.random.normal(size=NUM_INPUTS)
# randomly pick pairs of numbers for x
x1_two_x = numpy.random.uniform(low=MIN_X, high=MAX_X, size=NUM_INPUTS)
x2_two_x = numpy.random.uniform(low=MIN_X, high=MAX_X, size=NUM_INPUTS)
y_two_x = 0.5 * x1_two_x - 2.7 * x2_two_x - 2 + noise_two_x
data_two_x = pd.DataFrame(data=x1_two_x, columns = ['x1'])
data_two_x['x2'] = x2_two_x
data_two_x['y'] = y_two_x
data_two_x.head()
x1 | x2 | y | |
---|---|---|---|
0 | 2.557584 | 4.564261 | -13.463797 |
1 | -0.881504 | 3.287239 | -12.344193 |
2 | -0.231691 | 9.680882 | -27.295643 |
3 | 8.284173 | 1.276975 | -2.805065 |
4 | 3.542796 | 2.611530 | -7.216126 |
# use scikit-learn's linear regression model and fit to our data
model_two_x = linear_model.LinearRegression()
model_two_x.fit(data_two_x[['x1', 'x2']], data_two_x['y'])
# Print out the parameters for the best fit plane
print_model_fit(model_two_x)
Intercept: -2.0817409227903294 Coefficients: [ 0.48804114 -2.72217748]
## Now create a function that can plot in 3D
def plot_3d(model, x1, x2, y):
# 3D Plot
# create the figure
fig = matplotlib.pyplot.figure(1)
fig.suptitle('3D Data and Best-Fit Plane')
# get the current axes, and tell them to do a 3D projection
axes = fig.gca(projection='3d')
axes.set_xlabel('x1')
axes.set_ylabel('x2')
axes.set_zlabel('y')
# put the generated points on the graph
axes.scatter(x1, x2, y)
# predict for input points across the graph to find the best-fit plane
# and arrange them into a grid for matplotlib
X1 = X2 = numpy.arange(MIN_X, MAX_X, 0.05)
X1, X2 = numpy.meshgrid(X1, X2)
Y = numpy.array(model.predict(list(zip(X1.flatten(), X2.flatten())))).reshape(X1.shape)
# put the predicted plane on the graph
axes.plot_surface(X1, X2, Y, alpha=0.1)
# show the plots
matplotlib.pyplot.show()
# Now let's use the function
plot_3d(model_two_x, x1_two_x, x2_two_x, y_two_x)
Now, answer the following questions about Fake Dataset 2:
print_model_fit()
function for this above dataset. What output did you expect to see printed if the linear regression code was working, and why?
The new equation we'll try to model is $y = 0.7x^2 - 0.4x + 1.5$.
This dataset still just has one input, so the code is very similar to our first one. However, now the generating function is quadratic, so this one will be trickier to deal with.
Again, we'll go through dataset generation, training, and visualization.
# randomly pick numbers for x
x_quadratic = numpy.random.uniform(low=MIN_X, high=MAX_X, size=(NUM_INPUTS, 1))
data_quadratic = pd.DataFrame(data=x_quadratic, columns=['x'])
# Let's create some noise to make our data a little bit more spread out.
# generate some normally distributed noise
noise_quadratic = numpy.random.normal(size=NUM_INPUTS)
# Let's generate the y values
# Our equation:
# y = 0.7x^2 - 0.4x + 1.5
data_quadratic['y'] = 0.7 * data_quadratic['x'] * data_quadratic['x'] - 0.4 * data_quadratic['x'] + 1.5 + noise_quadratic
# get a 1D array of the input data
x_quadratic = data_quadratic['x'].values.reshape(-1, 1)
y_quadratic = data_quadratic['y'].values.reshape(-1, 1)
# Let's try use scikit-learn's linear regression model and fit to our data
model_quadratic = linear_model.LinearRegression()
model_quadratic.fit(x_quadratic, y_quadratic)
# show results
print_model_fit(model_quadratic)
plot_best_fit_line(model_quadratic, x_quadratic, y_quadratic)
Intercept: [26.1204316] Coefficients: [[-0.22266468]]
First, look over and understand the data for Fake Dataset 3.
There are some issues here. Clearly the linear model that we have isn't working great.
Your challenge is to write some new code that will better fit a linear model to this data. There are a couple different ways to do this, but all of them will involve some new code. If you have ideas but just aren't sure how to translate them into code, please ask for help!
### Your code here
Try adding some regularization to your linear regression model. This will get you some practice in using the sci-kit learn documentation to find new functions and figure out how to use them.
### Your code here