In school, students are taught to draw lines like the following.
$$ y = 2 x + 1$$They're taught to pick two values for $x$ and calculate the corresponding values for $y$ using the equation. Then they draw a set of axes, plot the points, and then draw a line extending through the two dots on their axes.
# Import matplotlib.
import matplotlib.pyplot as plt
# Draw some axes.
plt.plot([-1, 10], [0, 0], 'k-')
plt.plot([0, 0], [-1, 10], 'k-')
# Plot the red, blue and green lines.
plt.plot([1, 1], [-1, 3], 'b:')
plt.plot([-1, 1], [3, 3], 'r:')
# Plot the two points (1,3) and (2,5).
plt.plot([1, 2], [3, 5], 'ko')
# Join them with an (extending) green lines.
plt.plot([-1, 10], [-1, 21], 'g-')
# Set some reasonable plot limits.
plt.xlim([-1, 10])
plt.ylim([-1, 10])
# Show the plot.
plt.show()
Simple linear regression is about the opposite problem - what if you have some points and are looking for the equation? It's easy when the points are perfectly on a line already, but usually real-world data has some noise. The data might still look roughly linear, but aren't exactly so.
Suppose you are trying to weigh your suitcase to avoid an airline's extra charges. You don't have a weighing scales, but you do have a spring and some gym-style weights of masses 7KG, 14KG and 21KG. You attach the spring to the wall hook, and mark where the bottom of it hangs. You then hang the 7KG weight on the end and mark where the bottom of the spring is. You repeat this with the 14KG weight and the 21KG weight. Finally, you place your case hanging on the spring, and the spring hangs down halfway between the 7KG mark and the 14KG mark. Is your case over the 10KG limit set by the airline?
When you look at the marks on the wall, it seems that the 0KG, 7KG, 14KG and 21KG marks are evenly spaced. You wonder if that means your case weighs 10.5KG. That is, you wonder if there is a linear relationship between the distance the spring's hook is from its resting position, and the mass on the end of it.
You decide to experiment. You buy some new weights - a 1KG, a 2KG, a 3Kg, all the way up to 20KG. You place them each in turn on the spring and measure the distance the spring moves from the resting position. You tabulate the data and plot them.
Here we'll import the Python libraries we need for or investigations below.
# Make matplotlib show interactive plots in the notebook.
%matplotlib inline
# numpy efficiently deals with numerical multi-dimensional arrays.
import numpy as np
# matplotlib is a plotting library, and pyplot is its easy-to-use module.
import matplotlib.pyplot as plt
# This just sets the default plot size to be bigger.
plt.rcParams['figure.figsize'] = (8, 6)
Ignore the next couple of lines where I fake up some data. I'll use the fact that I faked the data to explain some results later. Just pretend that w is an array containing the weight values and d are the corresponding distance measurements.
w = np.arange(0.0, 21.0, 1.0)
d = 5.0 * w + 10.0 + np.random.normal(0.0, 5.0, w.size)
# Let's have a look at w.
w
array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20.])
# Let's have a look at d.
d
array([ 1.36251906, 19.73655544, 30.55256114, 29.16495485, 32.84776304, 32.23321566, 38.64757038, 50.38070997, 43.42743177, 49.58100346, 59.34981756, 67.27890951, 69.47270136, 76.30057939, 75.71234339, 88.99962243, 93.7598318 , 85.76109454, 96.9676159 , 105.60199837, 107.1682508 ])
Let's have a look at the data from our experiment.
# Create the plot.
plt.plot(w, d, 'k.')
# Set some properties for the plot.
plt.xlabel('Weight (KG)')
plt.ylabel('Distance (CM)')
# Show the plot.
plt.show()
It looks like the data might indeed be linear. The points don't exactly fit on a straight line, but they are not far off it. We might put that down to some other factors, such as the air density, or errors, such as in our tape measure. Then we can go ahead and see what would be the best line to fit the data.
All straight lines can be expressed in the form $y = mx + c$. The number $m$ is the slope of the line. The slope is how much $y$ increases by when $x$ is increased by 1.0. The number $c$ is the y-intercept of the line. It's the value of $y$ when $x$ is 0.
To fit a straight line to the data, we just must pick values for $m$ and $c$. These are called the parameters of the model, and we want to pick the best values possible for the parameters. That is, the best parameter values given the data observed. Below we show various lines plotted over the data, with different values for $m$ and $c$.
# Plot w versus d with black dots.
plt.plot(w, d, 'k.', label="Data")
# Overlay some lines on the plot.
x = np.arange(0.0, 21.0, 1.0)
plt.plot(x, 5.0 * x + 10.0, 'r-', label=r"$5x + 10$")
plt.plot(x, 6.0 * x + 5.0, 'g-', label=r"$6x + 5$")
plt.plot(x, 5.0 * x + 15.0, 'b-', label=r"$5x + 15$")
# Add a legend.
plt.legend()
# Add axis labels.
plt.xlabel('Weight (KG)')
plt.ylabel('Distance (CM)')
# Show the plot.
plt.show()
You can see that each of these lines roughly fits the data. Which one is best, and is there another line that is better than all three? Is there a "best" line?
It depends how you define the word best. Luckily, everyone seems to have settled on what the best means. The best line is the one that minimises the following calculated value.
$$ \sum_i (y_i - mx_i - c)^2 $$Here $(x_i, y_i)$ is the $i^{th}$ point in the data set and $\sum_i$ means to sum over all points. The values of $m$ and $c$ are to be determined. We usually denote the above as $Cost(m, c)$.
Where does the above calculation come from? It's easy to explain the part in the brackets $(y_i - mx_i - c)$. The corresponding value to $x_i$ in the dataset is $y_i$. These are the measured values. The value $m x_i + c$ is what the model says $y_i$ should have been. The difference between the value that was observed ($y_i$) and the value that the model gives ($m x_i + c$), is $y_i - mx_i - c$.
Why square that value? Well note that the value could be positive or negative, and you sum over all of these values. If we allow the values to be positive or negative, then the positive could cancel the negatives. So, the natural thing to do is to take the absolute value $\mid y_i - m x_i - c \mid$. Well it turns out that absolute values are a pain to deal with, and instead it was decided to just square the quantity instead, as the square of a number is always positive. There are pros and cons to using the square instead of the absolute value, but the square is used. This is usually called least squares fitting.
# Calculate the cost of the lines above for the data above.
cost = lambda m,c: np.sum([(d[i] - m * w[i] - c)**2 for i in range(w.size)])
print("Cost with m = %5.2f and c = %5.2f: %8.2f" % (5.0, 10.0, cost(5.0, 10.0)))
print("Cost with m = %5.2f and c = %5.2f: %8.2f" % (6.0, 5.0, cost(6.0, 5.0)))
print("Cost with m = %5.2f and c = %5.2f: %8.2f" % (5.0, 15.0, cost(5.0, 15.0)))
Cost with m = 5.00 and c = 10.00: 503.89 Cost with m = 6.00 and c = 5.00: 2148.92 Cost with m = 5.00 and c = 15.00: 1085.82
We want to calculate values of $m$ and $c$ that give the lowest value for the cost value above. For our given data set we can plot the cost value/function. Recall that the cost is:
$$ Cost(m, c) = \sum_i (y_i - mx_i - c)^2 $$This is a function of two variables, $m$ and $c$, so a plot of it is three dimensional. See the Advanced section below for the plot.
In the case of fitting a two-dimensional line to a few data points, we can easily calculate exactly the best values of $m$ and $c$. Some of the details are discussed in the Advanced section, as they involve calculus, but the resulting code is straight-forward. We first calculate the mean (average) values of our $x$ values and that of our $y$ values. Then we subtract the mean of $x$ from each of the $x$ values, and the mean of $y$ from each of the $y$ values. Then we take the dot product of the new $x$ values and the new $y$ values and divide it by the dot product of the new $x$ values with themselves. That gives us $m$, and we use $m$ to calculate $c$.
Remember that in our dataset $x$ is called $w$ (for weight) and $y$ is called $d$ (for distance). We calculate $m$ and $c$ below.
# Calculate the best values for m and c.
# First calculate the means (a.k.a. averages) of w and d.
w_avg = np.mean(w)
d_avg = np.mean(d)
# Subtract means from w and d.
w_zero = w - w_avg
d_zero = d - d_avg
# The best m is found by the following calculation.
m = np.sum(w_zero * d_zero) / np.sum(w_zero * w_zero)
# Use m from above to calculate the best c.
c = d_avg - m * w_avg
print("m is %8.6f and c is %6.6f." % (m, c))
m is 4.809677 and c is 11.632137.
Note that numpy has a function that will perform this calculation for us, called polyfit. It can be used to fit lines in many dimensions.
np.polyfit(w, d, 1)
array([ 4.80967703, 11.63213685])
So, the best values for $m$ and $c$ given our data and using least squares fitting are about $4.95$ for $m$ and about $11.13$ for $c$. We plot this line on top of the data below.
# Plot the best fit line.
plt.plot(w, d, 'k.', label='Original data')
plt.plot(w, m * w + c, 'b-', label='Best fit line')
# Add axis labels and a legend.
plt.xlabel('Weight (KG)')
plt.ylabel('Distance (CM)')
plt.legend()
# Show the plot.
plt.show()
Note that the $Cost$ of the best $m$ and best $c$ is not zero in this case.
print("Cost with m = %5.2f and c = %5.2f: %8.2f" % (m, c, cost(m, c)))
Cost with m = 4.99 and c = 8.84: 315.93
In this notebook we:
In the following sections we cover some of the more advanced concepts involved in fitting the line.
Earlier in the notebook we glossed over something important: we didn't actually do the weighing and measuring - we faked the data. A better term for this is simulation, which is an important tool in research, especially when testing methods such as simple linear regression.
We ran the following two commands to do this:
w = np.arange(0.0, 21.0, 1.0)
d = 5.0 * w + 10.0 + np.random.normal(0.0, 5.0, w.size)
The first command creates a numpy array containing all values between 1.0 and 21.0 (including 1.0 but not including 21.0) in steps of 1.0.
np.arange(0.0, 21.0, 1.0)
array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20.])
The second command is more complex.
First it takes the values in the w
array, multiplies each by 5.0 and then adds 10.0.
5.0 * w + 10.0
array([ 10., 15., 20., 25., 30., 35., 40., 45., 50., 55., 60., 65., 70., 75., 80., 85., 90., 95., 100., 105., 110.])
It then adds an array of the same length containing random values. The values are taken from what is called the normal distribution with mean 0.0 and standard deviation 5.0.
np.random.normal(0.0, 5.0, w.size)
array([-11.0513071 , -7.4718436 , -5.3680209 , 5.34030017, -2.78427481, -2.70242565, 3.64579591, 0.58326755, 4.63175628, -0.62130672, 6.7045042 , -7.47398151, -1.61108867, 0.99219224, 3.87220209, 6.471432 , 0.34497497, 10.80337659, -2.69464773, 4.36309782, 13.11690738])
The normal distribution follows a bell shaped curve. The curve is centred on the mean (0.0 in this case) and its general width is determined by the standard deviation (5.0 in this case).
# Plot the normal distrution.
normpdf = lambda mu, s, x: (1.0 / (2.0 * np.pi * s**2)) * np.exp(-((x - mu)**2)/(2 * s**2))
x = np.linspace(-20.0, 20.0, 100)
y = normpdf(0.0, 5.0, x)
plt.plot(x, y)
plt.show()
The idea here is to add a little bit of randomness to the measurements of the distance. The random values are entered around 0.0, with a greater than 99% chance they're within the range -15.0 to 15.0. The normal distribution is used because of the Central Limit Theorem which basically states that when a bunch of random effects happen together the outcome looks roughly like the normal distribution. (Don't quote me on that!)
We can plot the cost function for a given set of data points. Recall that the cost function involves two variables: $m$ and $c$, and that it looks like this:
$$ Cost(m,c) = \sum_i (y_i - mx_i - c)^2 $$To plot a function of two variables we need a 3D plot. It can be difficult to get the viewing angle right in 3D plots, but below you can just about make out that there is a low point on the graph around the $(m, c) = (\approx 5.0, \approx 10.0)$ point.
# This code is a little bit involved - don't worry about it.
# Just look at the plot below.
from mpl_toolkits.mplot3d import Axes3D
# Ask pyplot a 3D set of axes.
ax = plt.figure().gca(projection='3d')
# Make data.
mvals = np.linspace(4.5, 5.5, 100)
cvals = np.linspace(0.0, 20.0, 100)
# Fill the grid.
mvals, cvals = np.meshgrid(mvals, cvals)
# Flatten the meshes for convenience.
mflat = np.ravel(mvals)
cflat = np.ravel(cvals)
# Calculate the cost of each point on the grid.
C = [np.sum([(d[i] - m * w[i] - c)**2 for i in range(w.size)]) for m, c in zip(mflat, cflat)]
C = np.array(C).reshape(mvals.shape)
# Plot the surface.
surf = ax.plot_surface(mvals, cvals, C)
# Set the axis labels.
ax.set_xlabel('$m$', fontsize=16)
ax.set_ylabel('$c$', fontsize=16)
ax.set_zlabel('$Cost$', fontsize=16)
# Show the plot.
plt.show()
Earlier we used a cost function to determine the best line to fit the data. Usually the data do not perfectly fit on the best fit line, and so the cost is greater than 0. A quantity closely related to the cost is the coefficient of determination, also known as the R-squared value. The purpose of the R-squared value is to measure how much of the variance in $y$ is determined by $x$.
For instance, in our example the main thing that affects the distance the spring is hanging down is the weight on the end. It's not the only thing that affects it though. The room temperature and density of the air at the time of measurment probably affect it a little. The age of the spring, and how many times it has been stretched previously probably also have a small affect. There are probably lots of unknown factors affecting the measurment.
The R-squared value estimates how much of the changes in the $y$ value is due to the changes in the $x$ value compared to all of the other factors affecting the $y$ value. It is calculated as follows:
$$ R^2 = 1 - \frac{\sum_i (y_i - m x_i - c)^2}{\sum_i (y_i - \bar{y})^2} $$Note that sometimes the Pearson correlation coefficient is used instead of the R-squared value. You can just square the Pearson coefficient to get the R-squred value.
# Calculate the R-squared value for our data set.
rsq = 1.0 - (np.sum((d - m * w - c)**2)/np.sum((d - d_avg)**2))
print("The R-squared value is %6.4f" % rsq)
The R-squared value is 0.9838
# The same value using numpy.
np.corrcoef(w, d)[0][1]**2
0.9838143439861745
Earlier we used the following calculation to calculate $m$ and $c$ for the line of best fit. The code was:
w_zero = w - np.mean(w)
d_zero = d - np.mean(d)
m = np.sum(w_zero * d_zero) / np.sum(w_zero * w_zero)
c = np.mean(d) - m * np.mean(w)
In mathematical notation we write this as:
$$ m = \frac{\sum_i (x_i - \bar{x}) (y_i - \bar{y})}{\sum_i (x_i - \bar{x})^2} \qquad \textrm{and} \qquad c = \bar{y} - m \bar{x} $$where $\bar{x}$ is the mean of $x$ and $\bar{y}$ that of $y$.
Where did these equations come from? They were derived using calculus. We'll give a brief overview of it here, but feel free to gloss over this section if it's not for you. If you can understand the first part, where we calculate the partial derivatives, then great!
The calculations look complex, but if you know basic differentiation, including the chain rule, you can easily derive them. First, we differentiate the cost function with respect to $m$ while treating $c$ as a constant, called a partial derivative. We write this as $\frac{\partial m}{ \partial Cost}$, using $\delta$ as opposed to $d$ to signify that we are treating the other variable as a constant. We then do the same with respect to $c$ while treating $m$ as a constant. We set both equal to zero, and then solve them as two simultaneous equations in two variables.
Here we let $n$ be the length of $x$, which is also the length of $y$.
$$ \begin{align} & m = \frac{\sum_i x_i y_i - c \sum_i x_i}{\sum_i x_i x_i} \\[0.2cm] & \Rightarrow m = \frac{\sum x_i y_i - (\bar{y} - m \bar{x}) \sum x_i}{\sum x_i x_i} \\ & \Rightarrow m \sum x_i x_i = \sum x_i y_i - \bar{y} \sum x_i + m \bar{x} \sum x_i \\ & \Rightarrow m \sum x_i x_i - m \bar{x} \sum x_i = \sum x_i y_i - \bar{y} \sum x_i \\[0.3cm] & \Rightarrow m = \frac{\sum x_i y_i - \bar{y} \sum x_i}{\sum x_i x_i - \bar{x} \sum x_i} \\[0.2cm] & \Rightarrow m = \frac{\sum (x_i y_i) - n \bar{y} \bar{x}}{\sum (x_i x_i) - n \bar{x} \bar{x}} \\ & \Rightarrow m = \frac{\sum (x_i y_i) - n \bar{y} \bar{x} - n \bar{y} \bar{x} + n \bar{y} \bar{x}}{\sum (x_i x_i) - n \bar{x} \bar{x} - n \bar{x} \bar{x} + n \bar{x} \bar{x}} \\ & \Rightarrow m = \frac{\sum (x_i y_i) - \sum y_i \bar{x} - \sum \bar{y} x_i + n \bar{y} \bar{x}}{\sum (x_i x_i) - \sum x_i \bar{x} - \sum \bar{x} x_i + n \bar{x} \bar{x}} \\ & \Rightarrow m = \frac{\sum_i (x_i - \bar{x}) (y_i - \bar{y})}{\sum_i (x_i - \bar{x})^2} \\ \end{align} $$