Gradient descent

Simple linear regression

In a previous notebook, we solved the problem of simple linear regression - finding a straight line that best fits a data set with two variables. In that case, we were able to find the exact solution. In this notebook, we'll use a common technique to approximate that solution.

Why would we want to approximate a solution when we can easily find an exact solution? We don't - it's just that the technique we discuss here can also be used in situations where we can't find an exact solution or don't want to try, for whatever reason. The approximation technique is called gradient descent.

Preliminaries

In [1]:
# numpy efficiently deals with numerical multi-dimensional arrays.
import numpy as np

# matplotlib is a plotting library, and pyplot is its easy-to-use module.
import matplotlib.pyplot as pl

# This just sets the default plot size to be bigger.
pl.rcParams['figure.figsize'] = (16.0, 8.0)

Simple linear regression model

In simple linear regression, we have some data points $(x_i, y_i)$, and we decide that they belong to a straight line with a little bit of error involved. Straight lines in two dimensions are of the form $y = mx + c$, and to fit a line to our data points we must find appropriate values for $m$ and $c$. Numpy has a function called polyfit that finds such values for us.

In [2]:
w = np.arange(1.0, 16.0, 1.0)
d = 5.0 * w + 10.0 + np.random.normal(0.0, 5.0, w.size)

m, c = np.polyfit(w, d, 1)
print("Best fit is m = %f and c = %f" % (m, c))

# Plot the best fit line.
pl.plot(w, d, 'k.', label='Original data')
pl.plot(w, m * w + c, 'b-', label='Best fit: $%0.1f x + %0.1f$' % (m,c))
pl.legend()
pl.show()
Best fit is m = 5.404076 and c = 7.755549

Gradient descent

In gradient descent, we select a random guess of a parameter and iteratively improve that guess. For instance, we might pick $1.0$ as our initial guess for $m$ and then create a for loop to iteratively improve the value of $m$. The way we improve $m$ is to first take the partial derivative of our cost function with respect to $m$.

Cost function

Recall that our cost function for simple linear regression is:

$$ Cost(m, c) = \sum_i (y_i - mx_i - c)^2 $$

Calculate the partial derivatives

We calculate the partial derivative of $Cost$ with respect to $m$ while treating $c$ as a constant. Note that the $x_i$ and $y_i$ values are all just constants. We'll also calculate the partial derivative with respect to $c$ here.

$$ \begin{align} Cost(m, c) &= \sum_i (y_i - mx_i - c)^2 \\[1cm] \frac{\partial Cost}{\partial m} &= \sum 2(y_i - m x_i -c)(-x_i) \\ &= -2 \sum x_i (y_i - m x_i -c) \\[0.5cm] \frac{\partial Cost}{\partial c} & = \sum 2(y_i - m x_i -c)(-1) \\ & = -2 \sum (y_i - m x_i -c) \\ \end{align} $$

Code the partial derivatives

Once we've calculated the partial derivatives, we'll code them up in python. Here we create two functions, each taking four parameters. The first two parameters are arrays with our $x_i$ and $y_i$ data set values. The second two are our current guesses for $m$ and $c$.

In [3]:
def grad_m(x, y, m, c):
  return -2.0 * np.sum(x * (y - m * x - c))
In [4]:
def grad_c(x, y, m , c):
  return -2.0 * np.sum(y - m * x - c)

Iterate

Now we can run our gradient descent algorithm. For $m$, we keep replacing its value with $m - \eta grad\_m(x, y, m, c)$ until it doesn't change. For $c$, we keep replacing its value with $c - \eta grad\_c(x, y, m, c)$ until it doesn't change.

What is $\eta$? It is called the learning rate and we set it to a small value relative to the data points.

You can see on each iteration, $m$ and $c$ are getting closer to their true values.

In [5]:
eta = 0.0001
m, c = 1.0, 1.0
delta = 0.0000001

mold, cold = m - 1.0, c - 1.0
i = 0
while abs(mold - m) > delta and abs(cold - c) > delta:
  mold, cold = m, c
  m = mold - eta * grad_m(w, d, mold, cold)
  c = cold - eta * grad_c(w, d, mold, cold)

  i = i + 1
  if i % 1000 == 0:
    print("m: %20.16f  c: %20.16f" % (m, c))
m:   5.7149506777702488  c:   4.5518695120877757
m:   5.5629397868829846  c:   6.1183980049248472
m:   5.4852588608786546  c:   6.9189286986499470
m:   5.4455621902673004  c:   7.3280175818245992
m:   5.4252763139437015  c:   7.5370710455442040
m:   5.4149097826482908  c:   7.6439019874021357
m:   5.4096122559659570  c:   7.6984949605832087
m:   5.4069051027099357  c:   7.7263931768575951
m:   5.4055216875445105  c:   7.7406497821666767
m:   5.4048147318094859  c:   7.7479352226744274
m:   5.4044534617990276  c:   7.7516582438453998
m:   5.4042688448364773  c:   7.7535607899748742

Newton's method for square roots

Newton's method for square roots is a method for approximating the square root of a number $x$. We begin with an initial guess $z_0$ of the square root - it doesn't have to be particularly good. We then apply the following calculation repeatedly, to calculate $z_1$, then $z_2$, and so on:

$$ z_{i+1} = z_i - \frac{z_i^2 - x}{2z_i} $$

Coding the calculation

We can create a function that calculates the next value of $z$ based on the current value of $z$ as follows.

In [6]:
def next_z(x, z):
  return z - (z**2 - x) / (2 * z)

Calculating the square root of $x$

Suppose we want to calculate the square root of $x$. We start with a random guess for the square root, $z_0$. We then apply the next_z function repeatedly until the value of $z$ stops changing. Let's create a function to do this. We'll include the next_z function inside the newtsqrt function to make it one all-inclusive package.

In [7]:
def newtsqrt(x):
  next_z = lambda x, z: z - (z**2 - x) / (2 * z)
  z = 2.0
  n = next_z(x, z)
  
  while z != n:
    z, n = n, next_z(x, n)
    print(z)
  
  return z

newtsqrt(11)
3.75
3.341666666666667
3.316718620116376
3.3166247916826186
3.3166247903554
Out[7]:
3.3166247903554

Comparison with the standard library

We can compare our square root method return value to the value calculated by Python's math standard library package. It has a sqrt function.

In [8]:
import math
math.sqrt(11)
Out[8]:
3.3166247903554

Being careful

Due to the complexities of floating point numbers, the nextsqrt function could get into an infinite loop. For instance, calculating the square root of 10 gives an infinite loop.

In [9]:
# Uncommenting will result in infinite loop.
# newtsqrt(10)

To counteract this problem, the condition of the loop is better written as:

abs(z - n) > 0.001

Gradient descent for square roots

Newton's method for square roots is efficient, but we can also use gradient descent to approximate the square root of a real number $x$. Here, we use the following cost function.

$$ Cost(z \mid x) = (x - z^2)^2 $$

Example value

Let's use it to calculate the square root of 20, i.e. $x = 20$. Then the cost function is:

$$ Cost(z \mid x=20) = (20 - z^2)^2 $$

Our goal is to find the $z$ that minimises this.

Plotting the cost function

Let's plot the cost function. Given that we know the best $z$ will be between $4$ and $5$, we'll let $z$ range over 0 to 10.

In [10]:
i = np.linspace(0.0, 10.0, 1000)
j = (20.0 - i**2)**2

pl.plot(i, j, 'k-', label='$(20-z^2)^2$')
pl.legend()
pl.show()

The derivative

Looks like there's a low point at about $4.5$. Let's take the derivative of the cost function with respect to $z$.

$$ \begin{align} Cost(z) &= ( 20.0 - z^2 )^2 \\ \Rightarrow \frac{\partial Cost}{\partial z} &= 2(20.0 - z^2)(-2z) \\ &= 4z(z^2 - 20) \\ &= 4z^3 - 80z \end{align} $$

The derivative tells us what the slope of the tangent to the curve is at any point on the cost function. What does that mean? It means that if we pick a value of $z$, e.g. $8.0$, that the derivative tells us that a line going through the point $(8.0, (20.0 - (8.0)^2)^2)$ with the slope $4(8.0)^3 - 80(8.0)$ perfectly touches the graph above. Let's plot that.

When you simply, the point $(8.0, (20.0 - (8.0)^2)^2)$ becomes $(8,1936)$. The slope is $4(8.0)^3 - 80(8.0)$ which when simplified becomes $1408$. So, the claim is that the line with slope $1408$ going through the point $(8,1936)$ touches the graph. To calculate the equation of the line, we'll use $(y - y_1) = m(x - x_1)$:

$$ y - 1936 = 1408(x - 8) \\ \Rightarrow y = 1408x - 11264 + 1936 \\ \Rightarrow y = 1408x - 9328 $$

Let's plot that line and the cost function together.

In [11]:
i = np.linspace(0.0, 10.0, 1000)
j = (20.0 - i**2)**2
k = 1408 * i - 9328

pl.plot(i, j, 'k-', label='$(20-z^2)^2$')
pl.plot(i, k, 'b-', label='$1408z - 9328$')
pl.legend()
pl.show()

Why do we care about the slope?

It's a bit hard to see, but the blue line is perfectly touching the curve. We care about this because the slope of the blue line tells us in which way to change $z$ in order to make the cost less. If we increase $z$ the cost goes up. If we decrease it the cost goes down.

Gradient descent

Let's use the gradient descent algorithm to calculate the best $z$. We'll start with the guess $z=8$, and then use the derivative to move $z$ ever so slightly in the direction that decreases the cost. By ever so slightly, we mean $0.001$ times the slope:

$$ \begin{align} z_{i+1} &= z_i - \eta \frac{\partial Cost}{\partial z} \\ &= z_i - (0.001) (4 z_i^3 - 80 z_i) \end{align} $$

So, for our initial guess $z_0 = 8.0$ we get:

$$ \begin{align} z_1 &= 8.0 - (0.001) (4 (8.0)^3 - 80 (8.0)) \\ &= 8.0 - 1.408 = 6.592 \end{align} $$

Let's code it up.

In [12]:
def next_z(z, x, eta=0.001):
  return z - eta * (4.0 * z**3 - 80 * z)

def sqrt_grad_desc(x, z, verbose=False):
  while abs(z - next_z(z, x)) > 0.001:
    if verbose:
      print("Current: %14.8f\tNext: %14.8f" % (z, next_z(z, x)))
    z = next_z(z, x)
  return z

ans =sqrt_grad_desc(20.0, 8.0, True)
print("Square root:", ans, "\tSquared:", ans**2)
Current:     8.00000000	Next:     6.59200000
Current:     6.59200000	Next:     5.97355269
Current:     5.97355269	Next:     5.59881186
Current:     5.59881186	Next:     5.34469983
Current:     5.34469983	Next:     5.16157297
Current:     5.16157297	Next:     5.02444369
Current:     5.02444369	Next:     4.91903017
Current:     4.91903017	Next:     4.83645229
Current:     4.83645229	Next:     4.77084541
Current:     4.77084541	Next:     4.71815685
Current:     4.71815685	Next:     4.67548576
Current:     4.67548576	Next:     4.64069702
Current:     4.64069702	Next:     4.61218330
Current:     4.61218330	Next:     4.58871218
Current:     4.58871218	Next:     4.56932433
Current:     4.56932433	Next:     4.55326362
Current:     4.55326362	Next:     4.53992784
Current:     4.53992784	Next:     4.52883326
Current:     4.52883326	Next:     4.51958845
Current:     4.51958845	Next:     4.51187478
Current:     4.51187478	Next:     4.50543157
Current:     4.50543157	Next:     4.50004463
Current:     4.50004463	Next:     4.49553736
Current:     4.49553736	Next:     4.49176369
Current:     4.49176369	Next:     4.48860255
Current:     4.48860255	Next:     4.48595333
Current:     4.48595333	Next:     4.48373229
Current:     4.48373229	Next:     4.48186965
Current:     4.48186965	Next:     4.48030717
Current:     4.48030717	Next:     4.47899619
Current:     4.47899619	Next:     4.47789603
Square root: 4.477896027970839 	Squared: 20.05155283731702

A question

Let's try some other initial guesses: 4.0, 1.0 and -1.0. Can you explain the square root returned with -1.0?

In [13]:
print("With initial guess %6.2f: %10.6f" % (4.0, sqrt_grad_desc(20.0, 4.0, False)))
print("With initial guess %6.2f: %10.6f" % (1.0, sqrt_grad_desc(20.0, 1.0, False)))
print("With initial guess %6.2f: %10.6f" % (-1.0, sqrt_grad_desc(20.0, -1.0, False)))
With initial guess   4.00:   4.465951
With initial guess   1.00:   4.465963
With initial guess  -1.00:  -4.465963

End