In a previous notebook, we solved the problem of simple linear regression - finding a straight line that best fits a data set with two variables. In that case, we were able to find the exact solution. In this notebook, we'll use a common technique to approximate that solution.

Why would we want to approximate a solution when we can easily find an exact solution? We don't - it's just that the technique we discuss here can also be used in situations where we can't find an exact solution or don't want to try, for whatever reason. The approximation technique is called gradient descent.

In [1]:

```
# numpy efficiently deals with numerical multi-dimensional arrays.
import numpy as np
# matplotlib is a plotting library, and pyplot is its easy-to-use module.
import matplotlib.pyplot as pl
# This just sets the default plot size to be bigger.
pl.rcParams['figure.figsize'] = (16.0, 8.0)
```

In simple linear regression, we have some data points $(x_i, y_i)$, and we decide that they belong to a straight line with a little bit of error involved.
Straight lines in two dimensions are of the form $y = mx + c$, and to fit a line to our data points we must find appropriate values for $m$ and $c$.
Numpy has a function called `polyfit`

that finds such values for us.

In [2]:

```
w = np.arange(1.0, 16.0, 1.0)
d = 5.0 * w + 10.0 + np.random.normal(0.0, 5.0, w.size)
m, c = np.polyfit(w, d, 1)
print("Best fit is m = %f and c = %f" % (m, c))
# Plot the best fit line.
pl.plot(w, d, 'k.', label='Original data')
pl.plot(w, m * w + c, 'b-', label='Best fit: $%0.1f x + %0.1f$' % (m,c))
pl.legend()
pl.show()
```

In gradient descent, we select a random guess of a parameter and iteratively improve that guess.
For instance, we might pick $1.0$ as our initial guess for $m$ and then create a `for`

loop to iteratively improve the value of $m$.
The way we improve $m$ is to first take the partial derivative of our cost function with respect to $m$.

Recall that our cost function for simple linear regression is:

$$ Cost(m, c) = \sum_i (y_i - mx_i - c)^2 $$We calculate the partial derivative of $Cost$ with respect to $m$ while treating $c$ as a constant. Note that the $x_i$ and $y_i$ values are all just constants. We'll also calculate the partial derivative with respect to $c$ here.

$$ \begin{align} Cost(m, c) &= \sum_i (y_i - mx_i - c)^2 \\[1cm] \frac{\partial Cost}{\partial m} &= \sum 2(y_i - m x_i -c)(-x_i) \\ &= -2 \sum x_i (y_i - m x_i -c) \\[0.5cm] \frac{\partial Cost}{\partial c} & = \sum 2(y_i - m x_i -c)(-1) \\ & = -2 \sum (y_i - m x_i -c) \\ \end{align} $$Once we've calculated the partial derivatives, we'll code them up in python. Here we create two functions, each taking four parameters. The first two parameters are arrays with our $x_i$ and $y_i$ data set values. The second two are our current guesses for $m$ and $c$.

In [3]:

```
def grad_m(x, y, m, c):
return -2.0 * np.sum(x * (y - m * x - c))
```

In [4]:

```
def grad_c(x, y, m , c):
return -2.0 * np.sum(y - m * x - c)
```

Now we can run our gradient descent algorithm. For $m$, we keep replacing its value with $m - \eta grad\_m(x, y, m, c)$ until it doesn't change. For $c$, we keep replacing its value with $c - \eta grad\_c(x, y, m, c)$ until it doesn't change.

What is $\eta$? It is called the learning rate and we set it to a small value relative to the data points.

You can see on each iteration, $m$ and $c$ are getting closer to their true values.

In [5]:

```
eta = 0.0001
m, c = 1.0, 1.0
delta = 0.0000001
mold, cold = m - 1.0, c - 1.0
i = 0
while abs(mold - m) > delta and abs(cold - c) > delta:
mold, cold = m, c
m = mold - eta * grad_m(w, d, mold, cold)
c = cold - eta * grad_c(w, d, mold, cold)
i = i + 1
if i % 1000 == 0:
print("m: %20.16f c: %20.16f" % (m, c))
```

Newton's method for square roots is a method for approximating the square root of a number $x$. We begin with an initial guess $z_0$ of the square root - it doesn't have to be particularly good. We then apply the following calculation repeatedly, to calculate $z_1$, then $z_2$, and so on:

$$ z_{i+1} = z_i - \frac{z_i^2 - x}{2z_i} $$We can create a function that calculates the next value of $z$ based on the current value of $z$ as follows.

In [6]:

```
def next_z(x, z):
return z - (z**2 - x) / (2 * z)
```

Suppose we want to calculate the square root of $x$.
We start with a random guess for the square root, $z_0$.
We then apply the `next_z`

function repeatedly until the value of $z$ stops changing.
Let's create a function to do this.
We'll include the next_z function inside the `newtsqrt`

function to make it one all-inclusive package.

In [7]:

```
def newtsqrt(x):
next_z = lambda x, z: z - (z**2 - x) / (2 * z)
z = 2.0
n = next_z(x, z)
while z != n:
z, n = n, next_z(x, n)
print(z)
return z
newtsqrt(11)
```

Out[7]:

We can compare our square root method return value to the value calculated by Python's `math`

standard library package. It has a `sqrt`

function.

In [8]:

```
import math
math.sqrt(11)
```

Out[8]:

Due to the complexities of floating point numbers, the `nextsqrt`

function could get into an infinite loop.
For instance, calculating the square root of 10 gives an infinite loop.

In [9]:

```
# Uncommenting will result in infinite loop.
# newtsqrt(10)
```

To counteract this problem, the condition of the loop is better written as:

```
abs(z - n) > 0.001
```

Newton's method for square roots is efficient, but we can also use gradient descent to approximate the square root of a real number $x$. Here, we use the following cost function.

$$ Cost(z \mid x) = (x - z^2)^2 $$Let's use it to calculate the square root of 20, i.e. $x = 20$. Then the cost function is:

$$ Cost(z \mid x=20) = (20 - z^2)^2 $$Our goal is to find the $z$ that minimises this.

Let's plot the cost function. Given that we know the best $z$ will be between $4$ and $5$, we'll let $z$ range over 0 to 10.

In [10]:

```
i = np.linspace(0.0, 10.0, 1000)
j = (20.0 - i**2)**2
pl.plot(i, j, 'k-', label='$(20-z^2)^2$')
pl.legend()
pl.show()
```

Looks like there's a low point at about $4.5$. Let's take the derivative of the cost function with respect to $z$.

$$ \begin{align} Cost(z) &= ( 20.0 - z^2 )^2 \\ \Rightarrow \frac{\partial Cost}{\partial z} &= 2(20.0 - z^2)(-2z) \\ &= 4z(z^2 - 20) \\ &= 4z^3 - 80z \end{align} $$The derivative tells us what the slope of the tangent to the curve is at any point on the cost function. What does that mean? It means that if we pick a value of $z$, e.g. $8.0$, that the derivative tells us that a line going through the point $(8.0, (20.0 - (8.0)^2)^2)$ with the slope $4(8.0)^3 - 80(8.0)$ perfectly touches the graph above. Let's plot that.

When you simply, the point $(8.0, (20.0 - (8.0)^2)^2)$ becomes $(8,1936)$. The slope is $4(8.0)^3 - 80(8.0)$ which when simplified becomes $1408$. So, the claim is that the line with slope $1408$ going through the point $(8,1936)$ touches the graph. To calculate the equation of the line, we'll use $(y - y_1) = m(x - x_1)$:

$$ y - 1936 = 1408(x - 8) \\ \Rightarrow y = 1408x - 11264 + 1936 \\ \Rightarrow y = 1408x - 9328 $$Let's plot that line and the cost function together.

In [11]:

```
i = np.linspace(0.0, 10.0, 1000)
j = (20.0 - i**2)**2
k = 1408 * i - 9328
pl.plot(i, j, 'k-', label='$(20-z^2)^2$')
pl.plot(i, k, 'b-', label='$1408z - 9328$')
pl.legend()
pl.show()
```

It's a bit hard to see, but the blue line is perfectly touching the curve. We care about this because the slope of the blue line tells us in which way to change $z$ in order to make the cost less. If we increase $z$ the cost goes up. If we decrease it the cost goes down.

Let's use the gradient descent algorithm to calculate the best $z$. We'll start with the guess $z=8$, and then use the derivative to move $z$ ever so slightly in the direction that decreases the cost. By ever so slightly, we mean $0.001$ times the slope:

$$ \begin{align} z_{i+1} &= z_i - \eta \frac{\partial Cost}{\partial z} \\ &= z_i - (0.001) (4 z_i^3 - 80 z_i) \end{align} $$So, for our initial guess $z_0 = 8.0$ we get:

$$ \begin{align} z_1 &= 8.0 - (0.001) (4 (8.0)^3 - 80 (8.0)) \\ &= 8.0 - 1.408 = 6.592 \end{align} $$Let's code it up.

In [12]:

```
def next_z(z, x, eta=0.001):
return z - eta * (4.0 * z**3 - 80 * z)
def sqrt_grad_desc(x, z, verbose=False):
while abs(z - next_z(z, x)) > 0.001:
if verbose:
print("Current: %14.8f\tNext: %14.8f" % (z, next_z(z, x)))
z = next_z(z, x)
return z
ans =sqrt_grad_desc(20.0, 8.0, True)
print("Square root:", ans, "\tSquared:", ans**2)
```

Let's try some other initial guesses: 4.0, 1.0 and -1.0. Can you explain the square root returned with -1.0?

In [13]:

```
print("With initial guess %6.2f: %10.6f" % (4.0, sqrt_grad_desc(20.0, 4.0, False)))
print("With initial guess %6.2f: %10.6f" % (1.0, sqrt_grad_desc(20.0, 1.0, False)))
print("With initial guess %6.2f: %10.6f" % (-1.0, sqrt_grad_desc(20.0, -1.0, False)))
```