Gradient Descent

Objective

In most cases, the objective in ML and DL boil down to one thing: minimizing error.

This error is computed by feeding our model some data, recording its predictions, and feeding it to a criterion, which evaluates the performance or error of the model so that we can backpropagate/train it. We will call the operation that evaluates performance our Loss function.

image-20201022230530198.png

Gradient Descent

There are a multitude of strategies to minimize the Loss function, all with their own unique properties; however, in DL, the objective is often approximated by some variation of Gradient Descent.

Why?

In essence, Gradient Descent is a small "step" towards the path with steepest descent (hence its name) in your model's function.

There are three main ingredients needed to take this step:

  1. A differentiable function/model ($\frac{\partial F}{\partial w_i}$)
  2. Partial of Loss function w.r.t. model parameters ($\frac{\partial L}{\partial w_i}$)
  3. A learning rate ($\alpha$)

Once instantiated, we formulate Gradient Descent as:

$$ w_i = w_i - \frac{\partial L}{\partial w_i} * \alpha \tag{1} $$

More succinctly:

image-20201023080311021.png

NOTE: Without a differentiable function ($\frac{\partial F}{\partial w_i}$), we would not be able to take Partial of Loss function w.r.t. model parameters ($\frac{\partial L}{\partial w_i}$)

It's a simple formulation but a powerful technique.

Now, Gradient Descent is often referred to as "stochastic" in reference to the lack of RAM storage that is needed to hold the usually large amounts of data needed to train Deep Learning models. As such, we are instead forced to create batches of (usually) randomly distributed data. When we do this, given that we are training our model on different distributions of data, our model will look like we are taking random or stochastic steps that will slowly converge to a minimum.

I mentioned that this technique "slowly" converges because gradient descent is only effective when it is constantly iterated through thousands of cycles due to the multiplication with a small $\alpha$, which forces our "step" to be tiny (I will expand more on this in the next section).

So, why does it work?

Theory

Let's revisit the definition of the derivative for a bit

$$ \frac{dy}{dx}=\lim_{h\to0}\frac{f(x+h)-f(x)}{h} \tag{2} $$

From x, if we take a small forward step $h$, how much will $y$ increase/decrease by?

Traditionally, this formulation is interpreted as the velocity of the function at $x$. However, we can reformulate our expression through some algebra so lthat it can actually be interpreted as a linear approximation of $f(x+h)$ from our initial value $(x, f(x))$ to an arbitrary $x+h$.

$$ \begin{align*} & \frac{dy}{dx} = \lim_{h\to0}\frac{f(x+h)-f(x)}{h} \tag{2} \\ & \frac{dy}{dx}*\lim_{h\to0}h = f(x+h)-f(x) \\ & f(x+h) - f(x) = \frac{dy}{dx}*\lim_{h\to0}h \\ & f(x+h) = f(x) + \frac{dy}{dx}*\lim_{h\to0}h \\ & f(x+h) \approx f(x) + \frac{dy}{dx} * h \tag{3} & \end{align*} $$

NOTE: the limit was taken off so as to give freedom to approximate $f(x+h)$ at any given step $h$

Does this formulation look familiar? It should! It looks very similar to our "step" function in gradient descent! $$ \begin{align} w_i &= w_i - \frac{\partial L}{\partial w_i} * \alpha \ (\text{Gradient Descent}) \tag{2}\\ f(x+h) &\approx f(x) + \frac{dy}{dx} * h \ (\text{Linear Approximation}) \tag{3} \end{align} $$

That is because we are applying an equal formulation to a different problem.

For Gradient Descent, instead of taking a step forward as we do with our Linear Approximation, we are actually taking a step backward (hence, the negative sign).

As a result, in Gradient Descent we are approximating the step towards steepest descend, while in our Linear Approximation, we are approximating the step towards steepest ascent.

TIP: To understand why the gradient is the direction of steepest ascent/descent, please refer to this great explanation offered by Khan Academy

Criterion

Now that we are able to attain our linear approximation of the steepest descent or ascent of our function, we must next:

  • decide what Loss function we will use to compare the truth from our prediction and
  • understand the relationship between our Loss function and alpha $\alpha$

Let us treat our problem as one of regression and use the Squared Error (SE) to compute the Loss of our prediction based on the truth: $$ \begin{align} \text{Squared Error} &= (f(x) + \frac{dy}{dx}h-f(x+h))^2 \ (\text{Steepest Ascent}) \tag{4} \\ \text{Squared Error} &= (f(x) - \frac{dy}{dx}h-f(x+h))^2 \ (\text{Steepest Descent}) \tag{5} \end{align} $$

The definition of the derivative we saw earlier assumes the following:

$h\to 0$, $\text{Error}\to0$

Conversely, if we are approximating a non-linear function, below is generally true:

$h\to\infty$, $Error\to\infty$

Let us display this concept with a simple example of approximating any step of $f(x+h)$ from $x=4$ by modeling $f(x)=x^2$

In [3]:
fig1.show()