In [ ]:
%matplotlib inline
from fastai import *

In this part of the lecture we explain Stochastic Gradient Descent (SGD) which is an optimization method commonly used in neural networks. We will illustrate the concepts with concrete examples.

Linear Regression problem

The goal of linear regression is to fit a line to a set of points.

In [ ]:
n=100
In [ ]:
x = torch.ones(n,2) 
x[:,0].uniform_(-1.,1)
x[:5]
Out[ ]:
tensor([[-0.4801,  1.0000],
        [-0.0147,  1.0000],
        [-0.4377,  1.0000],
        [-0.1611,  1.0000],
        [-0.6662,  1.0000]])
In [ ]:
a = tensor(3.,2); a
Out[ ]:
tensor([3., 2.])
In [ ]:
y = x@a + torch.rand(n)
In [ ]:
plt.scatter(x[:,0], y);

You want to find parameters (weights) a such that you minimize the error between the points and the line [email protected]. Note that here a is unknown. For a regression problem the most common error function or loss function is the mean squared error.

In [ ]:
def mse(y_hat, y): return ((y_hat-y)**2).mean()

Suppose we believe a = (-1.0,1.0) then we can compute y_hat which is our prediction and then compute our error.

In [ ]:
a = tensor(-1.,1)
In [ ]:
y_hat = x@a
mse(y_hat, y)
Out[ ]:
tensor(8.9578)
In [ ]:
plt.scatter(x[:,0],y)
plt.scatter(x[:,0],y_hat);

So far we have specified the model (linear regression) and the evaluation criteria (or loss function). Now we need to handle optimization; that is, how do we find the best values for a? How do we find the best fitting linear regression.

Gradient Descent

We would like to find the values of a that minimize mse_loss.

Gradient descent is an algorithm that minimizes functions. Given a function defined by a set of parameters, gradient descent starts with an initial set of parameter values and iteratively moves toward a set of parameter values that minimize the function. This iterative minimization is achieved by taking steps in the negative direction of the function gradient.

Here is gradient descent implemented in PyTorch.

In [ ]:
a = nn.Parameter(a); a
Out[ ]:
Parameter containing:
tensor([-1.,  1.], requires_grad=True)
In [ ]:
def update():
    y_hat = x@a
    loss = mse(y, y_hat)
    if t % 10 == 0: print(loss)
    loss.backward()
    with torch.no_grad():
        a.sub_(lr * a.grad)
        a.grad.zero_()
In [ ]:
lr = 1e-1
for t in range(100): update()
tensor(8.9578, grad_fn=<MeanBackward1>)
tensor(1.3518, grad_fn=<MeanBackward1>)
tensor(0.3433, grad_fn=<MeanBackward1>)
tensor(0.1344, grad_fn=<MeanBackward1>)
tensor(0.0898, grad_fn=<MeanBackward1>)
tensor(0.0802, grad_fn=<MeanBackward1>)
tensor(0.0781, grad_fn=<MeanBackward1>)
tensor(0.0777, grad_fn=<MeanBackward1>)
tensor(0.0776, grad_fn=<MeanBackward1>)
tensor(0.0776, grad_fn=<MeanBackward1>)
In [ ]:
plt.scatter(x[:,0],y)
plt.scatter(x[:,0],x@a);

Animate it!

In [ ]:
from matplotlib import animation, rc
rc('animation', html='html5')

You'll need to uncomment the following to install the necessary plugin the first time you run this:

In [ ]:
#! sudo add-apt-repository ppa:mc3man/trusty-media  
#! sudo apt-get update  
#! sudo apt-get install ffmpeg  
#! sudo apt-get install frei0r-plugins 
In [ ]:
a = nn.Parameter(tensor(-1.,1))

fig = plt.figure()
plt.scatter(x[:,0], y, c='orange')
line, = plt.plot(x[:,0], x@a)
plt.close()

def animate(i):
    update()
    line.set_ydata(x@a)
    return line,

animation.FuncAnimation(fig, animate, np.arange(0, 100), interval=20)
Out[ ]:

In practice, we don't calculate on the whole file at once, but we use mini-batches.

Vocab

  • Learning rate
  • Epoch
  • Minibatch
  • SGD
  • Model / Architecture
  • Parameters
  • Loss function

For classification problems, we use cross entropy loss, also known as negative log likelihood loss. This penalizes incorrect confident predictions, and correct unconfident predictions.

In [ ]: