$\newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\yv}{\mathbf{y}} \newcommand{\Yv}{\mathbf{Y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\gv}{\mathbf{g}} \newcommand{\Hv}{\mathbf{H}} \newcommand{\dv}{\mathbf{d}} \newcommand{\Vv}{\mathbf{V}} \newcommand{\vv}{\mathbf{v}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\zv}{\mathbf{z}} \newcommand{\Zv}{\mathbf{Z}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}} \newcommand{\dimensionbar}[1]{\underset{#1}{\operatorname{|}}} $

06.1 Gradient Descent with Adam

06.1:

  • Modified Adam b replacing exponentional expressions in m_hat and v_hat.
In [1]:
import numpy as np
import matplotlib.pyplot as plt

import IPython.display as ipd  # for display and clear_output
import time  # for sleep

So far we have been using SGD or full-gradient steepest descent to find good weights. Sometimes other ways of using error gradients will lead to fewer steps to reach good weights.

Here we implement one of these ways, called Adam, for adaptive moment estimation. In class, we will review the information provided in the following links:

But first, to use general-purpose gradient descent algorithms, we should collect all of the weights in a neural network into one vector. This way we can use the gradient descent algorithms for any optimization problem for which we can arrange the parameters in a vector.

Say we have assigned weight matrices, $V$ and $W$, for our hidden and output layer weights, as we have before. To collect all of these weights into a single vector, we can use the following function.

In [2]:
def pack(V, W):
    return np.hstack((V.flat, W.flat))

Let's test this before going any further.

In [3]:
V = np.arange(10).reshape(5, 2).astype(np.float64)
W = 20 + np.arange(18).reshape(3, 6).astype(np.float64)
V.shape, W.shape
Out[3]:
((5, 2), (3, 6))
In [4]:
print(V)
print(W)
[[0. 1.]
 [2. 3.]
 [4. 5.]
 [6. 7.]
 [8. 9.]]
[[20. 21. 22. 23. 24. 25.]
 [26. 27. 28. 29. 30. 31.]
 [32. 33. 34. 35. 36. 37.]]
In [5]:
V
Out[5]:
array([[0., 1.],
       [2., 3.],
       [4., 5.],
       [6., 7.],
       [8., 9.]])
In [6]:
pack(V, W)
Out[6]:
array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 20., 21., 22.,
       23., 24., 25., 26., 27., 28., 29., 30., 31., 32., 33., 34., 35.,
       36., 37.])

Imagine we have sent this vector of weights off to an optimization function (which uses gradient descent) and it returns new weight values. Now we need to unpack this vector back into the $V$ and $W$ matrices. Here is a function to do that.

In [7]:
def unpack(w, Vshape, Wshape):
    Vrows, Vcols = Vshape
    Wrows, Wcols = Wshape
    V = w[:Vrows * Vcols].reshape(Vshape)
    W = w[Vrows * Vcols:].reshape(Wshape)
    return V, W
In [8]:
w = pack(V, W)
w
Out[8]:
array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 20., 21., 22.,
       23., 24., 25., 26., 27., 28., 29., 30., 31., 32., 33., 34., 35.,
       36., 37.])
In [9]:
V, W = unpack(w, V.shape, W.shape)
V
Out[9]:
array([[0., 1.],
       [2., 3.],
       [4., 5.],
       [6., 7.],
       [8., 9.]])
In [10]:
w[0]=2000
V
Out[10]:
array([[2.e+03, 1.e+00],
       [2.e+00, 3.e+00],
       [4.e+00, 5.e+00],
       [6.e+00, 7.e+00],
       [8.e+00, 9.e+00]])
In [11]:
w
Out[11]:
array([2.0e+03, 1.0e+00, 2.0e+00, 3.0e+00, 4.0e+00, 5.0e+00, 6.0e+00,
       7.0e+00, 8.0e+00, 9.0e+00, 2.0e+01, 2.1e+01, 2.2e+01, 2.3e+01,
       2.4e+01, 2.5e+01, 2.6e+01, 2.7e+01, 2.8e+01, 2.9e+01, 3.0e+01,
       3.1e+01, 3.2e+01, 3.3e+01, 3.4e+01, 3.5e+01, 3.6e+01, 3.7e+01])

Must update w in place to change the values in V, W. w = w * 0.1 does not change in place.

In [12]:
w *= 0.1  # update w in place without creating new version of w.   w = w * 0.1 does create new version
In [13]:
w
Out[13]:
array([2.0e+02, 1.0e-01, 2.0e-01, 3.0e-01, 4.0e-01, 5.0e-01, 6.0e-01,
       7.0e-01, 8.0e-01, 9.0e-01, 2.0e+00, 2.1e+00, 2.2e+00, 2.3e+00,
       2.4e+00, 2.5e+00, 2.6e+00, 2.7e+00, 2.8e+00, 2.9e+00, 3.0e+00,
       3.1e+00, 3.2e+00, 3.3e+00, 3.4e+00, 3.5e+00, 3.6e+00, 3.7e+00])
In [14]:
V, W
Out[14]:
(array([[2.e+02, 1.e-01],
        [2.e-01, 3.e-01],
        [4.e-01, 5.e-01],
        [6.e-01, 7.e-01],
        [8.e-01, 9.e-01]]),
 array([[2. , 2.1, 2.2, 2.3, 2.4, 2.5],
        [2.6, 2.7, 2.8, 2.9, 3. , 3.1],
        [3.2, 3.3, 3.4, 3.5, 3.6, 3.7]]))

Another way to update values in an array without creating new memory.

In [15]:
w[:] = w * 100
w
Out[15]:
array([2.0e+04, 1.0e+01, 2.0e+01, 3.0e+01, 4.0e+01, 5.0e+01, 6.0e+01,
       7.0e+01, 8.0e+01, 9.0e+01, 2.0e+02, 2.1e+02, 2.2e+02, 2.3e+02,
       2.4e+02, 2.5e+02, 2.6e+02, 2.7e+02, 2.8e+02, 2.9e+02, 3.0e+02,
       3.1e+02, 3.2e+02, 3.3e+02, 3.4e+02, 3.5e+02, 3.6e+02, 3.7e+02])
In [16]:
V, W
Out[16]:
(array([[2.e+04, 1.e+01],
        [2.e+01, 3.e+01],
        [4.e+01, 5.e+01],
        [6.e+01, 7.e+01],
        [8.e+01, 9.e+01]]),
 array([[200., 210., 220., 230., 240., 250.],
        [260., 270., 280., 290., 300., 310.],
        [320., 330., 340., 350., 360., 370.]]))
In [17]:
# Make some training data
n = 20
X = np.linspace(0.,20.0,n).reshape((n,1)) - 10
T = 0.2 + 0.05 * (X+10) + 0.4 * np.sin(X+10) + 0.2 * np.random.normal(size=(n,1))

# Make some testing data
Xtest = X + 0.1*np.random.normal(size=(n,1))
Ttest = 0.2 + 0.05 * (X+10) + 0.4 * np.sin(Xtest+10) + 0.2 * np.random.normal(size=(n,1))

nSamples = X.shape[0]
nOutputs = T.shape[1]
In [18]:
# Add constant column of 1's
def addOnes(A):
    return np.insert(A, 0, 1, axis=1)

X1 = addOnes(X)
Xtest1 = addOnes(Xtest)
In [19]:
nHiddens = 10
Vshape = (1 + 1, nHiddens)
Wshape = (1 + nHiddens, nOutputs)
np.prod(Vshape)
Out[19]:
20
In [20]:
# Set parameters of neural network
nHiddens = 10
rho = 0.1
rho = rho / (nSamples * nOutputs)

# Initialize weights to uniformly distributed values between small normally-distributed between -0.1 and 0.1
Vshape = (1 + 1, nHiddens)
Wshape = (1 + nHiddens, nOutputs)
V = np.random.uniform(-0.1, 0.1, Vshape)
W = np.random.uniform(-0.1, 0.1, Wshape)
print('V\n', V)
print('W\n', W)
V
 [[ 0.0713602  -0.05832346 -0.0535283  -0.06959432 -0.05306654  0.07361231
  -0.03018895  0.06962067 -0.02572612 -0.04628996]
 [-0.05732666  0.02210679 -0.01809187  0.06924505  0.08151444  0.09424323
  -0.00152218  0.07970044  0.07187384  0.08567878]]
W
 [[-0.01004698]
 [-0.0135908 ]
 [-0.07701436]
 [ 0.09986805]
 [ 0.01787848]
 [ 0.0743431 ]
 [ 0.04365285]
 [ 0.06305609]
 [ 0.09842045]
 [ 0.04480712]
 [ 0.05129682]]
In [21]:
# Set parameters of neural network
nHiddens = 10
rho = 0.01

# Initialize weights to uniformly distributed values between small normally-distributed between -0.1 and 0.1
Vshape = (1 + 1, nHiddens)
Wshape = (1 + nHiddens, nOutputs)
V = np.random.uniform(-0.1, 0.1, Vshape)
W = np.random.uniform(-0.1, 0.1, Wshape)
w = pack(V, W)
V, W = unpack(w, V.shape, W.shape)

# Take nEpochs steepest descent steps in gradient descent search in mean-squared-error function
nEpochs = 40000

# collect training and testing errors for plotting
errorTrace = np.zeros((nEpochs, 2))

fig = plt.figure(figsize=(10, 20))
for epoch in range(nEpochs):

    # V, W = unpack(w, Vshape, Wshape)
    
    # Forward pass on training data
    Z = np.tanh(X1 @ V)
    Z1 = addOnes(Z)
    Y = Z1 @ W

    # Error in output
    error = T - Y

    # Backward pass
    
    gradE_V = - X1.T @ ( ( error @ W[1:, :].T) * (1 - Z**2))
    gradE_W = - Z1.T @ error
        
    gradE_w = pack(gradE_V, gradE_W)
    
    w -= rho * gradE_w
    
    # V, W = unpack(w, Vshape, Wshape)

    # error traces for plotting
    errorTrace[epoch, 0] = np.sqrt(np.mean((error**2)))
    
    Ytest = addOnes(np.tanh(Xtest1 @ V)) @ W  #!! Forward pass in one line
    errorTrace[epoch, 1] = np.sqrt(np.mean((Ytest - Ttest)**2))

    if epoch % 2000 == 0 or epoch == nEpochs - 1:
        plt.clf()
        plt.subplot(3, 1, 1)
        plt.plot(errorTrace[:epoch, :])
        plt.ylim(0, 0.7)
        plt.xlabel('Epochs')
        plt.ylabel('RMSE')
        plt.legend(('Train','Test'), loc='upper left')
        
        plt.subplot(3, 1, 2)
        plt.plot(X, T, 'o-', Xtest, Ttest, 'o-', Xtest, Ytest, 'o-')
        plt.xlim(-10, 10)
        plt.legend(('Training', 'Testing', 'Model'), loc='upper left')
        plt.xlabel('$x$')
        plt.ylabel('Actual and Predicted $f(x)$')
        
        plt.subplot(3, 1, 3)
        plt.plot(X, Z)
        plt.ylim(-1.1, 1.1)
        plt.xlabel('$x$')
        plt.ylabel('Hidden Unit Outputs ($z$)');
        
        ipd.clear_output(wait=True)
        ipd.display(fig)
ipd.clear_output(wait=True)

print('Final RMSE', np.sqrt(np.mean((T - Y)**2)))
Final RMSE 0.100633898080757

Now we can repeat this training loop, but this time using Adam.

In [22]:
# Set parameters of neural network
nHiddens = 10

epsilon = 1e-8
beta1 = 0.9
beta2 = 0.999

rho = 0.001

# Initialize weights to uniformly distributed values between small normally-distributed between -0.1 and 0.1
Vshape = (1 + 1, nHiddens)
Wshape = (1 + nHiddens, nOutputs)
V = np.random.uniform(-0.1, 0.1, Vshape)
W = np.random.uniform(-0.1, 0.1, Wshape)
w = pack(V, W)
V, W = unpack(w, Vshape, Wshape)

mt = np.zeros_like(w)
vt = np.zeros_like(w)
beta1t = 1
beta2t = 1

# Take nReps steepest descent steps in gradient descent search in mean-squared-error function
nEpochs = 50000

# collect training and testing errors for plotting
errorTrace = np.zeros((nEpochs, 2))

fig = plt.figure(figsize=(10, 20))

for epoch in range(nEpochs):

    # Forward pass on training data
    Z = np.tanh(X1 @ V)
    Z1 = addOnes(Z)
    Y = Z1 @ W

    # Error in output
    error = T - Y

    # Backward pass
    
    gradE_V = - X1.T @ ( ( error @ W[1:, :].T) * (1 - Z**2))
    gradE_W = - Z1.T @ error
        
    gradE_w = pack(gradE_V, gradE_W)
     
    # approximate first and second moment
    mt = beta1 * mt + (1 - beta1) * gradE_w
    vt = beta2 * vt + (1 - beta2) * np.square(gradE_w)
    
    # bias corrected moment estimates
    beta1t *= beta1
    beta2t *= beta2
    mhat = mt / (1 - beta1t )
    vhat = vt / (1 - beta2t )               
    
    w -= rho * mhat / (np.sqrt(vhat) + epsilon)

    errorTrace[epoch, 0] = np.sqrt(np.mean((error**2)))
    
    Ytest = addOnes(np.tanh(Xtest1 @ V)) @ W  #!! Forward pass in one line
    errorTrace[epoch, 1] = np.sqrt(np.mean((Ytest - Ttest)**2))

    if epoch % 2000 == 0 or epoch == nEpochs - 1:
        plt.clf()
        plt.subplot(3, 1, 1)
        plt.plot(errorTrace[:epoch, :])
        plt.ylim(0, 0.7)
        plt.xlabel('Epochs')
        plt.ylabel('RMSE')
        plt.legend(('Train','Test'), loc='upper left')
        
        plt.subplot(3, 1, 2)
        plt.plot(X, T, 'o-', Xtest, Ttest, 'o-', Xtest, Ytest, 'o-')
        plt.xlim(-10, 10)
        plt.legend(('Training', 'Testing', 'Model'), loc='upper left')
        plt.xlabel('$x$')
        plt.ylabel('Actual and Predicted $f(x)$')
        
        plt.subplot(3, 1, 3)
        plt.plot(X, Z)
        plt.ylim(-1.1, 1.1)
        plt.xlabel('$x$')
        plt.ylabel('Hidden Unit Outputs ($z$)');
        
        ipd.clear_output(wait=True)
        ipd.display(fig)
ipd.clear_output(wait=True)

print('Final RMSE', np.sqrt(np.mean((T - Y)**2)))
Final RMSE 0.09533104538919793