$\newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\yv}{\mathbf{y}} \newcommand{\Yv}{\mathbf{Y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\gv}{\mathbf{g}} \newcommand{\Hv}{\mathbf{H}} \newcommand{\dv}{\mathbf{d}} \newcommand{\Vv}{\mathbf{V}} \newcommand{\vv}{\mathbf{v}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\zv}{\mathbf{z}} \newcommand{\Zv}{\mathbf{Z}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}} \newcommand{\dimensionbar}[1]{\underset{#1}{\operatorname{|}}}$

06.1:

• Modified Adam b replacing exponentional expressions in m_hat and v_hat.
In [1]:
import numpy as np
import matplotlib.pyplot as plt

import IPython.display as ipd  # for display and clear_output
import time  # for sleep


So far we have been using SGD or full-gradient steepest descent to find good weights. Sometimes other ways of using error gradients will lead to fewer steps to reach good weights.

Here we implement one of these ways, called Adam, for adaptive moment estimation. In class, we will review the information provided in the following links:

But first, to use general-purpose gradient descent algorithms, we should collect all of the weights in a neural network into one vector. This way we can use the gradient descent algorithms for any optimization problem for which we can arrange the parameters in a vector.

Say we have assigned weight matrices, $V$ and $W$, for our hidden and output layer weights, as we have before. To collect all of these weights into a single vector, we can use the following function.

In [2]:
def pack(V, W):
return np.hstack((V.flat, W.flat))


Let's test this before going any further.

In [3]:
V = np.arange(10).reshape(5, 2).astype(np.float64)
W = 20 + np.arange(18).reshape(3, 6).astype(np.float64)
V.shape, W.shape

Out[3]:
((5, 2), (3, 6))
In [4]:
print(V)
print(W)

[[0. 1.]
[2. 3.]
[4. 5.]
[6. 7.]
[8. 9.]]
[[20. 21. 22. 23. 24. 25.]
[26. 27. 28. 29. 30. 31.]
[32. 33. 34. 35. 36. 37.]]

In [5]:
V

Out[5]:
array([[0., 1.],
[2., 3.],
[4., 5.],
[6., 7.],
[8., 9.]])
In [6]:
pack(V, W)

Out[6]:
array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 20., 21., 22.,
23., 24., 25., 26., 27., 28., 29., 30., 31., 32., 33., 34., 35.,
36., 37.])

Imagine we have sent this vector of weights off to an optimization function (which uses gradient descent) and it returns new weight values. Now we need to unpack this vector back into the $V$ and $W$ matrices. Here is a function to do that.

In [7]:
def unpack(w, Vshape, Wshape):
Vrows, Vcols = Vshape
Wrows, Wcols = Wshape
V = w[:Vrows * Vcols].reshape(Vshape)
W = w[Vrows * Vcols:].reshape(Wshape)
return V, W

In [8]:
w = pack(V, W)
w

Out[8]:
array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 20., 21., 22.,
23., 24., 25., 26., 27., 28., 29., 30., 31., 32., 33., 34., 35.,
36., 37.])
In [9]:
V, W = unpack(w, V.shape, W.shape)
V

Out[9]:
array([[0., 1.],
[2., 3.],
[4., 5.],
[6., 7.],
[8., 9.]])
In [10]:
w[0]=2000
V

Out[10]:
array([[2.e+03, 1.e+00],
[2.e+00, 3.e+00],
[4.e+00, 5.e+00],
[6.e+00, 7.e+00],
[8.e+00, 9.e+00]])
In [11]:
w

Out[11]:
array([2.0e+03, 1.0e+00, 2.0e+00, 3.0e+00, 4.0e+00, 5.0e+00, 6.0e+00,
7.0e+00, 8.0e+00, 9.0e+00, 2.0e+01, 2.1e+01, 2.2e+01, 2.3e+01,
2.4e+01, 2.5e+01, 2.6e+01, 2.7e+01, 2.8e+01, 2.9e+01, 3.0e+01,
3.1e+01, 3.2e+01, 3.3e+01, 3.4e+01, 3.5e+01, 3.6e+01, 3.7e+01])

Must update w in place to change the values in V, W. w = w * 0.1 does not change in place.

In [12]:
w *= 0.1  # update w in place without creating new version of w.   w = w * 0.1 does create new version

In [13]:
w

Out[13]:
array([2.0e+02, 1.0e-01, 2.0e-01, 3.0e-01, 4.0e-01, 5.0e-01, 6.0e-01,
7.0e-01, 8.0e-01, 9.0e-01, 2.0e+00, 2.1e+00, 2.2e+00, 2.3e+00,
2.4e+00, 2.5e+00, 2.6e+00, 2.7e+00, 2.8e+00, 2.9e+00, 3.0e+00,
3.1e+00, 3.2e+00, 3.3e+00, 3.4e+00, 3.5e+00, 3.6e+00, 3.7e+00])
In [14]:
V, W

Out[14]:
(array([[2.e+02, 1.e-01],
[2.e-01, 3.e-01],
[4.e-01, 5.e-01],
[6.e-01, 7.e-01],
[8.e-01, 9.e-01]]),
array([[2. , 2.1, 2.2, 2.3, 2.4, 2.5],
[2.6, 2.7, 2.8, 2.9, 3. , 3.1],
[3.2, 3.3, 3.4, 3.5, 3.6, 3.7]]))

Another way to update values in an array without creating new memory.

In [15]:
w[:] = w * 100
w

Out[15]:
array([2.0e+04, 1.0e+01, 2.0e+01, 3.0e+01, 4.0e+01, 5.0e+01, 6.0e+01,
7.0e+01, 8.0e+01, 9.0e+01, 2.0e+02, 2.1e+02, 2.2e+02, 2.3e+02,
2.4e+02, 2.5e+02, 2.6e+02, 2.7e+02, 2.8e+02, 2.9e+02, 3.0e+02,
3.1e+02, 3.2e+02, 3.3e+02, 3.4e+02, 3.5e+02, 3.6e+02, 3.7e+02])
In [16]:
V, W

Out[16]:
(array([[2.e+04, 1.e+01],
[2.e+01, 3.e+01],
[4.e+01, 5.e+01],
[6.e+01, 7.e+01],
[8.e+01, 9.e+01]]),
array([[200., 210., 220., 230., 240., 250.],
[260., 270., 280., 290., 300., 310.],
[320., 330., 340., 350., 360., 370.]]))
In [17]:
# Make some training data
n = 20
X = np.linspace(0.,20.0,n).reshape((n,1)) - 10
T = 0.2 + 0.05 * (X+10) + 0.4 * np.sin(X+10) + 0.2 * np.random.normal(size=(n,1))

# Make some testing data
Xtest = X + 0.1*np.random.normal(size=(n,1))
Ttest = 0.2 + 0.05 * (X+10) + 0.4 * np.sin(Xtest+10) + 0.2 * np.random.normal(size=(n,1))

nSamples = X.shape[0]
nOutputs = T.shape[1]

In [18]:
# Add constant column of 1's
return np.insert(A, 0, 1, axis=1)


In [19]:
nHiddens = 10
Vshape = (1 + 1, nHiddens)
Wshape = (1 + nHiddens, nOutputs)
np.prod(Vshape)

Out[19]:
20
In [20]:
# Set parameters of neural network
nHiddens = 10
rho = 0.1
rho = rho / (nSamples * nOutputs)

# Initialize weights to uniformly distributed values between small normally-distributed between -0.1 and 0.1
Vshape = (1 + 1, nHiddens)
Wshape = (1 + nHiddens, nOutputs)
V = np.random.uniform(-0.1, 0.1, Vshape)
W = np.random.uniform(-0.1, 0.1, Wshape)
print('V\n', V)
print('W\n', W)

V
[[ 0.0713602  -0.05832346 -0.0535283  -0.06959432 -0.05306654  0.07361231
-0.03018895  0.06962067 -0.02572612 -0.04628996]
[-0.05732666  0.02210679 -0.01809187  0.06924505  0.08151444  0.09424323
-0.00152218  0.07970044  0.07187384  0.08567878]]
W
[[-0.01004698]
[-0.0135908 ]
[-0.07701436]
[ 0.09986805]
[ 0.01787848]
[ 0.0743431 ]
[ 0.04365285]
[ 0.06305609]
[ 0.09842045]
[ 0.04480712]
[ 0.05129682]]

In [21]:
# Set parameters of neural network
nHiddens = 10
rho = 0.01

# Initialize weights to uniformly distributed values between small normally-distributed between -0.1 and 0.1
Vshape = (1 + 1, nHiddens)
Wshape = (1 + nHiddens, nOutputs)
V = np.random.uniform(-0.1, 0.1, Vshape)
W = np.random.uniform(-0.1, 0.1, Wshape)
w = pack(V, W)
V, W = unpack(w, V.shape, W.shape)

# Take nEpochs steepest descent steps in gradient descent search in mean-squared-error function
nEpochs = 40000

# collect training and testing errors for plotting
errorTrace = np.zeros((nEpochs, 2))

fig = plt.figure(figsize=(10, 20))
for epoch in range(nEpochs):

# V, W = unpack(w, Vshape, Wshape)

# Forward pass on training data
Z = np.tanh(X1 @ V)
Y = Z1 @ W

# Error in output
error = T - Y

# Backward pass

gradE_V = - X1.T @ ( ( error @ W[1:, :].T) * (1 - Z**2))
gradE_W = - Z1.T @ error

# V, W = unpack(w, Vshape, Wshape)

# error traces for plotting
errorTrace[epoch, 0] = np.sqrt(np.mean((error**2)))

Ytest = addOnes(np.tanh(Xtest1 @ V)) @ W  #!! Forward pass in one line
errorTrace[epoch, 1] = np.sqrt(np.mean((Ytest - Ttest)**2))

if epoch % 2000 == 0 or epoch == nEpochs - 1:
plt.clf()
plt.subplot(3, 1, 1)
plt.plot(errorTrace[:epoch, :])
plt.ylim(0, 0.7)
plt.xlabel('Epochs')
plt.ylabel('RMSE')
plt.legend(('Train','Test'), loc='upper left')

plt.subplot(3, 1, 2)
plt.plot(X, T, 'o-', Xtest, Ttest, 'o-', Xtest, Ytest, 'o-')
plt.xlim(-10, 10)
plt.legend(('Training', 'Testing', 'Model'), loc='upper left')
plt.xlabel('$x$')
plt.ylabel('Actual and Predicted $f(x)$')

plt.subplot(3, 1, 3)
plt.plot(X, Z)
plt.ylim(-1.1, 1.1)
plt.xlabel('$x$')
plt.ylabel('Hidden Unit Outputs ($z$)');

ipd.clear_output(wait=True)
ipd.display(fig)
ipd.clear_output(wait=True)

print('Final RMSE', np.sqrt(np.mean((T - Y)**2)))

Final RMSE 0.100633898080757


Now we can repeat this training loop, but this time using Adam.

In [22]:
# Set parameters of neural network
nHiddens = 10

epsilon = 1e-8
beta1 = 0.9
beta2 = 0.999

rho = 0.001

# Initialize weights to uniformly distributed values between small normally-distributed between -0.1 and 0.1
Vshape = (1 + 1, nHiddens)
Wshape = (1 + nHiddens, nOutputs)
V = np.random.uniform(-0.1, 0.1, Vshape)
W = np.random.uniform(-0.1, 0.1, Wshape)
w = pack(V, W)
V, W = unpack(w, Vshape, Wshape)

mt = np.zeros_like(w)
vt = np.zeros_like(w)
beta1t = 1
beta2t = 1

# Take nReps steepest descent steps in gradient descent search in mean-squared-error function
nEpochs = 50000

# collect training and testing errors for plotting
errorTrace = np.zeros((nEpochs, 2))

fig = plt.figure(figsize=(10, 20))

for epoch in range(nEpochs):

# Forward pass on training data
Z = np.tanh(X1 @ V)
Y = Z1 @ W

# Error in output
error = T - Y

# Backward pass

gradE_V = - X1.T @ ( ( error @ W[1:, :].T) * (1 - Z**2))
gradE_W = - Z1.T @ error

# approximate first and second moment
mt = beta1 * mt + (1 - beta1) * gradE_w
vt = beta2 * vt + (1 - beta2) * np.square(gradE_w)

# bias corrected moment estimates
beta1t *= beta1
beta2t *= beta2
mhat = mt / (1 - beta1t )
vhat = vt / (1 - beta2t )

w -= rho * mhat / (np.sqrt(vhat) + epsilon)

errorTrace[epoch, 0] = np.sqrt(np.mean((error**2)))

Ytest = addOnes(np.tanh(Xtest1 @ V)) @ W  #!! Forward pass in one line
errorTrace[epoch, 1] = np.sqrt(np.mean((Ytest - Ttest)**2))

if epoch % 2000 == 0 or epoch == nEpochs - 1:
plt.clf()
plt.subplot(3, 1, 1)
plt.plot(errorTrace[:epoch, :])
plt.ylim(0, 0.7)
plt.xlabel('Epochs')
plt.ylabel('RMSE')
plt.legend(('Train','Test'), loc='upper left')

plt.subplot(3, 1, 2)
plt.plot(X, T, 'o-', Xtest, Ttest, 'o-', Xtest, Ytest, 'o-')
plt.xlim(-10, 10)
plt.legend(('Training', 'Testing', 'Model'), loc='upper left')
plt.xlabel('$x$')
plt.ylabel('Actual and Predicted $f(x)$')

plt.subplot(3, 1, 3)
plt.plot(X, Z)
plt.ylim(-1.1, 1.1)
plt.xlabel('$x$')
plt.ylabel('Hidden Unit Outputs ($z$)');

ipd.clear_output(wait=True)
ipd.display(fig)
ipd.clear_output(wait=True)

print('Final RMSE', np.sqrt(np.mean((T - Y)**2)))

Final RMSE 0.09533104538919793