$\newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\yv}{\mathbf{y}} \newcommand{\Yv}{\mathbf{Y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\gv}{\mathbf{g}} \newcommand{\Hv}{\mathbf{H}} \newcommand{\dv}{\mathbf{d}} \newcommand{\Vv}{\mathbf{V}} \newcommand{\vv}{\mathbf{v}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\zv}{\mathbf{z}} \newcommand{\Zv}{\mathbf{Z}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}} \newcommand{\dimensionbar}[1]{\underset{#1}{\operatorname{|}}} $

Activation Functions

Last time we saw how transforming the inputs using fixed nonlinear functions does result in models that are nonlinear in the inputs. However, there is no general way to know which nonlinear functions to use.

A more general approach is to pick the form of a nonlinear function that is controlled by parameters, or weights, whose values control the actual shape of the function. There are many possibilities for constructing such functions. Let's add some desireable properties:

  1. Computationally simple.
  2. For initial, small, weight values the function is close to linear. As the weight magnitudes grow, the function becomes increasingly nonlinear.
  3. The derivative of the function is computationally simple.
  4. The magnitude of the derivative decreases as the weight magnitudes grows, perhaps asymptotically.
  5. The maximum value of the magnitude of the derivative is limited.
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Let's start with a linear weighted sum, $s = \wv^T \xv$. So, for a particular input sample $\xv$, $s$ is small if the magnitudes of the weights of $\wv$ are near zero. As the weight magnitudes increase, the magnitude of $s$ increases. So let's try to construct a function of $s$ in the shape of the derivative that we want.

We know that using $s$ as an exponent can make functions asymptotically decrease to zero. Let's use base $e$, in case we deal with natural logarithms in the future.

In [ ]:
s = np.linspace(-10, 10, 100)
plt.plot(s, np.exp(-s));

Remember, we are constructing a derivative, so we want to limit the maximum value of the derivative.

In [ ]:
plt.plot(s, 1/(1+np.exp(-s)));

Looking good. Does what we want as $s$ grows more negative, but we also want to bring this function down to zero as $s$ becomes more positive.

In [ ]:
plt.plot(s, 1 - 1/(1+np.exp(-s)));

Okay. Now how can we combine these. How about just multiply them?

In [ ]:
plt.plot(s, 1/(1+np.exp(-s)) * (1 - 1/(1+np.exp(-s))));

Hey! Looks great. It asymptotes to zero as the magnitude of $s$ grows and the maximum value is limited at 0.25.

How simple is it? Well, there are common terms.

In [ ]:
y = 1  / (1 + np.exp(-s))
plt.plot(s, y * (1 - y));
In [ ]:
import sympy
In [ ]:
sS = sympy.Symbol('s')
sS
In [ ]:
sympy.diff(sS**4, sS)
In [ ]:
sympy.init_printing(use_unicode=True)
In [ ]:
sympy.diff(sS**4, sS)
In [ ]:
y = 1  / (1 + sympy.exp(-sS))
sympy.integrate(y * (1 - y), sS)

Hey! So, if $f(s)=\frac{1}{1+e^{-s}}$, the derivative of $f(s)$ is $f(s) (1-f(s))$.

Ta-da! We just arrived at the common sigmoid function used in neural networks. Remembering what $s=\xv^T \wv$ is, we get

$$ f(\xv; \wv) = \frac{1}{1 + e^{-\xv^T \wv}} $$
In [ ]:
def fS(s):
    return 1 / (1 + sympy.exp(-s))
In [ ]:
fS(sS)
In [ ]:
sympy.diff(fS(sS), sS)
In [ ]:
sympy.simplify(sympy.diff(fS(sS), sS))
In [ ]:
sympy.simplify(fS(sS) * (1 - fS(sS)))
In [ ]:
def f(s):
    return 1 / (1 + np.exp(-s))
def df(f_value):
    return f_value * (1 - f_value)
plt.plot(s, f(s))
plt.plot(s, df(f(s)))
plt.legend(('f(s)', 'df(s)/ds'))
plt.grid(True);

Let's try using stochastic gradient descent (SGD) to fit this "sigmoid" function to some data. Find weight values that minimize the sum of squared errors in the output of this function.

In [ ]:
X = np.linspace(0, 10, 20).reshape((-1,1))
T = X * 0.1 + np.random.uniform(-0.2, 0.2, size=(20, 1))
plt.plot(X, T, 'o');
In [ ]:
w = np.zeros((2, 1))
rho = 0.1
X1 = np.insert(X, 0, 1, 1)
for iter in range(100):
    for n in range(X.shape[0]):
        Xn = X1[n:n + 1, :]
        Tn = T[n:n + 1, :]
        yn = f(Xn @ w)
        w += rho * Xn.T * (Tn - yn) * df(yn) 
        
plt.plot(X, T, 'o')
plt.plot(X, f(X1 @ w))
w
In [ ]:
X = np.linspace(0, 10, 20).reshape((-1, 1))
T = 1 + -X * 0.1 + np.random.uniform(-0.2, 0.2, size=(20, 1))
plt.plot(X, T, 'o');
In [ ]:
w = np.zeros((2, 1))
rho = 0.1
X1 = np.insert(X, 0, 1, 1)
for iter in range(100):
    for n in range(X.shape[0]):
        Xn = X1[n:n+1, :]
        Tn = T[n:n+1, :]
        yn = f(Xn @ w)
        w += rho * Xn.T * (Tn - yn) * df(yn)
        
plt.plot(X, T, 'o')
plt.plot(X, f(X1 @ w))
w

How would you use this function to fit this data?

In [ ]:
X = np.linspace(-10, 10, 20).reshape((-1, 1))
T = X**2 * 0.01 + np.random.uniform(-0.2, 0.2, size=(20, 1))
plt.plot(X, T, 'o');
In [ ]:
w = np.zeros((2, 1))
eta = 0.1
X1 = np.insert(X, 0, 1, 1)
for iter in range(100):
    for n in range(X.shape[0]):
        Xn = X1[n:n+1, :]
        Tn = T[n:n+1, :]
        yn = f(Xn @ w)
        w += eta * Xn.T * (Tn - yn) * df(yn)
        
plt.plot(X, T, 'o')
plt.plot(X, f(X1 @ w));

Right. We could use two of these functions and add their outputs together.

Now we are talking multilayered neural networks! One layer has two units that output $f(\xv^T \wv)$ each with their own $\wv$. The second layer has one linear unit with its own $\wv$.

Linear Models as Neural Networks

Given $X$ and $T$, find $\wv_k$ that minimizes squared error in the $k^{th}$ output, then use it to make predictions. Collect all $\wv_k$ as columns in $\Wv$. $\tilde{\Xv}$ denotes $\Xv$ with constant 1 column. The target value for the $k^{th}$ output for the $n^{th}$ sample is $t_{n,k}$.

$$ \begin{align*} E(\Wv) &= \sum_{n=1}^N \sum_{k=1}^K (t_{n,k} - \tilde{\xv}_n^T \wv_k)^2\\ \Wv &= (\tilde{\Xv}^T \tilde{\Xv})^{-1} \tilde{\Xv}^T \Tv\\ ~\\ \Wv &= \begin{bmatrix} w_{0,1} & w_{0,2} & \cdots & w_{0,K}\\ w_{1,1} & w_{1,2} & \cdots & w_{1,K}\\ \vdots\\ w_{D,1} & w_{D,2} & \cdots & w_{D,K} \end{bmatrix}\\ ~\\ \Yv &= \tilde{\Xv} \Wv\\ ~\\ \tilde{\Xv} & \text{ is } N \times (D+1)\\ \Wv & \text{ is } (D+1) \times K\\ \Yv & \text{ is } N \times K \end{align*} $$

The calculation of $y_{n,k} = \tilde{\xv_n} \wv_k$ can be drawn as

Adding Nonlinear Combinations of Inputs

Transform $X$ into $\Phiv(X)$. For example,

$$ \Phiv(X) = \Phiv\left ( \begin{bmatrix} x_{0,1} & x_{0,2} & \cdots \\ x_{1,1} & x_{1,2} & \cdots \\ \vdots\\ x_{N-1,1} & x_{N-1,2} & \cdots \end{bmatrix} \right ) = \begin{bmatrix} x_{0,1} & x_{0,2}^5 & x_{0,2}^3 x_{0,4}^2 & \cdots \\ x_{1,1} & x_{1,2}^5 & x_{1,2}^3 x_{1,4}^2 & \cdots\\ \vdots\\ x_{N-1,1} & x_{N-1,2}^5 & x_{N-1,2}^3 x_{N-1,4}^2 & \cdots \end{bmatrix} $$

Now we just replace $\Xv$ by $\Phiv$ (which we use to represent $\Phiv(X)$, and $\Phiv_n = \Phiv(\xv_n)$) and proceed. We do our derivation to minimize

$$ E_k = \sum_{n=1}^N (t_{n,k} - \tilde\Phiv_n^T \wv_k))^2 $$

from which we find that

$$ \Wv = (\tilde{\Phiv}^T \tilde{\Phiv})^{-1} \tilde{\Phiv}^T T $$

and use it like

$$ \begin{align*} \Yv = \tilde{\Phiv} \Wv \end{align*} $$

Now, what should we put in the yellow box?

Can we use the training data to figure this out?

Structure of a Two Layer Neural Network

We have just entered the world of neural networks, with $\phi(\xv)$ being the output of another layer of adaptive units, or $\phi(\xv) = h(\xv)$.

$\;\;\;\; \tilde{\Xv} \;\;\;\;\;\;\;\;\;\;\;\;$ $\;\;\;\;\;\;\;\; \Vv \;\;\;\;\;\;\;\;\;\;\;\;$ $ \;\;\; \tilde{\Zv} \;\;\;\;\;\;\;\;\;\;\;\;$ $\;\;\;\;\;\;\;\; \Wv \;\;\;\;\;\;\;\;\;\;\;\;$ $\; \Yv \;\;\;\;\;\;\;\;\;\;\;\;$

$ N \times (D+1) \;~~$ $(D+1) \times M \;~~$ $ N \times (M+1) \;~~$ $(M+1)\times K \;~~$ $ N \times K$

$$ \begin{align*} \tilde{\Zv} & = h(\tilde{\Xv} \Vv),\\ \Yv & = \tilde{\Zv} \Wv, \text{ or }\\ \Yv & = \tilde{h}(\tilde{\Xv} \Vv) \Wv \end{align*} $$

The two layers are called the hidden and output layer.

$h$ is the activation function for the units in the hidden layer.

We will be doing gradient descent in the squared error, so want an $h$ whose derivative doesn't grow out of control as $\vv$ grows, and whose derivative is easy to calculate.

What about polynomials?

In [ ]:
def h(x):
    return 3 + 2 * x + 1 * x**2 + 3 * x**5

def dh(x):
    return 2 + 2 * x + 15 * x**4

plt.figure(figsize=(8,5))
xs = np.linspace(-5,5,100)
plt.plot(xs,h(xs),'b-',linewidth=3, label='$h_1(a)$')
plt.plot(xs,dh(xs),'b--',linewidth=3, label='$d h_1(a)/da$')

plt.plot((-6,6),(0,0),'k')
plt.plot((0,0),(-10000,10000),'k')
plt.plot([0,0],[-1,1],'r--',linewidth=3,alpha=0.5,)
plt.legend(loc='best');

A gradient descent procedure takes steps of size proportional to the derivative. This derivative gets huge---high positive as $a$ increases and high negative as $a$ decreases---so gradient descent if very unstable. Such huge steps will shoot the gradient descent far away from a close local minimum.

Two common choices for functions with well-behaved derivatives are, where $a = \tilde{\xv}^T \vv$:

  • sigmoid (asymmetric)
$$ \begin{align*} h_1(a) & = \frac{1}{1+e^{-a}} \end{align*} $$
  • tanh (symmetric)
$$ \begin{align*} h_2(a) & = \tanh(a) = \frac{e^{a} - e^{-a}}{e^{a} + e^{-a}} \end{align*} $$

Work out their derivatives in class.

In [ ]:
def h1(a):
    return 1 / (1 + np.exp(-a))

def h2(a):
    return np.tanh(a)

def dh1(a):
    y = h1(a)
    return y * (1 - y)

def dh2(a):
    y = h2(a)
    return 1 - y*y

plt.figure(figsize=(8,5))
xs = np.linspace(-5, 5, 100)
plt.plot(xs, h1(xs), 'b-', linewidth=3, label='$h_1(a)$')
plt.plot(xs, dh1(xs), 'b--', linewidth=3, label='$d h_1(a)/da$')

plt.plot(xs, h2(xs), 'r-', linewidth=3, label='$h_2(a)$')
plt.plot(xs, dh2(xs), 'r--', linewidth=3, label='$d h_2(a)/da$')

plt.plot((-6, 6), (0, 0), 'k')
plt.plot((0, 0), (-1, 1), 'k')
plt.plot([0, 0], [-1, 1], 'r--', linewidth=3, alpha=0.5)
plt.legend(loc='best');

Training by Gradient Descent

The mean squared error between each target value $t_{n,k}$ and output (predicted) value $y_{n,k}$ is

$$ \begin{align*} E &= \frac{1}{N} \frac{1}{K} \sum_{n=1}^N \sum_{k=1}^K ( t_{n,k} - y_{n,k})^2 \end{align*} $$

Since $E$ is no longer a linear function of the parameters (weights), we cannot set the derivative equal to zero and solve for the parameters. Instead, we can do gradient descent in $E$ by making small changes to weights $v_{j,m}$ and $w_{m,k}$ in the negative gradient direction.

$$ \begin{align*} v_{j,m} &\leftarrow v_{j,m} - \rho_h \frac{\partial E}{\partial v_{j,m}}\\ w_{m,k} &\leftarrow w_{m,k} - \rho_o \frac{\partial E}{\partial w_{m,k}} \end{align*} $$

This is often presented as $\rho_h = \rho_o$, but having different rates in the two layers often helps convergence rate.

Will this find the global optimum (the values of $v$ and $w$ that minimize the mean squared error)?

First a Simplified View

Remember our picture of a full two-layer network? Let's focus on how to modify a single weight, $v_{1,1}$, based on a single error between $y_1$ and $t_1$.

Let's drop the subscripts for now.

Forward calculation (ignoring constant 1 input and all other terms)

$$ \begin{align*} y &= w\; h(v\;x) \text{ or }\\ y &= w\; z\\ z &= h(a)\\ a &= v\; x \end{align*} $$

Since $E = (t - y)^2$

$$ \begin{align*} \frac{\partial E}{\partial v} & = \frac{\partial (t - y)^2}{\partial v} \end{align*} $$

The Chain Rule to the rescue.

$$ \begin{align*} \frac{\partial E}{\partial v} & = \frac{\partial (t-y)^2}{\partial v}\\ & = \frac{\partial (t-y)^2}{\partial y} \frac{\partial y}{\partial z} \frac{\partial z}{\partial a} \frac{\partial a}{\partial v} \\ & = 2(t-y) \;\;(-1)\;\; w \;\; \frac{\partial h(a)}{\partial a} \;\; x \\ \end{align*} $$

If $h(a) = \tanh(a)$, then $\frac{\partial h(a)}{\partial a} = (1-h(a)^2) = (1-z^2)$. See this page at Mathematics Stack Exchange.

$$ \begin{align*} \frac{\partial E}{\partial v} & = -2(t-y) \;\; w \;\; \frac{\partial h(a)}{\partial a} \;\; x \\ & = -2(t-y) \;\; w \;\; (1-z^2) \;\; x \end{align*} $$

Let's add another output.

The Chain Rule again.

$$ \begin{align*} \frac{\partial E}{\partial v} & = \frac{\partial (t_1-y_1)^2 + (t_2-y_2)^2}{\partial v}\\ & = \frac{\partial (t_1-y_1)^2 + (t_2-y_2)^2}{\partial z} \frac{\partial z}{\partial a} \frac{\partial a}{\partial v} \\ & = \left ( \frac{\partial (t_1-y_1)^2 + (t_2-y_2)^2}{\partial y_1}\frac{\partial y_1}{\partial z} + \frac{\partial (y_1-t_1)^2 + (y_2-t_2)^2}{\partial y_2}\frac{\partial y_2}{\partial z} \right ) \frac{\partial z}{\partial a} \frac{\partial a}{\partial v} \\ & = \left ( -2(t_1-y_1) w_1 - 2(t_2 -y_2) w_2 \right ) \frac{\partial z}{\partial a} \frac{\partial a}{\partial v} \\ & = -2 \left ( (t_1-y_1) w_1 + (t_2-y_2) w_2 \right ) \frac{\partial h(a)}{\partial a} x \\ & = -2 \left ( (t_1-y_1) w_1 + (t_2-y_2) w_2 \right ) (1-z^2) x \end{align*} $$

Think of the errors calculated in the output units as being sent backwards to the units in the previous layer. If we label these errors as "delta" values, our derivative expressions form what are commonly referred to as "delta rules". The delta values are "backpropagated" to the previous layer. This process is called "error backpropagation", but is really just the chain rule of calculating derivatives!

First, remember these derivatives.

$$ \begin{align*} \frac{\partial E}{\partial v} & = -2 ( (t_1-y_1) w_1 + (t_2-y_2) w_2 ) (1-z^2) x \\ \frac{\partial E}{\partial w_1} & = -2 (t_1-y_1) z \end{align*} $$

Now the update rules involving the deltas ($\delta$) are

$$ \begin{align*} w &\leftarrow w - \frac{\partial E}{\partial w_1}\\ &\leftarrow w + \rho_o (t_1-y_1) z\\ &\leftarrow w + \rho_o \delta_1^{(o)} z\\ v & \leftarrow v - \frac{\partial E}{\partial v}\\ & \leftarrow v + \rho_h \left ( (t_1-y_1) w_1 + (t_2-y_2) w_2 \right ) (1-z^2) x \\ & \leftarrow v + \rho_h \left ( (t_1-y_1) w_1 + (t_2-y_2) w_2 \right ) (1-z^2) x \\ & \leftarrow v + \rho_h \left ( \delta_1^{(o)} w_1 + \delta_2^{(o)} w_2 \right ) (1-z^2) x \\ & \leftarrow v + \rho_h \delta^{(h)} x\\ &\;\;\; \text{ where } \delta^{(h)} = ( \delta_1^{(o)} w_1 + \delta_2^{(o)} w_2 ) (1-z^2) \end{align*} $$

Now for the Full Version of Back-Propagation

$$ \begin{align*} v_{j,m} &\leftarrow v_{j,m} - \rho_h \frac{\partial E}{\partial v_{j,m}}\\ w_{m,k} & \leftarrow w_{m,k} - \rho_o \frac{\partial E}{\partial w_{m,k}} \end{align*} $$

A bit of calculus and algebra lead us to these expressions for doing gradient descent.

$$ \begin{align*} v_{j,m} &\leftarrow v_{j,m} + \rho_h \frac{1}{N} \frac{1}{K} \sum_{n=1}^N \sum_{k=1}^K (t_{n,k} - y_{n,k}) w_{m,k} (1-\tilde{z}_m^2) \tilde{x}_{n,j}\\ w_{m,k} &\leftarrow w_{m,k} + \rho_o \frac{1}{N} \frac{1}{K} \sum_{n=1}^N (t_{n,k} - y_{n,k}) \tilde{z}_m \end{align*} $$

Now we will derive this result. First work on $\frac{\partial E}{\partial w_{m,k}}$.

$$ \begin{align*} E &= \frac{1}{N} \frac{1}{K} \sum_{n=1}^N \sum_{k=1}^K (t_{n,k} - y_{n,k})^2\\ \frac{\partial E}{\partial w_{m,k}} & = -2 \frac{1}{N} \frac{1}{K} \sum_{n=1}^N \sum_{k'=1}^K (t_{n,k'} - y_{n,k'}) \frac{\partial y_{n,k'}}{\partial w_{m,k}}\\ \text{Since } & y_{n,k'} = \sum_{m'=0}^M w_{m',k'} \tilde{z}_{n,m'}\\ \frac{\partial E}{\partial w_{m,k}} & = -2 \frac{1}{N} \frac{1}{K} \sum_{n=1}^N \sum_{k'=1}^K (t_{n,k'} - y_{n,k'}) \frac{\partial \left ( \sum_{m'=0}^M w_{m',k'} \tilde{z}_{n,m'} \right )}{\partial w_{m,k}}\\ & = -2 \frac{1}{N} \frac{1}{K} \sum_{n=1}^N (t_{n,k} - y_{n,k}) \tilde{z}_{n,m} \end{align*} $$

Now the hard one, $\frac{\partial E}{\partial v_{j,m}}$.

$$ \begin{align*} E &= \frac{1}{N} \frac{1}{K} \sum_{n=1}^N \sum_{k=1}^K (t_{n,k} - y_{n,k})^2\\ \frac{\partial E}{\partial v_{j,m}} & = -2 \frac{1}{N} \frac{1}{K} \sum_{n=1}^N \sum_{k'=1}^K (t_{n,k'} - y_{n,k'}) \frac{\partial y_{n,k'}}{\partial v_{j,m}} \end{align*} $$

Knowing that

$$ \begin{align*} y_{n,k'} & = \sum_{m'=0}^M w_{m',k'} \tilde{z}_{n,m'}\\ & = \sum_{m'=0}^M w_{m',k'} \tilde{h}\left ( \sum_{j'=0}^D v_{j,m'} \tilde{x}_{n,j'} \right ) \end{align*} $$

we can continue.

$$ \begin{align*} \frac{\partial E}{\partial v_{j,m}} & = -2 \frac{1}{N} \frac{1}{K} \sum_{n=1}^N \sum_{k'=1}^K (t_{n,k'} - y_{n,k'}) \frac{\partial \left ( \sum_{m'=0}^M w_{m',k'} \tilde{h} \left ( \sum_{j'=0}^D v_{j',m'} \tilde{x}_{n,j'} \right ) \right )}{\partial v_{j,m}}\\ \text{Let } a_{n,m'} & = \tilde{h} \left ( \sum_{j'=0}^D v_{j',m'} \tilde{x}_{n,j'} \right )\\ \frac{\partial E}{\partial v_{j,m}} & = -2 \frac{1}{N} \frac{1}{K} \sum_{n=1}^N \sum_{k'=1}^K (t_{n,k'} - y_{n,k'}) \sum_{m'=0}^M w_{m',k'} \frac{\partial \tilde{h}(a_{n,m'})}{\partial a_{n,m'}} \frac{\partial a_{n,m'}}{\partial v_{j,m}}\\ & = -2 \frac{1}{N} \frac{1}{K} \sum_{n=1}^N \sum_{k'=1}^K (t_{n,k'} - y_{n,k'}) \sum_{m'=0}^M w_{m',k'} \frac{\partial \tilde{h}(a_{n,m'})}{\partial a_{n,m'}} \tilde{x}_{n,j} \end{align*} $$

To summarize:

$$ \begin{align*} E &= \frac{1}{N} \frac{1}{K} \sum_{n=1}^N \sum_{k=1}^K (t_{n,k} - y_{n,k})^2\\ \frac{\partial E}{\partial w_{m,k}} & = -2 \frac{1}{N} \frac{1}{K} \sum_{n=1}^N (t_{n,k} - y_{n,k}) \tilde{z}_{n,m}\\ \frac{\partial E}{\partial v_{j,m}} & = -2 \frac{1}{N} \frac{1}{K} \sum_{n=1}^N \sum_{k=1}^K (t_{n,k} - y_{n,k}) w_{m,k} (1-z_{n,m}^2) \tilde{x}_{n,j} \end{align*} $$

Forward pass

$$ \begin{align*} z_{n,m} &= h(\sum_{j=0}^D v_{j,m} \tilde{x}_{n,j})\\ y_{n,k} &= \sum_{m=1}^M w_{m,k} \tilde{z}_{n,m} \end{align*} $$

Backward pass

$$ \begin{align*} v_{j,m} & \leftarrow v_{j,m} + \rho_h \frac{1}{N} \frac{1}{K} \sum_{n=1}^N \sum_{k=1}^K (t_{n,k} - y_{n,k}) w_{m,k} (1-z_{n,m}^2) \tilde{x}_{n,j}\\ w_{m,k} & \leftarrow w_{m,k} + \rho_o \frac{1}{N} \frac{1}{K} \sum_{n=1}^N (t_{n,k} - y_{n,k}) \tilde{z}_{n,m} \end{align*} $$

Preparing for Python

Convert these scalar expressions to matrix expressions.

Forward Pass

$$ \begin{align*} z_{n,m} &= h(\sum_{j=0}^D v_{j,m} \tilde{x}_{n,j})\\ z_{n,m} &= h( v_{*,m} \tilde{x}_{n,*})\\ z_{n,m} &= h( \tilde{x}_{n,*} v_{*,m})\\ z_{*_n,m} &= h( \tilde{x}_{*_n,*} v_{*,m})\\ z_{*_n,*_m} &= h( \tilde{x}_{*_n,*} v_{*,*_m})\\ \Zv &= h(\tilde{\Xv} \Vv)\\ y_{n,k} &= \sum_{m=1}^M w_{m,k} \tilde{z}_{n,m}\\ y_{n,k} &= \tilde{z}_{n,*} w_{*,k}\\ y_{*_n,*_k} &= \tilde{z}_{*_n,*} w_{*,*_k}\\ \Yv &= \tilde{\Zv} \Wv \end{align*} $$

Backward Pass for $\Vv$

$$ \begin{align*} v_{j,m} & \leftarrow v_{j,m} + \rho_h \frac{1}{N} \frac{1}{K} \sum_{n=1}^N \sum_{k=1}^K (t_{n,k} - y_{n,k}) w_{m,k} (1-z_{n,m}^2) \tilde{x}_{n,j}\\ v_{j,m} & \leftarrow v_{j,m} + \rho_h \frac{1}{N} \frac{1}{K} \sum_{n=1}^N (t_{n,*_k} - y_{n,*_k}) w_{m,*_k}^T (1-z_{n,m}^2) \tilde{x}_{n,j}\\ v_{j,m} & \leftarrow v_{j,m} + \rho_h \frac{1}{N} \frac{1}{K} \left ( (t_{*_n,*_k} - y_{*_n,*_k}) w_{m,*_k}^T \cdot (1-z_{*_n,m}^2) \dimensionbar{N\times m} \right )^T \tilde{x}_{*_n,j} \end{align*} $$

Almost. Result on right hand side has subscripts $m,j$, but left side is $j,m$. So

$$ \begin{align*} v_{j,m} & \leftarrow v_{j,m} + \rho_h \frac{1}{N} \frac{1}{K} \left ( \left ( (t_{*_n,*_k} - y_{*_n,*_k}) w_{m,*_k}^T \cdot (1-z_{*_n,m}^2) \right )^T \tilde{x}_{*_n,j} \right )^T\\ & \leftarrow v_{j,m} + \rho_h \frac{1}{N} \frac{1}{K} \tilde{x}_{*_n,j}^T \left ( (t_{*_n,*_k} - y_{*_n,*_k}) w_{m,*_k}^T \cdot (1-z_{*_n,m}^2) \right )\\ v_{*_j,*_m} & \leftarrow v_{*_j,*_m} + \rho_h \frac{1}{N} \frac{1}{K} \tilde{x}_{*_n,*_j}^T \left ( (t_{*_n,*_k} - y_{*_n,*_k}) w_{*_m,*_k}^T \cdot (1-z_{*_n,*_m}^2) \right )\\ \Vv & \leftarrow \Vv + \rho_h \frac{1}{N} \frac{1}{K} \tilde{\Xv}^T \left ( (\Tv - \Yv) \hat{\Wv}^T \cdot (1-\Zv^2) \right )\\ \Vv & \leftarrow \Vv + \rho_h \frac{1}{N} \frac{1}{K} \tilde{\Xv}^T \dimensionbar{D\times N} \left ( (\Tv - \Yv) \dimensionbar{N\times K} \hat{\Wv}^T \dimensionbar{N\times M} \cdot (1-\Zv^2) \dimensionbar{N\times M}\right ) \dimensionbar{D\times M}\\ \text{where } & \hat{\Wv} \text{ is } \Wv \text{ without constant input row} \end{align*} $$

Backward Pass for $\Wv$

$$ \begin{align*} w_{m,k} & \leftarrow w_{m,k} + \rho_o \frac{1}{N} \frac{1}{K} \sum_{n=1}^N (t_{n,k} - y_{n,k}) \tilde{z}_{n,m}\\ & \leftarrow w_{m,k} + \rho_o \frac{1}{N} \frac{1}{K} (t_{*_n,k} - y_{*_n,k})^T \tilde{z}_{*_n,m} \end{align*} $$

Again, almost. Result on right hand side has subscripts $k,m$ but need $m,k$. So

$$ \begin{align*} w_{m,k} & \leftarrow w_{m,k} + \rho_o \frac{1}{N} \frac{1}{K} \left ( (t_{*_n,k} - y_{*_n,k})^T \tilde{z}_{*_n,m} \right )^T\\ & \leftarrow w_{m,k} + \rho_o \frac{1}{N} \frac{1}{K} \tilde{z}_{*_n,m}^T (t_{*_n,k} - y_{*_n,k}) \\ w_{*_m,*_k} & \leftarrow w_{*_m,*_k} + \rho_o \frac{1}{N} \frac{1}{K} \tilde{z}_{*_n,*_m}^T (t_{*_n,*_k} - y_{*_n,*_k}) \\ \Wv & \leftarrow \Wv + \rho_o \frac{1}{N} \frac{1}{K} \tilde{\Zv}^T (\Tv - \Yv)\\ \Wv & \leftarrow \Wv + \rho_o \frac{1}{N} \frac{1}{K} \tilde{\Zv}^T \dimensionbar{M\times N} (\Tv - \Yv \dimensionbar{N\times K}) \dimensionbar{M\times K} \end{align*} $$

All together now, first in math.

$$ \begin{align*} \Zv &= h(\tilde{\Xv} \Vv)\\ \Yv &= \tilde{\Zv} \Wv\\ \Vv & \leftarrow \Vv + \rho_h \frac{1}{N} \frac{1}{K} \tilde{\Xv}^T \left ( (\Tv - \Yv) \hat{\Wv}^T \cdot (1-\Zv^2) \right )\\ \Wv & \leftarrow \Wv + \rho_o \frac{1}{N} \frac{1}{K}\tilde{\Zv}^T (\Tv - \Yv) \end{align*} $$

Now in python.

# Given input X and target T, do forward pass to calculate network output, Y
X1 = addOnes(X)
Z = tanh(X1 @ V))
Z1 = addOnes(Z)
Y = Z1 @ W
# Do gradient descent on derivative of squared error with respect to each weight to update V and W.
error = T - Y
V = V + rh * X1.T @ ((error @ W[1:, :].T) * (1 - Z**2))
W = W + ro * Z1.T @ error

Finally, the Code

The above equations show a single step down the gradient of our mean square error, $E$. This is called steepest descent.

Here is an example. Let's fit a neural network to the function

$$ f(x) = 0.2 + 0.05 (x+10) + 0.4 \sin(x+10)) + 0.2 \epsilon $$

where $\epsilon$ is a random variable drawn from the standard normal distribution, or $\epsilon \sim \mathcal{N}(0,1)$, and $x \in [-10,10]$.

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import IPython.display as ipd  # for display and clear_output
import time  # for sleep
In [ ]:
# Make some training data
n = 20
X = np.linspace(0.,20.0,n).reshape((n,1)) - 10
T = 0.2 + 0.05 * (X+10) + 0.4 * np.sin(X+10) + 0.2 * np.random.normal(size=(n,1))

# Make some testing data
Xtest = X + 0.1*np.random.normal(size=(n,1))
Ttest = 0.2 + 0.05 * (X+10) + 0.4 * np.sin(Xtest+10) + 0.2 * np.random.normal(size=(n,1))

nSamples = X.shape[0]
nOutputs = T.shape[1]
In [ ]:
# Add constant column of 1's
def addOnes(A):
    return np.insert(A, 0, 1, axis=1)

X1 = addOnes(X)
Xtest1 = addOnes(Xtest)
In [ ]:
# Set parameters of neural network
nHiddens = 10

rhoh = 0.5
rhoo = 0.1

rh = rhoh / (nSamples * nOutputs)
ro = rhoo / (nSamples * nOutputs)

# Initialize weights to uniformly distributed values between small normally-distributed between -0.1 and 0.1
V = 0.1 * 2 * (np.random.uniform(size=(1 + 1, nHiddens)) - 0.5)
W = 0.1 * 2 * (np.random.uniform(size=(1 + nHiddens, nOutputs)) - 0.5)

# Take nReps steepest descent steps in gradient descent search in mean-squared-error function
nReps = 200000
# collect training and testing errors for plotting
errorTrace = np.zeros((nReps, 2))

fig = plt.figure(figsize=(10, 20))
for reps in range(nReps):

    # Forward pass on training data
    Z = np.tanh(X1 @ V)
    Z1 = addOnes(Z)
    Y = Z1 @ W

    # Error in output
    error = T - Y

    # Backward pass - the backpropagation and weight update steps
    V = V + rh * X1.T @ ( ( error @ W[1:, :].T) * (1 - Z**2))
    W = W + ro * Z1.T @ error

    # error traces for plotting
    errorTrace[reps, 0] = np.sqrt(np.mean((error**2)))
    Ytest = addOnes(np.tanh(Xtest1 @ V)) @ W  #!! Forward pass in one line
    errorTrace[reps, 1] = np.sqrt(np.mean((Ytest - Ttest)**2))

    if reps % 1000 == 0 or reps == nReps-1:
        plt.clf()
        plt.subplot(3, 1, 1)
        plt.plot(errorTrace[:reps, :])
        plt.ylim(0, 0.7)
        plt.xlabel('Epochs')
        plt.ylabel('RMSE')
        plt.legend(('Train','Test'), loc='upper left')
        
        plt.subplot(3, 1, 2)
        plt.plot(X, T, 'o-', Xtest, Ttest, 'o-', Xtest, Ytest, 'o-')
        plt.xlim(-10, 10)
        plt.legend(('Training', 'Testing', 'Model'), loc='upper left')
        plt.xlabel('$x$')
        plt.ylabel('Actual and Predicted $f(x)$')
        
        plt.subplot(3, 1, 3)
        plt.plot(X, Z)
        plt.ylim(-1.1, 1.1)
        plt.xlabel('$x$')
        plt.ylabel('Hidden Unit Outputs ($z$)');
        
        ipd.clear_output(wait=True)
        ipd.display(fig)
ipd.clear_output(wait=True)
In [ ]: