$\newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\yv}{\mathbf{y}} \newcommand{\Yv}{\mathbf{Y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\gv}{\mathbf{g}} \newcommand{\Hv}{\mathbf{H}} \newcommand{\dv}{\mathbf{d}} \newcommand{\Vv}{\mathbf{V}} \newcommand{\vv}{\mathbf{v}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\zv}{\mathbf{z}} \newcommand{\Zv}{\mathbf{Z}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}} \newcommand{\dimensionbar}[1]{\underset{#1}{\operatorname{|}}}$

### The Scaled Part¶

The first derivative of an error function with respect to the parameters of your model tells you which direction in the parameter space to proceed to reduce the error function. But how far do you go? So far we have just taken a small step by subtracting a small constant times the derivative from our current parameter values.

If we are in the vicinity of a minimum of the error function, we could do what Newton did...approximate the function at the current parameter value with a parabola and solve for the minimum of the parabola. Use this as the next guess at a good parameter value. If the error function is quadratic in the parameter, then we jump to the true minimum immediately.

How would you fit a parabola to a function at a particular value of $x$? We can derive a way to do this using a truncated Taylor series (google that) to approximate the function about a value of $x$:

$$f(x+\Delta x) \approx \hat{f}(x+\Delta x) = f(x) + f'(x) \Delta x + \frac{1}{2} f''(x) \Delta x^2 +$$

Now we want to know what value of $\Delta x$ minimizes $\hat{f}(x+\Delta x)$. So take its derivative and set equal to zero.

\begin{align*} \frac{d \hat{f}(x+\Delta x)}{d\Delta x} &= f'(x) + \frac{1}{2} 2 f''(x) \Delta x\\ & = f'(x) + f''(x) \Delta x \end{align*}

Setting equal to zero we get

\begin{align*} 0 &= f'(x) + f''(x) \Delta x\\ \Delta x &= -\frac{f'(x)}{f''(x)} \end{align*}

Now we can update our guess for $x$ by adding $\Delta x$ to it. Then, fit a new parabola at the new value of $x$, calculate $\Delta x$, and update $x$ again. Actually, the last equation above does the parabola approximation and calculation of $\Delta x$.

Here is a simple example. Say we want to find the minimum of

$$f(x) = 2 x^4 + 3 x^3 + 3$$

To calculate

\begin{align*} \Delta x &= -\frac{f'(x)}{f''(x)} \end{align*}

we need the function's first and second derivatives. The are

\begin{align*} f'(x) &= 8 x^3 + 9 x^2\\ f''(x) &= 24 x^2 + 18 x \end{align*}

All together now, in python!

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import IPython.display as ipd  # for display and clear_output
import time  # for sleep

In [2]:
def f(x):
return 2 * x**4 + 3 * x**3 + 3

def df(x):
return 8 * x**3 + 9 * x**2

def ddf(x):
return 24 * x**2 + 18*x

x = -2  # our initial guess
def taylorf(x,dx):
return f(x) + df(x) * dx + 0.5 * ddf(x) * dx**2

x = np.random.uniform(-2, 1)  # first guess at minimum

xs = np.linspace(-2,1,num=100)

fig = plt.figure()

dxs = np.linspace(-0.5,0.5,num=100)

for rep in range(10):
time.sleep(2) # sleep 2 seconds
plt.clf()
plt.plot(xs,f(xs))
plt.grid('on')
plt.plot(x+dxs, taylorf(x,dxs),'g-',linewidth=5,alpha=0.4)
plt.plot(x,f(x),'ro')
y0,y1 = plt.ylim()
plt.plot([x,x],[y0,y1],'r--')

x = x - df(x) / float(ddf(x))
plt.plot(x,f(x),'go')
plt.text(x,(y0+y1)*0.5,str(x),color='r')
plt.legend(('$f(x)$','$\hat{f}(x)$'))

ipd.clear_output(wait=True)
ipd.display(fig)
ipd.clear_output(wait=True)


This has all been for a function $f(x)$ of a single, scalar variable $x$. To minimize a squared error function for a neural network, $x$ will consist of all the weights of the neural network. If all of the weights are collected into the vector $\wv$, then the first derivative of the squared error function, $f$, with respect to the weight vector, $\wv$, is a vector of derivatives like $\frac{\partial f}{\partial w_{i}}$. This is usually written as the gradient

$$\nabla_{\wv} f = (\frac{\partial f}{\partial w_{1}}, \frac{\partial f}{\partial w_{2}}, \ldots, \frac{\partial f}{\partial w_{n}}).$$

The second derivative will be $n\times n$ matrix of values like $\frac{\partial^2 f}{\partial w_i \partial w_j}$, usually written as the Hessian

$$\nabla^2_{\wv} f = \begin{pmatrix} \frac{\partial^2 f}{\partial w_1 \partial w_1} & \frac{\partial^2 f}{\partial w_1 \partial w_2} & \cdots \frac{\partial^2 f}{\partial w_1 \partial w_n}\\ \frac{\partial^2 f}{\partial w_2 \partial w_1} & \frac{\partial^2 f}{\partial w_2 \partial w_2} & \cdots \frac{\partial^2 f}{\partial w_2 \partial w_n}\\ \vdots \\ \frac{\partial^2 f}{\partial w_n \partial w_1} & \frac{\partial^2 f}{\partial w_n \partial w_2} & \cdots \frac{\partial^2 f}{\partial w_n \partial w_n} \end{pmatrix}$$

It is often impractical to construct and use the Hessian. We will consider ways to approximate the product of the Hessian and a matrix as part of the Scaled Conjugate Gradient algorithm.

### The Conjugate Part¶

Let $E(\wv)$ be the error function (mean square error over training samples) we wish to minimize by findig the best $\wv$. Steepest descent will find new $\wv$ by minimizing $E(\wv)$ in successive directions $\dv_0, \dv_1, \ldots$ for which $\dv_i^T \dv_j = 0$ for $i \neq j$. In other words, the search directions are orthogonal to each other, resulting in a zig-zag pattern of steps, some of which are in the same directions.

Another problem with orthogonal directions is that forcing the second direction, for example, to be orthogonal to the first will not be in the direction of the minimum unless the error function is quadratic and its contours are circles.

We would rather choose a new direction based on the previous ones and on the curvature, or second derivative, of the error function at the current $\wv$. This is the idea behind conjugate gradient methods.

The Scaled Conjugate Gradient (SCG) algorithm, Efficient Training of Feed-Forward Neural Networks, by Moller, combines conjugate gradient directions with an local, quadratic approximation to the error function and solving for the new value of $\wv$ that would minimize the quadratic function. A number of additional steps are taken to improve the quadratic approximation.

In [3]:
!curl -O http://www.cs.colostate.edu/~anderson/cs445/notebooks/mlutilities.tar

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
100 30720  100 30720    0     0   115k      0 --:--:-- --:--:-- --:--:--  115k

In [4]:
!tar xvf mlutilities.tar

mlutilities.py

In [5]:
import mlutilities as ml

def parabola(x,xmin,s):
d = x - xmin
return np.dot(np.dot(d,S),d.T)

d = x - xmin
return 2 * np.dot(s,d)

f = parabola
center = np.array([5,5])
S = np.array([[5,4],[4,5]])

n = 10
xs = np.linspace(0,10,n)
ys = np.linspace(0,10,n)
X,Y = np.meshgrid(xs,ys)
both = np.vstack((X.flat,Y.flat)).T
nall = n*n
Z = np.zeros(nall)
for i in range(n*n):
Z[i] = parabola(both[i,:],center,S)
Z.resize((n,n))

fig = plt.figure(figsize=(8,8))

for reps in range(10):
time.sleep(2)

firstx = np.random.uniform(0,10,2)

resultSCG = ml.scg(firstx, f, df, center, S, xtracep=True)
resultSteepest =  ml.steepest(firstx, f, df, center, S, stepsize=0.05, xtracep=True)

plt.clf()
plt.contourf(X, Y, Z, 20, alpha=0.3)
plt.axis('tight')

xt = resultSteepest['xtrace']
plt.plot(xt[:, 0], xt[:, 1], 'ro-')

xt = resultSCG['xtrace']
plt.plot(xt[:, 0], xt[:, 1], 'go-')

plt.text(2, 3, "%s steps for Steepest Descent" % resultSteepest['xtrace'].shape[0], color='red')
plt.text(2, 2.5, "%s steps for SCG" % resultSCG['xtrace'].shape[0], color='green')

ipd.clear_output(wait=True)
ipd.display(fig)
ipd.clear_output(wait=True)


Rosenbrock's function is often used to test optimization algorithms. It is

$$f(x,y) = (1-x)^2 + 100(y-x^2)^2$$
In [6]:
def rosen(x):
v = 100 * ((x[1] - x[0]**2)**2) + (1.0 - x[0])**2
return v

g1 = -400 * (x[1] - x[0]**2) * x[0] - 2 * (1 - x[0])
g2 =  200 * (x[1] - x[0]**2)
return np.array([g1, g2])

f = rosen

In [7]:
n = 10
xmin, xmax = -1,2
xs = np.linspace(xmin, xmax, n)
ys = np.linspace(xmin, xmax, n)
X, Y = np.meshgrid(xs, ys)
both = np.vstack((X.flat, Y.flat)).T
nall = n * n
Z = np.zeros(nall)
for i in range(n * n):
Z[i] = f(both[i, :])
Z.resize((n, n))

fig = plt.figure(figsize=(8, 8))

for reps in range(10):
time.sleep(2)

firstx = np.random.uniform(xmin, xmax, 2)

# resultSCG = ml.scg(firstx, f, df, xPrecision=0.000001, xtracep=True)
# resultSteepest =  ml.steepest(firstx, f, df, stepsize=0.001, xPrecision=0.000001, xtracep=True)
resultSCG = ml.scg(firstx, f, df, xtracep=True)
resultSteepest =  ml.steepest(firstx, f, df, stepsize=0.001, xtracep=True)

plt.clf()
plt.contourf(X, Y, Z, 20, alpha=0.3)
plt.axis('tight')

xt = resultSteepest['xtrace']
plt.plot(xt[: ,0], xt[:, 1], 'ro-')

xt = resultSCG['xtrace']
plt.plot(xt[:, 0], xt[:, 1], 'go-')

plt.text(0.7, -0.5, "%s steps for Steepest Descent" % resultSteepest['nIterations'], color='red')
plt.text(0.7, -0.8, "%s steps for SCG" % resultSCG['nIterations'], color='green')

ipd.clear_output(wait=True)
ipd.display(fig)
ipd.clear_output(wait=True)


Only difficulty is that our scg (and steepest) implementation requires all parameters to be concatenated in a single vector. We will use pack and unpack functions to concatentate and extract $V$ and $W$ matrices.

Here is our example from last time again, but now using scg (scaled conjugate gradient).

In [8]:
# Make some training data
n = 20
X = np.linspace(0., 20.0, n).reshape((n, 1))
T = 0.2 + 0.05 * X + 0.4 * np.sin(X) + 0.2 * np.random.normal(size=(n, 1))

# Make some testing data
Xtest = X + 0.1 * np.random.normal(size=(n, 1))
Ttest = 0.2 + 0.05 * X + 0.4 * np.sin(Xtest) + 0.2 * np.random.normal(size=(n, 1))

return np.insert(A, 0, 1, axis=1)

In [9]:
# Set parameters of neural network
nInputs = X.shape[1]
nHiddens = 10
nOutputs = T.shape[1]

# Initialize weights to uniformly distributed values between small normally-distributed between -0.1 and 0.1
V = np.random.uniform(-0.1, 0.1, size=(1 + nInputs, nHiddens))
W = np.random.uniform(-0.1, 0.1, size=(1 + nHiddens, nOutputs))

### gradientDescent functions require all parameters in a vector.
def pack(V, W):
return np.hstack((V.flat, W.flat))

def unpack(w):
'''Assumes V, W, nInputs, nHidden, nOuputs are defined in calling context'''
V[:] = w[:(1 + nInputs) * nHiddens].reshape((1 + nInputs, nHiddens))
W[:] = w[(1 + nInputs)*nHiddens:].reshape((1 + nHiddens, nOutputs))

In [10]:
### Function f to be minimized
def objectiveF(w):
unpack(w)
# Forward pass on training data
Y = addOnes(np.tanh(X1 @ V)) @ W
return 0.5 * np.mean((T - Y)**2)

In [11]:
### Gradient of f with respect to V,W
unpack(w)
Z = np.tanh(X1 @ V)
Y = Z1 @ W
nSamples = X1.shape[0]
nOutputs = T.shape[1]
error = -(T - Y) / (nSamples * nOutputs)
dV = X1.T @ (error @ W[1:, :].T * (1 - Z**2))
dW = Z1.T @ error
return pack(dV, dW)

In [12]:
# Initialize weights to uniformly distributed values between small normally-distributed between -0.1 and 0.1
V = np.random.uniform(-0.1, 0.1, size=(1 + nInputs, nHiddens))
W = np.random.uniform(-0.1, 0.1, size=(1 + nHiddens, nOutputs))

result = ml.scg(pack(V, W), objectiveF, gradientF, nIterations = 2000, ftracep = True)

unpack(result['x'])  # copy best parameters into V and W
errorTrace = result['ftrace']
print('Ran for', len(errorTrace), 'iterations')

Ran for 2001 iterations

In [13]:
errorTrace[:10]

Out[13]:
array([0.42939572, 0.08875698, 0.07167256, 0.07167256, 0.07167256,
0.07167256, 0.07167256, 0.07167256, 0.07167256, 0.07167256])
In [14]:
plt.figure(figsize=(10, 10))

plt.subplot(3, 1, 1)
plt.plot(errorTrace)
nEpochs = len(errorTrace)
plt.xlim(0 - 0.05 * nEpochs, nEpochs * 1.05)
plt.xlabel('Epochs')
plt.ylabel('Train RMSE')

plt.subplot(3, 1, 2)
Y = addOnes(np.tanh(X1 @ V)) @ W
Ytest = addOnes(np.tanh(Xtest1 @ V)) @ W
plt.plot(X, T, 'o-', Xtest, Ttest, 'o-', Xtest, Ytest, 'o-')
plt.xlim(0, 20)
plt.legend(('Training', 'Testing', 'Model'), loc='upper left')
plt.xlabel('$x$')
plt.ylabel('Actual and Predicted $f(x)$')

plt.subplot(3, 1, 3)
Z = np.tanh(X1 @ V)
plt.plot(X, Z)
plt.xlabel('$x$')
plt.ylabel('Hidden Unit Outputs ($z$)');


## Neural Network Class¶

Python includes the ability to define new classes. Let's write one for neural networks. First, let's discuss how it might be used. Make it as easy for the user as possible.

X = ...
T = ...
nnet = NeuralNetwork(1, 5, 1)  # 1 input, 5 hidden units, 1 output
nnet.train(X, T, nIterations=100)
Y = nnet.use(X)



This implementation is for any number of hidden layers!

In [15]:
%%writefile neuralnetworks.py

import numpy as np
import mlutilities as ml
import matplotlib.pyplot as plt
from copy import copy
import time

class NeuralNetwork:

def __init__(self, ni, nhs, no):

if isinstance(nhs, list) or isinstance(nhs, tuple):
if len(nhs) == 1 and nhs[0] == 0:
nihs = [ni]
nhs = []
else:
nihs = [ni] + list(nhs)
else:
if nhs > 0:
nihs = [ni, nhs]
nhs = [nhs]
else:
nihs = [ni]
nhs = []

if len(nihs) > 1:
self.Vs = [1/np.sqrt(nihs[i]) *
np.random.uniform(-1, 1, size=(1 + nihs[i], nihs[i + 1])) for i in range(len(nihs) - 1)]
self.W = 1/np.sqrt(nhs[-1]) * np.random.uniform(-1, 1, size=(1 + nhs[-1], no))
else:
self.Vs = []
self.W = 1 / np.sqrt(ni) * np.random.uniform(-1, 1, size=(1 + ni, no))
self.ni, self.nhs, self.no = ni, nhs, no
self.Xmeans = None
self.Xstds = None
self.Tmeans = None
self.Tstds = None
self.trained = False
self.reason = None
self.errorTrace = None
self.numberOfIterations = None
self.trainingTime = None

def __repr__(self):
str = 'NeuralNetwork({}, {}, {})'.format(self.ni, self.nhs, self.no)
# str += '  Standardization parameters' + (' not' if self.Xmeans == None else '') + ' calculated.'
if self.trained:
str += '\n   Network was trained for {} iterations that took {:.4f} seconds. Final error is {}.'.format(self.numberOfIterations, self.getTrainingTime(), self.errorTrace[-1])
else:
str += '  Network is not trained.'
return str

def _standardizeX(self, X):
result = (X - self.Xmeans) / self.XstdsFixed
result[:, self.Xconstant] = 0.0
return result

def _unstandardizeX(self, Xs):
return self.Xstds * Xs + self.Xmeans

def _standardizeT(self, T):
result = (T - self.Tmeans) / self.TstdsFixed
result[:, self.Tconstant] = 0.0
return result

def _unstandardizeT(self, Ts):
return self.Tstds * Ts + self.Tmeans

def _pack(self, Vs, W):
return np.hstack([V.flat for V in Vs] + [W.flat])

def _unpack(self, w):
first = 0
numInThisLayer = self.ni
for i in range(len(self.Vs)):
self.Vs[i][:] = w[first:first + (1 + numInThisLayer) *
self.nhs[i]].reshape((1 + numInThisLayer, self.nhs[i]))
first += (numInThisLayer+1) * self.nhs[i]
numInThisLayer = self.nhs[i]
self.W[:] = w[first:].reshape((1 + numInThisLayer, self.no))

def _objectiveF(self, w, X, T):
self._unpack(w)
# Do forward pass through all layers
Zprev = X
for i in range(len(self.nhs)):
V = self.Vs[i]
Zprev = np.tanh(Zprev @ V[1:, :] + V[0:1, :])  # handling bias weight without adding column of 1's
Y = Zprev @ self.W[1:, :] + self.W[0:1, :]
return 0.5 * np.mean((T - Y)**2)

self._unpack(w)
# Do forward pass through all layers
Zprev = X
Z = [Zprev]
for i in range(len(self.nhs)):
V = self.Vs[i]
Zprev = np.tanh(Zprev @ V[1:, :] + V[0:1, :])
Z.append(Zprev)
Y = Zprev @ self.W[1:, :] + self.W[0:1, :]
# Do backward pass, starting with delta in output layer
delta = -(T - Y) / (X.shape[0] * T.shape[1])
dW = np.vstack((np.ones((1, delta.shape[0])) @ delta,
Z[-1].T @ delta))
dVs = []
delta = (1 - Z[-1]**2) * (delta @ self.W[1:, :].T)
for Zi in range(len(self.nhs), 0, -1):
Vi = Zi - 1  # because X is first element of Z
dV = np.vstack((np.ones((1, delta.shape[0])) @ delta,
Z[Zi-1].T @ delta))
dVs.insert(0, dV)
delta = (delta @ self.Vs[Vi][1:, :].T) * (1 - Z[Zi-1]**2)
return self._pack(dVs, dW)

def train(self, X, T, nIterations=100, verbose=False,
weightPrecision=0, errorPrecision=0, saveWeightsHistory=False):

if self.Xmeans is None:
self.Xmeans = X.mean(axis=0)
self.Xstds = X.std(axis=0)
self.Xconstant = self.Xstds == 0
self.XstdsFixed = copy(self.Xstds)
self.XstdsFixed[self.Xconstant] = 1
X = self._standardizeX(X)

if T.ndim == 1:
T = T.reshape((-1, 1))

if self.Tmeans is None:
self.Tmeans = T.mean(axis=0)
self.Tstds = T.std(axis=0)
self.Tconstant = self.Tstds == 0
self.TstdsFixed = copy(self.Tstds)
self.TstdsFixed[self.Tconstant] = 1
T = self._standardizeT(T)

startTime = time.time()

scgresult = ml.scg(self._pack(self.Vs, self.W),
X, T,
xPrecision=weightPrecision,
fPrecision=errorPrecision,
nIterations=nIterations,
verbose=verbose,
ftracep=True,
xtracep=saveWeightsHistory)

self._unpack(scgresult['x'])
self.reason = scgresult['reason']
self.errorTrace = np.sqrt(scgresult['ftrace']) # * self.Tstds # to _unstandardize the MSEs
self.numberOfIterations = len(self.errorTrace)
self.trained = True
self.weightsHistory = scgresult['xtrace'] if saveWeightsHistory else None
self.trainingTime = time.time() - startTime
return self

def use(self, X, allOutputs=False):
Zprev = self._standardizeX(X)
Z = [Zprev]
for i in range(len(self.nhs)):
V = self.Vs[i]
Zprev = np.tanh(Zprev @ V[1:, :] + V[0:1, :])
Z.append(Zprev)
Y = Zprev @ self.W[1:, :] + self.W[0:1, :]
Y = self._unstandardizeT(Y)
return (Y, Z[1:]) if allOutputs else Y

def getNumberOfIterations(self):
return self.numberOfIterations

def getErrors(self):
return self.errorTrace

def getTrainingTime(self):
return self.trainingTime

def getWeightsHistory(self):
return self.weightsHistory

def draw(self, inputNames=None, outputNames=None, gray=False):
ml.draw(self.Vs, self.W, inputNames, outputNames, gray)

if __name__ == '__main__':

X = np.arange(10).reshape((-1, 1))
T = X + 2

net = NeuralNetwork(1, 0, 1)
net.train(X, T, 10)
print(net)

net = NeuralNetwork(1, [5, 5], 1)
net.train(X, T, 10)
print(net)

Overwriting neuralnetworks.py


If the above class definition is placed in a file named neuralnetworks.py, then an instance of this class can be instantiated using code like

import neuralnetworks as nn
nnet = nn.NeuralNetwork(1, 4, 1)



The files neuralnetworks.py and mlutilities.py must be in your working directory.

In [16]:
import sys
sys.path = ['.'] + sys.path
sys.path

Out[16]:
['.',
'/home/anderson/lib/python',
'/home/anderson/anaconda3/lib/python36.zip',
'/home/anderson/anaconda3/lib/python3.6',
'',
'/home/anderson/.local/lib/python3.6/site-packages',
'/home/anderson/anaconda3/lib/python3.6/site-packages',
'/home/anderson/anaconda3/lib/python3.6/site-packages/IPython/extensions',
'/home/anderson/.ipython']
In [17]:
import neuralnetworks
neuralnetworks

Out[17]:
<module 'neuralnetworks' from './neuralnetworks.py'>
In [18]:
import neuralnetworks as nn
import imp
imp.reload(nn)  # in case neuralnetworks.py has been changed

# Make some training data
nSamples = 10
Xtrain = np.linspace(0, 10, nSamples).reshape((-1, 1))
Ttrain = 1.5 + 0.6 * Xtrain + 8 * np.sin(2.2 * Xtrain)
Ttrain[np.logical_and(Xtrain > 2, Xtrain < 3)] *= 3
Ttrain[np.logical_and(Xtrain > 5, Xtrain < 7)] *= 3

nSamples = 100
Xtest = np.linspace(0, 10, nSamples).reshape((-1, 1)) + 10.0/nSamples/2
Ttest = 1.5 + 0.6 * Xtest + 8 * np.sin(2.2  *Xtest) + np.random.uniform(-2, 2, size=(nSamples, 1))
Ttest[np.logical_and(Xtest > 2, Xtest < 3)] *= 3
Ttest[np.logical_and(Xtest > 5, Xtest < 7)] *= 3

# Create the neural network, with one hidden layer of 4 units
nnet = nn.NeuralNetwork(1, [4], 1)

# Train the neural network
nnet.train(Xtrain, Ttrain, nIterations=1000)

# Print some information when done
print('SCG stopped after', nnet.getNumberOfIterations(), 'iterations:', nnet.reason)
print('Training took', nnet.getTrainingTime(), 'seconds.')

# Print the training and testing RMSE
Ytrain = nnet.use(Xtrain)
Ytest, Ztest = nnet.use(Xtest, allOutputs=True)
print("Final RMSE: train", np.sqrt(np.mean((Ytrain - Ttrain)**2)),
"test", np.sqrt(np.mean((Ytest - Ttest)**2)))

plt.figure(figsize=(15, 15))

nHLayers = len(nnet.nhs)
nPlotRows = 3 + nHLayers

plt.subplot(nPlotRows, 1, 1)
plt.plot(nnet.getErrors())
plt.title('Regression Example')

plt.subplot(nPlotRows, 1, 2)
plt.plot(Xtrain, Ttrain, 'o-', label='Training Data')
plt.plot(Xtrain, Ytrain, 'o-', label='Train NN Output')
plt.ylabel('Training Data')
plt.legend(loc='lower right', prop={'size': 9})

plt.subplot(nPlotRows, 1, 3)
plt.plot(Xtest, Ttest, 'o-', label='Test Target')
plt.plot(Xtest, Ytest, 'o-', label='Test NN Output')
plt.ylabel('Testing Data')
plt.xlim(0, 10)
plt.legend(loc='lower right', prop={'size': 9})
for i in range(nHLayers):
layer = nHLayers - i - 1
plt.subplot(nPlotRows, 1, i + 4)
plt.plot(Xtest, Ztest[layer])
plt.xlim(0,10)
plt.ylim(-1.1,1.1)
plt.ylabel('Hidden Units')
plt.text(8,0, 'Layer {}'.format(layer+1))

SCG stopped after 1001 iterations: completed iterations
Training took 0.20933842658996582 seconds.
Final RMSE: train 6.19644016290778 test 9.021723154407656

In [19]:
nnet.draw(['x'],['y'])

In [20]:
nnet

Out[20]:
NeuralNetwork(1, [4], 1)
Network was trained for 1001 iterations that took 0.2093 seconds. Final error is 0.3489259770721829.

What happens if we add another two hidden layers, for a total of three hidden layers? Let's use 5 units in each hidden layer.

In [21]:
nnet = nn.NeuralNetwork(1, [5, 5], 1)

In [22]:
nnet

Out[22]:
NeuralNetwork(1, [5, 5], 1)  Network is not trained.

The rest of the code is the same. Even the plotting code written above works for as many hidden layers as we create.

In [23]:
nnet.train(Xtrain, Ttrain, nIterations=1000)
print("SCG stopped after", nnet.getNumberOfIterations(), "iterations:", nnet.reason)
Ytrain = nnet.use(Xtrain)
Ytest, Ztest = nnet.use(Xtest, allOutputs=True)
print("Final RMSE: train", np.sqrt(np.mean((Ytrain - Ttrain)**2)),
"test", np.sqrt(np.mean((Ytest - Ttest)**2)))

plt.figure(figsize=(10, 15))

nHLayers = len(nnet.nhs)
nPlotRows = 3 + nHLayers

plt.subplot(nPlotRows, 1, 1)
plt.plot(nnet.getErrors())
plt.title('Regression Example')

plt.subplot(nPlotRows, 1, 2)
plt.plot(Xtrain, Ttrain, 'o-', label='Training Data')
plt.plot(Xtrain, Ytrain, 'o-', label='Train NN Output')
plt.ylabel('Training Data')
plt.legend(loc='lower right', prop={'size': 9})

plt.subplot(nPlotRows, 1, 3)
plt.plot(Xtest, Ttest, 'o-', label='Test Target')
plt.plot(Xtest, Ytest, 'o-', label='Test NN Output')
plt.ylabel('Testing Data')
plt.xlim(0, 10)
plt.legend(loc='lower right', prop={'size': 9})
for i in range(nHLayers):
layer = nHLayers-i-1
plt.subplot(nPlotRows, 1, i + 4)
plt.plot(Xtest, Ztest[layer])
plt.xlim(0, 10)
plt.ylim(-1.1, 1.1)
plt.ylabel('Hidden Units')
plt.text(8, 0, 'Layer {}'.format(layer + 1))

2.937483399001997e-17
SCG stopped after 451 iterations: limit on machine precision
Final RMSE: train 1.7384279134351605e-07 test 6.751422827149613


For more fun, wrap the above code in a function to make it easy to try different network structures.

In [24]:
def run(Xtrain, Ttrain, Xtest, Ttest, hiddenUnits, nIterations=100, verbose=False):
if Xtrain.shape[1] != 1 or Ttrain.shape[1] != 1:
print('This function written for one-dimensional input samples, X, and one-dimensional targets, T.')
return

nnet = nn.NeuralNetwork(1, hiddenUnits,1 )

nnet.train(Xtrain, Ttrain, nIterations=nIterations, verbose=verbose)

Ytrain = nnet.use(Xtrain)
Ytest, Ztest = nnet.use(Xtest, allOutputs=True)
print('Training took {:.4f} seconds.'.format(nnet.getTrainingTime()))
print("Final RMSE: train", np.sqrt(np.mean((Ytrain - Ttrain)**2)),
"test", np.sqrt(np.mean((Ytest - Ttest)**2)))

plt.figure(figsize=(10, 15))
nHLayers = len(nnet.nhs)
nPlotRows = 3 + nHLayers

plt.subplot(nPlotRows, 1, 1)
plt.plot(nnet.getErrors())
plt.title('Regression Example')

plt.subplot(nPlotRows, 1, 2)
plt.plot(Xtrain, Ttrain, 'o-', label='Training Data')
plt.plot(Xtrain, Ytrain, 'o-', label='Train NN Output')
plt.ylabel('Training Data')
plt.legend(loc='lower right', prop={'size': 9})

plt.subplot(nPlotRows, 1, 3)
plt.plot(Xtest, Ttest, 'o-', label='Test Target')
plt.plot(Xtest, Ytest, 'o-', label='Test NN Output')
plt.ylabel('Testing Data')
plt.xlim(0, 10)
plt.legend(loc='lower right', prop={'size': 9})
for i in range(nHLayers):
layer = nHLayers-i-1
plt.subplot(nPlotRows, 1, i+4)
plt.plot(Xtest, Ztest[layer])
plt.xlim(0, 10)
plt.ylim(-1.1, 1.1)
plt.ylabel('Hidden Units')
plt.text(8, 0, 'Layer {}'.format(layer+1))
return nnet

In [25]:
run(Xtrain, Ttrain, Xtest, Ttest, (2, 2), nIterations=1000)

Training took 0.2101 seconds.
Final RMSE: train 3.945078075580234 test 8.058688805566645

Out[25]:
NeuralNetwork(1, (2, 2), 1)
Network was trained for 1001 iterations that took 0.2101 seconds. Final error is 0.2221501678315113.
In [26]:
run(Xtrain, Ttrain, Xtest, Ttest, (2, 2, 2, 2), nIterations=1000)

Training took 0.4604 seconds.
Final RMSE: train 5.754717294637087 test 15.440924309145363

Out[26]:
NeuralNetwork(1, (2, 2, 2, 2), 1)
Network was trained for 1001 iterations that took 0.4604 seconds. Final error is 0.32405224645357705.
In [27]:
[2]*6

Out[27]:
[2, 2, 2, 2, 2, 2]
In [28]:
run(Xtrain, Ttrain, Xtest, Ttest, [2]*6, nIterations=4000)

Training took 2.1112 seconds.
Final RMSE: train 4.7391048255815535 test 12.452730140279542

Out[28]:
NeuralNetwork(1, [2, 2, 2, 2, 2, 2], 1)
Network was trained for 4001 iterations that took 2.1112 seconds. Final error is 0.2668623819870092.

Can you say "deep learning"?

This last example doesn't always work. Depends a lot on good initial random weight values. Go back to the above cell and run it again and again, until you see it not work.

In [29]:
nnet = run(Xtrain, Ttrain, Xtest, Ttest, [50,10,3,1,3,10,50], nIterations=1000)
nnet

6.812155681528653e-17
Training took 0.4965 seconds.
Final RMSE: train 2.077639838020811e-08 test 21.3121065064551

Out[29]:
NeuralNetwork(1, [50, 10, 3, 1, 3, 10, 50], 1)
Network was trained for 660 iterations that took 0.4965 seconds. Final error is 1.1699338433404044e-09.