Notebook

Artificial Intelligence

Lesson 9

Neural Network from Scratch

The Bias Term The Sigmoid Activation Function Hidden Layer to Output The Backwards Pass (Training) Back Propagation Gradient Descent Gradient Descent for Neural Networks

***Original Tutorial by Cristian Dima:***
http://www.cristiandima.com/neural-networks-from-scratch-in-python/

OVERVIEW

Each node in the hidden layer holds a value representing an arbitrary combination of the nodes in the previous layer. For the first node in the hidden layer we can define this value as:

$z_1 = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n$

Where nn is the number of nodes in the input layer and xx the value of each input node. Note this is nothing but a simple linear combination of all input values. We can define this in general form for each hidden layer node kk as:

$z_k = w_{1,k} x_1 + w_{2,k} x_2 + \cdots + w_{n,k} x_n$

We can define a matrix

$W_{n \times d}$ and use it to store all our

$w_{i,j}w$ values as follows:

$W = \begin{pmatrix} w_{1,1} & w_{1,2} & \cdots & w_{1, d} \\ w_{2,1} & w_{2,2} & \cdots & w_{2, d} \\ \vdots & \vdots & \vdots & \vdots \\ w_{n,1} & w_{n,2} & \cdots & w_{n, d} \end{pmatrix}$

The $w_{i,j}$ values are called weights. They represent the "strength" of the connection between node $i$ (node on the left) and node $j$ (node on the right). If you look at the neural network figure in the beginning each ww represents an arrow (going from node $i$ to node $j$ ).

This weights matrix $W$ holds a column for each node in the hidden layer (i.e. we have $d$ nodes in the hidden layer) and a line for each node in the input layer (i.e. we have $n$ nodes in the input layer). We can use each column in the weights matrix to compute those $z$ values above. Defining the input layer as a column vector $x$ we can compute all those $z$ values at once as follows:

$z = \begin{pmatrix} z_{1} \\ z_{2} \\ \vdots \\ z_{d} \end{pmatrix} = W^T x = \begin{pmatrix} w_{1,1} & w_{2,1} & \cdots & w_{n, 1} \\ w_{1,2} & w_{2,2} & \cdots & w_{n, 2} \\ \vdots & \vdots & \vdots & \vdots \\ w_{1,d} & w_{2,d} & \cdots & w_{n, d} \end{pmatrix} \begin{pmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{pmatrix}$

Note we take the transpose of the weights matrix because we would like to do a dot product between each column in the untouched weights matrix and the column vector $x$ . Transposing $W$ allows us to express this as a simple matrix multiplication.

In Python you would do this operation as follows:

import numpy as np


>```python
x = np.random.randn(n, 1)
W = np.random.randn(n, d)

z = W.T.dot(x)


>```python
# Try it for yourself here and see what the resulting matrix looks like!
# Hint: remember, n and d must be defined. n is your inputs and d is your hidden nodes. 
# Hint: observe what x and W look like as well.

THE BIAS TERM

Something not mentioned until now is the bias term. A bias value is just another number we add to our $z$ value above. If you're familiar with linear regression (or just lines in general) you might have seen lines expressed as:

$y = wx + b y=wx+b$

That $b$ value is the bias term we are now talking about. It allows a linear function to shift. Learning the $w$ parameter allows us to change the steepness (slope) of the line we are learning. Learning a bias term as well, allows us to also shift the function up or down and thus produce a better model for our data.

In our case we simply produce another column vector bb and add it to $z$ .

The code for this would be as follows (one step different from the above):

import numpy as np


>```python
x = np.random.randn(n, 1)
W = np.random.randn(n, d)
b = np.random.randn(d)

z = W.T.dot(x) + b

We'll use this same concept and apply it to the rest of our neural network. Instead of looping through all the records in our dataset, we will construct a 2d matrix with all the records and feed it into the network using the dot product. For example, if we have 100 records with 5 features each, we will have a 100x5 2d matrix being fed through our network.

THE SIGMOID ACTIVATION FUNCTION

The sigmoid function is defined as:

$\sigma(x) = \frac{1}{1 + e^{-x}}$

This function produces values between 0 and 1 and has an s-shaped plot. It looks like this:

You can define the sigmoid function in Python like so:

import numpy as np


>```python
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

Numpy will do the computations elementwise so you can pass the sigmoid function a vector, a matrix, or even just a number (a scalar), and it will work fine.

This activation function will be applied to the $Z$ matrix we defined above to create the "activation" values of each neuron.

Thus we have finally arrived at the end of the first step in the forward pass, the input to hidden computation. This will be expressed as follows (again keep in mind the quick note above regarding the bias term):

$A = \sigma(Z) = \sigma(XW + b) A=σ(Z)=σ(XW+b)$

and in Python:

import numpy as np


>```python
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

X = np.random.randn(s, n)


>```python
W = np.random.randn(n, d)
b = np.random.randn(d)

Z = X.dot(W) + b A = sigmoid(Z)

The $A$ matrix will hold the activation values for each node in the hidden layer across all samples. This is in fact the computed hidden layer. These computed values will move forward in the network.

Note the sigmoid activation function is just one popular function used in neural networks. Other popular functions are the hyperbolic tangent function, tanh, and the ReLU function (used a lot in convolutional neural networks), but there can be others. These functions are used a lot because they have some nice mathematical properties (easy to take the derivative of) and because they produce good empirical results.

HIDDEN LAYER TO OUTPUT

Up until now, we've covered input to hidden layer, but the real results come from the output layer. After all, we require a prediction from our network right?! Let's take a look at how this is done.

Now that we have computed the hidden layer it's time to move forward in the network and arrive at the final layer, the output layer.

The computations done at this step are very much similar to the ones done in the previous step. There are only two differences. The first is that now instead of the input data $X$ we are working with the hidden layer values $A$ and the second is that the activation function is a bit different (and not like any of the others we have mentioned in passing either).

What we are trying to compute¶

Our hope is that once it is trained, our neural network will be able to make accurate predictions. Given some observed data we would like to predict to which class the item described by that data belongs. In our houses example we might want to place houses in categories such as higher class, middle class, or lower class. If we are working with images of people we might want to guess to which person the image belongs. These are all classifications problems which neural networks are quite good at solving.

A one-hot encoding (or vector) is what we use to numerically represent a class. A one-hot vector is a vector which has a single element set to one and the remaining elements set to zero. An example of one such vector:

$y = \begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix}$

If we have three classes to which our data belongs we can represent these classes with the following matrix:

$Y = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix}$

Each line in this matrix represents a distinct class. Whenever we observe data which belongs to the first class we simply add another [1, 0, 0] line to our observed values. If we observe data belonging to the third class we add another [0, 0, 1] line to our observed values and so on.

In the end we would like our neural network to produce vectors which are as close as possible to these target one-hot vectors. So if we have some features, and we know they describe an item belonging to the first class, we would like our neural network to produce an output layer having values looking something like:

$y = \begin{pmatrix} 0.98 \\ 0.01 \\ 0.01 \end{pmatrix}$

This vector is very close to the targeted [1, 0, 0] vector of interest. To obtain values looking like that we will use the softmax activation function.

Softmax activation¶

To compute the values associated with the output nodes we will first use the same linear combination we used in the previous step. Given the hidden layer activation values $A$ , a weights matrix $W^{(2)}$ describing the connections between the hidden layer and the output layer, and a bias vector $b^{(2)}$ we can compute a new set of $Z^{(2)}$ Z values like so:

$Z^{(2)} = AW^{(2)} + b^{(2)}$

In code:

import numpy as np


>```python
# A computed in the previous step is of shape s x d

W2 is of shape d x c¶

i.e. d hidden nodes and c output classes¶

W2 = np.random.randn(d, c) b2 = np.random.randn(c)


>```python
Z2 = A.dot(W2) + b2

Given a $K$ dimensional vector $x$ , the softmax activation function is defined as:

$softmax(x_j) = \frac{e^{x_j}}{\sum_k^K e^{x_k}}$

$for j = 1, \cdots, Kj=1,\cdots,K$

In other words, this function takes a vector, and squashes (or normalizes) each number inside the vector to a value between 0 and 1. If you ignore the exponentiation for a bit all this function does is divide each vector element by the sum of all vector elements. It does this however by taking the exponent of each vector element. This produces a new vector of elements between 0 and 1 that have the property of summing up to 1.

In essence, this softmax function produces a probability distribution. Given a set of numbers it will assign higher probabilities to the higher numbers and lower probabilities to the lower numbers. A numerical example:

$softmax( \begin{pmatrix} 5 \\ 1 \\ 1 \end{pmatrix} ) = \begin{pmatrix} 0.96466316 \\ 0.01766842 \\ 0.01766842 \end{pmatrix}$

$0.96466316 + 0.01766842 + 0.01766842 \approx 1 0.96466316+0.01766842+0.01766842≈1$

In Python you can define this as follows:

import numpy as np


>```python
def softmax(A):
    expA = np.exp(A)
    return expA / expA.sum(axis=1, keepdims=True)

Note how the softmax result looks a lot like the result we said we would like the neural network to produce. The idea is that if the neural network produces values which are good in magnitude (i.e. larger numbers for correct classes and smaller numbers for incorrect ones) the softmax function will squash those values to something looking a lot like the one-hot vector we want to predict.

The final neural network output computation in code:

import numpy as np


>```python
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def softmax(A): expA = np.exp(A) return expA / expA.sum(axis=1, keepdims=True)


>```python
# a 3 x 4 x 3 network example
s = 5 # five samples
n = 3 # three features per sample
d = 4 # four nodes in the hidden layer
c = 3 # three classes to predict

X = np.random.randn(s, n)


>```python
W = np.random.randn(n, d)
b = np.random.randn(d)

Z = X.dot(W) + b A = sigmoid(Z)


>```python
W2 = np.random.randn(d, c)
b2 = np.random.randn(c)

Z2 = A.dot(W2) + b2 Y = softmax(Z2)

THE BACKWARD PASS (TRAINING)

After we compute the first forward pass of our neural network the results are gonna be quite bad. Random weights will produce random results. However, before we figure out how to improve our network we need to figure out how wrong we are in our computations. We do this via the loss (or error) function. This function quantifies how bad our current results are.

The negative log likelihood

If we have $s$ training samples and $c$ classes then the loss for our prediction $y$ with respect to the true labels $t$ is given by:

$L(t, y) = - \frac{1}{s}\sum_i^s \sum_j^c t_{i,j} log y_{i,j}$

While the function may seem somewhat complicated it is actually not doing very much. It is taking an average across samples of the product between the log of our predicted values and the target values. If that seems a bit like a mouthful let's read it from right to left.

The $t \times log (y)t \times log(y)$ part multiplies our predicted values by the target values. If you ignore the log part for a bit what this does is the element-wise multiplication of two vectors which, if you remember from the previous sections, should look something like $t = [1, 0, 0]$ and, hopefully, $y = [0.98, 0.01, 0.01]$ .

If our prediction is very close to 1 then log of that number will be very close to zero which means our error for that particular case will be very close to zero. If we maintain this performance across samples then the average error is also going to be very close to zero.

The minus sign at the begining is there because log of a number between zero and one is negative and we would like to work with positive values so as to minimize the function (i.e go from values above zero to values as close as possible to zero).

BACK PROPAGATION

Finally we reach one of the more interesting parts of a neural network: the learning process. At this point we have one forward pass done, and we can compute how bad our neural network is using the negative log likelihood function. It's time to change our parameters so that on the next forward pass the neural network does better.

The backpropagation step involves the propagation of the neural network's error back through the network. Based on this error the neural network's weights can be updated so that they become better at minimizing the error.

This is the more math heavy part of a neural network. I will cover the math to some degree but I will add many links from across the web which cover this in more detail.

GRADIENT DESCENT

Gradient descent is an algorithm for iteratively finding the minimum of a function. Starting from a random point the algorithm takes small steps in the opposite direction of the gradient of the function at that point until it eventually reaches the minimum value of that function. This may seem a bit complicated but it's not that bad.

In math notation, if we are given a continuous function $f(x)$ and an initial random $x_0$ value, then we can find a "better" $x_1$ value (better in that the value of $f$ at $x_1$ is smaller, i.e. we are getting closer to the minimum) by taking a new $x_1$ as follows:

$x_{n} = x_{n-1} - \alpha \frac{\partial f}{\partial x_{n-1}}$

where $\alpha α$ is called the learning rate and usually takes values between $10^{-1}10$ .

Note that the algorithm is in no way guaranteed to find the global minimum of the function (in fact it is highly unlikely that it will do so if the function has multiple local minima), nor is it guaranteed to even find a solution at all. While the expectation is that the solution will improve at every step it is possible for the solutions to diverge and for the function to take ever increasing values (for example, this often happens when the learning rate is set too high).

Let's take a quick look at a simple numerical example to see this computation in practice. Let ff be the function

$f(x, y) = x^2 + y^2$

then

$\frac{\partial f}{\partial x} = 2x$

and

$\frac{\partial f}{\partial y} = 2y$

This function has a global minimum at $f(0,0) = 0$ . In code, we can define this function, along with the gradient descent computations, as follows:

import numpy as np


>```python
f = lambda x: (x**2).sum()
f_grad = lambda x: 2*x

np.random.seed(0)


>```python
# random 2 dimensional vector
# with seed 0 it takes the values [0.54, 0.71]
x = np.random.rand(2, 1)

learning rate¶

alpha = 0.01


>```python
for i in range(1000):
    x = x - alpha * f_grad(x)

print(x) # aproximately (0, 0) print(f(x)) # around 2.3e-18 so aproximately 0

GRADIENT DESCENT FOR NEURAL NETWORKS

Applying gradient descent to our neural network is somewhat more involved in terms of the calculus required but the basic principles are the same. We have a loss function defined and the parameters of this function are the weights and biases of our neural network. So we need to find the weights of our neural network such that the value of our loss function is minimized. Just like we did in the simple example above what we have to do now is take the gradient of our loss function with respect to our parameters.

There are four derivatives we need to compute: $\frac{\partial L}{\partial W^{(2)}}, \frac{\partial L}{\partial b^{(2)}}, \frac{\partial L}{\partial W}, \frac{\partial L}{\partial b}$ . These have the following formulae:

$\frac{\partial L}{\partial W^{(2)}} = \frac{\partial L}{\partial Z^{(2)}} \frac{\partial Z^{(2)}}{\partial W^{(2)}}$

This is the derivative of our loss function with respect to our second set of weights, the ones corresponding to the hidden to output connections. On the right hand side we see it expressed via the chain rule. We will denote the first derivative in the chain as $\delta^{(2)} = \frac{\partial L}{\partial Z^{(2)}}$

The second derivative is much easier to compute. Remember that $Z^{(2)} = AW^{(2)} + b^{(2)}$ Taking the derivative of $Z^{(2)}$ with respect to $W^{(2)}$ leaves us with $A$ .

So what about that $\delta^{(2)}$ You can think of it as the "error" of the nodes in its layer (our output layer in this case). Denoting that derivative as $\delta^{(2)}$ is also useful because it allows us to express derivatives across layers in a recursive fashion. We will this in action later. Andrew Ng's famous coursera course is a great resource on how this works. I will link further resources regarding these derivations at the end of this section but for now let's just see the math formulas we have to implement in code.

$\delta^{(2)} = Y - T \frac{\partial L}{\partial W^{(2)}} = A^{T} \delta^{(2)}$

The only difference is we are using matrix operations to express all the computations at once.

At this same layer we also have:

$\frac{\partial L}{\partial b^{(2)}} = \delta^{(2)}$

leaves us with 11. So then:

$\frac{\partial L}{\partial b^{(2)}} = \frac{\partial L}{\partial Z^{(2)}} \frac{\partial Z^{(2)}}{\partial b^{(2)}} = \delta^{(2)}$

Moving on to the first set of weights, we have ( $\odot$ denoting element wise multiplication):

$\delta^{(1)} = \delta^{(2)} W^{(2)T} \odot \sigma^{'}(Z)$

$\delta^{(1)} = \delta^{(2)} W^{(2)T} \odot \sigma(Z) \odot (1 - \sigma(Z))$

$\delta^{(1)} = \delta^{(2)} W^{(2)T} \odot A \odot (1 - A)$

and

$\frac{\partial L}{\partial W} = X^{T} \delta^{(1)}$

$\frac{\partial L}{\partial b} = \delta^{(1)}$

Finally, to put it all in one place, what we are working with is the following:

$\delta^{(2)} = Y - T$

$\delta^{(1)} = \delta^{(2)} W^{(2)T} \odot A \odot (1 - A)$

$\frac{\partial L}{\partial W^{(2)}} = A^{T} \delta^{(2)}$

$\frac{\partial L}{\partial b^{(2)}} = \delta^{(2)}$

$\frac{\partial L}{\partial W} = X^{T} \delta^{(1)}$

$\frac{\partial L}{\partial b} = \delta^{(1)}$

Before moving on with this article here are a number of links which cover these derivations in some greater detail:

Ng's Machine Learning Course link
Backpropagation with softmax cross entropy link
Derivative of softmax loss function link
Implementing a neural network from scratch link
Vector, Matrix, and Tensor Derivatives link
How the backpropagation algorithm works link
Principles of training multi-layer neural network using backpropagation link

In [ ]: