Notebook

Lecture 22: Multi-Layer Neural Network I¶

A supervised learning problem¶

We have access to labeled training samples $(\mathbf{x}^{(i)},y^{(i)})$. Neural networks give a way of defining a complex, highly non-linear form of hypotheses model function $h(\mathbf{x}; W, b)$, with parameters $W$ and $b$ that we can fit to our data. This nonlinear function is capable of approximating some of the most obscure relations in real life, if we have enough parameters.

References:¶

A feedforward neural network¶

Below is one example of a feedforward neural network, the name comes from the fact that the connectivity graph does not have any directed loops or cycles.

How a single neuron works in the $l$-th layer¶

$\mathbf{w}$ and $b$: weights and bias
$\mathbf{a} = (a_1, a_2, a_3)$: input (outputs/activations from the previous layer)

This "neuron" is a computational unit/node that takes an input $\mathbf{a}$, and outputs the model function $h^{\text{single} }(\mathbf{a}; \mathbf{w}, b)$ (aka activation): $$ h^{\text{single} }(\mathbf{a}; \mathbf{w}, b) = f(\mathbf{w}^{\top} \mathbf{a} + b) = f\Big(\sum_{i=1}^3 w_i a_i +b\Big) $$ The $f(\cdot)$ is called an "activation function", common choices are $\tanh$, Sigmoid and ReLU: $$ \text{ReLU} (x) = \max(0, x), \; \sigma(x) = \frac{1}{1 + e^{-x}}, \; \tanh(x) = \frac{e^{x}-e^{-x}}{e^{x}+e^{-x}} $$

Putting neurons together¶

A neural network is put together by hooking many of our simple "neurons", so that the output of a neuron can be the input of another. For example, here is a small neural network (a slice of a bigger network):

In this figure, we have used circles to also denote the inputs to the network. The circles labeled "+1" are called bias units. Layer 1 is called the input layer, and Layer 3 is the output layer (which, in this example, has only one node). The middle layer, Layer 2, is called a hidden layer, because its values are not observed in the training set.

The neural network in our example has 2 input units (not counting the bias unit), 3 hidden units, and 1 output unit.

Parameters and the forward pass¶

Our neural network has parameters $(W, b) := \big(W^{(1)},b^{(1)},W^{(2)},b^{(2)}\big)$.

$W^{(l)} = \big(w^{(l)}_{ij}\big)$ to denote the weight matrix, where the entry-$ij$ is associated with the connection between unit $j$ in layer $l$, and unit $i$ in layer $l+1$. Note the order of the indices, $j$ is the closer to the input that this matrix is acting on
$b^{(l)}_i$ is the bias associated with unit $i$ in layer $l+1$.

In our example above, we have $W^{(1)}\in \mathbb{R}^{3×2}$, and $W^{(2)}\in \mathbb{R}^{1×3}$. Note that bias units do not have inputs or connections going into them, we write their output the value +1 for convenience. When we count the number of units in layer $l$, we do not count the bias unit.

Activation and function compositions¶

We will write $a^{(l)}_i$ to denote the activation or output value of unit $i$ in layer $l$. For $l=1$, $a^{(1)}_i= x_i$ denotes the $i$-th input to this network. Given a fixed set of parameters $(W,b)$, and the input $\mathbf{x}$, the neural network above defines a model function $h(\mathbf{x}; W, b)$ made of layers of function compositions that outputs a real number. Specifically, the computation that this neural network represents is given by:

$$\begin{aligned} a_1^{(2)} &= f\big(w_{11}^{(1)}x_1 + w_{12}^{(1)} x_2 + b_1^{(1)}\big) \\ a_2^{(2)} &= f\big(w_{21}^{(1)}x_1 + w_{22}^{(1)} x_2 + b_2^{(1)}\big) \\ a_3^{(2)} &= f\big(w_{31}^{(1)}x_1 + w_{32}^{(1)} x_2 + b_3^{(1)}\big) \\ h(\mathbf{x}; W,b) &=a = a_1^{(3)} = f\big(w_{11}^{(2)} a_1^{(2)} + w_{12}^{(2)} a_2^{(2)} + w_{13}^{(2)} a_3^{(2)} + b_1^{(2)}\big) \end{aligned} $$

Compact notation: forward pass¶

If we allow the activation function $f(\cdot)$ to act on vectors in an element-wise fashion: $f([\mathbf{z}_1,\mathbf{z}_2,\mathbf{z}_3])=[f(\mathbf{z}_1),f(\mathbf{z}_3),f(\mathbf{z}_3)]$, then we can write the equations above more compactly as: $$\begin{aligned} \mathbf{z}^{(2)} &= W^{(1)} \mathbf{x} + \mathbf{b}^{(1)} \\ \mathbf{a}^{(2)} &= f(\mathbf{z}^{(2)}) \\ \mathbf{z}^{(3)} &= W^{(2)} \mathbf{a}^{(2)} + \mathbf{b}^{(2)} \\ h(\mathbf{x}; W, b) &= \mathbf{a}^{(3)} = f(\mathbf{z}^{(3)}) \end{aligned} $$ More generally, recalling that $\mathbf{a}^{(1)}=\mathbf{x}$ also denotes the values from the input layer, then given layer $l$'s activations $\mathbf{a}^{(l)}$, we can compute layer $(l+1)$'s activations $\mathbf{a}^{(l+1)}$ as: $$ \begin{aligned} \mathbf{z}^{(l+1)} &= W^{(l)} \mathbf{a}^{(l)} + \mathbf{b}^{(l)} \\ \mathbf{a}^{(l+1)} &= f(\mathbf{z}^{(l+1)}) \end{aligned} $$ By organizing the parameters in matrices and using matrix-vector operations, we can take advantage of fast linear algebra routines to quickly perform calculations in our network.

Loss function for regression¶

Suppose we have a fixed and labeled training set $\{ (\mathbf{x}^{(1)}, y^{(1)}), \dots, (\mathbf{x}^{(N)}, y^{(N)}) \}$ of $N$ training examples. For a single training sample and its target value $(\mathbf{x}, y)$, we define the sample loss function with respect to this single example to be: $$ J(W,b; \mathbf{x},y) = \frac{1}{2} \big| h(\mathbf{x}; W,b) - y \big|^2, $$ or if the label is a vector, $$ J(W,b; \mathbf{x},y) = \frac{1}{2} \big\| h(\mathbf{x}; W,b) - y \big\|^2, $$

Then the overall loss function is the mean of the sample losses, plus the regularization term (aka a weight decay term) that tends to decrease the magnitude of the weights $w_{ij}^{(l)}$ but not the biases, and helps prevent overfitting, lastly the $1/2$ factor is added so that upon taking derivate we can get a nice rounded expression without any factors. $$ \begin{aligned} J(W,b) &= \frac{1}{N} \sum_{i=1}^N J(W,b;\mathbf{x}^{(i)},y^{(i)}) + \frac{\epsilon}{2} \sum_{l=1}^{n_l-1} ; \sum_{i=1}^{s_l} ; \sum_{j=1}^{s_{l+1}} \left( w^{(l)}{ji} \right)^2 \ &= \frac{1}{N} \sum{i=1}^N \left( \frac{1}{2} \left| h(\mathbf{x}^{(i)}; W,b) - y^{(i)} \right|^2 \right)

\frac{\epsilon}{2} \sum_{l=1}^{n_l-1} ; \sum_{i=1}^{s_l} ; \sum_{j=1}^{s_{l+1}} \left( w^{(l)}_{ji} \right)^2,

\end{aligned} $$ where $n_l$ denote the number of layers in the network, and $s_l$ denote the number of nodes in layer $l$ (not counting the bias unit).

How a neural net works in action¶

We are gonna perform forward passes for a trained neural net with the input layer having 784 input units (28x28 grayscale images), 1 hidden layer with 256 hidden units (neurons), the output layer has 10 units (each represents a class), the activation is ReLU.

Implementation Remark:¶

This cost function above is often used both for classification and for regression problems. For classification, we let $y=0$ or $1$ represent the two class labels (recall that the sigmoid activation function outputs values in $[0,1]$; If we were using a $\tanh$ activation function, we would instead use $-1$ and $+1$ to denote the labels). For regression problems, we first scale our outputs to ensure that they lie in the $[0,1]$ range or $[−1,1]$ range. Most of the time, rescaling inputs is helpful too.

In [ ]:

# first the implementation of a vectorized ReLU
def relu(x):
    return x*(x>0)

In [ ]:

# X is the input, of which the second dimension is 784
# if X is the trained samples, X.shape = (60000, 784)
# if X is a single testing sample, we should make X's shape to be (1, 784)
# W is the weight, which is implemented as a list so that
# W[0].shape = (784, 256), W[0] maps the input layer to the hidden layer
# W[1].shape = (256, 10), W[1] maps the output from the hidden layer to the output layer (10 classes)
# b is the bias in each layer, it is also a list so that
# b[0].shape = (256,) and b[0] is the bias in the input layer
# b[1].shape = (10,) and b[1] is the bias in the hidden layer
def h(X,W,b):
    # layer 1 = input layer
    a1 = X
    # layer 1 (input layer) -> layer 2 (hidden layer)
    z2 = np.matmul(X, W[0]) + b[0]
    # layer 2 activation
    a2 = relu(z2)
    # layer 2 (hidden layer) -> layer 3 (output layer)
    z3 = np.matmul(a2, W[1]) + b[1]
    # output layer activation
    output = relu(z3)
    return output 

In [ ]:

# load the trained weights
W = np.load('weights.npz')['weights']
b = np.load('weights.npz')['bias']

In [ ]:

X_test = np.load('mnist_test.npz')['X']/255
y_test = np.load('mnist_test.npz')['y'].astype(int)

In [ ]:

y_pred = np.argmax(h(X_test, W, b), axis=1)  # pick the biggest activated output unit's index as our prediction

In [ ]:

print("accuracy is:", np.mean(y_pred == y_test))