We have access to labeled training samples $(\mathbf{x}^{(i)},y^{(i)})$. Neural networks give a way of defining a complex, highly non-linear form of hypotheses model function $h(\mathbf{x}; W, b)$, with parameters $W$ and $b$ that we can fit to our data. This nonlinear function is capable of approximating some of the most obscure relations in real life, if we have enough parameters.
Below is one example of a feedforward neural network, the name comes from the fact that the connectivity graph does not have any directed loops or cycles.
This "neuron" is a computational unit/node that takes an input $\mathbf{a}$, and outputs the model function $h^{\text{single} }(\mathbf{a}; \mathbf{w}, b)$ (aka activation): $$ h^{\text{single} }(\mathbf{a}; \mathbf{w}, b) = f(\mathbf{w}^{\top} \mathbf{a} + b) = f\Big(\sum_{i=1}^3 w_i a_i +b\Big) $$ The $f(\cdot)$ is called an "activation function", common choices are $\tanh$, Sigmoid and ReLU: $$ \text{ReLU} (x) = \max(0, x), \; \sigma(x) = \frac{1}{1 + e^{-x}}, \; \tanh(x) = \frac{e^{x}-e^{-x}}{e^{x}+e^{-x}} $$
A neural network is put together by hooking many of our simple "neurons", so that the output of a neuron can be the input of another. For example, here is a small neural network (a slice of a bigger network):
In this figure, we have used circles to also denote the inputs to the network. The circles labeled "+1" are called bias units. Layer 1 is called the input layer, and Layer 3 is the output layer (which, in this example, has only one node). The middle layer, Layer 2, is called a hidden layer, because its values are not observed in the training set.
The neural network in our example has 2 input units (not counting the bias unit), 3 hidden units, and 1 output unit.
Our neural network has parameters $(W, b) := \big(W^{(1)},b^{(1)},W^{(2)},b^{(2)}\big)$.
$W^{(l)} = \big(w^{(l)}_{ij}\big)$ to denote the weight matrix, where the entry-$ij$ is associated with the connection between unit $j$ in layer $l$, and unit $i$ in layer $l+1$. Note the order of the indices, $j$ is the closer to the input that this matrix is acting on
$b^{(l)}_i$ is the bias associated with unit $i$ in layer $l+1$.
In our example above, we have $W^{(1)}\in \mathbb{R}^{3×2}$, and $W^{(2)}\in \mathbb{R}^{1×3}$. Note that bias units do not have inputs or connections going into them, we write their output the value +1
for convenience. When we count the number of units in layer $l$, we do not count the bias unit.
We will write $a^{(l)}_i$ to denote the activation or output value of unit $i$ in layer $l$. For $l=1$, $a^{(1)}_i= x_i$ denotes the $i$-th input to this network. Given a fixed set of parameters $(W,b)$, and the input $\mathbf{x}$, the neural network above defines a model function $h(\mathbf{x}; W, b)$ made of layers of function compositions that outputs a real number. Specifically, the computation that this neural network represents is given by:
$$\begin{aligned} a_1^{(2)} &= f\big(w_{11}^{(1)}x_1 + w_{12}^{(1)} x_2 + b_1^{(1)}\big) \\ a_2^{(2)} &= f\big(w_{21}^{(1)}x_1 + w_{22}^{(1)} x_2 + b_2^{(1)}\big) \\ a_3^{(2)} &= f\big(w_{31}^{(1)}x_1 + w_{32}^{(1)} x_2 + b_3^{(1)}\big) \\ h(\mathbf{x}; W,b) &=a = a_1^{(3)} = f\big(w_{11}^{(2)} a_1^{(2)} + w_{12}^{(2)} a_2^{(2)} + w_{13}^{(2)} a_3^{(2)} + b_1^{(2)}\big) \end{aligned} $$If we allow the activation function $f(\cdot)$ to act on vectors in an element-wise fashion: $f([\mathbf{z}_1,\mathbf{z}_2,\mathbf{z}_3])=[f(\mathbf{z}_1),f(\mathbf{z}_3),f(\mathbf{z}_3)]$, then we can write the equations above more compactly as: $$\begin{aligned} \mathbf{z}^{(2)} &= W^{(1)} \mathbf{x} + \mathbf{b}^{(1)} \\ \mathbf{a}^{(2)} &= f(\mathbf{z}^{(2)}) \\ \mathbf{z}^{(3)} &= W^{(2)} \mathbf{a}^{(2)} + \mathbf{b}^{(2)} \\ h(\mathbf{x}; W, b) &= \mathbf{a}^{(3)} = f(\mathbf{z}^{(3)}) \end{aligned} $$ More generally, recalling that $\mathbf{a}^{(1)}=\mathbf{x}$ also denotes the values from the input layer, then given layer $l$'s activations $\mathbf{a}^{(l)}$, we can compute layer $(l+1)$'s activations $\mathbf{a}^{(l+1)}$ as: $$ \begin{aligned} \mathbf{z}^{(l+1)} &= W^{(l)} \mathbf{a}^{(l)} + \mathbf{b}^{(l)} \\ \mathbf{a}^{(l+1)} &= f(\mathbf{z}^{(l+1)}) \end{aligned} $$ By organizing the parameters in matrices and using matrix-vector operations, we can take advantage of fast linear algebra routines to quickly perform calculations in our network.
Suppose we have a fixed and labeled training set $\{ (\mathbf{x}^{(1)}, y^{(1)}), \dots, (\mathbf{x}^{(N)}, y^{(N)}) \}$ of $N$ training examples. For a single training sample and its target value $(\mathbf{x}, y)$, we define the sample loss function with respect to this single example to be: $$ J(W,b; \mathbf{x},y) = \frac{1}{2} \big| h(\mathbf{x}; W,b) - y \big|^2, $$ or if the label is a vector, $$ J(W,b; \mathbf{x},y) = \frac{1}{2} \big\| h(\mathbf{x}; W,b) - y \big\|^2, $$
Then the overall loss function is the mean of the sample losses, plus the regularization term (aka a weight decay term) that tends to decrease the magnitude of the weights $w_{ij}^{(l)}$ but not the biases, and helps prevent overfitting, lastly the $1/2$ factor is added so that upon taking derivate we can get a nice rounded expression without any factors. $$ \begin{aligned} J(W,b) &= \frac{1}{N} \sum_{i=1}^N J(W,b;\mathbf{x}^{(i)},y^{(i)}) + \frac{\epsilon}{2} \sum_{l=1}^{n_l-1} ; \sum_{i=1}^{s_l} ; \sum_{j=1}^{s_{l+1}} \left( w^{(l)}{ji} \right)^2 \ &= \frac{1}{N} \sum{i=1}^N \left( \frac{1}{2} \left| h(\mathbf{x}^{(i)}; W,b) - y^{(i)} \right|^2 \right)
\end{aligned} $$ where $n_l$ denote the number of layers in the network, and $s_l$ denote the number of nodes in layer $l$ (not counting the bias unit).
We are gonna perform forward passes for a trained neural net with the input layer having 784 input units (28x28 grayscale images), 1 hidden layer with 256 hidden units (neurons), the output layer has 10 units (each represents a class), the activation is ReLU.
This cost function above is often used both for classification and for regression problems. For classification, we let $y=0$ or $1$ represent the two class labels (recall that the sigmoid activation function outputs values in $[0,1]$; If we were using a $\tanh$ activation function, we would instead use $-1$ and $+1$ to denote the labels). For regression problems, we first scale our outputs to ensure that they lie in the $[0,1]$ range or $[−1,1]$ range. Most of the time, rescaling inputs is helpful too.
# first the implementation of a vectorized ReLU
def relu(x):
return x*(x>0)
# X is the input, of which the second dimension is 784
# if X is the trained samples, X.shape = (60000, 784)
# if X is a single testing sample, we should make X's shape to be (1, 784)
# W is the weight, which is implemented as a list so that
# W[0].shape = (784, 256), W[0] maps the input layer to the hidden layer
# W[1].shape = (256, 10), W[1] maps the output from the hidden layer to the output layer (10 classes)
# b is the bias in each layer, it is also a list so that
# b[0].shape = (256,) and b[0] is the bias in the input layer
# b[1].shape = (10,) and b[1] is the bias in the hidden layer
def h(X,W,b):
# layer 1 = input layer
a1 = X
# layer 1 (input layer) -> layer 2 (hidden layer)
z2 = np.matmul(X, W[0]) + b[0]
# layer 2 activation
a2 = relu(z2)
# layer 2 (hidden layer) -> layer 3 (output layer)
z3 = np.matmul(a2, W[1]) + b[1]
# output layer activation
output = relu(z3)
return output
# load the trained weights
W = np.load('weights.npz')['weights']
b = np.load('weights.npz')['bias']
X_test = np.load('mnist_test.npz')['X']/255
y_test = np.load('mnist_test.npz')['y'].astype(int)
y_pred = np.argmax(h(X_test, W, b), axis=1) # pick the biggest activated output unit's index as our prediction
print("accuracy is:", np.mean(y_pred == y_test))