Notebook

Chapter 2: How the backpropagation algorithm works¶

Warm up: a fast matrix-based approach to computing the output from a neural network¶

The two assumptions we need about the cost function¶

The Hadamard product¶

The four fundamental equations behind backpropagation¶

Problem 1 (link): alternate presentation of the equations of backpropagation¶

Transforming BP1: $\delta^L = \sigma'(z^L) \odot \nabla_a C$ can be rewritten as $\delta^L = \Sigma'(z^L) \nabla_a C$

This is actually a particular case of the following, which stems from the definition of matrix multiplication:

If $X = \begin{pmatrix}x_1 \\ x_2 \\ \ldots \\ x_n \end{pmatrix}$ and $Y = \begin{pmatrix}y_1 \\ y_2 \\ \ldots \\ y_n \end{pmatrix}$ then $X \odot Y = Y \odot X = \sum\limits_{k=1}^n x_k y_k = \begin{pmatrix}x_1 & 0 & & \ldots\\0 & x_2 & & \\ & & \ddots & \\ \ldots & & & x_n\end{pmatrix} \begin{pmatrix}y_1 \\ y_2 \\ \ldots \\ y_n \end{pmatrix}$

Transforming BP2: this is the exact same reasoning.
Now to prove the last equation, we only have to repeatedly replace $\delta^{l+1}$ in $\delta^l = \Sigma'(z^l) (w^{l+1})^T \delta^{l+1}$ with $\Sigma'(z^{l+1}) (w^{l+2})^T \delta^{l+2}$ until reaching $l+1 = L$, then using $\delta^L = \sigma'(z^L) \odot \nabla_a C$ to finally get:

\begin{eqnarray} \delta^l = \Sigma'(z^l) (w^{l+1})^T \ldots \Sigma'(z^{L-1}) (w^L)^T \Sigma'(z^L) \nabla_a C \end{eqnarray}

Exercise 1 (link): prove equations BP3 and BP4¶

BP3: $\frac{\partial C}{\partial b^l_j} = \delta^l_j$

By definition, $\delta_j^l = \frac{\partial C}{\partial z_j^l}$.

Applying the chain rule, we can write this in terms of partial derivatives with respect to the biases of every neuron in layer $l$:

$$\delta_j^l = \sum_k \frac{\partial C}{\partial b_k^l} \frac{\partial b_k^l}{\partial z_j^l}$$

Since $b_k^l$ depends only on $z_k^l$, all terms in this sum are null, except the one where $k = j$:

$$\delta_j^l = \frac{\partial C}{\partial b_j^l} \frac{\partial b_j^l}{\partial z_j^l}$$

Now since $z^l_j = \sum_k w^l_{jk} a^{l-1}_k+b^l_j$, and therefore $b^l_j = z^l_j - \sum_k w^l_{jk} a^{l-1}_k$, we have $\frac{\partial b_j^l}{\partial z_j^l} = 1$.

This gives us BP3:

$$\frac{\partial C}{\partial b^l_j} = \delta^l_j$$

BP4: $\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j$

As previously, $w_{jk}^l$ only affects only $z_j^l$, so let's drop the sum right away and write the chain rule as:

$$\frac{\partial C}{\partial w_{jk}^l} = \frac{\partial C}{\partial z_j^l} \frac{\partial z_j^l}{\partial w_{jk}^l}$$

By definition, $\frac{\partial C}{\partial z_j^l}$ is $\delta_j^l$ and since $z^l_j = \sum_k w^l_{jk} a^{l-1}_k+b^l_j$, we have $\frac{\partial z_j^l}{\partial w_{jk}^l} = a_k^{l-1}$.

And so we have BP4:

$$\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j$$

The backpropagation algorithm¶

Exercise 2 (link): Backpropagation with a single modified neuron¶

For this we have to recognize which parts of the backpropagation equation actually depend on the precise form of the activation function, and which parts don't.

Let's say our modified neuron is the $i^{th}$ neuron in the $l^{th}$ layer.

In the step 2 (feedforward), we will compute everything as before, except for $a_i^l = f(z_i^l)$.

Now:

if the modified neuron is in the output layer ($l = L$), we will have to separate the different error calculations in step 3, and compute $\delta_i^L = \frac{\partial C}{\partial a_i^L} f'(z_i^L)$ for our modified neuron, and $\delta_j^L = \frac{\partial C}{\partial a_j^L} \sigma'(z_j^L)$ for the other neurons. The other steps will be unchanged.
if the modified neuron is inside the network ($l < L$), steps 3 and 5 will be unchanged, but at step 4 we will have to compute things differently for our neuron. It will help here to consider the component form of BP2, which is the equation used in this step (below, the not yet modified form):

$$\delta^l_j = \sum_k w^{l+1}_{kj} \delta^{l+1}_k \sigma'(z^l_j)$$

But this time, we will have for our modified neuron:

$$\delta^l_i = \sum_k w^{l+1}_{ki} \delta^{l+1}_k f'(z^l_i)$$

And for the other neurons:

$$\delta^l_j = \sum_k w^{l+1}_{kj} \delta^{l+1}_k \sigma'(z^l_j)$$

Exercise 3 (link): Backpropagation with linear neurons¶

With the activation function $\sigma(z) = z$, we have $\sigma'(z) = 1$ and $a_j^l = z_j^l$, which simplifies the backpropagation algorithm:

In step 3, the output error vector is simply:

$$\delta^{L} = \nabla_a C$$

In step 4, the error vector is simply:

$$\delta^{l} = (w^{l+1})^T \delta^{l+1}$$

The code for backpropagation¶

Problem 2 (link): a fully matrix-based approach to backpropagation over a mini-batch¶

The code is in the chap2p2 directory. Executing exec_normal.py makes use of network.py, which is the original python3 file except that I've added a timer. Here's the output:

Epoch 0: 9078 / 10000, elapsed time: 8.82s
Epoch 1: 9258 / 10000, elapsed time: 18.31s
Epoch 2: 9319 / 10000, elapsed time: 27.38s
...
Epoch 27: 9434 / 10000, elapsed time: 234.64s
Epoch 28: 9457 / 10000, elapsed time: 242.92s
Epoch 29: 9444 / 10000, elapsed time: 251.13s

network_matrix.py implements the matrix-based approach. Let's execute exec_matrix.py:

Epoch 0: 8216 / 10000, elapsed time: 2.59s
Epoch 1: 8365 / 10000, elapsed time: 5.05s
Epoch 2: 8375 / 10000, elapsed time: 7.49s
...
Epoch 27: 9482 / 10000, elapsed time: 67.95s
Epoch 28: 9483 / 10000, elapsed time: 70.37s
Epoch 29: 9511 / 10000, elapsed time: 72.78s

On my computer, the matrix-based approach is 3.5 times faster.

Note that the method remains exactly the same, therefore the differences in accuracy are only due to the randomness and shouldn't be interpreted.