Notebook

5. Neural Networks¶

Table of Contents¶

5.1 Feed-forward Network Functions
5.2 Network Training
- 5.2.1 Parameter optimization
- 5.2.2 Local quadratic approximation
- 5.2.4 Gradient descent optimization
5.3 Error Backpropagation
5.5 Regularization in Neural Networks
- 5.5.2 Early Stopping
- 5.5.3 Invariances
- 5.5.6 Convolutional networks
5.6 Mixture Density Networks

In [15]:

import random
import numpy as np
import matplotlib.pyplot as plt
from prml import nn
from prml.linear import Perceptron, LogisticRegression
from prml.preprocessing import OneHotEncoder
from prml.datasets import generate_toy_data, load_planar_dataset, plot_2d_decision_boundary, load_mnist_dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Set random seed to make deterministic
np.random.seed(0)

# Ignore zero divisions and computation involving NaN values.
np.seterr(divide="ignore", invalid="ignore")

# Enable higher resolution plots
%config InlineBackend.figure_format = 'retina'

# Enable autoreload all modules before executing code
%reload_ext autoreload
%autoreload 2

5.1 Feed-forward Network Functions ¶

The linear models discussed in previous chapters are based on linear combinations of fixed (non)linear basis functions $\phi_j(\mathbf{x})$ and take the form,

$y(\mathbf{x},\mathbf{w}) = f\Bigg(\sum_{j=1}^M w_j\phi_j(\mathbf{x})\Bigg)$

where $f(\cdot)$ is a nonlinear activation function in the case of classification and the identity in the case of regression. Although such models have useful analytical properties, they are limited by the curse of dimentionality, and they need to adapt the basis functions to the data for large-scale problems. An alternative is to use a predefined number of basis functions but allow them to be adaptive during training. Thus, an extension of the model above is making the basis functions $\phi_j(\mathbf{x})$ depend on parameters and then adjust them along the coefficients $\{w_j\}$ , during training.

Neural networks use basis functions that follow the same form, that is, each basis function is itself a nonlinear function of a linear combination of the inputs, where the coefficients are adapative parameters. Thus, the basic neural network model is described as a series of functional transformations.

Given the input variables $x_1,\dots,x_D$ , we construct $M$ linear combinations in the form:

$a_j = \sum_{i=1}^D w_{ji}^{(1)}x_i + w_{j0}^{(1)}$

where $j=1,\dots,M$ , and the superscript $(1)$ indicates the corresponding parameters of the first layer of the network. The input layer often uses the superscript $(0)$ . The quantities $a_j$ are known as activations and each of them is transformed using a differentiable nonlinear activation function $h(\cdot)$ to give

$z_j = h(a_j)$

these correspond to the outputs of the basis functions, and in the context of neural networks are called hidden units. The following figure presents the two step computation of a single unit or neuron.

Following the same procedure, these output values from the first layer, are linearly combined again to give,

$a_k = \sum_{j=1}^M w_{ki}^{(2)}z_j + w_{k0}^{(2)}$

where $k=1,\dots,K$ . This transformation corresponds to the second layer of the network. These output activations are transformed again using an appropriate activation function $h$ to give a set of outputs $y_k$ . The following figure depicts the entire process for a $2$ -layer network.

It is called a $2$ -layer neural network because there are two layers of adaptive weights.

The output unit activation function is determined by the nature of the data and the assumed distribution of target variables. Thus, for regression problems, the activation function can be the identity, so that $y_k=\alpha_k$ , and for classification, the output uses a sigmoid, or softmax function, so that $y_k=\sigma(\alpha_k)$ .

We may combine these stages to obtain the overall network function (using a sigmoid output unit), as follows,

$y_k(\mathbf{x},\mathbf{w}) = \sigma\Bigg(\sum_{j=1}^M w_{kj}^{(2)}h\Bigg(\sum_{i=1}^D w_{ji}^{(1)}x_i + w_{j0}^{(1)}\Bigg) + w_{k0}^{(2)} \Bigg)$

where biases on each layer can be absorbed into the set of weight parameters by defining an additional input variable $x_0=1$ .

Important Notes:

The key difference compared to the perceptron is that the neural network uses continuous sigmoidal non-linear hidden units, whereas the perceptron uses step-function non-linearities. This means that the neural network function is differentiable which plays a central role in network training.
If the activation functions of all hidden units are taken to be linear, then any such network can be replaced by a network without the hidden units. This follows from the fact that the composition of successive linear transformations is itself a linear transformation.

In order to study these claims, lets consider a dataset, in which, the classes are separated by a non-linear decision boundary. Therefore, a simple linear model, without performing any sophisticated feature engineering, cannot capture.

In [5]:

x, y = load_planar_dataset()
plt.scatter(x[:, 0], x[:, 1])
plt.title("Planar dataset")
plt.xlabel("$X_1$")
plt.ylabel("$X_2$")
plt.show()

print(f"The shape of X is: {str(x.shape)}")
print(f"The shape of Y is: {str(y.shape)}")
print(f"We have {x.shape[1]} training examples!")

The shape of X is: (400, 2)
The shape of Y is: (400, 1)
We have 2 training examples!

Training either a simple perceptron or a logistic regression classifier, similar to the ones presented in Chapter 4, only manages to learn an insufficient linear decision boundary that cannot capture the underlying data distribution.

In [6]:

classifier = Perceptron()
classifier.fit(x, np.squeeze(y))
plot_2d_decision_boundary(lambda x: classifier.predict(x) > 0.5, x, y)

classifier = LogisticRegression()
classifier.fit(x, np.squeeze(y))
plot_2d_decision_boundary(lambda x: classifier.predict(x) > 0.5, x, y)

Linear Algebra Notation¶

For efficiency reasons, in an actual implementation, its more convinient to represent the activations in some $\ell$ -layer as an $M$ -dimensional vector,

$\mathbf{a}^{(\ell)} = \begin{bmatrix} a_1^{(\ell)} \\ a_2^{(\ell)} \\ \vdots \\ a_M^{(\ell)} \end{bmatrix} = \begin{bmatrix} \mathbf{w}_{1}^{(\ell)T}\mathbf{z}^{(\ell-1)} + w_{10}^{(\ell)} \\ \mathbf{w}_{2}^{(\ell)T}\mathbf{z}^{(\ell-1)} + w_{20}^{(\ell)} \\ \vdots \\ \mathbf{w}_{M}^{(\ell)T}\mathbf{z}^{(\ell-1)} + w_{M0}^{(\ell)} \\ \end{bmatrix} = \begin{bmatrix} \mathbf{w}_{1}^{(\ell)T} \\ \mathbf{w}_{2}^{(\ell)T} \\ \vdots \\ \mathbf{w}_{M}^{(\ell)T} \\ \end{bmatrix} \mathbf{z}^{(\ell-1)} + \mathbf{w}_{0}^{(\ell)} = \mathbf{W}^{(\ell)}\mathbf{z}^{(\ell-1)} + \mathbf{w}_{0}^{(\ell)}$

In terms of matrix dimensions, we have $(M \times D) (D \times 1) + (M \times 1)$ . Then, the activation function $h(\cdot)$ is applied on $\mathbf{a}^{(\ell)}$ to obtain,

$\mathbf{z}^{(\ell)} = h^{(\ell)}(\mathbf{a}^{(\ell)})$

This can also be generalized across $N$ training examples $\mathbf{x}_i$ , by stacking them in columns, creating a matrix $\mathbf{X}$ of dimensions $(D \times N)$ . Then, we obtain,

$\mathbf{W}^{(\ell)}\mathbf{Z}^{(\ell-1)} + \mathbf{w}_{0}^{(\ell)}$

and

$\mathbf{Z}^{(\ell)} = h^{(\ell)}(\mathbf{A}^{(\ell)})$

where $\mathbf{Z}^{(0)} = \mathbf{X}$ .

Activation Functions¶

In the general case, we can use as an activation function any non-linear $h(\mathbf{z})$ . Some popular choices are the following non-linear functions.

The sigmoid function or $\sigma(z) = \frac{1}{1 + e^{-z}}$ . Mostly used in the output layer for representing class probability. In the case of multi-class problems, the softmax activation function is used instead of the sigmoid.
The hyperbolic tangent or $\mathrm{tanh}(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$ . Note that the hyperbolic tangent is a shifted version of the sigmoid function, that crosses zero point and rescales so that it ranges from $-1$ to $1$ . That means that it has the effect of centering the data around zero which is a desired property for learning algorithms.

However, a downside of both the sigmoid and the hyperbolic tangent is that as $z$ gets very large or very small the slope of the function gets close to zero thus slowing down gradient descent.

The rectified linear unit or $\mathrm{relu}(z) = \max(0, z)$ . Thus, the derivative is $1$ as long as $z>0$ and $0$ when $z\leq0$ .
The leaky rectified linear unit or $\mathrm{leaky relu}(z) = \max(0.01z, z)$ , attempts to improves upon the dying ReLU problem, which is that all negative input values become zero immediately. Another variation is the parametric ReLU, which simply makes $0.01$ a parameter, i.e., $\max(\alpha z, z)$ .

Rectified linear unit actication functions, overcome the problem of vanishing gradients, and thus, they enable much faster training of neural networks.

The exponential linear unit (ELU) or $\mathrm{elu}(z) = \begin{cases} z & \quad z \geq 0 \\ \alpha(e^z -1 ) & \quad z < 0 \end{cases}$ uses a log curve to define the negative values unlike the parametric ReLU functions that use a straight line.
The self-gated activation function (Swish) or $\mathrm{swish}(z) = z\sigma(z)$ is a smooth function that does not abruptly change direction $x = 0$ . Rather, it smoothly bends from 0 towards values < 0 and then upwards again. It consistently matches or outperforms the ReLU activation functions.

In [7]:

import numpy as np


def sigmoid(z: np.ndarray) -> float:
    return 1 / (1 + np.exp(-z))


z = np.linspace(-5, 5)

plt.figure(figsize=(10, 4))

plt.subplot(2, 3, 1)
plt.tight_layout()
plt.plot(z, sigmoid(z))
plt.plot(z, sigmoid(z) * (1 - sigmoid(z)), "k", linestyle="dotted")
plt.xlabel("z")
plt.ylabel("$\sigma(z)$")
plt.legend(["sigmoid", "derivative"])

plt.subplot(2, 3, 2)
plt.tight_layout()
plt.plot(z, np.tanh(z), "r")
plt.plot(z, 1 - np.tanh(z) * np.tanh(z), "k", linestyle="dotted")
plt.xlabel("z")
plt.ylabel("$\mathtt{tanh}(z)$")
plt.legend(["tanh", "derivative"])

plt.subplot(2, 3, 3)
plt.tight_layout()
plt.plot(z, np.maximum(0, z), "g")
plt.plot(z, np.where(z > 0, 1, 0), "k", linestyle="dotted")
plt.xlabel("z")
plt.ylabel("$\mathtt{relu}(z)$")
plt.legend(["ReLU", "derivative"])

plt.subplot(2, 3, 4)
plt.tight_layout()
plt.plot(z, np.maximum(0.01 * z, z), "b")
plt.plot(z, np.where(z > 0, 1, 0.01), "k", linestyle="dotted")
plt.xlabel("z")
plt.ylabel("$\mathtt{leakyrelu}(z)$")
plt.legend(["Leaky ReLU", "derivative"])

plt.subplot(2, 3, 5)
plt.tight_layout()
plt.plot(z, np.where(z > 0, z, np.exp(z) - 1), "y")
plt.plot(z, np.where(z > 0, 1, np.where(z > 0, z, np.exp(z) - 1) + 1), "k", linestyle="dotted")
plt.xlabel("z")
plt.ylabel("$\mathtt{elu}(z)$")
plt.legend(["ELU (a=1)", "derivative"])

plt.subplot(2, 3, 6)
plt.tight_layout()
plt.plot(z, z * sigmoid(z), "m")
plt.plot(z, sigmoid(z) + z * sigmoid(z) * (1 - sigmoid(z)), "k", linestyle="dotted")
plt.xlabel("z")
plt.ylabel("$\mathtt{swish}(z)$")
plt.legend(["Swish", "derivative"])

plt.show()

5.2 Network Training¶

Regression¶

A simple approach to the problem of determining the network parameters is to revisit the discussion of polynomial curve fitting, and attempt to minimize a sum-of-squares error function. Thus, given a training set comprising a set of input vectors $\{\mathbf{x}_n\}$ , where $n=1,\dots,N$ , and a corresponding set of target vectors $\{\mathbf{t}_n\}$ , we minimize the error function,

$E_n(\mathbf{w}) = \frac{1}{2}\sum_{n=1}^N ||\mathbf{y}(\mathbf{x}_n, \mathbf{w}) - \mathbf{t}_n)||^2$

Consider regression problems, and for the moment, a single target variable $t$ that may take any real value. Similar to Section 3.1, we assume that $t$ follows a Gaussian distribution having an $\mathbf{x}$ -dependent mean, which is given by the output of the neural network,

$p(t|\mathbf{x},\mathbf{w}) = \mathcal{N}(t|y(\mathbf{x},\mathbf{w}),\beta^{-1})$

For the above conditional distribution, it is sufficient to take the output unit activation function to be the identity, because such a network can approximate any continuous function from $\mathbf{x}$ to y. Given a data set of $N$ independent, identically distributed observations $\mathbf{X}$ , and the corresponding target values $\mathsf{t}$ , the likelihood function is as follows,

$p(\mathsf{t}|\mathbf{X},\mathbf{w},\beta) = \prod_{n=1}^N p(t_n|\mathbf{x}_n,\mathbf{w},\beta) = \prod_{n=1}^N \mathcal{N}(t_n|y(\mathbf{x}_n,\mathbf{w}),\beta^{-1})$

Then, by taking the negative logarithm, we obtain the same error function derived in (3.11). By minimizing the error function, we obtain the maximum likelihood solution $\mathbf{w}_{ML}$ . Having found $\mathbf{w}_{ML}$ , the value of $\beta$ can be found using (3.21), derived by minimizing the negative log likelihood.

IMPORTANT: Keep in mind, however, that the nonlinearity of the network function $y(\mathbf{x}_n,\mathbf{w})$ causes the error to be nonconvex, and so a local maxima of the likelihood may be found, corresponding to local minima of the error function.

Classification¶

In binary classification a single target variable $t$ such that $t=1$ for class $\mathcal{C}_1$ and $t=0$ for class $\mathcal{C}_2$ . We consider a network having a single output whose activation function is a logistic sigmoid,

$y = \sigma(\alpha) = \frac{1}{1 + \exp(-\alpha)}$

so that $0 \leq y(\mathbf{x},\mathbf{w}) \leq 1$ , is interpreted as the conditional probability $p(\mathcal{C}_k|\mathbf{x})$ given by $1 - y(\mathbf{x},\mathbf{w})$ . The conditional distribution of targets given inputs is then a Bernoulli distribution of the form,

$p(t|\mathbf{x},\mathbf{w}) = y(\mathbf{x},\mathbf{w})^t\{1-y(\mathbf{x},\mathbf{w})\}^{1-t}$

Given a training set of independent observations, the the error function by the negative log-likelihood, if the cross-entropy error function of the form,

$E(\mathbf{w}) = -\sum_{n=1}^N \{t_n\ln y_n + (1-t_n)\ln(1-y_n)\}$

NOTE: There is no analogue of the noise precision $\beta$ because the target values are assumed to be correctly labelled.

IMPORTANT: There is a natural choice of both output unit activation function and matching error function, according to the type of problem being solved. For regression we rely on linear outputs and sum-of-squares error, and for binary logistic sigmoid (binary) or softmax (multiclass) outputs and cross-entropy error function.

5.2.1 Parameter optimization ¶

The goal is to find a vector such that $E(\mathbf{w})$ is minimized. However, the error function has a highly nonlinear dependence on the weights, and so there are many points in the weight space at which the gradient vanishes. Since there is no hope of finding an analytical solution to the equation $\nabla E(\mathbf{w})=0$ we resort to iterative numerical procedures. Most of these techniques involve choosing an initial value $\mathbf{w}^{(0)}$ for the weight vector and then moving through weight space in a succession of steps of the form,

$\mathbf{w}^{(\tau+1)} = \mathbf{w}^{(\tau)} + \Delta\mathbf{w}^{(\tau)}$

Such algorithms involve different choices for the weight vector update $\Delta\mathbf{w}^{(\tau)}$ . Usually, they make use of gradient information and therefore require that, after each update, the value of $\nabla E(\mathbf{w})$ is evaluated at the updated weight vector $\mathbf{w}^{(\tau+1)}$ . In order to understand the importance of gradient information, it is useful to consider a local approximation to the error function based on a Taylor expansion.

5.2.2 Local quadratic approximation¶

Consider the second-order Taylor expansion of $E(\mathbf{w})$ around some point $\hat{\mathbf{w}}$ ,

$E(\mathbf{w}) \approx E(\hat{\mathbf{w}}) +(\mathbf{w} - \hat{\mathbf{w}})^T\mathbf{b} + \frac{1}{2}(\mathbf{w} - \hat{\mathbf{w}})^T\mathbf{H}(\mathbf{w} - \hat{\mathbf{w}})$

where $\mathbf{b}=\nabla E|_{\mathbf{w}=\hat{\mathbf{w}}}$ and $\mathbf{H}=\nabla\nabla E$ is the Hessian matrix of second derivatives. Thus, the local approximation to the gradient is given by,

$E(\mathbf{w}) \approx \mathbf{b} + \mathbf{H}(\mathbf{w} - \hat{\mathbf{w}})$

for points $\mathbf{w}$ that are sufficiently close to $\hat{\mathbf{w}}$ , these expressions give reasonable approximations for the error and its gradient.

Consider, for instance, a simple 2-dimensional error function of the form $E(w_1, w_2) = w_1^2 + w_2^4 + w_1w_2$ .

In [8]:

w = np.linspace(-1, 1, 100)
w1, w2 = np.meshgrid(w, w)


def E(w1: float, w2: float) -> float:
    return w1**2 + w2**4 + w1 * w2


plt.contour(w1, w2, w1**2 + w2**4 + w1 * w2)
plt.xlabel("$w_1$")
plt.ylabel("$w_2$")
plt.title("$E(w_1, w_2) = w_1^2 + w_2^4 + w_1w_2$")
plt.show()

The gradient of $E(w_1, w_2)$ is defined as follows,

$\nabla E = \begin{bmatrix} \frac{\partial E}{w_1} \\[0.1cm] \frac{\partial E}{w_2} \end{bmatrix} = \begin{bmatrix} 2w_1 + w_2 \\[0.1cm] 4w_2 + w_1 \end{bmatrix}$

and the Hessian matrix $\mathbf{H}$ equals,

$\mathbf{H} = \nabla\nabla E = \begin{bmatrix} \frac{\partial E}{w_1w_1} & \frac{\partial E}{w_1w_2} \\[0.1cm] \frac{\partial E}{w_2w_1} & \frac{\partial E}{w_2w_2} \end{bmatrix} = \begin{bmatrix} 2 & 1 \\[0.1cm] 1 & 4 \end{bmatrix}$

In [9]:

def gradient(w1, w2):
    return np.array([2 * w1 + w2, 4 * w2 + w1])


H = np.array([[2, 1], [1, 4]])

Then, given some point $\hat{\mathbf{w}} = [\hat{w}_1, \hat{w}_2]$ , $E(w_1, w_2)$ can be approximated by,

$E(w_1, w_2) \approx E(\hat{w_1}, \hat{w_2}) + [w_1 - \hat{w_1}, w_2 - \hat{w_2}] \begin{bmatrix} 2\hat{w_1} + \hat{w_2} \\[0.1cm] 4\hat{w_2} + \hat{w_1} \end{bmatrix} + \frac{1}{2} [w_1 - \hat{w_1}, w_2 - \hat{w_2}]^T \begin{bmatrix} 2 & 1 \\[0.1cm] 1 & 4 \end{bmatrix} [w_1 - \hat{w_1}, w_2 - \hat{w_2}]$

In [10]:

w_hat = np.array([0, -0.8])


@np.vectorize
def E_approx(w1, w2):
    w = np.array([w1, w2])
    return E(*w_hat) + np.dot((w - w_hat), gradient(*w_hat)).T + 0.5 * np.dot(w - w_hat, np.dot(H, w - w_hat))


plt.contour(w1, w2, E_approx(w1, w2), 20, colors="lightcoral")
plt.contour(w1, w2, E(w1, w2), 20, colors="powderblue")
plt.scatter(*w_hat, color="black", marker="x")
plt.xlabel("$w_1$")
plt.ylabel("$w_2$")
plt.show()

5.2.4 Gradient descent optimization¶

The simplest approach to using gradient information is to choose the weight update to be a small step in the direction of the negative gradient, so that,

$\mathbf{w}^{(\tau+1)} = \mathbf{w}^{(\tau)} - \eta\nabla E(\mathbf{w}^{(\tau)})$

where the parameter $\eta > 0$ is known as learning rate. At each step the weight vector is moved in the direction of the greatest rate of decrease of the error function, and so the approach is known as gradient descent or steepest descent. Note that the error function is defined with respect to a training set, and thus, each step requires that the entire training set be processed in order to evaluate $\nabla E$ . Techniques that use the whole data set at once are called batch methods. Although such an approach might intuitively seem reasonable, in fact it turns out to be a poor algorithm.

For batch optimization, there are more efficient methods, such as conjugate gradients and quasi-Newton methods, which are much more robust and much faster than simple gradient descent. Unlike gradient descent, these algorithms have the property that the error function always decreases at each iteration unless the weight vector has arrived at a local or global minimum.

There is, however, an on-line version of gradient descent, known as sequential gradient descent or stochastic gradient descent, that has proved useful in practice for training neural networks on large data sets. Error functions based on maximum likelihood for a set of independent observations comprise a sum of terms, one for each data point,

$E(\mathbf{w}) = \sum_{n=1}^N E_n(\mathbf{w})$

Stochastic gradient descent makes an update to the weight vector based on one data point at a time, so that

$\mathbf{w}^{(\tau+1)} = \mathbf{w}^{(\tau)} - \eta\nabla E_n(\mathbf{w}^{(\tau)})$

The update is repeated by cycling through the data either in sequence or by selecting points at random with replacement. There are of course intermediate scenarios in which the updates are based on batches of data points. One advantage of on-line methods compared to batch methods is that the former handle redundancy in the data much more efficiently. To see, this consider an extreme example in which we take a data set and double its size by duplicating every data point. Note that this simply multiplies the error function by a factor of $2$ and so is equivalent to using the original error function. Batch methods will require double the computational effort to evaluate the batch error function gradient, whereas online methods will be unaffected. Another property of on-line gradient descent is the possibility of escaping from local minima, since a stationary point with respect to the error function for the whole data set will generally not be a stationary point for each data point individually.

5.3 Error Backpropagation ¶

Error backpropagation, or simply backprop, is an efficient technique for evaluating the gradient of an error function $E(\mathbf{w})$ for a feed-forward neural net. This can be achieved using a local message passing scheme in which information is sent alternately forwards and backwards through the network.

Most training algorithms involve an iterative procedure for minimization of an error function. At each such step, we can distinguish between two stages:

The derivatives of the error function must be evaluated. The important contribution of the backpropagation technique is in providing a computationally efficient method for evaluating such derivatives. Since the errors are propagated backwards through the network, we use the term backpropagation to describe the evaluation of derivatives.
The derivatives are then used to compute the adjustments to be made to the weights, e.g., gradient descent.

It is important to recognize that these stages are distinct. Thus, the first stage, namely the propagation of errors backwards through the network in order to evaluate derivatives, can be applied to many other kinds of networks and not just the multilayer perceptron. It can also be applied to error functions other than the simple sum-of-squares, and to the evaluation of other derivatives, such as, the Jacobian and Hessian matrices.

5.3.1 Evaluation of error-function derivatives ¶

The backpropagation algorithm can be applied to a general network of arbitrary topology, non-linear activation functions, and a broad class of error function. Many error functions of practical interest comprise a sum of terms, one of each data point, so that,

$E(\mathbf{w}) = \sum_{n=1}^N E_n(\mathbf{w})$

For simplicity, consider the evalution of a single term $\nabla E_n(\mathbf{w})$ . Consider the first linear model, where the outputs are linear combinations of the input variables.

$y_{nk} = y_k(\mathbf{x}_n, \mathbf{w}) = \sum_i w_{ki}x_i$

thus, the error function for input example $n$ , takes the form,

$E_n = \frac{1}{2} \sum_k(y_{nk} - t_{nk})^2$

and its gradient with respect to $w_{ji}$ , is given by,

$\frac{\partial E_n}{\partial w_{ji}} = (y_{nj} - t_{nj})x_{ni}$

In a general feed-forward network, each unit $j$ (in any layer $\ell$ ) computes a weighted sum of its inputs, as follows,

$a_j = \sum_i w_{ji}z_i$

where $z_i$ is the activation of a unit in the previous layer ( $\ell-1$ ), that sends a connection to unit $j$ , and $w_{ji}$ is the weight associated with the connection. Then, the sum $a_j$ is transformed by a non-linear activation function $h(\cdot)$ to give $z_j$ in the form,

$z_j = h(a_j)$

In turn, $z_j$ may be sent as a connection to a subsequent unit in order to participate in another activation. Note that using the vectorized notation introduced in the beginning of this chapter, we can compute the transformed activations of any layer $\ell$ as follows,

$\begin{aligned} \mathbf{a}^{\ell} &= \mathbf{W}^{\ell}\mathbf{z}^{\ell-1} \\ \mathbf{z}^{\ell} &= h(\mathbf{a}^{\ell}) \end{aligned}$

In order to apply backprop, we assume that we have computed the activations of all hidden and output units in the network, a process called forward propagation because it may be regarded as the forward flow of information through the network. Consider again the evaluation of the derivative $E_n$ . Given an arbitraty unit $j$ , $E_n$ depends on the weight $w_{ji}$ only via the input $a_j$ . Therefore, according to the chain rule for partial derivatives, we obtain,

$\frac{\partial E_n}{\partial w_{ji}} = \frac{\partial E_n}{\partial a_j} \frac{\partial a_j}{\partial w_{ji}} \overset{(5.48)}= \frac{\partial E_n}{\partial a_j} z_i$

Since $z_i$ is computed during forward propagation, we only need to compute $\frac{\partial E_n}{\partial a_j}$ . For simplicity, lets define $\delta_j = \frac{\partial E_n}{\partial a_j}$ .

For the output units, we have that $\delta_k = \frac{\partial E_n}{\partial y_k} \overset{(5.46)}= y_k - t_k$ .
For hidden units, we apply again the chain rule,

$\begin{aligned} \delta_j &= \frac{\partial E_n}{\partial a_j} = \sum_k \frac{\partial E_n}{\partial a_k} \frac{\partial a_k}{\partial a_j} \\ &= \sum_k \delta_k \frac{\partial a_k}{\partial a_j} \overset{(5.48)}= \sum_k \delta_k \Big( \frac{\partial}{\partial a_j} \sum_j w_{kj} z_j \Big) \overset{(5.49)}= \sum_k \delta_k \Big( \frac{\partial}{\partial a_j} \sum_j w_{kj} h(a_j) \Big) \\ &= h'(a_j) \sum_k w_{kj} \delta_k \end{aligned}$

where the sum runs over all $k$ units to which $j$ sends connections. Thus, for a particular hidden unit, $\delta$ is obtained by propagating the $\delta$ backwards from units higher up in the network.

Since, we know the $\delta$ for the output units, we can recursively evaluate the $\delta$ for all the hidden units in a feed-forward neural net, regardless of the topology.

Error Backpropagation algorithm

Forward propagate any input vector $\mathbf{x}_n$ using $(5.48)$ and $(5.49)$ .
Evalute $\delta_k$ for output units using $(5.54)$ .
Backpropagate $\delta$ using $(5.56)$ to obtain $\delta_j$ for each hidden unit.
Evalute the required derivatives using $(5.53)$ .

For batch methods, the derivative of the total error $E$ is obtained be repeating the above steps for each example $n$ and then summing over all examples.

Implementation notes¶

When implementing neural networks is much more performant to perform forward and backward propagation using the matrix notation introduced in the beginning of this chapter. Therefore, here we present both propagations in matrix notation across multiple training examples.

Forward Propagation

$\mathbf{Z}^{(0)} = \mathbf{X}$
Repeat for each layer $\ell$ :
- $\mathbf{A}^{(\ell)} = \mathbf{W}^{(\ell)}\mathbf{Z}^{(\ell-1)} + \mathbf{w}_{0}^{(\ell)}$
- and $\mathbf{Z}^{(\ell)} = h^{(\ell)}(\mathbf{A}^{(\ell)})$

Backward Propagation

For the output layer evalute $\boldsymbol\delta^{L} = \mathbf{y} - \mathtt{t}$
Backpropagate $\boldsymbol\delta^{\ell+1}$ to obtain $\boldsymbol\delta^{\ell} = h'(\mathbf{a}^{\ell}) \odot \big( \mathbf{W}^{(\ell+1)T} \boldsymbol\delta^{\ell+1} \big)$
Evaluate derivatives $\nabla E_n(\mathbf{W}^{\ell}) = \boldsymbol\delta^{\ell}\mathbf{z}^{\ell-1}$

Let's train a shallow $2$ -layer neural network for classification, on the planar dataset, using one hidden layer of hyperbolic tangent activation functions and one sigmoid output layer. In order to train the network for classification, we use a cross-entropy loss function, similar to logistic regression. The weights of the neural network are initialized at random, while biases are initialized to zero values.

In [18]:

x, y = load_planar_dataset()

net = nn.NeuralNetwork(
    nn.LinearLayer(2, 4),
    nn.TanH(),
    nn.LinearLayer(4, 1),
    nn.Sigmoid(),
)

L = nn.BinaryCrossEntropyLoss()
net.fit(x, y, loss=L)
plot_2d_decision_boundary(lambda x: net.predict(x) > 0.5, x, y)

-- Epoch 1 ---
Cost: 0.7080379029345033
-- Epoch 101 ---
Cost: 0.5332581467045885
-- Epoch 201 ---
Cost: 0.4320979117595954
-- Epoch 301 ---
Cost: 0.37866149726575815
-- Epoch 401 ---
Cost: 0.3473329935455369
-- Epoch 501 ---
Cost: 0.32698746950287444
-- Epoch 601 ---
Cost: 0.31283296402612243
-- Epoch 701 ---
Cost: 0.30246073314522465
-- Epoch 801 ---
Cost: 0.294544357721986
-- Epoch 901 ---
Cost: 0.28830257594631986

Weight initialization helps, among others, avoiding vanishing/exloding gradients. A common choice is Xavier initilization, defined as follows,

$w = \mathcal{N}(0, \sqrt{\frac{1}{N^{(\ell-1)}}})$

Xavier initilization works better for networks using tangent activation functions. A popular choice for ReLU activation functions is

$w = \mathcal{N}(0, \sqrt{\frac{2}{N^{(\ell-1)}}})$

Another popular alternative is

$w = \mathcal{N}(0, \sqrt{\frac{2}{N^{(\ell-1)} + N^{\ell}}})$

What happens if all weights and biases are initialized to the same value?¶

Then all hidden units are symmetric (completely identical), thus computing the same function, which is undesirable. Therefore, weights must be initialized randomly. Biases can still be zero since they represent a single dimension in the weight vectors of the hidden units which already differ due to the random initialization. As a proof of concept, lets re-train the same exact network but initialize weights and bias to a constant value.

In [19]:

net = nn.NeuralNetwork(
    nn.LinearLayer(2, 4, random_initialization=False),
    nn.TanH(),
    nn.LinearLayer(4, 1, random_initialization=False),
    nn.Sigmoid(),
)

L = nn.BinaryCrossEntropyLoss()
net.fit(x, y, loss=L)
plot_2d_decision_boundary(lambda x: net.predict(x) > 0.5, x, y)

-- Epoch 1 ---
Cost: 0.6932673944222819
-- Epoch 101 ---
Cost: 0.6930829250688935
-- Epoch 201 ---
Cost: 0.6884763177838701
-- Epoch 301 ---
Cost: 0.6706060650349758
-- Epoch 401 ---
Cost: 0.668686022681725
-- Epoch 501 ---
Cost: 0.6683657053643131
-- Epoch 601 ---
Cost: 0.6676242897407316
-- Epoch 701 ---
Cost: 0.6658298361918764
-- Epoch 801 ---
Cost: 0.6627349400852941
-- Epoch 901 ---
Cost: 0.659059060995905

Note that the behavior of the trained network is identical to logistic regression and the perceptron models we used in the beginning of the chapter.

In case of multiclass classification the sigmoid activation in the final layer of the network should be replaced by a softmax activation, similar to softmax regression. For instance, consider a synthetic dataset comprise $100$ training examples, each having $2$ input features and belonging to one of $3$ classes.

In [22]:

# number of training points
N = 100

x_train, t = make_classification(
    n_features=2, n_informative=2, n_redundant=0, n_classes=3, n_clusters_per_class=1, n_samples=N, random_state=21
)

encoder = OneHotEncoder()
t_one_hot = encoder.encode(t)

model = nn.NeuralNetwork(nn.LinearLayer(2, 4), nn.ReLU(), nn.LinearLayer(4, 3), nn.Softmax())

model.fit(
    x_train,
    t_one_hot,
    epochs=10000,
    loss=nn.CrossEntropyLoss(),
    optimizer=nn.GradientDescent(learning_rate=0.01),
    verbose=True,
)

x1, x2 = np.meshgrid(np.linspace(-5, 5, 100), np.linspace(-5, 5, 100))
x_test = np.array([x1, x2]).reshape(2, -1).T

predicted = np.argmax(model.predict(x_test), axis=1)

print("Training Error:")
print(classification_report(t, np.argmax(model.predict(x_train), axis=-1)))

plt.scatter(x_train[:, 0], x_train[:, 1], c=t)
plt.contourf(x1, x2, predicted.reshape(100, 100), alpha=0.2, levels=np.linspace(0, 2, 4))
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.xlabel("$x_1$", fontsize=12)
plt.ylabel("$x_2$", fontsize=12)

plt.show()

-- Epoch 1 ---
Cost: 1.10166804109341
-- Epoch 1001 ---
Cost: 0.4274539415136976
-- Epoch 2001 ---
Cost: 0.38231954420096187
-- Epoch 3001 ---
Cost: 0.3679944991104277
-- Epoch 4001 ---
Cost: 0.3585322733626179
-- Epoch 5001 ---
Cost: 0.34823524562803826
-- Epoch 6001 ---
Cost: 0.3399993027070345
-- Epoch 7001 ---
Cost: 0.33318353101161735
-- Epoch 8001 ---
Cost: 0.3284922056777914
-- Epoch 9001 ---
Cost: 0.3244782822114613
Training Error:
              precision    recall  f1-score   support

           0       0.90      0.82      0.86        33
           1       0.94      0.86      0.90        35
           2       0.82      0.97      0.89        32

    accuracy                           0.88       100
   macro avg       0.88      0.88      0.88       100
weighted avg       0.89      0.88      0.88       100

5.3.3 Efficiency of backpropagation¶

Notes on the numerical approximation of the gradient¶

Let a cost function $f(\theta)$ , then by choosing a very small $\epsilon$ value, we can numerically approximate the gradient in any given point of the function by taking symmetrical central differences around that point, as depicted below,

Numerical differentiation is very important in practice, because a comparison of the derivatives calculated by backpropagation with those obtained using central differences provides a very accurate check on the correctness of any implementation of the backpropagation algorithm. When training networks in practice, derivatives should be evaluated using backpropagation, because this gives the greatest accuracy and numerical efficiency. However, the results should be compared with numerical differentiation in order to check the correctness of the implementation.

Basic recipe for training neural networks¶

In the presence of high bias, increasing the number of hidden units may increase performance.
If bias is reasonable, then variance may be high. In the presence of high variance, using more data or adding some form of regularization may increase performance.

Note that by using these steps, for many problems, you can achieve significant reduction in both bias and variance, in constrast to simpler models, where the bias-variance tradeoff is more hard or even impossible to overcome. The main drawback here is that larger networks require have higher computation cost to train, and more data are not always easy to find.

Normalizing inputs. When input features have very different scales, the cost function may be elongated, which, in turn, leads to slower learning, since gradient descent need small learning rate to converge. On the other hand, normalized inputs have spherical contours and, even for larger learning rates, gradient descent goes straight to the minimum.

5.5 Regularization in Neural Networks¶

The number $M$ of hidden units if a free parameter, in contrast to input and output units, and can be adjusted to obtain the best predictive performance. Note that $M$ indirectly controls the number of parameters (weights and biases) in the network, and therefore, we expect that by using maximum likelihood, we should find an optimal value for $M$ that yields the best generalization performance, corresponding to the optimal balance between an under-fit and over-fit.

The generalization error, however, is not a simple function of $M$ due to the presence of local minima in the error function. One approach of choosing $M$ is to plot a graph of $M$ against validation set performance and then choose the solution having the smallest validation set error, similar to model selection on Chapter 1.

Another approach of course is to choose a relatively large value for $M$ and add a regularization term to the error function in order to control the model complexity. The simplest regularizer is the quadratic, also known as weight decay in the context of neural networks,

$\tilde{E}(\mathbf{w}) = E(\mathbf{w}) + \frac{\lambda}{2}\mathbf{w}^T\mathbf{w}$

However, for the full neural network, the error function is obtained by,

$\tilde{E} = E + \frac{\lambda}{2} \sum_{\ell=1}^L||\mathbf{W}^{\ell}||_F^2$

where $||\mathbf{W}^{\ell}||_F^2$ is the Forbenius norm, defined as,

$||\mathbf{W}^{\ell}||_F^2 = \sum_{i=1}^{M^{\ell}}\sum_{j=1}^{M^{(\ell-1)}}w_{ij}^2$

The effective model complexity is then determined by the choice of $\lambda$ . As discussed in Chapter 1, the quadratic regularizer can be interpreted as the negative logarithm of a zero-mean Gaussian prior over the weight vector $\mathbf{w}$ . Then, adding a quadratic regularization term, the error function for input example $n$ , takes the form,

$\tilde{E}_n = \frac{1}{2} \sum_k(y_{nk} - t_{nk})^2 + \frac{\lambda}{2}\sum_{k=1}w_k^2$

where $\lambda$ is called the regularization parameter, and the gradient is obtained as follows,

$\frac{\partial \tilde{E}_n}{w_{ji}} = (y_{nj} - t_{nj})x_{ni} + \lambda w_j = \frac{\partial E_n}{w_{ji}} + \lambda w_j$

$\nabla \tilde{E}_n = (\mathbf{y}_n - \mathbf{t}_n)\mathbf{x}_n + \lambda\mathbf{w} = \nabla E_n + \lambda\mathbf{w}$

Then, by replacing back into $(5.43)$ , the stochastic gradient descent update becomes:

$\mathbf{w}^{(\tau+1)} = \mathbf{w}^{(\tau)} - \eta\Big(\nabla E_n(\mathbf{w}^{(\tau)}) + \lambda\mathbf{w}^{(\tau)} \Big) = \mathbf{w}^{(\tau)} - \eta\nabla E_n(\mathbf{w}^{(\tau)}) - \eta\lambda\mathbf{w}^{(\tau)} = (1 - \eta\lambda)\mathbf{w}^{(\tau)} - \eta\nabla E_n(\mathbf{w}^{(\tau)})$

In [23]:

x_train, y_train = generate_toy_data(lambda x: np.sin(2 * np.pi * x), sample_size=10, std=0.25)
plt.scatter(x_train, y_train, marker="x", color="k")
plt.show()

In [24]:

x_space = np.linspace(0, 1, 100)[:, None]


def create_network(m: int) -> nn.NeuralNetwork:
    return nn.NeuralNetwork(
        nn.LinearLayer(1, m),
        nn.TanH(),
        nn.LinearLayer(m, 1),
        nn.Linear(),
    )


plt.figure(figsize=(20, 5))

for i, m in enumerate([1, 10]):
    model = create_network(m)
    model.fit(
        x_train[:, None],
        y_train[:, None],
        epochs=100000,
        loss=nn.SSELoss(),
        optimizer=nn.GradientDescent(learning_rate=0.01),
        verbose=False,
    )
    y = model(x_space)

    regularized_model = create_network(m)
    regularized_model.fit(
        x_train[:, None],
        y_train[:, None],
        epochs=100000,
        loss=nn.SSELoss(),
        optimizer=nn.GradientDescent(learning_rate=0.01, weight_decay=1e-3),
        verbose=False,
    )
    y_regularized = regularized_model(x_space)

    plt.subplot(1, 3, i + 1)
    plt.scatter(x_train.ravel(), y_train.ravel(), marker="x", color="black")
    plt.plot(x_space, np.sin(2 * np.pi * x_space), color="green")
    plt.plot(x_space.ravel(), y.ravel(), color="orangered")
    plt.plot(x_space.ravel(), y_regularized.ravel(), color="deepskyblue")
    plt.annotate(f"M={m}", (0.7, 0.5))
    plt.legend(["Training Data", "$\sin(2\pi x)$", "No regularization", "Weight Decay"])

plt.show()

How does regularization reduces the over-fit?¶

In the three figures above, the unregularized neural network (red line), has high bias on the left ( $M=1$ ) and high variance on the right ( $M=10$ ). Regularization helps combat high variance by making the network simpler by enforcing the deactivation of some hidden units (by penalizing weight parameters). Let's see an intuitive example. Assume that a neural network uses $\mathrm{tanh}(x)$ activation functions for the hidden units. The $\mathrm{tanh}(x)$ activation has a roughly linear form for values close to zero, as shown in the following figure.

In [25]:

x = np.linspace(-5, 5)
narrow_x = np.linspace(-1, 1)

plt.plot(x, np.tanh(x), "k")
plt.plot(narrow_x, np.tanh(narrow_x), "r")
plt.vlines([-1, 1], -1, 1, colors="r", linestyles="dotted")
plt.xlabel("x")
plt.ylabel("$\mathtt{tanh}(x)$")
plt.show()

As the $\lambda$ parameter gets larger, the effect of the regularization term penalizes the weight parameters to have smaller values (closer to zero). Note that hidden unit outputs (before activation) are a linear combination of the layer $\ell$ weights and the outputs of the previous layer, $\mathbf{A}^{(\ell)} = \mathbf{W}^{(\ell)}\mathbf{Z}^{(\ell-1)} + \mathbf{w}_{0}^{(\ell)}$ , thus, in turn, the resulting $\mathbf{A}^{(\ell)}$ would be also smaller since $\mathbf{Z}^{(\ell-1)}$ are multiplied by smaller values. Therefore, when passed through the activation functions $\mathrm{tanh}(z)$ would respond more linearly. Increasing regularization (larger $\lambda$ ) enforces roughly linear activations and leads to simpler models which helps combat high variance.

Dropout Regularization¶

Dropout is another form of regularization that eliminates a percentage of the hidden units in the network by chance. For instance, you may toss a fair coin and have $50\%$ change of keeping each hidden unit in some layer $\ell$ . Then, by removing these units from the parameters $\mathbf{W}^{\ell}$ of the $\ell$ layer, the resulting network is much smaller. Dropout defines an indicator vector (mask) of zeros and ones, per hidden unit in the layer $\ell$ , which is used for dropping or deactivating a percentage of hidden units. More formally, assuming tha layer $\ell$ has $M$ hidden units,

$\mathbf{d}^{\ell} \propto \mathrm{Bin}(M, p)$

where $p$ is the probability of keeping a hidden unit, and

$\mathbf{z}^{\ell} = \frac{\mathbf{d}^{\ell} \odot \mathbf{z}^{\ell}}{p}$

where the division by $p$ is called inverted dropout and ensures that the expected value of $z$ remains the same after dropping a percentage of hidden units. For each training example a new $\mathbf{d}^\ell$ vector should be randomly chosen. Therefore given $N$ training examples stacked in a matrix notation, the dropout mask is defined as,

$\mathbf{D}^{\ell} = \begin{bmatrix} \mathrm{Bin}_1(M, p)\\ \dots\\ \mathrm{Bin}_N(M, p)\\ \end{bmatrix}$

and

$\mathbf{Z}^{\ell} = \frac{\mathbf{D}^{\ell} \odot \mathbf{Z}^{\ell}}{p}$

Note that on prediction (test) time, dropout should not be used, that is, the network should not activate hidden units at random, but instead should use all hidden units. Moreover, due to to randomization, the error function is ill-defined and therefore it may not decrease in every iteration of gradient descent as expected.

In [26]:

x_space = np.linspace(0, 1, 100)[:, None]


def create_network(m: int, dropout: float = 0) -> nn.NeuralNetwork:
    return (
        nn.NeuralNetwork(
            nn.LinearLayer(1, m),
            nn.Dropout(dropout),
            nn.TanH(),
            nn.LinearLayer(m, 1),
            nn.Linear(),
        )
        if dropout > 0
        else nn.NeuralNetwork(
            nn.LinearLayer(1, m),
            nn.TanH(),
            nn.LinearLayer(m, 1),
            nn.Linear(),
        )
    )


plt.figure(figsize=(20, 5))

for i, m in enumerate([1, 10]):
    model = create_network(m)
    model.fit(
        x_train[:, None],
        y_train[:, None],
        epochs=100000,
        loss=nn.SSELoss(),
        optimizer=nn.GradientDescent(learning_rate=0.01),
        verbose=False,
    )
    y = model(x_space)

    regularized_model = create_network(m, dropout=0.99)
    regularized_model.fit(
        x_train[:, None],
        y_train[:, None],
        epochs=100000,
        loss=nn.SSELoss(),
        optimizer=nn.GradientDescent(learning_rate=0.01),
        verbose=False,
    )
    y_regularized = regularized_model(x_space)

    plt.subplot(1, 3, i + 1)
    plt.scatter(x_train.ravel(), y_train.ravel(), marker="x", color="black")
    plt.plot(x_space, np.sin(2 * np.pi * x_space), color="green")
    plt.plot(x_space.ravel(), y.ravel(), color="orangered")
    plt.plot(x_space.ravel(), y_regularized.ravel(), color="deepskyblue")
    plt.annotate(f"M={m}", (0.7, 0.5))
    plt.legend(["Training Data", "$\sin(2\pi x)$", "No regularization", "Dropout (90%)"])

Why does dropout work?¶

Since dropout randomly deactivates, in any given layer, a portion of the inputs, that means, intuitively, that the learning algorithm cannot rely on any feature, and so, it is forced to spread out the weights. To that end, it shrinks the squared norm of the weights. Note that dropout can be shown to be an adaptive form of $\ell_2$ -regularization.

Batch Normalization¶

As discussed earlier, to facilitate learning, weights are initialized to have zero mean and small variance. As training progresses and the parameters are updated to different extents, the initial normalization is lost, which, in turn, slows down training and amplifies changes as the network becomes deeper. Batch normalization reestablishes these normalizations for every mini-batch of $N$ training examples, on every layer $\ell$ . By making batch normalization part of the model architecture, we are able to use higher learning rates and pay less attention to the initialization parameters. Batch normalization additionally acts as a regularizer, reducing (and sometimes even eliminating) the need for Dropout. Batch normalization is usually applied on top of the activations $\boldsymbol\alpha^{(\ell)}$ as follows:

$\begin{aligned} \boldsymbol\mu^{(\ell)} &= \frac{1}{N} \sum_{i=1}^{N} \boldsymbol\alpha_i^{(\ell)} \\ {\boldsymbol\sigma^{(\ell)}}^2 &= \frac{1}{m} \sum_{i=1}^{N} (\boldsymbol\alpha_i^{(\ell)} - \boldsymbol\mu^{(\ell)})^2 \\ \boldsymbol\alpha_{i,norm}^{(\ell)} &= \frac{\boldsymbol\alpha_i^{(\ell)} - \boldsymbol\mu^{(\ell)}}{\sqrt{\boldsymbol\sigma^{(\ell)2} + \epsilon}} \\ \hat{\boldsymbol\alpha}_i^{(\ell)} &= \boldsymbol\gamma^{(\ell)} \boldsymbol\alpha_{i,norm}^{(\ell)} + \boldsymbol\beta^{(\ell)} \end{aligned}$

where $\boldsymbol\gamma^{(\ell)}$ and $\boldsymbol\beta^{(\ell)}$ are learnable parameters, thus allowing each layer to have a different distribution. In order words, during forward propagation, each neuron or hidden unit includes a batch normalization operator between the activations $\boldsymbol\alpha^{(\ell)}$ and the nonlinear activation functions $h(\cdot)$ . However, the addition of the batch normalization operator, changes the derivation of backprop. During backward propagation, $\boldsymbol\delta^{(\ell)} = \frac{\partial E_n}{\partial \boldsymbol\alpha_i^{(\ell)}}$ should be derived using the chain rule as follows,

$\frac{\partial E_n}{\partial \boldsymbol\alpha_i^{(\ell)}} = \frac{\partial E_n}{\partial \boldsymbol\alpha_{i,norm}^{(\ell)}} \frac{\partial \boldsymbol\alpha_{i,norm}^{(\ell)}}{\partial \boldsymbol\alpha_i^{(\ell)}} + \frac{\partial E_n}{\partial \boldsymbol\mu} \frac{\partial \boldsymbol\mu}{\partial \boldsymbol\alpha_i^{(\ell)}} + \frac{\partial E_n}{\partial {\boldsymbol\sigma^{(\ell)}}^2} \frac{\partial {\boldsymbol\sigma^{(\ell)}}^2}{\partial \boldsymbol\alpha_i^{(\ell)}}$

The chain rule results in a summation having three components. Let's start from the simpler terms, which are the second terms in the product of each component,

$\frac{\partial \boldsymbol\alpha_{i,norm}^{(\ell)}}{\partial \boldsymbol\alpha_i^{(\ell)}} = \frac{1}{\sqrt{{\boldsymbol\sigma^{(\ell)}}^2 + \epsilon}}, \quad \frac{\partial \boldsymbol\mu^{(\ell)}}{\partial \boldsymbol\alpha_i^{(\ell)}} = \frac{1}{m}, \quad \frac{\partial {\boldsymbol\sigma^{(\ell)}}^2}{\partial \boldsymbol\alpha_i^{(\ell)}} = \frac{2(\boldsymbol\alpha_i^{(\ell)} - \boldsymbol\mu^{(\ell)})}{m}$

then, let's move on to the first terms of the components,

$\frac{\partial E_n}{\partial \boldsymbol\alpha_{i,norm}^{(\ell)}} = \frac{\partial E_n}{\partial \hat{\boldsymbol\alpha_i}^{(\ell)}} \frac{\partial \hat{\boldsymbol\alpha_i}^{(\ell)}}{\partial \boldsymbol\alpha_{i,norm}^{(\ell)}} = \frac{\partial E_n}{\partial \hat{\boldsymbol\alpha_i}^{(\ell)}} \boldsymbol\gamma^{(\ell)}$

$\frac{\partial E_n}{\partial \boldsymbol\mu^{(\ell)}} = \frac{\partial E_n}{\partial \boldsymbol\alpha_{i,norm}^{(\ell)}} \frac{\partial \boldsymbol\alpha_{i,norm}^{(\ell)}}{\partial \boldsymbol\mu^{(\ell)}} + \frac{\partial E_n}{\partial {\boldsymbol\sigma^{(\ell)}}^2} \frac{\partial {\boldsymbol\sigma^{(\ell)}}^2}{\partial \boldsymbol\mu^{(\ell)}}$

$\frac{\partial \boldsymbol\alpha_{i,norm}^{(\ell)}}{\partial \boldsymbol\mu^{(\ell)}} = \frac{-1}{\sqrt{{\boldsymbol\sigma^{(\ell)}}^2 + \epsilon}}$

$\frac{\partial E_n}{\partial {\boldsymbol\sigma^{(\ell)}}^2} = \frac{\partial E_n}{\partial \boldsymbol\alpha_{i,norm}^{(\ell)}} \frac{\partial \boldsymbol\alpha_{i,norm}^{(\ell)}}{\partial {\boldsymbol\sigma^{(\ell)}}^2} = -\frac{1}{2} \sum_{i=1}^{N} (\boldsymbol\alpha_i^{(\ell)} - \boldsymbol\mu)({\boldsymbol\sigma^{(\ell)}}^2 + \epsilon)^{-1.5} \frac{\partial E_n}{\partial \boldsymbol\alpha_{i,norm}^{(\ell)}}$

$\frac{\partial {\boldsymbol\sigma^{(\ell)}}^2}{\partial \boldsymbol\mu^{(\ell)}} = -\frac{2}{m}\sum_{i=1}^N (\boldsymbol\alpha_i^{(\ell)} - \boldsymbol\mu^{(\ell)}) = -2 \Big( \frac{1}{m}\sum_{i=1}^N\alpha_i^{(\ell)} - \frac{1}{m}\sum_{i=1}^N\boldsymbol\mu^{(\ell)} \Big) = -2 \Big( \boldsymbol\mu^{(\ell)} - \frac{m \boldsymbol\mu^{(\ell)}}{m} \Big) = 0$

therefore,

$\frac{\partial E_n}{\partial \boldsymbol\mu^{(\ell)}} = \sum_{i=1}^{N} \frac{\partial E_n}{\partial \boldsymbol\alpha_{i,norm}^{(\ell)}} \frac{-1}{\sqrt{{\boldsymbol\sigma^{(\ell)}}^2 + \epsilon}}$

Finally, $\boldsymbol\delta^{\ell}$ is obtained as follows,

$\begin{aligned} \boldsymbol\delta^{(\ell)} &= \Big( \frac{\partial E_n}{\partial \boldsymbol\alpha_{i,norm}^{(\ell)}} \frac{1}{\sqrt{{\boldsymbol\sigma^{(\ell)}}^2 + \epsilon}} \Big) + \Big( \frac{1}{m} \sum_{j=1}^{N} \frac{\partial E_n}{\partial \boldsymbol\alpha_{j,norm}^{(\ell)}} \frac{-1}{\sqrt{{\boldsymbol\sigma^{(\ell)}}^2 + \epsilon}} \Big) + \Big( -\frac{1}{2} \sum_{j=1}^{N} (\boldsymbol\alpha_j^{(\ell)} - \boldsymbol\mu)({\boldsymbol\sigma^{(\ell)}}^2 + \epsilon)^{-1.5} \frac{\partial E_n}{\partial \boldsymbol\alpha_{j,norm}^{(\ell)}} \frac{2(\boldsymbol\alpha_i^{(\ell)} - \boldsymbol\mu^{(\ell)})}{m} \Big) \\ &= \Big( \frac{\partial E_n}{\partial \boldsymbol\alpha_{i,norm}^{(\ell)}} \frac{1}{\sqrt{{\boldsymbol\sigma^{(\ell)}}^2 + \epsilon}} \Big) - \Big( \sum_{j=1}^{N} \frac{\partial E_n}{\partial \boldsymbol\alpha_{j,norm}^{(\ell)}} \frac{1}{m\sqrt{{\boldsymbol\sigma^{(\ell)}}^2 + \epsilon}} \Big) - \Big( \sum_{j=1}^{N} (\boldsymbol\alpha_j^{(\ell)} - \boldsymbol\mu^{(\ell)})({\boldsymbol\sigma^{(\ell)}}^2 + \epsilon)^{-1.5} \frac{\partial E_n}{\partial \boldsymbol\alpha_{j,norm}^{(\ell)}} \frac{(\boldsymbol\alpha_i^{(\ell)} - \boldsymbol\mu^{(\ell)})}{m} \Big) \\ &= \Big( \frac{\partial E_n}{\partial \boldsymbol\alpha_{i,norm}^{(\ell)}} \frac{1}{\sqrt{{\boldsymbol\sigma^{(\ell)}}^2 + \epsilon}} \Big) - \Big( \sum_{j=1}^{N} \frac{\partial E_n}{\partial \boldsymbol\alpha_{j,norm}^{(\ell)}} \frac{1}{m\sqrt{{\boldsymbol\sigma^{(\ell)}}^2 + \epsilon}} \Big) - \Big( \sum_{j=1}^{N} \frac{(\boldsymbol\alpha_j^{(\ell)} - \boldsymbol\mu^{(\ell)})(\boldsymbol\alpha_i^{(\ell)} - \boldsymbol\mu^{(\ell)})}{m\sqrt{{\boldsymbol\sigma^{(\ell)}}^2 + \epsilon}^3} \frac{\partial E_n}{\partial \boldsymbol\alpha_{j,norm}^{(\ell)}} \Big) \\ &= \frac{1}{m\sqrt{{\boldsymbol\sigma^{(\ell)}}^2 + \epsilon}} \Big( m\frac{\partial E_n}{\partial \boldsymbol\alpha_{i,norm}^{(\ell)}} - \sum_{j=1}^{N} \frac{\partial E_n}{\partial \boldsymbol\alpha_{j,norm}^{(\ell)}} - \sum_{j=1}^{N} \frac{(\boldsymbol\alpha_i^{(\ell)} - \boldsymbol\mu^{(\ell)})(\boldsymbol\alpha_j^{(\ell)} - \boldsymbol\mu^{(\ell)})}{\sqrt{{\boldsymbol\sigma^{(\ell)}}^2 + \epsilon}^2} \frac{\partial E_n}{\partial \boldsymbol\alpha_{j,norm}^{(\ell)}} \Big) \\ &= \frac{1}{m\sqrt{{\boldsymbol\sigma^{(\ell)}}^2 + \epsilon}} \Big( m\frac{\partial E_n}{\partial \boldsymbol\alpha_{i,norm}^{(\ell)}} - \sum_{j=1}^{N} \frac{\partial E_n}{\partial \boldsymbol\alpha_{j,norm}^{(\ell)}} - \boldsymbol\alpha_{i,norm}^{(\ell)} \sum_{j=1}^{N} \boldsymbol\alpha_{j,norm}^{(\ell)} \frac{\partial E_n}{\partial \boldsymbol\alpha_{j,norm}^{(\ell)}} \Big) \end{aligned}$

The gradients for $\boldsymbol\gamma^{(\ell)}$ and $\boldsymbol\beta^{(\ell)}$ are obtained similarly,

$\frac{\partial E_n}{\partial \boldsymbol\gamma^{(\ell)}} = \sum_{i=1}^{N} \frac{\partial E_n}{\partial \hat{\boldsymbol\alpha_i}^{(\ell)}} \frac{\partial \hat{\boldsymbol\alpha_i}^{(\ell)}}{\partial \boldsymbol\gamma^{(\ell)}} = \sum_{i=1}^{N} \frac{\partial E_n}{\partial \hat{\boldsymbol\alpha_i}^{(\ell)}} \boldsymbol\alpha_{i,norm}^{(\ell)}$

and

$\frac{\partial E_n}{\partial \boldsymbol\beta^{(\ell)}} = \sum_{i=1}^{N} \frac{\partial E_n}{\partial \hat{\boldsymbol\alpha_i}^{(\ell)}} \frac{\partial \hat{\boldsymbol\alpha_i}^{(\ell)}}{\partial \boldsymbol\beta^{(\ell)}} = \sum_{i=1}^{N} \frac{\partial E_n}{\partial \hat{\boldsymbol\alpha_i}^{(\ell)}}$

In [28]:

x_space = np.linspace(0, 1, 100)[:, None]

model = nn.NeuralNetwork(
    nn.LinearLayer(1, 10),
    nn.BatchNorm(),
    nn.TanH(),
    nn.LinearLayer(10, 1),
    nn.Linear(),
)

model.fit(
    x_train[:, None],
    y_train[:, None],
    epochs=10000,
    loss=nn.SSELoss(),
    optimizer=nn.GradientDescent(learning_rate=0.01),
    verbose=False,
)
y = model(x_space)


plt.scatter(x_train.ravel(), y_train.ravel(), marker="x", color="black")
plt.plot(x_space, np.sin(2 * np.pi * x_space), color="green")
plt.plot(x_space.ravel(), y.ravel(), color="orangered")
plt.annotate(f"M={4}", (0.7, 0.5))
plt.legend(["Training Data", "$\sin(2\pi x)$", "Batch normalization"])
plt.show()

5.5.1 Consistent Gaussian priors¶

Simple weight decay is affected by certain scaling properties of network mappings. A regularizer that is invariant to re-scaling of the weights and to shifts to biases, given a $2$ -layer neural network is defined as,

$\frac{\lambda_1}{2}\sum_{w\in\mathcal{W}_1}w^2 + \frac{\lambda_2}{2}\sum_{w\in\mathcal{W}_2}w^2$

where $\mathcal{W}_1$ denotes the weights of the first layer and $\mathcal{W}_2$ denotes the set of weights of the second layer and biases are excluded from the summations. The corresponding Gaussian prior for this regularizer takes the form

$p(\mathbf{w}|\alpha_1,\alpha_2) \propto \Big( -\frac{\alpha_1}{2}\sum_{w\in\mathcal{W}_1}w^2 - \frac{\alpha_2}{2}\sum_{w\in\mathcal{W}_2}w^2 \Big)$

Note that priors of this form are improper, that means they cannot be normalized, because bias parameters are unconstrained. Since improper priors lead to zero evidence in the Bayesian framework, it is common practice to include separate priors for the biases.

In the general case, weights can be divided into any number of groups $\mathcal{W}_k$ , thus obtaining priors of the form,

$p(\mathbf{w}|\boldsymbol\alpha) \propto \Big( -\frac{1}{2}\sum_{k}\alpha_k ||\mathbf{w}||_k^2 \Big)$

where $\boldsymbol\alpha=(\alpha_1,\dots,\alpha_k)$ and $||\mathbf{w}||_k^2 = \sum_{j\in\mathcal{W}_k} w_j^2$ .

5.5.2 Early stopping¶

An alternative to regularization is the procedure of early stopping. For many algorithms used for network training, such as conjugate gradients, the error is a non-increasing function of the iteration index. However, the error measured on a validation set (independent data), often shows a decrease at first, followed by an increase as the network starts to over-fit. Training can thus be stopped at this point of smallest error with respect to the validation set in order to obtain good generalization performance.

Improving neural network performance guidelines¶

In order to achieve or surpass human-level performance, which can be considered as a proxy to optimal Bayes error, the model should minimize the error in the training set to be as close as possible to the error achieved by humans (avoidable bias). However, at the same time, the model should retain low variance, that is, low error on the validation or development set. In the first case, where the error in the training set is off by a large percentage compared to the error measured in humans, one should consider training a deeper model, using a better optimization algorithm or even alternative neural network architectures. On the other hand, when variance is high, the model may have overfitted, which can be dealt using regularization techniques, more training data or data augmentation to include invariances.

5.5.3 Invariances¶

Ideally, predictions should be unchanged or invariant under one or more transformations of the input variables. For example, in image classification tasks, such as digit recognition, the particular object should be assigned the same label irrespective of its position in the image (translation invariance) or of its size (scale invariance). Similar, in speech recognition, small levels of nonlinear warping along the time axis (assuming temporal ordering is perserved) should not change the interpretation of the signal.

Given a sufficiently large number of examples, an adaptive model can learn the invariance, even approximately. However, if the number of examples is limited, or there are several invariants, there is a number of alternative approaches for encouraging a model to exhibit the invariances:

Training set can be augmented to include replicas of training examples, transformed according to the desired invariances.
A regularization term can be added to the error function to penalize changes in the model output when the input is transformed, a technique called tangent propagation.
Extracting invariant features, thus building regression or classification systems that necessarily respect the invariances.
Build the invariance properties into the structure of the neural network, using local receptive fields and shared weights, such as convolutional neural networks.

5.5.6 Convolutional networks¶

Convolutional neural networks build invariance properties into the structure of the network and have been widely applied to image data. In general, image recognition may be performed using a fully connected neural network similar to the ones presented so far. Given sufficiently large training data, such a network could in principle yield a good solution and learn the appropriate invariances. However, typical neural networks ignore a key property of images, which is that nearby pixels are more strongly correlated than distance ones.

On the other hand, modern approaches to computer vision exploit this property by extracting local features that depend only on small subregions of the image. Information from such features are then merged in later stages of processing in order to detect higher-order features. Moreover, local features that are useful in one region of the image are likely to be useful in other regions of the image, for instance if the object of interest is translated.

These notions are incorporated into convolutional neural networks through three mechanisms:

Local receptive fields
Weight sharing
Subsampling

The convolution operation is one of the fundamental building blocks of the convolutional neural networks. As a motivating example, lets perform edge detection on an image in order to get a grasp of the convolution operation. Consider an input grayscale image of Dr. Freeman (the silent protagonist of Half-Life).

In [29]:

from PIL import Image

im = Image.open("../images/gordon_freeman.jpg").convert("L")
im_array = np.asarray(im)

THRESHOLD = 125
im_array = np.where(im_array > THRESHOLD, 255, np.where(im_array <= THRESHOLD, 0, im_array))

plt.figure(figsize=(10, 10))
plt.subplot(1, 2, 1)
plt.imshow(im, cmap="gray")
plt.title("Grayscale")
plt.subplot(1, 2, 2)
plt.imshow(im_array, cmap="gray")
plt.title("Black and White")
plt.show()

Each grayscale image comprises a set of pixel intensity values represented as a $2$ -dimensional array of integers in $[0,255]$ , where $0$ represents completely black and $255$ completely white. In order to make subsequent edge detection more apparent, we further transform the input image to black and white using thresholding on intensity value $125$ . To that end, the image of Dr. Freeman is represented by the following matrix:

In [30]:

im_array

Out[30]:

array([[255, 255, 255, ..., 255, 255, 255],
       [255, 255, 255, ..., 255, 255, 255],
       [255, 255, 255, ..., 255, 255, 255],
       ...,
       [255, 255, 255, ...,   0,   0,   0],
       [255, 255, 255, ...,   0,   0,   0],
       [255, 255, 255, ...,   0,   0,   0]], dtype=uint8)

Then, in order to detect edges, we may construct a small matrix (called a filter or a kernel in computer vision literature) and convolve it with the image of Dr. Freeman. If we wish to detect vertical edges, then the filter may have the following form,

$\begin{bmatrix} 1 & 0 & -1\\ 1 & 0 & -1\\ 1 & 0 & -1\\ \end{bmatrix}$

or, in case of horizontal edges, it may take the form,

$\begin{bmatrix} 1 & 1 & 1\\ 0 & 0 & 0\\ -1 & -1 & -1\\ \end{bmatrix}$

The convolution of the image and the filter may be implemented as follows:

In [31]:

def convolve2D(image: np.ndarray, filter: np.ndarray, padding: int = 0, strides: int = 1) -> np.ndarray:
    assert padding >= 0, "Padding cannot be negative"
    assert strides > 0, "Stride cannot be zero or negative"

    n, m = image.shape
    f, k = filter.shape

    # shape of output convolution
    output_n = int(((n - f + 2 * padding) / strides) + 1)
    output_m = int(((m - k + 2 * padding) / strides) + 1)
    output = np.zeros((output_n, output_m), dtype=np.uint8)

    # apply padding
    if padding > 0:
        padded_image = np.zeros((n + padding * 2, m + padding * 2))
        padded_image[padding:-padding, padding:-padding] = image
    else:
        padded_image = image

    for i in range(output_n):
        for j in range(output_m):
            output[i, j] = (filter * padded_image[i * strides : i * strides + f, j * strides : j * strides + f]).sum()

    return output

The following figure depicts the original black and white image along the horizontal and vertical edges detected by the filters.

In [32]:

horizontal_filter = np.array(
    [
        [1, 1, 1],
        [0, 0, 0],
        [-1, -1, -1],
    ]
)

vertical_filter = np.array(
    [
        [1, 0, -1],
        [1, 0, -1],
        [1, 0, -1],
    ]
)

plt.figure(figsize=(10, 10))
plt.subplot(1, 3, 1)
plt.imshow(im_array, cmap="gray")
plt.subplot(1, 3, 2)
plt.imshow(convolve2D(im_array, horizontal_filter), cmap="gray")
plt.subplot(1, 3, 3)
plt.imshow(convolve2D(im_array, vertical_filter), cmap="gray")
plt.show()

Moreover, padding may be used on the original image before convoluation in order for the output size to be the same as the input size.

In [33]:

print(
    f"Original size: {im_array.shape}, Output size (no padding): {convolve2D(im_array, horizontal_filter).shape}, Output size (padding): {convolve2D(im_array, horizontal_filter, 1).shape}"
)

Original size: (604, 604), Output size (no padding): (602, 602), Output size (padding): (604, 604)

These filters detect or emphasize edges having horizontal or vertical orientation. On the other hand, there are numerous filters developed in the image processing literature. To that end, the idea behind convolutional neural networks is to learn such filters that detect useful local features (e.g. edges) over the input images. Thus, the filters in a convolutional layer are represented using the following parametric form:

$\begin{bmatrix} w_1 & w_2 & w_3 \\ w_4 & w_5 & w_6 \\ w_7 & w_8 & w_9 \\ \end{bmatrix}$

In the convolutional layer the units are organized into planes, each of which is called a feature map. Units in a feature map take inputs only from a small subregion of the image, and all of the units in a feature map are constrained to share the same weight values. The convolutional layers can also learn filters or feature maps over multiple color channels. For instance, consider the RGB version of Dr. Freeman picture:

In [34]:

from PIL import Image

im = Image.open("../images/gordon_freeman.jpg")
im_array = np.asarray(im)

plt.imshow(im, cmap="gray")
plt.show()

Note that the image matrix has three channels in the third dimension.

In [35]:

im_array.shape

Out[35]:

(604, 604, 3)

The convolution over multiple color channels is similar to the grayscale one:

In [37]:

def convolve3D(image: np.ndarray, filter: np.ndarray, padding: int = 0, strides: int = 1) -> np.ndarray:
    assert padding >= 0, "Padding cannot be negative"
    assert strides > 0, "Stride cannot be zero or negative"
    assert image.shape[2] == filter.shape[2], "Image and filter should have the same number of channels"

    n, m, nc = image.shape
    f, k, nc = filter.shape

    # shape of output convolution
    output_n = int(((n - f + 2 * padding) / strides) + 1)
    output_m = int(((m - k + 2 * padding) / strides) + 1)
    output = np.zeros((output_n, output_m), dtype=np.uint8)

    # apply padding
    if padding > 0:
        padded_image = np.zeros((n + padding * 2, m + padding * 2, nc))
        padded_image[padding:-padding, padding:-padding, :] = image
    else:
        padded_image = image

    for i in range(output_n):
        for j in range(output_m):
            output[i, j] = (
                filter * padded_image[i * strides : i * strides + f, j * strides : j * strides + f, :]
            ).sum()

    return output

The following figure depicts the original RGB image along the horizontal and vertical edges detected by the 3-channel filters.

In [38]:

horizontal_filter = np.array(
    [
        [
            [1, 1, 1],
            [0, 0, 0],
            [-1, -1, -1],
        ],
        [
            [1, 1, 1],
            [0, 0, 0],
            [-1, -1, -1],
        ],
        [
            [1, 1, 1],
            [0, 0, 0],
            [-1, -1, -1],
        ],
    ]
)

vertical_filter = np.array(
    [
        [
            [1, 0, -1],
            [1, 0, -1],
            [1, 0, -1],
        ],
        [
            [1, 0, -1],
            [1, 0, -1],
            [1, 0, -1],
        ],
        [
            [1, 0, -1],
            [1, 0, -1],
            [1, 0, -1],
        ],
    ]
)

plt.figure(figsize=(10, 10))
plt.subplot(1, 3, 1)
plt.imshow(im_array)
plt.subplot(1, 3, 2)
plt.imshow(convolve3D(im_array, horizontal_filter))
plt.subplot(1, 3, 3)
plt.imshow(convolve3D(im_array, vertical_filter))
plt.show()

For RGB images the respective convolutional layers learn filters having multiple channels (usually three), by representing them as tensors.

Convolutional Layer¶

A single convolutional layer may take as input an RGB image and convolve it with a number of filters as depicted in the following figure:

Summary of the notation:

Symbol	Description
$n_h^{[l-1]}$	input height
$n_h^{[l]}$	output height
$n_w^{[l-1]}$	input width
$n_w^{[l]}$	output width
$f^{[l]}$	filter size
$p^{[l]}$	padding
$s^{[l]}$	stride
$n_c^{[l]}$	number of filters

Each filter convolution inside the layer yields an output matrix of dimension $\Big( \frac{n_h^{[0]} - f^{[0]} + 2 * p^{[0]}}{s^{[0]}} + 1 \Big) \times \Big( \frac{n_w^{[0]} - f^{[0]} + 2 * p^{[0]}}{s^{[0]}} + 1 \Big)$ . Then the bias parameter is added to the resulting matrix and the result is passed through the activation function $g$ . The final matrices across all $n_c^{[0]}$ filters of the layer are stacked together to yield a tensor of dimension $\Big( \frac{n_h^{[0]} - f^{[0]} + 2 * p^{[0]}}{s^{[0]}} + 1 \Big) \times \Big( \frac{n_w^{[0]} - f^{[0]} + 2 * p^{[0]}}{s^{[0]}} + 1 \Big) \times n_c^{[0]}$ (represented by the 3-dimensional rectangle). The parameter matrix $\mathbf{W}^{[l]}$ of the convolutional layer has dimension $f^{[l]} \times f^{[l]} \times n_c^{[l-1]} \times n_c^{[l]}$ .

Why Convolutions?

Parameter Sharing: A feature detector that is useful in one part of the image is probably useful in another part of the image. Therefore filter parameters derived during a learning process are used to detect features or edges over many parts of the image.

Sparsity of connections: In each layer, each output value depends only on a small number of inputs, that is, adjacent pixel values.

Both of these properties allows us to construct neural networks that have a lot fewer parameters than fully connected architectures. Moreover, fewer parameters can be learned using smaller training sets in contrast to the corresponding fully connected neural networks. Moreover, convolutional layers are effective at capturing translation invariance, which means that an image shifted by a few pixels should result in pretty similar features and to that end, classified in a similar way.

Pooling Layers¶

Pooling layers perform a mathematical operation, usually an aggregation, over sub-regions of an image. For instance, max pooling computes the maximum pixel intensity of each sub-region of the input image. Thus, given a $4 \times 4$ image a max pooling layer of size $2$ and stride $2$ , should yield the following:

Pooling layers are applied on each channel independently, thus computing an output tensor the same dimension. In the context of edge detection, a max pooling layer may intuitevely keep stronger or more apparent edges. However, in practice, pooling is used because it is shown experimentaly that works well and not because we have a concrete proof about its importance. There are other kinds of pooling, except max pooling, like average pooling which averages the pixel intensities of each region instead of computing the maximum. Note that pooling have no learnable parameters only a couple of hyperparameters.

In [39]:

pooled = nn.MaxPooling(pool_size=(25, 5), stride=5)._forward(im_array[None, :])

plt.imshow(pooled[0].astype(int))
plt.show()

LeNet-5 MNIST¶

In [46]:

images, labels = load_mnist_dataset(500)

random_indices = [random.randint(0, images.shape[0]) for _ in range(10)]

plt.figure(figsize=(10, 5))

for i, index in enumerate(random_indices):
    image = images[index]
    plt.subplot(int(len(random_indices) / 5) + 1, 5, i + 1)
    plt.imshow(image, cmap=plt.cm.gray)
    plt.title(f"Label: '{labels[index]}'")
    plt.axis("off")

In [47]:

images = (images - images.mean(axis=(1, 2), keepdims=True)) / images.std(axis=(1, 2), keepdims=True)

In [48]:

encoder = OneHotEncoder()
one_hot_labels = encoder.encode(labels)

images_train, images_test, one_hot_labels_train, one_hot_labels_test = train_test_split(
    images, one_hot_labels, test_size=0.2, stratify=one_hot_labels
)

In [49]:

images[0].shape

Out[49]:

(28, 28)

In [58]:

model = nn.NeuralNetwork(
    nn.ConvLayer(1, 6, kernel_size=(5, 5), padding=2),
    nn.ReLU(),
    nn.MaxPooling(pool_size=(2, 2), stride=2),
    nn.ConvLayer(6, 16, kernel_size=(5, 5)),
    nn.ReLU(),
    nn.MaxPooling(pool_size=(2, 2), stride=2),
    nn.Flatten(),
    nn.LinearLayer(400, 120),
    nn.ReLU(),
    nn.LinearLayer(120, 84),
    nn.ReLU(),
    nn.LinearLayer(84, 10),
    nn.Softmax(),
)

In [60]:

model.fit(
    images_train[:, :, :, None],
    one_hot_labels_train,
    epochs=50,
    batch_size=10,
    loss=nn.CrossEntropyLoss(),
    optimizer=nn.AdamW(learning_rate=0.00001, weight_decay=1e-2),
)

-- Epoch 1 ---
Cost: 1.598538332796852
-- Epoch 3 ---
Cost: 1.3129573812941102
-- Epoch 5 ---
Cost: 1.3086841008174646
-- Epoch 7 ---
Cost: 0.5296930303635164
-- Epoch 9 ---
Cost: 0.24401742821367545
-- Epoch 11 ---
Cost: 1.2518014840898435
-- Epoch 13 ---
Cost: 0.1519895430574175
-- Epoch 15 ---
Cost: 0.10023537642811078
-- Epoch 17 ---
Cost: 0.5512170802002273
-- Epoch 19 ---
Cost: 0.4524269162805252

In [61]:

train_predictions = np.argmax(model(images_train[:, :, :, None]), axis=1)
print(classification_report(encoder.decode(one_hot_labels_train), train_predictions, zero_division=0))

              precision    recall  f1-score   support

           0       0.95      0.97      0.96        39
           1       0.94      0.96      0.95        46
           2       0.97      0.88      0.93        42
           3       0.90      0.93      0.91        40
           4       1.00      0.72      0.84        39
           5       0.88      0.81      0.84        36
           6       0.83      1.00      0.90        38
           7       0.82      0.88      0.85        41
           8       0.84      0.82      0.83        39
           9       0.76      0.85      0.80        40

    accuracy                           0.88       400
   macro avg       0.89      0.88      0.88       400
weighted avg       0.89      0.88      0.88       400

In [62]:

test_predictions = np.argmax(model(images_test[:, :, :, None]), axis=1)
print(classification_report(encoder.decode(one_hot_labels_test), test_predictions, zero_division=0))

              precision    recall  f1-score   support

           0       0.82      0.90      0.86        10
           1       1.00      1.00      1.00        11
           2       1.00      0.50      0.67        10
           3       0.60      0.60      0.60        10
           4       0.88      0.70      0.78        10
           5       0.62      0.89      0.73         9
           6       1.00      1.00      1.00        10
           7       0.89      0.80      0.84        10
           8       0.70      0.70      0.70        10
           9       0.69      0.90      0.78        10

    accuracy                           0.80       100
   macro avg       0.82      0.80      0.80       100
weighted avg       0.82      0.80      0.80       100

5.6 Mixture Density Networks¶

Consider a dataset generated by sampling a variable $x$ uniformly over the interval $[0,1]$ and the corresponding target values $t$ by computing the function $f(x) = x + 0.3\sin(2\pi x)$ and adding uniform noise over the interval $[-0.1, 0.1]$ . The inverse dataset is obtained by exchanging the roles of $x$ and $t$ . Then, by training a $2$ -layer neural network having $6$ hidden units and a single linear output unit, we can see that it leads to a very poor model for the highly non-Gaussian inverse problem. That is because least squares corresponds to maximum likelihood under a Gaussian assumption.

In [66]:

x, y = generate_toy_data(lambda x: x + 0.3 * np.sin(2 * np.pi * x), sample_size=300, std=0.1, uniform=True)

model_xy = create_network(6)

model_xy.fit(x[:, None], y[:, None], epochs=100000, loss=nn.SSELoss(), verbose=False)

model_yx = create_network(6)

model_yx.fit(
    y[:, None],
    x[:, None],
    epochs=100000,
    optimizer=nn.GradientDescent(learning_rate=0.5),
    loss=nn.SSELoss(),
    verbose=False,
)

In [67]:

x_space = np.linspace(0, 1, 100)[:, None]
y_space = np.linspace(0, 1, 100)[:, None]

plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.scatter(x, y, facecolors="none", edgecolors="green")
plt.plot(x_space, model_xy.predict(x_space), color="red")
plt.xlabel("x")
plt.ylabel("y")

plt.subplot(1, 2, 2)
plt.scatter(y, x, facecolors="none", edgecolors="green")
plt.plot(y_space, model_yx.predict(y_space), color="red")
plt.xlabel("y")
plt.ylabel("x")
plt.show()

We therefore seek a general framework for modelling conditional probability distributions. This can be achieved by using a mixture model for $p(\mathbf{t}|\mathbf{x})$ in which both the mixing coefficients as well as the component densities are parametric functions of the input $\mathbf{x}$ , giving rise to the mixture density network.

Any mixture of distributions may be used for the components, such as Bernoulli if the target variables are binary, however, we shall develop the model explicitly for Gaussian components, so that,

$p(\mathbf{t}|\mathbf{x}) = \sum_{k=1}^K \pi_k(\mathbf{x})\mathcal{N}\big(\mathbf{t}|\boldsymbol\mu_k(\mathbf{x}),\sigma_k^2(\mathbf{x})\mathbf{I}\big)$

which is an example of a heteroscedastic model specialized to the case of isotropic covariances for the components.

The partial derivatives with respect to the mixing coefficients are obtained by

$\begin{aligned} \frac{\partial{E_n}}{\partial{a}_{j}^{\pi}} &= - \frac{\partial}{\partial{a}_{j}^{\pi}} \ln \Big\{ \sum_{k=1}^K \pi_k \mathcal{N}_{nk} \Big\} \\ &= - \frac{1}{\sum_{k=1}^K \pi_k \mathcal{N}_{nk}} \sum_{k=1}^K \frac{\partial\pi_k}{\partial{a}_{j}^{\pi}} \mathcal{N}_{nk} \\ &= - \frac{1}{\sum_{k=1}^K \pi_k \mathcal{N}_{nk}} \sum_{k=1}^K \pi_j(I_{kj} - \pi_k) \mathcal{N}_{nk} \\ &= - \frac{1}{\sum_{k=1}^K \pi_k \mathcal{N}_{nk}} \Big( \pi_j\mathcal{N}_{jk} - \pi_j\sum_{k=1}^K \pi_k\mathcal{N}_{nk} \Big) \\ &= \pi_j - \frac{\pi_j\mathcal{N}_{jk}}{\sum_{k=1}^K \pi_k \mathcal{N}_{nk}} \\ &= \pi_j - \gamma_{nj} \end{aligned}$

Note that we use $a_{j}^{\pi}$ instead of $a_{k}^{\pi}$ for the subscript denoting the $j$ component in order to avoid confusion with the summation subscript $k$ . Thus, by changing the subscript back to $k$ after the proof, we arrive at the same result presented in $(5.154)$ , that is, $\frac{\partial{E_n}}{\partial{a}_{k}^{\pi}} = \pi_k - \gamma_{nk}$ .

The partial derivatives with respect to the component means are obtained by

$\begin{aligned} \frac{\partial{E_n}}{\partial{\mathbf{a}}_{k}^{\mu}} &\overset{(5.152)}{=} \frac{\partial{E_n}}{\partial{\boldsymbol\mu}_{k}} \\ &= - \frac{\partial}{\partial{\boldsymbol\mu}_{k}} \ln \Big\{ \sum_{k=1}^K \pi_k \mathcal{N}_{nk} \Big\} \\ &= - \frac{1}{\sum_{k=1}^K \pi_k \mathcal{N}_{nk}} \frac{\partial}{\partial{\boldsymbol\mu}_{k}} \pi_k \mathcal{N}_{nk} \\ &= - \frac{1}{\sum_{k=1}^K \pi_k \mathcal{N}_{nk}} \frac{\partial}{\partial{\boldsymbol\mu}_{k}} \Big\{ -\frac{1}{2} (\mathbf{t}_n - \boldsymbol\mu_k)^T (\sigma_k^2\mathbf{I})^{-1} (\mathbf{t}_n - \boldsymbol\mu_k) \Big\} \\ &= - \frac{\pi_k \mathcal{N}_{nk}}{\sum_{k=1}^K \pi_k \mathcal{N}_{nk}} \Big(-\frac{1}{2}\Big) \Big(-2\frac{1}{\sigma_k^2}\mathbf{I} (\mathbf{t}_n - \boldsymbol\mu_k)\Big) \\ &= \frac{\pi_k \mathcal{N}_{nk}}{\sum_{k=1}^K \pi_k \mathcal{N}_{nk}} \frac{\boldsymbol\mu_k - \mathbf{t}_n}{\sigma_k^2}\mathbf{I} \\ &\overset{(5.154)}{=} \gamma_{nk} \frac{\boldsymbol\mu_k - \mathbf{t}_n}{\sigma_k^2}\mathbf{I} \end{aligned}$

Thus, for a particular dimension $l$ , the final derivative would be,

$\frac{\partial{E_n}}{\partial{a}_{kl}^{\mu}} = \gamma_{nk} \frac{\mu_{kl} - t_{nl}}{\sigma_k^2}$

The partial derivatives with respect to the component of variances are obtained by

$\begin{aligned} \frac{\partial{E_n}}{\partial{a}_{k}^{\sigma}} &= - \frac{\partial}{\partial{a}_{k}^{\sigma}} \ln \Big\{ \sum_{k=1}^K \pi_k \mathcal{N}_{nk} \Big\} \\ &= - \frac{1}{\sum_{k=1}^K \pi_k \mathcal{N}_{nk}} \frac{\partial}{\partial{a}_{k}^{\sigma}} \pi_k\mathcal{N}_{nk} \\ &= - \frac{1}{\sum_{k=1}^K \pi_k \mathcal{N}_{nk}} \pi_k \frac{\partial}{\partial{a}_{k}^{\sigma}} \frac{1}{(2\pi)^{D/2}|\sigma_k\mathbf{I}|^{1/2}} \exp\Big\{-\frac{1}{2}(\mathbf{t}_n-\boldsymbol\mu_k)^\text{T}(\sigma_k\mathbf{I})^{-1}(\mathbf{t}_n-\boldsymbol\mu_k)\Big\} \\ &\overset{|\sigma_k\mathbf{I}| = \sigma_k^D}{=} - \frac{1}{\sum_{k=1}^K \pi_k \mathcal{N}_{nk}} \pi_k \frac{\partial}{\partial{a}_{k}^{\sigma}} \frac{1}{(2\pi)^{D/2}\sigma_k^{D/2}} \exp\Big\{-\frac{1}{2}(\mathbf{t}_n-\boldsymbol\mu_k)^\text{T}(\sigma_k\mathbf{I})^{-1}(\mathbf{t}_n-\boldsymbol\mu_k)\Big\} \\ &= - \frac{1}{\sum_{k=1}^K \pi_k \mathcal{N}_{nk}} \pi_k \Bigg[ \frac{1}{(2\pi)^{D/2}} \frac{\partial}{\partial{a}_{k}^{\sigma}} \frac{1}{\sigma_k^{D/2}} \exp\Big\{-\frac{1}{2}(\mathbf{t}_n-\boldsymbol\mu_k)^\text{T}(\sigma_k\mathbf{I})^{-1}(\mathbf{t}_n-\boldsymbol\mu_k)\Big\} + \frac{1}{(2\pi)^{D/2}\sigma_k^{D/2}} \exp\Big\{-\frac{1}{2}(\mathbf{t}_n-\boldsymbol\mu_k)^\text{T}(\sigma_k\mathbf{I})^{-1}(\mathbf{t}_n-\boldsymbol\mu_k)\Big\} \frac{\partial}{\partial{a}_{k}^{\sigma}} \Big( -\frac{1}{2}(\mathbf{t}_n-\boldsymbol\mu_k)^\text{T}(\sigma_k\mathbf{I})^{-1}(\mathbf{t}_n-\boldsymbol\mu_k) \Big) \Bigg] \\ &= - \frac{1}{\sum_{k=1}^K \pi_k \mathcal{N}_{nk}} \pi_k \Bigg[ \frac{-D}{2\sigma_k}\mathcal{N}_{nk} + \mathcal{N}_{nk} \frac{\partial}{\partial{a}_{k}^{\sigma}} \Big( -\frac{1}{2}(\mathbf{t}_n-\boldsymbol\mu_k)^\text{T}(\sigma_k\mathbf{I})^{-1}(\mathbf{t}_n-\boldsymbol\mu_k) \Big) \Bigg] \\ &= - \frac{1}{\sum_{k=1}^K \pi_k \mathcal{N}_{nk}} \pi_k\mathcal{N}_{nk} \Bigg[ -\frac{D}{2\sigma_k} + \frac{\partial}{\partial{a}_{k}^{\sigma}} \Big( -\frac{1}{2}(\mathbf{t}_n-\boldsymbol\mu_k)^\text{T}(\sigma_k\mathbf{I})^{-1}(\mathbf{t}_n-\boldsymbol\mu_k) \Big) \Bigg] \\ &= - \frac{\pi_k\mathcal{N}_{nk}}{\sum_{k=1}^K \pi_k \mathcal{N}_{nk}} \Bigg[ -\frac{D}{2\sigma_k} + \frac{\partial}{\partial{a}_{k}^{\sigma}} \Big( -\frac{1}{2}\frac{||\mathbf{t}_n-\boldsymbol\mu_k||^{2}}{\sigma_k}\mathbf{I} \Big) \Bigg] \\ &= - \frac{\pi_k\mathcal{N}_{nk}}{\sum_{k=1}^K \pi_k \mathcal{N}_{nk}} \Bigg[ -\frac{D}{2\sigma_k} + \frac{1}{2}\frac{||\mathbf{t}_n-\boldsymbol\mu_k||^{2}}{\sigma_k^2}\mathbf{I} \Bigg] \\ &= \frac{\pi_k\mathcal{N}_{nk}}{\sum_{k=1}^K \pi_k \mathcal{N}_{nk}} \Bigg[ \frac{D}{2\sigma_k} - \frac{1}{2}\frac{||\mathbf{t}_n-\boldsymbol\mu_k||^{2}}{\sigma_k^2}\mathbf{I} \Bigg] \\ &= \gamma_{nk} \Bigg[ \frac{D}{\sigma_k} - \frac{||\mathbf{t}_n-\boldsymbol\mu_k||^{2}}{\sigma_k^2}\mathbf{I} \Bigg] \\ \end{aligned}$

In [103]:

model = nn.NeuralNetwork(
    nn.LinearLayer(1, 5), nn.TanH(), nn.LinearLayer(5, 9), nn.Concat([nn.Softmax(), nn.Linear(), nn.Exp()])
)

model.fit(
    y[:, None],
    x[:, None],
    epochs=10000,
    batch_size=50,
    loss=nn.GaussianNLLLoss(n_components=3),
    optimizer=nn.AdamW(learning_rate=0.0001, weight_decay=1e-2),
)

-- Epoch 1 ---
Cost: 56.13401004150711
-- Epoch 1001 ---
Cost: -12.960075807796274
-- Epoch 2001 ---
Cost: -17.888693038278074
-- Epoch 3001 ---
Cost: -25.77117031181849
-- Epoch 4001 ---
Cost: -42.193536667020155
-- Epoch 5001 ---
Cost: -42.774433858274755
-- Epoch 6001 ---
Cost: -44.47439846519983
-- Epoch 7001 ---
Cost: -55.262623240038444
-- Epoch 8001 ---
Cost: -46.64232798438148
-- Epoch 9001 ---
Cost: -46.8022878753674

Once the mixture density network has been trained, it can predict the conditional density function of the target data for any given value of the input vector. For many problems we might be interested instead in finding one specific value for the output vector. The most likely value for the output vector, for a given input vector $\mathbf{x}$ is given by the maximum of the conditional density $p(\mathbf{t}|\mathbf{x})$ . Since this density is represented by a mixture model, the location of its global maximum is a problem of non-linear optimization. For applications where speed is important, a good approximation is to take the mean $\mu_i(\mathbf{x})$ fo the largest central value,

$\max_{i}\Bigg\{ \frac{\pi_i(\mathbf{x})}{\sigma_i(\mathbf{x})^c} \Bigg\}$

In [108]:

densities = model.predict(y_space)
pi, mu, sigma = np.array_split(densities, 3, axis=1)

predictions = np.take_along_axis(mu, (pi / sigma).argmax(axis=1)[:, None], axis=1)

plt.scatter(y, x, facecolors="none", edgecolors="green")
plt.plot(y_space, predictions, color="red")
plt.xlabel("y")
plt.ylabel("x")
plt.show()

In [105]:

plt.figure(figsize=(10, 4))
plt.subplot(1, 3, 1)
plt.plot(y_space, pi[:, 0], color="blue")
plt.plot(y_space, pi[:, 1], color="red")
plt.plot(y_space, pi[:, 2], color="green")
plt.title("$\pi$")

plt.subplot(1, 3, 2)
plt.plot(y_space, mu[:, 0], color="blue")
plt.plot(y_space, mu[:, 1], color="red")
plt.plot(y_space, mu[:, 2], color="green")
plt.title("$\mu$")

plt.subplot(1, 3, 3)
plt.plot(y_space, sigma[:, 0], color="blue")
plt.plot(y_space, sigma[:, 1], color="red")
plt.plot(y_space, sigma[:, 2], color="green")
plt.title("$\sigma$")
plt.show()

In [110]:

xx, yy = np.meshgrid(x_space.ravel(), y_space.ravel())
prob = pi * np.exp(-0.5 * ((y_space[:, None] - mu) ** 2) / sigma**2) / np.sqrt(2 * np.pi * sigma**2)

plt.contour(xx, yy, prob.sum(axis=-1), levels=25)
plt.scatter(y, x, facecolor="none", edgecolor="b")
plt.show()

In [ ]: