# 5.1 From Dense Layers to Convolutions¶

• Networks with many parameters either require a lot of data or a lot of regularization.
• Consider the task of distinguishing cats from dogs.
• We decide to use a good camera and take 1 megapixel photos
• The input into a network has 1 million dimensions.
• Even an aggressive reduction to 1,000 dimensions after the first layer means that...
• we need $10^9$ parameters.
• Add in subsequent layers and it is clear that this approach is infeasible.
• Both humans and computers are able to distinguish cats from dogs quite well, often after only a few hundred images.

## 5.1.1 Invariances (two key principles)¶

• [Translation Invariance] Object detectors should work the same regardless of where in the image an object can be found.
• In other words, the ‘waldoness’ of a location in the image can be assessed without regard of the position within the image.
• [Locality] Object detection can be answered by considering only local information.
• In other words, the ‘waldoness’ of a location can be assessed without regard of what else happens in the image at large distances.

## 5.1.2 Constraining the MLP¶

• We will treat images and hidden layers as two-dimensional arrays. I.e. $x[i,j]$ and $h[i,j]$ denote the position $(i,j)$ in an image.
• In this case a dense layer can be written as follows: $$h[i,j] = \sum_{k,l} W[i,j,k,l] \cdot x[k,l] = \sum_{a, b} V[i,j,a,b] \cdot x[i+a,j+b]$$
• we set $V[i,j,a,b] = W[i,j,i+a, j+b]$.
• For any given location $(i,j)$ in the hidden layer $h[i,j]$ we compute its value by summing over pixels in $x$, centered around $(i,j)$ and weighted by $V[i,j,a,b]$.
• Translation Invariance.

• This is only possible if $V$ doesn't actually depend on $(i,j)$,
• that is, we have $V[i,j,a,b] = V[a,b]$.
• As a result we can simplify the definition for $h$. $$h[i,j] = \sum_{a, b} V[a,b] \cdot x[i+a,j+b]$$
• This is a convolution!
• We are effectively weighting pixels $(i+a, j+b)$ in the vicinity of $(i,j)$ with coefficients $V[a,b]$ to obtain the value $h[i,j]$.
• Note that $V[a,b]$ needs a lot fewer coefficients than $V[i,j,a,b]$.
• For a 1 megapixel image, it has at most 1 million coefficients.
• This is 1 million fewer parameters since it no longer depends on the location within the image.
• Locality

• We should not have to look very far away from $(i,j)$ in order to glean relevant information to assess what is going on at $h[i,j]$.
• This means that outside some range $|a|, |b| > \Delta$, we should set $V[a,b] = 0$. $$h[i,j] = \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} V[a,b] \cdot x[i+a,j+b]$$
• This is the convolutional layer.
• While in fully connected layer we might have needed $10^8$ or more coefficients, we now only need $O(\Delta^2)$ terms.

# 5.2 Convolutions for Images¶

• Strictly speaking, convolutional networks are a slight misnomer (but for notation only), since the operations are typically expressed as cross correlations.

## 5.2.1 The Cross-Correlation Operator¶

• kernel or filter
• The output array has a height of 2 and width of 2, and the four elements are derived from a two-dimensional cross-correlation operation:
$$0\times0+1\times1+3\times2+4\times3=19,\\ 1\times0+2\times1+4\times2+5\times3=25,\\ 3\times0+4\times1+6\times2+7\times3=37,\\ 4\times0+5\times1+7\times2+8\times3=43.$$

• Note that the output size is smaller than the input.
• Input size: $H \times W$
• Kernel size: $h \times w$
• Output size: $(H-h+1) \times (W-w+1)$.
In [1]:
from mxnet import autograd, nd
from mxnet.gluon import nn

def corr2d(X, K):  # This function has been saved in the gluonbook package for future use.
h, w = K.shape
Y = nd.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
for i in range(Y.shape[0]):
for j in range(Y.shape[1]):
Y[i, j] = (X[i: i + h, j: j + w] * K).sum()
return Y

In [2]:
X = nd.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
K = nd.array([[0, 1], [2, 3]])
corr2d(X, K)

Out[2]:
[[19. 25.]
[37. 43.]]
<NDArray 2x2 @cpu(0)>

## 5.2.2 Convolutional Layers¶

• The convolutional layer...
• 1) cross-correlates the input and kernels,
• 2) adds a scalar bias to get the output.
• The model parameters of the convolutional layer include the kernel and scalar bias.
In [3]:
class Conv2D(nn.Block):
def __init__(self, kernel_size, **kwargs):
super(Conv2D, self).__init__(**kwargs)
self.weight = self.params.get('weight', shape=kernel_size)
self.bias = self.params.get('bias', shape=(1,))

def forward(self, x):
return corr2d(x, self.weight.data()) + self.bias.data()


## 5.2.3 Object Edge Detection in Images¶

• Let’s look at a simple application of a convolutional layer
• Detecting the edge of an object in an image by finding the location of the pixel change.
In [4]:
X = nd.ones((6, 8))
X[:, 2:6] = 0
X

Out[4]:
[[1. 1. 0. 0. 0. 0. 1. 1.]
[1. 1. 0. 0. 0. 0. 1. 1.]
[1. 1. 0. 0. 0. 0. 1. 1.]
[1. 1. 0. 0. 0. 0. 1. 1.]
[1. 1. 0. 0. 0. 0. 1. 1.]
[1. 1. 0. 0. 0. 0. 1. 1.]]
<NDArray 6x8 @cpu(0)>
In [5]:
K = nd.array([[1, -1]])

• We will detect 1 for the edge from white to black and -1 for the edge from black to white.
• The rest of the outputs are 0.
In [6]:
Y = corr2d(X, K)
Y

Out[6]:
[[ 0.  1.  0.  0.  0. -1.  0.]
[ 0.  1.  0.  0.  0. -1.  0.]
[ 0.  1.  0.  0.  0. -1.  0.]
[ 0.  1.  0.  0.  0. -1.  0.]
[ 0.  1.  0.  0.  0. -1.  0.]
[ 0.  1.  0.  0.  0. -1.  0.]]
<NDArray 6x7 @cpu(0)>
• Let’s apply the kernel to the transposed ‘image’. As expected, it vanishes.
• The kernel K only detects vertical edges.
In [7]:
corr2d(X.T, K)

Out[7]:
[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]
<NDArray 8x5 @cpu(0)>

## 5.2.4 Learning a Kernel¶

• We can learn the kernel that generated $Y$ from $X$ by looking at the (input, output) pairs only.
• We use the built-in Conv2D class provided by Gluon below.
• we construct a convolutional layer with 1 output channel and a kernel array shape of (1, 2).
In [16]:
conv2d = nn.Conv2D(channels=1, kernel_size=(1, 2))
conv2d.initialize()

• The two-dimensional convolutional layer uses four-dimensional input and output in the format of (example, channel, height, width)
• The batch size (number of examples in the batch): 1
• The number of channels: 1
In [17]:
X = X.reshape((1, 1, 6, 8))
Y = Y.reshape((1, 1, 6, 7))
(X, Y)

Out[17]:
(
[[[[1. 1. 0. 0. 0. 0. 1. 1.]
[1. 1. 0. 0. 0. 0. 1. 1.]
[1. 1. 0. 0. 0. 0. 1. 1.]
[1. 1. 0. 0. 0. 0. 1. 1.]
[1. 1. 0. 0. 0. 0. 1. 1.]
[1. 1. 0. 0. 0. 0. 1. 1.]]]]
<NDArray 1x1x6x8 @cpu(0)>,
[[[[ 0.  1.  0.  0.  0. -1.  0.]
[ 0.  1.  0.  0.  0. -1.  0.]
[ 0.  1.  0.  0.  0. -1.  0.]
[ 0.  1.  0.  0.  0. -1.  0.]
[ 0.  1.  0.  0.  0. -1.  0.]
[ 0.  1.  0.  0.  0. -1.  0.]]]]
<NDArray 1x1x6x7 @cpu(0)>)
In [18]:
for i in range(10):
Y_hat = conv2d(X)
l = (Y_hat - Y) ** 2
l.backward()
# For the sake of simplicity, we ignore the bias here.
conv2d.weight.data()[:] -= 3e-2 * conv2d.weight.grad()
print('batch %d, loss %.3f' % (i + 1, l.sum().asscalar()))

batch 1, loss 12.495
batch 2, loss 5.132
batch 3, loss 2.111
batch 4, loss 0.871
batch 5, loss 0.360
batch 6, loss 0.150
batch 7, loss 0.063
batch 8, loss 0.027
batch 9, loss 0.012
batch 10, loss 0.005

In [19]:
conv2d.weight.data().reshape((1, 2))

Out[19]:
[[ 0.9917276 -0.9848021]]
<NDArray 1x2 @cpu(0)>

## 5.2.5 Cross-correlation and Convolution¶

• The core computation of a two-dimensional convolutional layer is a two-dimensional cross-correlation operation.
• This performs a cross-correlation operation on the two-dimensional input data and the kernel, and then adds a bias.

# 5.3 Padding and Stride¶

• Assuming the input shape is $n_h\times n_w$ and the convolution kernel window shape is $k_h\times k_w$, then the output shape will be $$(n_h-k_h+1) \times (n_w-k_w+1).$$
• Multiple layers of convolutions reduce the information available at the boundary, often by much more than what we would want. If we start with a 240x240 pixel image, 10 layers of 5x5 convolutions reduce the image to 200x200 pixels, effectively slicing off 30% of the image and with it obliterating anything interesting on the boundaries.
• Padding mitigates this problem.
• Add extra pixels around the boundary of the image, thus increasing the effective size of the image
• The extra pixels typically assume the value 0.

• If a total of $p_h$ rows are padded on both sides of the height and a total of $p_w$ columns are padded on both sides of width, the output shape will be $$(n_h-k_h+p_h+1)\times(n_w-k_w+p_w+1)$$

• This means that the height and width of the output will increase by $p_h$ and $p_w$ respectively.

• In many cases, we will want to set $p_h=k_h-1$ and $p_w=k_w-1$ to give the input and output the same height and width.
• This will make it easier to predict the output shape of each layer when constructing the network.
• If $k_h$ is odd here, we will pad $p_h/2$ rows on both sides of the height.
• If $k_h$ is even, one possibility is to pad $\lceil p_h/2\rceil$ rows on the top of the input and $\lfloor p_h/2\rfloor$ rows on the bottom.
• We will pad both sides of the width in the same way.
• Convolutional neural networks often use convolution kernels with odd height and width values, such as 1, 3, 5, and 7,
• So, the number of padding rows or columns on both sides are the same.
• In the following example, the output size is
• $(n_h-k_h+p_h+1)\times(n_w-k_w+p_w+1) = (8 - 3 + 2 + 1) * (8 - 3 + 2 + 1) = (8, 8)$
In [20]:
from mxnet import nd
from mxnet.gluon import nn

# We define a convenience function to calculate the convolutional layer. This function initializes
# the convolutional layer weights and performs corresponding dimensionality elevations and reductions
# on the input and output.
def comp_conv2d(conv2d, X):
conv2d.initialize()
# (1,1) indicates that the batch size and the number of channels (described in later chapters) are both 1.
X = X.reshape((1, 1) + X.shape)
Y = conv2d(X)
return Y.reshape(Y.shape[2:])  # Exclude the first two dimensions that do not interest us: batch and channel.

# Note that here 1 row or column is padded on either side, so a total of 2 rows or columns are added.
conv2d = nn.Conv2D(1, kernel_size=3, padding=1)
X = nd.random.uniform(shape=(8, 8))
comp_conv2d(conv2d, X).shape

Out[20]:
(8, 8)
• In the following example, the output size is
• $(n_h-k_h+p_h+1)\times(n_w-k_w+p_w+1) = (8 - 5 + 4 + 1) * (8 - 3 + 2 + 1) = (8, 8)$
In [23]:
# Here, we use a convolution kernel with a height of 5 and a width of 3. The padding numbers on
# both sides of the height and width are 2 and 1, respectively.
conv2d = nn.Conv2D(1, kernel_size=(5, 3), padding=(2, 1))
comp_conv2d(conv2d, X).shape

Out[23]:
(8, 8)

## 5.3.2 Stride¶

• Cross-correlation with strides of 3 and 2 for height and width respectively.
• When the stride for the height is $s_h$ and the stride for the width is $s_w$, the output shape is
$$\lfloor(n_h-k_h+p_h+s_h)/s_h\rfloor \times \lfloor(n_w-k_w+p_w+s_w)/s_w\rfloor.$$
• In the following example, the output size is
• $\lfloor(n_h-k_h+p_h+s_h)/s_h\rfloor \times \lfloor(n_w-k_w+p_w+s_w)/s_w\rfloor = \lfloor(8-3+2+2)/2\rfloor \times \lfloor(8-3+2+2)/2\rfloor$ = (4, 4)
In [22]:
conv2d = nn.Conv2D(1, kernel_size=3, padding=1, strides=2)
comp_conv2d(conv2d, X).shape

Out[22]:
(4, 4)
• In the following example, the output size is
• $\lfloor(n_h-k_h+p_h+s_h)/s_h\rfloor \times \lfloor(n_w-k_w+p_w+s_w)/s_w\rfloor = \lfloor(8-3+0+3)/3\rfloor \times \lfloor(8-5+2+4)/4\rfloor$ = (4, 4)
In [25]:
conv2d = nn.Conv2D(1, kernel_size=(3, 5), padding=(0, 1), strides=(3, 4))
comp_conv2d(conv2d, X).shape

Out[25]:
(2, 2)
• When the padding number on both sides of the input height and width are $p_h$ and $p_w$ respectively, we call the padding $(p_h, p_w)$.
• Specifically, when $p_h = p_w = p$, the padding is $p$.
• When the strides on the height and width are $s_h$ and $s_w$, respectively, we call the stride $(s_h, s_w)$.
• Specifically, when $s_h = s_w = s$, the stride is $s$.

• By default

• the padding is 0 and the stride is 1.
• In practice we rarely use inhomogeneous strides or padding, i.e. we usually have $p_h = p_w$ and $s_h = s_w$.

# 5.4 Multiple Input and Output Channels¶

• Assuming that the height and width of the color image are $h$ and $w$ (pixels), it can be represented in the memory as a multi-dimensional array of $3\times h\times w$.
• We refer to this dimension, with a size of 3, as the channel dimension.

## 5.4.1 Multiple Input Channels¶

• When the input data contains multiple channels, we need to construct a convolution kernel with the same number of input channels, so that it can perform cross-correlation with the input data.
• The number of channels for the input data: $c_i$
• The convolution kernel window shape: $k_h\times k_w$,
• Finally, the shape of convolution kernel is $c_i\times k_h\times k_w$
• Cross correlation
• $(1\times1+2\times2+4\times3+5\times4)+(0\times0+1\times1+3\times2+4\times3)=56$
In [26]:
import gluonbook as gb
from mxnet import nd

def corr2d_multi_in(X, K):
# First, traverse along the 0th dimension (channel dimension) of X and K.
# Then, add them together by using * to turn the result list into a positional argument of the add_n function.
return nd.add_n(*[gb.corr2d(x, k) for x, k in zip(X, K)])

In [27]:
X = nd.array([
[[0, 1, 2], [3, 4, 5], [6, 7, 8]],
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
])
K = nd.array([
[[0, 1], [2, 3]],
[[1, 2], [3, 4]]
])

corr2d_multi_in(X, K)

Out[27]:
[[ 56.  72.]
[104. 120.]]
<NDArray 2x2 @cpu(0)>

## 5.4.2 Multiple Output Channels¶

• We might need more than one output
• for edge detection in different directions or
• for more advanced filters
• The number of input channels: $c_i$
• The number of output channels: $c_o$
• let $k_h$ and $k_w$ be the height and width of the kernel.
• To get an output with multiple channels, we can create a kernel array of shape $c_i\times k_h\times k_w$ for each output channel.
• We concatenate them on the output channel dimension, so that the shape of the convolution kernel is $c_o\times c_i\times k_h\times k_w$.
• In cross-correlation operations, the result on each output channel is calculated from the kernel array of the convolution kernel on the same output channel and the entire input array.
In [29]:
def corr2d_multi_in_out(X, K):
# Traverse along the 0th dimension of K, and each time,
# perform cross-correlation operations with input X.
# All of the results are merged together using the stack function.
return nd.stack(*[corr2d_multi_in(X, k) for k in K])

• We construct a convolution kernel with 3 output channels by concatenating the kernel array $K$ with $K+1$ (plus one for each element in $K$) and $K+2$.
In [31]:
K = nd.array([
[[0, 1], [2, 3]],
[[1, 2], [3, 4]]
])

K = nd.stack(K, K + 1, K + 2)
K.shape

Out[31]:
(3, 2, 2, 2)
In [32]:
corr2d_multi_in_out(X, K)

Out[32]:
[[[ 56.  72.]
[104. 120.]]

[[ 76. 100.]
[148. 172.]]

[[ 96. 128.]
[192. 224.]]]
<NDArray 3x2x2 @cpu(0)>

## 5.4.3 1×1 Convolutional Layer¶

• $1 \times 1$ convolution, i.e. $k_h = k_w = 1$

• $1 \times 1$ convolution obviously doesn't correlates adjacent pixels.
• $1\times 1$ convolution loses the ability of the convolutional layer to recognize patterns composed of adjacent elements in the height and width dimensions.

• The main computation of the $1\times 1$ convolution occurs on the channel dimension.

• The inputs and outputs have the same height and width.
• Each element in the output is derived from a linear combination of elements in the same position in the height and width of the input between different channels.
• Assuming that the channel dimension is considered a feature dimension and that the elements in the height and width dimensions are considered data examples, then the $1\times 1$ convolutional layer is equivalent to the fully connected layer.
In [33]:
def corr2d_multi_in_out_1x1(X, K):
c_i, h, w = X.shape
c_o = K.shape[0]
X = X.reshape((c_i, h * w))   #(c_i, h*w) = (3, 9)
K = K.reshape((c_o, c_i))     #(c_o, c_i) = (2, 3)
Y = nd.dot(K, X)  # Matrix multiplication in the fully connected layer.
return Y.reshape((c_o, h, w)) #(2, 3, 3)

In [35]:
X = nd.random.uniform(shape=(3, 3, 3))
K = nd.random.uniform(shape=(2, 3, 1, 1))

Y1 = corr2d_multi_in_out_1x1(X, K)
Y2 = corr2d_multi_in_out(X, K)
(Y1 - Y2).norm().asscalar() < 1e-6

Out[35]:
True
• The $1\times 1$ convolutional layer is equivalent to the fully connected layer, when applied on a per pixel basis.

• The $1\times 1$ convolutional layer is typically used to adjust the number of channels between network layers and to control model complexity.

# 5.5 Pooling¶

• As we process images (or other data sources) we will eventually want to reduce the resolution of the images.
• Reasons
• 1) We typically want to output an estimate that does not depend on the dimensionality of the original image.
• 2) When detecting lower-level features, such as edge detection, we often want to have some degree of invariance to translation.

## 5.5.1 Maximum Pooling and Average Pooling¶

• Pooling computes the output for each element in a fixed-shape window (also known as a pooling window) of input data.
• The pooling layer directly calculates the maximum or average value of the elements in the pooling window.
• These operations are called maximum pooling or average pooling respectively. )
• The four elements are derived from the maximum value of $\text{max}$: $$\max(0,1,3,4)=4,\\ \max(1,2,4,5)=5,\\ \max(3,4,6,7)=7,\\ \max(4,5,7,8)=8.\$$

• The pooling layer with a pooling window shape of $p \times q$ is called the $p \times q$ pooling layer.

• The pooling operation is called $p \times q$ pooling.
• That is to say, using the $2\times 2$ maximum pooling layer, we can still detect if the pattern recognized by the convolutional layer moves no more than one element in height and width.
In [36]:
from mxnet import nd
from mxnet.gluon import nn

def pool2d(X, pool_size, mode='max'):
p_h, p_w = pool_size
Y = nd.zeros((X.shape[0] - p_h + 1, X.shape[1] - p_w + 1))
for i in range(Y.shape[0]):
for j in range(Y.shape[1]):
if mode == 'max':
Y[i, j] = X[i: i + p_h, j: j + p_w].max()
elif mode == 'avg':
Y[i, j] = X[i: i + p_h, j: j + p_w].mean()
return Y

In [37]:
X = nd.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
pool2d(X, (2, 2))

Out[37]:
[[4. 5.]
[7. 8.]]
<NDArray 2x2 @cpu(0)>
In [38]:
pool2d(X, (2, 2), 'avg')

Out[38]:
[[2. 3.]
[5. 6.]]
<NDArray 2x2 @cpu(0)>

## 5.5.2 Padding and Stride¶

• We will demonstrate the use of padding and stride in the pooling layer through the two-dimensional maximum pooling layer MaxPool2D in the nn module.
• Pooling, combined with a stride larger than 1 can be used to reduce the resolution.
In [39]:
X = nd.arange(16).reshape((1, 1, 4, 4))
X

Out[39]:
[[[[ 0.  1.  2.  3.]
[ 4.  5.  6.  7.]
[ 8.  9. 10. 11.]
[12. 13. 14. 15.]]]]
<NDArray 1x1x4x4 @cpu(0)>
• Because there are no model parameters in the pooling layer, we do not need to call the parameter initialization function.
• By default, the stride in the MaxPool2D class has the same shape as the pooling window.
In [40]:
pool2d = nn.MaxPool2D(3)
pool2d(X)

Out[40]:
[[[[10.]]]]
<NDArray 1x1x1x1 @cpu(0)>
In [41]:
pool2d = nn.MaxPool2D(3, padding=1, strides=2)
pool2d(X)

Out[41]:
[[[[ 5.  7.]
[13. 15.]]]]
<NDArray 1x1x2x2 @cpu(0)>
In [42]:
pool2d = nn.MaxPool2D((2, 3), padding=(1, 2), strides=(2, 3))
pool2d(X)

Out[42]:
[[[[ 0.  3.]
[ 8. 11.]
[12. 15.]]]]
<NDArray 1x1x3x2 @cpu(0)>

## 5.5.3 Multiple Channels¶

• When processing multi-channel input data, the pooling layer pools each input channel separately
• [Note] A convolutional layer adds the inputs of each channel
• This means that the number of output channels for the pooling layer is the same as the number of input channels.
In [44]:
X = nd.arange(16).reshape((1, 1, 4, 4))
X = nd.concat(X, X + 1, dim=1)
X

Out[44]:
[[[[ 0.  1.  2.  3.]
[ 4.  5.  6.  7.]
[ 8.  9. 10. 11.]
[12. 13. 14. 15.]]

[[ 1.  2.  3.  4.]
[ 5.  6.  7.  8.]
[ 9. 10. 11. 12.]
[13. 14. 15. 16.]]]]
<NDArray 1x2x4x4 @cpu(0)>
In [45]:
pool2d = nn.MaxPool2D(3, padding=1, strides=2)
pool2d(X)

Out[45]:
[[[[ 5.  7.]
[13. 15.]]

[[ 6.  8.]
[14. 16.]]]]
<NDArray 1x2x2x2 @cpu(0)>