- Networks with many parameters either require a lot of data or a lot of regularization.
- Consider the task of distinguishing cats from dogs.
- We decide to use a good camera and take 1 megapixel photos
- The input into a network has 1 million dimensions.
- Even an aggressive reduction to 1,000 dimensions after the first layer means that...
- we need $10^9$ parameters.

- Add in subsequent layers and it is clear that this approach is infeasible.

- Both humans and computers are able to distinguish cats from dogs quite well, often after only a few hundred images.

- [
] Object detectors should work the same regardless of where in the image an object can be found.*Translation Invariance*- In other words, the ‘waldoness’ of a location in the image can be assessed without regard of the position within the image.

- [
] Object detection can be answered by considering only local information.*Locality*- In other words, the ‘waldoness’ of a location can be assessed without regard of what else happens in the image at large distances.

- We will treat images and hidden layers as two-dimensional arrays. I.e. $x[i,j]$ and $h[i,j]$ denote the position $(i,j)$ in an image.
- In this case a dense layer can be written as follows: $$h[i,j] = \sum_{k,l} W[i,j,k,l] \cdot x[k,l] = \sum_{a, b} V[i,j,a,b] \cdot x[i+a,j+b]$$
- we set $V[i,j,a,b] = W[i,j,i+a, j+b]$.
- For any given location $(i,j)$ in the hidden layer $h[i,j]$ we compute its value by summing over pixels in $x$, centered around $(i,j)$ and weighted by $V[i,j,a,b]$.

.*Translation Invariance*- This is only possible if $V$ doesn't actually depend on $(i,j)$,
- that is, we have $V[i,j,a,b] = V[a,b]$.

- As a result we can simplify the definition for $h$. $$h[i,j] = \sum_{a, b} V[a,b] \cdot x[i+a,j+b]$$
- This is a convolution!
- We are effectively weighting pixels $(i+a, j+b)$ in the vicinity of $(i,j)$ with coefficients $V[a,b]$ to obtain the value $h[i,j]$.

- Note that $V[a,b]$ needs a lot fewer coefficients than $V[i,j,a,b]$.
- For a 1 megapixel image, it has at most 1 million coefficients.
- This is 1 million fewer parameters since it no longer depends on the location within the image.

- This is only possible if $V$ doesn't actually depend on $(i,j)$,
*Locality*- We should not have to look very far away from $(i,j)$ in order to glean relevant information to assess what is going on at $h[i,j]$.
- This means that outside some range $|a|, |b| > \Delta$, we should set $V[a,b] = 0$. $$h[i,j] = \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} V[a,b] \cdot x[i+a,j+b]$$
- This is the convolutional layer.
- While in fully connected layer we might have needed $10^8$ or more coefficients, we now only need $O(\Delta^2)$ terms.

- Strictly speaking, convolutional networks are a slight misnomer (but for notation only), since the operations are typically expressed as
.*cross correlations*

- kernel or filter
- The output array has a height of 2 and width of 2, and the four elements are derived from a two-dimensional cross-correlation operation:

- Note that the output size is smaller than the input.
- Input size: $H \times W$
- Kernel size: $h \times w$
- Output size: $(H-h+1) \times (W-w+1)$.

In [1]:

```
from mxnet import autograd, nd
from mxnet.gluon import nn
def corr2d(X, K): # This function has been saved in the gluonbook package for future use.
h, w = K.shape
Y = nd.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
for i in range(Y.shape[0]):
for j in range(Y.shape[1]):
Y[i, j] = (X[i: i + h, j: j + w] * K).sum()
return Y
```

In [2]:

```
X = nd.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
K = nd.array([[0, 1], [2, 3]])
corr2d(X, K)
```

Out[2]:

- The convolutional layer...
- 1) cross-correlates the input and kernels,
- 2) adds a scalar bias to get the output.

- The model parameters of the convolutional layer include the kernel and scalar bias.

In [3]:

```
class Conv2D(nn.Block):
def __init__(self, kernel_size, **kwargs):
super(Conv2D, self).__init__(**kwargs)
self.weight = self.params.get('weight', shape=kernel_size)
self.bias = self.params.get('bias', shape=(1,))
def forward(self, x):
return corr2d(x, self.weight.data()) + self.bias.data()
```

- Let’s look at a simple application of a convolutional layer
by finding the location of the pixel change.*Detecting the edge of an object in an image*

In [4]:

```
X = nd.ones((6, 8))
X[:, 2:6] = 0
X
```

Out[4]:

In [5]:

```
K = nd.array([[1, -1]])
```

- We will detect 1 for the edge from white to black and -1 for the edge from black to white.
- The rest of the outputs are 0.

In [6]:

```
Y = corr2d(X, K)
Y
```

Out[6]:

- Let’s apply the kernel to the transposed ‘image’. As expected, it vanishes.
- The kernel K only detects vertical edges.

In [7]:

```
corr2d(X.T, K)
```

Out[7]:

- We can learn the kernel that generated $Y$ from $X$ by looking at the (input, output) pairs only.

- We use the built-in Conv2D class provided by Gluon below.
- we construct a convolutional layer with 1 output channel and a kernel array shape of (1, 2).

In [16]:

```
conv2d = nn.Conv2D(channels=1, kernel_size=(1, 2))
conv2d.initialize()
```

- The two-dimensional convolutional layer uses four-dimensional input and output in the format of (example, channel, height, width)
- The batch size (number of examples in the batch): 1
- The number of channels: 1

In [17]:

```
X = X.reshape((1, 1, 6, 8))
Y = Y.reshape((1, 1, 6, 7))
(X, Y)
```

Out[17]:

In [18]:

```
for i in range(10):
with autograd.record():
Y_hat = conv2d(X)
l = (Y_hat - Y) ** 2
l.backward()
# For the sake of simplicity, we ignore the bias here.
conv2d.weight.data()[:] -= 3e-2 * conv2d.weight.grad()
print('batch %d, loss %.3f' % (i + 1, l.sum().asscalar()))
```

In [19]:

```
conv2d.weight.data().reshape((1, 2))
```

Out[19]:

- The core computation of a two-dimensional convolutional layer is a two-dimensional cross-correlation operation.
- This performs a cross-correlation operation on the two-dimensional input data and the kernel, and then adds a bias.

- Assuming the input shape is $n_h\times n_w$ and the convolution kernel window shape is $k_h\times k_w$, then the output shape will be $$(n_h-k_h+1) \times (n_w-k_w+1).$$
- Padding
- Multiple layers of convolutions reduce the information available at the boundary, often by much more than what we would want. If we start with a 240x240 pixel image, 10 layers of 5x5 convolutions reduce the image to 200x200 pixels, effectively slicing off 30% of the image and with it obliterating anything interesting on the boundaries.
- Padding mitigates this problem.
- Add extra pixels around the boundary of the image, thus increasing the effective size of the image
- The extra pixels typically assume the value 0.

If a total of $p_h$ rows are padded on both sides of the height and a total of $p_w$ columns are padded on both sides of width, the output shape will be $$(n_h-k_h+p_h+1)\times(n_w-k_w+p_w+1)$$

This means that the height and width of the output will increase by $p_h$ and $p_w$ respectively.

- In many cases, we will want to set $p_h=k_h-1$ and $p_w=k_w-1$ to give the input and output the same height and width.
- This will make it easier to predict the output shape of each layer when constructing the network.
- If $k_h$ is odd here, we will pad $p_h/2$ rows on both sides of the height.
- If $k_h$ is even, one possibility is to pad $\lceil p_h/2\rceil$ rows on the top of the input and $\lfloor p_h/2\rfloor$ rows on the bottom.
- We will pad both sides of the width in the same way.

- Convolutional neural networks often use convolution
, such as 1, 3, 5, and 7,*kernels with odd height and width values*- So,
__the number of padding rows or columns on both sides are the same__.

- So,

- In the following example, the output size is
- $(n_h-k_h+p_h+1)\times(n_w-k_w+p_w+1) = (8 - 3 + 2 + 1) * (8 - 3 + 2 + 1) = (8, 8)$

In [20]:

```
from mxnet import nd
from mxnet.gluon import nn
# We define a convenience function to calculate the convolutional layer. This function initializes
# the convolutional layer weights and performs corresponding dimensionality elevations and reductions
# on the input and output.
def comp_conv2d(conv2d, X):
conv2d.initialize()
# (1,1) indicates that the batch size and the number of channels (described in later chapters) are both 1.
X = X.reshape((1, 1) + X.shape)
Y = conv2d(X)
return Y.reshape(Y.shape[2:]) # Exclude the first two dimensions that do not interest us: batch and channel.
# Note that here 1 row or column is padded on either side, so a total of 2 rows or columns are added.
conv2d = nn.Conv2D(1, kernel_size=3, padding=1)
X = nd.random.uniform(shape=(8, 8))
comp_conv2d(conv2d, X).shape
```

Out[20]:

- In the following example, the output size is
- $(n_h-k_h+p_h+1)\times(n_w-k_w+p_w+1) = (8 - 5 + 4 + 1) * (8 - 3 + 2 + 1) = (8, 8)$

In [23]:

```
# Here, we use a convolution kernel with a height of 5 and a width of 3. The padding numbers on
# both sides of the height and width are 2 and 1, respectively.
conv2d = nn.Conv2D(1, kernel_size=(5, 3), padding=(2, 1))
comp_conv2d(conv2d, X).shape
```

Out[23]:

- Cross-correlation with strides of 3 and 2 for height and width respectively.
- When the stride for the height is $s_h$ and the stride for the width is $s_w$, the output shape is

- In the following example, the output size is
- $\lfloor(n_h-k_h+p_h+s_h)/s_h\rfloor \times \lfloor(n_w-k_w+p_w+s_w)/s_w\rfloor = \lfloor(8-3+2+2)/2\rfloor \times \lfloor(8-3+2+2)/2\rfloor$ = (4, 4)

In [22]:

```
conv2d = nn.Conv2D(1, kernel_size=3, padding=1, strides=2)
comp_conv2d(conv2d, X).shape
```

Out[22]:

- In the following example, the output size is
- $\lfloor(n_h-k_h+p_h+s_h)/s_h\rfloor \times \lfloor(n_w-k_w+p_w+s_w)/s_w\rfloor = \lfloor(8-3+0+3)/3\rfloor \times \lfloor(8-5+2+4)/4\rfloor$ = (4, 4)

In [25]:

```
conv2d = nn.Conv2D(1, kernel_size=(3, 5), padding=(0, 1), strides=(3, 4))
comp_conv2d(conv2d, X).shape
```

Out[25]:

- When the padding number on both sides of the input height and width are $p_h$ and $p_w$ respectively, we call the padding $(p_h, p_w)$.
- Specifically, when $p_h = p_w = p$, the padding is $p$.
- When the strides on the height and width are $s_h$ and $s_w$, respectively, we call the stride $(s_h, s_w)$.
Specifically, when $s_h = s_w = s$, the stride is $s$.

By default

- the padding is 0 and the stride is 1.

- In practice we rarely use inhomogeneous strides or padding, i.e. we usually have $p_h = p_w$ and $s_h = s_w$.

- Assuming that the height and width of the color image are $h$ and $w$ (pixels), it can be represented in the memory as a multi-dimensional array of $3\times h\times w$.
- We refer to this dimension, with a size of 3, as the
dimension.*channel*

- When the input data contains multiple channels, we need to construct
, so that it can perform cross-correlation with the input data.*a convolution kernel with the same number of input channels* - The number of channels for the input data: $c_i$
- The convolution kernel window shape: $k_h\times k_w$,
- Finally, the shape of convolution kernel is $c_i\times k_h\times k_w$
- Cross correlation
- $(1\times1+2\times2+4\times3+5\times4)+(0\times0+1\times1+3\times2+4\times3)=56$

In [26]:

```
import gluonbook as gb
from mxnet import nd
def corr2d_multi_in(X, K):
# First, traverse along the 0th dimension (channel dimension) of X and K.
# Then, add them together by using * to turn the result list into a positional argument of the add_n function.
return nd.add_n(*[gb.corr2d(x, k) for x, k in zip(X, K)])
```

In [27]:

```
X = nd.array([
[[0, 1, 2], [3, 4, 5], [6, 7, 8]],
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
])
K = nd.array([
[[0, 1], [2, 3]],
[[1, 2], [3, 4]]
])
corr2d_multi_in(X, K)
```

Out[27]:

- We might need more than one output
- for edge detection in different directions or
- for more advanced filters

- The number of input channels: $c_i$
- The number of output channels: $c_o$
- let $k_h$ and $k_w$ be the height and width of the kernel.
- To get an output with multiple channels, we can create a kernel array of shape $c_i\times k_h\times k_w$ for each output channel.
- We concatenate them on the output channel dimension, so that the shape of the convolution kernel is $c_o\times c_i\times k_h\times k_w$.
- In cross-correlation operations, the result on each output channel is calculated from the kernel array of the convolution kernel on the same output channel and the entire input array.

In [29]:

```
def corr2d_multi_in_out(X, K):
# Traverse along the 0th dimension of K, and each time,
# perform cross-correlation operations with input X.
# All of the results are merged together using the stack function.
return nd.stack(*[corr2d_multi_in(X, k) for k in K])
```

- We construct a convolution kernel with 3 output channels by concatenating the kernel array $K$ with $K+1$ (plus one for each element in $K$) and $K+2$.

In [31]:

```
K = nd.array([
[[0, 1], [2, 3]],
[[1, 2], [3, 4]]
])
K = nd.stack(K, K + 1, K + 2)
K.shape
```

Out[31]:

In [32]:

```
corr2d_multi_in_out(X, K)
```

Out[32]:

$1 \times 1$ convolution, i.e. $k_h = k_w = 1$

- $1 \times 1$ convolution obviously doesn't correlates adjacent pixels.

$1\times 1$ convolution loses the ability of the convolutional layer to recognize patterns composed of adjacent elements in the height and width dimensions.

The main computation of the $1\times 1$ convolution occurs on the channel dimension.

__The inputs and outputs have the same height and width__.- Each element in the output is derived from a linear combination of elements in the same position in the height and width of the input between different channels.
- Assuming that the channel dimension is considered a feature dimension and that the elements in the height and width dimensions are considered data examples, then the $1\times 1$ convolutional layer is equivalent to the fully connected layer.

In [33]:

```
def corr2d_multi_in_out_1x1(X, K):
c_i, h, w = X.shape
c_o = K.shape[0]
X = X.reshape((c_i, h * w)) #(c_i, h*w) = (3, 9)
K = K.reshape((c_o, c_i)) #(c_o, c_i) = (2, 3)
Y = nd.dot(K, X) # Matrix multiplication in the fully connected layer.
return Y.reshape((c_o, h, w)) #(2, 3, 3)
```

In [35]:

```
X = nd.random.uniform(shape=(3, 3, 3))
K = nd.random.uniform(shape=(2, 3, 1, 1))
Y1 = corr2d_multi_in_out_1x1(X, K)
Y2 = corr2d_multi_in_out(X, K)
(Y1 - Y2).norm().asscalar() < 1e-6
```

Out[35]:

The $1\times 1$ convolutional layer is equivalent to the fully connected layer, when applied on a per pixel basis.

The $1\times 1$ convolutional layer is typically used to

__adjust the number of channels between network layers and to control model complexity__.

- As we process images (or other data sources) we will eventually want to reduce the resolution of the images.
- Reasons
- 1) We typically want to output an estimate that does not depend on the dimensionality of the original image.
- 2) When detecting lower-level features, such as edge detection, we often want to have some degree of invariance to translation.

- Pooling computes the output for each element in a fixed-shape window (also known as a pooling window) of input data.
- The pooling layer directly calculates the maximum or average value of the elements in the pooling window.
- These operations are called maximum pooling or average pooling respectively. )

The four elements are derived from the maximum value of $\text{max}$: $$ \max(0,1,3,4)=4,\\ \max(1,2,4,5)=5,\\ \max(3,4,6,7)=7,\\ \max(4,5,7,8)=8.\ $$

The pooling layer with a pooling window shape of $p \times q$ is called the $p \times q$ pooling layer.

- The pooling operation is called $p \times q$ pooling.
- That is to say, using the $2\times 2$ maximum pooling layer, we can still detect if the pattern recognized by the convolutional layer moves no more than one element in height and width.

In [36]:

```
from mxnet import nd
from mxnet.gluon import nn
def pool2d(X, pool_size, mode='max'):
p_h, p_w = pool_size
Y = nd.zeros((X.shape[0] - p_h + 1, X.shape[1] - p_w + 1))
for i in range(Y.shape[0]):
for j in range(Y.shape[1]):
if mode == 'max':
Y[i, j] = X[i: i + p_h, j: j + p_w].max()
elif mode == 'avg':
Y[i, j] = X[i: i + p_h, j: j + p_w].mean()
return Y
```

In [37]:

```
X = nd.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
pool2d(X, (2, 2))
```

Out[37]:

In [38]:

```
pool2d(X, (2, 2), 'avg')
```

Out[38]:

- We will demonstrate the use of padding and stride in the pooling layer through the two-dimensional maximum pooling layer
`MaxPool2D`

in the`nn`

module. - Pooling, combined with a stride larger than 1 can be used to reduce the resolution.

In [39]:

```
X = nd.arange(16).reshape((1, 1, 4, 4))
X
```

Out[39]:

- Because there are no model parameters in the pooling layer, we do not need to call the parameter initialization function.
- By default, the stride in the
`MaxPool2D class`

has the same shape as the pooling window.

In [40]:

```
pool2d = nn.MaxPool2D(3)
pool2d(X)
```

Out[40]:

In [41]:

```
pool2d = nn.MaxPool2D(3, padding=1, strides=2)
pool2d(X)
```

Out[41]:

In [42]:

```
pool2d = nn.MaxPool2D((2, 3), padding=(1, 2), strides=(2, 3))
pool2d(X)
```

Out[42]:

- When processing multi-channel input data, the pooling layer pools each input channel separately
- [Note] A convolutional layer adds the inputs of each channel

- This means that
__the number of output channels for the pooling layer is the same as the number of input channels__.

In [44]:

```
X = nd.arange(16).reshape((1, 1, 4, 4))
X = nd.concat(X, X + 1, dim=1)
X
```

Out[44]:

In [45]:

```
pool2d = nn.MaxPool2D(3, padding=1, strides=2)
pool2d(X)
```

Out[45]: