#!/usr/bin/env python
# coding: utf-8
# # 5.1 From Dense Layers to Convolutions
# - Networks with many parameters either require a lot of data or a lot of regularization.
# - Consider the task of distinguishing cats from dogs.
# - We decide to use a good camera and take 1 megapixel photos
# - The input into a network has 1 million dimensions.
# - Even an aggressive reduction to 1,000 dimensions after the first layer means that...
# - we need $10^9$ parameters.
# - Add in subsequent layers and it is clear that this approach is infeasible.
# - Both humans and computers are able to distinguish cats from dogs quite well, often after only a few hundred images.
# ## 5.1.1 Invariances (two key principles)
# ![](https://github.com/d2l-ai/d2l-en/raw/master/img/waldo.jpg)
# - [***Translation Invariance***] Object detectors should work the same regardless of where in the image an object can be found.
# - In other words, the ‘waldoness’ of a location in the image can be assessed without regard of the position within the image.
# - [***Locality***] Object detection can be answered by considering only local information.
# - In other words, the ‘waldoness’ of a location can be assessed without regard of what else happens in the image at large distances.
# ## 5.1.2 Constraining the MLP
# - We will treat images and hidden layers as two-dimensional arrays. I.e. $x[i,j]$ and $h[i,j]$ denote the position $(i,j)$ in an image.
# - In this case a dense layer can be written as follows: $$h[i,j] = \sum_{k,l} W[i,j,k,l] \cdot x[k,l] = \sum_{a, b} V[i,j,a,b] \cdot x[i+a,j+b]$$
# - we set $V[i,j,a,b] = W[i,j,i+a, j+b]$.
# - For any given location $(i,j)$ in the hidden layer $h[i,j]$ we compute its value by summing over pixels in $x$, centered around $(i,j)$ and weighted by $V[i,j,a,b]$.
# - ***Translation Invariance***.
# - This is only possible if $V$ doesn't actually depend on $(i,j)$,
# - that is, we have $V[i,j,a,b] = V[a,b]$.
# - As a result we can simplify the definition for $h$.
# $$h[i,j] = \sum_{a, b} V[a,b] \cdot x[i+a,j+b]$$
# - This is a convolution!
# - We are effectively weighting pixels $(i+a, j+b)$ in the vicinity of $(i,j)$ with coefficients $V[a,b]$ to obtain the value $h[i,j]$.
# - Note that $V[a,b]$ needs a lot fewer coefficients than $V[i,j,a,b]$.
# - For a 1 megapixel image, it has at most 1 million coefficients.
# - This is 1 million fewer parameters since it no longer depends on the location within the image.
#
# - ***Locality***
# - We should not have to look very far away from $(i,j)$ in order to glean relevant information to assess what is going on at $h[i,j]$.
# - This means that outside some range $|a|, |b| > \Delta$, we should set $V[a,b] = 0$.
# $$h[i,j] = \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} V[a,b] \cdot x[i+a,j+b]$$
# - This is the convolutional layer.
# - While in fully connected layer we might have needed $10^8$ or more coefficients, we now only need $O(\Delta^2)$ terms.
# ## 5.1.3 Convolutions
# # 5.2 Convolutions for Images
# - Strictly speaking, convolutional networks are a slight misnomer (but for notation only), since the operations are typically expressed as ***cross correlations***.
# ## 5.2.1 The Cross-Correlation Operator
# - kernel or filter
# - The output array has a height of 2 and width of 2, and the four elements are derived from a two-dimensional cross-correlation operation:
#
# $$ 0\times0+1\times1+3\times2+4\times3=19,\\ 1\times0+2\times1+4\times2+5\times3=25,\\ 3\times0+4\times1+6\times2+7\times3=37,\\ 4\times0+5\times1+7\times2+8\times3=43. $$
# ![](https://github.com/d2l-ai/d2l-en/raw/master/img/correlation.svg?sanitize=true)
# - Note that the output size is smaller than the input.
# - Input size: $H \times W$
# - Kernel size: $h \times w$
# - Output size: $(H-h+1) \times (W-w+1)$.
# In[1]:
from mxnet import autograd, nd
from mxnet.gluon import nn
def corr2d(X, K): # This function has been saved in the gluonbook package for future use.
h, w = K.shape
Y = nd.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
for i in range(Y.shape[0]):
for j in range(Y.shape[1]):
Y[i, j] = (X[i: i + h, j: j + w] * K).sum()
return Y
# In[2]:
X = nd.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
K = nd.array([[0, 1], [2, 3]])
corr2d(X, K)
# ## 5.2.2 Convolutional Layers
# - The convolutional layer...
# - 1) cross-correlates the input and kernels,
# - 2) adds a scalar bias to get the output.
# - The model parameters of the convolutional layer include the kernel and scalar bias.
# In[3]:
class Conv2D(nn.Block):
def __init__(self, kernel_size, **kwargs):
super(Conv2D, self).__init__(**kwargs)
self.weight = self.params.get('weight', shape=kernel_size)
self.bias = self.params.get('bias', shape=(1,))
def forward(self, x):
return corr2d(x, self.weight.data()) + self.bias.data()
# ## 5.2.3 Object Edge Detection in Images
# - Let’s look at a simple application of a convolutional layer
# - ***Detecting the edge of an object in an image*** by finding the location of the pixel change.
# In[4]:
X = nd.ones((6, 8))
X[:, 2:6] = 0
X
# In[5]:
K = nd.array([[1, -1]])
# - We will detect 1 for the edge from white to black and -1 for the edge from black to white.
# - The rest of the outputs are 0.
# In[6]:
Y = corr2d(X, K)
Y
# - Let’s apply the kernel to the transposed ‘image’. As expected, it vanishes.
# - The kernel K only detects vertical edges.
# In[7]:
corr2d(X.T, K)
# ## 5.2.4 Learning a Kernel
# - We can learn the kernel that generated $Y$ from $X$ by looking at the (input, output) pairs only.
# - We use the built-in Conv2D class provided by Gluon below.
# - we construct a convolutional layer with 1 output channel and a kernel array shape of (1, 2).
#
# In[16]:
conv2d = nn.Conv2D(channels=1, kernel_size=(1, 2))
conv2d.initialize()
# - The two-dimensional convolutional layer uses four-dimensional input and output in the format of (example, channel, height, width)
# - The batch size (number of examples in the batch): 1
# - The number of channels: 1
# In[17]:
X = X.reshape((1, 1, 6, 8))
Y = Y.reshape((1, 1, 6, 7))
(X, Y)
# In[18]:
for i in range(10):
with autograd.record():
Y_hat = conv2d(X)
l = (Y_hat - Y) ** 2
l.backward()
# For the sake of simplicity, we ignore the bias here.
conv2d.weight.data()[:] -= 3e-2 * conv2d.weight.grad()
print('batch %d, loss %.3f' % (i + 1, l.sum().asscalar()))
# In[19]:
conv2d.weight.data().reshape((1, 2))
# ## 5.2.5 Cross-correlation and Convolution
# - The core computation of a two-dimensional convolutional layer is a two-dimensional cross-correlation operation.
# - This performs a cross-correlation operation on the two-dimensional input data and the kernel, and then adds a bias.
# # 5.3 Padding and Stride
# - Assuming the input shape is $n_h\times n_w$ and the convolution kernel window shape is $k_h\times k_w$, then the output shape will be $$(n_h-k_h+1) \times (n_w-k_w+1).$$
# - Padding
# - Multiple layers of convolutions reduce the information available at the boundary, often by much more than what we would want. If we start with a 240x240 pixel image, 10 layers of 5x5 convolutions reduce the image to 200x200 pixels, effectively slicing off 30% of the image and with it obliterating anything interesting on the boundaries.
# - Padding mitigates this problem.
# - Add extra pixels around the boundary of the image, thus increasing the effective size of the image
# - The extra pixels typically assume the value 0.
#
# ![](https://github.com/d2l-ai/d2l-en/raw/master/img/conv_pad.svg?sanitize=true)
# - If a total of $p_h$ rows are padded on both sides of the height and a total of $p_w$ columns are padded on both sides of width, the output shape will be $$(n_h-k_h+p_h+1)\times(n_w-k_w+p_w+1)$$
#
# - This means that the height and width of the output will increase by $p_h$ and $p_w$ respectively.
# - In many cases, we will want to set $p_h=k_h-1$ and $p_w=k_w-1$ to give the input and output the same height and width.
# - This will make it easier to predict the output shape of each layer when constructing the network.
# - If $k_h$ is odd here, we will pad $p_h/2$ rows on both sides of the height.
# - If $k_h$ is even, one possibility is to pad $\lceil p_h/2\rceil$ rows on the top of the input and $\lfloor p_h/2\rfloor$ rows on the bottom.
# - We will pad both sides of the width in the same way.
# - Convolutional neural networks often use convolution ***kernels with odd height and width values***, such as 1, 3, 5, and 7,
# - So, the number of padding rows or columns on both sides are the same.
# - In the following example, the output size is
# - $(n_h-k_h+p_h+1)\times(n_w-k_w+p_w+1) = (8 - 3 + 2 + 1) * (8 - 3 + 2 + 1) = (8, 8)$
# In[20]:
from mxnet import nd
from mxnet.gluon import nn
# We define a convenience function to calculate the convolutional layer. This function initializes
# the convolutional layer weights and performs corresponding dimensionality elevations and reductions
# on the input and output.
def comp_conv2d(conv2d, X):
conv2d.initialize()
# (1,1) indicates that the batch size and the number of channels (described in later chapters) are both 1.
X = X.reshape((1, 1) + X.shape)
Y = conv2d(X)
return Y.reshape(Y.shape[2:]) # Exclude the first two dimensions that do not interest us: batch and channel.
# Note that here 1 row or column is padded on either side, so a total of 2 rows or columns are added.
conv2d = nn.Conv2D(1, kernel_size=3, padding=1)
X = nd.random.uniform(shape=(8, 8))
comp_conv2d(conv2d, X).shape
# - In the following example, the output size is
# - $(n_h-k_h+p_h+1)\times(n_w-k_w+p_w+1) = (8 - 5 + 4 + 1) * (8 - 3 + 2 + 1) = (8, 8)$
# In[23]:
# Here, we use a convolution kernel with a height of 5 and a width of 3. The padding numbers on
# both sides of the height and width are 2 and 1, respectively.
conv2d = nn.Conv2D(1, kernel_size=(5, 3), padding=(2, 1))
comp_conv2d(conv2d, X).shape
# ## 5.3.2 Stride
# - Cross-correlation with strides of 3 and 2 for height and width respectively.
# ![](https://github.com/d2l-ai/d2l-en/raw/master/img/conv_stride.svg?sanitize=true)
# - When the stride for the height is $s_h$ and the stride for the width is $s_w$, the output shape is
#
# $$\lfloor(n_h-k_h+p_h+s_h)/s_h\rfloor \times \lfloor(n_w-k_w+p_w+s_w)/s_w\rfloor.$$
# - In the following example, the output size is
# - $\lfloor(n_h-k_h+p_h+s_h)/s_h\rfloor \times \lfloor(n_w-k_w+p_w+s_w)/s_w\rfloor = \lfloor(8-3+2+2)/2\rfloor \times \lfloor(8-3+2+2)/2\rfloor$ = (4, 4)
# In[22]:
conv2d = nn.Conv2D(1, kernel_size=3, padding=1, strides=2)
comp_conv2d(conv2d, X).shape
# - In the following example, the output size is
# - $\lfloor(n_h-k_h+p_h+s_h)/s_h\rfloor \times \lfloor(n_w-k_w+p_w+s_w)/s_w\rfloor = \lfloor(8-3+0+3)/3\rfloor \times \lfloor(8-5+2+4)/4\rfloor$ = (4, 4)
# In[25]:
conv2d = nn.Conv2D(1, kernel_size=(3, 5), padding=(0, 1), strides=(3, 4))
comp_conv2d(conv2d, X).shape
# - When the padding number on both sides of the input height and width are $p_h$ and $p_w$ respectively, we call the padding $(p_h, p_w)$.
# - Specifically, when $p_h = p_w = p$, the padding is $p$.
# - When the strides on the height and width are $s_h$ and $s_w$, respectively, we call the stride $(s_h, s_w)$.
# - Specifically, when $s_h = s_w = s$, the stride is $s$.
#
# - By default
# - the padding is 0 and the stride is 1.
# - In practice we rarely use inhomogeneous strides or padding, i.e. we usually have $p_h = p_w$ and $s_h = s_w$.
# # 5.4 Multiple Input and Output Channels
# - Assuming that the height and width of the color image are $h$ and $w$ (pixels), it can be represented in the memory as a multi-dimensional array of $3\times h\times w$.
# - We refer to this dimension, with a size of 3, as the ***channel*** dimension.
# ## 5.4.1 Multiple Input Channels
# - When the input data contains multiple channels, we need to construct ***a convolution kernel with the same number of input channels***, so that it can perform cross-correlation with the input data.
# - The number of channels for the input data: $c_i$
# - The convolution kernel window shape: $k_h\times k_w$,
# - Finally, the shape of convolution kernel is $c_i\times k_h\times k_w$
# - Cross correlation
# - $(1\times1+2\times2+4\times3+5\times4)+(0\times0+1\times1+3\times2+4\times3)=56$
# ![](https://github.com/d2l-ai/d2l-en/raw/master/img/conv_multi_in.svg?sanitize=true)
# In[26]:
import gluonbook as gb
from mxnet import nd
def corr2d_multi_in(X, K):
# First, traverse along the 0th dimension (channel dimension) of X and K.
# Then, add them together by using * to turn the result list into a positional argument of the add_n function.
return nd.add_n(*[gb.corr2d(x, k) for x, k in zip(X, K)])
# In[27]:
X = nd.array([
[[0, 1, 2], [3, 4, 5], [6, 7, 8]],
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
])
K = nd.array([
[[0, 1], [2, 3]],
[[1, 2], [3, 4]]
])
corr2d_multi_in(X, K)
# ## 5.4.2 Multiple Output Channels
# - We might need more than one output
# - for edge detection in different directions or
# - for more advanced filters
# - The number of input channels: $c_i$
# - The number of output channels: $c_o$
# - let $k_h$ and $k_w$ be the height and width of the kernel.
# - To get an output with multiple channels, we can create a kernel array of shape $c_i\times k_h\times k_w$ for each output channel.
# - We concatenate them on the output channel dimension, so that the shape of the convolution kernel is $c_o\times c_i\times k_h\times k_w$.
# - In cross-correlation operations, the result on each output channel is calculated from the kernel array of the convolution kernel on the same output channel and the entire input array.
# In[29]:
def corr2d_multi_in_out(X, K):
# Traverse along the 0th dimension of K, and each time,
# perform cross-correlation operations with input X.
# All of the results are merged together using the stack function.
return nd.stack(*[corr2d_multi_in(X, k) for k in K])
# - We construct a convolution kernel with 3 output channels by concatenating the kernel array $K$ with $K+1$ (plus one for each element in $K$) and $K+2$.
# In[31]:
K = nd.array([
[[0, 1], [2, 3]],
[[1, 2], [3, 4]]
])
K = nd.stack(K, K + 1, K + 2)
K.shape
# In[32]:
corr2d_multi_in_out(X, K)
# ## 5.4.3 1×1 Convolutional Layer
# - $1 \times 1$ convolution, i.e. $k_h = k_w = 1$
# - $1 \times 1$ convolution obviously doesn't correlates adjacent pixels.
#
# - $1\times 1$ convolution loses the ability of the convolutional layer to recognize patterns composed of adjacent elements in the height and width dimensions.
#
# - The main computation of the $1\times 1$ convolution occurs on the channel dimension.
# ![](https://github.com/d2l-ai/d2l-en/raw/master/img/conv_1x1.svg?sanitize=true)
# - The inputs and outputs have the same height and width.
# - Each element in the output is derived from a linear combination of elements in the same position in the height and width of the input between different channels.
# - Assuming that the channel dimension is considered a feature dimension and that the elements in the height and width dimensions are considered data examples, then the $1\times 1$ convolutional layer is equivalent to the fully connected layer.
# In[33]:
def corr2d_multi_in_out_1x1(X, K):
c_i, h, w = X.shape
c_o = K.shape[0]
X = X.reshape((c_i, h * w)) #(c_i, h*w) = (3, 9)
K = K.reshape((c_o, c_i)) #(c_o, c_i) = (2, 3)
Y = nd.dot(K, X) # Matrix multiplication in the fully connected layer.
return Y.reshape((c_o, h, w)) #(2, 3, 3)
# In[35]:
X = nd.random.uniform(shape=(3, 3, 3))
K = nd.random.uniform(shape=(2, 3, 1, 1))
Y1 = corr2d_multi_in_out_1x1(X, K)
Y2 = corr2d_multi_in_out(X, K)
(Y1 - Y2).norm().asscalar() < 1e-6
# - The $1\times 1$ convolutional layer is equivalent to the fully connected layer, when applied on a per pixel basis.
#
# - The $1\times 1$ convolutional layer is typically used to adjust the number of channels between network layers and to control model complexity.
# # 5.5 Pooling
# - As we process images (or other data sources) we will eventually want to reduce the resolution of the images.
# - Reasons
# - 1) We typically want to output an estimate that does not depend on the dimensionality of the original image.
# - 2) When detecting lower-level features, such as edge detection, we often want to have some degree of invariance to translation.
# ## 5.5.1 Maximum Pooling and Average Pooling
# - Pooling computes the output for each element in a fixed-shape window (also known as a pooling window) of input data.
# - The pooling layer directly calculates the maximum or average value of the elements in the pooling window.
# - These operations are called maximum pooling or average pooling respectively.
# ![](https://github.com/d2l-ai/d2l-en/raw/master/img/pooling.svg?sanitize=true![image.png](attachment:image.png))
# - The four elements are derived from the maximum value of $\text{max}$: $$ \max(0,1,3,4)=4,\\ \max(1,2,4,5)=5,\\ \max(3,4,6,7)=7,\\ \max(4,5,7,8)=8.\ $$
#
# - The pooling layer with a pooling window shape of $p \times q$ is called the $p \times q$ pooling layer.
# - The pooling operation is called $p \times q$ pooling.
# - That is to say, using the $2\times 2$ maximum pooling layer, we can still detect if the pattern recognized by the convolutional layer moves no more than one element in height and width.
# In[36]:
from mxnet import nd
from mxnet.gluon import nn
def pool2d(X, pool_size, mode='max'):
p_h, p_w = pool_size
Y = nd.zeros((X.shape[0] - p_h + 1, X.shape[1] - p_w + 1))
for i in range(Y.shape[0]):
for j in range(Y.shape[1]):
if mode == 'max':
Y[i, j] = X[i: i + p_h, j: j + p_w].max()
elif mode == 'avg':
Y[i, j] = X[i: i + p_h, j: j + p_w].mean()
return Y
# In[37]:
X = nd.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
pool2d(X, (2, 2))
# In[38]:
pool2d(X, (2, 2), 'avg')
# ## 5.5.2 Padding and Stride
# - We will demonstrate the use of padding and stride in the pooling layer through the two-dimensional maximum pooling layer `MaxPool2D` in the `nn` module.
# - Pooling, combined with a stride larger than 1 can be used to reduce the resolution.
# In[39]:
X = nd.arange(16).reshape((1, 1, 4, 4))
X
# - Because there are no model parameters in the pooling layer, we do not need to call the parameter initialization function.
# - By default, the stride in the `MaxPool2D class` has the same shape as the pooling window.
# In[40]:
pool2d = nn.MaxPool2D(3)
pool2d(X)
# In[41]:
pool2d = nn.MaxPool2D(3, padding=1, strides=2)
pool2d(X)
# In[42]:
pool2d = nn.MaxPool2D((2, 3), padding=(1, 2), strides=(2, 3))
pool2d(X)
# ## 5.5.3 Multiple Channels
# - When processing multi-channel input data, the pooling layer pools each input channel separately
# - [Note] A convolutional layer adds the inputs of each channel
# - This means that the number of output channels for the pooling layer is the same as the number of input channels.
#
# In[44]:
X = nd.arange(16).reshape((1, 1, 4, 4))
X = nd.concat(X, X + 1, dim=1)
X
# In[45]:
pool2d = nn.MaxPool2D(3, padding=1, strides=2)
pool2d(X)