#!/usr/bin/env python # coding: utf-8 # # 5.1 From Dense Layers to Convolutions # - Networks with many parameters either require a lot of data or a lot of regularization. # - Consider the task of distinguishing cats from dogs. # - We decide to use a good camera and take 1 megapixel photos # - The input into a network has 1 million dimensions. # - Even an aggressive reduction to 1,000 dimensions after the first layer means that... # - we need $10^9$ parameters. # - Add in subsequent layers and it is clear that this approach is infeasible. # - Both humans and computers are able to distinguish cats from dogs quite well, often after only a few hundred images. # ## 5.1.1 Invariances (two key principles) # ![](https://github.com/d2l-ai/d2l-en/raw/master/img/waldo.jpg) # - [***Translation Invariance***] Object detectors should work the same regardless of where in the image an object can be found. # - In other words, the ‘waldoness’ of a location in the image can be assessed without regard of the position within the image. # - [***Locality***] Object detection can be answered by considering only local information. # - In other words, the ‘waldoness’ of a location can be assessed without regard of what else happens in the image at large distances. # ## 5.1.2 Constraining the MLP # - We will treat images and hidden layers as two-dimensional arrays. I.e. $x[i,j]$ and $h[i,j]$ denote the position $(i,j)$ in an image. # - In this case a dense layer can be written as follows: $$h[i,j] = \sum_{k,l} W[i,j,k,l] \cdot x[k,l] = \sum_{a, b} V[i,j,a,b] \cdot x[i+a,j+b]$$ # - we set $V[i,j,a,b] = W[i,j,i+a, j+b]$. # - For any given location $(i,j)$ in the hidden layer $h[i,j]$ we compute its value by summing over pixels in $x$, centered around $(i,j)$ and weighted by $V[i,j,a,b]$. # - ***Translation Invariance***. # - This is only possible if $V$ doesn't actually depend on $(i,j)$, # - that is, we have $V[i,j,a,b] = V[a,b]$. # - As a result we can simplify the definition for $h$. # $$h[i,j] = \sum_{a, b} V[a,b] \cdot x[i+a,j+b]$$ # - This is a convolution! # - We are effectively weighting pixels $(i+a, j+b)$ in the vicinity of $(i,j)$ with coefficients $V[a,b]$ to obtain the value $h[i,j]$. # - Note that $V[a,b]$ needs a lot fewer coefficients than $V[i,j,a,b]$. # - For a 1 megapixel image, it has at most 1 million coefficients. # - This is 1 million fewer parameters since it no longer depends on the location within the image. # # - ***Locality*** # - We should not have to look very far away from $(i,j)$ in order to glean relevant information to assess what is going on at $h[i,j]$. # - This means that outside some range $|a|, |b| > \Delta$, we should set $V[a,b] = 0$. # $$h[i,j] = \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} V[a,b] \cdot x[i+a,j+b]$$ # - This is the convolutional layer. # - While in fully connected layer we might have needed $10^8$ or more coefficients, we now only need $O(\Delta^2)$ terms. # ## 5.1.3 Convolutions # # 5.2 Convolutions for Images # - Strictly speaking, convolutional networks are a slight misnomer (but for notation only), since the operations are typically expressed as ***cross correlations***. # ## 5.2.1 The Cross-Correlation Operator # - kernel or filter # - The output array has a height of 2 and width of 2, and the four elements are derived from a two-dimensional cross-correlation operation: # # $$ 0\times0+1\times1+3\times2+4\times3=19,\\ 1\times0+2\times1+4\times2+5\times3=25,\\ 3\times0+4\times1+6\times2+7\times3=37,\\ 4\times0+5\times1+7\times2+8\times3=43. $$ # ![](https://github.com/d2l-ai/d2l-en/raw/master/img/correlation.svg?sanitize=true) # - Note that the output size is smaller than the input. # - Input size: $H \times W$ # - Kernel size: $h \times w$ # - Output size: $(H-h+1) \times (W-w+1)$. # In[1]: from mxnet import autograd, nd from mxnet.gluon import nn def corr2d(X, K): # This function has been saved in the gluonbook package for future use. h, w = K.shape Y = nd.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1)) for i in range(Y.shape[0]): for j in range(Y.shape[1]): Y[i, j] = (X[i: i + h, j: j + w] * K).sum() return Y # In[2]: X = nd.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) K = nd.array([[0, 1], [2, 3]]) corr2d(X, K) # ## 5.2.2 Convolutional Layers # - The convolutional layer... # - 1) cross-correlates the input and kernels, # - 2) adds a scalar bias to get the output. # - The model parameters of the convolutional layer include the kernel and scalar bias. # In[3]: class Conv2D(nn.Block): def __init__(self, kernel_size, **kwargs): super(Conv2D, self).__init__(**kwargs) self.weight = self.params.get('weight', shape=kernel_size) self.bias = self.params.get('bias', shape=(1,)) def forward(self, x): return corr2d(x, self.weight.data()) + self.bias.data() # ## 5.2.3 Object Edge Detection in Images # - Let’s look at a simple application of a convolutional layer # - ***Detecting the edge of an object in an image*** by finding the location of the pixel change. # In[4]: X = nd.ones((6, 8)) X[:, 2:6] = 0 X # In[5]: K = nd.array([[1, -1]]) # - We will detect 1 for the edge from white to black and -1 for the edge from black to white. # - The rest of the outputs are 0. # In[6]: Y = corr2d(X, K) Y # - Let’s apply the kernel to the transposed ‘image’. As expected, it vanishes. # - The kernel K only detects vertical edges. # In[7]: corr2d(X.T, K) # ## 5.2.4 Learning a Kernel # - We can learn the kernel that generated $Y$ from $X$ by looking at the (input, output) pairs only. # - We use the built-in Conv2D class provided by Gluon below. # - we construct a convolutional layer with 1 output channel and a kernel array shape of (1, 2). # # In[16]: conv2d = nn.Conv2D(channels=1, kernel_size=(1, 2)) conv2d.initialize() # - The two-dimensional convolutional layer uses four-dimensional input and output in the format of (example, channel, height, width) # - The batch size (number of examples in the batch): 1 # - The number of channels: 1 # In[17]: X = X.reshape((1, 1, 6, 8)) Y = Y.reshape((1, 1, 6, 7)) (X, Y) # In[18]: for i in range(10): with autograd.record(): Y_hat = conv2d(X) l = (Y_hat - Y) ** 2 l.backward() # For the sake of simplicity, we ignore the bias here. conv2d.weight.data()[:] -= 3e-2 * conv2d.weight.grad() print('batch %d, loss %.3f' % (i + 1, l.sum().asscalar())) # In[19]: conv2d.weight.data().reshape((1, 2)) # ## 5.2.5 Cross-correlation and Convolution # - The core computation of a two-dimensional convolutional layer is a two-dimensional cross-correlation operation. # - This performs a cross-correlation operation on the two-dimensional input data and the kernel, and then adds a bias. # # 5.3 Padding and Stride # - Assuming the input shape is $n_h\times n_w$ and the convolution kernel window shape is $k_h\times k_w$, then the output shape will be $$(n_h-k_h+1) \times (n_w-k_w+1).$$ # - Padding # - Multiple layers of convolutions reduce the information available at the boundary, often by much more than what we would want. If we start with a 240x240 pixel image, 10 layers of 5x5 convolutions reduce the image to 200x200 pixels, effectively slicing off 30% of the image and with it obliterating anything interesting on the boundaries. # - Padding mitigates this problem. # - Add extra pixels around the boundary of the image, thus increasing the effective size of the image # - The extra pixels typically assume the value 0. # # ![](https://github.com/d2l-ai/d2l-en/raw/master/img/conv_pad.svg?sanitize=true) # - If a total of $p_h$ rows are padded on both sides of the height and a total of $p_w$ columns are padded on both sides of width, the output shape will be $$(n_h-k_h+p_h+1)\times(n_w-k_w+p_w+1)$$ # # - This means that the height and width of the output will increase by $p_h$ and $p_w$ respectively. # - In many cases, we will want to set $p_h=k_h-1$ and $p_w=k_w-1$ to give the input and output the same height and width. # - This will make it easier to predict the output shape of each layer when constructing the network. # - If $k_h$ is odd here, we will pad $p_h/2$ rows on both sides of the height. # - If $k_h$ is even, one possibility is to pad $\lceil p_h/2\rceil$ rows on the top of the input and $\lfloor p_h/2\rfloor$ rows on the bottom. # - We will pad both sides of the width in the same way. # - Convolutional neural networks often use convolution ***kernels with odd height and width values***, such as 1, 3, 5, and 7, # - So, the number of padding rows or columns on both sides are the same. # - In the following example, the output size is # - $(n_h-k_h+p_h+1)\times(n_w-k_w+p_w+1) = (8 - 3 + 2 + 1) * (8 - 3 + 2 + 1) = (8, 8)$ # In[20]: from mxnet import nd from mxnet.gluon import nn # We define a convenience function to calculate the convolutional layer. This function initializes # the convolutional layer weights and performs corresponding dimensionality elevations and reductions # on the input and output. def comp_conv2d(conv2d, X): conv2d.initialize() # (1,1) indicates that the batch size and the number of channels (described in later chapters) are both 1. X = X.reshape((1, 1) + X.shape) Y = conv2d(X) return Y.reshape(Y.shape[2:]) # Exclude the first two dimensions that do not interest us: batch and channel. # Note that here 1 row or column is padded on either side, so a total of 2 rows or columns are added. conv2d = nn.Conv2D(1, kernel_size=3, padding=1) X = nd.random.uniform(shape=(8, 8)) comp_conv2d(conv2d, X).shape # - In the following example, the output size is # - $(n_h-k_h+p_h+1)\times(n_w-k_w+p_w+1) = (8 - 5 + 4 + 1) * (8 - 3 + 2 + 1) = (8, 8)$ # In[23]: # Here, we use a convolution kernel with a height of 5 and a width of 3. The padding numbers on # both sides of the height and width are 2 and 1, respectively. conv2d = nn.Conv2D(1, kernel_size=(5, 3), padding=(2, 1)) comp_conv2d(conv2d, X).shape # ## 5.3.2 Stride # - Cross-correlation with strides of 3 and 2 for height and width respectively. # ![](https://github.com/d2l-ai/d2l-en/raw/master/img/conv_stride.svg?sanitize=true) # - When the stride for the height is $s_h$ and the stride for the width is $s_w$, the output shape is # # $$\lfloor(n_h-k_h+p_h+s_h)/s_h\rfloor \times \lfloor(n_w-k_w+p_w+s_w)/s_w\rfloor.$$ # - In the following example, the output size is # - $\lfloor(n_h-k_h+p_h+s_h)/s_h\rfloor \times \lfloor(n_w-k_w+p_w+s_w)/s_w\rfloor = \lfloor(8-3+2+2)/2\rfloor \times \lfloor(8-3+2+2)/2\rfloor$ = (4, 4) # In[22]: conv2d = nn.Conv2D(1, kernel_size=3, padding=1, strides=2) comp_conv2d(conv2d, X).shape # - In the following example, the output size is # - $\lfloor(n_h-k_h+p_h+s_h)/s_h\rfloor \times \lfloor(n_w-k_w+p_w+s_w)/s_w\rfloor = \lfloor(8-3+0+3)/3\rfloor \times \lfloor(8-5+2+4)/4\rfloor$ = (4, 4) # In[25]: conv2d = nn.Conv2D(1, kernel_size=(3, 5), padding=(0, 1), strides=(3, 4)) comp_conv2d(conv2d, X).shape # - When the padding number on both sides of the input height and width are $p_h$ and $p_w$ respectively, we call the padding $(p_h, p_w)$. # - Specifically, when $p_h = p_w = p$, the padding is $p$. # - When the strides on the height and width are $s_h$ and $s_w$, respectively, we call the stride $(s_h, s_w)$. # - Specifically, when $s_h = s_w = s$, the stride is $s$. # # - By default # - the padding is 0 and the stride is 1. # - In practice we rarely use inhomogeneous strides or padding, i.e. we usually have $p_h = p_w$ and $s_h = s_w$. # # 5.4 Multiple Input and Output Channels # - Assuming that the height and width of the color image are $h$ and $w$ (pixels), it can be represented in the memory as a multi-dimensional array of $3\times h\times w$. # - We refer to this dimension, with a size of 3, as the ***channel*** dimension. # ## 5.4.1 Multiple Input Channels # - When the input data contains multiple channels, we need to construct ***a convolution kernel with the same number of input channels***, so that it can perform cross-correlation with the input data. # - The number of channels for the input data: $c_i$ # - The convolution kernel window shape: $k_h\times k_w$, # - Finally, the shape of convolution kernel is $c_i\times k_h\times k_w$ # - Cross correlation # - $(1\times1+2\times2+4\times3+5\times4)+(0\times0+1\times1+3\times2+4\times3)=56$ # ![](https://github.com/d2l-ai/d2l-en/raw/master/img/conv_multi_in.svg?sanitize=true) # In[26]: import gluonbook as gb from mxnet import nd def corr2d_multi_in(X, K): # First, traverse along the 0th dimension (channel dimension) of X and K. # Then, add them together by using * to turn the result list into a positional argument of the add_n function. return nd.add_n(*[gb.corr2d(x, k) for x, k in zip(X, K)]) # In[27]: X = nd.array([ [[0, 1, 2], [3, 4, 5], [6, 7, 8]], [[1, 2, 3], [4, 5, 6], [7, 8, 9]] ]) K = nd.array([ [[0, 1], [2, 3]], [[1, 2], [3, 4]] ]) corr2d_multi_in(X, K) # ## 5.4.2 Multiple Output Channels # - We might need more than one output # - for edge detection in different directions or # - for more advanced filters # - The number of input channels: $c_i$ # - The number of output channels: $c_o$ # - let $k_h$ and $k_w$ be the height and width of the kernel. # - To get an output with multiple channels, we can create a kernel array of shape $c_i\times k_h\times k_w$ for each output channel. # - We concatenate them on the output channel dimension, so that the shape of the convolution kernel is $c_o\times c_i\times k_h\times k_w$. # - In cross-correlation operations, the result on each output channel is calculated from the kernel array of the convolution kernel on the same output channel and the entire input array. # In[29]: def corr2d_multi_in_out(X, K): # Traverse along the 0th dimension of K, and each time, # perform cross-correlation operations with input X. # All of the results are merged together using the stack function. return nd.stack(*[corr2d_multi_in(X, k) for k in K]) # - We construct a convolution kernel with 3 output channels by concatenating the kernel array $K$ with $K+1$ (plus one for each element in $K$) and $K+2$. # In[31]: K = nd.array([ [[0, 1], [2, 3]], [[1, 2], [3, 4]] ]) K = nd.stack(K, K + 1, K + 2) K.shape # In[32]: corr2d_multi_in_out(X, K) # ## 5.4.3 1×1 Convolutional Layer # - $1 \times 1$ convolution, i.e. $k_h = k_w = 1$ # - $1 \times 1$ convolution obviously doesn't correlates adjacent pixels. # # - $1\times 1$ convolution loses the ability of the convolutional layer to recognize patterns composed of adjacent elements in the height and width dimensions. # # - The main computation of the $1\times 1$ convolution occurs on the channel dimension. # ![](https://github.com/d2l-ai/d2l-en/raw/master/img/conv_1x1.svg?sanitize=true) # - The inputs and outputs have the same height and width. # - Each element in the output is derived from a linear combination of elements in the same position in the height and width of the input between different channels. # - Assuming that the channel dimension is considered a feature dimension and that the elements in the height and width dimensions are considered data examples, then the $1\times 1$ convolutional layer is equivalent to the fully connected layer. # In[33]: def corr2d_multi_in_out_1x1(X, K): c_i, h, w = X.shape c_o = K.shape[0] X = X.reshape((c_i, h * w)) #(c_i, h*w) = (3, 9) K = K.reshape((c_o, c_i)) #(c_o, c_i) = (2, 3) Y = nd.dot(K, X) # Matrix multiplication in the fully connected layer. return Y.reshape((c_o, h, w)) #(2, 3, 3) # In[35]: X = nd.random.uniform(shape=(3, 3, 3)) K = nd.random.uniform(shape=(2, 3, 1, 1)) Y1 = corr2d_multi_in_out_1x1(X, K) Y2 = corr2d_multi_in_out(X, K) (Y1 - Y2).norm().asscalar() < 1e-6 # - The $1\times 1$ convolutional layer is equivalent to the fully connected layer, when applied on a per pixel basis. # # - The $1\times 1$ convolutional layer is typically used to adjust the number of channels between network layers and to control model complexity. # # 5.5 Pooling # - As we process images (or other data sources) we will eventually want to reduce the resolution of the images. # - Reasons # - 1) We typically want to output an estimate that does not depend on the dimensionality of the original image. # - 2) When detecting lower-level features, such as edge detection, we often want to have some degree of invariance to translation. # ## 5.5.1 Maximum Pooling and Average Pooling # - Pooling computes the output for each element in a fixed-shape window (also known as a pooling window) of input data. # - The pooling layer directly calculates the maximum or average value of the elements in the pooling window. # - These operations are called maximum pooling or average pooling respectively. # ![](https://github.com/d2l-ai/d2l-en/raw/master/img/pooling.svg?sanitize=true![image.png](attachment:image.png)) # - The four elements are derived from the maximum value of $\text{max}$: $$ \max(0,1,3,4)=4,\\ \max(1,2,4,5)=5,\\ \max(3,4,6,7)=7,\\ \max(4,5,7,8)=8.\ $$ # # - The pooling layer with a pooling window shape of $p \times q$ is called the $p \times q$ pooling layer. # - The pooling operation is called $p \times q$ pooling. # - That is to say, using the $2\times 2$ maximum pooling layer, we can still detect if the pattern recognized by the convolutional layer moves no more than one element in height and width. # In[36]: from mxnet import nd from mxnet.gluon import nn def pool2d(X, pool_size, mode='max'): p_h, p_w = pool_size Y = nd.zeros((X.shape[0] - p_h + 1, X.shape[1] - p_w + 1)) for i in range(Y.shape[0]): for j in range(Y.shape[1]): if mode == 'max': Y[i, j] = X[i: i + p_h, j: j + p_w].max() elif mode == 'avg': Y[i, j] = X[i: i + p_h, j: j + p_w].mean() return Y # In[37]: X = nd.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) pool2d(X, (2, 2)) # In[38]: pool2d(X, (2, 2), 'avg') # ## 5.5.2 Padding and Stride # - We will demonstrate the use of padding and stride in the pooling layer through the two-dimensional maximum pooling layer `MaxPool2D` in the `nn` module. # - Pooling, combined with a stride larger than 1 can be used to reduce the resolution. # In[39]: X = nd.arange(16).reshape((1, 1, 4, 4)) X # - Because there are no model parameters in the pooling layer, we do not need to call the parameter initialization function. # - By default, the stride in the `MaxPool2D class` has the same shape as the pooling window. # In[40]: pool2d = nn.MaxPool2D(3) pool2d(X) # In[41]: pool2d = nn.MaxPool2D(3, padding=1, strides=2) pool2d(X) # In[42]: pool2d = nn.MaxPool2D((2, 3), padding=(1, 2), strides=(2, 3)) pool2d(X) # ## 5.5.3 Multiple Channels # - When processing multi-channel input data, the pooling layer pools each input channel separately # - [Note] A convolutional layer adds the inputs of each channel # - This means that the number of output channels for the pooling layer is the same as the number of input channels. # # In[44]: X = nd.arange(16).reshape((1, 1, 4, 4)) X = nd.concat(X, X + 1, dim=1) X # In[45]: pool2d = nn.MaxPool2D(3, padding=1, strides=2) pool2d(X)