#!/usr/bin/env python
# coding: utf-8

# # 5.1 From Dense Layers to Convolutions

# - Networks with many parameters either require a lot of data or a lot of regularization.
# - Consider the task of distinguishing cats from dogs. 
#   - We decide to use a good camera and take 1 megapixel photos 
#   - The input into a network has 1 million dimensions. 
#   - Even an aggressive reduction to 1,000 dimensions after the first layer means that...
#     - we need $10^9$ parameters. 
#   - Add in subsequent layers and it is clear that this approach is infeasible.
# - Both humans and computers are able to distinguish cats from dogs quite well, often after only a few hundred images.

# ## 5.1.1 Invariances (two key principles)
# ![](https://github.com/d2l-ai/d2l-en/raw/master/img/waldo.jpg)
# - [***Translation Invariance***] Object detectors should work the same regardless of where in the image an object can be found. 
#   - In other words, the ‘waldoness’ of a location in the image can be assessed without regard of the position within the image.
# - [***Locality***] Object detection can be answered by considering only local information. 
#   - In other words, the ‘waldoness’ of a location can be assessed without regard of what else happens in the image at large distances. 

# ## 5.1.2 Constraining the MLP
# - We will treat images and hidden layers as two-dimensional arrays. I.e. $x[i,j]$ and $h[i,j]$ denote the position $(i,j)$ in an image. 
# - In this case a dense layer can be written as follows: $$h[i,j] = \sum_{k,l} W[i,j,k,l] \cdot x[k,l] = \sum_{a, b} V[i,j,a,b] \cdot x[i+a,j+b]$$
#   - we set $V[i,j,a,b] = W[i,j,i+a, j+b]$.
#   - For any given location $(i,j)$ in the hidden layer $h[i,j]$ we compute its value by summing over pixels in $x$, centered around $(i,j)$ and weighted by $V[i,j,a,b]$.
# - ***Translation Invariance***. 
#   - This is only possible if $V$ doesn't actually depend on $(i,j)$, 
#     - that is, we have $V[i,j,a,b] = V[a,b]$. 
#   - As a result we can simplify the definition for $h$.
#   $$h[i,j] = \sum_{a, b} V[a,b] \cdot x[i+a,j+b]$$
#   - This is a convolution!
#     - We are effectively weighting pixels $(i+a, j+b)$ in the vicinity of $(i,j)$ with coefficients $V[a,b]$ to obtain the value $h[i,j]$. 
#   - Note that $V[a,b]$ needs a lot fewer coefficients than $V[i,j,a,b]$. 
#     - For a 1 megapixel image, it has at most 1 million coefficients. 
#     - This is 1 million fewer parameters since it no longer depends on the location within the image. 
# 
# - ***Locality***
#   - We should not have to look very far away from $(i,j)$ in order to glean relevant information to assess what is going on at $h[i,j]$. 
#   - This means that outside some range $|a|, |b| > \Delta$, we should set $V[a,b] = 0$. 
#   $$h[i,j] = \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} V[a,b] \cdot x[i+a,j+b]$$
#   - This is the convolutional layer. 
#   - While in fully connected layer we might have needed $10^8$ or more coefficients, we now only need $O(\Delta^2)$ terms. 

# ## 5.1.3 Convolutions

# # 5.2 Convolutions for Images
# - Strictly speaking, convolutional networks are a slight misnomer (but for notation only), since the operations are typically expressed as ***cross correlations***.

# ## 5.2.1 The Cross-Correlation Operator
# - kernel or filter
#   - The output array has a height of 2 and width of 2, and the four elements are derived from a two-dimensional cross-correlation operation:
# 
# $$ 0\times0+1\times1+3\times2+4\times3=19,\\ 1\times0+2\times1+4\times2+5\times3=25,\\ 3\times0+4\times1+6\times2+7\times3=37,\\ 4\times0+5\times1+7\times2+8\times3=43. $$
# ![](https://github.com/d2l-ai/d2l-en/raw/master/img/correlation.svg?sanitize=true)
# - Note that the output size is smaller than the input.
#   - Input size: $H \times W$ 
#   - Kernel size: $h \times w$ 
#   - Output size: $(H-h+1) \times (W-w+1)$.

# In[1]:


from mxnet import autograd, nd
from mxnet.gluon import nn

def corr2d(X, K):  # This function has been saved in the gluonbook package for future use.
    h, w = K.shape
    Y = nd.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            Y[i, j] = (X[i: i + h, j: j + w] * K).sum()
    return Y


# In[2]:


X = nd.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
K = nd.array([[0, 1], [2, 3]])
corr2d(X, K)


# ## 5.2.2 Convolutional Layers
# - The convolutional layer...
#   - 1) cross-correlates the input and kernels, 
#   - 2) adds a scalar bias to get the output. 
# - The model parameters of the convolutional layer include the kernel and scalar bias.

# In[3]:


class Conv2D(nn.Block):
    def __init__(self, kernel_size, **kwargs):
        super(Conv2D, self).__init__(**kwargs)
        self.weight = self.params.get('weight', shape=kernel_size)
        self.bias = self.params.get('bias', shape=(1,))

    def forward(self, x):
        return corr2d(x, self.weight.data()) + self.bias.data()


# ## 5.2.3 Object Edge Detection in Images
# - Let’s look at a simple application of a convolutional layer
#   - ***Detecting the edge of an object in an image*** by finding the location of the pixel change.

# In[4]:


X = nd.ones((6, 8))
X[:, 2:6] = 0
X


# In[5]:


K = nd.array([[1, -1]])


# - We will detect 1 for the edge from white to black and -1 for the edge from black to white. 
# - The rest of the outputs are 0.

# In[6]:


Y = corr2d(X, K)
Y


# - Let’s apply the kernel to the transposed ‘image’. As expected, it vanishes. 
# - The kernel K only detects vertical edges.

# In[7]:


corr2d(X.T, K)


# ## 5.2.4 Learning a Kernel
# - We can learn the kernel that generated $Y$ from $X$ by looking at the (input, output) pairs only.

# - We use the built-in Conv2D class provided by Gluon below.
#   - we construct a convolutional layer with 1 output channel and a kernel array shape of (1, 2).
# 

# In[16]:


conv2d = nn.Conv2D(channels=1, kernel_size=(1, 2))
conv2d.initialize()


# - The two-dimensional convolutional layer uses four-dimensional input and output in the format of (example, channel, height, width)
#   - The batch size (number of examples in the batch): 1
#   - The number of channels: 1

# In[17]:


X = X.reshape((1, 1, 6, 8))
Y = Y.reshape((1, 1, 6, 7))
(X, Y)


# In[18]:


for i in range(10):
    with autograd.record():
        Y_hat = conv2d(X)
        l = (Y_hat - Y) ** 2
    l.backward()
    # For the sake of simplicity, we ignore the bias here.
    conv2d.weight.data()[:] -= 3e-2 * conv2d.weight.grad()
    print('batch %d, loss %.3f' % (i + 1, l.sum().asscalar()))


# In[19]:


conv2d.weight.data().reshape((1, 2))


# ## 5.2.5 Cross-correlation and Convolution
# - The core computation of a two-dimensional convolutional layer is a two-dimensional cross-correlation operation. 
# - This performs a cross-correlation operation on the two-dimensional input data and the kernel, and then adds a bias.

# # 5.3 Padding and Stride
# - Assuming the input shape is $n_h\times n_w$ and the convolution kernel window shape is $k_h\times k_w$, then the output shape will be $$(n_h-k_h+1) \times (n_w-k_w+1).$$
# - Padding
#   - Multiple layers of convolutions reduce the information available at the boundary, often by much more than what we would want. If we start with a 240x240 pixel image, 10 layers of 5x5 convolutions reduce the image to 200x200 pixels, effectively slicing off 30% of the image and with it obliterating anything interesting on the boundaries. 
#   - Padding mitigates this problem.
#     - Add extra pixels around the boundary of the image, thus increasing the effective size of the image 
#     - The extra pixels typically assume the value 0.
#   
# ![](https://github.com/d2l-ai/d2l-en/raw/master/img/conv_pad.svg?sanitize=true)
#   - If a total of $p_h$ rows are padded on both sides of the height and a total of $p_w$ columns are padded on both sides of width, the output shape will be $$(n_h-k_h+p_h+1)\times(n_w-k_w+p_w+1)$$
# 
#   - This means that the height and width of the output will increase by $p_h$ and $p_w$ respectively.
#   - In many cases, we will want to set $p_h=k_h-1$ and $p_w=k_w-1$ to give the input and output the same height and width. 
#     - This will make it easier to predict the output shape of each layer when constructing the network.
#     - If $k_h$ is odd here, we will pad $p_h/2$ rows on both sides of the height. 
#     - If $k_h$ is even, one possibility is to pad $\lceil p_h/2\rceil$ rows on the top of the input and $\lfloor p_h/2\rfloor$ rows on the bottom.
#     - We will pad both sides of the width in the same way.
#   - Convolutional neural networks often use convolution ***kernels with odd height and width values***, such as 1, 3, 5, and 7, 
#     - So, <u>the number of padding rows or columns on both sides are the same</u>. 

# - In the following example, the output size is 
#   - $(n_h-k_h+p_h+1)\times(n_w-k_w+p_w+1) = (8 - 3 + 2 + 1) * (8 - 3 + 2 + 1) = (8, 8)$

# In[20]:


from mxnet import nd
from mxnet.gluon import nn

# We define a convenience function to calculate the convolutional layer. This function initializes
# the convolutional layer weights and performs corresponding dimensionality elevations and reductions
# on the input and output.
def comp_conv2d(conv2d, X):
    conv2d.initialize()
    # (1,1) indicates that the batch size and the number of channels (described in later chapters) are both 1.
    X = X.reshape((1, 1) + X.shape)
    Y = conv2d(X)
    return Y.reshape(Y.shape[2:])  # Exclude the first two dimensions that do not interest us: batch and channel.

# Note that here 1 row or column is padded on either side, so a total of 2 rows or columns are added.
conv2d = nn.Conv2D(1, kernel_size=3, padding=1)
X = nd.random.uniform(shape=(8, 8))
comp_conv2d(conv2d, X).shape


# - In the following example, the output size is 
#   - $(n_h-k_h+p_h+1)\times(n_w-k_w+p_w+1) = (8 - 5 + 4 + 1) * (8 - 3 + 2 + 1) = (8, 8)$

# In[23]:


# Here, we use a convolution kernel with a height of 5 and a width of 3. The padding numbers on
# both sides of the height and width are 2 and 1, respectively.
conv2d = nn.Conv2D(1, kernel_size=(5, 3), padding=(2, 1))
comp_conv2d(conv2d, X).shape


# ## 5.3.2 Stride
# - Cross-correlation with strides of 3 and 2 for  height and width respectively.
# ![](https://github.com/d2l-ai/d2l-en/raw/master/img/conv_stride.svg?sanitize=true)
# - When the stride for the height is $s_h$ and the stride for the width is $s_w$, the output shape is
# 
# $$\lfloor(n_h-k_h+p_h+s_h)/s_h\rfloor \times \lfloor(n_w-k_w+p_w+s_w)/s_w\rfloor.$$

# - In the following example, the output size is 
#   - $\lfloor(n_h-k_h+p_h+s_h)/s_h\rfloor \times \lfloor(n_w-k_w+p_w+s_w)/s_w\rfloor = \lfloor(8-3+2+2)/2\rfloor \times \lfloor(8-3+2+2)/2\rfloor$ = (4, 4)

# In[22]:


conv2d = nn.Conv2D(1, kernel_size=3, padding=1, strides=2)
comp_conv2d(conv2d, X).shape


# - In the following example, the output size is 
#   - $\lfloor(n_h-k_h+p_h+s_h)/s_h\rfloor \times \lfloor(n_w-k_w+p_w+s_w)/s_w\rfloor = \lfloor(8-3+0+3)/3\rfloor \times \lfloor(8-5+2+4)/4\rfloor$ = (4, 4)

# In[25]:


conv2d = nn.Conv2D(1, kernel_size=(3, 5), padding=(0, 1), strides=(3, 4))
comp_conv2d(conv2d, X).shape


# - When the padding number on both sides of the input height and width are $p_h$ and $p_w$ respectively, we call the padding $(p_h, p_w)$. 
# - Specifically, when $p_h = p_w = p$, the padding is $p$. 
# - When the strides on the height and width are $s_h$ and $s_w$, respectively, we call the stride $(s_h, s_w)$. 
# - Specifically, when $s_h = s_w = s$, the stride is $s$. 
# 
# - By default
#   - the padding is 0 and the stride is 1. 
# - In practice we rarely use inhomogeneous strides or padding, i.e. we usually have $p_h = p_w$ and $s_h = s_w$.

# # 5.4 Multiple Input and Output Channels
# - Assuming that the height and width of the color image are $h$ and $w$ (pixels), it can be represented in the memory as a multi-dimensional array of $3\times h\times w$. 
# - We refer to this dimension, with a size of 3, as the ***channel*** dimension.

# ## 5.4.1 Multiple Input Channels
# - When the input data contains multiple channels, we need to construct ***a convolution kernel with the same number of input channels***, so that it can perform cross-correlation with the input data.
# - The number of channels for the input data: $c_i$
# - The convolution kernel window shape: $k_h\times k_w$, 
# - Finally, the shape of convolution kernel is $c_i\times k_h\times k_w$
# - Cross correlation
#   - $(1\times1+2\times2+4\times3+5\times4)+(0\times0+1\times1+3\times2+4\times3)=56$
# ![](https://github.com/d2l-ai/d2l-en/raw/master/img/conv_multi_in.svg?sanitize=true)

# In[26]:


import gluonbook as gb
from mxnet import nd

def corr2d_multi_in(X, K):
    # First, traverse along the 0th dimension (channel dimension) of X and K. 
    # Then, add them together by using * to turn the result list into a positional argument of the add_n function.
    return nd.add_n(*[gb.corr2d(x, k) for x, k in zip(X, K)])


# In[27]:


X = nd.array([
    [[0, 1, 2], [3, 4, 5], [6, 7, 8]],
    [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
])
K = nd.array([
    [[0, 1], [2, 3]], 
    [[1, 2], [3, 4]]
])

corr2d_multi_in(X, K)


# ## 5.4.2 Multiple Output Channels
# - We might need more than one output
#   - for edge detection in different directions or 
#   - for more advanced filters
# - The number of input channels: $c_i$
# - The number of output channels: $c_o$
# - let $k_h$ and $k_w$ be the height and width of the kernel. 
# - To get an output with multiple channels, we can create a kernel array of shape $c_i\times k_h\times k_w$ for each output channel. 
# - We concatenate them on the output channel dimension, so that the shape of the convolution kernel is $c_o\times c_i\times k_h\times k_w$. 
# - In cross-correlation operations, the result on each output channel is calculated from the kernel array of the convolution kernel on the same output channel and the entire input array.

# In[29]:


def corr2d_multi_in_out(X, K):
    # Traverse along the 0th dimension of K, and each time, 
    # perform cross-correlation operations with input X. 
    # All of the results are merged together using the stack function.
    return nd.stack(*[corr2d_multi_in(X, k) for k in K])


# - We construct a convolution kernel with 3 output channels by concatenating the kernel array $K$ with $K+1$ (plus one for each element in $K$) and $K+2$.

# In[31]:


K = nd.array([
    [[0, 1], [2, 3]], 
    [[1, 2], [3, 4]]
])

K = nd.stack(K, K + 1, K + 2)
K.shape


# In[32]:


corr2d_multi_in_out(X, K)


# ## 5.4.3 1×1 Convolutional Layer
# - $1 \times 1$ convolution, i.e. $k_h = k_w = 1$ 
#   - $1 \times 1$ convolution obviously doesn't correlates adjacent pixels. 
#   
# - $1\times 1$ convolution loses the ability of the convolutional layer to recognize patterns composed of adjacent elements in the height and width dimensions. 
# 
# - The main computation of the $1\times 1$ convolution occurs on the channel dimension. 
# ![](https://github.com/d2l-ai/d2l-en/raw/master/img/conv_1x1.svg?sanitize=true)

# - <u>The inputs and outputs have the same height and width</u>. 
# - Each element in the output is derived from a linear combination of elements in the same position in the height and width of the input between different channels. 
# - Assuming that the channel dimension is considered a feature dimension and that the elements in the height and width dimensions are considered data examples, then the $1\times 1$ convolutional layer is equivalent to the fully connected layer.

# In[33]:


def corr2d_multi_in_out_1x1(X, K):
    c_i, h, w = X.shape
    c_o = K.shape[0]
    X = X.reshape((c_i, h * w))   #(c_i, h*w) = (3, 9)
    K = K.reshape((c_o, c_i))     #(c_o, c_i) = (2, 3)
    Y = nd.dot(K, X)  # Matrix multiplication in the fully connected layer.
    return Y.reshape((c_o, h, w)) #(2, 3, 3)


# In[35]:


X = nd.random.uniform(shape=(3, 3, 3))
K = nd.random.uniform(shape=(2, 3, 1, 1))

Y1 = corr2d_multi_in_out_1x1(X, K)
Y2 = corr2d_multi_in_out(X, K)
(Y1 - Y2).norm().asscalar() < 1e-6


# - The $1\times 1$ convolutional layer is equivalent to the fully connected layer, when applied on a per pixel basis.
# 
# - The $1\times 1$ convolutional layer is typically used to <u>adjust the number of channels between network layers and to control model complexity</u>.

# # 5.5 Pooling
# - As we process images (or other data sources) we will eventually want to reduce the resolution of the images. 
# - Reasons
#   - 1) We typically want to output an estimate that does not depend on the dimensionality of the original image.
#   - 2) When detecting lower-level features, such as edge detection, we often want to have some degree of invariance to translation. 

# ## 5.5.1 Maximum Pooling and Average Pooling
# - Pooling computes the output for each element in a fixed-shape window (also known as a pooling window) of input data.
#   - The pooling layer directly calculates the maximum or average value of the elements in the pooling window. 
#   - These operations are called maximum pooling or average pooling respectively.
#   ![](https://github.com/d2l-ai/d2l-en/raw/master/img/pooling.svg?sanitize=true![image.png](attachment:image.png))
# - The four elements are derived from the maximum value of $\text{max}$: $$ \max(0,1,3,4)=4,\\ \max(1,2,4,5)=5,\\ \max(3,4,6,7)=7,\\ \max(4,5,7,8)=8.\ $$
# 
# - The pooling layer with a pooling window shape of $p \times q$ is called the $p \times q$ pooling layer.
# - The pooling operation is called $p \times q$ pooling.
# - That is to say, using the $2\times 2$ maximum pooling layer, we can still detect if the pattern recognized by the convolutional layer moves no more than one element in height and width.

# In[36]:


from mxnet import nd
from mxnet.gluon import nn

def pool2d(X, pool_size, mode='max'):
    p_h, p_w = pool_size
    Y = nd.zeros((X.shape[0] - p_h + 1, X.shape[1] - p_w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            if mode == 'max':
                Y[i, j] = X[i: i + p_h, j: j + p_w].max()
            elif mode == 'avg':
                Y[i, j] = X[i: i + p_h, j: j + p_w].mean()
    return Y


# In[37]:


X = nd.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
pool2d(X, (2, 2))


# In[38]:


pool2d(X, (2, 2), 'avg')


# ## 5.5.2 Padding and Stride

# - We will demonstrate the use of padding and stride in the pooling layer through the two-dimensional maximum pooling layer `MaxPool2D` in the `nn` module. 
# - Pooling, combined with a stride larger than 1 can be used to reduce the resolution.

# In[39]:


X = nd.arange(16).reshape((1, 1, 4, 4))
X


# - Because there are no model parameters in the pooling layer, we do not need to call the parameter initialization function.
# - By default, the stride in the `MaxPool2D class` has the same shape as the pooling window. 

# In[40]:


pool2d = nn.MaxPool2D(3)
pool2d(X)


# In[41]:


pool2d = nn.MaxPool2D(3, padding=1, strides=2)
pool2d(X)


# In[42]:


pool2d = nn.MaxPool2D((2, 3), padding=(1, 2), strides=(2, 3))
pool2d(X)


# ## 5.5.3 Multiple Channels
# - When processing multi-channel input data, the pooling layer pools each input channel separately
#     - [Note] A convolutional layer adds the inputs of each channel
# - This means that <u>the number of output channels for the pooling layer is the same as the number of input channels</u>.
# 

# In[44]:


X = nd.arange(16).reshape((1, 1, 4, 4))
X = nd.concat(X, X + 1, dim=1)
X


# In[45]:


pool2d = nn.MaxPool2D(3, padding=1, strides=2)
pool2d(X)