- Let's review some of the practical challenges when training deep networks.
- Data preprocessing is a key aspect of effective modeling.
- We standardized input data to zero mean and unit variance.
- Standardizing input data makes the distribution of features similar, which generally makes it easier to train effective models since parameters are a-priori at a similar scale.

- The activations in intermediate layers will assume rather different orders of magnitude.
- This can lead to issues with the convergence of the network due to scale of activations
- If one layer has activation values that are 100x that of another layer, we need to adjust learning rates adaptively per layer (or even per parameter group per layer).

- Deeper networks are fairly complex and they are more prone to overfitting.
- This means that regularization becomes more critical.
- Dropout is nontrivial to use in convolutional layers and does not perform as well
- Dence we need a more appropriate type of regularization.

- When the last layers will converge first, at which point the layers below start converging.
- Unfortunately, once this happens, the weights for the last layers are no longer optimal and they need to converge again.
- As training progresses, this gets worse.

- Data preprocessing is a key aspect of effective modeling.
- Batch normalization (BN) can be used to cope with the challenges of deep model training.
- During training, BN continuously adjusts the intermediate output of the neural network by utilizing the mean and standard deviation of the mini-batch.
- Ioffe and Szegedy, "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift," 2015 - https://arxiv.org/abs/1502.03167

- In a nutshell, the idea in Batch Normalization is to transform the activation at a given layer from $\mathbf{x}$ to $$\mathrm{BN}(\mathbf{x}) = \mathbf{\gamma} \odot \frac{\mathbf{x} - \hat{\mathbf{\mu}}}{\hat\sigma} + \mathbf{\beta}$$
- $\hat{\mathbf{\mu}}$ is the estimate of the mean
- $\hat{\mathbf{\sigma}}$ is the estimate of the variance
- The activations are approximately rescaled to zero mean and unit variance.
- Since this may not be quite what we want, we allow for a coordinate-wise scaling coefficient $\mathbf{\gamma}$ and an offset $\mathbf{\beta}$.
- To address the fact that in some cases the activations actually need to differ from standardized data, we need to introduce scaling coefficients $\mathbf{\gamma}$ and an offset $\mathbf{\beta}$.

- Consequently the activations for intermediate layers cannot diverge any longer
- we are actively rescaling it back to a given order of magnitude via $\mathbf{\mu}$ and $\sigma$.

- Consequently we can be more aggressive in picking large learning rates on the data.

- Batch Normalization for Fully Connected Layers
- We will put the batch normalization layer between the affine transformation and the activation function in the fully connected layer.
- We denote by $\mathbf{u}$ the input and by $\mathbf{x} = \mathbf{W}\mathbf{u} + \mathbf{b}$ the output of the linear transform.
- This yields the following variant of the batch norm: $$\mathbf{y} = \phi(\mathrm{BN}(\mathbf{x})) = \phi(\mathrm{BN}(\mathbf{W}\mathbf{u} + \mathbf{b}))$$
- Recall that mean and variance are computed on the same minibatch $\mathcal{B}$ on which this transformation is applied to.
- Also recall that the scaling coefficient $\mathbf{\gamma}$ and the offset $\mathbf{\beta}$ are parameters that need to be learned.
- They ensure that the effect of batch normalization can be neutralized as needed.

- Batch Normalization for Convolutional Layers
- Batch normalization occurs after the convolution computation and before the application of the activation function.
- If the convolution computation outputs multiple channels, we need to carry out batch normalization for each of the outputs of these channels, and each channel has an independent scale parameter and shift parameter.
- Assume that there are $m$ examples in the mini-batch.
- On a single channel, we assume that the height and width of the convolution computation output are $p$ and $q$, respectively.
- We need to carry out batch normalization for $m \times p \times q$ elements in this channel simultaneously.
- While carrying out the standardization computation for these elements, we use the same mean and variance.
- In other words, we use the means and variances of the $m \times p \times q$ elements in this channel rather than one per pixel.

- Batch Normalization During Prediction
- At prediction time we might be required to make one prediction at a time.
- $\mathbf{\mu}$ and $\mathbf{\sigma}$ arising from a minibatch are highly undesirable once we've trained the model.
- One way to mitigate this is to compute more stable estimates on a larger set for once (e.g. via a moving average) and then fix them at prediction time.
- Consequently, Batch Normalization behaves differently during training and test time.

In [1]:

```
import gluonbook as gb
from mxnet import autograd, gluon, init, nd
from mxnet.gluon import nn
def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
# Use autograd to determine whether the current mode is training mode or prediction mode.
if not autograd.is_training():
# If it is the prediction mode, directly use the mean and variance obtained
# from the incoming moving average.
X_hat = (X - moving_mean) / nd.sqrt(moving_var + eps)
else:
assert len(X.shape) in (2, 4)
if len(X.shape) == 2:
# When using a fully connected layer, calculate the mean and variance on the feature dimension.
mean = X.mean(axis=0)
var = ((X - mean) ** 2).mean(axis=0)
else:
# When using a two-dimensional convolutional layer, calculate the mean
# and variance on the channel dimension (axis=1). Here we need to maintain
# the shape of X, so that the broadcast operation can be carried out later.
mean = X.mean(axis=(0, 2, 3), keepdims=True)
var = ((X - mean) ** 2).mean(axis=(0, 2, 3), keepdims=True)
# In training mode, the current mean and variance are used for the standardization.
X_hat = (X - mean) / nd.sqrt(var + eps)
# Update the mean and variance of the moving average.
moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
moving_var = momentum * moving_var + (1.0 - momentum) * var
Y = gamma * X_hat + beta # Scale and shift.
return Y, moving_mean, moving_var
```

`BatchNorm`

retains the scale parameter`gamma`

and the shift parameter`beta`

involved in gradient finding and iteration- It also maintains the mean and variance obtained from the moving average, so that they can be used during model prediction.

In [2]:

```
class BatchNorm(nn.Block):
def __init__(self, num_features, num_dims, **kwargs):
super(BatchNorm, self).__init__(**kwargs)
if num_dims == 2:
shape = (1, num_features)
else:
shape = (1, num_features, 1, 1)
# The scale parameter and the shift parameter involved in gradient finding and iteration are initialized to 0 and 1 respectively.
self.gamma = self.params.get('gamma', shape=shape, init=init.One())
self.beta = self.params.get('beta', shape=shape, init=init.Zero())
# All the variables not involved in gradient finding and iteration are initialized to 0 on the CPU.
self.moving_mean = nd.zeros(shape)
self.moving_var = nd.zeros(shape)
def forward(self, X):
# If X is not on the CPU, copy moving_mean and moving_var to the device where X is located.
if self.moving_mean.context != X.context:
self.moving_mean = self.moving_mean.copyto(X.context)
self.moving_var = self.moving_var.copyto(X.context)
# Save the updated moving_mean and moving_var.
Y, self.moving_mean, self.moving_var = batch_norm(
X,
self.gamma.data(),
self.beta.data(),
self.moving_mean,
self.moving_var,
eps=1e-5,
momentum=0.9
)
return Y
```

- The
`num_features`

parameter required by the`BatchNorm`

instance is the number of outputs for a fully connected layer and the number of output channels for a convolutional layer. - The
`num_dims`

parameter also required by this instance is 2 for a fully connected layer and 4 for a convolutional layer.

In [3]:

```
net = nn.Sequential()
net.add(
nn.Conv2D(6, kernel_size=5),
BatchNorm(6, num_dims=4),
nn.Activation('sigmoid'),
nn.MaxPool2D(pool_size=2, strides=2),
nn.Conv2D(16, kernel_size=5),
BatchNorm(16, num_dims=4),
nn.Activation('sigmoid'),
nn.MaxPool2D(pool_size=2, strides=2),
nn.Dense(120),
BatchNorm(120, num_dims=2),
nn.Activation('sigmoid'),
nn.Dense(84),
BatchNorm(84, num_dims=2),
nn.Activation('sigmoid'),
nn.Dense(10)
)
```

In [5]:

```
lr = 1.0
num_epochs = 5
batch_size = 256
ctx = gb.try_gpu()
net.initialize(ctx=ctx, init=init.Xavier(), force_reinit=True)
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
train_iter, test_iter = gb.load_data_fashion_mnist(batch_size)
gb.train_ch5(net, train_iter, test_iter, batch_size, trainer, ctx, num_epochs)
```

In [6]:

```
net[1].gamma.data().reshape((-1,)), net[1].beta.data().reshape((-1,))
```

Out[6]:

In [7]:

```
net = nn.Sequential()
net.add(
nn.Conv2D(6, kernel_size=5),
nn.BatchNorm(),
nn.Activation('sigmoid'),
nn.MaxPool2D(pool_size=2, strides=2),
nn.Conv2D(16, kernel_size=5),
nn.BatchNorm(),
nn.Activation('sigmoid'),
nn.MaxPool2D(pool_size=2, strides=2),
nn.Dense(120),
nn.BatchNorm(),
nn.Activation('sigmoid'),
nn.Dense(84),
nn.BatchNorm(),
nn.Activation('sigmoid'),
nn.Dense(10)
)
```

In [ ]:

```
lr = 1.0
num_epochs = 5
batch_size = 256
ctx = gb.try_gpu()
net.initialize(ctx=ctx, init=init.Xavier(), force_reinit=True)
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
train_iter, test_iter = gb.load_data_fashion_mnist(batch_size)
gb.train_ch5(net, train_iter, test_iter, batch_size, trainer, ctx, num_epochs)
```

During model training, batch normalization continuously adjusts the intermediate output of the neural network by utilizing the mean and standard deviation of the mini-batch, so that the values of the intermediate output in each layer throughout the neural network are more stable.

Like a dropout layer, batch normalization layers have different computation results in training mode and prediction mode.

Batch Normalization has many beneficial side effects, primarily that of regularization.

On the other hand, the original motivation of reducing covariate shift seems not to be a valid explanation.

- Can we add a new layer to the neural network so that the fully trained model can reduce training errors more effectively?
- The added layer might make it easier to reduce training errors.

- In practice, however, with the addition of too many layers, training errors increase rather than decrease.
- Adding layers doesn't only make the network more expressive.

Function Classes

- Consider $\mathcal{F}$, the class of functions that a specific network architecture (together with learning rates and other hyperparameter settings) can reach.
- That is, for all $f \in \mathcal{F}$ there exists some set of parameters $W$ that can be obtained through training on a suitable dataset.

- Let's assume that $\hat{f}$ is the function that we really would like to find.

)

- Only if larger function classes contain the smaller ones are we guaranteed that increasing them strictly increases the expressive power of the network.

- Consider $\mathcal{F}$, the class of functions that a specific network architecture (together with learning rates and other hyperparameter settings) can reach.
He Kaiming and his colleagues proposed the ResNet.

Papers

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778). https://arxiv.org/abs/1512.03385

He, K., Zhang, X., Ren, S., & Sun, J. (2016, October). Identity mappings in deep residual networks. In European Conference on Computer Vision (pp. 630-645). Springer, Cham. - https://arxiv.org/abs/1603.05027

- It won the ImageNet Visual Recognition Challenge in 2015
- It had a profound influence on the design of subsequent deep neural networks.
- At the heart of ResNet is the idea that
.*every additional layer should contain the identity function as one of its elements*- This means that if we can train the newly-added layer into an identity mapping $f(\mathbf{x}) = \mathbf{x}$, the new model will be as effective as the original model.

- These considerations are rather profound but they led to a surprisingly simple solution, a
*residual block*

- Each block can be expressed in a general form: $$ y_l = h(x_l) + F(x_l, W_l) \\ x_{l+1} = f(y_l) $$
- $x_l$ and $x_{l+1}$ are input and output of the $l$-th unit
- $F$ is a residual function.
- $h(x_l) = x_l$ is an identity mapping
- $f$ is a ReLU function

- The central idea of ResNets
.*Learn the additive residual function $F$ with respect to an identity mapping $h(x_l) = x_l$*- This is realized by attaching an identity skip connection (“shortcut”).

- ResNet follows VGG's full $3\times 3$ convolutional layer design.
- The residual block has two $3\times 3$ convolutional layers with the same number of output channels.
- Each convolutional layer is followed by a batch normalization layer and a ReLU activation function.
- Then, we skip these two convolution operations and add the input directly before the final ReLU activation function.
- The output of the two convolutional layers should be of the same shape as the input, so that they can be added together.

- If we want to change the number of channels or the the stride, we need to introduce an additional $1\times 1$ convolutional layer to transform the input into the desired shape for the addition operation.

In [1]:

```
import gluonbook as gb
from mxnet import gluon, init, nd
from mxnet.gluon import nn
class Residual(nn.Block): # This class is part of the gluonbook package
def __init__(self, num_channels, use_1x1conv=False, strides=1, **kwargs):
super(Residual, self).__init__(**kwargs)
self.conv1 = nn.Conv2D(num_channels, kernel_size=3, padding=1, strides=strides)
self.conv2 = nn.Conv2D(num_channels, kernel_size=3, padding=1)
self.bn1 = nn.BatchNorm()
self.bn2 = nn.BatchNorm()
if use_1x1conv:
self.conv3 = nn.Conv2D(num_channels, kernel_size=1, strides=strides)
else:
self.conv3 = None
def forward(self, X):
Y = nd.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
return nd.relu(Y + X)
```

- The above code generates two types of networks
- 1) Add the input to the output before applying the ReLu nonlinearity
- 2) Whenever
`use_1x1conv=True`

, adjust channels and resolution by means of a $1 \times 1$ convolution before adding.

In [2]:

```
blk = Residual(num_channels=3)
blk.initialize()
X = nd.random.uniform(shape=(4, 3, 6, 6))
blk(X).shape
```

Out[2]:

In [3]:

```
blk = Residual(num_channels=6, use_1x1conv=True, strides=2)
blk.initialize()
X = nd.random.uniform(shape=(4, 3, 6, 6))
blk(X).shape
```

Out[3]:

- The first two layers of ResNet are the same as those of the GoogLeNet
- $7\times 7$ convolutional layer with 64 output channels and a stride of 2
- Then, the $3\times 3$ maximum pooling layer with a stride of 2 and padding of 1.

- The difference is the batch normalization layer added after each convolutional layer in ResNet.

In [18]:

```
net = nn.Sequential()
net.add(
nn.Conv2D(64, kernel_size=7, strides=2, padding=3),
nn.BatchNorm(),
nn.Activation('relu'),
nn.MaxPool2D(pool_size=3, strides=2, padding=1)
)
```

In [19]:

```
X = nd.random.uniform(shape=(1, 1, 28, 28))
net.initialize()
for layer in net:
X = layer(X)
print(layer.name, 'output shape:\t', X.shape)
```

In [20]:

```
net.add(
#Since a maximum pooling layer with a stride of 2 has already been used,
#it is not necessary to reduce the height and width at the first residual block.
Residual(num_channels=64),
Residual(num_channels=64),
Residual(num_channels=64),
Residual(num_channels=128, use_1x1conv=True, strides=2), # height and width are halved
Residual(num_channels=128),
Residual(num_channels=128),
Residual(num_channels=256, use_1x1conv=True, strides=2), # height and width are halved
Residual(num_channels=256),
Residual(num_channels=256),
Residual(num_channels=512, use_1x1conv=True, strides=2), # height and width are halved
Residual(num_channels=512),
Residual(num_channels=512)
)
```

In [21]:

```
net.add(
nn.GlobalAvgPool2D(),
nn.Dense(10)
)
```

In [23]:

```
X = nd.random.uniform(shape=(1, 1, 28, 28))
net.initialize(force_reinit=True)
for layer in net:
X = layer(X)
print(layer.name, 'output shape:\t', X.shape)
```

In [24]:

```
lr = 0.05
num_epochs = 5
batch_size = 256
ctx = gb.try_gpu()
net.initialize(force_reinit=True, ctx=ctx, init=init.Xavier())
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
train_iter, test_iter = gb.load_data_fashion_mnist(batch_size)
gb.train_ch5(net, train_iter, test_iter, batch_size, trainer, ctx, num_epochs)
```

- Residual blocks allow for a parametrization relative to the identity function $f(\mathbf{x}) = \mathbf{x}$.
- Adding residual blocks increases the function complexity in a well-defined manner.
- We can train an effective deep neural network by having residual blocks pass through cross-layer data channels.
- ResNet had a major influence on the design of subsequent deep neural networks, both for convolutional and sequential nature.

- DenseNet is a logical extension of DenseNet.
- To understand how to arrive at it, let's take a small detour to theory.
- Recall the Taylor expansion for functions. $$f(x) = f(0) + f'(x) x + \frac{1}{2} f''(x) x^2 + \frac{1}{6} f'''(x) x^3 + o(x^3)$$
Function Decomposition

- It decomposes the function into increasingly higher order terms.
- ResNet decomposes functions into $$f(\mathbf{x}) = \mathbf{x} + g(\mathbf{x})$$
- That is, ResNet decomposes $f$ into a simple linear term and a more complex nonlinear one.
- What if we want to go beyond two terms?
- A solution was proposed by Huang et al, 2016 in the form of DenseNet.
- Huang, G., Liu, Z., Weinberger, K. Q., & van der Maaten, L. (2017). Densely connected con- volutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 1, No. 2). https://arxiv.org/abs/1608.06993

The key difference between ResNet and DenseNet

- In DenseNet, outputs are concatenated rather than added.
- As a result we perform a mapping from $\mathbf{x}$ to its values after applying an increasingly complex sequence of functions. $$\mathbf{x} \to \left[\mathbf{x}, f_1(\mathbf{x}), f_2(\mathbf{x}, f_1(\mathbf{x})), f_3(\mathbf{x}, f_1(\mathbf{x}), f_2(\mathbf{x}, f_1(\mathbf{x})), \ldots\right]$$
- In the end, all these functions are combined in an MLP to reduce the number of features again.
- The name DenseNet arises from the fact that...
- The dependency graph between variables becomes quite dense.
- The last layer of such a chain is densely connected to all previous layers.
- The main components that compose a DenseNet are dense blocks and transition layers.
- The dense block defines how the inputs and outputs are concatenated,
- The transition layers control the number of channels so that it is not too large.

In [25]:

```
import gluonbook as gb
from mxnet import gluon, init, nd
from mxnet.gluon import nn
def conv_block(num_channels):
blk = nn.Sequential()
blk.add(
nn.BatchNorm(),
nn.Activation('relu'),
nn.Conv2D(num_channels, kernel_size=3, padding=1)
)
return blk
```

In [33]:

```
class DenseBlock(nn.Block):
def __init__(self, num_convs, num_channels, **kwargs):
super(DenseBlock, self).__init__(**kwargs)
self.net = nn.Sequential()
for _ in range(num_convs):
self.net.add(conv_block(num_channels))
def forward(self, X):
for blk in self.net:
Y = blk(X)
print("X.shape: {0} --> Y.shape: {1}".format(X.shape, Y.shape))
X = nd.concat(X, Y, dim=1) # Concatenate the input and output of each block on the channel dimension.
return X
```

In [34]:

```
blk = DenseBlock(num_convs=2, num_channels=10)
blk.initialize(force_reinit=True)
X = nd.random.uniform(shape=(4, 3, 8, 8))
Y = blk(X)
Y.shape
```

Out[34]:

- Since each dense block will increase the number of channels, adding too many of them will lead to an excessively complex model.
- A transition layer is used to control the complexity of the model.
- It reduces the number of channels by using the $1\times 1$ convolutional layer and halves the height and width of the average pooling layer with a stride of 2, further reducing the complexity of the model.

In [35]:

```
def transition_block(num_channels):
blk = nn.Sequential()
blk.add(
nn.BatchNorm(),
nn.Activation('relu'),
nn.Conv2D(num_channels, kernel_size=1),
nn.AvgPool2D(pool_size=2, strides=2)
)
return blk
```

In [37]:

```
blk = transition_block(num_channels=10)
blk.initialize(force_reinit=True)
blk(Y).shape
```

Out[37]:

In [38]:

```
net = nn.Sequential()
net.add(
nn.Conv2D(64, kernel_size=7, strides=2, padding=3),
nn.BatchNorm(),
nn.Activation('relu'),
nn.MaxPool2D(pool_size=3, strides=2, padding=1)
)
```

In [39]:

```
num_channels = 64 # Num_channels: the current number of channels.
growth_rate = 32
num_convs_in_dense_blocks = [4, 4, 4, 4]
for i, num_convs in enumerate(num_convs_in_dense_blocks):
net.add(DenseBlock(num_convs=num_convs, num_channels=growth_rate))
# This is the number of output channels in the previous dense block.
num_channels += num_convs * growth_rate
# A transition layer that haves the number of channels is added between the dense blocks.
if i != len(num_convs_in_dense_blocks) - 1:
net.add(transition_block(num_channels // 2))
```

In [40]:

```
net.add(
nn.BatchNorm(),
nn.Activation('relu'),
nn.GlobalAvgPool2D(),
nn.Dense(10)
)
```

In [42]:

```
lr = 0.1
num_epochs = 5
batch_size = 256
ctx = gb.try_gpu()
net.initialize(ctx=ctx, init=init.Xavier(), force_reinit=True)
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
train_iter, test_iter = gb.load_data_fashion_mnist(batch_size, resize=96)
gb.train_ch5(net, train_iter, test_iter, batch_size, trainer, ctx, num_epochs)
```

In [ ]:

```
```