# 5.9 Network in Network (NiN)¶

• LeNet, AlexNet, and VGG all share a common design pattern
• extract the spatial features through a sequence of convolutions
• pooling layers
• post-process the representations via fully connected layers
• A careless use of a dense layer would destroy the spatial structure of the data entirely, since fully connected layers mangle all inputs.
• NiN

## 5.9.1 NiN Blocks¶

• The inputs and outputs of convolutional layers are usually four-dimensional arrays
• (example, channel, height, width)
• The inputs and outputs of fully connected layers are usually two-dimensional arrays
• (example, feature)
• Once we process data by a fully connected layer, it's virtually impossible to recover the spatial structure of the representation.
• We could apply a fully connected layer at a pixel level
• Recall the $1\times 1$ convolutional layer.
• It is considered as a fully connected layer processing channel activations on a per pixel level.
• Another way to view this is to think of...
• each element in the spatial dimension (height and width) as equivalent to an example,
• the channel as equivalent to a feature.
• NiN Block
• it use the $1\times 1$ convolutional layer instead of a fully connected layer.
• The spatial information can then be naturally passed to the subsequent layers.
• It consists of a convolutional layer and multiple $1\times 1$ convolutional layer. This can be used within the convolutional stack to allow for more per-pixel nonlinearity.
In [33]:
import gluonbook as gb
from mxnet import gluon, init, nd
from mxnet.gluon import nn

blk = nn.Sequential()
nn.Conv2D(num_channels, kernel_size=1, activation='relu'),
nn.Conv2D(num_channels, kernel_size=1, activation='relu')
)
return blk


## 5.9.2 NiN Model¶

• NiN uses convolutional layers with convolution window shapes of 11 × 11, 5 × 5, and 3 × 3, and the corresponding numbers of output channels (96, 256, and 384) are the same as in AlexNet.
• Each NiN block is followed by a maximum pooling layer with a stride of 2 and a window shape of 3 × 3.
• The last NiN block has a number of output channels equal to the number of label classes, and then uses a global average pooling layer to average all elements in each channel for direct use in classification.
• In global average pooling layer, the window shape is equal to the average pooling layer of the input spatial dimension shape.
• It can significantly reduce the size of model parameters, thus mitigating overfitting.
• In other words, all operations are convolutions.
• However, this design sometimes results in an increase in model training time.
In [48]:
net = nn.Sequential()
nn.MaxPool2D(pool_size=2, strides=1),
nn.MaxPool2D(pool_size=2, strides=1),
nn.MaxPool2D(pool_size=3, strides=2),
nn.Dropout(0.5),

# There are 10 label classes.

# The global average pooling layer automatically sets the window shape to the height and width of the input.
nn.GlobalAvgPool2D(),

# Transform the four-dimensional output into two-dimensional output with a shape of (batch size, 10).
nn.Flatten()
)

In [49]:
X = nd.random.uniform(shape=(1, 1, 28, 28))
net.initialize(force_reinit=True)
for layer in net:
X = layer(X)
print("Layer.name {0:12s} - Output Shape - {1}".format(layer.name, X.shape))

Layer.name sequential71 - Output Shape - (1, 24, 26, 26)
Layer.name pool56       - Output Shape - (1, 24, 25, 25)
Layer.name sequential72 - Output Shape - (1, 64, 14, 14)
Layer.name pool57       - Output Shape - (1, 64, 13, 13)
Layer.name sequential73 - Output Shape - (1, 96, 6, 6)
Layer.name pool58       - Output Shape - (1, 96, 2, 2)
Layer.name dropout14    - Output Shape - (1, 96, 2, 2)
Layer.name sequential74 - Output Shape - (1, 10, 2, 2)
Layer.name pool59       - Output Shape - (1, 10, 1, 1)
Layer.name flatten14    - Output Shape - (1, 10)


## 5.9.3 Data Acquisition and Training¶

• NiN removes the fully connected layers and replaces them with global average pooling (i.e. summing over all locations) after reducing the number of channels to the desired number of outputs
• Removing the dense layers reduces overfitting. NiN has dramatically fewer parameters.
In [50]:
lr = 0.05
num_epochs = 5
batch_size = 256

ctx = gb.try_gpu()

net.initialize(force_reinit=True, ctx=ctx, init=init.Xavier())

trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})

gb.train_ch5(net, train_iter, test_iter, batch_size, trainer, ctx, num_epochs)

training on cpu(0)
epoch 1, loss 2.3018, train acc 0.126, test acc 0.240, time 158.0 sec
epoch 2, loss 2.2965, train acc 0.199, test acc 0.236, time 155.7 sec
epoch 3, loss 2.2469, train acc 0.214, test acc 0.235, time 159.2 sec
epoch 4, loss 1.9712, train acc 0.315, test acc 0.460, time 154.5 sec
epoch 5, loss 1.6260, train acc 0.446, test acc 0.498, time 158.6 sec


# 5.10 Networks with Parallel Concatenations (GoogLeNet)¶

• Szegedy et al., "Going Deeper with Convolutions," 2014 - https://arxiv.org/abs/1409.4842
• Pragmatic answer to the question as to which size of convolution is ideal for processing.
• 1 × 1 or 3 × 3, 5 × 5 or even larger.
• It isn’t always clear which one is the best.
• As it turns out, the answer is that a combination of all the above works best.
• Over the next few years, researchers made several improvements to GoogLeNet.

## 5.10.1 Inception Blocks¶

• There are four parallel paths in the Inception block.
• The first three paths
• use convolutional layers with window sizes of $1\times 1$, $3\times 3$, and $5\times 5$
• to extract information from different spatial sizes.
• The middle two paths
• perform a $1\times 1$ convolution on the input to reduce the number of input channels
• so as to reduce the model's complexity.
• The fourth path
• uses the $3\times 3$ maximum pooling layer
• followed by the $1\times 1$ convolutional layer to change the number of channels.
• The four paths all use appropriate padding to give the input and output the same height and width.
• Finally, we concatenate the output of each path on the channel dimension and input it to the next layer.
• The customizable parameters of the Inception block
• Number of output channels per layer, which can be used to control the model complexity.
In [53]:
import gluonbook as gb
from mxnet import gluon, init, nd
from mxnet.gluon import nn

class Inception(nn.Block):
# c1 - c4 are the number of output channels for each layer in the path.
def __init__(self, c1, c2, c3, c4, **kwargs):
super(Inception, self).__init__(**kwargs)

# Path 1 is a single 1 x 1 convolutional layer.
self.p1_1 = nn.Conv2D(c1, kernel_size=1, activation='relu')

# Path 2 is a 1 x 1 convolutional layer followed by a 3 x 3 convolutional layer.
self.p2_1 = nn.Conv2D(c2[0], kernel_size=1, activation='relu')
self.p2_2 = nn.Conv2D(c2[1], kernel_size=3, padding=1, activation='relu')

# Path 3 is a 1 x 1 convolutional layer followed by a 5 x 5 convolutional layer.
self.p3_1 = nn.Conv2D(c3[0], kernel_size=1, activation='relu')
self.p3_2 = nn.Conv2D(c3[1], kernel_size=5, padding=2, activation='relu')

# Path 4 is a 3 x 3 maximum pooling layer followed by a 1 x 1 convolutional layer.
self.p4_2 = nn.Conv2D(c4, kernel_size=1, activation='relu')

def forward(self, x):
p1 = self.p1_1(x)
p2 = self.p2_2(self.p2_1(x))
p3 = self.p3_2(self.p3_1(x))
p4 = self.p4_2(self.p4_1(x))
# Concatenate the outputs on the channel dimension.
return nd.concat(p1, p2, p3, p4, dim=1)

• To understand why this works as well as it does, consider the combination of the filters.
• Details can be recognized efficiently by different filters.

• GoogLeNet uses a stack of a total of 9 inception blocks and global average pooling to generate its estimates.
• Maximum pooling between inception blocks reduced the dimensionality.
• The first part is identical to AlexNet and LeNet
• The stack of blocks is inherited from VGG
• The global average pooling which can avoid a stack of fully connected layers at the end.
In [54]:
b1 = nn.Sequential()
)

In [55]:
b2 = nn.Sequential()
nn.Conv2D(64, kernel_size=1),
)

In [56]:
b3 = nn.Sequential()
Inception(64, (96, 128), (16, 32), 32),
Inception(128, (128, 192), (32, 96), 64),
)

In [57]:
b4 = nn.Sequential()
Inception(192, (96, 208), (16, 48), 64),
Inception(160, (112, 224), (24, 64), 64),
Inception(128, (128, 256), (24, 64), 64),
Inception(112, (144, 288), (32, 64), 64),
Inception(256, (160, 320), (32, 128), 128),
)

In [58]:
b5 = nn.Sequential()
Inception(256, (160, 320), (32, 128), 128),
Inception(384, (192, 384), (48, 128), 128),
nn.GlobalAvgPool2D()
)

net = nn.Sequential()
net.add(b1, b2, b3, b4, b5, nn.Dense(10))

In [59]:
X = nd.random.uniform(shape=(1, 1, 96, 96))
net.initialize()
for layer in net:
X = layer(X)
print(layer.name, 'output shape:\t', X.shape)

sequential77 output shape:	 (1, 64, 24, 24)
sequential78 output shape:	 (1, 192, 12, 12)
sequential79 output shape:	 (1, 480, 6, 6)
sequential80 output shape:	 (1, 832, 3, 3)
sequential81 output shape:	 (1, 1024, 1, 1)
dense0 output shape:	 (1, 10)


## 5.10.3 Data Acquisition and Training¶

In [60]:
lr = 0.1
num_epochs = 5
batch_size = 256

ctx = gb.try_gpu()

net.initialize(force_reinit=True, ctx=ctx, init=init.Xavier())

trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})

gb.train_ch5(net, train_iter, test_iter, batch_size, trainer, ctx, num_epochs)

training on cpu(0)
epoch 1, loss 2.1636, train acc 0.207, test acc 0.280, time 526.9 sec
epoch 2, loss 1.1490, train acc 0.519, test acc 0.687, time 536.9 sec
epoch 3, loss 0.8065, train acc 0.693, test acc 0.771, time 485.1 sec
epoch 4, loss 0.6022, train acc 0.768, test acc 0.811, time 474.9 sec
epoch 5, loss 0.5084, train acc 0.805, test acc 0.831, time 477.2 sec

• The Inception block is equivalent to a subnetwork with four paths.
• It extracts information in parallel through convolutional layers of different window shapes and maximum pooling layers.
• $1 \times 1$ convolutions reduce channel dimensionality on a per-pixel level.
• Max-pooling reduces the resolution.
• GoogLeNet connects multiple well-designed Inception blocks with other layers in series.
• The ratio of the number of channels assigned in the Inception block is obtained through a large number of experiments on the ImageNet data set.
• GoogLeNet, as well as its succeeding versions, was one of the most efficient models on ImageNet, providing similar test accuracy with lower computational complexity.
In [ ]: