5.6 Convolutional Neural Networks (LeNet)

  • MLP's problems for the Fashion-MNIST dataset
    • The pixels in the image are expanded line by line to get a vector of length 784, and then used them as inputs to the fully connected layer.
    • The adjacent pixels in the same column of an image may be far apart in this vector.
      • The patterns they create may be difficult for the model to recognize.
      • The vectorial representation ignores position entirely
    • For large input images, using a fully connected layer can easily cause the model to become too large.
  • CNN
    • The convolutional layer retains the input shape
      • The correlation of image pixels in the directions of both height and width can be recognized effectively.
    • The convolutional layer repeatedly calculates the same kernel and the input of different positions through the sliding window
      • It avoids excessively large parameter sizes.

5.6.1 LeNet

  • LeNet
    • invented by Yann LeCun and coworkers at AT&T Bell Labs in the early 90s.
  • LeNet is divided into two parts

    • Convolutional layers
    • Fully connected layers
  • The basic units in the convolutional block

    • a convolutional layer
      • used to recognize the spatial patterns in the image, such as lines and the parts of objects
      • $5\times 5$ kernel window
      • sigmoid activation function
        • [NOTE] ReLu works better, but it had not been invented in the 90s yet.
      • The number of output channels
        • For the first convolutional layer is 6
      • The number of output channels
        • For the second convolutional layer is 16
    • a subsequent average pooling layer
      • used to reduce the dimensionality
        • [NOTE] max-pooling works better, but it had not been invented in the 90s yet.
      • The window shape for the two average pooling layers of the convolutional layer block is $2\times 2$
      • The stride is 2
      • the pooling layer performs downsampling
  • Height and width of the input of the second convolutional layer is smaller than that of the first convolutional layer.

    • Therefore, increasing the number of output channels makes the parameter sizes of the two convolutional layers similar.
  • The output shape of the convolutional layer block is (batch size, channel, height, width).

  • When the output of the convolutional layer block is passed into the fully connected layer block, the fully connected layer block flattens each example in the mini-batch.
    • The input shape of the fully connected layer will become two dimensional is (batch size, channel $\times$ height $\times$ width)
    • The fully connected layer block has three fully connected layers.
      • They have 120, 84, and 10 outputs, respectively.
      • Here, 10 is the number of output classes.
In [120]:
import gluonbook as gb
import mxnet as mx
from mxnet import autograd, gluon, init, nd
from mxnet.gluon import loss as gloss, nn
import time

net = nn.Sequential()
net.add(
    nn.Conv2D(channels=6, kernel_size=5, padding=2, activation='sigmoid'),
    nn.AvgPool2D(pool_size=2, strides=2),
    nn.Conv2D(channels=16, kernel_size=5, activation='sigmoid'),
    nn.AvgPool2D(pool_size=2, strides=2),
    # Dense will transform the input of the shape (batch size, channel, height, width) into
    # the input of the shape (batch size, channel *height * width) automatically by default.
    nn.Dense(120, activation='sigmoid'),
    nn.Dense(84, activation='sigmoid'),
    nn.Dense(10)
)
In [121]:
X = nd.random.uniform(shape=(1, 1, 28, 28))
net.initialize()
for layer in net:
    X = layer(X)
    if layer.name.startswith("pool"):
        print("Layer Name: {0}, Output Shape: {1}".format(
            layer.name,
            X.shape
        ))       
    else:
        print("Layer Name: {0}, Parameter Shape: Weight-{1}/Bias-{2}, Output Shape: {3}".format(
            layer.name,
            layer.weight.data().shape,
            layer.bias.data().shape,            
            X.shape
        ))
Layer Name: conv133, Parameter Shape: Weight-(6, 1, 5, 5)/Bias-(6,), Output Shape: (1, 6, 28, 28)
Layer Name: pool83, Output Shape: (1, 6, 14, 14)
Layer Name: conv134, Parameter Shape: Weight-(16, 6, 5, 5)/Bias-(16,), Output Shape: (1, 16, 10, 10)
Layer Name: pool84, Output Shape: (1, 16, 5, 5)
Layer Name: dense60, Parameter Shape: Weight-(120, 400)/Bias-(120,), Output Shape: (1, 120)
Layer Name: dense61, Parameter Shape: Weight-(84, 120)/Bias-(84,), Output Shape: (1, 84)
Layer Name: dense62, Parameter Shape: Weight-(10, 84)/Bias-(10,), Output Shape: (1, 10)

5.6.2 Data Acquisition and Training

In [122]:
batch_size = 256
train_iter, test_iter = gb.load_data_fashion_mnist(batch_size=batch_size)
In [123]:
def try_gpu4(): # This function has been saved in the gluonbook package for future use.
    try:
        ctx = mx.gpu()
        _ = nd.zeros((1,), ctx=ctx) 
    except mx.base.MXNetError:
        ctx = mx.cpu() 
    return ctx

ctx = try_gpu4()
ctx
Out[123]:
cpu(0)
In [124]:
def evaluate_accuracy(data_iter, net, ctx):
    acc = nd.array([0], ctx=ctx)
    for X, y in data_iter:
        # If ctx is the GPU, copy the data to the GPU.
        X, y = X.as_in_context(ctx), y.as_in_context(ctx)
        acc += gb.accuracy(net(X), y)
    return acc.asscalar() / len(data_iter)
In [125]:
# This function has been saved in the gluonbook package for future use.
def train_ch5(net, train_iter, test_iter, batch_size, trainer, ctx, num_epochs):
    print('Training on', ctx)

    loss = gloss.SoftmaxCrossEntropyLoss()

    for epoch in range(num_epochs):
        train_loss_sum = 0.0 
        train_acc_sum = 0.0
        start = time.time()
        for X, y in train_iter:
            X, y = X.as_in_context(ctx), y.as_in_context(ctx)
            with autograd.record():
                y_hat = net(X)
                l = loss(y_hat, y)
            l.backward()
            trainer.step(batch_size)
            train_loss_sum += l.mean().asscalar()
            train_acc_sum += gb.accuracy(y_hat, y)
        test_acc = evaluate_accuracy(test_iter, net, ctx)
        print('Epoch {%d}, loss %.4f, train acc %.3f, test acc %.3f, time %.1f sec' % (
            epoch + 1, 
            train_loss_sum / len(train_iter),
            train_acc_sum / len(train_iter),
            test_acc, 
            time.time() - start
        ))
In [126]:
lr = 0.5
num_epochs = 5

net.initialize(force_reinit=True, ctx=ctx, init=init.Xavier())

trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})

train_ch5(net, train_iter, test_iter, batch_size, trainer, ctx, num_epochs)
Training on cpu(0)
Epoch {1}, loss 2.3193, train acc 0.101, test acc 0.099, time 18.3 sec
Epoch {2}, loss 2.0512, train acc 0.234, test acc 0.548, time 18.0 sec
Epoch {3}, loss 1.0410, train acc 0.587, test acc 0.657, time 18.1 sec
Epoch {4}, loss 0.8653, train acc 0.664, test acc 0.701, time 18.6 sec
Epoch {5}, loss 0.7620, train acc 0.707, test acc 0.727, time 18.0 sec

5.7 Deep Convolutional Neural Networks (AlexNet)

  • A good story about machine learning for image data
  • Deep learning vs. Machine learning
    • Deep learning is end-to-end learning.
    • In machine learning, feature enginnering is very important
      • SIFT, SURF, HOG, Bags of visual words and similar feature extractors

5.7.1 Learning Feature Representation

  • Deep learning researchers believed that...
    • features themselves ought to be learned.
    • features ought to be hierarchically composed.
    • By jointly training many layers of a neural network, they might come to learn hierarchical representations of data.
  • In 2012, Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton designed a new variant of a CNN, AlexNet
    • It achieved excellent performance in the ImageNet challenge.
    • It learned good feature extractors in the lower layers.
    • The figure below describes lower level image descriptors.
      • Image filters learned by the first layer of AlexNet
    • Higher layers might build upon these representations to represent larger structures
      • like eyes, noses, blades of grass, and features.
    • Yet higher layers might represent whole objects
      • like people, airplanes, dogs, or frisbees.
    • And ultimately, before the classification layer, the final hidden state might represent a compact representation of the image
      • It summarizes the contents where data belonging to different categories would be linearly separable.
    • The hierarchical representation of the input is determined by the parameters in the multilayer model, and these parameters are all obtained from learning.
  • Visual processing system of animals (and humans) works a bit like that.
    • At its lowest level, it contains mostly edge detectors, followed by more structured features.
  • The main reasons why in the 90s and early 2000s algorithms based on convex optimization were the preferred way of solving problems.
    • Missing Ingredient - Data
    • Missing Ingredient - Hardware

5.7.2 AlexNet

  • This network proved, for the first time, that the features obtained by learning can transcend manually-design features, breaking the previous paradigm in computer vision. )
  • Capacity Control and Preprocessing
    • AlextNet controls the model complexity of the fully connected layer by dropout
      • LeNet only uses weight decay.
    • AlexNet added a great deal of image augmentation, such as flipping, clipping, and color changes.
      • This makes the model more robust
      • The larger sample size effectively reduces overfitting.
In [127]:
import gluonbook as gb
from mxnet import gluon, init, nd
from mxnet.gluon import data as gdata, nn
import os
import sys

net = nn.Sequential()
# Here, we use a larger 11 x 11 window to capture objects. 
# At the same time, we use a stride of 4 to greatly reduce the height and width of the output.
# Here, the number of output channels is much larger than that in LeNet.
net.add(
    nn.Conv2D(96, kernel_size=11, strides=4, activation='relu'),
    nn.MaxPool2D(pool_size=3, strides=2),
    
    # Make the convolution window smaller, 
    # set padding to 2 for consistent height and width across the input and output, 
    # and increase the number of output channels
    nn.Conv2D(256, kernel_size=5, padding=2, activation='relu'),
    nn.MaxPool2D(pool_size=3, strides=2),
    
    # Use three successive convolutional layers and a smaller convolution window. 
    # Except for the final convolutional layer, the number of output channels is further increased.
    # Pooling layers are not used to reduce the height and width of input after the first two convolutional layers.
    nn.Conv2D(384, kernel_size=3, padding=1, activation='relu'),
    nn.Conv2D(384, kernel_size=3, padding=1, activation='relu'),
    nn.Conv2D(256, kernel_size=3, padding=1, activation='relu'),
    nn.MaxPool2D(pool_size=3, strides=2),

    # Here, the number of outputs of the fully connected layer is several times larger than that in LeNet. 
    # Use the dropout layer to mitigate overfitting.
    nn.Dense(4096, activation="relu"),
    nn.Dropout(0.5),
    nn.Dense(4096, activation="relu"),
    nn.Dropout(0.5),

    # Output layer. Since we are using Fashion-MNIST, the number of classes is 10, instead of 1000 as in the paper.
    nn.Dense(10)
)
In [128]:
X = nd.random.uniform(shape=(1, 1, 224, 224))
net.initialize()
for layer in net:
    X = layer(X)
    if layer.name.startswith("pool") or layer.name.startswith("dropout"):
        print("Layer Name: {0}, Output Shape: {1}".format(
            layer.name,
            X.shape
        ))       
    else:
        print("Layer Name: {0}, Parameter Shape: Weight-{1}/Bias-{2}, Output Shape: {3}".format(
            layer.name,
            layer.weight.data().shape,
            layer.bias.data().shape,            
            X.shape
        ))
Layer Name: conv135, Parameter Shape: Weight-(96, 1, 11, 11)/Bias-(96,), Output Shape: (1, 96, 54, 54)
Layer Name: pool85, Output Shape: (1, 96, 26, 26)
Layer Name: conv136, Parameter Shape: Weight-(256, 96, 5, 5)/Bias-(256,), Output Shape: (1, 256, 26, 26)
Layer Name: pool86, Output Shape: (1, 256, 12, 12)
Layer Name: conv137, Parameter Shape: Weight-(384, 256, 3, 3)/Bias-(384,), Output Shape: (1, 384, 12, 12)
Layer Name: conv138, Parameter Shape: Weight-(384, 384, 3, 3)/Bias-(384,), Output Shape: (1, 384, 12, 12)
Layer Name: conv139, Parameter Shape: Weight-(256, 384, 3, 3)/Bias-(256,), Output Shape: (1, 256, 12, 12)
Layer Name: pool87, Output Shape: (1, 256, 5, 5)
Layer Name: dense63, Parameter Shape: Weight-(4096, 6400)/Bias-(4096,), Output Shape: (1, 4096)
Layer Name: dropout38, Output Shape: (1, 4096)
Layer Name: dense64, Parameter Shape: Weight-(4096, 4096)/Bias-(4096,), Output Shape: (1, 4096)
Layer Name: dropout39, Output Shape: (1, 4096)
Layer Name: dense65, Parameter Shape: Weight-(10, 4096)/Bias-(10,), Output Shape: (1, 10)
  • We can upsample a fashion-MNIST sample image to 244 × 244
In [129]:
# This function has been saved in the gluonbook package for future use.
def load_data_fashion_mnist(batch_size, resize=None, root=os.path.join('~', '.mxnet', 'datasets', 'fashion-mnist')):
    root = os.path.expanduser(root)  # Expand the user path '~'.
    transformer = []
    if resize:
        transformer += [gdata.vision.transforms.Resize(resize)]
    transformer += [gdata.vision.transforms.ToTensor()]
    transformer = gdata.vision.transforms.Compose(transformer)
    
    mnist_train = gdata.vision.FashionMNIST(root=root, train=True)
    mnist_test = gdata.vision.FashionMNIST(root=root, train=False)
    
    num_workers = 0 if sys.platform.startswith('win32') else 4
    
    train_iter = gdata.DataLoader(
        mnist_train.transform_first(transformer), batch_size, shuffle=True,
        num_workers=num_workers
    )
    
    test_iter = gdata.DataLoader(
        mnist_test.transform_first(transformer), batch_size, shuffle=False,
        num_workers=num_workers
    )
    return train_iter, test_iter
In [130]:
batch_size = 128
train_iter, test_iter = load_data_fashion_mnist(batch_size, resize=224)
In [ ]:
lr = 0.01
num_epochs = 5
ctx = gb.try_gpu()

net.initialize(ctx=ctx, init=init.Xavier(), force_reinit=True)

trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})

gb.train_ch5(net, train_iter, test_iter, batch_size, trainer, ctx, num_epochs)
  • Small AlexNet
In [131]:
import gluonbook as gb
from mxnet import gluon, init, nd
from mxnet.gluon import data as gdata, nn
import os
import sys

net = nn.Sequential()
# Here, we use a larger 11 x 11 window to capture objects. 
# At the same time, we use a stride of 4 to greatly reduce the height and width of the output.
# Here, the number of output channels is much larger than that in LeNet.
net.add(
    nn.Conv2D(24, kernel_size=2, strides=1, activation='relu'),
    nn.MaxPool2D(pool_size=3, strides=2),
    
    # Make the convolution window smaller, 
    # set padding to 2 for consistent height and width across the input and output, 
    # and increase the number of output channels
    nn.Conv2D(64, kernel_size=2, padding=1, activation='relu'),
    nn.MaxPool2D(pool_size=3, strides=2),
    
    # Use three successive convolutional layers and a smaller convolution window. 
    # Except for the final convolutional layer, the number of output channels is further increased.
    # Pooling layers are not used to reduce the height and width of input after the first two convolutional layers.
    nn.Conv2D(96, kernel_size=2, padding=1, activation='relu'),
    nn.Conv2D(96, kernel_size=2, padding=1, activation='relu'),
    nn.Conv2D(64, kernel_size=2, padding=1, activation='relu'),
    nn.MaxPool2D(pool_size=3, strides=2),

    # Here, the number of outputs of the fully connected layer is several times larger than that in LeNet. 
    # Use the dropout layer to mitigate overfitting.
    nn.Dense(1024, activation="relu"),
    nn.Dropout(0.5),
    nn.Dense(1024, activation="relu"),
    nn.Dropout(0.5),

    # Output layer. Since we are using Fashion-MNIST, the number of classes is 10, instead of 1000 as in the paper.
    nn.Dense(10)
)
In [132]:
X = nd.random.uniform(shape=(1, 1, 28, 28))
net.initialize()
for layer in net:
    X = layer(X)
    if layer.name.startswith("pool") or layer.name.startswith("dropout"):
        print("Layer Name: {0}, Output Shape: {1}".format(
            layer.name,
            X.shape
        ))       
    else:
        print("Layer Name: {0}, Parameter Shape: Weight-{1}/Bias-{2}, Output Shape: {3}".format(
            layer.name,
            layer.weight.data().shape,
            layer.bias.data().shape,            
            X.shape
        ))
Layer Name: conv140, Parameter Shape: Weight-(24, 1, 2, 2)/Bias-(24,), Output Shape: (1, 24, 27, 27)
Layer Name: pool88, Output Shape: (1, 24, 13, 13)
Layer Name: conv141, Parameter Shape: Weight-(64, 24, 2, 2)/Bias-(64,), Output Shape: (1, 64, 14, 14)
Layer Name: pool89, Output Shape: (1, 64, 6, 6)
Layer Name: conv142, Parameter Shape: Weight-(96, 64, 2, 2)/Bias-(96,), Output Shape: (1, 96, 7, 7)
Layer Name: conv143, Parameter Shape: Weight-(96, 96, 2, 2)/Bias-(96,), Output Shape: (1, 96, 8, 8)
Layer Name: conv144, Parameter Shape: Weight-(64, 96, 2, 2)/Bias-(64,), Output Shape: (1, 64, 9, 9)
Layer Name: pool90, Output Shape: (1, 64, 4, 4)
Layer Name: dense66, Parameter Shape: Weight-(1024, 1024)/Bias-(1024,), Output Shape: (1, 1024)
Layer Name: dropout40, Output Shape: (1, 1024)
Layer Name: dense67, Parameter Shape: Weight-(1024, 1024)/Bias-(1024,), Output Shape: (1, 1024)
Layer Name: dropout41, Output Shape: (1, 1024)
Layer Name: dense68, Parameter Shape: Weight-(10, 1024)/Bias-(10,), Output Shape: (1, 10)
In [133]:
batch_size = 256
train_iter, test_iter = gb.load_data_fashion_mnist(batch_size=batch_size)
In [134]:
lr = 0.05
num_epochs = 5
ctx = gb.try_gpu()

net.initialize(force_reinit=True, ctx=ctx, init=init.Xavier())

trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})

gb.train_ch5(net, train_iter, test_iter, batch_size, trainer, ctx, num_epochs)
training on cpu(0)
epoch 1, loss 1.6177, train acc 0.379, test acc 0.640, time 96.9 sec
epoch 2, loss 0.8953, train acc 0.654, test acc 0.740, time 95.2 sec
epoch 3, loss 0.7166, train acc 0.725, test acc 0.770, time 95.0 sec
epoch 4, loss 0.6218, train acc 0.762, test acc 0.794, time 95.2 sec
epoch 5, loss 0.5587, train acc 0.785, test acc 0.824, time 94.9 sec

5.8 Networks Using Blocks (VGG)

  • Progress in this field mirrors that in chip design where engineers went from placing transistors (neurons) to logical elements (layers) to logic blocks (the topic of the current section).
  • The idea of using blocks was first proposed by the Visual Geometry Group (VGG) at Oxford University.
  • When using a modern deep learning framework, repeated structures can be expressed as code with for loops and subroutines.

5.8.1 VGG Blocks

  • The basic building block of a ConvNet is the combination of a convolutional layer (with padding to keep the resolution unchanged), followed by a nonlinearity such as a ReLu.
  • A VGG block is given by a sequence of such layers, followed by maximum pooling.
  • In 2014, Simonyan and Ziserman used...
    • convolution windows of size 3
    • maximum pooling with window width 2
    • 2 stride
    • effectively halving the resolution after each block.
In [135]:
import gluonbook as gb
from mxnet import gluon, init, nd
from mxnet.gluon import nn

def vgg_block(num_convs, num_channels):
    blk = nn.Sequential()
    for _ in range(num_convs):
        blk.add(nn.Conv2D(num_channels, kernel_size=2, padding=1, activation='relu'))
    blk.add(nn.MaxPool2D(pool_size=2, strides=2))
    return blk

5.8.2 VGG Network

  • Several vgg_block modules are connected in series in the convolutional layer module, the hyper-parameter of which is defined by the variable conv_arch.
    • This variable specifies the numbers of convolutional layers and output channels in each VGG block.
In [136]:
conv_arch = ((1, 64), (1, 128), (2, 256), (2, 256), (2, 256))
  • this network uses 8 convolutional layers and 3 fully connected layers, it is often called VGG-11
In [137]:
def vgg(conv_arch):
    net = nn.Sequential()
    # The convolutional layer part.
    for (num_convs, num_channels) in conv_arch:
        net.add(vgg_block(num_convs, num_channels))
        
    # The fully connected layer part.
    net.add(
        nn.Dense(1024, activation='relu'),
        nn.Dropout(0.5),
        nn.Dense(1024, activation='relu'),
        nn.Dropout(0.5),
        nn.Dense(10)
    )
    return net

net = vgg(conv_arch)
  • By halving the height and width while doubling the number of channels, VGG allows most convolutional layers to have the same model activation size and computational complexity.

5.8.3 Model Training

In [138]:
ratio = 6
small_conv_arch = [(pair[0], pair[1] // ratio) for pair in conv_arch]
net = vgg(small_conv_arch)
In [139]:
X = nd.random.uniform(shape=(1, 1, 28, 28))
net.initialize()
for blk in net:
    X = blk(X)
    print("blk Name: {0}, Output Shape: {1}".format(
        blk.name,
        X.shape
    ))
blk Name: sequential91, Output Shape: (1, 10, 14, 14)
blk Name: sequential92, Output Shape: (1, 21, 7, 7)
blk Name: sequential93, Output Shape: (1, 42, 4, 4)
blk Name: sequential94, Output Shape: (1, 42, 3, 3)
blk Name: sequential95, Output Shape: (1, 42, 2, 2)
blk Name: dense72, Output Shape: (1, 1024)
blk Name: dropout44, Output Shape: (1, 1024)
blk Name: dense73, Output Shape: (1, 1024)
blk Name: dropout45, Output Shape: (1, 1024)
blk Name: dense74, Output Shape: (1, 10)
In [140]:
batch_size = 256
train_iter, test_iter = gb.load_data_fashion_mnist(batch_size=batch_size)
In [141]:
lr = 0.05
num_epochs = 5
batch_size = 256

ctx = gb.try_gpu()

net.initialize(ctx=ctx, init=init.Xavier(), force_reinit=True)

trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})

train_iter, test_iter = gb.load_data_fashion_mnist(batch_size)

gb.train_ch5(net, train_iter, test_iter, batch_size, trainer, ctx, num_epochs)
training on cpu(0)
epoch 1, loss 2.2859, train acc 0.223, test acc 0.399, time 51.5 sec
epoch 2, loss 1.3229, train acc 0.482, test acc 0.638, time 51.3 sec
epoch 3, loss 0.8787, train acc 0.649, test acc 0.707, time 51.5 sec
epoch 4, loss 0.7683, train acc 0.708, test acc 0.758, time 51.6 sec
epoch 5, loss 0.6797, train acc 0.743, test acc 0.785, time 51.5 sec
  • VGG-11 constructs a network using reusable convolutional blocks.
  • Different VGG models can be defined by the differences in the number of convolutional layers and output channels in each block.
  • The use of blocks leads to very compact representations of the network definition.
  • It allows for efficient design of complex networks.
  • Simonyan and Ziserman experimented with various architetures.
  • In particular, they found that several layers of deep and narrow convolutions (i.e. 3 × 3) were more effective than fewer layers of wider convolutions.