CSC321 Tutorial 9: Transfer Learning

In this tutorial, we'll go through an example of transfer learning to detect American Sign Language (ASL) gestures letters A-I. We could train a CNN from scratch. However, you will see that using AlexNet weights will give us much better results.

American Sign Language (ASL)

American Sign Language (ASL) is a complete, complex language that employs signs made by moving the hands combined with facial expressions and postures of the body. It is the primary language of many North Americans who are deaf and is one of several communication options used by people who are deaf or hard-of-hearing.

The hand gestures representing English alphabets are shown below. This lab focuses on classifying a subset of these hand gesture images using convolutional neural networks. Specifically, given an image of a hand showing one of the letters A-I, we want to detect which letter is being represented.

<img src="https://qualityansweringservice.com/wp-content/uploads/2010/01/images_abc1280x960.png" width=400px" />

In [ ]:
import matplotlib
import numpy as np
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.models, torchvision.datasets

%matplotlib inline

Question 1. Data

Download the data set to your Google Drive by going to the link https://drive.google.com/drive/folders/1aPL24P610NHLvt9exk6-B7SzGk3R8Q48?usp=sharing and selecting "Add to My Drive".

This is faster than downloading the data from the course website.

Then, mount Google Drive from your Google Colab notebook:

In [ ]:
from google.colab import drive
drive.mount('/content/gdrive')

The file structure we use is intentional, so that we can use torchvision.datasets.ImageFolder to help load our data and create labels.

In [ ]:
train_path = "/content/gdrive/My Drive/CSC321/asl_data/train/" # edit me
valid_path = "/content/gdrive/My Drive/CSC321/asl_data/valid/" # edit me
test_path = "/content/gdrive/My Drive/CSC321/asl_data/test/"   # edit me

train_data = torchvision.datasets.ImageFolder(train_path, transform=torchvision.transforms.ToTensor())
valid_data = torchvision.datasets.ImageFolder(valid_path, transform=torchvision.transforms.ToTensor())
test_data = torchvision.datasets.ImageFolder(test_path, transform=torchvision.transforms.ToTensor())

Part (a)

Read up on what torchvision.datasets.ImageFolder does for us here https://pytorch.org/docs/stable/torchvision/datasets.html#imagefolder

Part (b)

We can iterate through the one training data point at a time like this:

In [ ]:
for x, y in train_data:
    print(x, y)
    break # uncomment if you'd like

What do the variables x and y contain? What is the shape of our images? What are our labels? Based on what you learned in Part (a), how were the labels generated from the folder structure?

Part (c)

We saw in the earlier tutorials that PyTorch has a utility to help us creat minibatches with our data. We can use the same DataLoader helper here:

In [ ]:
train_loader = torch.utils.data.DataLoader(train_data, batch_size=10, shuffle=True)

for x, y in train_loader:
    print(x, y)
    break # uncomment if you'd like

What do the variables x and y contain? What are their shapes? What data do they contain?

Part (d)

How many images are there in the training, validation, and test sets?

Question 2. CNN

For this part, you'll be working with this network.

In [ ]:
class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=3,
                               out_channels=4,
                               kernel_size=3,
                               padding=1)
        self.bn1 = nn.BatchNorm2d(4)
        self.conv2 = nn.Conv2d(in_channels=4,
                               out_channels=8,
                               kernel_size=3,
                               padding=1)
        self.bn2 = nn.BatchNorm2d(8)
        self.conv3 = nn.Conv2d(in_channels=8,
                               out_channels=16,
                               kernel_size=3,
                               padding=1)
        self.bn3 = nn.BatchNorm2d(16)
        self.conv4 = nn.Conv2d(in_channels=16,
                               out_channels=32,
                               kernel_size=3,
                               padding=1)
        self.bn4 = nn.BatchNorm2d(32)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(32 * 14 * 14, 100)
        self.fc2 = nn.Linear(100, 9)
    def forward(self, x):
        x = self.pool(torch.relu(self.conv1(x)))
        x = self.bn1(x)
        x = self.pool(torch.relu(self.conv2(x)))
        x = self.bn2(x)
        x = self.pool(torch.relu(self.conv3(x)))
        x = self.bn3(x)
        x = self.pool(torch.relu(self.conv4(x)))
        x = self.bn4(x)
        x = x.view(-1, 32 * 14 * 14)
        x = torch.relu(self.fc1(x))
        return self.fc2(x)

Part (a)

We are using batch normalization because it makes our model significantly faster to train. What do you think is the difference between BatchNorm2d and BatchNorm1d? Why are we using BatchNorm2d in our neural network?

Part (b)

The training code is written for you. Train the CNN() model for at least 6 epochs, and report on the maximum validation accuracy that you can attain.

As your model is training, you might want to move on to the next question.

In [ ]:
def get_accuracy(model, data):
    loader = torch.utils.data.DataLoader(data, batch_size=256)

    model.eval() # annotate model for evaluation
    correct = 0
    total = 0
    for imgs, labels in loader:
        output = model(imgs) # We don't need to run torch.softmax
        pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability
        correct += pred.eq(labels.view_as(pred)).sum().item()
        total += imgs.shape[0]
    return correct / total

def train(model, train_data, valid_data, batch_size=32, weight_decay=0.0,
          learning_rate=0.001, num_epochs=7):
    # training data
    train_loader = torch.utils.data.DataLoader(train_data,
                                               batch_size=batch_size,
                                               shuffle=True)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(),
                           lr=learning_rate,
                           weight_decay=weight_decay)
    iters, losses, train_acc, val_acc = [], [], [], []

    # training
    n = 0 # the number of iterations (for plotting)
    for epoch in range(num_epochs):
        for imgs, labels in iter(train_loader):
            if imgs.size()[0] < batch_size:
                continue

            model.train()
            out = model(imgs)
            loss = criterion(out, labels)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            n += 1

        # save the current training information
        loss = float(loss)/batch_size
        tacc = get_accuracy(model, train_data)
        vacc = get_accuracy(model, valid_data)
        print("Iter %d; Loss %f; Train Acc %.3f; Val Acc %.3f" % (n, loss, tacc, vacc))

        iters.append(n)
        losses.append(loss)
        train_acc.append(tacc)
        val_acc.append(vacc)

    # plotting
    plt.title("Learning Curve")
    plt.plot(iters, losses, label="Train")
    plt.xlabel("Iterations")
    plt.ylabel("Loss")
    plt.show()

    plt.title("Learning Curve")
    plt.plot(iters, train_acc, label="Train")
    plt.plot(iters, val_acc, label="Validation")
    plt.xlabel("Iterations")
    plt.ylabel("Training Accuracy")
    plt.legend(loc='best')
    plt.show()

    print("Final Training Accuracy: {}".format(train_acc[-1]))
    print("Final Validation Accuracy: {}".format(val_acc[-1]))
In [ ]:
# training code
# cnn = CNN()
# train(cnn, train_data, valid_data, batch_size=32)

Question 3. Transfer Learning

For many image classification tasks, it is generally not a good idea to train a very large deep neural network model from scratch due to the enormous compute requirements and lack of sufficient amounts of training data.

One of the better options is to try using an existing model that performs a similar task to the one you need to solve. This method of utilizing a pre-trained network for other similar tasks is broadly termed Transfer Learning. In this assignment, we will use Transfer Learning to extract features from the hand gesture images. Then, train a smaller network to use these features as input and classify the hand gestures.

As you have learned from the CNN lecture, convolution layers extract various features from the images which get utilized by the fully connected layers for correct classification. AlexNet architecture played a pivotal role in establishing Deep Neural Nets as a go-to tool for image classification problems and we will use an ImageNet pre-trained AlexNet model to extract features in this assignment.

Part (a)

Here is the code to load the AlexNet network, with pretrained weights. When you first run the code, PyTorch will download the pretrained weights from the internet.

In [ ]:
import torchvision.models
alexnet = torchvision.models.alexnet(pretrained=True)

The alexnet model is split up into two components: alexnet.features and alexnet.classifier. The first neural network component, alexnet.features, is used to computed convolutional features, which is taken as input in alexnet.classifier.

The neural network alexnet.features expects an image tensor of shape Nx3x224x224 as inputs and it will output a tensor of shape Nx256x6x6 . (N = batch size).

Here is an example code snippet showing how you can compute the AlexNet features for some images (your actual code might be different):

In [ ]:
img, label = train_data[0]
features = alexnet.features(img.unsqueeze(0)).detach()

Note that the .detach() at the end will be necessary in your code. The reason is that PyTorch automatically builds computation graphs to be able to backpropagate graidents. If we did not explicitly "detach" this tensor from the AlexNet portion of the computation graph, PyTorch might try to backpropagate gradients to the AlexNet weight and tune the AlexNet weights.

Compute the AlexNet features for each of your training, validation, and test data. In other words, create three new arrays called train_data_features, valid_data_features and test_data_features. Each of these arrays should contain tuples of the form (alexnet_features, label)

In [ ]:
train_data_features = []
for img, y in train_data:
	features = None  # compute the alex net features based on the image
    train_data_features.append((features, y),)

Part (b)

Create a multi-layer preceptron that takes as input these AlexNet features, and makes a prediction. Your model should be a subclass of nn.Module.

In [ ]:
class MLP(nn.Module):
    def __init__(self):
        super(MLP, self).__init__()
        # ... todo ...

    def forward(self, x):
        x = x.view(-1, 256 * 6 * 6)
        # ... todo ...

Part (c)

Train the MLP() model for at least 6 epochs, and report on the maximum validation accuracy that you can attain.

This model should train much faster, and attain much better accuracy much faster!

In [ ]:
mlp = MLP()
train(mlp, train_data_new, valid_data_new)

Question 4

Report the test accuracy of your two models.

Question 5

Some conceptual questions to think about, and ask your TA.

Part (a)

Transfer learning worked very well in this example. However, transfer learning (using AlexNet) did not work well for the problem of classifying left shoe vs right shoe. Why do you think this is?

Part (b)

We decide to use AlexNet features as input to our MLP, and avoided tuning AlexNet weights. However, we could have considered AlexNet to be a part of our model, and continue to tune AlexNet weights to improve our model performance. What are the advantages and disadvantages of continuing to tune AlexNet weights?

Part (c)

Suppose we would like to create an autoencoder to be able to generate ASL images. Recall that the autoencoder has an encoder that reduces the dimensionality of our data, and a decoder that tries to reconstruct the data.

Do you think we could use the AlexNet features network as the encoder to an autoencoder? Why or why not?

Assume that we would not be tuning the AlexNet encoder weights---i.e. we would only train the decoder weights to be able to reconstruct the images.