Transfer learning / fine-tuning

This tutorial will guide you through the process of using transfer learning to learn an accurate image classifier from a relatively small number of training samples. Generally speaking, transfer learning refers to the process of leveraging the knowledge learned in one model for the training of another model.

More specifically, the process involves taking an existing neural network which was previously trained to good performance on a larger dataset, and using it as the basis for a new model which leverages that previous network's accuracy for a new task. This method has become popular in recent years to improve the performance of a neural net trained on a small dataset; the intuition is that the new dataset may be too small to train to good performance by itself, but we know that most neural nets trained to learn image features often learn similar features anyway, especially at early layers where they are more generic (edge detectors, blobs, and so on).

Transfer learning has been largely enabled by the open-sourcing of state-of-the-art models; for the top performing models in image classification tasks (like from ILSVRC), it is common practice now to not only publish the architecture, but to release the trained weights of the model as well. This lets amateurs use these top image classifiers to boost the performance of their own task-specific models.

Feature extraction vs. fine-tuning

At one extreme, transfer learning can involve taking the pre-trained network and freezing the weights, and using one of its hidden layers (usually the last one) as a feature extractor, using those features as the input to a smaller neural net.

At the other extreme, we start with the pre-trained network, but don't freeze its weights, allowing them to be updated along with the new network. Another name for this procedure is called "fine-tuning" because we are slightly adjusting the pre-trained net's weights to the new task. We usually train such a network with a lower learning rate, since we expect the features are already relatively good and do not need to be changed too much.

Sometimes, we do something in-between. Freeze just the early/generic layers, but fine-tune the later layers. Which strategy is best depends on the size of your dataset, the number of classes, and how much it resembles the dataset the previous model was trained on (and thus, whether it can benefit from the same learned feature extractors). A more detailed discussion of how to strategize can be found in [1] [2].

Procedure

In this guide will go through the process of loading a state-of-the-art, 1000-class image classifier, VGG16 which won the ImageNet challenge in 2014, and using it as a fixed feature extractor to train a smaller custom classifier on our own images, although with very few code changes, you can try fine-tuning as well.

We will first load VGG16 and remove its final layer, the 1000-class softmax classification layer specific to ImageNet, and replace it with a new classification layer for the classes we are training over. We will then freeze all the weights in the network except the new ones connecting to the new classification layer, and then train the new classification layer over our new dataset.

We will also compare this method to training a small neural network from scratch on the new dataset, and as we shall see, it will dramatically improve our accuracy. We will do that part first.

As our test subject, we'll use a dataset consisting of around 6000 images belonging to 97 classes, and train an image classifier with around 80% accuracy on it. It's worth noting that this strategy scales well to image sets where you may have even just a couple hundred or less images. Its performance will be lesser from a small number of samples (depending on classes) as usual, but still impressive considering the usual constraints.

Implementation details

This guide requires you to install keras, if you have not done so already. It is highly recommended to make sure you are using the GPU to train these models, as it will otherwise take much longer to train (however it is still possible to use CPU). If you can use GPU and have Theano as your backend, you should run the following command before importing keras, to ensure it uses the GPU.

os.environ["THEANO_FLAGS"] = "mode=FAST_RUN,device=gpu,floatX=float32"

If you are using Tensorflow as the backend, this is unnecessary.

Note this guide uses quite deep networks, much larger than the ones we trained in the convnets guide. If your system does not have enough memory, you may experience out-of-memory errors running this guide. A duplicate of this guide using smaller networks and smaller images is forthcoming soon -- in the meantime, you can try changing the architecture to suit your memory constraints.

To start, make sure the following import statements all work.

In [1]:
%matplotlib inline

import os

#if using Theano with GPU
#os.environ["THEANO_FLAGS"] = "mode=FAST_RUN,device=gpu,floatX=float32"

import random
import numpy as np
import keras

import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow

from keras.preprocessing import image
from keras.applications.imagenet_utils import preprocess_input
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Activation
from keras.layers import Conv2D, MaxPooling2D
from keras.models import Model
Using Theano backend.

Getting a dataset

The first step is going to be to load our data. As our example, we will be using the dataset CalTech-101, which contains around 9000 labeled images belonging to 101 object categories. However, we will exclude 5 of the categories which have the most images. This is in order to keep the class distribution fairly balanced (around 50-100) and constrained to a smaller number of images, around 6000.

To obtain this dataset, you can either run the download script download.sh in the data folder, or the following commands:

wget http://www.vision.caltech.edu/Image_Datasets/Caltech101/101_ObjectCategories.tar.gz
tar -xvzf 101_ObjectCategories.tar.gz

If you wish to use your own dataset, it should be aranged in the same fashion to 101_ObjectCategories with all of the images organized into subfolders, one for each class. In this case, the following cell should load your custom dataset correctly by just replacing root with your folder. If you have an alternate structure, you just need to make sure that you load the list data where every element is a dict where x is the data (a 1-d numpy array) and y is the label (an integer). Use the helper function get_image(path) to load the image correctly into the array, and note also that the images are being resized to 224x224. This is necessary because the input to VGG16 is a 224x224 RGB image. You do not need to resize them on your hard drive, as that is being done in the code below.

If you have 101_ObjectCategories in your data folder, the following cell should load all the data.

In [2]:
root = '../data/101_ObjectCategories'
exclude = ['BACKGROUND_Google', 'Motorbikes', 'airplanes', 'Faces_easy', 'Faces']
train_split, val_split = 0.7, 0.15

categories = [x[0] for x in os.walk(root) if x[0]][1:]
categories = [c for c in categories 
              if c not in [os.path.join(root, e) for e in exclude]]

# helper function to load image and return it and input vector
def get_image(path):
    img = image.load_img(path, target_size=(224, 224))
    x = image.img_to_array(img)
    x = np.expand_dims(x, axis=0)
    x = preprocess_input(x)
    return img, x

# load all the images from root folder
data = []
for c, category in enumerate(categories):
    images = [os.path.join(dp, f) for dp, dn, filenames 
              in os.walk(category) for f in filenames 
              if os.path.splitext(f)[1].lower() in ['.jpg','.png','.jpeg']]
    for img_path in images:
        img, x = get_image(img_path)
        data.append({'x':np.array(x[0]), 'y':c})

# count the number of classes
num_classes = len(categories)

# randomize the data order
random.shuffle(data)

# create training / validation / test split (70%, 15%, 15%)
idx_val = int(train_split * len(data))
idx_test = int((train_split + val_split) * len(data))
train = data[:idx_val]
val = data[idx_val:idx_test]
test = data[idx_test:]

# separate data for labels
x_train, y_train = np.array([t["x"] for t in train]), [t["y"] for t in train]
x_val, y_val = np.array([t["x"] for t in val]), [t["y"] for t in val]
x_test, y_test = np.array([t["x"] for t in test]), [t["y"] for t in test]

# normalize data
x_train = x_train.astype('float32') / 255.
x_val = x_val.astype('float32') / 255.
x_test = x_test.astype('float32') / 255.

# convert labels to one-hot vectors
y_train = keras.utils.to_categorical(y_train, num_classes)
y_val = keras.utils.to_categorical(y_val, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

# summary
print("finished loading %d images from %d categories"%(len(data), num_classes))
print("train / validation / test split: %d, %d, %d"%(len(x_train), len(x_val), len(x_test)))
print("training data shape: ", x_train.shape)
print("training labels shape: ", y_train.shape)
finished loading 6209 images from 97 categories
train / validation / test split: 4346, 931, 932
('training data shape: ', (4346, 224, 224, 3))
('training labels shape: ', (4346, 97))

If everything worked properly, you should have loaded a bunch of images, and split them into three sets: train, val, and test. The shape of the training data should be (n, 224, 224, 3) where n is the size of your training set, and the labels should be (n, c) where c is the number of classes (97 in the case of 101_ObjectCategories.

Notice that we divided all the data into three subsets -- a training set train, a validation set val, and a test set test. The reason for this is to properly evaluate the accuracy of our classifier. During training, the optimizer uses the validation set to evaluate its internal performance, in order to determine the gradient without overfitting to the training set. The test set is always held out from the training algorithm, and is only used at the end to evaluate the final accuracy of our model.

Let's quickly look at a few sample images from our dataset.

In [59]:
images = [os.path.join(dp, f) for dp, dn, filenames in os.walk(root) for f in filenames if os.path.splitext(f)[1].lower() in ['.jpg','.png','.jpeg']]
idx = [int(len(images) * random.random()) for i in range(8)]
imgs = [image.load_img(images[i], target_size=(224, 224)) for i in idx]
concat_image = np.concatenate([np.asarray(img) for img in imgs], axis=1)
plt.figure(figsize=(16,4))
plt.imshow(concat_image)
Out[59]:
<matplotlib.image.AxesImage at 0x1f36ed790>

First training a neural net from scratch

Before doing the transfer learning, let's first build a neural network from scratch for doing classification on our dataset. This will give us a baseline to compare to our transfer-learned network later.

The network we will construct contains 4 alternating convolutional and max-pooling layers, followed by a dropout after every other conv/pooling pair. After the last pooling layer, we will attach a fully-connected layer with 256 neurons, another dropout layer, then finally a softmax classification layer for our classes.

Our loss function will be, as usual, categorical cross-entropy loss, and our learning algorithm will be AdaDelta. Various things about this network can be changed to get better performance, perhaps using a larger network or a different optimizer will help, but for the purposes of this notebook, the goal is to just get an understanding of an approximate baseline for comparison's sake, and so it isn't neccessary to spend much time trying to optimize this network.

Upon compiling the network, let's run model.summary() to get a snapshot of its layers.

In [63]:
# build the network
model = Sequential()

model.add(Conv2D(32, (3, 3), input_shape=x_train.shape[1:]))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Dropout(0.25))

model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(256))
model.add(Activation('relu'))

model.add(Dropout(0.5))

model.add(Dense(num_classes))
model.add(Activation('softmax'))

# compile the model to use categorical cross-entropy loss function and adadelta optimizer
model.compile(loss='categorical_crossentropy',
              optimizer='adadelta',
              metrics=['accuracy'])

model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_9 (Conv2D)            (None, 222, 222, 32)      896       
_________________________________________________________________
activation_13 (Activation)   (None, 222, 222, 32)      0         
_________________________________________________________________
max_pooling2d_9 (MaxPooling2 (None, 111, 111, 32)      0         
_________________________________________________________________
conv2d_10 (Conv2D)           (None, 109, 109, 32)      9248      
_________________________________________________________________
activation_14 (Activation)   (None, 109, 109, 32)      0         
_________________________________________________________________
max_pooling2d_10 (MaxPooling (None, 54, 54, 32)        0         
_________________________________________________________________
dropout_7 (Dropout)          (None, 54, 54, 32)        0         
_________________________________________________________________
conv2d_11 (Conv2D)           (None, 52, 52, 32)        9248      
_________________________________________________________________
activation_15 (Activation)   (None, 52, 52, 32)        0         
_________________________________________________________________
max_pooling2d_11 (MaxPooling (None, 26, 26, 32)        0         
_________________________________________________________________
conv2d_12 (Conv2D)           (None, 24, 24, 32)        9248      
_________________________________________________________________
activation_16 (Activation)   (None, 24, 24, 32)        0         
_________________________________________________________________
max_pooling2d_12 (MaxPooling (None, 12, 12, 32)        0         
_________________________________________________________________
dropout_8 (Dropout)          (None, 12, 12, 32)        0         
_________________________________________________________________
flatten_3 (Flatten)          (None, 4608)              0         
_________________________________________________________________
dense_13 (Dense)             (None, 256)               1179904   
_________________________________________________________________
activation_17 (Activation)   (None, 256)               0         
_________________________________________________________________
dropout_9 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_14 (Dense)             (None, 97)                24929     
_________________________________________________________________
activation_18 (Activation)   (None, 97)                0         
=================================================================
Total params: 1,233,473.0
Trainable params: 1,233,473.0
Non-trainable params: 0.0
_________________________________________________________________

We've created a medium-sized network with ~1.2 million weights and biases (the parameters). Most of them are leading into the one pre-softmax fully-connected layer "dense_5".

We can now go ahead and train our model for 100 epochs with a batch size of 128. We'll also record its history so we can plot the loss over time later.

In [64]:
history = model.fit(x_train, y_train,
                    batch_size=128,
                    epochs=100,
                    validation_data=(x_val, y_val))
Train on 4346 samples, validate on 931 samples
Epoch 1/100
4346/4346 [==============================] - 38s - loss: 4.5407 - acc: 0.0308 - val_loss: 4.4815 - val_acc: 0.0602
Epoch 2/100
4346/4346 [==============================] - 38s - loss: 4.4150 - acc: 0.0686 - val_loss: 4.3017 - val_acc: 0.0945
Epoch 3/100
4346/4346 [==============================] - 38s - loss: 4.2222 - acc: 0.1086 - val_loss: 4.0468 - val_acc: 0.1557
Epoch 4/100
4346/4346 [==============================] - 38s - loss: 3.9835 - acc: 0.1484 - val_loss: 3.7900 - val_acc: 0.1976
Epoch 5/100
4346/4346 [==============================] - 38s - loss: 3.7314 - acc: 0.1855 - val_loss: 3.6591 - val_acc: 0.2578
Epoch 6/100
4346/4346 [==============================] - 38s - loss: 3.5076 - acc: 0.2209 - val_loss: 3.4835 - val_acc: 0.2460
Epoch 7/100
4346/4346 [==============================] - 38s - loss: 3.2847 - acc: 0.2630 - val_loss: 3.2689 - val_acc: 0.2997
Epoch 8/100
4346/4346 [==============================] - 38s - loss: 3.0013 - acc: 0.3118 - val_loss: 3.1930 - val_acc: 0.3244
Epoch 9/100
4346/4346 [==============================] - 38s - loss: 2.7695 - acc: 0.3583 - val_loss: 2.9774 - val_acc: 0.3416
Epoch 10/100
4346/4346 [==============================] - 38s - loss: 2.5365 - acc: 0.3983 - val_loss: 2.8194 - val_acc: 0.3695
Epoch 11/100
4346/4346 [==============================] - 38s - loss: 2.3103 - acc: 0.4459 - val_loss: 2.7380 - val_acc: 0.3813
Epoch 12/100
4346/4346 [==============================] - 38s - loss: 2.1475 - acc: 0.4795 - val_loss: 2.6909 - val_acc: 0.3953
Epoch 13/100
4346/4346 [==============================] - 38s - loss: 1.8565 - acc: 0.5377 - val_loss: 2.7421 - val_acc: 0.3921
Epoch 14/100
4346/4346 [==============================] - 38s - loss: 1.6818 - acc: 0.5658 - val_loss: 2.5981 - val_acc: 0.4393
Epoch 15/100
4346/4346 [==============================] - 38s - loss: 1.5412 - acc: 0.6045 - val_loss: 2.5592 - val_acc: 0.4382
Epoch 16/100
4346/4346 [==============================] - 38s - loss: 1.3512 - acc: 0.6401 - val_loss: 2.6747 - val_acc: 0.4339
Epoch 17/100
4346/4346 [==============================] - 38s - loss: 1.2114 - acc: 0.6723 - val_loss: 2.6503 - val_acc: 0.4318
Epoch 18/100
4346/4346 [==============================] - 38s - loss: 1.0976 - acc: 0.7018 - val_loss: 2.6322 - val_acc: 0.4404
Epoch 19/100
4346/4346 [==============================] - 38s - loss: 0.9964 - acc: 0.7260 - val_loss: 2.6771 - val_acc: 0.4479
Epoch 20/100
4346/4346 [==============================] - 38s - loss: 0.9172 - acc: 0.7483 - val_loss: 2.6750 - val_acc: 0.4468
Epoch 21/100
4346/4346 [==============================] - 38s - loss: 0.8427 - acc: 0.7655 - val_loss: 2.6560 - val_acc: 0.4726
Epoch 22/100
4346/4346 [==============================] - 38s - loss: 0.7785 - acc: 0.7754 - val_loss: 2.7327 - val_acc: 0.4576
Epoch 23/100
4346/4346 [==============================] - 38s - loss: 0.7053 - acc: 0.7977 - val_loss: 2.8198 - val_acc: 0.4533
Epoch 24/100
4346/4346 [==============================] - 38s - loss: 0.6925 - acc: 0.7994 - val_loss: 2.7577 - val_acc: 0.4597
Epoch 25/100
4346/4346 [==============================] - 38s - loss: 0.6092 - acc: 0.8182 - val_loss: 2.8431 - val_acc: 0.4468
Epoch 26/100
4346/4346 [==============================] - 38s - loss: 0.5661 - acc: 0.8378 - val_loss: 2.9029 - val_acc: 0.4576
Epoch 27/100
4346/4346 [==============================] - 38s - loss: 0.5162 - acc: 0.8532 - val_loss: 2.9416 - val_acc: 0.4576
Epoch 28/100
4346/4346 [==============================] - 38s - loss: 0.5112 - acc: 0.8458 - val_loss: 2.9678 - val_acc: 0.4586
Epoch 29/100
4346/4346 [==============================] - 38s - loss: 0.4802 - acc: 0.8555 - val_loss: 2.9655 - val_acc: 0.4694
Epoch 30/100
4346/4346 [==============================] - 38s - loss: 0.4321 - acc: 0.8714 - val_loss: 2.9573 - val_acc: 0.4683
Epoch 31/100
4346/4346 [==============================] - 38s - loss: 0.4379 - acc: 0.8711 - val_loss: 3.0498 - val_acc: 0.4705
Epoch 32/100
4346/4346 [==============================] - 38s - loss: 0.3906 - acc: 0.8840 - val_loss: 3.0571 - val_acc: 0.4629
Epoch 33/100
4346/4346 [==============================] - 38s - loss: 0.3787 - acc: 0.8898 - val_loss: 3.0878 - val_acc: 0.4565
Epoch 34/100
4346/4346 [==============================] - 38s - loss: 0.3744 - acc: 0.8863 - val_loss: 3.0943 - val_acc: 0.4748
Epoch 35/100
4346/4346 [==============================] - 38s - loss: 0.3785 - acc: 0.8896 - val_loss: 3.0990 - val_acc: 0.4726
Epoch 36/100
4346/4346 [==============================] - 38s - loss: 0.3520 - acc: 0.8951 - val_loss: 3.1261 - val_acc: 0.4608
Epoch 37/100
4346/4346 [==============================] - 38s - loss: 0.3447 - acc: 0.8990 - val_loss: 3.0691 - val_acc: 0.4629
Epoch 38/100
4346/4346 [==============================] - 38s - loss: 0.3292 - acc: 0.8992 - val_loss: 3.1341 - val_acc: 0.4672
Epoch 39/100
4346/4346 [==============================] - 38s - loss: 0.2945 - acc: 0.9144 - val_loss: 3.1678 - val_acc: 0.4651
Epoch 40/100
4346/4346 [==============================] - 38s - loss: 0.2834 - acc: 0.9096 - val_loss: 3.2061 - val_acc: 0.4662
Epoch 41/100
4346/4346 [==============================] - 38s - loss: 0.2829 - acc: 0.9160 - val_loss: 3.2115 - val_acc: 0.4769
Epoch 42/100
4346/4346 [==============================] - 38s - loss: 0.2608 - acc: 0.9241 - val_loss: 3.2818 - val_acc: 0.4705
Epoch 43/100
4346/4346 [==============================] - 38s - loss: 0.2643 - acc: 0.9213 - val_loss: 3.2121 - val_acc: 0.4726
Epoch 44/100
4346/4346 [==============================] - 38s - loss: 0.2493 - acc: 0.9225 - val_loss: 3.2950 - val_acc: 0.4651
Epoch 45/100
4346/4346 [==============================] - 38s - loss: 0.2611 - acc: 0.9236 - val_loss: 3.3133 - val_acc: 0.4715
Epoch 46/100
4346/4346 [==============================] - 38s - loss: 0.2246 - acc: 0.9301 - val_loss: 3.2911 - val_acc: 0.4694
Epoch 47/100
4346/4346 [==============================] - 38s - loss: 0.2414 - acc: 0.9268 - val_loss: 3.2376 - val_acc: 0.4769
Epoch 48/100
4346/4346 [==============================] - 38s - loss: 0.2282 - acc: 0.9340 - val_loss: 3.2864 - val_acc: 0.4780
Epoch 49/100
4346/4346 [==============================] - 38s - loss: 0.2262 - acc: 0.9296 - val_loss: 3.3845 - val_acc: 0.4758
Epoch 50/100
4346/4346 [==============================] - 38s - loss: 0.2009 - acc: 0.9406 - val_loss: 3.3347 - val_acc: 0.4769
Epoch 51/100
4346/4346 [==============================] - 38s - loss: 0.1928 - acc: 0.9409 - val_loss: 3.3904 - val_acc: 0.4855
Epoch 52/100
4346/4346 [==============================] - 38s - loss: 0.1862 - acc: 0.9443 - val_loss: 3.4426 - val_acc: 0.4812
Epoch 53/100
4346/4346 [==============================] - 38s - loss: 0.1973 - acc: 0.9402 - val_loss: 3.4805 - val_acc: 0.4705
Epoch 54/100
4346/4346 [==============================] - 38s - loss: 0.2090 - acc: 0.9390 - val_loss: 3.3708 - val_acc: 0.4780
Epoch 55/100
4346/4346 [==============================] - 38s - loss: 0.1750 - acc: 0.9443 - val_loss: 3.4272 - val_acc: 0.4694
Epoch 56/100
4346/4346 [==============================] - 38s - loss: 0.2037 - acc: 0.9409 - val_loss: 3.4955 - val_acc: 0.4823
Epoch 57/100
4346/4346 [==============================] - 38s - loss: 0.1705 - acc: 0.9501 - val_loss: 3.5468 - val_acc: 0.4919
Epoch 58/100
4346/4346 [==============================] - 38s - loss: 0.1724 - acc: 0.9494 - val_loss: 3.4323 - val_acc: 0.4672
Epoch 59/100
4346/4346 [==============================] - 38s - loss: 0.1681 - acc: 0.9503 - val_loss: 3.5111 - val_acc: 0.4866
Epoch 60/100
4346/4346 [==============================] - 38s - loss: 0.1963 - acc: 0.9399 - val_loss: 3.4187 - val_acc: 0.4791
Epoch 61/100
4346/4346 [==============================] - 38s - loss: 0.1667 - acc: 0.9514 - val_loss: 3.5126 - val_acc: 0.4823
Epoch 62/100
4346/4346 [==============================] - 38s - loss: 0.1443 - acc: 0.9597 - val_loss: 3.5175 - val_acc: 0.4866
Epoch 63/100
4346/4346 [==============================] - 38s - loss: 0.1520 - acc: 0.9549 - val_loss: 3.4818 - val_acc: 0.4780
Epoch 64/100
4346/4346 [==============================] - 38s - loss: 0.1390 - acc: 0.9584 - val_loss: 3.6060 - val_acc: 0.4715
Epoch 65/100
4346/4346 [==============================] - 38s - loss: 0.1446 - acc: 0.9521 - val_loss: 3.4628 - val_acc: 0.4758
Epoch 66/100
4346/4346 [==============================] - 38s - loss: 0.1314 - acc: 0.9586 - val_loss: 3.5667 - val_acc: 0.4737
Epoch 67/100
4346/4346 [==============================] - 38s - loss: 0.1355 - acc: 0.9572 - val_loss: 3.5765 - val_acc: 0.4769
Epoch 68/100
4346/4346 [==============================] - 38s - loss: 0.1346 - acc: 0.9579 - val_loss: 3.5701 - val_acc: 0.4715
Epoch 69/100
4346/4346 [==============================] - 38s - loss: 0.1451 - acc: 0.9544 - val_loss: 3.5311 - val_acc: 0.4737
Epoch 70/100
4346/4346 [==============================] - 38s - loss: 0.1497 - acc: 0.9572 - val_loss: 3.6496 - val_acc: 0.4715
Epoch 71/100
4346/4346 [==============================] - 38s - loss: 0.1320 - acc: 0.9611 - val_loss: 3.5732 - val_acc: 0.4694
Epoch 72/100
4346/4346 [==============================] - 38s - loss: 0.1138 - acc: 0.9664 - val_loss: 3.6330 - val_acc: 0.4737
Epoch 73/100
4346/4346 [==============================] - 38s - loss: 0.1384 - acc: 0.9588 - val_loss: 3.6614 - val_acc: 0.4737
Epoch 74/100
4346/4346 [==============================] - 38s - loss: 0.1334 - acc: 0.9607 - val_loss: 3.6358 - val_acc: 0.4672
Epoch 75/100
4346/4346 [==============================] - 38s - loss: 0.1167 - acc: 0.9666 - val_loss: 3.6078 - val_acc: 0.4758
Epoch 76/100
4346/4346 [==============================] - 38s - loss: 0.1354 - acc: 0.9639 - val_loss: 3.6409 - val_acc: 0.4758
Epoch 77/100
4346/4346 [==============================] - 38s - loss: 0.1202 - acc: 0.9627 - val_loss: 3.5874 - val_acc: 0.4834
Epoch 78/100
4346/4346 [==============================] - 38s - loss: 0.1076 - acc: 0.9680 - val_loss: 3.7173 - val_acc: 0.4694
Epoch 79/100
4346/4346 [==============================] - 38s - loss: 0.1389 - acc: 0.9565 - val_loss: 3.5808 - val_acc: 0.4801
Epoch 80/100
4346/4346 [==============================] - 38s - loss: 0.1281 - acc: 0.9648 - val_loss: 3.6159 - val_acc: 0.4780
Epoch 81/100
4346/4346 [==============================] - 38s - loss: 0.1049 - acc: 0.9678 - val_loss: 3.6340 - val_acc: 0.4726
Epoch 82/100
4346/4346 [==============================] - 38s - loss: 0.1013 - acc: 0.9676 - val_loss: 3.6720 - val_acc: 0.4801
Epoch 83/100
4346/4346 [==============================] - 38s - loss: 0.0965 - acc: 0.9731 - val_loss: 3.7164 - val_acc: 0.4844
Epoch 84/100
4346/4346 [==============================] - 38s - loss: 0.0983 - acc: 0.9715 - val_loss: 3.7099 - val_acc: 0.4844
Epoch 85/100
4346/4346 [==============================] - 38s - loss: 0.1251 - acc: 0.9611 - val_loss: 3.6103 - val_acc: 0.4662
Epoch 86/100
4346/4346 [==============================] - 38s - loss: 0.0977 - acc: 0.9673 - val_loss: 3.5529 - val_acc: 0.4823
Epoch 87/100
4346/4346 [==============================] - 38s - loss: 0.0997 - acc: 0.9685 - val_loss: 3.7150 - val_acc: 0.4672
Epoch 88/100
4346/4346 [==============================] - 38s - loss: 0.0945 - acc: 0.9685 - val_loss: 3.8007 - val_acc: 0.4769
Epoch 89/100
4346/4346 [==============================] - 38s - loss: 0.1063 - acc: 0.9664 - val_loss: 3.8063 - val_acc: 0.4748
Epoch 90/100
4346/4346 [==============================] - 38s - loss: 0.1029 - acc: 0.9694 - val_loss: 3.7229 - val_acc: 0.4866
Epoch 91/100
4346/4346 [==============================] - 38s - loss: 0.1127 - acc: 0.9666 - val_loss: 3.6706 - val_acc: 0.4855
Epoch 92/100
4346/4346 [==============================] - 38s - loss: 0.0993 - acc: 0.9694 - val_loss: 3.6673 - val_acc: 0.4941
Epoch 93/100
4346/4346 [==============================] - 38s - loss: 0.1001 - acc: 0.9678 - val_loss: 3.7259 - val_acc: 0.4726
Epoch 94/100
4346/4346 [==============================] - 38s - loss: 0.0922 - acc: 0.9731 - val_loss: 3.6871 - val_acc: 0.4748
Epoch 95/100
4346/4346 [==============================] - 38s - loss: 0.0786 - acc: 0.9761 - val_loss: 3.7414 - val_acc: 0.4651
Epoch 96/100
4346/4346 [==============================] - 38s - loss: 0.1023 - acc: 0.9682 - val_loss: 3.7787 - val_acc: 0.4844
Epoch 97/100
4346/4346 [==============================] - 38s - loss: 0.0895 - acc: 0.9747 - val_loss: 3.7143 - val_acc: 0.4876
Epoch 98/100
4346/4346 [==============================] - 38s - loss: 0.1098 - acc: 0.9685 - val_loss: 3.6360 - val_acc: 0.4791
Epoch 99/100
4346/4346 [==============================] - 38s - loss: 0.0858 - acc: 0.9731 - val_loss: 3.7660 - val_acc: 0.4844
Epoch 100/100
4346/4346 [==============================] - 38s - loss: 0.0875 - acc: 0.9733 - val_loss: 3.6681 - val_acc: 0.4855

Let's plot the validation loss and validation accuracy over time.

In [36]:
fig = plt.figure(figsize=(16,4))
ax = fig.add_subplot(121)
ax.plot(history.history["val_loss"])
ax.set_title("validation loss")
ax.set_xlabel("epochs")

ax2 = fig.add_subplot(122)
ax2.plot(history.history["val_acc"])
ax2.set_title("validation accuracy")
ax2.set_xlabel("epochs")
ax2.set_ylim(0, 1)

plt.show()

Notice that the validation loss begins to actually rise after around 16 epochs, even though validation accuracy remains roughly between 40% and 50%. This suggests our model begins overfitting around then, and best performance would have been achieved if we had stopped early around then. Nevertheless, our accuracy would not have likely been above 50%, and probably lower down.

We can also get a final evaluation by running our model on the training set. Doing so, we get the following results:

In [5]:
loss, accuracy = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', loss)
print('Test accuracy:', accuracy)
('Test loss:', 3.7721598874857496)
('Test accuracy:', 0.49463519313304721)

Finally, we see that we have achieved a (top-1) accuracy of around 49%. That's not too bad for 6000 images, considering that if we were to use a naive strategy of taking random guesses, we would have only gotten around 1% accuracy.

Transfer learning by starting with existing network

Now we can move on to the main strategy for training an image classifier on our small dataset: by starting with a larger and already trained network.

To start, we will load the VGG16 from keras, which was trained on ImageNet and the weights saved online. If this is your first time loading VGG16, you'll need to wait a bit for the weights to download from the web. Once the network is loaded, we can again inspect the layers with the summary() method.

In [6]:
vgg = keras.applications.VGG16(weights='imagenet', include_top=True)
vgg.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 224, 224, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
_________________________________________________________________
block3_conv1 (Conv2D)        (None, 56, 56, 256)       295168    
_________________________________________________________________
block3_conv2 (Conv2D)        (None, 56, 56, 256)       590080    
_________________________________________________________________
block3_conv3 (Conv2D)        (None, 56, 56, 256)       590080    
_________________________________________________________________
block3_pool (MaxPooling2D)   (None, 28, 28, 256)       0         
_________________________________________________________________
block4_conv1 (Conv2D)        (None, 28, 28, 512)       1180160   
_________________________________________________________________
block4_conv2 (Conv2D)        (None, 28, 28, 512)       2359808   
_________________________________________________________________
block4_conv3 (Conv2D)        (None, 28, 28, 512)       2359808   
_________________________________________________________________
block4_pool (MaxPooling2D)   (None, 14, 14, 512)       0         
_________________________________________________________________
block5_conv1 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_conv2 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_conv3 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_pool (MaxPooling2D)   (None, 7, 7, 512)         0         
_________________________________________________________________
flatten (Flatten)            (None, 25088)             0         
_________________________________________________________________
fc1 (Dense)                  (None, 4096)              102764544 
_________________________________________________________________
fc2 (Dense)                  (None, 4096)              16781312  
_________________________________________________________________
predictions (Dense)          (None, 1000)              4097000   
=================================================================
Total params: 138,357,544.0
Trainable params: 138,357,544.0
Non-trainable params: 0.0
_________________________________________________________________

Notice that VGG16 is much bigger than the network we constructed earlier. It contains 13 convolutional layers and two fully connected layers at the end, and has over 138 million parameters, around 100 times as many parameters than the network we made above. Like our first network, the majority of the parameters are stored in the connections leading into the first fully-connected layer.

VGG16 was made to solve ImageNet, and achieves a 8.8% top-5 error rate, which means that 91.2% of test samples were classified correctly within the top 5 predictions for each image. It's top-1 accuracy--equivalent to the accuracy metric we've been using (that the top prediction is correct)--is 73%. This is especially impressive since there are not just 97, but 1000 classes, meaning that random guesses would get us only 0.1% accuracy.

In order to use this network for our task, we "remove" the final classification layer, the 1000-neuron softmax layer at the end, which corresponds to ImageNet, and instead replace it with a new softmax layer for our dataset, which contains 97 neurons in the case of the 101_ObjectCategories dataset.

In terms of implementation, it's easier to simply create a copy of VGG from its input layer until the second to last layer, and then work with that, rather than modifying the VGG object directly. So technically we never "remove" anything, we just circumvent/ignore it. This can be done in the following way, by using the keras Model class to initialize a new model whose input layer is the same as VGG but whose output layer is our new softmax layer, called new_classification_layer. Note: although it appears we are duplicating this large network, internally Keras is actually just copying all the layers by reference, and thus we don't need to worry about overloading the memory.

In [41]:
# make a reference to VGG's input layer
inp = vgg.input

# make a new softmax layer with num_classes neurons
new_classification_layer = Dense(num_classes, activation='softmax')

# connect our new layer to the second to last layer in VGG, and make a reference to it
out = new_classification_layer(vgg.layers[-2].output)

# create a new network between inp and out
model_new = Model(inp, out)

We are going to retrain this network, model_new on the new dataset and labels. But first, we need to freeze the weights and biases in all the layers in the network, except our new one at the end, with the expectation that the features that were learned in VGG should still be fairly relevant to the new image classification task. Not optimal, but most likely better than what we can train to in our limited dataset.

By setting the trainable flag in each layer false (except our new classification layer), we ensure all the weights and biases in those layers remain fixed, and we simply train the weights in the one layer at the end. In some cases, it is desirable to not freeze all the pre-classification layers. If your dataset has enough samples, and doesn't resemble ImageNet very much, it might be advantageous to fine-tune some of the VGG layers along with the new classifier, or possibly even all of them. To do this, you can change the below code to make more of the layers trainable.

In the case of CalTech-101, we will just do feature extraction, fearing that fine-tuning too much with this dataset may overfit. But maybe we are wrong? A good exercise would be to try out both, and compare the results.

So we go ahead and freeze the layers, and compile the new model with exactly the same optimizer and loss function as in our first network, for the sake of a fair comparison. We then run summary again to look at the network's architecture.

In [42]:
# make all layers untrainable by freezing weights (except for last layer)
for l, layer in enumerate(model_new.layers[:-1]):
    layer.trainable = False

# ensure the last layer is trainable/not frozen
for l, layer in enumerate(model_new.layers[-1:]):
    layer.trainable = True

model_new.compile(loss='categorical_crossentropy',
              optimizer='adadelta',
              metrics=['accuracy'])

model_new.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 224, 224, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
_________________________________________________________________
block3_conv1 (Conv2D)        (None, 56, 56, 256)       295168    
_________________________________________________________________
block3_conv2 (Conv2D)        (None, 56, 56, 256)       590080    
_________________________________________________________________
block3_conv3 (Conv2D)        (None, 56, 56, 256)       590080    
_________________________________________________________________
block3_pool (MaxPooling2D)   (None, 28, 28, 256)       0         
_________________________________________________________________
block4_conv1 (Conv2D)        (None, 28, 28, 512)       1180160   
_________________________________________________________________
block4_conv2 (Conv2D)        (None, 28, 28, 512)       2359808   
_________________________________________________________________
block4_conv3 (Conv2D)        (None, 28, 28, 512)       2359808   
_________________________________________________________________
block4_pool (MaxPooling2D)   (None, 14, 14, 512)       0         
_________________________________________________________________
block5_conv1 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_conv2 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_conv3 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_pool (MaxPooling2D)   (None, 7, 7, 512)         0         
_________________________________________________________________
flatten (Flatten)            (None, 25088)             0         
_________________________________________________________________
fc1 (Dense)                  (None, 4096)              102764544 
_________________________________________________________________
fc2 (Dense)                  (None, 4096)              16781312  
_________________________________________________________________
dense_5 (Dense)              (None, 97)                397409    
=================================================================
Total params: 134,657,953.0
Trainable params: 397,409.0
Non-trainable params: 134,260,544.0
_________________________________________________________________

Looking at the summary, we see the network is identical to the VGG model we instantiated earlier, except the last layer, formerly a 1000-neuron softmax, has been replaced by a new 97-neuron softmax. Additionally, we still have roughly 134 million weights, but now the vast majority of them are "non-trainable params" because we froze the layers they are contained in. We now only have 397,000 trainable parameters, which is actually only a quarter of the number of parameters needed to train the first model.

As before, we go ahead and train the new model, using the same hyperparameters (batch size and number of epochs) as before, along with the same optimization algorithm. We also keep track of its history as we go.

In [54]:
history2 = model_new.fit(x_train, y_train, 
                         batch_size=128, 
                         epochs=100, 
                         validation_data=(x_val, y_val))
Train on 4346 samples, validate on 931 samples
Epoch 1/100
4346/4346 [==============================] - 36s - loss: 4.2942 - acc: 0.0900 - val_loss: 3.7752 - val_acc: 0.1472
Epoch 2/100
4346/4346 [==============================] - 36s - loss: 3.4561 - acc: 0.2584 - val_loss: 3.2105 - val_acc: 0.3212
Epoch 3/100
4346/4346 [==============================] - 36s - loss: 2.9204 - acc: 0.3684 - val_loss: 2.7843 - val_acc: 0.3910
Epoch 4/100
4346/4346 [==============================] - 36s - loss: 2.5652 - acc: 0.4473 - val_loss: 2.5515 - val_acc: 0.4823
Epoch 5/100
4346/4346 [==============================] - 36s - loss: 2.2863 - acc: 0.5166 - val_loss: 2.2879 - val_acc: 0.4791
Epoch 6/100
4346/4346 [==============================] - 36s - loss: 2.0762 - acc: 0.5511 - val_loss: 2.1597 - val_acc: 0.5188
Epoch 7/100
4346/4346 [==============================] - 36s - loss: 1.9124 - acc: 0.5874 - val_loss: 1.9776 - val_acc: 0.5585
Epoch 8/100
4346/4346 [==============================] - 36s - loss: 1.7626 - acc: 0.6197 - val_loss: 1.9062 - val_acc: 0.5671
Epoch 9/100
4346/4346 [==============================] - 36s - loss: 1.6470 - acc: 0.6523 - val_loss: 1.7965 - val_acc: 0.5994
Epoch 10/100
4346/4346 [==============================] - 36s - loss: 1.5568 - acc: 0.6707 - val_loss: 1.6949 - val_acc: 0.6079
Epoch 11/100
4346/4346 [==============================] - 36s - loss: 1.4616 - acc: 0.6901 - val_loss: 1.6447 - val_acc: 0.6047
Epoch 12/100
4346/4346 [==============================] - 36s - loss: 1.3878 - acc: 0.7103 - val_loss: 1.5615 - val_acc: 0.6337
Epoch 13/100
4346/4346 [==============================] - 36s - loss: 1.3259 - acc: 0.7112 - val_loss: 1.5190 - val_acc: 0.6488
Epoch 14/100
4346/4346 [==============================] - 36s - loss: 1.2700 - acc: 0.7333 - val_loss: 1.4809 - val_acc: 0.6574
Epoch 15/100
4346/4346 [==============================] - 36s - loss: 1.2156 - acc: 0.7430 - val_loss: 1.4060 - val_acc: 0.6874
Epoch 16/100
4346/4346 [==============================] - 36s - loss: 1.1614 - acc: 0.7545 - val_loss: 1.3809 - val_acc: 0.6702
Epoch 17/100
4346/4346 [==============================] - 36s - loss: 1.1213 - acc: 0.7549 - val_loss: 1.3757 - val_acc: 0.6735
Epoch 18/100
4346/4346 [==============================] - 36s - loss: 1.0823 - acc: 0.7623 - val_loss: 1.3053 - val_acc: 0.6992
Epoch 19/100
4346/4346 [==============================] - 36s - loss: 1.0498 - acc: 0.7743 - val_loss: 1.2646 - val_acc: 0.6917
Epoch 20/100
4346/4346 [==============================] - 36s - loss: 1.0119 - acc: 0.7780 - val_loss: 1.2462 - val_acc: 0.6982
Epoch 21/100
4346/4346 [==============================] - 36s - loss: 0.9782 - acc: 0.7902 - val_loss: 1.2202 - val_acc: 0.7014
Epoch 22/100
4346/4346 [==============================] - 36s - loss: 0.9491 - acc: 0.7936 - val_loss: 1.2190 - val_acc: 0.7025
Epoch 23/100
4346/4346 [==============================] - 36s - loss: 0.9309 - acc: 0.7927 - val_loss: 1.2055 - val_acc: 0.7068
Epoch 24/100
4346/4346 [==============================] - 36s - loss: 0.8900 - acc: 0.8125 - val_loss: 1.1590 - val_acc: 0.7186
Epoch 25/100
4346/4346 [==============================] - 36s - loss: 0.8782 - acc: 0.8127 - val_loss: 1.1377 - val_acc: 0.7272
Epoch 26/100
4346/4346 [==============================] - 36s - loss: 0.8557 - acc: 0.8097 - val_loss: 1.1375 - val_acc: 0.7057
Epoch 27/100
4346/4346 [==============================] - 36s - loss: 0.8281 - acc: 0.8180 - val_loss: 1.1416 - val_acc: 0.7111
Epoch 28/100
4346/4346 [==============================] - 36s - loss: 0.8066 - acc: 0.8233 - val_loss: 1.1032 - val_acc: 0.7218
Epoch 29/100
4346/4346 [==============================] - 36s - loss: 0.7975 - acc: 0.8208 - val_loss: 1.0881 - val_acc: 0.7272
Epoch 30/100
4346/4346 [==============================] - 36s - loss: 0.7724 - acc: 0.8302 - val_loss: 1.0782 - val_acc: 0.7304
Epoch 31/100
4346/4346 [==============================] - 36s - loss: 0.7553 - acc: 0.8329 - val_loss: 1.0862 - val_acc: 0.7175
Epoch 32/100
4346/4346 [==============================] - 37s - loss: 0.7379 - acc: 0.8382 - val_loss: 1.0506 - val_acc: 0.7390
Epoch 33/100
4346/4346 [==============================] - 36s - loss: 0.7256 - acc: 0.8376 - val_loss: 1.0220 - val_acc: 0.7551
Epoch 34/100
4346/4346 [==============================] - 36s - loss: 0.7063 - acc: 0.8433 - val_loss: 1.0208 - val_acc: 0.7379
Epoch 35/100
4346/4346 [==============================] - 36s - loss: 0.6978 - acc: 0.8442 - val_loss: 0.9990 - val_acc: 0.7540
Epoch 36/100
4346/4346 [==============================] - 36s - loss: 0.6751 - acc: 0.8530 - val_loss: 1.0387 - val_acc: 0.7272
Epoch 37/100
4346/4346 [==============================] - 36s - loss: 0.6700 - acc: 0.8442 - val_loss: 0.9966 - val_acc: 0.7476
Epoch 38/100
4346/4346 [==============================] - 36s - loss: 0.6557 - acc: 0.8573 - val_loss: 0.9849 - val_acc: 0.7594
Epoch 39/100
4346/4346 [==============================] - 36s - loss: 0.6431 - acc: 0.8585 - val_loss: 0.9924 - val_acc: 0.7465
Epoch 40/100
4346/4346 [==============================] - 36s - loss: 0.6315 - acc: 0.8679 - val_loss: 0.9706 - val_acc: 0.7487
Epoch 41/100
4346/4346 [==============================] - 36s - loss: 0.6202 - acc: 0.8603 - val_loss: 0.9716 - val_acc: 0.7530
Epoch 42/100
4346/4346 [==============================] - 37s - loss: 0.6090 - acc: 0.8677 - val_loss: 0.9680 - val_acc: 0.7583
Epoch 43/100
4346/4346 [==============================] - 37s - loss: 0.5978 - acc: 0.8677 - val_loss: 0.9738 - val_acc: 0.7530
Epoch 44/100
4346/4346 [==============================] - 36s - loss: 0.5870 - acc: 0.8744 - val_loss: 0.9347 - val_acc: 0.7766
Epoch 45/100
4346/4346 [==============================] - 37s - loss: 0.5753 - acc: 0.8771 - val_loss: 0.9095 - val_acc: 0.7701
Epoch 46/100
4346/4346 [==============================] - 36s - loss: 0.5675 - acc: 0.8792 - val_loss: 0.9525 - val_acc: 0.7573
Epoch 47/100
4346/4346 [==============================] - 36s - loss: 0.5587 - acc: 0.8753 - val_loss: 0.9459 - val_acc: 0.7476
Epoch 48/100
4346/4346 [==============================] - 36s - loss: 0.5555 - acc: 0.8794 - val_loss: 0.9261 - val_acc: 0.7530
Epoch 49/100
4346/4346 [==============================] - 36s - loss: 0.5383 - acc: 0.8836 - val_loss: 0.9454 - val_acc: 0.7454
Epoch 50/100
4346/4346 [==============================] - 36s - loss: 0.5342 - acc: 0.8840 - val_loss: 0.9138 - val_acc: 0.7615
Epoch 51/100
4346/4346 [==============================] - 36s - loss: 0.5291 - acc: 0.8870 - val_loss: 0.9070 - val_acc: 0.7573
Epoch 52/100
4346/4346 [==============================] - 36s - loss: 0.5148 - acc: 0.8861 - val_loss: 0.9018 - val_acc: 0.7680
Epoch 53/100
4346/4346 [==============================] - 36s - loss: 0.5096 - acc: 0.8886 - val_loss: 0.9226 - val_acc: 0.7530
Epoch 54/100
4346/4346 [==============================] - 36s - loss: 0.5027 - acc: 0.8939 - val_loss: 0.8943 - val_acc: 0.7744
Epoch 55/100
4346/4346 [==============================] - 36s - loss: 0.4936 - acc: 0.8960 - val_loss: 0.9074 - val_acc: 0.7691
Epoch 56/100
4346/4346 [==============================] - 36s - loss: 0.4879 - acc: 0.8953 - val_loss: 0.9124 - val_acc: 0.7540
Epoch 57/100
4346/4346 [==============================] - 36s - loss: 0.4792 - acc: 0.8965 - val_loss: 0.8896 - val_acc: 0.7744
Epoch 58/100
4346/4346 [==============================] - 36s - loss: 0.4755 - acc: 0.8960 - val_loss: 0.8891 - val_acc: 0.7658
Epoch 59/100
4346/4346 [==============================] - 36s - loss: 0.4682 - acc: 0.9008 - val_loss: 0.8810 - val_acc: 0.7701
Epoch 60/100
4346/4346 [==============================] - 36s - loss: 0.4641 - acc: 0.8985 - val_loss: 0.8869 - val_acc: 0.7820
Epoch 61/100
4346/4346 [==============================] - 36s - loss: 0.4495 - acc: 0.9036 - val_loss: 0.8652 - val_acc: 0.7744
Epoch 62/100
4346/4346 [==============================] - 37s - loss: 0.4485 - acc: 0.9040 - val_loss: 0.8858 - val_acc: 0.7680
Epoch 63/100
4346/4346 [==============================] - 37s - loss: 0.4396 - acc: 0.9103 - val_loss: 0.8642 - val_acc: 0.7755
Epoch 64/100
4346/4346 [==============================] - 37s - loss: 0.4323 - acc: 0.9116 - val_loss: 0.8702 - val_acc: 0.7798
Epoch 65/100
4346/4346 [==============================] - 36s - loss: 0.4302 - acc: 0.9059 - val_loss: 0.8904 - val_acc: 0.7594
Epoch 66/100
4346/4346 [==============================] - 37s - loss: 0.4272 - acc: 0.9110 - val_loss: 0.8638 - val_acc: 0.7744
Epoch 67/100
4346/4346 [==============================] - 36s - loss: 0.4167 - acc: 0.9146 - val_loss: 0.8752 - val_acc: 0.7723
Epoch 68/100
4346/4346 [==============================] - 37s - loss: 0.4100 - acc: 0.9183 - val_loss: 0.8653 - val_acc: 0.7787
Epoch 69/100
4346/4346 [==============================] - 37s - loss: 0.4076 - acc: 0.9144 - val_loss: 0.8573 - val_acc: 0.7658
Epoch 70/100
4346/4346 [==============================] - 36s - loss: 0.4026 - acc: 0.9162 - val_loss: 0.8291 - val_acc: 0.7798
Epoch 71/100
4346/4346 [==============================] - 36s - loss: 0.3964 - acc: 0.9185 - val_loss: 0.8721 - val_acc: 0.7701
Epoch 72/100
4346/4346 [==============================] - 36s - loss: 0.3944 - acc: 0.9156 - val_loss: 0.8509 - val_acc: 0.7787
Epoch 73/100
4346/4346 [==============================] - 37s - loss: 0.3897 - acc: 0.9197 - val_loss: 0.8458 - val_acc: 0.7744
Epoch 74/100
4346/4346 [==============================] - 37s - loss: 0.3819 - acc: 0.9206 - val_loss: 0.8388 - val_acc: 0.7852
Epoch 75/100
4346/4346 [==============================] - 36s - loss: 0.3787 - acc: 0.9215 - val_loss: 0.8281 - val_acc: 0.7691
Epoch 76/100
4346/4346 [==============================] - 37s - loss: 0.3757 - acc: 0.9213 - val_loss: 0.8352 - val_acc: 0.7701
Epoch 77/100
4346/4346 [==============================] - 37s - loss: 0.3713 - acc: 0.9287 - val_loss: 0.8344 - val_acc: 0.7766
Epoch 78/100
4346/4346 [==============================] - 36s - loss: 0.3664 - acc: 0.9243 - val_loss: 0.8259 - val_acc: 0.7830
Epoch 79/100
4346/4346 [==============================] - 37s - loss: 0.3614 - acc: 0.9273 - val_loss: 0.8162 - val_acc: 0.7830
Epoch 80/100
4346/4346 [==============================] - 36s - loss: 0.3563 - acc: 0.9271 - val_loss: 0.8301 - val_acc: 0.7734
Epoch 81/100
4346/4346 [==============================] - 37s - loss: 0.3527 - acc: 0.9261 - val_loss: 0.8223 - val_acc: 0.7863
Epoch 82/100
4346/4346 [==============================] - 37s - loss: 0.3499 - acc: 0.9294 - val_loss: 0.7862 - val_acc: 0.7927
Epoch 83/100
4346/4346 [==============================] - 36s - loss: 0.3399 - acc: 0.9342 - val_loss: 0.8135 - val_acc: 0.7766
Epoch 84/100
4346/4346 [==============================] - 37s - loss: 0.3410 - acc: 0.9337 - val_loss: 0.8146 - val_acc: 0.7830
Epoch 85/100
4346/4346 [==============================] - 36s - loss: 0.3354 - acc: 0.9335 - val_loss: 0.7957 - val_acc: 0.7905
Epoch 86/100
4346/4346 [==============================] - 36s - loss: 0.3355 - acc: 0.9314 - val_loss: 0.8126 - val_acc: 0.7820
Epoch 87/100
4346/4346 [==============================] - 36s - loss: 0.3307 - acc: 0.9344 - val_loss: 0.8042 - val_acc: 0.7820
Epoch 88/100
4346/4346 [==============================] - 36s - loss: 0.3240 - acc: 0.9386 - val_loss: 0.7895 - val_acc: 0.8002
Epoch 89/100
4346/4346 [==============================] - 37s - loss: 0.3210 - acc: 0.9388 - val_loss: 0.8042 - val_acc: 0.7927
Epoch 90/100
4346/4346 [==============================] - 37s - loss: 0.3167 - acc: 0.9365 - val_loss: 0.8036 - val_acc: 0.7809
Epoch 91/100
4346/4346 [==============================] - 36s - loss: 0.3138 - acc: 0.9416 - val_loss: 0.7923 - val_acc: 0.7863
Epoch 92/100
4346/4346 [==============================] - 37s - loss: 0.3126 - acc: 0.9390 - val_loss: 0.7948 - val_acc: 0.7916
Epoch 93/100
4346/4346 [==============================] - 36s - loss: 0.3083 - acc: 0.9422 - val_loss: 0.7973 - val_acc: 0.7895
Epoch 94/100
4346/4346 [==============================] - 37s - loss: 0.3057 - acc: 0.9422 - val_loss: 0.7880 - val_acc: 0.7916
Epoch 95/100
4346/4346 [==============================] - 36s - loss: 0.3004 - acc: 0.9448 - val_loss: 0.8244 - val_acc: 0.7755
Epoch 96/100
4346/4346 [==============================] - 36s - loss: 0.2986 - acc: 0.9439 - val_loss: 0.7961 - val_acc: 0.7884
Epoch 97/100
4346/4346 [==============================] - 36s - loss: 0.2966 - acc: 0.9459 - val_loss: 0.7892 - val_acc: 0.7927
Epoch 98/100
4346/4346 [==============================] - 36s - loss: 0.2932 - acc: 0.9439 - val_loss: 0.7866 - val_acc: 0.8002
Epoch 99/100
4346/4346 [==============================] - 36s - loss: 0.2887 - acc: 0.9473 - val_loss: 0.7858 - val_acc: 0.7948
Epoch 100/100
4346/4346 [==============================] - 36s - loss: 0.2862 - acc: 0.9471 - val_loss: 0.7880 - val_acc: 0.7841

Our validation accuracy hovers close to 80% towards the end, which is more than 30% improvement on the original network trained from scratch (meaning that we make the wrong prediction on 20% of samples, rather than 50%).

It's worth noting also that this network actually trains slightly faster than the original network, despite having more than 100 times as many parameters! This is because freezing the weights negates the need to backpropagate through all those layers, saving us on runtime.

Let's plot the validation loss and accuracy again, this time comparing the original model trained from scratch (in blue) and the new transfer-learned model in green.

In [35]:
fig = plt.figure(figsize=(16,4))
ax = fig.add_subplot(121)
ax.plot(history.history["val_loss"])
ax.plot(history2.history["val_loss"])
ax.set_title("validation loss")
ax.set_xlabel("epochs")

ax2 = fig.add_subplot(122)
ax2.plot(history.history["val_acc"])
ax2.plot(history2.history["val_acc"])
ax2.set_title("validation accuracy")
ax2.set_xlabel("epochs")
ax2.set_ylim(0, 1)

plt.show()

Notice that whereas the original model began overfitting around epoch 16, the new model continued to slowly decrease its loss over time, and likely would have improved its accuracy slightly with more iterations. The new model made it to roughly 80% top-1 accuracy (in the validation set) and continued to improve slowly through 100 epochs.

It's possibly we could have improved the original model with better regularization or more dropout, but we surely would not have made up the >30% improvement in accuracy.

Again, we do a final validation on the test set.

In [16]:
loss, accuracy = model_new.evaluate(x_test, y_test, verbose=0)

print('Test loss:', loss)
print('Test accuracy:', accuracy)
('Test loss:', 0.8078902146335324)
('Test accuracy:', 0.7821888412017167)

To predict a new image, simply run the following code to get the probabilities for each class.

In [ ]:
img, x = get_image('../data/101_ObjectCategories/airplanes/image_0003.jpg')
probabilities = model_new.predict([x])
print(probabilities)

Improving the results

78.2% top-1 accuracy on 97 classes, roughly evenly distributed, is a pretty good achievement. It is not quite as impressive as the original VGG16 which achieved 73% top-1 accuracy on 1000 classes. Nevertheless, it is much better than what we were able to achieve with our original network, and there is room for improvement. Some techniques which possibly could have improved our performance.

  • Using data augementation: augmentation refers to using various modifications of the original training data, in the form of distortions, rotations, rescalings, lighting changes, etc to increase the size of the training set and create more tolerance for such distortions.
  • Using a different optimizer, adding more regularization/dropout, and other hyperparameters.
  • Training for longer (of course)

A more advanced example of transfer learning in Keras, involving augmentation for a small 2-class dataset, can be found in the Keras blog.