In this exercise you will implement a vanilla recurrent neural networks and use them it to train a model that can generate novel captions for images.
# As usual, a bit of setup from __future__ import print_function import time, os, json import numpy as np import matplotlib.pyplot as plt from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array from cs231n.rnn_layers import * from cs231n.captioning_solver import CaptioningSolver from cs231n.classifiers.rnn import CaptioningRNN from cs231n.coco_utils import load_coco_data, sample_coco_minibatch, decode_captions from cs231n.image_utils import image_from_url %matplotlib inline plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots plt.rcParams['image.interpolation'] = 'nearest' plt.rcParams['image.cmap'] = 'gray' # for auto-reloading external modules # see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython %load_ext autoreload %autoreload 2 def rel_error(x, y): """ returns relative error """ return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))
The COCO dataset we will be using is stored in HDF5 format. To load HDF5 files, we will need to install the
h5py Python package. From the command line, run:
pip install h5py
If you receive a permissions error, you may need to run the command as root:
sudo pip install h5py
You can also run commands directly from the Jupyter notebook by prefixing the command with the "!" character:
!pip install h5py
Requirement already satisfied: h5py in /home/mada/anaconda3/envs/python27/lib/python2.7/site-packages Requirement already satisfied: numpy>=1.7 in /home/mada/anaconda3/envs/python27/lib/python2.7/site-packages (from h5py) Requirement already satisfied: six in /home/mada/.local/lib/python2.7/site-packages (from h5py)
For this exercise we will use the 2014 release of the Microsoft COCO dataset which has become the standard testbed for image captioning. The dataset consists of 80,000 training images and 40,000 validation images, each annotated with 5 captions written by workers on Amazon Mechanical Turk.
You should have already downloaded the data by changing to the
cs231n/datasets directory and running the script
get_assignment3_data.sh. If you haven't yet done so, run that script now. Warning: the COCO data download is ~1GB.
We have preprocessed the data and extracted features for you already. For all images we have extracted features from the fc7 layer of the VGG-16 network pretrained on ImageNet; these features are stored in the files
val2014_vgg16_fc7.h5 respectively. To cut down on processing time and memory requirements, we have reduced the dimensionality of the features from 4096 to 512; these features can be found in the files
The raw images take up a lot of space (nearly 20GB) so we have not included them in the download. However all images are taken from Flickr, and URLs of the training and validation images are stored in the files
val2014_urls.txt respectively. This allows you to download images on the fly for visualization. Since images are downloaded on-the-fly, you must be connected to the internet to view images.
Dealing with strings is inefficient, so we will work with an encoded version of the captions. Each word is assigned an integer ID, allowing us to represent a caption by a sequence of integers. The mapping between integer IDs and words is in the file
coco2014_vocab.json, and you can use the function
decode_captions from the file
cs231n/coco_utils.py to convert numpy arrays of integer IDs back into strings.
There are a couple special tokens that we add to the vocabulary. We prepend a special
<START> token and append an
<END> token to the beginning and end of each caption respectively. Rare words are replaced with a special
<UNK> token (for "unknown"). In addition, since we want to train with minibatches containing captions of different lengths, we pad short captions with a special
<NULL> token after the
<END> token and don't compute loss or gradient for
<NULL> tokens. Since they are a bit of a pain, we have taken care of all implementation details around special tokens for you.
You can load all of the MS-COCO data (captions, features, URLs, and vocabulary) using the
load_coco_data function from the file
cs231n/coco_utils.py. Run the following cell to do so:
# Load COCO data from disk; this returns a dictionary # We'll work with dimensionality-reduced features for this notebook, but feel # free to experiment with the original features by changing the flag below. data = load_coco_data(pca_features=True) # Print out all the keys and values from the data dictionary for k, v in data.items(): if type(v) == np.ndarray: print(k, type(v), v.shape, v.dtype) else: print(k, type(v), len(v))
idx_to_word <type 'list'> 1004 train_captions <type 'numpy.ndarray'> (400135, 17) int32 val_captions <type 'numpy.ndarray'> (195954, 17) int32 train_image_idxs <type 'numpy.ndarray'> (400135,) int32 val_features <type 'numpy.ndarray'> (40504, 512) float32 val_image_idxs <type 'numpy.ndarray'> (195954,) int32 train_features <type 'numpy.ndarray'> (82783, 512) float32 train_urls <type 'numpy.ndarray'> (82783,) |S63 val_urls <type 'numpy.ndarray'> (40504,) |S63 word_to_idx <type 'dict'> 1004
It is always a good idea to look at examples from the dataset before working with it.
You can use the
sample_coco_minibatch function from the file
cs231n/coco_utils.py to sample minibatches of data from the data structure returned from
load_coco_data. Run the following to sample a small minibatch of training data and show the images and their captions. Running it multiple times and looking at the results helps you to get a sense of the dataset.
Note that we decode the captions using the
decode_captions function and that we download the images on-the-fly using their Flickr URL, so you must be connected to the internet to view images.
# Sample a minibatch and show the images and captions batch_size = 3 captions, features, urls = sample_coco_minibatch(data, batch_size=batch_size) for i, (caption, url) in enumerate(zip(captions, urls)): plt.imshow(image_from_url(url)) plt.axis('off') caption_str = decode_captions(caption, data['idx_to_word']) plt.title(caption_str) plt.show()