from theano.sandbox import cuda
Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103) /home/ubuntu/anaconda2/lib/python2.7/site-packages/theano/sandbox/cuda/__init__.py:600: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5. warnings.warn(warn)
%matplotlib inline
import utils; reload(utils)
from utils import *
from __future__ import division, print_function
Using Theano backend.
model_path = 'data/imdb/models/'
%mkdir -p $model_path
We're going to look at the IMDB dataset, which contains movie reviews from IMDB, along with their sentiment. Keras comes with some helpers for this dataset.
from keras.datasets import imdb
idx = imdb.get_word_index()
This is the word list:
idx_arr = sorted(idx, key=idx.get)
idx_arr[:10]
['the', 'and', 'a', 'of', 'to', 'is', 'br', 'in', 'it', 'i']
...and this is the mapping from id to word
idx2word = {v: k for k, v in idx.iteritems()}
We download the reviews using code copied from keras.datasets:
path = get_file('imdb_full.pkl',
origin='https://s3.amazonaws.com/text-datasets/imdb_full.pkl',
md5_hash='d091312047c43cf9e4e38fef92437263')
f = open(path, 'rb')
(x_train, labels_train), (x_test, labels_test) = pickle.load(f)
Downloading data from https://s3.amazonaws.com/text-datasets/imdb_full.pkl 65298432/65552540 [============================>.] - ETA: 0s
len(x_train)
Here's the 1st review. As you see, the words have been replaced by ids. The ids can be looked up in idx2word.
', '.join(map(str, x_train[0]))
The first word of the first review is 23022. Let's see what that is.
idx2word[23022]
Here's the whole review, mapped from ids to words.
' '.join([idx2word[o] for o in x_train[0]])
The labels are 1 for positive, 0 for negative.
labels_train[:10]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Reduce vocab size by setting rare words to max index.
vocab_size = 5000
trn = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_train]
test = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_test]
Look at distribution of lengths of sentences.
lens = np.array(map(len, trn))
(lens.max(), lens.min(), lens.mean())
(2493, 10, 237.71364)
Pad (with zero) or truncate each sentence to make consistent length.
seq_len = 500
trn = sequence.pad_sequences(trn, maxlen=seq_len, value=0)
test = sequence.pad_sequences(test, maxlen=seq_len, value=0)
This results in nice rectangular matrices that can be passed to ML algorithms. Reviews shorter than 500 words are pre-padded with zeros, those greater are truncated.
trn.shape
(25000, 500)
The simplest model that tends to give reasonable results is a single hidden layer net. So let's try that. Note that we can't expect to get any useful results by feeding word ids directly into a neural net - so instead we use an embedding to replace them with a vector of 32 (initially random) floats for each word in the vocab.
model = Sequential([
Embedding(vocab_size, 32, input_length=seq_len),
Flatten(),
Dense(100, activation='relu'),
Dropout(0.7),
Dense(1, activation='sigmoid')])
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
model.summary()
____________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ==================================================================================================== embedding_1 (Embedding) (None, 500, 32) 160000 embedding_input_1[0][0] ____________________________________________________________________________________________________ flatten_1 (Flatten) (None, 16000) 0 embedding_1[0][0] ____________________________________________________________________________________________________ dense_1 (Dense) (None, 100) 1600100 flatten_1[0][0] ____________________________________________________________________________________________________ dropout_1 (Dropout) (None, 100) 0 dense_1[0][0] ____________________________________________________________________________________________________ dense_2 (Dense) (None, 1) 101 dropout_1[0][0] ==================================================================================================== Total params: 1760201 ____________________________________________________________________________________________________
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)
Train on 25000 samples, validate on 25000 samples Epoch 1/2 25000/25000 [==============================] - 1s - loss: 0.4651 - acc: 0.7495 - val_loss: 0.2830 - val_acc: 0.8804 Epoch 2/2 25000/25000 [==============================] - 1s - loss: 0.1969 - acc: 0.9265 - val_loss: 0.3195 - val_acc: 0.8694
<keras.callbacks.History at 0x7f0e084f4210>
The stanford paper that this dataset is from cites a state of the art accuracy (without unlabelled data) of 0.883. So we're short of that, but on the right track.
A CNN is likely to work better, since it's designed to take advantage of ordered data. We'll need to use a 1D CNN, since a sequence of words is 1D.
conv1 = Sequential([
Embedding(vocab_size, 32, input_length=seq_len, dropout=0.2),
Dropout(0.2),
Convolution1D(64, 5, border_mode='same', activation='relu'),
Dropout(0.2),
MaxPooling1D(),
Flatten(),
Dense(100, activation='relu'),
Dropout(0.7),
Dense(1, activation='sigmoid')])
conv1.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
conv1.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=4, batch_size=64)
Train on 25000 samples, validate on 25000 samples Epoch 1/4 25000/25000 [==============================] - 4s - loss: 0.4984 - acc: 0.7250 - val_loss: 0.2922 - val_acc: 0.8816 Epoch 2/4 25000/25000 [==============================] - 4s - loss: 0.2971 - acc: 0.8836 - val_loss: 0.2681 - val_acc: 0.8911 Epoch 3/4 25000/25000 [==============================] - 4s - loss: 0.2568 - acc: 0.8983 - val_loss: 0.2551 - val_acc: 0.8947 Epoch 4/4 25000/25000 [==============================] - 4s - loss: 0.2427 - acc: 0.9029 - val_loss: 0.2558 - val_acc: 0.8947
<keras.callbacks.History at 0x7f99cfa785d0>
That's well past the Stanford paper's accuracy - another win for CNNs!
conv1.save_weights(model_path + 'conv1.h5')
conv1.load_weights(model_path + 'conv1.h5')
You may want to look at wordvectors.ipynb before moving on.
In this section, we replicate the previous CNN, but using pre-trained embeddings.
def get_glove_dataset(dataset):
"""Download the requested glove dataset from files.fast.ai
and return a location that can be passed to load_vectors.
"""
# see wordvectors.ipynb for info on how these files were
# generated from the original glove data.
md5sums = {'6B.50d': '8e1557d1228decbda7db6dfd81cd9909',
'6B.100d': 'c92dbbeacde2b0384a43014885a60b2c',
'6B.200d': 'af271b46c04b0b2e41a84d8cd806178d',
'6B.300d': '30290210376887dcc6d0a5a6374d8255'}
glove_path = os.path.abspath('data/glove/results')
%mkdir -p $glove_path
return get_file(dataset,
'http://files.fast.ai/models/glove/' + dataset + '.tgz',
cache_subdir=glove_path,
md5_hash=md5sums.get(dataset, None),
untar=True)
def load_vectors(loc):
return (load_array(loc+'.dat'),
pickle.load(open(loc+'_words.pkl','rb')),
pickle.load(open(loc+'_idx.pkl','rb')))
vecs, words, wordidx = load_vectors(get_glove_dataset('6B.50d'))
Downloading data from http://files.fast.ai/models/glove/6B.50d.tgz 80101376/80107627 [============================>.] - ETA: 0sUntaring file...
The glove word ids and imdb word ids use different indexes. So we create a simple function that creates an embedding matrix using the indexes from imdb, and the embeddings from glove (where they exist).
def create_emb():
n_fact = vecs.shape[1]
emb = np.zeros((vocab_size, n_fact))
for i in range(1,len(emb)):
word = idx2word[i]
if word and re.match(r"^[a-zA-Z0-9\-]*$", word):
src_idx = wordidx[word]
emb[i] = vecs[src_idx]
else:
# If we can't find the word in glove, randomly initialize
emb[i] = normal(scale=0.6, size=(n_fact,))
# This is our "rare word" id - we want to randomly initialize
emb[-1] = normal(scale=0.6, size=(n_fact,))
emb/=3
return emb
emb = create_emb()
We pass our embedding matrix to the Embedding constructor, and set it to non-trainable.
model = Sequential([
Embedding(vocab_size, 50, input_length=seq_len, dropout=0.2,
weights=[emb], trainable=False),
Dropout(0.25),
Convolution1D(64, 5, border_mode='same', activation='relu'),
Dropout(0.25),
MaxPooling1D(),
Flatten(),
Dense(100, activation='relu'),
Dropout(0.7),
Dense(1, activation='sigmoid')])
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)
Train on 25000 samples, validate on 25000 samples Epoch 1/2 25000/25000 [==============================] - 4s - loss: 0.5217 - acc: 0.7172 - val_loss: 0.2942 - val_acc: 0.8815 Epoch 2/2 25000/25000 [==============================] - 4s - loss: 0.3169 - acc: 0.8719 - val_loss: 0.2662 - val_acc: 0.8978
<keras.callbacks.History at 0x7f0de0f2d910>
We already have beaten our previous model! But let's fine-tune the embedding weights - especially since the words we couldn't find in glove just have random embeddings.
model.layers[0].trainable=True
model.optimizer.lr=1e-4
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=1, batch_size=64)
Train on 25000 samples, validate on 25000 samples Epoch 1/1 25000/25000 [==============================] - 4s - loss: 0.2751 - acc: 0.8911 - val_loss: 0.2500 - val_acc: 0.9008
<keras.callbacks.History at 0x7f0de0c4e0d0>
As expected, that's given us a nice little boost. :)
model.save_weights(model_path+'glove50.h5')
This is an implementation of a multi-size CNN as shown in Ben Bowles' excellent blog post.
from keras.layers import Merge
We use the functional API to create multiple conv layers of different sizes, and then concatenate them.
graph_in = Input ((vocab_size, 50))
convs = [ ]
for fsz in range (3, 6):
x = Convolution1D(64, fsz, border_mode='same', activation="relu")(graph_in)
x = MaxPooling1D()(x)
x = Flatten()(x)
convs.append(x)
out = Merge(mode="concat")(convs)
graph = Model(graph_in, out)
emb = create_emb()
We then replace the conv/max-pool layer in our original CNN with the concatenated conv layers.
model = Sequential ([
Embedding(vocab_size, 50, input_length=seq_len, dropout=0.2, weights=[emb]),
Dropout (0.2),
graph,
Dropout (0.5),
Dense (100, activation="relu"),
Dropout (0.7),
Dense (1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)
Train on 25000 samples, validate on 25000 samples Epoch 1/2 25000/25000 [==============================] - 11s - loss: 0.3997 - acc: 0.8207 - val_loss: 0.3032 - val_acc: 0.8943 Epoch 2/2 25000/25000 [==============================] - 11s - loss: 0.2882 - acc: 0.8832 - val_loss: 0.2646 - val_acc: 0.9029
<keras.callbacks.History at 0x7f55b79b7990>
Interestingly, I found that in this case I got best results when I started the embedding layer as being trainable, and then set it to non-trainable after a couple of epochs. I have no idea why!
model.layers[0].trainable=False
model.optimizer.lr=1e-5
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)
Train on 25000 samples, validate on 25000 samples Epoch 1/2 25000/25000 [==============================] - 11s - loss: 0.2556 - acc: 0.8949 - val_loss: 0.2534 - val_acc: 0.9024 Epoch 2/2 25000/25000 [==============================] - 11s - loss: 0.2360 - acc: 0.9057 - val_loss: 0.2577 - val_acc: 0.9036
<keras.callbacks.History at 0x7f55b74de110>
This more complex architecture has given us another boost in accuracy.
We haven't covered this bit yet!
model = Sequential([
Embedding(vocab_size, 32, input_length=seq_len, mask_zero=True,
W_regularizer=l2(1e-6), dropout=0.2),
LSTM(100, consume_less='gpu'),
Dense(1, activation='sigmoid')])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
____________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ==================================================================================================== embedding_13 (Embedding) (None, 500, 32) 160064 embedding_input_13[0][0] ____________________________________________________________________________________________________ lstm_13 (LSTM) (None, 100) 53200 embedding_13[0][0] ____________________________________________________________________________________________________ dense_18 (Dense) (None, 1) 101 lstm_13[0][0] ==================================================================================================== Total params: 213365 ____________________________________________________________________________________________________
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=5, batch_size=64)
Train on 25000 samples, validate on 25000 samples Epoch 1/5 25000/25000 [==============================] - 100s - loss: 0.5007 - acc: 0.7446 - val_loss: 0.3475 - val_acc: 0.8531 Epoch 2/5 25000/25000 [==============================] - 100s - loss: 0.3524 - acc: 0.8507 - val_loss: 0.3602 - val_acc: 0.8453 Epoch 3/5 25000/25000 [==============================] - 99s - loss: 0.3750 - acc: 0.8342 - val_loss: 0.4758 - val_acc: 0.7710 Epoch 4/5 25000/25000 [==============================] - 99s - loss: 0.3238 - acc: 0.8652 - val_loss: 0.3094 - val_acc: 0.8725 Epoch 5/5 25000/25000 [==============================] - 99s - loss: 0.2681 - acc: 0.8920 - val_loss: 0.3018 - val_acc: 0.8776
<keras.callbacks.History at 0x7f9a16b12c50>