In [1]:
import keras
keras.__version__
Using TensorFlow backend.
Out[1]:
'2.0.8'

Sequence processing with convnets

This notebook contains the code samples found in Chapter 6, Section 4 of Deep Learning with Python. Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments.

Implementing a 1D convnet

In Keras, you would use a 1D convnet via the Conv1D layer, which has a very similar interface to Conv2D. It takes as input 3D tensors with shape (samples, time, features) and also returns similarly-shaped 3D tensors. The convolution window is a 1D window on the temporal axis, axis 1 in the input tensor.

Let's build a simple 2-layer 1D convnet and apply it to the IMDB sentiment classification task that you are already familiar with.

As a reminder, this is the code for obtaining and preprocessing the data:

In [2]:
from keras.datasets import imdb
from keras.preprocessing import sequence

max_features = 10000  # number of words to consider as features
max_len = 500  # cut texts after this number of words (among top max_features most common words)

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=max_len)
x_test = sequence.pad_sequences(x_test, maxlen=max_len)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)
Loading data...
25000 train sequences
25000 test sequences
Pad sequences (samples x time)
x_train shape: (25000, 500)
x_test shape: (25000, 500)

1D convnets are structured in the same way as their 2D counter-parts that you have used in Chapter 5: they consist of a stack of Conv1D and MaxPooling1D layers, eventually ending in either a global pooling layer or a Flatten layer, turning the 3D outputs into 2D outputs, allowing to add one or more Dense layers to the model, for classification or regression.

One difference, though, is the fact that we can afford to use larger convolution windows with 1D convnets. Indeed, with a 2D convolution layer, a 3x3 convolution window contains 3*3 = 9 feature vectors, but with a 1D convolution layer, a convolution window of size 3 would only contain 3 feature vectors. We can thus easily afford 1D convolution windows of size 7 or 9.

This is our example 1D convnet for the IMDB dataset:

In [3]:
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop

model = Sequential()
model.add(layers.Embedding(max_features, 128, input_length=max_len))
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.MaxPooling1D(5))
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(1))

model.summary()

model.compile(optimizer=RMSprop(lr=1e-4),
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=128,
                    validation_split=0.2)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 500, 128)          1280000   
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 494, 32)           28704     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 98, 32)            0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 92, 32)            7200      
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
=================================================================
Total params: 1,315,937
Trainable params: 1,315,937
Non-trainable params: 0
_________________________________________________________________
Train on 20000 samples, validate on 5000 samples
Epoch 1/10
20000/20000 [==============================] - 4s - loss: 0.7713 - acc: 0.5287 - val_loss: 0.6818 - val_acc: 0.5970
Epoch 2/10
20000/20000 [==============================] - 3s - loss: 0.6631 - acc: 0.6775 - val_loss: 0.6582 - val_acc: 0.6646
Epoch 3/10
20000/20000 [==============================] - 3s - loss: 0.6142 - acc: 0.7580 - val_loss: 0.5987 - val_acc: 0.7118
Epoch 4/10
20000/20000 [==============================] - 3s - loss: 0.5156 - acc: 0.8124 - val_loss: 0.4936 - val_acc: 0.7736
Epoch 5/10
20000/20000 [==============================] - 3s - loss: 0.4029 - acc: 0.8469 - val_loss: 0.4123 - val_acc: 0.8358
Epoch 6/10
20000/20000 [==============================] - 3s - loss: 0.3455 - acc: 0.8653 - val_loss: 0.4040 - val_acc: 0.8382
Epoch 7/10
20000/20000 [==============================] - 3s - loss: 0.3078 - acc: 0.8634 - val_loss: 0.4059 - val_acc: 0.8240
Epoch 8/10
20000/20000 [==============================] - 3s - loss: 0.2812 - acc: 0.8535 - val_loss: 0.4147 - val_acc: 0.8098
Epoch 9/10
20000/20000 [==============================] - 3s - loss: 0.2554 - acc: 0.8334 - val_loss: 0.4296 - val_acc: 0.7878
Epoch 10/10
20000/20000 [==============================] - 3s - loss: 0.2356 - acc: 0.8052 - val_loss: 0.4296 - val_acc: 0.7600

Here are our training and validation results: validation accuracy is somewhat lower than that of the LSTM we used two sections ago, but runtime is faster, both on CPU and GPU (albeit the exact speedup will vary greatly depending on your exact configuration). At that point, we could re-train this model for the right number of epochs (8), and run it on the test set. This is a convincing demonstration that a 1D convnet can offer a fast, cheap alternative to a recurrent network on a word-level sentiment classification task.

In [4]:
import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

Combining CNNs and RNNs to process long sequences

Because 1D convnets process input patches independently, they are not sensitive to the order of the timesteps (beyond a local scale, the size of the convolution windows), unlike RNNs. Of course, in order to be able to recognize longer-term patterns, one could stack many convolution layers and pooling layers, resulting in upper layers that would "see" long chunks of the original inputs -- but that's still a fairly weak way to induce order-sensitivity. One way to evidence this weakness is to try 1D convnets on the temperature forecasting problem from the previous section, where order-sensitivity was key to produce good predictions. Let's see:

In [2]:
# We reuse the following variables defined in the last section:
# float_data, train_gen, val_gen, val_steps

import os
import numpy as np

data_dir = '/home/ubuntu/data/'
fname = os.path.join(data_dir, 'jena_climate_2009_2016.csv')

f = open(fname)
data = f.read()
f.close()

lines = data.split('\n')
header = lines[0].split(',')
lines = lines[1:]

float_data = np.zeros((len(lines), len(header) - 1))
for i, line in enumerate(lines):
    values = [float(x) for x in line.split(',')[1:]]
    float_data[i, :] = values
    
mean = float_data[:200000].mean(axis=0)
float_data -= mean
std = float_data[:200000].std(axis=0)
float_data /= std

def generator(data, lookback, delay, min_index, max_index,
              shuffle=False, batch_size=128, step=6):
    if max_index is None:
        max_index = len(data) - delay - 1
    i = min_index + lookback
    while 1:
        if shuffle:
            rows = np.random.randint(
                min_index + lookback, max_index, size=batch_size)
        else:
            if i + batch_size >= max_index:
                i = min_index + lookback
            rows = np.arange(i, min(i + batch_size, max_index))
            i += len(rows)

        samples = np.zeros((len(rows),
                           lookback // step,
                           data.shape[-1]))
        targets = np.zeros((len(rows),))
        for j, row in enumerate(rows):
            indices = range(rows[j] - lookback, rows[j], step)
            samples[j] = data[indices]
            targets[j] = data[rows[j] + delay][1]
        yield samples, targets
        
lookback = 1440
step = 6
delay = 144
batch_size = 128

train_gen = generator(float_data,
                      lookback=lookback,
                      delay=delay,
                      min_index=0,
                      max_index=200000,
                      shuffle=True,
                      step=step, 
                      batch_size=batch_size)
val_gen = generator(float_data,
                    lookback=lookback,
                    delay=delay,
                    min_index=200001,
                    max_index=300000,
                    step=step,
                    batch_size=batch_size)
test_gen = generator(float_data,
                     lookback=lookback,
                     delay=delay,
                     min_index=300001,
                     max_index=None,
                     step=step,
                     batch_size=batch_size)

# This is how many steps to draw from `val_gen`
# in order to see the whole validation set:
val_steps = (300000 - 200001 - lookback) // batch_size

# This is how many steps to draw from `test_gen`
# in order to see the whole test set:
test_steps = (len(float_data) - 300001 - lookback) // batch_size
In [3]:
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop

model = Sequential()
model.add(layers.Conv1D(32, 5, activation='relu',
                        input_shape=(None, float_data.shape[-1])))
model.add(layers.MaxPooling1D(3))
model.add(layers.Conv1D(32, 5, activation='relu'))
model.add(layers.MaxPooling1D(3))
model.add(layers.Conv1D(32, 5, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(1))

model.compile(optimizer=RMSprop(), loss='mae')
history = model.fit_generator(train_gen,
                              steps_per_epoch=500,
                              epochs=20,
                              validation_data=val_gen,
                              validation_steps=val_steps)
Epoch 1/20
500/500 [==============================] - 124s - loss: 0.4189 - val_loss: 0.4521
Epoch 2/20
500/500 [==============================] - 11s - loss: 0.3629 - val_loss: 0.4545
Epoch 3/20
500/500 [==============================] - 11s - loss: 0.3399 - val_loss: 0.4527
Epoch 4/20
500/500 [==============================] - 11s - loss: 0.3229 - val_loss: 0.4721
Epoch 5/20
500/500 [==============================] - 11s - loss: 0.3122 - val_loss: 0.4712
Epoch 6/20
500/500 [==============================] - 11s - loss: 0.3030 - val_loss: 0.4705
Epoch 7/20
500/500 [==============================] - 11s - loss: 0.2935 - val_loss: 0.4870
Epoch 8/20
500/500 [==============================] - 11s - loss: 0.2862 - val_loss: 0.4676
Epoch 9/20
500/500 [==============================] - 11s - loss: 0.2817 - val_loss: 0.4738
Epoch 10/20
500/500 [==============================] - 11s - loss: 0.2775 - val_loss: 0.4896
Epoch 11/20
500/500 [==============================] - 11s - loss: 0.2715 - val_loss: 0.4765
Epoch 12/20
500/500 [==============================] - 11s - loss: 0.2683 - val_loss: 0.4724
Epoch 13/20
500/500 [==============================] - 11s - loss: 0.2644 - val_loss: 0.4842
Epoch 14/20
500/500 [==============================] - 11s - loss: 0.2606 - val_loss: 0.4910
Epoch 15/20
500/500 [==============================] - 11s - loss: 0.2558 - val_loss: 0.5000
Epoch 16/20
500/500 [==============================] - 11s - loss: 0.2539 - val_loss: 0.4960
Epoch 17/20
500/500 [==============================] - 11s - loss: 0.2516 - val_loss: 0.4875
Epoch 18/20
500/500 [==============================] - 11s - loss: 0.2501 - val_loss: 0.4884
Epoch 19/20
500/500 [==============================] - 11s - loss: 0.2444 - val_loss: 0.5024
Epoch 20/20
500/500 [==============================] - 11s - loss: 0.2444 - val_loss: 0.4821

Here are our training and validation Mean Absolute Errors:

In [5]:
import matplotlib.pyplot as plt

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(loss))

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

The validation MAE stays in the low 0.40s: we cannot even beat our common-sense baseline using the small convnet. Again, this is because our convnet looks for patterns anywhere in the input timeseries, and has no knowledge of the temporal position of a pattern it sees (e.g. towards the beginning, towards the end, etc.). Since more recent datapoints should be interpreted differently from older datapoints in the case of this specific forecasting problem, the convnet fails at producing meaningful results here. This limitation of convnets was not an issue on IMDB, because patterns of keywords that are associated with a positive or a negative sentiment will be informative independently of where they are found in the input sentences.

One strategy to combine the speed and lightness of convnets with the order-sensitivity of RNNs is to use a 1D convnet as a preprocessing step before a RNN. This is especially beneficial when dealing with sequences that are so long that they couldn't realistically be processed with RNNs, e.g. sequences with thousands of steps. The convnet will turn the long input sequence into much shorter (downsampled) sequences of higher-level features. This sequence of extracted features then becomes the input to the RNN part of the network.

This technique is not seen very often in research papers and practical applications, possibly because it is not very well known. It is very effective and ought to be more common. Let's try this out on the temperature forecasting dataset. Because this strategy allows us to manipulate much longer sequences, we could either look at data from further back (by increasing the lookback parameter of the data generator), or look at high-resolution timeseries (by decreasing the step parameter of the generator). Here, we will chose (somewhat arbitrarily) to use a step twice smaller, resulting in twice longer timeseries, where the weather data is being sampled at a rate of one point per 30 minutes.

In [11]:
# This was previously set to 6 (one point per hour).
# Now 3 (one point per 30 min).
step = 3
lookback = 720  # Unchanged
delay = 144 # Unchanged

train_gen = generator(float_data,
                      lookback=lookback,
                      delay=delay,
                      min_index=0,
                      max_index=200000,
                      shuffle=True,
                      step=step)
val_gen = generator(float_data,
                    lookback=lookback,
                    delay=delay,
                    min_index=200001,
                    max_index=300000,
                    step=step)
test_gen = generator(float_data,
                     lookback=lookback,
                     delay=delay,
                     min_index=300001,
                     max_index=None,
                     step=step)
val_steps = (300000 - 200001 - lookback) // 128
test_steps = (len(float_data) - 300001 - lookback) // 128

This is our model, starting with two Conv1D layers and following-up with a GRU layer:

In [12]:
model = Sequential()
model.add(layers.Conv1D(32, 5, activation='relu',
                        input_shape=(None, float_data.shape[-1])))
model.add(layers.MaxPooling1D(3))
model.add(layers.Conv1D(32, 5, activation='relu'))
model.add(layers.GRU(32, dropout=0.1, recurrent_dropout=0.5))
model.add(layers.Dense(1))

model.summary()

model.compile(optimizer=RMSprop(), loss='mae')
history = model.fit_generator(train_gen,
                              steps_per_epoch=500,
                              epochs=20,
                              validation_data=val_gen,
                              validation_steps=val_steps)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv1d_6 (Conv1D)            (None, None, 32)          2272      
_________________________________________________________________
max_pooling1d_4 (MaxPooling1 (None, None, 32)          0         
_________________________________________________________________
conv1d_7 (Conv1D)            (None, None, 32)          5152      
_________________________________________________________________
gru_1 (GRU)                  (None, 32)                6240      
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 33        
=================================================================
Total params: 13,697
Trainable params: 13,697
Non-trainable params: 0
_________________________________________________________________
Epoch 1/20
500/500 [==============================] - 60s - loss: 0.3387 - val_loss: 0.3030
Epoch 2/20
500/500 [==============================] - 58s - loss: 0.3055 - val_loss: 0.2864
Epoch 3/20
500/500 [==============================] - 58s - loss: 0.2904 - val_loss: 0.2841
Epoch 4/20
500/500 [==============================] - 58s - loss: 0.2830 - val_loss: 0.2730
Epoch 5/20
500/500 [==============================] - 58s - loss: 0.2767 - val_loss: 0.2757
Epoch 6/20
500/500 [==============================] - 58s - loss: 0.2696 - val_loss: 0.2819
Epoch 7/20
500/500 [==============================] - 57s - loss: 0.2642 - val_loss: 0.2787
Epoch 8/20
500/500 [==============================] - 57s - loss: 0.2595 - val_loss: 0.2920
Epoch 9/20
500/500 [==============================] - 58s - loss: 0.2546 - val_loss: 0.2919
Epoch 10/20
500/500 [==============================] - 57s - loss: 0.2506 - val_loss: 0.2772
Epoch 11/20
500/500 [==============================] - 58s - loss: 0.2459 - val_loss: 0.2801
Epoch 12/20
500/500 [==============================] - 57s - loss: 0.2433 - val_loss: 0.2807
Epoch 13/20
500/500 [==============================] - 58s - loss: 0.2411 - val_loss: 0.2855
Epoch 14/20
500/500 [==============================] - 58s - loss: 0.2360 - val_loss: 0.2858
Epoch 15/20
500/500 [==============================] - 57s - loss: 0.2350 - val_loss: 0.2834
Epoch 16/20
500/500 [==============================] - 58s - loss: 0.2315 - val_loss: 0.2917
Epoch 17/20
500/500 [==============================] - 60s - loss: 0.2285 - val_loss: 0.2944
Epoch 18/20
500/500 [==============================] - 57s - loss: 0.2280 - val_loss: 0.2923
Epoch 19/20
500/500 [==============================] - 57s - loss: 0.2249 - val_loss: 0.2910
Epoch 20/20
500/500 [==============================] - 57s - loss: 0.2215 - val_loss: 0.2952
In [13]:
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(loss))

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()
<matplotlib.figure.Figure at 0x7f707ade8128>

Judging from the validation loss, this setup is not quite as good as the regularized GRU alone, but it's significantly faster. It is looking at twice more data, which in this case doesn't appear to be hugely helpful, but may be important for other datasets.

Wrapping up

Here's what you should take away from this section:

  • In the same way that 2D convnets perform well for processing visual patterns in 2D space, 1D convnets perform well for processing temporal patterns. They offer a faster alternative to RNNs on some problems, in particular NLP tasks.
  • Typically 1D convnets are structured much like their 2D equivalents from the world of computer vision: they consist of stacks of Conv1D layers and MaxPooling1D layers, eventually ending in a global pooling operation or flattening operation.
  • Because RNNs are extremely expensive for processing very long sequences, but 1D convnets are cheap, it can be a good idea to use a 1D convnet as a preprocessing step before a RNN, shortening the sequence and extracting useful representations for the RNN to process.

One useful and important concept that we will not cover in these pages is that of 1D convolution with dilated kernels.