Keras implementation of Character Aware Neural Models by Yoon Kim et al.

Introduction/Disclaimer

This notebook tries to understand the structure and implement the character RNN given in [Kim, 2016](https://arxiv.org/abs/1508.06615). While there is [code](https://github.com/yoonkim/lstm-char-cnn) available online by the authors, that is available in Torch (and there are several PyTorch implementations available). The current notebook tries to aid people in recreating such models in Keras, rather than having a model that works out-of-the-box. I am also currently not very confident about my approach myself, so please if you have any comments, open an issue ticket.

Description

The model is a character-input sequential model for next-word prediction. It accepts as input a word as a series of characters (represented as integers), and spits a probability distribution of the next word. It differs from other character RNN models in that it considers the whole word at each timestep, instead of a single character, and from word RNN models in that it accepts characters-as-integers and not words-as-integers. Below we show how we pre-process a corpus .txt file to feed into the model, how we build the model into keras and finally how we train & evaluate it.

Requirements

Python packages keras, numpy and tensorflow (might also work with Theano but I haven't tried it). Also a GPU with CUDA is highly recommended.

Preprocessing

Assume we have a corpus train.txt for training, valid.txt for validation and test.txt for testing. We split each corpus into words.

Note: The corpus we use here is the same as the corpus used by Kim, Y. et al. in their repo. The prepared corpus have some words that appear only once (e.g. some very rare proper names) and in that case they have been erplaced with the token <unk>.

Training has words $W_{tr}=[w_1,w_2,...,w_T]$. Validation has words $W_{vd}$ and testing $W_{ts}$.

In [1]:
import numpy as np

with open('train.txt') as f:
    W_tr = f.read().split()
    
with open('valid.txt') as f:
    W_vd = f.read().split()
    
with open('test.txt') as f:
    W_ts = f.read().split()

For each word $w_{1...T}$ in training:

  1. Convert $w_t$ into a series of characters: $w_t = [c_1, c_2, ..., c_N]$
  2. For each word $w_t$ and character $c_n$, assign them an integer $t$ and $n$ respectively in the Look Up Tables (LUTs) word_idx and char_idx accordingly. Also create the inverse mapping LUTs idx_word and idx_char that map from integers to words (and characters).
  3. Replace $c_n$ and $w_t$ with $n$ and $t$ respectively.

Note: We also include characters begin and end (they are not python characters but for our purposes we can use them as such) to signify the beginning and ending of each word.

Note 2: We assign the integer 0 to character '' (no character). Since we use keras, we need to use padding to our models and this is done by filling with 0. It is important that 0 does not correspond to a valid character or word.

In [2]:
# First gather all the words and characters. 
words_training = list(set(W_tr))
chars_training = []
for word in words_training:
    for char in word:
        if char not in chars_training:
            chars_training.append(char)
            
chars_training.append('begin')
chars_training.append('end')

# Create the look up table as a python dictionary
word_idx = dict((words_training[n], n) for n in range(len(words_training)))

char_idx = dict((chars_training[n], n+1) for n in range(len(chars_training)))
char_idx[''] = 0

# Also create the inverse as a python dictionary
idx_word = dict((word_idx[w], w) for w in word_idx)
idx_char = dict((char_idx[c], c) for c in char_idx)

In order to train the model later, we need to turn our corpora into pairs of (Input, Output). We do this with the function below (accepts a corpus file as a list of words as input). We make this a function since we are going to be using it several times below.

Note Since <unk> words appear only once, we omit them from training.

In [3]:
def prepare_inputs_outputs(words):
    # Remember that the output is a single word.
    output_words = [] 
    
    # The input is a word split into characters. 
    input_words = []
    
    for n in range(len(words) - 1):
        # <unk> are words that appear only once, so omit them.
        #if words[n] != '<unk>' and words[n+1] != '<unk>':
        input_words.append(words[n])
        output_words.append(words[n+1])
            
    # The input words split into sequence of characters
    input_seqs = []
    for n in range(len(input_words)):
        input_seqs.append(
            # Remember that each word starts with the character begin and ends with end.
            ['begin'] + [c for c in input_words[n]] + ['end']
        )
        
    # Final input, output
    inputs = input_seqs
    outputs = output_words
    
    return inputs, outputs

    
inputs_tr, outputs_tr = prepare_inputs_outputs(W_tr)
inputs_vd, outputs_vd = prepare_inputs_outputs(W_vd)
inputs_ts, outputs_ts = prepare_inputs_outputs(W_ts)


print("First 10 (input, output) pairs in training corpus: ")
for n in range(10):

    print('[Sample {}]'.format(n), 'Input:',inputs_tr[n], 'Output:',outputs_tr[n])
First 10 (input, output) pairs in training corpus: 
[Sample 0] Input: ['begin', 'a', 'e', 'r', 'end'] Output: banknote
[Sample 1] Input: ['begin', 'b', 'a', 'n', 'k', 'n', 'o', 't', 'e', 'end'] Output: berlitz
[Sample 2] Input: ['begin', 'b', 'e', 'r', 'l', 'i', 't', 'z', 'end'] Output: calloway
[Sample 3] Input: ['begin', 'c', 'a', 'l', 'l', 'o', 'w', 'a', 'y', 'end'] Output: centrust
[Sample 4] Input: ['begin', 'c', 'e', 'n', 't', 'r', 'u', 's', 't', 'end'] Output: cluett
[Sample 5] Input: ['begin', 'c', 'l', 'u', 'e', 't', 't', 'end'] Output: fromstein
[Sample 6] Input: ['begin', 'f', 'r', 'o', 'm', 's', 't', 'e', 'i', 'n', 'end'] Output: gitano
[Sample 7] Input: ['begin', 'g', 'i', 't', 'a', 'n', 'o', 'end'] Output: guterman
[Sample 8] Input: ['begin', 'g', 'u', 't', 'e', 'r', 'm', 'a', 'n', 'end'] Output: hydro-quebec
[Sample 9] Input: ['begin', 'h', 'y', 'd', 'r', 'o', '-', 'q', 'u', 'e', 'b', 'e', 'c', 'end'] Output: ipo

In the paper we see that training is done in groups of seq_len=30 timesteps for non-arabic corpora, so we group the inputs and outputs accordingly:

In [4]:
seq_len = 35

inputs_tr_seq = []
outputs_tr_seq = []

for n in range(0, len(inputs_tr)-seq_len, seq_len):
    # rearrange input into groups of 25 samples
    # let's say input has words-as-character-sequences W = [W_1, W_2, ..., W_N]
    input_seq = inputs_tr[n:n+seq_len]
    
    # Then since the goal is to predict the next word, the output 
    # has the sequence of words-as-words W' = [W_2, W_3, ..., W_N+1]
    output_seq = outputs_tr[n:n+seq_len]
    
    inputs_tr_seq.append(input_seq)
    outputs_tr_seq.append(output_seq)
    
# Before each (charachters, word) pair was a single sample. Now a single sample
# is ([(characters, ..., characters, characters), (word, ..., word, word)]).

# Example.
for n in range(2):
    print('[Sample {}]'.format(n))
    print('Input:\n',inputs_tr_seq[n])
    print('Output:\n',outputs_tr_seq[n])   
    print('')
[Sample 0]
Input:
 [['begin', 'a', 'e', 'r', 'end'], ['begin', 'b', 'a', 'n', 'k', 'n', 'o', 't', 'e', 'end'], ['begin', 'b', 'e', 'r', 'l', 'i', 't', 'z', 'end'], ['begin', 'c', 'a', 'l', 'l', 'o', 'w', 'a', 'y', 'end'], ['begin', 'c', 'e', 'n', 't', 'r', 'u', 's', 't', 'end'], ['begin', 'c', 'l', 'u', 'e', 't', 't', 'end'], ['begin', 'f', 'r', 'o', 'm', 's', 't', 'e', 'i', 'n', 'end'], ['begin', 'g', 'i', 't', 'a', 'n', 'o', 'end'], ['begin', 'g', 'u', 't', 'e', 'r', 'm', 'a', 'n', 'end'], ['begin', 'h', 'y', 'd', 'r', 'o', '-', 'q', 'u', 'e', 'b', 'e', 'c', 'end'], ['begin', 'i', 'p', 'o', 'end'], ['begin', 'k', 'i', 'a', 'end'], ['begin', 'm', 'e', 'm', 'o', 't', 'e', 'c', 'end'], ['begin', 'm', 'l', 'x', 'end'], ['begin', 'n', 'a', 'h', 'b', 'end'], ['begin', 'p', 'u', 'n', 't', 's', 'end'], ['begin', 'r', 'a', 'k', 'e', 'end'], ['begin', 'r', 'e', 'g', 'a', 't', 't', 'a', 'end'], ['begin', 'r', 'u', 'b', 'e', 'n', 's', 'end'], ['begin', 's', 'i', 'm', 'end'], ['begin', 's', 'n', 'a', 'c', 'k', '-', 'f', 'o', 'o', 'd', 'end'], ['begin', 's', 's', 'a', 'n', 'g', 'y', 'o', 'n', 'g', 'end'], ['begin', 's', 'w', 'a', 'p', 'o', 'end'], ['begin', 'w', 'a', 'c', 'h', 't', 'e', 'r', 'end'], ['begin', 'p', 'i', 'e', 'r', 'r', 'e', 'end'], ['begin', '<', 'u', 'n', 'k', '>', 'end'], ['begin', 'N', 'end'], ['begin', 'y', 'e', 'a', 'r', 's', 'end'], ['begin', 'o', 'l', 'd', 'end'], ['begin', 'w', 'i', 'l', 'l', 'end'], ['begin', 'j', 'o', 'i', 'n', 'end'], ['begin', 't', 'h', 'e', 'end'], ['begin', 'b', 'o', 'a', 'r', 'd', 'end'], ['begin', 'a', 's', 'end'], ['begin', 'a', 'end']]
Output:
 ['banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec', 'ipo', 'kia', 'memotec', 'mlx', 'nahb', 'punts', 'rake', 'regatta', 'rubens', 'sim', 'snack-food', 'ssangyong', 'swapo', 'wachter', 'pierre', '<unk>', 'N', 'years', 'old', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive']

[Sample 1]
Input:
 [['begin', 'n', 'o', 'n', 'e', 'x', 'e', 'c', 'u', 't', 'i', 'v', 'e', 'end'], ['begin', 'd', 'i', 'r', 'e', 'c', 't', 'o', 'r', 'end'], ['begin', 'n', 'o', 'v', '.', 'end'], ['begin', 'N', 'end'], ['begin', 'm', 'r', '.', 'end'], ['begin', '<', 'u', 'n', 'k', '>', 'end'], ['begin', 'i', 's', 'end'], ['begin', 'c', 'h', 'a', 'i', 'r', 'm', 'a', 'n', 'end'], ['begin', 'o', 'f', 'end'], ['begin', '<', 'u', 'n', 'k', '>', 'end'], ['begin', 'n', '.', 'v', '.', 'end'], ['begin', 't', 'h', 'e', 'end'], ['begin', 'd', 'u', 't', 'c', 'h', 'end'], ['begin', 'p', 'u', 'b', 'l', 'i', 's', 'h', 'i', 'n', 'g', 'end'], ['begin', 'g', 'r', 'o', 'u', 'p', 'end'], ['begin', 'r', 'u', 'd', 'o', 'l', 'p', 'h', 'end'], ['begin', '<', 'u', 'n', 'k', '>', 'end'], ['begin', 'N', 'end'], ['begin', 'y', 'e', 'a', 'r', 's', 'end'], ['begin', 'o', 'l', 'd', 'end'], ['begin', 'a', 'n', 'd', 'end'], ['begin', 'f', 'o', 'r', 'm', 'e', 'r', 'end'], ['begin', 'c', 'h', 'a', 'i', 'r', 'm', 'a', 'n', 'end'], ['begin', 'o', 'f', 'end'], ['begin', 'c', 'o', 'n', 's', 'o', 'l', 'i', 'd', 'a', 't', 'e', 'd', 'end'], ['begin', 'g', 'o', 'l', 'd', 'end'], ['begin', 'f', 'i', 'e', 'l', 'd', 's', 'end'], ['begin', 'p', 'l', 'c', 'end'], ['begin', 'w', 'a', 's', 'end'], ['begin', 'n', 'a', 'm', 'e', 'd', 'end'], ['begin', 'a', 'end'], ['begin', 'n', 'o', 'n', 'e', 'x', 'e', 'c', 'u', 't', 'i', 'v', 'e', 'end'], ['begin', 'd', 'i', 'r', 'e', 'c', 't', 'o', 'r', 'end'], ['begin', 'o', 'f', 'end'], ['begin', 't', 'h', 'i', 's', 'end']]
Output:
 ['director', 'nov.', 'N', 'mr.', '<unk>', 'is', 'chairman', 'of', '<unk>', 'n.v.', 'the', 'dutch', 'publishing', 'group', 'rudolph', '<unk>', 'N', 'years', 'old', 'and', 'former', 'chairman', 'of', 'consolidated', 'gold', 'fields', 'plc', 'was', 'named', 'a', 'nonexecutive', 'director', 'of', 'this', 'british']

In order to actually use the inputs and outputs with our model, we convert them to integers. This is as straightforward as looking using LUTs defined above (word_idx for words, char_idx for characters) and replacing each word and character with its corresponding integer. We can try that, and we will get a MemoryError. That is because the output has a total number of elements: len(outputs_tr_seq)*seq_len*len(words_training)) = 26721*35*9999 = 8015498370 which requires over 32GB to store for 4 byte integers! We will consider a different approach by constructing a generator. This is a function that yields a single batch of input paired by a single batch of output every time. We will later use that generator with the fit_generator method of our model.

Note: Since we are using a generator, we need to specify the number of samples in a batch in advance.

In [5]:
import numpy as np

# Since in keras pretty much everything 
# has to be of fixed size, figure out the
# maximum number of characters there is in a 
# word, and add 2 (for the characters begin and end)
max_word_len = max([len(w) for w in words_training]) + 2

## Uncomment the code below to get a MemoryError

# Create the output for the inputs (will probably give a MemoryError. If it doesn't 
# still use the method below since people are going to start whining of you taking 
# too much memory (assuming it is a memory machine)).
# y = np.zeros(
#     (len(outputs_tr_seq), # Same number of samples as input
#      seq_len,             # Same number of timesteps
#      len(words_training)  # Categorical distribution
#     )
# )

batch_size = 20

def generator(words,                 # The corpora to 
              batch_size=batch_size, # The desired batch size
              seq_len=seq_len,       # The timesteps at each batch
              char_idx=char_idx,     # The character-to-integer LUT
              word_idx=word_idx,     # The word-to-integer LUT
             ):
    
    inputs, outputs = prepare_inputs_outputs(words) # Prepare the inputs and outputs as above
    
    x = np.zeros((batch_size, seq_len, max_word_len))
    y = np.zeros((batch_size, seq_len, len(word_idx)))
    
    I = 0                            # Since the generator never "terminates"
                                     # we traverse through our samples from the
                                     # beginning to the end, back to the 
                                     # beginning, etc...
    while 1:
        # Clear the vectors at each yield.
        x[:,:,:] = 0.0
        y[:,:,:] = 0.0
        
        for sample in range(batch_size):
            for time in range(seq_len):
                if I >= len(inputs):
                    I = 0            # If we are at the last sample and last timestep, 
                                     # move to the beginning.
                        
                for n, char in enumerate(inputs[I]):
                    x[sample, time, n] = char_idx[char]  # Replace with integer
                    
                y[sample, time, word_idx[outputs[I]]] = 1.0
                                                         # Set 1.0 to the position of 
                                                         # the vector given by the integer
                                                         # from the word LUT for the word
                                                         # at outputs[I]
                                
                I += 1 
        yield x, y
                    
# Example: Show the first 10 input/output shapes as vectors
gen = generator(W_tr)

I = 0
for x, y in gen:
    if I >= 10:
        break
    print("[Batch: {}]".format(I),"Input: ",x.shape,"Output:",y.shape)
    I += 1
[Batch: 0] Input:  (20, 35, 21) Output: (20, 35, 9999)
[Batch: 1] Input:  (20, 35, 21) Output: (20, 35, 9999)
[Batch: 2] Input:  (20, 35, 21) Output: (20, 35, 9999)
[Batch: 3] Input:  (20, 35, 21) Output: (20, 35, 9999)
[Batch: 4] Input:  (20, 35, 21) Output: (20, 35, 9999)
[Batch: 5] Input:  (20, 35, 21) Output: (20, 35, 9999)
[Batch: 6] Input:  (20, 35, 21) Output: (20, 35, 9999)
[Batch: 7] Input:  (20, 35, 21) Output: (20, 35, 9999)
[Batch: 8] Input:  (20, 35, 21) Output: (20, 35, 9999)
[Batch: 9] Input:  (20, 35, 21) Output: (20, 35, 9999)

Implementation

The model is given in Fig ? in the paper and looks like the figure below:

model figure.

Note that this is for one time step. At time $t$ the input is a word-as-string-of-characters $w_t = [c_1 c_2 \dots c_N]$ (where $c_n$ is an integer), and the output is the probability distribution over all words $W_{t+1}$.

Note the LSTM stage has an output to itself. Its input at time $t$ isn't output of the Highway Network at time $t$ but the output of the LSTM itself at time $t-1$.

In order to train it, we will "unroll" the LSTM for seq_len=35 time steps. Since the LSTM at each timestep also depends on the outputs of the highway network at this timestep we also have to distribute the previous stages across time. The final model will then be:

model_unrolled

In the block diagram above, with pink we have the inputs/outputs and with teal the various components of the model. We will go about building each component separately as a keras model and then we will put them together in order to build the composite model.

Embeddings layer

The embeddings layer is quite simple, it inputs a series of characters as integers $[c_1, c_2, ... c_N]$ and it outputs a matrix $V = [v_1 v_2 \dots v_N]$ where each row $v_i$ (of embeddings_size = 15 columns) is an embedding of character $c_i$. Let's implement it:

In [6]:
# First import the relevant keras modules
import keras
from keras import layers

# embeddings_size is the dimension of the embeddings vector of each character
embeddings_size = 15

# Because we want to build the composite model step-by-step
# each layer we define is going to be wrapped into a model
Embeddings = keras.Sequential()

Embeddings.add(
    layers.Embedding(
        len(char_idx) + 1,      # Input is the number of possible characters + 1
        embeddings_size,        # Output is the dimension of the embeddings vector
        name = 'Embeddings',
        #embeddings_initializer='random_normal',
        input_shape=(max_word_len,)
    )
)
Embeddings.name = 'Embeddings'

# Change to False to keep the embeddings as are (as used in the paper). Setting it to True
# leads to a massive increase in performance.
Embeddings.trainable = True

# Show details to check on input/output dimensions
Embeddings.summary()

from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot

SVG(model_to_dot(Embeddings, show_shapes=True).create(prog='dot', format='svg'))
/home/mmxgn/.local/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
Embeddings (Embedding)       (None, 21, 15)            780       
=================================================================
Total params: 780
Trainable params: 780
Non-trainable params: 0
_________________________________________________________________
Out[6]:
G 140498816003824 Embeddings_input: InputLayer input: output: (None, 21) (None, 21) 140498816002256 Embeddings: Embedding input: output: (None, 21) (None, 21, 15) 140498816003824->140498816002256

In the paper $V$ is a stack of the embeddings as column-vectors instead of row-vectors. The first dimension in (None, None, 15) is the batch dimension. The second is the number of characters $N$ in each word. Let's see initially how the embeddings look like (they will change during training).

In [7]:
# We generated some x's and y's above, use a slice for visualization purposes
sample_input = x[:, 0, :]
embeddings_output = Embeddings.predict(sample_input)

import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(21,5))
# Plot first word as integer characters
plt.plot(sample_input[0,:])
plt.xlabel('position in word')
plt.ylabel('character as integer')

plt.figure(figsize=(21,5))
plt.imshow(embeddings_output[0,:,:].T, aspect='auto')
plt.xlabel('position in word')
plt.ylabel('character as integer')
Out[7]:
Text(0,0.5,'character as integer')

Convolutional

The convolutional layer takes the embeddings and applies a number of 1D filters of various shapes across the character dimension. For the small model described in the paper, it filters it through the following number of 1D filters of the following sizes:

num_filters w
25 1
50 2
75 3
100 4
125 5
150 6

This can be implemented in keras using the functional API and a for loop. Foreach number of filters num_filters and width w in the matrix above:

  1. Create a Conv1D layer of size num_filters and shape w
  2. Apply max pooling over time using GlobalMaxPool1D
  3. Append output to a list L
In [8]:
num_filters_per_layer = [25, 50, 75, 100, 125, 150]

filter_widths_per_layer = [1, 2, 3, 4, 5 ,6]

inputs = layers.Input(
    shape=(max_word_len, embeddings_size),  # The input to this layer comes from the output
                                            # of the embeddings layer, therefore must have 
                                            # the same shape as the embedding's output
    name='InputFromEmbeddings'
)

L = [] # Will keep record of the outputs of maxpooling

for n in range(len(num_filters_per_layer)):
    num_filters = num_filters_per_layer[n]
    w = filter_widths_per_layer[n]
    
    # Create a Conv1D layer
    x = layers.Conv1D(num_filters, 
                      w,
                     activation='tanh',  # Hyperbolic tangent is used as the activation layer
                     name='Conv1D_{}_{}'.format(num_filters, w) # Give the layer a representative name
                     )(inputs)                                  # for debugging purposes.
    
    # Append to outputs that will be concatenated
    L.append(x)

# Again, wrap it into a model.

Convolutional = keras.Model(
    inputs=inputs,
    outputs=L,
    name='Convolutional'
)

Convolutional.summary()
SVG(model_to_dot(Convolutional, show_shapes=True).create(prog='dot', format='svg'))
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
InputFromEmbeddings (InputLayer (None, 21, 15)       0                                            
__________________________________________________________________________________________________
Conv1D_25_1 (Conv1D)            (None, 21, 25)       400         InputFromEmbeddings[0][0]        
__________________________________________________________________________________________________
Conv1D_50_2 (Conv1D)            (None, 20, 50)       1550        InputFromEmbeddings[0][0]        
__________________________________________________________________________________________________
Conv1D_75_3 (Conv1D)            (None, 19, 75)       3450        InputFromEmbeddings[0][0]        
__________________________________________________________________________________________________
Conv1D_100_4 (Conv1D)           (None, 18, 100)      6100        InputFromEmbeddings[0][0]        
__________________________________________________________________________________________________
Conv1D_125_5 (Conv1D)           (None, 17, 125)      9500        InputFromEmbeddings[0][0]        
__________________________________________________________________________________________________
Conv1D_150_6 (Conv1D)           (None, 16, 150)      13650       InputFromEmbeddings[0][0]        
==================================================================================================
Total params: 34,650
Trainable params: 34,650
Non-trainable params: 0
__________________________________________________________________________________________________
Out[8]:
G 140496709029784 InputFromEmbeddings: InputLayer input: output: (None, 21, 15) (None, 21, 15) 140496709028888 Conv1D_25_1: Conv1D input: output: (None, 21, 15) (None, 21, 25) 140496709029784->140496709028888 140496709321896 Conv1D_50_2: Conv1D input: output: (None, 21, 15) (None, 20, 50) 140496709029784->140496709321896 140496709029728 Conv1D_75_3: Conv1D input: output: (None, 21, 15) (None, 19, 75) 140496709029784->140496709029728 140499251436960 Conv1D_100_4: Conv1D input: output: (None, 21, 15) (None, 18, 100) 140496709029784->140499251436960 140496703249880 Conv1D_125_5: Conv1D input: output: (None, 21, 15) (None, 17, 125) 140496709029784->140496703249880 140496702985776 Conv1D_150_6: Conv1D input: output: (None, 21, 15) (None, 16, 150) 140496709029784->140496702985776

Let's see how it transforms our input at this stage (remember the output from embeddings in in embeddings_output of shape (30,21,15)):

In [9]:
convolutional_output = Convolutional.predict(embeddings_output)
plt.figure(figsize=(21,5))
plt.subplot(1, len(convolutional_output), 1)

for n, outp in enumerate(convolutional_output):
    plt.subplot(1, len(convolutional_output), n+1)
    plt.imshow(outp[0,:,:].T)
    plt.title('Conv1D_{}_{}'.format(num_filters_per_layer[n],filter_widths_per_layer[n]))
/usr/lib/python3.6/site-packages/matplotlib/cbook/deprecation.py:106: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  warnings.warn(message, mplDeprecation, stacklevel=1)

MaxPoolingOverTime

MaxPooling over time is implemented using a GlobalMaxPool1D layer on the outputs of the Convolutional layer. Then those outputs are concatenated to provide a fixed representation.

In [10]:
inputs = [] # list that holds the inputs, 
            # must be of similar size as the outputs of the Convolutional block
L = []      # Stores the outputs of the maxpooling in order to concatenate them later
    
for n in range(len(convolutional_output)):
    # This looks complicated but is really an input layer that matches
    # the output of the convolutional layer.
    inputs.append(
        layers.Input(
            shape=(
                convolutional_output[n].shape[1], 
                convolutional_output[n].shape[2]
            ),
            name = 'FromConvolutional_{}'.format(n)
        
        )
    )
    
    # Max Pooling over time for input n
    x = layers.GlobalMaxPool1D(name='MaxPoolingOverTime_{}'.format(n))(inputs[n])
    
    # Append the output of max pooling to L
    L.append(x)
    
outputs = layers.Concatenate()(L) # Concatenated the outputs in L

# Wrap it into a model 
MaxPoolingOverTime = keras.Model(
    inputs = inputs,
    outputs = outputs,
    name = 'MaxPoolingOverTime'
)

MaxPoolingOverTime.summary()

# Block diagram
SVG(model_to_dot(MaxPoolingOverTime, show_shapes=True).create(prog='dot', format='svg'))
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
FromConvolutional_0 (InputLayer (None, 21, 25)       0                                            
__________________________________________________________________________________________________
FromConvolutional_1 (InputLayer (None, 20, 50)       0                                            
__________________________________________________________________________________________________
FromConvolutional_2 (InputLayer (None, 19, 75)       0                                            
__________________________________________________________________________________________________
FromConvolutional_3 (InputLayer (None, 18, 100)      0                                            
__________________________________________________________________________________________________
FromConvolutional_4 (InputLayer (None, 17, 125)      0                                            
__________________________________________________________________________________________________
FromConvolutional_5 (InputLayer (None, 16, 150)      0                                            
__________________________________________________________________________________________________
MaxPoolingOverTime_0 (GlobalMax (None, 25)           0           FromConvolutional_0[0][0]        
__________________________________________________________________________________________________
MaxPoolingOverTime_1 (GlobalMax (None, 50)           0           FromConvolutional_1[0][0]        
__________________________________________________________________________________________________
MaxPoolingOverTime_2 (GlobalMax (None, 75)           0           FromConvolutional_2[0][0]        
__________________________________________________________________________________________________
MaxPoolingOverTime_3 (GlobalMax (None, 100)          0           FromConvolutional_3[0][0]        
__________________________________________________________________________________________________
MaxPoolingOverTime_4 (GlobalMax (None, 125)          0           FromConvolutional_4[0][0]        
__________________________________________________________________________________________________
MaxPoolingOverTime_5 (GlobalMax (None, 150)          0           FromConvolutional_5[0][0]        
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 525)          0           MaxPoolingOverTime_0[0][0]       
                                                                 MaxPoolingOverTime_1[0][0]       
                                                                 MaxPoolingOverTime_2[0][0]       
                                                                 MaxPoolingOverTime_3[0][0]       
                                                                 MaxPoolingOverTime_4[0][0]       
                                                                 MaxPoolingOverTime_5[0][0]       
==================================================================================================
Total params: 0
Trainable params: 0
Non-trainable params: 0
__________________________________________________________________________________________________
Out[10]:
G 140496703596304 FromConvolutional_0: InputLayer input: output: (None, 21, 25) (None, 21, 25) 140496703598152 MaxPoolingOverTime_0: GlobalMaxPooling1D input: output: (None, 21, 25) (None, 25) 140496703596304->140496703598152 140496703724232 FromConvolutional_1: InputLayer input: output: (None, 20, 50) (None, 20, 50) 140495944195824 MaxPoolingOverTime_1: GlobalMaxPooling1D input: output: (None, 20, 50) (None, 50) 140496703724232->140495944195824 140496703721880 FromConvolutional_2: InputLayer input: output: (None, 19, 75) (None, 19, 75) 140495952949712 MaxPoolingOverTime_2: GlobalMaxPooling1D input: output: (None, 19, 75) (None, 75) 140496703721880->140495952949712 140495944120080 FromConvolutional_3: InputLayer input: output: (None, 18, 100) (None, 18, 100) 140495943989512 MaxPoolingOverTime_3: GlobalMaxPooling1D input: output: (None, 18, 100) (None, 100) 140495944120080->140495943989512 140496703694608 FromConvolutional_4: InputLayer input: output: (None, 17, 125) (None, 17, 125) 140495943926336 MaxPoolingOverTime_4: GlobalMaxPooling1D input: output: (None, 17, 125) (None, 125) 140496703694608->140495943926336 140495944121984 FromConvolutional_5: InputLayer input: output: (None, 16, 150) (None, 16, 150) 140496703676824 MaxPoolingOverTime_5: GlobalMaxPooling1D input: output: (None, 16, 150) (None, 150) 140495944121984->140496703676824 140496703597536 concatenate_1: Concatenate input: output: [(None, 25), (None, 50), (None, 75), (None, 100), (None, 125), (None, 150)] (None, 525) 140496703598152->140496703597536 140495944195824->140496703597536 140495952949712->140496703597536 140495943989512->140496703597536 140495943926336->140496703597536 140496703676824->140496703597536

Here is an example of the output at this layer for one word:

In [11]:
maxpooling_output = MaxPoolingOverTime.predict(convolutional_output)
plt.figure(figsize=(21,11))
plt.plot(maxpooling_output[0,:])
plt.xlabel('coefficient')
plt.ylabel('value')
plt.title('Concatenate')
Out[11]:
Text(0.5,1,'Concatenate')

Highway network

A Highway network is indicative of its name. It is a network that transforms its input and also carries part of it intact to the output. It is easily implemented using the functional API as such (see paper for the equations):

In [12]:
inputs = layers.Input(shape=(
    maxpooling_output.shape[1],))  # The input must match the output of 
                                               # the MaxPoolingOverTime layer
    
transform_gate = layers.Dense(maxpooling_output.shape[1],
                             activation='sigmoid',
                             name='transorm_gate')(inputs)

carry_gate = layers.Lambda(lambda x: 1-x,
                           name='carry_gate')(transform_gate)

z = layers.Add()([
    layers.Multiply()([
        transform_gate,
        layers.Dense(maxpooling_output.shape[1],
                    activation='relu')(inputs)
    ]),
    layers.Multiply()([carry_gate, inputs])
])

Highway = keras.Model(
    inputs=inputs, 
    outputs=z,
    name='Highway'
)

Highway.summary()

# Block diagram
SVG(model_to_dot(Highway, show_shapes=True, show_layer_names=True).create(prog='dot', format='svg'))
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 525)          0                                            
__________________________________________________________________________________________________
transorm_gate (Dense)           (None, 525)          276150      input_1[0][0]                    
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 525)          276150      input_1[0][0]                    
__________________________________________________________________________________________________
carry_gate (Lambda)             (None, 525)          0           transorm_gate[0][0]              
__________________________________________________________________________________________________
multiply_1 (Multiply)           (None, 525)          0           transorm_gate[0][0]              
                                                                 dense_1[0][0]                    
__________________________________________________________________________________________________
multiply_2 (Multiply)           (None, 525)          0           carry_gate[0][0]                 
                                                                 input_1[0][0]                    
__________________________________________________________________________________________________
add_1 (Add)                     (None, 525)          0           multiply_1[0][0]                 
                                                                 multiply_2[0][0]                 
==================================================================================================
Total params: 552,300
Trainable params: 552,300
Non-trainable params: 0
__________________________________________________________________________________________________
Out[12]:
G 140495943146240 input_1: InputLayer input: output: (None, 525) (None, 525) 140495943146408 transorm_gate: Dense input: output: (None, 525) (None, 525) 140495943146240->140495943146408 140495906222264 dense_1: Dense input: output: (None, 525) (None, 525) 140495943146240->140495906222264 140495906213840 multiply_2: Multiply input: output: [(None, 525), (None, 525)] (None, 525) 140495943146240->140495906213840 140495944122040 carry_gate: Lambda input: output: (None, 525) (None, 525) 140495943146408->140495944122040 140495943109824 multiply_1: Multiply input: output: [(None, 525), (None, 525)] (None, 525) 140495943146408->140495943109824 140495906222264->140495943109824 140495944122040->140495906213840 140495943114880 add_1: Add input: output: [(None, 525), (None, 525)] (None, 525) 140495943109824->140495943114880 140495906213840->140495943114880

The necessary output example:

In [13]:
highway_output = Highway.predict(maxpooling_output)
plt.figure(figsize=(21,11))
plt.plot(maxpooling_output[0])
plt.plot(highway_output[0])
plt.legend(['maxpooling_output', 'highway_output'])
Out[13]:
<matplotlib.legend.Legend at 0x7fc7c35337b8>

Combining Embedding, Convolutional, MaxPool, and Highway

Before implementing our LSTM layer, we will combine the above layers to a FeatureExtract layer. We will then distribute that layer across time for all time instants $t-25+1, \dots t-1, t$ which we will provide to the LSTM layer unrolled for those time instants.

In [14]:
inputs = Embeddings.inputs
x = Embeddings(inputs=inputs)
x = Convolutional(inputs=x)
x = MaxPoolingOverTime(inputs=x)
x = Highway(inputs=x)

FeatureExtract = keras.Model(inputs=inputs, outputs=x)
FeatureExtract.summary()

# Block diagram
SVG(model_to_dot(FeatureExtract, show_shapes=True).create(prog='dot', format='svg'))
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
Embeddings_input (InputLayer)   (None, 21)           0                                            
__________________________________________________________________________________________________
Embeddings (Sequential)         (None, 21, 15)       780         Embeddings_input[0][0]           
__________________________________________________________________________________________________
Convolutional (Model)           [(None, 21, 25), (No 34650       Embeddings[1][0]                 
__________________________________________________________________________________________________
MaxPoolingOverTime (Model)      (None, 525)          0           Convolutional[1][0]              
                                                                 Convolutional[1][1]              
                                                                 Convolutional[1][2]              
                                                                 Convolutional[1][3]              
                                                                 Convolutional[1][4]              
                                                                 Convolutional[1][5]              
__________________________________________________________________________________________________
Highway (Model)                 (None, 525)          552300      MaxPoolingOverTime[1][0]         
==================================================================================================
Total params: 587,730
Trainable params: 587,730
Non-trainable params: 0
__________________________________________________________________________________________________
Out[14]:
G 140498816003824 Embeddings_input: InputLayer input: output: (None, 21) (None, 21) 140498816004048 Embeddings: Sequential input: output: (None, 21) (None, 21, 15) 140498816003824->140498816004048 140496709026480 Convolutional: Model input: output: (None, 21, 15) [(None, 21, 25), (None, 20, 50), (None, 19, 75), (None, 18, 100), (None, 17, 125), (None, 16, 150)] 140498816004048->140496709026480 140495952899880 MaxPoolingOverTime: Model input: output: [(None, 21, 25), (None, 20, 50), (None, 19, 75), (None, 18, 100), (None, 17, 125), (None, 16, 150)] (None, 525) 140496709026480->140495952899880 140495943115272 Highway: Model input: output: (None, 525) (None, 525) 140495952899880->140495943115272

FeatureExtract does process one word at a time but at training time we process sequences of words of length seq_len=25. We are going to assign a FeatureExtract at each timestep with the TimeDistributed layer, and finally add 2 LSTM layers with hidden_units = 300 hidden units and a time distributed Dense layer. In the paper there is a softmax activation after the last fully connected layer. We will ommit it however and defer its calculation to K.categorical_crossentropy(target, pred, from_logits=True).

In [15]:
CharRNN = keras.Sequential()
CharRNN.add(
    layers.TimeDistributed(
        FeatureExtract,
        input_shape=(seq_len,max_word_len), # We need to declare shape
        name='SequenceFeatureExtract'
    )
)

# Add the 2 LSTM layers
hidden_units = 300

CharRNN.add(
    layers.LSTM(
        hidden_units,
        return_sequences=True,
        name='RNN1'
    )
)

# Add a dropout of 0.5 to the input->hidden
CharRNN.add(
    layers.Dropout(0.5)
)

CharRNN.add(
    layers.LSTM(
        hidden_units,
        return_sequences=True,
        name='RNN2'
    )
)

# Add a dropout of 0.5 to thehidden->softmax
CharRNN.add(
    layers.Dropout(0.5)
)


CharRNN.add(
    layers.TimeDistributed(
        layers.Dense(
            len(word_idx), 
            activation='linear' # See the third link in the credits. 
                                # Softmax is calculated in the cross entropy loss
                                # by tensorflow.
        )
    )
)

CharRNN.summary()

# Block diagram
SVG(model_to_dot(CharRNN, show_shapes=True).create(prog='dot', format='svg'))
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
SequenceFeatureExtract (Time (None, 35, 525)           587730    
_________________________________________________________________
RNN1 (LSTM)                  (None, 35, 300)           991200    
_________________________________________________________________
dropout_1 (Dropout)          (None, 35, 300)           0         
_________________________________________________________________
RNN2 (LSTM)                  (None, 35, 300)           721200    
_________________________________________________________________
dropout_2 (Dropout)          (None, 35, 300)           0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 35, 9999)          3009699   
=================================================================
Total params: 5,309,829
Trainable params: 5,309,829
Non-trainable params: 0
_________________________________________________________________
Out[15]:
G 140495950515672 SequenceFeatureExtract_input: InputLayer input: output: (None, 35, 21) (None, 35, 21) 140495950516008 SequenceFeatureExtract(model_1): TimeDistributed(Model) input: output: (None, 35, 21) (None, 35, 525) 140495950515672->140495950516008 140495943551688 RNN1: LSTM input: output: (None, 35, 525) (None, 35, 300) 140495950516008->140495943551688 140495943927344 dropout_1: Dropout input: output: (None, 35, 300) (None, 35, 300) 140495943551688->140495943927344 140495950515616 RNN2: LSTM input: output: (None, 35, 300) (None, 35, 300) 140495943927344->140495950515616 140495950515448 dropout_2: Dropout input: output: (None, 35, 300) (None, 35, 300) 140495950515616->140495950515448 140495952271792 time_distributed_1(dense_2): TimeDistributed(Dense) input: output: (None, 35, 300) (None, 35, 9999) 140495950515448->140495952271792

In the paper, SGD is used for training with an initial learning rate of 1.0 (which reduces in half if the loss does not improve after 1 epoch) and gradient clipping value of 5. First we need to define the loss $PPL$:

In [16]:
import keras.backend as K

# See the second link in credits for that. Instead of computing a final
# softmax, we defer this to the 
def categorical_crossentropy(y_true, y_pred):
    return K.categorical_crossentropy(y_true, y_pred,from_logits=True )

def PPL(y_true, y_pred):
    return K.pow(2.0, K.mean(K.categorical_crossentropy(y_true, y_pred,from_logits=True )))
    

Then the optimizer for the model (norm clipping value of 5. is applied):

In [17]:
from keras.optimizers import SGD

opt = SGD(lr=1.0, clipnorm=5.)

Finally compile the model. It is also a good idea to save the untrained model.

In [18]:
CharRNN.compile(loss=categorical_crossentropy, metrics=['acc', PPL], optimizer=opt)
CharRNN.save('char-rnn-cnn_untrained.hdf5')

We can now train the model. First we define several callbacks:

  1. ReduceLRPlateau will half the learning rate if it does not improve after one epoch.
  2. ModelCheckpoint (optional but recommended) will save a checkpoint of the model at each epoch.
  3. TensorBoard (optional) will use Google's Tensorboard to show training progress.

Finally, we train the model using the fit_generator method with the generator objects defined above.

Note: You can observe training progress with tensorboard by running tensorboard --logdir=logs at the directory this notebook resides.

In [19]:
from keras.callbacks import ReduceLROnPlateau, ModelCheckpoint, TensorBoard
from time import strftime

model_name = 'char-rnn-cnn'

# Define generators: train_gen for training, val_gen for validation (see above)
train_gen = generator(W_tr)
val_gen = generator(W_vd, batch_size=1)

# Callbacks (see above.) comment out the ones you don't want to.
callbacks = [
    ReduceLROnPlateau(
        factor=0.5,                       # new LR = old LR * 0.5
        patience=1,                       # After one epoch
        #min_lr=0.000976562,               # Minimum value of learning late
        monitor='val_PPL',                   # Monitor loss 
        epsilon=1.                        # Amount by which a plateau is considered
    ),    
    ModelCheckpoint('checkpoints/{}_{{epoch:02d}}-{{val_loss:.2f}}.hdf5'.format(model_name)),
    TensorBoard(log_dir='logs/{}/{}'.format(
                    model_name,
                    strftime('%d-%m-%y/%H:%M:%S')
                    ),
                write_images=True,
               )
]

history = CharRNN.fit_generator(
    train_gen,
    epochs=25, #25 epochs for non-arabic languages (from the paper)
    steps_per_epoch=len(inputs_tr)//seq_len//batch_size, # number of batches to feed
    verbose=1,
    validation_data = val_gen,
    validation_steps = len(inputs_vd)//seq_len,          # We don't use minibatches to validate
    shuffle=False,
    callbacks=callbacks
)
Epoch 1/25
1267/1267 [==============================] - 101s 79ms/step - loss: 6.2205 - acc: 0.0510 - PPL: 109.0537 - val_loss: 6.5892 - val_acc: 0.0486 - val_PPL: 106.9790
Epoch 2/25
1267/1267 [==============================] - 99s 78ms/step - loss: 6.0385 - acc: 0.0576 - PPL: 88.5834 - val_loss: 6.4777 - val_acc: 0.0505 - val_PPL: 100.3288
Epoch 3/25
1267/1267 [==============================] - 99s 78ms/step - loss: 5.8674 - acc: 0.0691 - PPL: 81.3396 - val_loss: 6.3364 - val_acc: 0.0833 - val_PPL: 92.7788
Epoch 4/25
1267/1267 [==============================] - 99s 78ms/step - loss: 5.8825 - acc: 0.0763 - PPL: 78.2120 - val_loss: 6.3529 - val_acc: 0.0943 - val_PPL: 92.6208
Epoch 5/25
1267/1267 [==============================] - 99s 78ms/step - loss: 5.7617 - acc: 0.0847 - PPL: 72.3175 - val_loss: 6.1295 - val_acc: 0.1016 - val_PPL: 81.4286
Epoch 6/25
1267/1267 [==============================] - 99s 78ms/step - loss: 5.7181 - acc: 0.0904 - PPL: 68.4978 - val_loss: 6.0488 - val_acc: 0.0976 - val_PPL: 77.2231
Epoch 7/25
1267/1267 [==============================] - 99s 78ms/step - loss: 5.6775 - acc: 0.0969 - PPL: 64.7776 - val_loss: 5.9409 - val_acc: 0.1063 - val_PPL: 72.7255
Epoch 8/25
1267/1267 [==============================] - 99s 78ms/step - loss: 5.5066 - acc: 0.0975 - PPL: 61.7341 - val_loss: 5.8394 - val_acc: 0.1203 - val_PPL: 67.7843
Epoch 9/25
1267/1267 [==============================] - 98s 78ms/step - loss: 5.5073 - acc: 0.1008 - PPL: 60.8387 - val_loss: 5.7717 - val_acc: 0.1280 - val_PPL: 64.6741
Epoch 10/25
1267/1267 [==============================] - 98s 77ms/step - loss: 5.4586 - acc: 0.1032 - PPL: 60.2961 - val_loss: 5.7277 - val_acc: 0.1350 - val_PPL: 63.0129
Epoch 11/25
1267/1267 [==============================] - 98s 78ms/step - loss: 5.4499 - acc: 0.1031 - PPL: 63.0168 - val_loss: 5.6674 - val_acc: 0.1446 - val_PPL: 60.2713
Epoch 12/25
1267/1267 [==============================] - 98s 78ms/step - loss: 5.4814 - acc: 0.1055 - PPL: 63.1000 - val_loss: 5.8431 - val_acc: 0.1285 - val_PPL: 68.5861
Epoch 13/25
1267/1267 [==============================] - 98s 77ms/step - loss: 5.4975 - acc: 0.1084 - PPL: 64.7933 - val_loss: 5.6355 - val_acc: 0.1427 - val_PPL: 58.4656
Epoch 14/25
1267/1267 [==============================] - 98s 77ms/step - loss: 5.5273 - acc: 0.1063 - PPL: 70.0334 - val_loss: 5.8094 - val_acc: 0.1262 - val_PPL: 66.5901
Epoch 15/25
1267/1267 [==============================] - 98s 77ms/step - loss: 5.5689 - acc: 0.1098 - PPL: 74.0235 - val_loss: 5.6621 - val_acc: 0.1411 - val_PPL: 60.1453
Epoch 16/25
1267/1267 [==============================] - 98s 77ms/step - loss: 5.5261 - acc: 0.1129 - PPL: 64.9730 - val_loss: 5.6639 - val_acc: 0.1374 - val_PPL: 60.1021
Epoch 17/25
1267/1267 [==============================] - 98s 77ms/step - loss: 5.3318 - acc: 0.1282 - PPL: 51.7135 - val_loss: 5.5873 - val_acc: 0.1571 - val_PPL: 56.5843
Epoch 18/25
1267/1267 [==============================] - 98s 77ms/step - loss: 5.3151 - acc: 0.1196 - PPL: 54.3754 - val_loss: 5.8620 - val_acc: 0.1399 - val_PPL: 69.4699
Epoch 19/25
1267/1267 [==============================] - 98s 77ms/step - loss: 5.3355 - acc: 0.1210 - PPL: 54.4442 - val_loss: 5.7665 - val_acc: 0.1284 - val_PPL: 64.4040
Epoch 20/25
1267/1267 [==============================] - 98s 77ms/step - loss: 5.2400 - acc: 0.1362 - PPL: 49.0685 - val_loss: 5.5783 - val_acc: 0.1639 - val_PPL: 55.6973
Epoch 21/25
1267/1267 [==============================] - 98s 77ms/step - loss: 5.2198 - acc: 0.1397 - PPL: 48.4728 - val_loss: 5.5208 - val_acc: 0.1634 - val_PPL: 54.1354
Epoch 22/25
1267/1267 [==============================] - 98s 77ms/step - loss: 5.2313 - acc: 0.1400 - PPL: 48.6355 - val_loss: 5.5302 - val_acc: 0.1630 - val_PPL: 54.3647
Epoch 23/25
1267/1267 [==============================] - 98s 77ms/step - loss: 5.1974 - acc: 0.1397 - PPL: 47.8129 - val_loss: 5.5249 - val_acc: 0.1654 - val_PPL: 54.6090
Epoch 24/25
1267/1267 [==============================] - 98s 77ms/step - loss: 5.1583 - acc: 0.1399 - PPL: 47.0824 - val_loss: 5.5541 - val_acc: 0.1639 - val_PPL: 55.1477
Epoch 25/25
1267/1267 [==============================] - 98s 77ms/step - loss: 5.1568 - acc: 0.1392 - PPL: 47.3577 - val_loss: 5.5370 - val_acc: 0.1648 - val_PPL: 54.6469

Finally we evaluate the model. Here is a snapshot of the training process in TensorBoard:

training

Orange lines are with untrainable character embeddings (just a random character LUT) and blue lines with trainable embeddings.

Note: There probably is something wrong with the metrics since we have a perplexity much lower than reported (51.7 vs 92.3 in the paper).

In [22]:
test_gen = generator(W_ts,batch_size=1)
loss, acc, ppl = CharRNN.evaluate_generator(test_gen, steps=len(inputs_ts)//seq_len)

print("Test Loss: {}, Perplexity: {}".format(loss, ppl))
Test Loss: 5.487452989824623, Perplexity: 51.72303666000001

Credits

Final Word

In the checkpoints directory I have a checkpoint for the last training epoch. You can load it in Keras and play with it as you please.

Please note I wrote this tutorial in order for me to understand how to build such models. If you found it useful or have any questions, criticism, suggestions or other feedback, please drop me an e-mail.

I am interested whether I defined perplexity correctly, if someone have any suggestions on that, open an issue or send me an e-mail.

Feedback

This area is for feedback I receive by e-mail.