Have you ever wondered how Gmail automatic reply works? Or how a Neural Network can generate musical notes? The general way of generating sequence of text is to train a model to predict the next word/character given all previous words/characters. Such model is called Statistical Language Model. So what is a statistical language model? A statistical language model tries to capture the statistical structure (latent space) of training text it's trained on. Usually Recurrent Neural Network (RNN) models family is used to train the model due to the fact that it's very powerful and expressive in which they remember and process past information through their high dimensional hidden state units. The main goal of any language model is to learn the joint probability distribution of sequences of characters/words in a training text, i.e. trying to learn the joint probability function. For example, if we're trying to predict a sequence of $T$ words, we try to get the joint probability $P(w_1, w_2, ..., w_T)$ as big as we can which is equal to the product of all conditional probabilities $\prod_{t = 1}^T P(w_t/w_{t-1})$ at all time steps (t).
In this notebook, we'll cover Character-level Language Model where almost all the concepts hold for any other language models such as word-language models. The main task of character-level language model is to predict next character given all previous characters in a sequence of data, i.e. generate text character by character. More formally, given a training sequence $(x^1, ... , x^T)$, the RNN uses the sequence of its output vectors $(o^1, ... , o^T)$ to obtain a sequence of predictive distributions $P(x^t|x^{<t}) = softmax(o^t)$.
Let's illustrate how the character-level language model works using my first name ("imad") as an example (see figure 1 for all the details of this example).
The objective is to make the green numbers as big as we can and the red numbers as small as we can in the probability distribution layer. The reason for that is that the true index should have the highest probability by making it as close as we can to 1. The way to do that is to measure the loss using cross-entropy and then compute the gradients of the loss w.r.t. all parameters to update them in the opposite of the gradient direction. Repeating the process over many times where each time we adjust the parameters based on the gradient direction --> model will be able to correctly predict next characters given all previous one using all names in the training text. Notice that hidden state $h^4$ has all past information about all characters.
The dataset we'll be using has 5,163 names: 4,275 male names, 1,219 female names, and 331 names that can be both female and male names. The RNN architecture we'll be using to train the character-level language model is called many to many where time steps of the input $(T_x)$ = time steps of the output $(T_y)$. In other words, the sequence of the input and output are synced (see figure 2).
In this section, we'll go over four main parts:
We'll be using Stochastic Gradient Descent (SGD) where each batch consists of only one example. In other words, the RNN model will learn from each example (name) separately, i.e. run both forward and backward passes on each example and update parameters accordingly. Below are all the steps needed for a forward pass:
Notice that we use hyperbolic tangent $(\frac{e^x - e^{-x}}{e^x + e^{-x}})$ as the non-linear function. One of the main advantages of the hyperbolic tangent function is that it resembles the identity function.
The softmax layer has the same dimension as the output layer which is vocab_size x 1. As a result, $y^t[i]$ is the probability of of index $i$ being the next character at time step (t).
Since we'll be using SGD, the loss will be noisy and have many oscillations, so it's a good practice to smooth out the loss using exponential weighted average.
# Load packages
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
os.chdir("../scripts/")
from character_level_language_model import (initialize_parameters,
initialize_rmsprop,
softmax,
smooth_loss,
update_parameters_with_rmsprop)
os.chdir("../notebooks/")
%matplotlib inline
sns.set_context("notebook")
plt.style.use("fivethirtyeight")
def rnn_forward(x, y, h_prev, parameters):
"""
Implement one Forward pass on one name.
Arguments
---------
x : list
list of integers for the index of the characters in the example
shifted one character to the right.
y : list
list of integers for the index of the characters in the example.
h_prev : array
last hidden state from the previous example.
parameters : python dict
dictionary containing the parameters.
Returns
-------
loss : float
cross-entropy loss.
cache : tuple
contains three python dictionaries:
xs -- input of all time steps.
hs -- hidden state of all time steps.
probs -- probability distribution of each character at each time
step.
"""
# Retrieve parameters
Wxh, Whh, b = parameters["Wxh"], parameters["Whh"], parameters["b"]
Why, c = parameters["Why"], parameters["c"]
# Initialize inputs, hidden state, output, and probabilities dictionaries
xs, hs, os, probs = {}, {}, {}, {}
# Initialize x0 to zero vector
xs[0] = np.zeros((vocab_size, 1))
# Initialize loss and assigns h_prev to last hidden state in hs
loss = 0
hs[-1] = np.copy(h_prev)
# Forward pass: loop over all characters of the name
for t in range(len(x)):
# Convert to one-hot vector
if t > 0:
xs[t] = np.zeros((vocab_size, 1))
xs[t][x[t]] = 1
# Hidden state
hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t - 1]) + b)
# Logits
os[t] = np.dot(Why, hs[t]) + c
# Probs
probs[t] = softmax(os[t])
# Loss
loss -= np.log(probs[t][y[t], 0])
cache = (xs, hs, probs)
return loss, cache
With RNN based models, the gradient-based technique that will be used is called Backpropagation Through Time (BPTT). We start at last time step $T$ and backpropagate loss function w.r.t. all parameters across all time steps and sum them up (see figure 3).
Note that at last time step $T$, we'll initialize $dh_{next}$ to zeros since we can't get values from future. To stabilize the update at each time step since SGD may have so many oscillations, we'll be using one of the adaptive learning methods' optimizer. More specifically, Root Mean Squared Propagation (RMSProp) which tends to have acceptable performance.
def clip_gradients(gradients, max_value):
"""
Implements gradient clipping element-wise on gradients to be between the
interval [-max_value, max_value].
Arguments
----------
gradients : python dict
dictionary that stores all the gradients.
max_value : scalar
edge of the interval [-max_value, max_value].
Returns
-------
gradients : python dict
dictionary where all gradients were clipped.
"""
for grad in gradients.keys():
np.clip(gradients[grad], -max_value, max_value, out=gradients[grad])
return gradients
def rnn_backward(y, parameters, cache):
"""
Implements Backpropagation on one name.
Arguments
---------
y : list
list of integers for the index of the characters in the example.
parameters : python dict
dictionary containing the parameters.
cache : tuple
contains three python dictionaries:
xs -- input of all time steps.
hs -- hidden state of all time steps.
probs -- probability distribution of each character at each time
step.
Returns
-------
grads : python dict
dictionary containing all the gradients.
h_prev : array
last hidden state from the current example.
"""
# Retrieve xs, hs, and probs
xs, hs, probs = cache
# Initialize all gradients to zero
dh_next = np.zeros_like(hs[0])
parameters_names = ["Whh", "Wxh", "b", "Why", "c"]
grads = {}
for param_name in parameters_names:
grads["d" + param_name] = np.zeros_like(parameters[param_name])
# Iterate over all time steps in reverse order starting from Tx
for t in reversed(range(len(xs))):
dy = np.copy(probs[t])
dy[y[t]] -= 1
grads["dWhy"] += np.dot(dy, hs[t].T)
grads["dc"] += dy
dh = np.dot(parameters["Why"].T, dy) + dh_next
dhraw = (1 - hs[t] ** 2) * dh
grads["dWhh"] += np.dot(dhraw, hs[t - 1].T)
grads["dWxh"] += np.dot(dhraw, xs[t].T)
grads["db"] += dhraw
dh_next = np.dot(parameters["Whh"].T, dhraw)
# Clip the gradients using [-5, 5] as the interval
grads = clip_gradients(grads, 5)
# Get the last hidden state
h_prev = hs[len(xs) - 1]
return grads, h_prev
As we increase randomness, text will loose local structure; however, as we decrease randomness, the generated text will sound more real and start to preserve its local structure. For this exercise, we will sample from the distribution that's generated by the model which can be seen as an intermediate level of randomness between maximum and minimum entropy (see figure 4). Using this sampling strategy on the above distribution, the index 0 has $20$% probability of being picked, while index 2 has $40$% probability to be picked.
def sample(parameters, idx_to_chars, chars_to_idx, n):
"""
Implements sampling of a squence of n characters characters length. The
sampling will be based on the probability distribution output of RNN.
Arguments
---------
parameters : python dict
dictionary storing all the parameters of the model.
idx_to_chars : python dict
dictionary mapping indices to characters.
chars_to_idx : python dict
dictionary mapping characters to indices.
n : scalar
number of characters to output.
Returns
-------
sequence : str
sequence of characters sampled.
"""
# Retrienve parameters, shapes, and vocab size
Whh, Wxh, b = parameters["Whh"], parameters["Wxh"], parameters["b"]
Why, c = parameters["Why"], parameters["c"]
n_h, n_x = Wxh.shape
vocab_size = c.shape[0]
# Initialize a0 and x1 to zero vectors
h_prev = np.zeros((n_h, 1))
x = np.zeros((n_x, 1))
# Initialize empty sequence
indices = []
idx = -1
counter = 0
while (counter <= n and idx != chars_to_idx["\n"]):
# Fwd propagation
h = np.tanh(np.dot(Whh, h_prev) + np.dot(Wxh, x) + b)
o = np.dot(Why, h) + c
probs = softmax(o)
# Sample the index of the character using generated probs distribution
idx = np.random.choice(vocab_size, p=probs.ravel())
# Get the character of the sampled index
char = idx_to_chars[idx]
# Add the char to the sequence
indices.append(idx)
# Update a_prev and x
h_prev = np.copy(h)
x = np.zeros((n_x, 1))
x[idx] = 1
counter += 1
sequence = "".join([idx_to_chars[idx] for idx in indices if idx != 0])
return sequence
def model(
file_path, chars_to_idx, idx_to_chars, hidden_layer_size, vocab_size,
num_epochs=10, learning_rate=0.01):
"""
Implements RNN to generate characters.
Arguments
---------
file_path : str
path to the file of the raw data.
num_epochs : int
number of passes the optimization algorithm to go over the training
data.
learning_rate : float
step size of learning.
chars_to_idx : python dict
dictionary mapping characters to indices.
idx_to_chars : python dict
dictionary mapping indices to characters.
hidden_layer_size : int
number of hidden units in the hidden layer.
vocab_size : int
size of vocabulary dictionary.
Returns
-------
parameters : python dict
dictionary storing all the parameters of the model.
overall_loss : list
list stores smoothed loss per epoch.
"""
# Get the data
with open(file_path) as f:
data = f.readlines()
examples = [x.lower().strip() for x in data]
# Initialize parameters
parameters = initialize_parameters(vocab_size, hidden_layer_size)
# Initialize Adam parameters
s = initialize_rmsprop(parameters)
# Initialize loss
smoothed_loss = -np.log(1 / vocab_size) * 7
# Initialize hidden state h0 and overall loss
h_prev = np.zeros((hidden_layer_size, 1))
overall_loss = []
# Iterate over number of epochs
for epoch in range(num_epochs):
print(f"\033[1m\033[94mEpoch {epoch}")
print(f"\033[1m\033[92m=======")
# Sample one name
print(f"""Sampled name: {sample(parameters, idx_to_chars, chars_to_idx,
10).capitalize()}""")
print(f"Smoothed loss: {smoothed_loss:.4f}\n")
# Shuffle examples
np.random.shuffle(examples)
# Iterate over all examples (SGD)
for example in examples:
x = [None] + [chars_to_idx[char] for char in example]
y = x[1:] + [chars_to_idx["\n"]]
# Fwd pass
loss, cache = rnn_forward(x, y, h_prev, parameters)
# Compute smooth loss
smoothed_loss = smooth_loss(smoothed_loss, loss)
# Bwd pass
grads, h_prev = rnn_backward(y, parameters, cache)
# Update parameters
parameters, s = update_parameters_with_rmsprop(
parameters, grads, s)
overall_loss.append(smoothed_loss)
return parameters, overall_loss
# Load names
data = open("../data/names.txt", "r").read()
# Convert characters to lower case
data = data.lower()
# Construct vocabulary using unique characters, sort it in ascending order,
# then construct two dictionaries that maps character to index and index to
# characters.
chars = list(sorted(set(data)))
chars_to_idx = {ch:i for i, ch in enumerate(chars)}
idx_to_chars = {i:ch for ch, i in chars_to_idx.items()}
# Get the size of the data and vocab size
data_size = len(data)
vocab_size = len(chars_to_idx)
print(f"There are {data_size} characters and {vocab_size} unique characters.")
# Fitting the model
parameters, loss = model("../data/names.txt", chars_to_idx, idx_to_chars, 100, vocab_size, 100, 0.01)
# Plotting the loss
plt.plot(range(len(loss)), loss)
plt.xlabel("Epochs")
plt.ylabel("Smoothed loss");
There are 36121 characters and 27 unique characters. Epoch 0 ======= Sampled name: Nijqikkgzst Smoothed loss: 23.0709 Epoch 1 ======= Sampled name: Balrccanevi Smoothed loss: 16.5925 Epoch 2 ======= Sampled name: Onon Smoothed loss: 15.7529 Epoch 3 ======= Sampled name: Flereny Smoothed loss: 15.6335 Epoch 4 ======= Sampled name: Riela Smoothed loss: 15.5693 Epoch 5 ======= Sampled name: Fribella Smoothed loss: 15.5195 Epoch 6 ======= Sampled name: Esialina Smoothed loss: 15.3628 Epoch 7 ======= Sampled name: Daliangea Smoothed loss: 15.1252 Epoch 8 ======= Sampled name: Lgbedo Smoothed loss: 15.0611 Epoch 9 ======= Sampled name: Colisha Smoothed loss: 14.9486 Epoch 10 ======= Sampled name: Milton Smoothed loss: 14.7446 Epoch 11 ======= Sampled name: Kenn Smoothed loss: 14.7901 Epoch 12 ======= Sampled name: Joy Smoothed loss: 14.7042 Epoch 13 ======= Sampled name: Kendie Smoothed loss: 14.5615 Epoch 14 ======= Sampled name: Balloro Smoothed loss: 14.5283 Epoch 15 ======= Sampled name: Roster Smoothed loss: 14.5440 Epoch 16 ======= Sampled name: Thane Smoothed loss: 14.5419 Epoch 17 ======= Sampled name: Rennida Smoothed loss: 14.4142 Epoch 18 ======= Sampled name: Krona Smoothed loss: 14.3520 Epoch 19 ======= Sampled name: Cynna Smoothed loss: 14.2286 Epoch 20 ======= Sampled name: Moita Smoothed loss: 14.2191 Epoch 21 ======= Sampled name: Mostela Smoothed loss: 14.1889 Epoch 22 ======= Sampled name: Sustin Smoothed loss: 14.1924 Epoch 23 ======= Sampled name: Lehna Smoothed loss: 14.0645 Epoch 24 ======= Sampled name: Alda Smoothed loss: 14.1150 Epoch 25 ======= Sampled name: Leetha Smoothed loss: 14.0461 Epoch 26 ======= Sampled name: Laina Smoothed loss: 13.9593 Epoch 27 ======= Sampled name: Ltoranna Smoothed loss: 13.9691 Epoch 28 ======= Sampled name: Ida Smoothed loss: 14.1112 Epoch 29 ======= Sampled name: Shaejaliisa Smoothed loss: 13.9906 Epoch 30 ======= Sampled name: Dangelyn Smoothed loss: 13.8179 Epoch 31 ======= Sampled name: Lilinnda Smoothed loss: 13.8924 Epoch 32 ======= Sampled name: Mindy Smoothed loss: 13.6877 Epoch 33 ======= Sampled name: Lucinda Smoothed loss: 13.7732 Epoch 34 ======= Sampled name: Leynallon Smoothed loss: 13.8123 Epoch 35 ======= Sampled name: Dannie Smoothed loss: 13.7393 Epoch 36 ======= Sampled name: Tuera Smoothed loss: 13.7140 Epoch 37 ======= Sampled name: Kerry Smoothed loss: 13.8100 Epoch 38 ======= Sampled name: Trena Smoothed loss: 13.8851 Epoch 39 ======= Sampled name: Lyno Smoothed loss: 13.8151 Epoch 40 ======= Sampled name: Chalita Smoothed loss: 13.7768 Epoch 41 ======= Sampled name: Ueana Smoothed loss: 13.7901 Epoch 42 ======= Sampled name: Mady Smoothed loss: 13.6581 Epoch 43 ======= Sampled name: Nada Smoothed loss: 13.7101 Epoch 44 ======= Sampled name: Shaunce Smoothed loss: 13.4868 Epoch 45 ======= Sampled name: Jeman Smoothed loss: 13.6186 Epoch 46 ======= Sampled name: Bellen Smoothed loss: 13.5687 Epoch 47 ======= Sampled name: Loneith Smoothed loss: 13.6583 Epoch 48 ======= Sampled name: Breena Smoothed loss: 13.6168 Epoch 49 ======= Sampled name: Daa Smoothed loss: 13.4808 Epoch 50 ======= Sampled name: Colira Smoothed loss: 13.6772 Epoch 51 ======= Sampled name: Deonora Smoothed loss: 13.6995 Epoch 52 ======= Sampled name: Eya Smoothed loss: 13.5731 Epoch 53 ======= Sampled name: Oleina Smoothed loss: 13.5367 Epoch 54 ======= Sampled name: Meild Smoothed loss: 13.5455 Epoch 55 ======= Sampled name: Narielie Smoothed loss: 13.6152 Epoch 56 ======= Sampled name: Dar Smoothed loss: 13.5110 Epoch 57 ======= Sampled name: Genna Smoothed loss: 13.5699 Epoch 58 ======= Sampled name: Tressa Smoothed loss: 13.5733 Epoch 59 ======= Sampled name: Lecelyn Smoothed loss: 13.5511 Epoch 60 ======= Sampled name: Aliene Smoothed loss: 13.4716 Epoch 61 ======= Sampled name: Grace Smoothed loss: 13.5585 Epoch 62 ======= Sampled name: Dosha Smoothed loss: 13.5014 Epoch 63 ======= Sampled name: Libornie Smoothed loss: 13.5098 Epoch 64 ======= Sampled name: Naula Smoothed loss: 13.5603 Epoch 65 ======= Sampled name: Teney Smoothed loss: 13.5932 Epoch 66 ======= Sampled name: Akilla Smoothed loss: 13.4078 Epoch 67 ======= Sampled name: Ina Smoothed loss: 13.4269 Epoch 68 ======= Sampled name: Ticki Smoothed loss: 13.5426 Epoch 69 ======= Sampled name: Dernaio Smoothed loss: 13.4338 Epoch 70 ======= Sampled name: Lacira Smoothed loss: 13.3782 Epoch 71 ======= Sampled name: Uidshinva Smoothed loss: 13.4009 Epoch 72 ======= Sampled name: Leus Smoothed loss: 13.4333 Epoch 73 ======= Sampled name: Teanna Smoothed loss: 13.4269 Epoch 74 ======= Sampled name: Conda Smoothed loss: 13.3653 Epoch 75 ======= Sampled name: Ceth Smoothed loss: 13.4187 Epoch 76 ======= Sampled name: Loma Smoothed loss: 13.3606 Epoch 77 ======= Sampled name: Dilis Smoothed loss: 13.4309 Epoch 78 ======= Sampled name: Lasamia Smoothed loss: 13.4144 Epoch 79 ======= Sampled name: Lanni Smoothed loss: 13.4627 Epoch 80 ======= Sampled name: Tammora Smoothed loss: 13.4617 Epoch 81 ======= Sampled name: Iararea Smoothed loss: 13.4516 Epoch 82 ======= Sampled name: Lyn Smoothed loss: 13.3161 Epoch 83 ======= Sampled name: Nym Smoothed loss: 13.3647 Epoch 84 ======= Sampled name: Latrica Smoothed loss: 13.3848 Epoch 85 ======= Sampled name: Tiedann Smoothed loss: 13.2875 Epoch 86 ======= Sampled name: Mora Smoothed loss: 13.3485 Epoch 87 ======= Sampled name: Lito Smoothed loss: 13.3280 Epoch 88 ======= Sampled name: Lung Smoothed loss: 13.3447 Epoch 89 ======= Sampled name: Lilomannala Smoothed loss: 13.2688 Epoch 90 ======= Sampled name: Tomone Smoothed loss: 13.3204 Epoch 91 ======= Sampled name: Kelia Smoothed loss: 13.3137 Epoch 92 ======= Sampled name: Nashristina Smoothed loss: 13.2945 Epoch 93 ======= Sampled name: Adaurta Smoothed loss: 13.4576 Epoch 94 ======= Sampled name: Tissie Smoothed loss: 13.3241 Epoch 95 ======= Sampled name: Lanosh Smoothed loss: 13.2851 Epoch 96 ======= Sampled name: Mariau Smoothed loss: 13.4743 Epoch 97 ======= Sampled name: Emher Smoothed loss: 13.2563 Epoch 98 ======= Sampled name: Tara Smoothed loss: 13.3592 Epoch 99 ======= Sampled name: Cathranda Smoothed loss: 13.3380
As you may notice, the names generated started to get more interesting after 15 epochs. One of the interesting names is "Yasira" which is an Arabic name :).