Notebook

Neural Networks that write like Shakespeare¶

In this Chapter, we will:

Solve the Challenge of arbitrary length input.
Discover the surprising power of averaged word vectors
Uncover the limitations of bag-of-words vectors
Use identity vectors to sum word embeddings
Learn the transition matrix
Learn to create useful sentence vectors
Implement forward propagation in Python
Implement forward propagation and backpropagation with arbitrary length

[Andrej Karpathy] "There is something magical about recurrent neural networks".

The Challenge of Arbitrary Length¶

Let's Model Arbitrary Long Sequences of Data using Neural Networks!¶

In this chapter, we'll expand on the intuition of an embedding conveying the meaning of a word by creating embeddings that convey the meaning of variable-length phrases and setences. If we want to create a vector that held an entire sequence of symbols within its contents in the same way a word embedding holds information about the word in the form of a vector, how could we do that?

If we concatenated or stacked each word embedding in the sentence, we will have a vector that may represent the whole sentence.

But this approach leaves something to be desired, bigger sentences will have more stacked embeddings. for example, we can't compare two sentences because they have different shapes.

In Theory, these two sentences should be very similar. Comparing their vectors should indicate high similarity.

Do comparisons really matter?¶

Why we should care about whether we can compare two sentence vectors?¶

The act of comparing two vectors is useful because it allows us to understand what the network sees. Even though we can't read two vectors, but we can check their similarity. In this case, we are interested in analyzing the perspective of the Neural Network.

We want to create sentence vectors that are useful for predicting things about the sentence. Meaning, at the minimum, similar setences should in theory result in similar vector representations.

But how about averaging word vectors from the sentence? This method has the following characteristics:

Word order does not matter: this technique loses words order information.
Same-size result vector: We don't have to worry about the alignment because each word vector is in the same length.
The sentences "The Cat Sat" & "The Cat Sat Still" will have similar vector representations because most words are the same.
Even Better, it's likely that "A Dog Walked" will be similar to "A Cat Sat" because The Actual word representations are similar.

We can conclude that the techniuqe of averaging words is a strong one, although not perfect.

The Suprising Power of Averaged Word Vectors¶

It's the amazingly powerful go-to tool for neural prediction¶

To play with word embeddings, let's first train them:

In [1]:

import sys, random, math
from collections import Counter
import numpy as np
np.random.seed(1)
random.seed(1)

IMDB_PATH = '/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/reviews.txt'
IMDB_LABEL_PATH = '/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/labels.txt'

In [2]:

f = open(IMDB_PATH, 'r')
raw_reviews = f.readlines()
f.close()

In [3]:

tokens = list(map(lambda x: x.split(" "), raw_reviews))
word_counter = Counter()
for review in tokens:
    for token in review:
        word_counter[token] -= 1

In [4]:

_ = word_counter.most_common()
vocab = list(set(map(lambda x: x[0], word_counter.most_common())))

In [5]:

word2index = {}
for i, word in enumerate(vocab):
    word2index[word] = i

In [6]:

concatenated = list()
input_dataset = list()
for review in tokens:
    review_indices = list()
    for token in review:
        try:
            review_indices.append(word2index[token])
            concatenated.append(word2index[token])
        except:
            ""
    input_dataset.append(review_indices)
concatenated = np.array(concatenated)
random.shuffle(input_dataset)

In [7]:

lr, epochs = (.05, 2)
hidden_size, window, negative = 50, 2, 5
W0 = (np.random.rand(len(vocab), hidden_size) - 0.5) * 0.2
W1 = np.random.rand(len(vocab), hidden_size)*0
layer_2_target = np.zeros(negative+1)
layer_2_target[0] = 1

In [8]:

def similar(target='beautiful'):
    target_index = word2index[target]
    
    scores = Counter()
    for word, index in word2index.items():
        raw_difference = W0[index] - W0[target_index]
        squared_difference = raw_difference * raw_difference
        scores[word] = -math.sqrt(sum(squared_difference))
    return scores.most_common(10)

In [9]:

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

In [10]:

for review_i, review in enumerate(input_dataset * epochs):
    for target_i in range(len(review)):
        target_samples = [review[target_i]] + list(concatenated[(np.random.rand(negative)*len(concatenated)).astype('int').tolist()])
        left_context = review[max(0, target_i-window):target_i]
        right_context = review[target_i+1: min(len(review), target_i+window)]
        layer_1 = np.mean(W0[left_context+right_context], axis=0)
        layer_2 = sigmoid(layer_1.dot(W1[target_samples].T))
        layer_2_delta = layer_2 - layer_2_target
        layer_1_delta = layer_2_delta.dot(W1[target_samples])
        W0[left_context+right_context] -= layer_1_delta*lr
        W1[target_samples] -= np.outer(layer_2_delta, layer_1)*lr
    if(review_i % 5000 == 0):
        print('\rProgress:'+str(review_i/float(len(input_dataset)*epochs)) + " " + str(similar('terrible')))
print(similar('terrible'))

Progress:0.0 [('terrible', -0.0), ('genital', -0.39326532865973723), ('midkoff', -0.3938513073457206), ('ethan', -0.3968892076212083), ('streep', -0.3979724210300289), ('bjorlin', -0.3981196278255343), ('injure', -0.3982249442410485), ('munchies', -0.39891945750964025), ('orgolini', -0.40061746244471097), ('vega', -0.4015861557887159)]
Progress:0.1 [('terrible', -0.0), ('unique', -1.927270765664731), ('magnificent', -2.045146852188629), ('student', -2.062339523062212), ('deeply', -2.0687588315141556), ('fantastic', -2.08371892800822), ('rare', -2.0969774764288616), ('tough', -2.128273679772459), ('worthwhile', -2.149016740284797), ('dreadful', -2.1524088693570316)]
Progress:0.2 [('terrible', -0.0), ('horrible', -3.108306431979032), ('brilliant', -3.396949380238907), ('fantastic', -3.4444297632941083), ('compelling', -3.6848868031292823), ('lame', -3.6906613668414274), ('remarkable', -3.7214051894822826), ('great', -3.749394484829831), ('superb', -3.764204679513025), ('tedious', -3.78968255882158)]
Progress:0.3 [('terrible', -0.0), ('horrible', -3.498158178893929), ('tremendous', -3.761329799226001), ('fantastic', -3.7628561218791723), ('mediocre', -3.766116610681203), ('laughable', -3.7805893810816893), ('marvelous', -3.813188768185005), ('dreadful', -3.8134367604842216), ('breathtaking', -3.862370005681857), ('spectacular', -3.8694653810324677)]
Progress:0.4 [('terrible', -0.0), ('horrible', -2.967964656933557), ('brilliant', -3.5420185698927042), ('pathetic', -3.7138289510002624), ('remarkable', -3.770923141465496), ('superb', -3.7941187646097148), ('fantastic', -3.8364360547450773), ('laughable', -3.845696114343685), ('great', -3.906869347583601), ('bad', -3.921186208070328)]
Progress:0.5 [('terrible', -0.0), ('horrible', -3.1989896124441066), ('brilliant', -3.3647957125926413), ('superb', -3.446848086835425), ('pathetic', -3.6557981030906634), ('fantastic', -3.673566586441344), ('lame', -3.8944061428190775), ('remarkable', -4.004842620824482), ('marvelous', -4.053430718224174), ('great', -4.0758362861596575)]
Progress:0.6 [('terrible', -0.0), ('horrible', -2.7739353122626755), ('fantastic', -3.050940700260329), ('brilliant', -3.232619045738981), ('pathetic', -3.5566710306301847), ('superb', -3.713381489932097), ('lame', -3.739970542211976), ('marvelous', -3.841008362820846), ('haunting', -3.846184382564499), ('magnificent', -3.847871555156092)]
Progress:0.7 [('terrible', -0.0), ('horrible', -2.5027876972527996), ('brilliant', -3.387431880847732), ('fantastic', -3.805817922179162), ('superb', -3.8359369916740205), ('magnificent', -3.8404436166584484), ('pathetic', -3.9100961380425856), ('laughable', -3.9562924360256875), ('marvelous', -3.9658570094816135), ('lame', -4.01274210326551)]
Progress:0.8 [('terrible', -0.0), ('horrible', -3.3525614136062547), ('fantastic', -3.787226665684703), ('dire', -3.812683534917237), ('brilliant', -3.824813307264604), ('dreadful', -3.9178990948548855), ('laughable', -3.9454124321935535), ('fabulous', -3.9960864707094257), ('miserable', -4.04534807188886), ('phenomenal', -4.051791600383227)]
Progress:0.9 [('terrible', -0.0), ('horrible', -2.8685809548008545), ('brilliant', -3.2730169748091433), ('horrendous', -3.675608725486543), ('dreadful', -3.7564170172164517), ('phenomenal', -3.7578615711368135), ('bad', -3.7787265252348763), ('marvelous', -3.815732576454664), ('fantastic', -3.9545000164096757), ('great', -3.975672000914959)]
[('terrible', -0.0), ('horrible', -2.7898917830688132), ('brilliant', -3.3469631230517645), ('phenomenal', -3.690963614102198), ('pathetic', -3.795400367744756), ('marvelous', -3.9134681822713833), ('superb', -3.920458050440294), ('masterful', -3.979517882443988), ('bad', -3.9993630029279785), ('mediocre', -4.138258710066284)]

Let's experiment with average sentence embeddings:

In [11]:

import numpy as np

norms = np.sum(W0 * W0, axis=1)
norms.resize(norms.shape[0], 1)
normed_weights = W0 * norms

In [12]:

def make_sent_vect(words):
    # 1st part get index of each word
    # 2nd part filters only words in vocab
    indices = list(map(lambda x: word2index[x], filter(lambda x: x in word2index, words)))
    return np.mean(normed_weights[indices], axis=0)

In [13]:

reviews_to_vectors = list()
for review in tokens:
    reviews_to_vectors.append(make_sent_vect(review))
reviews_to_vectors = np.array(reviews_to_vectors)

In [14]:

def most_similar_reviews(review):
    v = make_sent_vect(review)
    scores = Counter()
    for i, val in enumerate(reviews_to_vectors.dot(v)):
        scores[i] = val
    most_similar = list()
    for idx, score in scores.most_common(3):
        most_similar.append(raw_reviews[idx][0:40])
    return most_similar

In [15]:

most_similar_reviews(['beautiful', 'amazing'])

Out[15]:

['i have never seen such terrible performa',
 'i think this show is definitely the grea',
 'if you havn  t seen this movie i highly ']

There appear to be interesting statistical information within these vectors, such as when negative & embeddings cluster together.

How is information stored in these embeddings?¶

When you average word embeddings, average shapes remain¶

To understand the meaning of embeddings, let's visualize a word vector as a squiggly line, like this one:

Instead of thinking that a vector is just a list of numbers, we visualize it as a line that goes through low/high points that corresponds to the low/high numbers in the vector. If we selected several words from the corpus, they might look like these:

By considering the similaries between various words, we may notice that each word's shape is unique. However, "terrible" & "boring" have certain similarities in their shapes. The most interesting thing is that part of these squiggles have meaning in and of themselves.

Consider the negative words for example, there is nothing magical about the common spike these words have in common, if we retrained the network, the spike would appear somewhere else. What gives the "negative" sensation to the shape is that all negative words have it.

These shapes were molded by training so that every part of the curve convery a certain meaning that would prove useful in the correlation summarization task. When we take an average curve over multiple word embeddings, common meaning holds while other opposite meanings cancel out or get weakened.

How does a Neural Network use embeddings?¶

Neural Networks detect the curves that have correlation with the target label¶

NN simply looks for various bumps and variations in the input layer and compresses that information to predict the target label (missing word/sentiment/...).

Through training, the NN shapes the word embeddings in such a way that it clusters them, all for the goal of making it easier to predict the target label. It compresses the signal in a way that allows it to minimize its loss function (i.e. correctly predict the target label)

Let's think about the case when we have longer sentences, the longer the sentence, the more its representation will be averaged to zero. So avereging word embeddings is a bit mushy. Meaning that the longer the sentence, the more likely that it will average out to be a straight line.

In short, this concept of storing a sentence in a fixed length averaged vector doesn't decay nicely. That being said, typical sentences aren't that long anyway.

The Limitations of Bag-of-Words Vectors¶

Order becomes irrelevant when you average word embeddings¶

The biggest issue with averaged word embeddings is that they lose the order information. By losing order, we also lose grammar and syntax.

This way of averaging or summing a bunch of words to have a sentence/text representation is called the bag-of-words approach. The new representation creates a bag, because order isn't preserved.

We want a way to represent sentence vectors that preserve order and yield different representations for different word orders. More importantly, the way in which order changes the resulting vector should be learned. By learning, the neural network will find an order representation that minimizes its loss function and yield interesting properties about word order in general.

One of the most famous and successful ways of generating vectors for sequences are Recurrent Neural Networks (RNNs). Identity matrices have one property: if you multiply any vector by them, you'll get the same vector.

Using Identity Vectors to sum word embeddings¶

Let's implement the same logic using a different approach¶

This is the standard way of simply summing up the word embeddings to form a fixed length sentence embedding.

The example above adds another step while summing, it multiplies each element by the identity matrix. Note that we're not doing anything special here, this indentity way of summing up the elements yields the same resulting vector as the first example, but consider this: If the matrices used are anything but the identity matrix, changing the order of the word embeddings will change the resulting vector.

Let's see this in Code:

Matrices that change absolutely nothing¶

Let's create sentence embeddings using the Identity matrix in Python¶

In [16]:

import numpy as np

In [17]:

a = np.array([1,2,3])
b = np.array([.1, .2, .3])
c = np.array([-1, -.5, 0])
d = np.array([0, 0, 0])

In [18]:

identity = np.eye(3)
identity

Out[18]:

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [19]:

a.dot(identity), b.dot(identity), c.dot(identity), d.dot(identity)

Out[19]:

(array([1., 2., 3.]),
 array([0.1, 0.2, 0.3]),
 array([-1. , -0.5,  0. ]),
 array([0., 0., 0.]))

In [20]:

this = np.array([2, 4, 6])
movie = np.array([10, 10, 10])
rocks = np.array([1, 1, 1])

In [21]:

print(this + movie + rocks)
print(((this.dot(identity) + movie).dot(identity) + rocks).dot(identity))

[13 15 17]
[13. 15. 17.]

We yielded the same results just because the identity matrix is a very special type of matrix. In fact, the identity matrix is the only matrix capable of returning the same vector as doing direct sum, no other matrix has this guarantee.

Learning the Transition Matrices¶

What if you allowed the identity matrices to change to minimize the loss?¶

Let's remember the goal: "generating sequence embeddings that cluster according to the meaning of the sentence". These sentence embeddings should care about the word order. Our hypothesis is that if we used the word vectors dotted with corresponding identity matrices, and turned the identity matrices into weight matrices, the network will preserve order and extract meaningful sentence representations that make use of words order.

Whenever we train a neural network, we should set a task for it to learn useful representations, in our case, what is the task that will allow the network to extract useful sentence embeddings? The answer is to Train a Network that takes a list of words & learns to predict the Next Word. In Other words, A Language Model.

Learning to Create Useful Sentence Vectors¶

Create the Sentence Vector, Make a Prediction, & Modify the sentence vector via its parts.¶

We think about creating a sentence embedding, and use it to predict the next word, and then modifying the respective parts (word embeddings, weight matrices) to make the prediction a little bit better. Here are the inner components of the neural network:

RNNs process data in two steps:

Create the input sentence embedding using the weight matrices and the word vector representations.
Use the input vector representation to predict the probability for each vocab token to be the next word.

We will initialize the weight matrices as identity matrices, however, they will change throughout training just as the word embeddings & weights corresponding to the dense classification layers. We'll also constrain the transition matrices to be the same, meaning:

$W_0=W_1=W_2...$
Whatever logic learned in one transition is use in any other transition.

Forward Propagation in Python¶

Let's take this Idea & see how to perform a forward propagation for our network¶

Now that we have a conceptual idea of what we're trying to build, let's check out a toy version in Python. First, let's create the weights (we use a limited vocab of 9 words):

In [22]:

import numpy as np

In [23]:

def softmax(v):
    v = np.atleast_2d(v)
    return np.exp(v)/np.sum(np.exp(v), axis=1, keepdims=True)

In [24]:

# initial word embeddings
word_vects = {}
word_vects['yankees'] = np.array([[0., 0., 0.]])
word_vects['bears'] = np.array([[0., 0., 0.]])
word_vects['braves'] = np.array([[0., 0., 0.]])
word_vects['red'] = np.array([[0., 0., 0.]])
word_vects['sox'] = np.array([[0., 0., 0.]])
word_vects['lose'] = np.array([[0., 0., 0.]])
word_vects['defeat'] = np.array([[0., 0., 0.]])
word_vects['beat'] = np.array([[0., 0., 0.]])
word_vects['tie'] = np.array([[0., 0., 0.]])

In [25]:

# sentence embedding to output classification weights
sent2output = np.random.rand(3, len(word_vects))

In [26]:

# initial weight matrices
identity = np.eye(3)

The above code creates 3 sets of weights:

A dictionnary of word embeddings.
The identity or transition matrix.
A classification layer.

The classification layer serves to predict the next word over the word vocab, using the sentence vector representation. With these tools, forward propagation is trivial, let's take an example:

In [32]:

layer_0 = word_vects['red']
layer_1 = layer_0.dot(identity) + word_vects['sox']
layer_2 = layer_1.dot(identity) + word_vects['defeat']
pred = softmax(layer_2.dot(sent2output))

In [33]:

pred

Out[33]:

array([[0.11111111, 0.11111111, 0.11111111, 0.11111111, 0.11111111,
        0.11111111, 0.11111111, 0.11111111, 0.11111111]])

How do you backpropagate into this?¶

It might seem trickier, but we use the same steps we already learned¶

To calculate the gradients, we will go back through the identity matrix and further:

We calculate the gradients over the word embeddings injected in each layer.
When we add two vectors together during forward propagation, we will backpropagate the same gradient into both sides of addition.
When we generate layer_2_delta, we backpropagate it twice, once across the identity matrix to create layer_1_delta and again to word_vectors[word].

In [34]:

y = np.array([1, 0, 0, 0, 0, 0, 0, 0, 0])  # one-hot vector for yankees
pred_delta = pred - y
layer_2_delta = pred_delta.dot(sent2output.T)
defeat_delta = layer_2_delta * 1
layer_1_delta = layer_2_delta.dot(identity.T)
sox_delta = layer_1_delta * 1
layer_0_delta = layer_1_delta.dot(identity.T)  # which is `red_delta`

In [35]:

lr = .01
word_vects['red'] -= layer_0_delta*lr
word_vects['sox'] -= sox_delta*lr
word_vects['defeat'] -= defeat_delta*lr
identity -= np.outer(layer_0, layer_1_delta)*lr
identity -= np.outer(layer_1, layer_2_delta)*lr
sent2output -= np.outer(layer_2, pred_delta)*lr

In [36]:

word_vects

Out[36]:

{'yankees': array([[0., 0., 0.]]),
 'bears': array([[0., 0., 0.]]),
 'braves': array([[0., 0., 0.]]),
 'red': array([[ 0.00127242, -0.00064627,  0.00298193]]),
 'sox': array([[ 0.00127242, -0.00064627,  0.00298193]]),
 'lose': array([[0., 0., 0.]]),
 'defeat': array([[ 0.00127242, -0.00064627,  0.00298193]]),
 'beat': array([[0., 0., 0.]]),
 'tie': array([[0., 0., 0.]])}

Let's Train it!¶

We have all the tools; let's train the network on a toy corpus¶

Let's first train the network on a toy dataset called Babi dataset. This dataset is a synthetically generated Q/A corpus to teach machines how to answer questions about an environment.

First let's download and decompress the Babi dataset using bash commands:

In [37]:

!wget http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-1.tar.gz

--2020-11-13 15:44:18--  http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-1.tar.gz
Resolving www.thespermwhale.com (www.thespermwhale.com)... 69.65.3.211
Connecting to www.thespermwhale.com (www.thespermwhale.com)|69.65.3.211|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1282454 (1.2M) [application/x-gzip]
Saving to: ‘tasks_1-20_v1-1.tar.gz’

tasks_1-20_v1-1.tar 100%[===================>]   1.22M   389KB/s    in 3.2s    

2020-11-13 15:44:22 (389 KB/s) - ‘tasks_1-20_v1-1.tar.gz’ saved [1282454/1282454]

In [38]:

!mv tasks_1-20_v1-1.tar.gz static/data/

In [39]:

!tar -xvf static/data/tasks_1-20_v1-1.tar.gz

x tasksv11/
x tasksv11/en/
x tasksv11/LICENSE
x tasksv11/README
x tasksv11/shuffled/
x tasksv11/shuffled/qa10_indefinite-knowledge_test.txt
x tasksv11/shuffled/qa10_indefinite-knowledge_train.txt
x tasksv11/shuffled/qa11_basic-coreference_test.txt
x tasksv11/shuffled/qa11_basic-coreference_train.txt
x tasksv11/shuffled/qa12_conjunction_test.txt
x tasksv11/shuffled/qa12_conjunction_train.txt
x tasksv11/shuffled/qa13_compound-coreference_test.txt
x tasksv11/shuffled/qa13_compound-coreference_train.txt
x tasksv11/shuffled/qa14_time-reasoning_test.txt
x tasksv11/shuffled/qa14_time-reasoning_train.txt
x tasksv11/shuffled/qa15_basic-deduction_test.txt
x tasksv11/shuffled/qa15_basic-deduction_train.txt
x tasksv11/shuffled/qa16_basic-induction_test.txt
x tasksv11/shuffled/qa16_basic-induction_train.txt
x tasksv11/shuffled/qa17_positional-reasoning_test.txt
x tasksv11/shuffled/qa17_positional-reasoning_train.txt
x tasksv11/shuffled/qa18_size-reasoning_test.txt
x tasksv11/shuffled/qa18_size-reasoning_train.txt
x tasksv11/shuffled/qa19_path-finding_test.txt
x tasksv11/shuffled/qa19_path-finding_train.txt
x tasksv11/shuffled/qa1_single-supporting-fact_test.txt
x tasksv11/shuffled/qa1_single-supporting-fact_train.txt
x tasksv11/shuffled/qa20_agents-motivations_test.txt
x tasksv11/shuffled/qa20_agents-motivations_train.txt
x tasksv11/shuffled/qa2_two-supporting-facts_test.txt
x tasksv11/shuffled/qa2_two-supporting-facts_train.txt
x tasksv11/shuffled/qa3_three-supporting-facts_test.txt
x tasksv11/shuffled/qa3_three-supporting-facts_train.txt
x tasksv11/shuffled/qa4_two-arg-relations_test.txt
x tasksv11/shuffled/qa4_two-arg-relations_train.txt
x tasksv11/shuffled/qa5_three-arg-relations_test.txt
x tasksv11/shuffled/qa5_three-arg-relations_train.txt
x tasksv11/shuffled/qa6_yes-no-questions_test.txt
x tasksv11/shuffled/qa6_yes-no-questions_train.txt
x tasksv11/shuffled/qa7_counting_test.txt
x tasksv11/shuffled/qa7_counting_train.txt
x tasksv11/shuffled/qa8_lists-sets_test.txt
x tasksv11/shuffled/qa8_lists-sets_train.txt
x tasksv11/shuffled/qa9_simple-negation_test.txt
x tasksv11/shuffled/qa9_simple-negation_train.txt
x tasksv11/en/qa10_indefinite-knowledge_test.txt
x tasksv11/en/qa10_indefinite-knowledge_train.txt
x tasksv11/en/qa11_basic-coreference_test.txt
x tasksv11/en/qa11_basic-coreference_train.txt
x tasksv11/en/qa12_conjunction_test.txt
x tasksv11/en/qa12_conjunction_train.txt
x tasksv11/en/qa13_compound-coreference_test.txt
x tasksv11/en/qa13_compound-coreference_train.txt
x tasksv11/en/qa14_time-reasoning_test.txt
x tasksv11/en/qa14_time-reasoning_train.txt
x tasksv11/en/qa15_basic-deduction_test.txt
x tasksv11/en/qa15_basic-deduction_train.txt
x tasksv11/en/qa16_basic-induction_test.txt
x tasksv11/en/qa16_basic-induction_train.txt
x tasksv11/en/qa17_positional-reasoning_test.txt
x tasksv11/en/qa17_positional-reasoning_train.txt
x tasksv11/en/qa18_size-reasoning_test.txt
x tasksv11/en/qa18_size-reasoning_train.txt
x tasksv11/en/qa19_path-finding_test.txt
x tasksv11/en/qa19_path-finding_train.txt
x tasksv11/en/qa1_single-supporting-fact_test.txt
x tasksv11/en/qa1_single-supporting-fact_train.txt
x tasksv11/en/qa20_agents-motivations_test.txt
x tasksv11/en/qa20_agents-motivations_train.txt
x tasksv11/en/qa2_two-supporting-facts_test.txt
x tasksv11/en/qa2_two-supporting-facts_train.txt
x tasksv11/en/qa3_three-supporting-facts_test.txt
x tasksv11/en/qa3_three-supporting-facts_train.txt
x tasksv11/en/qa4_two-arg-relations_test.txt
x tasksv11/en/qa4_two-arg-relations_train.txt
x tasksv11/en/qa5_three-arg-relations_test.txt
x tasksv11/en/qa5_three-arg-relations_train.txt
x tasksv11/en/qa6_yes-no-questions_test.txt
x tasksv11/en/qa6_yes-no-questions_train.txt
x tasksv11/en/qa7_counting_test.txt
x tasksv11/en/qa7_counting_train.txt
x tasksv11/en/qa8_lists-sets_test.txt
x tasksv11/en/qa8_lists-sets_train.txt
x tasksv11/en/qa9_simple-negation_test.txt
x tasksv11/en/qa9_simple-negation_train.txt

With some simple Python, we can open, clean, & preprocess the data to feed it to the RNN:

In [40]:

import sys, random, math
from collections import Counter
import numpy as np

In [41]:

f = open('static/data/tasksv11/en/qa1_single-supporting-fact_train.txt', 'r')
raw = f.readlines()
f.close()

In [42]:

sentences = list()
for line in raw[:1000]:
    sentences.append(line.lower().replace("\n", "").split(" ")[1:])

In [43]:

sentences[:3]

Out[43]:

[['mary', 'moved', 'to', 'the', 'bathroom.'],
 ['john', 'went', 'to', 'the', 'hallway.'],
 ['where', 'is', 'mary?', '\tbathroom\t1']]

This dataset contains a lot of simple statements and questions. Each Question is followed by the correct answer. When used in the context of QA, the neural network reads the statements then answers the question.

For now, our network will attempt to finish each sentence when given one or more starting words.

Setting Things Up¶

Before we can create matrices, we need to learn how many parameters we have.¶

In [44]:

vocab = set()
for sent in sentences:
    for word in sent:
        vocab.add(word)
vocab = list(vocab)

In [45]:

word2index = {}
for i, word in enumerate(vocab):
    word2index[word] = i

In [46]:

def sent2indices(sent):
    indices = list()
    for word in sent:
        indices.append(word2index[word])
    return indices

In [47]:

def softmax(v):
    e_v = np.exp(v - np.max(v))
    return e_v / e_v.sum(axis=0)

We set the word embedding size to 10:

In [48]:

np.random.seed(1)

In [49]:

embed_size = 10
word_embeddings = (np.random.rand(len(vocab), embed_size) - 0.5) * 0.1
W0 = np.eye(embed_size)
empty_sentence_embedding = np.zeros(embed_size)
W1 = (np.random.rand(embed_size, len(vocab)) - 0.5) * 0.1
# target vocab vectors for the loss function
y_hots = np.eye(len(vocab))

In [50]:

W0.shape, W1.shape

Out[50]:

((10, 10), (10, 82))

Forward Propagation with Arbitrary Length¶

We'll forward propagate using the same logic described earlier¶

Instead of predicting only the last word, we will make a prediction for each timestep using all previous words. This is more efficient than doing a new forward propagation everytime we want to predict a new word (from the beginning of the phrase).

In [51]:

def predict(sent):
    layers = list()
    layer = {}
    layer['hidden'] = empty_sentence_embedding
    layers.append(layer)
    
    loss = 0
    
    # forward propagates
    preds = list()
    for target_i in range(len(sent)):
        layer = {}
        
        # tries to predict the next term
        layer['pred'] = softmax(layers[-1]['hidden'].dot(W1))
        
        # `sent[target_i]` gets actual word, which is a number that represent word in vocab, then we get its prediction in the proba distribution
        loss += -np.log(layer['pred'][sent[target_i]])
        
        # generates the next hidden state
        layer['hidden'] = layers[-1]['hidden'].dot(W0) + word_embeddings[sent[target_i]]
        layers.append(layer)
    
    return layers, loss

The List called "layers" represent a new way of doing forward propagation. We should notice that we end up doing more forward propagation if the sentence length is longer.

Backpropagation with Arbitrary Length¶

You'll backpropagate using the same logic described earlier¶

Let's implement backpropagation over arbitrary sequence lengths. The most important object is the layers list, which has two vectors

layer['state']
layer['previous_hidden']

In order to backpropagate, we'll take the output's gradient and add a new object to each list called layer['state_delta'] which will represent the gradient at that layer. We're building the same logic in a way that it can consume the variable-length sequences from the forward propagation logic.

In [52]:

sentences[1][1:]

Out[52]:

['went', 'to', 'the', 'hallway.']

In [53]:

for iter in range(30000):
    # forward propagation
    lr = .001
    # getting sentence indices w/o 1st word
    # why it leaves the first word of each sentence? -> to use it instead of "NaN" as first layer['hidden']
    sent = sent2indices(sentences[iter%len(sentences)][1:])  
    layers, loss = predict(sent)  # predicts, returns hidden layer embeddings + loss
    
    # backpropagation
    for layer_i in reversed(range(len(layers))):  # loop over hidden states (layers)
        layer = layers[layer_i]  # get corresponding layer
        target = sent[layer_i - 1]  # get target word
        
        # If not the 1st layer
        if (layer_i > 0):  # if not first layer
            layer['output_delta'] = layer['pred'] - y_hots[target]  # delta
            new_hidden_delta = layer['output_delta'].dot(W1.transpose())  # gradient
            
            # If last layer, don't pull from a later one, because it doesn't exist
            # seems that for each hidden layer, its hidden delta is depedent upon the last decoding operation + next layer gradient
            if(layer_i == len(layers)-1):
                layer['hidden_delta'] = new_hidden_delta
            else:
                layer['hidden_delta'] = new_hidden_delta + layers[layer_i+1]['hidden_delta'].dot(W0.transpose())
        else:  # if the first layer
            layer['hidden_delta'] = layers[layer_i+1]['hidden_delta'].dot(W0.transpose())

The following figure simplifies one training loop:

Weight update with arbitrary length¶

We'll update the weights using the same logic described earlier¶

After having stored the gradients for each layer, now we need to add the weight update code:

In [54]:

for iter in range(30000):
    # forward propagation
    lr = .001
    # getting sentence indices w/o 1st word
    # why it leaves the first word of each sentence? -> to use it instead of "NaN" as first layer['hidden']
    sent = sent2indices(sentences[iter%len(sentences)][1:])  
    layers, loss = predict(sent)  # predicts, returns hidden layer embeddings + loss
    
    # backpropagation
    for layer_i in reversed(range(len(layers))):  # loop over hidden states (layers)
        layer = layers[layer_i]  # get corresponding layer
        target = sent[layer_i - 1]  # get target word
        
        # If not the 1st layer
        if (layer_i > 0):  # if not first layer
            layer['output_delta'] = layer['pred'] - y_hots[target]  # delta
            new_hidden_delta = layer['output_delta'].dot(W1.transpose())  # gradient
            
            # If last layer, don't pull from a later one, because it doesn't exist
            # seems that for each hidden layer, its hidden delta is depedent upon the last decoding operation + next layer gradient
            if(layer_i == len(layers)-1):
                layer['hidden_delta'] = new_hidden_delta
            else:
                layer['hidden_delta'] = new_hidden_delta + layers[layer_i+1]['hidden_delta'].dot(W0.transpose())
        else:  # if the first layer
            layer['hidden_delta'] = layers[layer_i+1]['hidden_delta'].dot(W0.transpose())
    
    # Update weights of NaN Embedding
    empty_sentence_embedding -= layers[0]['hidden_delta'] * lr / float(len(sent))
    for layer_i, layer in enumerate(layers[1:]):
        # update decoder
        W1 -= np.outer(layers[layer_i]["hidden"], layer['output_delta'])*lr / float(len(sent))
        embed_i = sent[layer_i]
        # update embeddings
        word_embeddings[embed_i] -= layers[layer_i]['hidden_delta']*lr/float(len(sent))
        # update encoder
        W0 -= np.outer(layers[layer_i]['hidden'], layer['hidden_delta']) * lr / float(len(sent))
    
    if (iter % 1000 == 0):
        print("Perplexity :" + str(np.exp(loss/len(sent))))

Perplexity :81.95028707568486
Perplexity :81.78060980022715
Perplexity :81.51436435831427
Perplexity :81.01004082432779
Perplexity :79.96135229022907
Perplexity :77.53659835083533
Perplexity :70.45797177871069
Perplexity :41.73186994211299
Perplexity :24.3413252435313
Perplexity :19.820219349001245
Perplexity :18.502932946608617
Perplexity :17.16337351696595
Perplexity :15.200622187379624
Perplexity :12.259293966301598
Perplexity :8.975708726197752
Perplexity :7.120714615818728
Perplexity :6.122010097753815
Perplexity :5.486491492025226
Perplexity :5.079309978044186
Perplexity :4.798546702336612
Perplexity :4.611469221507808
Perplexity :4.5010239902745655
Perplexity :4.440082283729496
Perplexity :4.387575014051199
Perplexity :4.324829616690472
Perplexity :4.251929353507349
Perplexity :4.174533982649985
Perplexity :4.098962793552853
Perplexity :4.03059393832612
Perplexity :3.9706506361787377

In information theory, preplexity is a measurement that predicts how well a probability distribution or a probability model predicts a sample. It may be used to compare probability models.

A low Perplexity indicate that the model is good at predicting a sample.

Execution and Output Analysis¶

Perplexity is the probability of the current label (word), passed through a log function, negated, and exponentiated (e^x). Perplexity measures the difference between two probability distributions. It is high when two probability distributions doesn't match, and low (approaching 1) when the two distributions are close to each other. However, this metric hardly tells us what's going on in the weights.

Perplexity has faced some critisicm over the years (particularly in the language modeling community) for being overused as a metric.

Let's look a little more closely at the predictions:

In [55]:

sent_index = 4
l, _ = predict(sent2indices(sentences[sent_index]))
print(sentences[sent_index])

['sandra', 'moved', 'to', 'the', 'garden.']

In [57]:

for i, each_layer in enumerate(l[1:-1]):
    input = sentences[sent_index][i]
    true = sentences[sent_index][i+1]
    pred = vocab[each_layer['pred'].argmax()]
    print("Prev Input:" + input + (' ' * (12 - len(input))) + "True: " + true + (' ' * (15 - len(true))) + "Pred: " + pred)

Prev Input:sandra      True: moved          Pred: is
Prev Input:moved       True: to             Pred: to
Prev Input:to          True: the            Pred: the
Prev Input:the         True: garden.        Pred: bedroom.

This code takes a sentence and predicts the word the model think is most likely.

Looking at predictions can help us understand what's going on¶

We can look at the predictions of the neural network as it trains to understand not only what kind of patterns it picks up, but also in the order in which it does so. Neural Networks tend to start off random, but after some amount of training, the NN picks up the most frequent word and predicts it as a default (this is an extremely common error in RNNs).

It's important to know that there is almost no way this network can predict the next word perfectly. More context is needed to solve this task, but the fact that it's unsolvable, creates educational analysis for the ways in which it fails.

Summary¶

RNNs Predict over arbitrary Length Sequences¶

In this chapter, we learned how to create vector representations of arbitrary long sentences.

One question remains: How does a neural network fit a variable length sequence into a fixed size vector? The truth is, sentence vectors don't encode everything in their vector. The name of the game in RNNs is not just what these vectors remember, but also what they forget.

In the case of predicting the next word, RNNs learn that only the last few words matter.