In this Chapter, we will:
[Andrej Karpathy] "There is something magical about recurrent neural networks".
In this chapter, we'll expand on the intuition of an embedding conveying the meaning of a word by creating embeddings that convey the meaning of variable-length phrases and setences. If we want to create a vector that held an entire sequence of symbols within its contents in the same way a word embedding holds information about the word in the form of a vector, how could we do that?
If we concatenated or stacked each word embedding in the sentence, we will have a vector that may represent the whole sentence.
But this approach leaves something to be desired, bigger sentences will have more stacked embeddings. for example, we can't compare two sentences because they have different shapes.
In Theory, these two sentences should be very similar. Comparing their vectors should indicate high similarity.
The act of comparing two vectors is useful because it allows us to understand what the network sees. Even though we can't read two vectors, but we can check their similarity. In this case, we are interested in analyzing the perspective of the Neural Network.
We want to create sentence vectors that are useful for predicting things about the sentence. Meaning, at the minimum, similar setences should in theory result in similar vector representations.
But how about averaging word vectors from the sentence? This method has the following characteristics:
We can conclude that the techniuqe of averaging words is a strong one, although not perfect.
import sys, random, math
from collections import Counter
import numpy as np
np.random.seed(1)
random.seed(1)
IMDB_PATH = '/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/reviews.txt'
IMDB_LABEL_PATH = '/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/labels.txt'
f = open(IMDB_PATH, 'r')
raw_reviews = f.readlines()
f.close()
tokens = list(map(lambda x: x.split(" "), raw_reviews))
word_counter = Counter()
for review in tokens:
for token in review:
word_counter[token] -= 1
_ = word_counter.most_common()
vocab = list(set(map(lambda x: x[0], word_counter.most_common())))
word2index = {}
for i, word in enumerate(vocab):
word2index[word] = i
concatenated = list()
input_dataset = list()
for review in tokens:
review_indices = list()
for token in review:
try:
review_indices.append(word2index[token])
concatenated.append(word2index[token])
except:
""
input_dataset.append(review_indices)
concatenated = np.array(concatenated)
random.shuffle(input_dataset)
lr, epochs = (.05, 2)
hidden_size, window, negative = 50, 2, 5
W0 = (np.random.rand(len(vocab), hidden_size) - 0.5) * 0.2
W1 = np.random.rand(len(vocab), hidden_size)*0
layer_2_target = np.zeros(negative+1)
layer_2_target[0] = 1
def similar(target='beautiful'):
target_index = word2index[target]
scores = Counter()
for word, index in word2index.items():
raw_difference = W0[index] - W0[target_index]
squared_difference = raw_difference * raw_difference
scores[word] = -math.sqrt(sum(squared_difference))
return scores.most_common(10)
def sigmoid(x):
return 1 / (1 + np.exp(-x))
for review_i, review in enumerate(input_dataset * epochs):
for target_i in range(len(review)):
target_samples = [review[target_i]] + list(concatenated[(np.random.rand(negative)*len(concatenated)).astype('int').tolist()])
left_context = review[max(0, target_i-window):target_i]
right_context = review[target_i+1: min(len(review), target_i+window)]
layer_1 = np.mean(W0[left_context+right_context], axis=0)
layer_2 = sigmoid(layer_1.dot(W1[target_samples].T))
layer_2_delta = layer_2 - layer_2_target
layer_1_delta = layer_2_delta.dot(W1[target_samples])
W0[left_context+right_context] -= layer_1_delta*lr
W1[target_samples] -= np.outer(layer_2_delta, layer_1)*lr
if(review_i % 5000 == 0):
print('\rProgress:'+str(review_i/float(len(input_dataset)*epochs)) + " " + str(similar('terrible')))
print(similar('terrible'))
Progress:0.0 [('terrible', -0.0), ('genital', -0.39326532865973723), ('midkoff', -0.3938513073457206), ('ethan', -0.3968892076212083), ('streep', -0.3979724210300289), ('bjorlin', -0.3981196278255343), ('injure', -0.3982249442410485), ('munchies', -0.39891945750964025), ('orgolini', -0.40061746244471097), ('vega', -0.4015861557887159)] Progress:0.1 [('terrible', -0.0), ('unique', -1.927270765664731), ('magnificent', -2.045146852188629), ('student', -2.062339523062212), ('deeply', -2.0687588315141556), ('fantastic', -2.08371892800822), ('rare', -2.0969774764288616), ('tough', -2.128273679772459), ('worthwhile', -2.149016740284797), ('dreadful', -2.1524088693570316)] Progress:0.2 [('terrible', -0.0), ('horrible', -3.108306431979032), ('brilliant', -3.396949380238907), ('fantastic', -3.4444297632941083), ('compelling', -3.6848868031292823), ('lame', -3.6906613668414274), ('remarkable', -3.7214051894822826), ('great', -3.749394484829831), ('superb', -3.764204679513025), ('tedious', -3.78968255882158)] Progress:0.3 [('terrible', -0.0), ('horrible', -3.498158178893929), ('tremendous', -3.761329799226001), ('fantastic', -3.7628561218791723), ('mediocre', -3.766116610681203), ('laughable', -3.7805893810816893), ('marvelous', -3.813188768185005), ('dreadful', -3.8134367604842216), ('breathtaking', -3.862370005681857), ('spectacular', -3.8694653810324677)] Progress:0.4 [('terrible', -0.0), ('horrible', -2.967964656933557), ('brilliant', -3.5420185698927042), ('pathetic', -3.7138289510002624), ('remarkable', -3.770923141465496), ('superb', -3.7941187646097148), ('fantastic', -3.8364360547450773), ('laughable', -3.845696114343685), ('great', -3.906869347583601), ('bad', -3.921186208070328)] Progress:0.5 [('terrible', -0.0), ('horrible', -3.1989896124441066), ('brilliant', -3.3647957125926413), ('superb', -3.446848086835425), ('pathetic', -3.6557981030906634), ('fantastic', -3.673566586441344), ('lame', -3.8944061428190775), ('remarkable', -4.004842620824482), ('marvelous', -4.053430718224174), ('great', -4.0758362861596575)] Progress:0.6 [('terrible', -0.0), ('horrible', -2.7739353122626755), ('fantastic', -3.050940700260329), ('brilliant', -3.232619045738981), ('pathetic', -3.5566710306301847), ('superb', -3.713381489932097), ('lame', -3.739970542211976), ('marvelous', -3.841008362820846), ('haunting', -3.846184382564499), ('magnificent', -3.847871555156092)] Progress:0.7 [('terrible', -0.0), ('horrible', -2.5027876972527996), ('brilliant', -3.387431880847732), ('fantastic', -3.805817922179162), ('superb', -3.8359369916740205), ('magnificent', -3.8404436166584484), ('pathetic', -3.9100961380425856), ('laughable', -3.9562924360256875), ('marvelous', -3.9658570094816135), ('lame', -4.01274210326551)] Progress:0.8 [('terrible', -0.0), ('horrible', -3.3525614136062547), ('fantastic', -3.787226665684703), ('dire', -3.812683534917237), ('brilliant', -3.824813307264604), ('dreadful', -3.9178990948548855), ('laughable', -3.9454124321935535), ('fabulous', -3.9960864707094257), ('miserable', -4.04534807188886), ('phenomenal', -4.051791600383227)] Progress:0.9 [('terrible', -0.0), ('horrible', -2.8685809548008545), ('brilliant', -3.2730169748091433), ('horrendous', -3.675608725486543), ('dreadful', -3.7564170172164517), ('phenomenal', -3.7578615711368135), ('bad', -3.7787265252348763), ('marvelous', -3.815732576454664), ('fantastic', -3.9545000164096757), ('great', -3.975672000914959)] [('terrible', -0.0), ('horrible', -2.7898917830688132), ('brilliant', -3.3469631230517645), ('phenomenal', -3.690963614102198), ('pathetic', -3.795400367744756), ('marvelous', -3.9134681822713833), ('superb', -3.920458050440294), ('masterful', -3.979517882443988), ('bad', -3.9993630029279785), ('mediocre', -4.138258710066284)]
Let's experiment with average sentence embeddings:
import numpy as np
norms = np.sum(W0 * W0, axis=1)
norms.resize(norms.shape[0], 1)
normed_weights = W0 * norms
def make_sent_vect(words):
# 1st part get index of each word
# 2nd part filters only words in vocab
indices = list(map(lambda x: word2index[x], filter(lambda x: x in word2index, words)))
return np.mean(normed_weights[indices], axis=0)
reviews_to_vectors = list()
for review in tokens:
reviews_to_vectors.append(make_sent_vect(review))
reviews_to_vectors = np.array(reviews_to_vectors)
def most_similar_reviews(review):
v = make_sent_vect(review)
scores = Counter()
for i, val in enumerate(reviews_to_vectors.dot(v)):
scores[i] = val
most_similar = list()
for idx, score in scores.most_common(3):
most_similar.append(raw_reviews[idx][0:40])
return most_similar
most_similar_reviews(['beautiful', 'amazing'])
['i have never seen such terrible performa', 'i think this show is definitely the grea', 'if you havn t seen this movie i highly ']
There appear to be interesting statistical information within these vectors, such as when negative & embeddings cluster together.
To understand the meaning of embeddings, let's visualize a word vector as a squiggly line, like this one:
Instead of thinking that a vector is just a list of numbers, we visualize it as a line that goes through low/high points that corresponds to the low/high numbers in the vector. If we selected several words from the corpus, they might look like these:
By considering the similaries between various words, we may notice that each word's shape is unique. However, "terrible" & "boring" have certain similarities in their shapes. The most interesting thing is that part of these squiggles have meaning in and of themselves.
Consider the negative words for example, there is nothing magical about the common spike these words have in common, if we retrained the network, the spike would appear somewhere else. What gives the "negative" sensation to the shape is that all negative words have it.
These shapes were molded by training so that every part of the curve convery a certain meaning that would prove useful in the correlation summarization task. When we take an average curve over multiple word embeddings, common meaning holds while other opposite meanings cancel out or get weakened.
NN simply looks for various bumps and variations in the input layer and compresses that information to predict the target label (missing word/sentiment/...).
Through training, the NN shapes the word embeddings in such a way that it clusters them, all for the goal of making it easier to predict the target label. It compresses the signal in a way that allows it to minimize its loss function (i.e. correctly predict the target label)
Let's think about the case when we have longer sentences, the longer the sentence, the more its representation will be averaged to zero. So avereging word embeddings is a bit mushy. Meaning that the longer the sentence, the more likely that it will average out to be a straight line.
In short, this concept of storing a sentence in a fixed length averaged vector doesn't decay nicely. That being said, typical sentences aren't that long anyway.
The biggest issue with averaged word embeddings is that they lose the order information. By losing order, we also lose grammar and syntax.
This way of averaging or summing a bunch of words to have a sentence/text representation is called the bag-of-words approach. The new representation creates a bag, because order isn't preserved.
We want a way to represent sentence vectors that preserve order and yield different representations for different word orders. More importantly, the way in which order changes the resulting vector should be learned. By learning, the neural network will find an order representation that minimizes its loss function and yield interesting properties about word order in general.
One of the most famous and successful ways of generating vectors for sequences are Recurrent Neural Networks (RNNs). Identity matrices have one property: if you multiply any vector by them, you'll get the same vector.
This is the standard way of simply summing up the word embeddings to form a fixed length sentence embedding.
The example above adds another step while summing, it multiplies each element by the identity matrix. Note that we're not doing anything special here, this indentity way of summing up the elements yields the same resulting vector as the first example, but consider this: If the matrices used are anything but the identity matrix, changing the order of the word embeddings will change the resulting vector.
Let's see this in Code:
import numpy as np
a = np.array([1,2,3])
b = np.array([.1, .2, .3])
c = np.array([-1, -.5, 0])
d = np.array([0, 0, 0])
identity = np.eye(3)
identity
array([[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]])
a.dot(identity), b.dot(identity), c.dot(identity), d.dot(identity)
(array([1., 2., 3.]), array([0.1, 0.2, 0.3]), array([-1. , -0.5, 0. ]), array([0., 0., 0.]))
this = np.array([2, 4, 6])
movie = np.array([10, 10, 10])
rocks = np.array([1, 1, 1])
print(this + movie + rocks)
print(((this.dot(identity) + movie).dot(identity) + rocks).dot(identity))
[13 15 17] [13. 15. 17.]
We yielded the same results just because the identity matrix is a very special type of matrix. In fact, the identity matrix is the only matrix capable of returning the same vector as doing direct sum, no other matrix has this guarantee.
Let's remember the goal: "generating sequence embeddings that cluster according to the meaning of the sentence". These sentence embeddings should care about the word order. Our hypothesis is that if we used the word vectors dotted with corresponding identity matrices, and turned the identity matrices into weight matrices, the network will preserve order and extract meaningful sentence representations that make use of words order.
Whenever we train a neural network, we should set a task for it to learn useful representations, in our case, what is the task that will allow the network to extract useful sentence embeddings? The answer is to Train a Network that takes a list of words & learns to predict the Next Word. In Other words, A Language Model.
We think about creating a sentence embedding, and use it to predict the next word, and then modifying the respective parts (word embeddings, weight matrices) to make the prediction a little bit better. Here are the inner components of the neural network:
RNNs process data in two steps:
We will initialize the weight matrices as identity matrices, however, they will change throughout training just as the word embeddings & weights corresponding to the dense classification layers. We'll also constrain the transition matrices to be the same, meaning:
import numpy as np
def softmax(v):
v = np.atleast_2d(v)
return np.exp(v)/np.sum(np.exp(v), axis=1, keepdims=True)
# initial word embeddings
word_vects = {}
word_vects['yankees'] = np.array([[0., 0., 0.]])
word_vects['bears'] = np.array([[0., 0., 0.]])
word_vects['braves'] = np.array([[0., 0., 0.]])
word_vects['red'] = np.array([[0., 0., 0.]])
word_vects['sox'] = np.array([[0., 0., 0.]])
word_vects['lose'] = np.array([[0., 0., 0.]])
word_vects['defeat'] = np.array([[0., 0., 0.]])
word_vects['beat'] = np.array([[0., 0., 0.]])
word_vects['tie'] = np.array([[0., 0., 0.]])
# sentence embedding to output classification weights
sent2output = np.random.rand(3, len(word_vects))
# initial weight matrices
identity = np.eye(3)
The above code creates 3 sets of weights:
The classification layer serves to predict the next word over the word vocab, using the sentence vector representation. With these tools, forward propagation is trivial, let's take an example:
layer_0 = word_vects['red']
layer_1 = layer_0.dot(identity) + word_vects['sox']
layer_2 = layer_1.dot(identity) + word_vects['defeat']
pred = softmax(layer_2.dot(sent2output))
pred
array([[0.11111111, 0.11111111, 0.11111111, 0.11111111, 0.11111111, 0.11111111, 0.11111111, 0.11111111, 0.11111111]])
To calculate the gradients, we will go back through the identity matrix and further:
layer_2_delta
, we backpropagate it twice, once across the identity matrix to create layer_1_delta
and again to word_vectors[word]
.y = np.array([1, 0, 0, 0, 0, 0, 0, 0, 0]) # one-hot vector for yankees
pred_delta = pred - y
layer_2_delta = pred_delta.dot(sent2output.T)
defeat_delta = layer_2_delta * 1
layer_1_delta = layer_2_delta.dot(identity.T)
sox_delta = layer_1_delta * 1
layer_0_delta = layer_1_delta.dot(identity.T) # which is `red_delta`
lr = .01
word_vects['red'] -= layer_0_delta*lr
word_vects['sox'] -= sox_delta*lr
word_vects['defeat'] -= defeat_delta*lr
identity -= np.outer(layer_0, layer_1_delta)*lr
identity -= np.outer(layer_1, layer_2_delta)*lr
sent2output -= np.outer(layer_2, pred_delta)*lr
word_vects
{'yankees': array([[0., 0., 0.]]), 'bears': array([[0., 0., 0.]]), 'braves': array([[0., 0., 0.]]), 'red': array([[ 0.00127242, -0.00064627, 0.00298193]]), 'sox': array([[ 0.00127242, -0.00064627, 0.00298193]]), 'lose': array([[0., 0., 0.]]), 'defeat': array([[ 0.00127242, -0.00064627, 0.00298193]]), 'beat': array([[0., 0., 0.]]), 'tie': array([[0., 0., 0.]])}
Let's first train the network on a toy dataset called Babi dataset
. This dataset is a synthetically generated Q/A corpus to teach machines how to answer questions about an environment.
First let's download and decompress the Babi dataset using bash commands:
!wget http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-1.tar.gz
--2020-11-13 15:44:18-- http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-1.tar.gz Resolving www.thespermwhale.com (www.thespermwhale.com)... 69.65.3.211 Connecting to www.thespermwhale.com (www.thespermwhale.com)|69.65.3.211|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 1282454 (1.2M) [application/x-gzip] Saving to: ‘tasks_1-20_v1-1.tar.gz’ tasks_1-20_v1-1.tar 100%[===================>] 1.22M 389KB/s in 3.2s 2020-11-13 15:44:22 (389 KB/s) - ‘tasks_1-20_v1-1.tar.gz’ saved [1282454/1282454]
!mv tasks_1-20_v1-1.tar.gz static/data/
!tar -xvf static/data/tasks_1-20_v1-1.tar.gz
x tasksv11/ x tasksv11/en/ x tasksv11/LICENSE x tasksv11/README x tasksv11/shuffled/ x tasksv11/shuffled/qa10_indefinite-knowledge_test.txt x tasksv11/shuffled/qa10_indefinite-knowledge_train.txt x tasksv11/shuffled/qa11_basic-coreference_test.txt x tasksv11/shuffled/qa11_basic-coreference_train.txt x tasksv11/shuffled/qa12_conjunction_test.txt x tasksv11/shuffled/qa12_conjunction_train.txt x tasksv11/shuffled/qa13_compound-coreference_test.txt x tasksv11/shuffled/qa13_compound-coreference_train.txt x tasksv11/shuffled/qa14_time-reasoning_test.txt x tasksv11/shuffled/qa14_time-reasoning_train.txt x tasksv11/shuffled/qa15_basic-deduction_test.txt x tasksv11/shuffled/qa15_basic-deduction_train.txt x tasksv11/shuffled/qa16_basic-induction_test.txt x tasksv11/shuffled/qa16_basic-induction_train.txt x tasksv11/shuffled/qa17_positional-reasoning_test.txt x tasksv11/shuffled/qa17_positional-reasoning_train.txt x tasksv11/shuffled/qa18_size-reasoning_test.txt x tasksv11/shuffled/qa18_size-reasoning_train.txt x tasksv11/shuffled/qa19_path-finding_test.txt x tasksv11/shuffled/qa19_path-finding_train.txt x tasksv11/shuffled/qa1_single-supporting-fact_test.txt x tasksv11/shuffled/qa1_single-supporting-fact_train.txt x tasksv11/shuffled/qa20_agents-motivations_test.txt x tasksv11/shuffled/qa20_agents-motivations_train.txt x tasksv11/shuffled/qa2_two-supporting-facts_test.txt x tasksv11/shuffled/qa2_two-supporting-facts_train.txt x tasksv11/shuffled/qa3_three-supporting-facts_test.txt x tasksv11/shuffled/qa3_three-supporting-facts_train.txt x tasksv11/shuffled/qa4_two-arg-relations_test.txt x tasksv11/shuffled/qa4_two-arg-relations_train.txt x tasksv11/shuffled/qa5_three-arg-relations_test.txt x tasksv11/shuffled/qa5_three-arg-relations_train.txt x tasksv11/shuffled/qa6_yes-no-questions_test.txt x tasksv11/shuffled/qa6_yes-no-questions_train.txt x tasksv11/shuffled/qa7_counting_test.txt x tasksv11/shuffled/qa7_counting_train.txt x tasksv11/shuffled/qa8_lists-sets_test.txt x tasksv11/shuffled/qa8_lists-sets_train.txt x tasksv11/shuffled/qa9_simple-negation_test.txt x tasksv11/shuffled/qa9_simple-negation_train.txt x tasksv11/en/qa10_indefinite-knowledge_test.txt x tasksv11/en/qa10_indefinite-knowledge_train.txt x tasksv11/en/qa11_basic-coreference_test.txt x tasksv11/en/qa11_basic-coreference_train.txt x tasksv11/en/qa12_conjunction_test.txt x tasksv11/en/qa12_conjunction_train.txt x tasksv11/en/qa13_compound-coreference_test.txt x tasksv11/en/qa13_compound-coreference_train.txt x tasksv11/en/qa14_time-reasoning_test.txt x tasksv11/en/qa14_time-reasoning_train.txt x tasksv11/en/qa15_basic-deduction_test.txt x tasksv11/en/qa15_basic-deduction_train.txt x tasksv11/en/qa16_basic-induction_test.txt x tasksv11/en/qa16_basic-induction_train.txt x tasksv11/en/qa17_positional-reasoning_test.txt x tasksv11/en/qa17_positional-reasoning_train.txt x tasksv11/en/qa18_size-reasoning_test.txt x tasksv11/en/qa18_size-reasoning_train.txt x tasksv11/en/qa19_path-finding_test.txt x tasksv11/en/qa19_path-finding_train.txt x tasksv11/en/qa1_single-supporting-fact_test.txt x tasksv11/en/qa1_single-supporting-fact_train.txt x tasksv11/en/qa20_agents-motivations_test.txt x tasksv11/en/qa20_agents-motivations_train.txt x tasksv11/en/qa2_two-supporting-facts_test.txt x tasksv11/en/qa2_two-supporting-facts_train.txt x tasksv11/en/qa3_three-supporting-facts_test.txt x tasksv11/en/qa3_three-supporting-facts_train.txt x tasksv11/en/qa4_two-arg-relations_test.txt x tasksv11/en/qa4_two-arg-relations_train.txt x tasksv11/en/qa5_three-arg-relations_test.txt x tasksv11/en/qa5_three-arg-relations_train.txt x tasksv11/en/qa6_yes-no-questions_test.txt x tasksv11/en/qa6_yes-no-questions_train.txt x tasksv11/en/qa7_counting_test.txt x tasksv11/en/qa7_counting_train.txt x tasksv11/en/qa8_lists-sets_test.txt x tasksv11/en/qa8_lists-sets_train.txt x tasksv11/en/qa9_simple-negation_test.txt x tasksv11/en/qa9_simple-negation_train.txt
With some simple Python, we can open, clean, & preprocess the data to feed it to the RNN:
import sys, random, math
from collections import Counter
import numpy as np
f = open('static/data/tasksv11/en/qa1_single-supporting-fact_train.txt', 'r')
raw = f.readlines()
f.close()
sentences = list()
for line in raw[:1000]:
sentences.append(line.lower().replace("\n", "").split(" ")[1:])
sentences[:3]
[['mary', 'moved', 'to', 'the', 'bathroom.'], ['john', 'went', 'to', 'the', 'hallway.'], ['where', 'is', 'mary?', '\tbathroom\t1']]
This dataset contains a lot of simple statements and questions. Each Question is followed by the correct answer. When used in the context of QA, the neural network reads the statements then answers the question.
For now, our network will attempt to finish each sentence when given one or more starting words.
vocab = set()
for sent in sentences:
for word in sent:
vocab.add(word)
vocab = list(vocab)
word2index = {}
for i, word in enumerate(vocab):
word2index[word] = i
def sent2indices(sent):
indices = list()
for word in sent:
indices.append(word2index[word])
return indices
def softmax(v):
e_v = np.exp(v - np.max(v))
return e_v / e_v.sum(axis=0)
We set the word embedding size to 10
:
np.random.seed(1)
embed_size = 10
word_embeddings = (np.random.rand(len(vocab), embed_size) - 0.5) * 0.1
W0 = np.eye(embed_size)
empty_sentence_embedding = np.zeros(embed_size)
W1 = (np.random.rand(embed_size, len(vocab)) - 0.5) * 0.1
# target vocab vectors for the loss function
y_hots = np.eye(len(vocab))
W0.shape, W1.shape
((10, 10), (10, 82))
Instead of predicting only the last word, we will make a prediction for each timestep using all previous words. This is more efficient than doing a new forward propagation everytime we want to predict a new word (from the beginning of the phrase).
def predict(sent):
layers = list()
layer = {}
layer['hidden'] = empty_sentence_embedding
layers.append(layer)
loss = 0
# forward propagates
preds = list()
for target_i in range(len(sent)):
layer = {}
# tries to predict the next term
layer['pred'] = softmax(layers[-1]['hidden'].dot(W1))
# `sent[target_i]` gets actual word, which is a number that represent word in vocab, then we get its prediction in the proba distribution
loss += -np.log(layer['pred'][sent[target_i]])
# generates the next hidden state
layer['hidden'] = layers[-1]['hidden'].dot(W0) + word_embeddings[sent[target_i]]
layers.append(layer)
return layers, loss
The List called "layers" represent a new way of doing forward propagation. We should notice that we end up doing more forward propagation if the sentence length is longer.
Let's implement backpropagation over arbitrary sequence lengths. The most important object is the layers list, which has two vectors
layer['state']
layer['previous_hidden']
In order to backpropagate, we'll take the output's gradient and add a new object to each list called layer['state_delta']
which will represent the gradient at that layer. We're building the same logic in a way that it can consume the variable-length sequences from the forward propagation logic.
sentences[1][1:]
['went', 'to', 'the', 'hallway.']
for iter in range(30000):
# forward propagation
lr = .001
# getting sentence indices w/o 1st word
# why it leaves the first word of each sentence? -> to use it instead of "NaN" as first layer['hidden']
sent = sent2indices(sentences[iter%len(sentences)][1:])
layers, loss = predict(sent) # predicts, returns hidden layer embeddings + loss
# backpropagation
for layer_i in reversed(range(len(layers))): # loop over hidden states (layers)
layer = layers[layer_i] # get corresponding layer
target = sent[layer_i - 1] # get target word
# If not the 1st layer
if (layer_i > 0): # if not first layer
layer['output_delta'] = layer['pred'] - y_hots[target] # delta
new_hidden_delta = layer['output_delta'].dot(W1.transpose()) # gradient
# If last layer, don't pull from a later one, because it doesn't exist
# seems that for each hidden layer, its hidden delta is depedent upon the last decoding operation + next layer gradient
if(layer_i == len(layers)-1):
layer['hidden_delta'] = new_hidden_delta
else:
layer['hidden_delta'] = new_hidden_delta + layers[layer_i+1]['hidden_delta'].dot(W0.transpose())
else: # if the first layer
layer['hidden_delta'] = layers[layer_i+1]['hidden_delta'].dot(W0.transpose())
The following figure simplifies one training loop:
for iter in range(30000):
# forward propagation
lr = .001
# getting sentence indices w/o 1st word
# why it leaves the first word of each sentence? -> to use it instead of "NaN" as first layer['hidden']
sent = sent2indices(sentences[iter%len(sentences)][1:])
layers, loss = predict(sent) # predicts, returns hidden layer embeddings + loss
# backpropagation
for layer_i in reversed(range(len(layers))): # loop over hidden states (layers)
layer = layers[layer_i] # get corresponding layer
target = sent[layer_i - 1] # get target word
# If not the 1st layer
if (layer_i > 0): # if not first layer
layer['output_delta'] = layer['pred'] - y_hots[target] # delta
new_hidden_delta = layer['output_delta'].dot(W1.transpose()) # gradient
# If last layer, don't pull from a later one, because it doesn't exist
# seems that for each hidden layer, its hidden delta is depedent upon the last decoding operation + next layer gradient
if(layer_i == len(layers)-1):
layer['hidden_delta'] = new_hidden_delta
else:
layer['hidden_delta'] = new_hidden_delta + layers[layer_i+1]['hidden_delta'].dot(W0.transpose())
else: # if the first layer
layer['hidden_delta'] = layers[layer_i+1]['hidden_delta'].dot(W0.transpose())
# Update weights of NaN Embedding
empty_sentence_embedding -= layers[0]['hidden_delta'] * lr / float(len(sent))
for layer_i, layer in enumerate(layers[1:]):
# update decoder
W1 -= np.outer(layers[layer_i]["hidden"], layer['output_delta'])*lr / float(len(sent))
embed_i = sent[layer_i]
# update embeddings
word_embeddings[embed_i] -= layers[layer_i]['hidden_delta']*lr/float(len(sent))
# update encoder
W0 -= np.outer(layers[layer_i]['hidden'], layer['hidden_delta']) * lr / float(len(sent))
if (iter % 1000 == 0):
print("Perplexity :" + str(np.exp(loss/len(sent))))
Perplexity :81.95028707568486 Perplexity :81.78060980022715 Perplexity :81.51436435831427 Perplexity :81.01004082432779 Perplexity :79.96135229022907 Perplexity :77.53659835083533 Perplexity :70.45797177871069 Perplexity :41.73186994211299 Perplexity :24.3413252435313 Perplexity :19.820219349001245 Perplexity :18.502932946608617 Perplexity :17.16337351696595 Perplexity :15.200622187379624 Perplexity :12.259293966301598 Perplexity :8.975708726197752 Perplexity :7.120714615818728 Perplexity :6.122010097753815 Perplexity :5.486491492025226 Perplexity :5.079309978044186 Perplexity :4.798546702336612 Perplexity :4.611469221507808 Perplexity :4.5010239902745655 Perplexity :4.440082283729496 Perplexity :4.387575014051199 Perplexity :4.324829616690472 Perplexity :4.251929353507349 Perplexity :4.174533982649985 Perplexity :4.098962793552853 Perplexity :4.03059393832612 Perplexity :3.9706506361787377
In information theory, preplexity is a measurement that predicts how well a probability distribution or a probability model predicts a sample. It may be used to compare probability models.
A low Perplexity indicate that the model is good at predicting a sample.
Perplexity is the probability of the current label (word), passed through a log function, negated, and exponentiated (e^x). Perplexity measures the difference between two probability distributions. It is high when two probability distributions doesn't match, and low (approaching 1) when the two distributions are close to each other. However, this metric hardly tells us what's going on in the weights.
Perplexity has faced some critisicm over the years (particularly in the language modeling community) for being overused as a metric.
Let's look a little more closely at the predictions:
sent_index = 4
l, _ = predict(sent2indices(sentences[sent_index]))
print(sentences[sent_index])
['sandra', 'moved', 'to', 'the', 'garden.']
for i, each_layer in enumerate(l[1:-1]):
input = sentences[sent_index][i]
true = sentences[sent_index][i+1]
pred = vocab[each_layer['pred'].argmax()]
print("Prev Input:" + input + (' ' * (12 - len(input))) + "True: " + true + (' ' * (15 - len(true))) + "Pred: " + pred)
Prev Input:sandra True: moved Pred: is Prev Input:moved True: to Pred: to Prev Input:to True: the Pred: the Prev Input:the True: garden. Pred: bedroom.
This code takes a sentence and predicts the word the model think is most likely.
We can look at the predictions of the neural network as it trains to understand not only what kind of patterns it picks up, but also in the order in which it does so. Neural Networks tend to start off random, but after some amount of training, the NN picks up the most frequent word and predicts it as a default (this is an extremely common error in RNNs).
It's important to know that there is almost no way this network can predict the next word perfectly. More context is needed to solve this task, but the fact that it's unsolvable, creates educational analysis for the ways in which it fails.
In this chapter, we learned how to create vector representations of arbitrary long sentences.
One question remains: How does a neural network fit a variable length sequence into a fixed size vector? The truth is, sentence vectors don't encode everything in their vector. The name of the game in RNNs is not just what these vectors remember, but also what they forget.
In the case of predicting the next word, RNNs learn that only the last few words matter.