Notebook

Neural Networks that understand Language: King - Man + Woman == ?¶

In this chapter, we will:

Introduce Natural Language Processing (NLP)
Define supervised NLP
Capture Word Correlation in input data
Introduce the Embedding Layer
Compare Word Embeddings
Define the task of "filling" in the blank in sentences
We will derive meaning from the loss function
Analyze Word Analogies

[John Pfeiffer] "Man is a Slow, Sloppy, and Brilliant Thinker; Computers are Fast, Accurate, and Stupid!"

What does it mean to understand language?¶

What kinds of predictions do people make about language?¶

Natural Language Processing (NLP), is a much older field that overlaps deep learning. NLP is dedicated exlusively to the automated task of understanding human language.

Natural Language Processing (NLP)¶

NLP is divided into a collection of tasks and challenges. We present a few types of classification problems that are common in NLP:

Use the characters of a document to predict where words start and end.
Use the words of a document to predict where sentences start and end.
Use the words in a sentence to predict the part of speech of each word.
Use words in a sentence to predict where named entities (person, place, thing) references start and end.
Use sentences in a document to predict which pronouns refer to the same person/place/thing.
Use words in a sentence to predict the sentiment of a sentence.

NLP tasks seek to do one of three things:

label a region of text.
Link two or more regions of text.
Try to fill in missing information based on Context.

Until recently, most of the state-of-the-art (SoTA) NLP Algorithms were advanced, probabilistic, non-parametric models but the recent development and popularization of two major neural algorithms have swept the field of NLP:

Neural Word Embeddings.
Recurrent Neural Networks.

NLP also plays a very special role in AGI (Artificial General Intelligence), because language is the bedrock of logic and communication of humans.

Supervised NLP¶

Words go in, & predictions come out¶

Up until now, we represented inputs as numbers, but NLP uses text as input, the question is how do we process text? We know that NNs map input numbers to output numbers, for this reason, we need to convert our words into their corresponding numerical representation. As it turns out, the way we transform text into numbers is exteremly important!

In order to find the optimal numerical representation for text, we need to look at the underlying input-to-output problem, let's take an example:

IMDB Movie Reviews Dataset¶

Problem: Predict if a review (text) is positive or negative¶

The IMDB Reviews Dataset is a collection of Review/Rating Pairs that often looks like the following:

"This Movie was terrible, The Plot was Dry, The acting unconvincing, and I spilled popcorn on my shirt!" — Rating: 1 Stars.

The entire dataset consists of around 50K reviews. The input reviews are usually a few sentences & the output rating is between 1 and 5 stars. It should be obvious that this sentiment dataset is very different from other sentiment datasets, such as product reviews or hospital patient reviews.

While preparing the data, we will adjust the range of stars from 1 to 5 into 0 to 1 so we can use binary softmax (Sigmoid). On top of that, the input data is a list of characters, this presents a few problems:

The input data is text instead of numbers.
Input is variable-Length Text.

We will opt to use each "word" as a single entity instead of "characters" since we would not expect any characters to have correlation with the output (sentiment). On the other hand, words such as "terrible", "unconvincing", "bad" give a strong indication about the sentiment of the reviewer. Several words can have a bit of correlation with the output, by negative, we mean as the frequency of these words increases, ratings tend to decrease in number of stars.

Capturing Word Correlation in Input Data¶

Bag of words: Given a review's Vocabulary, predict the sentiment¶

In [1]:

import numpy as np
import re
import pandas as pd

We first need to download the dataset into a suitable directory:

In [4]:

IMDB_PATH = '/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/imdb_master.csv'

In [5]:

!ls $IMDB_PATH

/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/imdb_master.csv

In [10]:

df = pd.read_csv(IMDB_PATH, encoding="ISO-8859-1")  # added encoding to fix error
df.head(7)

Out[10]:

	review	sentiment
0	One of the other reviewers has mentioned that ...	positive
1	A wonderful little production. <br /><br />The...	positive
2	I thought this was a wonderful way to spend ti...	positive
3	Basically there's a family where a little boy ...	negative
4	Petter Mattei's "Love in the Time of Money" is...	positive
5	Probably my all-time favorite movie, a story o...	positive
6	I sure would like to see a resurrection of a u...	positive

In [11]:

# let's take a look at one review:
df.loc[0].review, df.loc[0].sentiment

Out[11]:

("One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me. The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word. It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away. I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.",
'positive')

A common preprocessing step is to create a matrix where each row represents a review and each column represents whether a review contains a particular word in the vocabulary. To create a vector for a review, we just need to loop over the content and put $1$ s in places where the corresponding vocabulary words are present in the review.

The size of the review vectors depends on the global vocabulary of the reviews. If we have 2,000 unique words, you need vectors of length 2,000. This form of storage, called one-hot encoding, is the most common way to store binary information, in our case, the presence/absence of particular vocabulary words from the text of a review.

If our vocabulary have only 4 words, than the one-hot encoding might look like this:

In [12]:

one_hots = {}
one_hots['cat'] = np.array([1, 0, 0, 0])
one_hots['the'] = np.array([0, 1, 0, 0])
one_hots['dog'] = np.array([0, 0, 1, 0])
one_hots['sat'] = np.array([0, 0, 0, 1])

In [13]:

sentence = ['the', 'cat', 'sat']
x = one_hots[sentence[0]] + one_hots[sentence[1]] + one_hots[sentence[2]]
print('Sent Encoding:' + str(x))

Sent Encoding:[1 1 0 1]

We create a vector for each term in the vocabulary. Then we use vector addition to represent a set of words present in a sentence.

Predicting Movie Reviews¶

With the Previous Strategy and network, we can predict the sentiment of any review¶

The way to do it is to build a one-hot vector for the review then use the two-layer network to predict sentiment.

In [14]:

import re
import numpy as np
import pandas as pd
from collections import Counter

IMDB_PATH = '/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/imdb_master.csv'

In [17]:

df = pd.read_csv(IMDB_PATH, encoding="ISO-8859-1")
df = df[df['sentiment'].isin(['negative', 'positive'])]
all_reviews_text = " ".join(df.review.tolist())

In [18]:

# we get unique tokens
all_tokens = all_reviews_text.split(" ")
unique_tokens = [v for (v, _) in Counter(all_tokens).most_common(10000)]
len(all_tokens), len(unique_tokens)

Out[18]:

(11557297, 10000)

In [19]:

# create a function out of it
def get_tokens(text):
    return list(set(text.split(" ")))

In [20]:

# create one-hot representations of each token
word_to_index, index_to_word = {}, {}
for i, word in enumerate(unique_tokens):
    word_to_index[word], index_to_word[i] = i, word

In [21]:

df['words_count'] = df['review'].apply(lambda x: len(x.split(" "))) 

In [22]:

df.describe()

Out[22]:

	words_count
count	50000.000000
mean	231.145940
std	171.326419
min	4.000000
25%	126.000000
50%	173.000000
75%	280.000000
max	2470.000000

We will set the size of the one-hot vector to be 10,000 (representing the 10K most frequent words in the corpus). In this case, the review length doesn't matter, we'll just add up each word in the review to get a final representation of the review in a 10,000 vector.

Let's preprocess the training data:

In [28]:

test_idx = int(len(df) * (1 - 0.2))
train, test = df.iloc[:test_idx], df.iloc[test_idx:]
train.shape, test.shape

Out[28]:

((40000, 3), (10000, 3))

In [ ]:

# we delete columns we're not interested in
train = train.drop(columns=['words_count'])

In [34]:

# now we transform label into a number
train['y'] = train['sentiment'].replace({'negative': 0, 'positive': 1})
train = train.drop('sentiment', axis=1)

<ipython-input-34-50996a7211ae>:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['y'] = train['sentiment'].replace({'negative': 0, 'positive': 1})

In [36]:

# shuffle train now ..
train = train.sample(frac=1).reset_index(drop=True)

In [38]:

x, y = [], []
for _, r in train.iterrows():
    review, label = r['review'], r['y']
    one_hot = np.zeros(10000)
    tokens = get_tokens(review)
    for token in tokens:
        if token in word_to_index:
            one_hot[word_to_index[token]] = int(1)
    x.append(one_hot)
    y.append(label)

In [41]:

x, y = np.array(x), np.array(y)
x.shape, y.shape

Out[41]:

((40000, 10000), (40000,))

Now we have the representations we need to move forward and create a dense neural network to train.

Intro to an embedding layer¶

There is one more trick to make the network faster¶

We know that the first layer is the dataset. The first layer will be followed by what's called a linear layer, then an activation ReLU layer, then another linear layer, and finally the output, which is the prediction layer.

As it turns out, we can take a bit of a shortcut to layer 1 by replacing the 1st linear layer with an embedding layer. An important thing to notice is that taking a vector of 1s and 0s is mathematically equivalent to summing several rows of a matrix. So we just have to sum W_0's rows that mark available words to form the unique "embedding layer". Thus, it's much more efficient to select the relevant rows of W_0 and sum them as opposed to doing a big vector-matrix multiplication.

Because the sentiment vocabulary is on the order of 70k words, most of the vector matrix multiplication is spent multiplying zeros in the input vector by weights before summing them, embeddings are much more efficient. The advantage is that summing a bunch of rows is much faster.

In [78]:

import numpy as np
import sys

IMDB_PATH = '/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/reviews.txt'
IMDB_LABEL_PATH = '/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/labels.txt'

In [79]:

f = open(IMDB_PATH, mode='r')
raw_reviews = f.readlines()
f.close()

In [80]:

f = open(IMDB_LABEL_PATH, mode='r')
raw_labels = f.readlines()
f.close()

In [81]:

len(raw_reviews), len(raw_labels)

Out[81]:

(25000, 25000)

In [82]:

# python's map object is an iterator
# we can also convert map objects to lists, tupes, ..
tokens = list(map(lambda x: set(x.split(" ")), raw_reviews))

In [83]:

# let's extract the vocab
vocab = set()
for sent in tokens:
    for word in sent:
        if (len(word)>0):
            vocab.add(word)
vocab = list(vocab)

In [84]:

word2index = {}
for i, word in enumerate(vocab):
    word2index[word] = i

In [85]:

# transform all reviews to vectors
input_dataset = list()
for sent in tokens:
    sent_indices = list()
    for word in sent:
        try:
            sent_indices.append(word2index[word])
        except:
            ""
    input_dataset.append(list(set(sent_indices)))

In [86]:

# same for target data
target_dataset = list()
for label in raw_labels:
    if label == "positive\n":
        target_dataset.append(1)
    else:
        target_dataset.append(0)

In [87]:

import numpy as np
np.random.seed(1)

In [88]:

def sigmoid(x):
    return 1/(1+np.exp(-x))

In [89]:

lr, epochs = 0.01, 1
embedding_layer_size = 100

In [90]:

W0 = (0.2 * np.random.random((len(vocab), embedding_layer_size))) - 0.1
W1 = (0.2 * np.random.random((embedding_layer_size, 1))) - 0.1

In [91]:

# training loop
correct, total = (0, 0)

for epoch in range(epochs):
    
    # leave last 1000 for testing
    for i in range(len(input_dataset) - 1000):
        # Forward Propagation
        x, y = input_dataset[i], target_dataset[i]
        layer_1 = sigmoid(np.sum(W0[x], axis=0))
        layer_2 = sigmoid(layer_1.dot(W1))
        
        # Gradients Calc
        layer_2_delta = (layer_2 - y)
        layer_1_delta = layer_2_delta.dot(W1.T)
        
        # Backpropagation
        W0[x] -= layer_1_delta*lr  # update only corresponding embeddings (w/o attached input to gradient).
        W1 -= np.outer(layer_1, layer_2_delta) * lr
        
        # training accuracy
        if(np.abs(layer_2_delta) < 0.5):
            correct += 1
        total += 1
        
        if (i%1000 == 0):
            progress = 100 * i / float(len(input_dataset))
            print(f"Iter: {i} | Progress: {round(progress, 2)}% | Training Accuracy: {round(correct / float(total), 2)}%")
            
    # test set evaluation
    correct, total = (0, 0)
    for i in range(len(input_dataset) - 1000, len(input_dataset)):
        x, y = input_dataset[i], target_dataset[i]
        layer_1 = sigmoid(np.sum(W0[x], axis=0))
        layer_2 = sigmoid(layer_1.dot(W1))
        if(np.abs(layer_2-y) < 0.5):
            correct += 1
        total += 1
    print(f"Test accuracy: {correct / float(total)}")

Iter: 0 | Progress: 0.0% | Training Accuracy: 0.0%
Iter: 1000 | Progress: 4.0% | Training Accuracy: 0.5%
Iter: 2000 | Progress: 8.0% | Training Accuracy: 0.62%
Iter: 3000 | Progress: 12.0% | Training Accuracy: 0.68%
Iter: 4000 | Progress: 16.0% | Training Accuracy: 0.71%
Iter: 5000 | Progress: 20.0% | Training Accuracy: 0.73%
Iter: 6000 | Progress: 24.0% | Training Accuracy: 0.74%
Iter: 7000 | Progress: 28.0% | Training Accuracy: 0.76%
Iter: 8000 | Progress: 32.0% | Training Accuracy: 0.77%
Iter: 9000 | Progress: 36.0% | Training Accuracy: 0.78%
Iter: 10000 | Progress: 40.0% | Training Accuracy: 0.79%
Iter: 11000 | Progress: 44.0% | Training Accuracy: 0.8%
Iter: 12000 | Progress: 48.0% | Training Accuracy: 0.8%
Iter: 13000 | Progress: 52.0% | Training Accuracy: 0.81%
Iter: 14000 | Progress: 56.0% | Training Accuracy: 0.81%
Iter: 15000 | Progress: 60.0% | Training Accuracy: 0.81%
Iter: 16000 | Progress: 64.0% | Training Accuracy: 0.81%
Iter: 17000 | Progress: 68.0% | Training Accuracy: 0.81%
Iter: 18000 | Progress: 72.0% | Training Accuracy: 0.82%
Iter: 19000 | Progress: 76.0% | Training Accuracy: 0.82%
Iter: 20000 | Progress: 80.0% | Training Accuracy: 0.82%
Iter: 21000 | Progress: 84.0% | Training Accuracy: 0.83%
Iter: 22000 | Progress: 88.0% | Training Accuracy: 0.83%
Iter: 23000 | Progress: 92.0% | Training Accuracy: 0.83%
Test accuracy: 0.847

Interpreting the Output¶

What did the Neural Network learn along the way?¶

The Network was looking for correlation between the input data points and the output data points. It's extremely beneficial to know what kind of patterns the network detected while training and took as signal for predicting sentiment, just because the network was able to find correlation between the input and the output doesn't mean that it found every pattern of language. So understanding what the difference between what the network is able to currently learn from data sets and what it should learn to truly understand language is very important & essential to solve artificial general intelligence.

To answer this question, let's start by considering was what presented to the network. We presented a presence/absence binary indicator for every word in the top 10,000 most frequent words in the corpus. We'd expect the network to know which words have strong correlation with negative opinions and which are positive, but this isn't the whole story.

Neural Architecture¶

How did the choice of architecture effect what the network was able to learn?¶

Hidden layers are about grouping input data points coming from the previous layer into n groups. Each hidden neuron takes in a data point and asks "is this data point in my group?" and as the hidden layer learns, it searches for useful groupings. So what are the useful groupings for our task? We know that a grouping is useful if it manages to find hidden and interesting structure in the data. So:

bad groupings just memorize data.
good groupings capture phenomenas that are useful linguistically.

For example, understanding the difference between "terrible" and "not terrible" is a powerful grouping. However, because the input to the network is a vocabulary and not a sequence, a sentence such as "It is Great, Not terrible" will be interpreted exactly like "It is Terrible, Not great".

If we can construct two examples with the same activation hidden layer & the pattern is present in the first example while absent in the 2nd, then the network won't be able to detect the pattern we're interested in.

What should we see in the weights connecting weights to hidden neurons?¶

We'd expect words that have similar predictive power should subscribe to similar groups.

Words that subscribe to similar groups, having similar weights, will have similar linguistic meaning with regards to the task at hand (sentiment analysis).

Comparing Word Embeddings¶

How Can we Visualize Weight Similarity?¶

We can get the embedding of each word by simply extracting the corresponding row from the first weight matrix. We can also do word-to-word comparison by simply calculating the euclidian distance between the two vectors.

In [92]:

from collections import Counter
import math

In [93]:

def similar(target='beautiful'):
    target_index = word2index[target]
    scores = Counter()
    for word, index in word2index.items():
        raw_difference = W0[index] - W0[target_index]
        squared_difference = raw_difference**2
        scores[word] = -math.sqrt(sum(squared_difference))
    return scores.most_common(10)

This procedure will allow us to easily find out the similar words to a target word, examples:

In [94]:

print(similar('beautiful'))

[('beautiful', -0.0), ('wonderfully', -0.7347476245943578), ('each', -0.7397281670618566), ('recommended', -0.7700989754926751), ('job', -0.8021862760775765), ('fascinating', -0.803780429603366), ('masterpiece', -0.806440875020042), ('true', -0.8087076223072098), ('especially', -0.8096794093609967), ('sweet', -0.8117635303801121)]

In [95]:

print(similar('terrible'))

[('terrible', -0.0), ('annoying', -0.7717906985613661), ('poorly', -0.8084446689686995), ('avoid', -0.8088320312353884), ('worse', -0.8246951041670193), ('stupid', -0.8309245272531632), ('boring', -0.8385096157034106), ('bad', -0.8395554304307457), ('disappointment', -0.8654898536150686), ('unfortunately', -0.8780291885742453)]

In [96]:

print(similar('average'))

[('average', -0.0), ('clearence', -0.6276424599813579), ('bizniss', -0.6339327432204244), ('brock', -0.6346237642100628), ('swordsmanship', -0.6370031733437977), ('sexegenarian', -0.6379006068844714), ('breckinridge', -0.6381035731502563), ('burnside', -0.6421944971499771), ('nudges', -0.6422029970777722), ('floorpan', -0.6482336019908111)]

In [97]:

print(similar('love'))

[('love', -0.0), ('friendship', -0.7000998912887154), ('believable', -0.7095907323713758), ('nice', -0.716328595780204), ('worth', -0.7191739646420174), ('bit', -0.7275198610668071), ('together', -0.7277400327527178), ('true', -0.7297715834947054), ('also', -0.7372723172510404), ('gives', -0.743701219426435)]

What we see is a standard phenomenon in the correlation summarization. It seeks to create similar latent representations within the network to facilitate information compression to arrive to the correct target label.

What is the meaning of a neuron?¶

Meaning is entirely based on the target labels being predicted¶

We should notice that Beautiful & recommended are nearly identical, but only in the context of sentiment prediction. In the other hand, their meaning is quite different.

The meaning of a neuron in the network depends entirely on the target labels. The NN is entirely ignorant of any other meaning outside the task it was trained on. So how do we make the meaning of a neuron more broad? Well, if we give it a task that requires broad understanding of language, it will learn new complexities and its neurons will become much more general.

The Task we'll use to learn more interesting word embeddings is the "fill in the blank" task. There is nearly infinite training data (the internet) which provides an infinite signal to the network. A NN being able to learn to fill the blank requires at least some context language understanding.

Filling in the Blank¶

Learn Richer Word Meanings by having A Richer Signal¶

The following example uses almost the same previous architecture with minor modifications. We'll split the text into 5 words sentences, then remove one word (focus term), and train the network to predict the focus term.

We'll also use a technique called negative sampling to make the network train a bit more faster. Consider that in order to predict the focus term, we need one label for each possible word. This would require several thousand labels, which would cause the network to train slowly. To overcome this, we randomly ignore most of the labels for each forward propagation. Although this seems crude, it's a technique that works well in practice.

In [98]:

import sys, random, math
from collections import Counter
import numpy as np

IMDB_PATH = '/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/reviews.txt'
IMDB_LABEL_PATH = '/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/labels.txt'

In [101]:

np.random.seed(1)
random.seed(1)

In [102]:

f = open(IMDB_PATH, 'r')
raw_reviews = f.readlines()
f.close()

In [103]:

len(raw_reviews)

Out[103]:

In [104]:

tokens = list(map(lambda x: x.split(" "), raw_reviews))
len(tokens[0]), len(tokens[1]), len(tokens[2])

Out[104]:

(185, 127, 537)

In [105]:

word_counter = Counter()

In [106]:

for review in tokens:
    for token in review:
        word_counter[token] -= 1

In [107]:

_ = word_counter.most_common()  # least common in this case.

most_common() just sorts out the data, it doesn't take the Top N most common tokens unless you force it to (by giving it an argument).

In [108]:

vocab = list(set(map(lambda x: x[0], word_counter.most_common())))

In [109]:

word2index = {}
for i, word in enumerate(vocab):
    word2index[word] = i

In [110]:

concatenated = list()
input_dataset = list()
for review in tokens:
    review_indices = list()
    for token in review:
        try:
            review_indices.append(word2index[token])
            concatenated.append(word2index[token])
        except:
            ""
    input_dataset.append(review_indices)
concatenated = np.array(concatenated)
random.shuffle(input_dataset)

In [111]:

lr, epochs = (.05, 2)
hidden_size, window, negative = 50, 2, 5

In [119]:

W0 = (np.random.rand(len(vocab), hidden_size) - 0.5) * 0.2
W1 = np.zeros((len(vocab), hidden_size))
W0.shape, W1.shape

Out[119]:

((74075, 50), (74075, 50))

In [120]:

layer_2_target = np.zeros(negative+1)
layer_2_target[0] = 1

In [123]:

def similar(target='beautiful', top=7):
    target_index = word2index[target]
    
    scores = Counter()
    for word, index in word2index.items():
        raw_difference = W0[index] - W0[target_index]
        squared_difference = raw_difference * raw_difference
        scores[word] = -math.sqrt(sum(squared_difference))
    return scores.most_common(top)

In [124]:

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

In [125]:

for review_i, review in enumerate(input_dataset * epochs):
    for target_i in range(len(review)):
        # predict only a random subset, because it's really expensive to predict every vocab
        # We can't do a softmax over all possible words, we will predict for the target word + a subset of the total vocab
        target_samples = [review[target_i]] + list(concatenated[(np.random.rand(negative)*len(concatenated)).astype('int').tolist()])
        
        # get tokens on the right & on Left of target word
        left_context = review[max(0, target_i-window):target_i]
        right_context = review[target_i+1: min(len(review), target_i+window)]
        
        # feed forward
        # context words w/o target word
        # mean instead of sum, interesting
        layer_1 = np.mean(W0[left_context+right_context], axis=0)
        # using sigmoid here is kind of weird because there is only one true target token
        layer_2 = sigmoid(layer_1.dot(W1[target_samples].T))
        layer_2_delta = layer_2 - layer_2_target
        layer_1_delta = layer_2_delta.dot(W1[target_samples])
        
        # update weights
        W0[left_context+right_context] -= layer_1_delta*lr
        W1[target_samples] -= np.outer(layer_2_delta, layer_1)*lr
        
    if(review_i % 1000 == 0):
        print(f"Progress: {round(review_i/float(len(input_dataset)*epochs), 3)} | `Terrible` nearest neighbors: {similar('terrible', top=5)}")
print(similar('terrible'))

Progress: 0.0 | `Terrible` nearest neighbors: [('terrible', -0.0), ('misperceived', -0.37629838414529776), ('origination', -0.38645406817879285), ('bumpuses', -0.3866817101079805), ('recognition', -0.3870682177410917)]
Progress: 0.02 | `Terrible` nearest neighbors: [('terrible', -0.0), ('superb', -0.9395550125335567), ('fantastic', -0.9785445410150915), ('brilliant', -1.0009723994783375), ('excellent', -1.0507492536281997)]
Progress: 0.04 | `Terrible` nearest neighbors: [('terrible', -0.0), ('brilliant', -1.4181573161847985), ('horrible', -1.435351250932622), ('hilarious', -1.4828872270341869), ('fantastic', -1.5224293865473035)]
Progress: 0.06 | `Terrible` nearest neighbors: [('terrible', -0.0), ('fantastic', -1.3645701568801774), ('brilliant', -1.452030183491021), ('horrible', -1.4872861958792385), ('convincing', -1.6451485631159968)]
Progress: 0.08 | `Terrible` nearest neighbors: [('terrible', -0.0), ('fantastic', -1.635188731505891), ('brilliant', -1.6478061610864096), ('convincing', -1.8233281433244541), ('lame', -1.903116578071796)]
Progress: 0.1 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -1.8219440801238125), ('fantastic', -1.8671267127593203), ('lame', -1.9050522776833492), ('fascinating', -1.9732384841951174)]
Progress: 0.12 | `Terrible` nearest neighbors: [('terrible', -0.0), ('brilliant', -2.1433165398420138), ('horrible', -2.1638703481599477), ('fascinating', -2.2868941326807817), ('weak', -2.3292512816130797)]
Progress: 0.14 | `Terrible` nearest neighbors: [('terrible', -0.0), ('brilliant', -2.5092534705974803), ('fantastic', -2.6508016767739924), ('fascinating', -2.7452444884676908), ('horrible', -2.865792216162147)]
Progress: 0.16 | `Terrible` nearest neighbors: [('terrible', -0.0), ('brilliant', -2.6959329307597475), ('horrible', -2.988478730789771), ('fascinating', -3.1247256619089545), ('terrific', -3.129899894825866)]
Progress: 0.18 | `Terrible` nearest neighbors: [('terrible', -0.0), ('brilliant', -2.8446078019182597), ('horrible', -2.956595849168358), ('superb', -3.1012022870890967), ('fascinating', -3.262813698847574)]
Progress: 0.2 | `Terrible` nearest neighbors: [('terrible', -0.0), ('brilliant', -2.9033044381747177), ('horrible', -3.0115624244181727), ('fantastic', -3.2470408994109534), ('superb', -3.256887985703758)]
Progress: 0.22 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -3.0915151786846438), ('fantastic', -3.1265211166700086), ('brilliant', -3.1454738767589787), ('superb', -3.183344902892115)]
Progress: 0.24 | `Terrible` nearest neighbors: [('terrible', -0.0), ('brilliant', -3.0127884429926897), ('horrible', -3.15807211093518), ('fantastic', -3.2550240363481096), ('superb', -3.4725565464616666)]
Progress: 0.26 | `Terrible` nearest neighbors: [('terrible', -0.0), ('brilliant', -2.7865435546532855), ('horrible', -3.057266418335675), ('fantastic', -3.6015249499836965), ('remarkable', -3.7331652129773283)]
Progress: 0.28 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.908750351571196), ('brilliant', -3.0256961917363117), ('remarkable', -3.3441415219255313), ('fantastic', -3.346827479733814)]
Progress: 0.3 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -3.1965770870240044), ('brilliant', -3.232387427139472), ('fantastic', -3.3406354622067576), ('laughable', -3.4525431309832544)]
Progress: 0.32 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.7193204217548046), ('fantastic', -3.4924465018991375), ('brilliant', -3.578142963089179), ('pathetic', -3.69027244678774)]
Progress: 0.34 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.7508534380433756), ('laughable', -3.564716387829581), ('brilliant', -3.6256016784964853), ('ridiculous', -3.691640952693665)]
Progress: 0.36 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.6235371223238038), ('brilliant', -3.456400101821386), ('fantastic', -3.591454386466738), ('pathetic', -3.6334012075001842)]
Progress: 0.38 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.683096639333961), ('fantastic', -3.4242003788819737), ('pathetic', -3.4360802291007326), ('brilliant', -3.4872253519455176)]
Progress: 0.4 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -3.1624804317677495), ('fantastic', -3.5925838881609518), ('superb', -3.768105964314105), ('wonderful', -3.862213327361215)]
Progress: 0.42 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -3.437289914569698), ('fantastic', -3.5216332851140426), ('wonderful', -3.650556914213743), ('superb', -3.657396309163536)]
Progress: 0.44 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -3.2723437444549783), ('wonderful', -3.4368907041506995), ('brilliant', -3.7929820192531545), ('superb', -3.849470415444905)]
Progress: 0.46 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -3.2460430522999157), ('fantastic', -3.701891129630293), ('superb', -3.8686900453638264), ('wonderful', -3.8932476027869205)]
Progress: 0.48 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.9614105229354526), ('fantastic', -3.6383683495777577), ('wonderful', -3.7598207425786945), ('marvelous', -3.8547592780789506)]
Progress: 0.5 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -3.1209779081546825), ('pathetic', -3.9281919379578993), ('bad', -3.9341848616327995), ('ridiculous', -3.973607969275994)]
Progress: 0.52 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -3.3213812570213213), ('superb', -3.6699560346070927), ('pathetic', -3.6779200874922333), ('brilliant', -3.741357710902307)]
Progress: 0.54 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.9042261348652203), ('ridiculous', -3.523791388919498), ('brilliant', -3.553038115161444), ('pathetic', -3.670621870859373)]
Progress: 0.56 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -3.019056443525238), ('ridiculous', -3.491907182197584), ('brilliant', -3.5144127474855265), ('pathetic', -3.637777041244951)]
Progress: 0.58 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -3.047805663412288), ('ridiculous', -3.284331453269096), ('pathetic', -3.416744552532168), ('fantastic', -3.5162930264465393)]
Progress: 0.6 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -3.0350582648337365), ('pathetic', -3.141418301245476), ('brilliant', -3.3284278791026645), ('ridiculous', -3.3681009224333156)]
Progress: 0.62 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.8424670383329955), ('brilliant', -3.190611111009397), ('fantastic', -3.3554779094550637), ('pathetic', -3.613499339625761)]
Progress: 0.64 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.6288033645941358), ('ridiculous', -3.1037980754584007), ('fantastic', -3.4010410214913493), ('brilliant', -3.414474269349876)]
Progress: 0.66 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.8451699935635197), ('ridiculous', -3.360397446507512), ('fantastic', -3.466029776873303), ('brilliant', -3.578601473087416)]
Progress: 0.68 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.66673055814208), ('fantastic', -3.3362097606735692), ('brilliant', -3.373603786961304), ('magnificent', -3.624886774249765)]
Progress: 0.7 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.8630220563840676), ('fantastic', -3.3992359956414386), ('wonderful', -3.566726961067627), ('magnificent', -3.591300947805222)]
Progress: 0.72 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.9368471700599974), ('magnificent', -3.608067713166859), ('brilliant', -3.655464546153343), ('dreadful', -3.7050087712182838)]
Progress: 0.74 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.635848871982613), ('brilliant', -3.6651707790712265), ('fantastic', -3.7062670345209163), ('horrid', -3.7328953795618287)]
Progress: 0.76 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.503166496540591), ('fantastic', -3.629923785322746), ('wonderful', -3.86941026115452), ('brilliant', -3.901485156616881)]
Progress: 0.78 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.412898394155387), ('horrid', -3.549930811490535), ('dreadful', -3.5937964712495667), ('dire', -3.758413186706391)]
Progress: 0.8 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.7875323979515803), ('brilliant', -3.553032704687312), ('fantastic', -3.5650466133229215), ('horrid', -3.6895680657641807)]
Progress: 0.82 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.505351530365435), ('horrid', -3.5555145292149475), ('dreadful', -3.723125002700873), ('horrendous', -3.7799764137444125)]
Progress: 0.84 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.52225303835475), ('horrid', -3.659388277805759), ('horrendous', -3.6596442985085273), ('brilliant', -3.7064657941555823)]
Progress: 0.86 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.403871102526396), ('brilliant', -3.4543509427136447), ('dreadful', -3.6327538154677845), ('horrendous', -3.638227523309254)]
Progress: 0.88 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.580263267663093), ('brilliant', -3.168048700465928), ('fantastic', -3.47530255959384), ('horrendous', -3.5527378429665823)]
Progress: 0.9 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.5560655161199812), ('brilliant', -3.2765502997525595), ('phenomenal', -3.6192939150249934), ('marvelous', -3.627323403235288)]
Progress: 0.92 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.773181520872384), ('brilliant', -3.393156773756116), ('superb', -3.628130376783132), ('fantastic', -3.6532165585570455)]
Progress: 0.94 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.7740421486011266), ('pathetic', -3.3994302985804645), ('phenomenal', -3.4011259732447283), ('brilliant', -3.4238172462044756)]
Progress: 0.96 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.7104808985858138), ('pathetic', -3.3380165149787935), ('phenomenal', -3.5040384247065166), ('brilliant', -3.6382402279199804)]
Progress: 0.98 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.421661394464606), ('phenomenal', -3.4302258651982), ('pathetic', -3.5310157422128037), ('fantastic', -3.5986885562168713)]
[('terrible', -0.0), ('horrible', -2.9616723194748578), ('pathetic', -3.53126362760028), ('phenomenal', -3.704929909889765), ('brilliant', -3.8445386415610043), ('dreadful', -3.9673549749081625), ('superb', -4.113236415218835)]

The word embeddings get trained according to the task the neural network is trained on, let's give. a few examples:

Sentiment Analysis: Embeddings are grouped together depending on how Positive/Negative they are or depending on How they effect a review being good or bad.
Filling the Blank: Embeddings are grouped together depending on how close/far they are when filling blanks.
- Solve: "I ___ You so much!"
  - Possible Solution — "I hate You so much!"
  - Possible Solution — "I love You so much!"

In this sense, hate & love are pretty close!

Meaning is Derived from Loss¶

Before, words were clustered according to the likelihood that the review is positive/negative. Now, they are clustered based on the likelihood that they will occur on the same phrase (regardless of the sentiment behind a review).

Our key takeaway is that even though we are training on the same dataset, using a very similar network architecture, we can influence what the network learns by changing the loss function (task). Even though it's looking at the same information, we can alter its learning behavior by simply changing the output structure.

Let's call the process of choosing what the network should learn: Intelligence Targeting. We can also change how the network measures error, its architetcure, and regularization, this is also a way of performing Intelligence targeting.

In deep learning research, all of the above techniques fall under the umbrella term: Loss function construction.

Neural Networks don't really LEARN Data; they minimize Loss Functions¶

The Choice of Loss Function Determines the Network's Knowledge¶

Considering that Learning is all about minimizing a loss function, this gives a broader understanding of how neural networks work.

Different kinds of layers, activations, regularization techniques, datasets, aren't really that different. For Example: if the network is overfitting, we can augment the loss fucntion by choosing simpler non-linearities, adding dropout, enforcing regularizations, adding more data and so on. All of these techniques will have a similar effect on the loss function and the learning behavior.

With learning, everything is contained within the loss function and If something is going wrong, remember that the solution is in the loss function.

King - Man + Woman ~= Queen¶

The task of filling in the blank creates an interesting property called "word analogies". Analogies are one of the famous properties of word embeddings (or trained vectors).

We can take different embeddings and perform algebric operations on them to discover these analogies.

In [126]:

def analogy(positive=['terrible', 'good'], negative=['bad']):
    norms = np.sum(W0*W0, axis=1)
    norms.resize((norms.shape[0], 1))
    # normalize weights for vector-level operations
    normed_weights = W0 * norms
    query_vect = np.zeros(W0.shape[1])
    for word in positive:
        query_vect += normed_weights[word2index[word]]
    for word in negative:
        query_vect -= normed_weights[word2index[word]]
    
    scores = Counter()
    for word, index in word2index.items():
        raw_difference = W0[index] - query_vect
        squared_difference = raw_difference * raw_difference
        scores[word] = -math.sqrt(sum(squared_difference))
    return scores.most_common(10)[1:]

In [132]:

analogy(['elizabeth', 'he'], ['she'])

Out[132]:

[('lee', -174.27474476700854),
 ('been', -174.49725533677292),
 ('david', -174.61356143328723),
 ('william', -174.6143457630514),
 ('walken', -174.70899829891496),
 ('st', -174.73844365112527),
 ('simon', -174.76909796361042),
 ('sean', -174.8055256904309),
 ('smith', -174.89679055038982)]

Word Analogies¶

Linear Compression of an Existing Property in the Data¶

Even though "Word Analogy" Discovery was initially very exciting, the deep learning NLP paradigm didn't move forward from that to discover new features, instead, current language models rely on ~~Recurrent Neural Networks to do language modeling~~ (This book was released before ELMO, BERT, & GPT-2, that is why the author considers RNNs to be the SoTA in Language modeling).

Nevertheless, we need to understand why this concept emerged out of the network as a result of us training the network to fill in the blank? If we imagine the word embeddings to have two dimensions, then it would be easier to know why word analogies work:

In [133]:

king = [.6, .1]
man = [.5, .0]
woman = [.0, .8]
queen = [.1, 1.0]

In [134]:

[x_i - y_i for (x_i, y_i) in zip(king, man)]

Out[134]:

[0.09999999999999998, 0.1]

In [135]:

[x_i - y_i for (x_i, y_i) in zip(queen, woman)]

Out[135]:

[0.1, 0.19999999999999996]

The relative usefulness to the final prediction between "Man"/"King" & "Woman"/"Queen" is similar because the difference between "King" and "Man" Leaves a vector that represents Royalty. There are a bunch of male/female related words in one grouping, and a bunch of king/queen related words in another grouping. Because the relative distance between the two group is constant, it means that the distances between each group items will be relatively the same.

This phenomena can be traced back to the chosen loss. What is important is that learning analogies is more about the properties of language than deep learning. Any linear compression of these co-occurent statistics will yield the same results.

Summary¶

We've learned a lot about Word embeddings & the impact of loss on learning¶

We've also unpacked the principles of using neural networks to model language.