In this chapter, we will:
[John Pfeiffer] "Man is a Slow, Sloppy, and Brilliant Thinker; Computers are Fast, Accurate, and Stupid!"
NLP is divided into a collection of tasks and challenges. We present a few types of classification problems that are common in NLP:
NLP tasks seek to do one of three things:
Until recently, most of the state-of-the-art (SoTA) NLP Algorithms were advanced, probabilistic, non-parametric models but the recent development and popularization of two major neural algorithms have swept the field of NLP:
NLP also plays a very special role in AGI (Artificial General Intelligence), because language is the bedrock of logic and communication of humans.
Up until now, we represented inputs as numbers, but NLP uses text as input, the question is how do we process text? We know that NNs map input numbers to output numbers, for this reason, we need to convert our words into their corresponding numerical representation. As it turns out, the way we transform text into numbers is exteremly important!
In order to find the optimal numerical representation for text, we need to look at the underlying input-to-output problem, let's take an example:
The IMDB Reviews Dataset is a collection of Review/Rating Pairs that often looks like the following:
"This Movie was terrible, The Plot was Dry, The acting unconvincing, and I spilled popcorn on my shirt!" — Rating: 1 Stars.
The entire dataset consists of around 50K reviews. The input reviews are usually a few sentences & the output rating is between 1 and 5 stars. It should be obvious that this sentiment dataset is very different from other sentiment datasets, such as product reviews or hospital patient reviews.
While preparing the data, we will adjust the range of stars from 1 to 5 into 0 to 1 so we can use binary softmax (Sigmoid). On top of that, the input data is a list of characters, this presents a few problems:
We will opt to use each "word" as a single entity instead of "characters" since we would not expect any characters to have correlation with the output (sentiment). On the other hand, words such as "terrible", "unconvincing", "bad" give a strong indication about the sentiment of the reviewer. Several words can have a bit of correlation with the output, by negative, we mean as the frequency of these words increases, ratings tend to decrease in number of stars.
import numpy as np
import re
import pandas as pd
We first need to download the dataset into a suitable directory:
IMDB_PATH = '/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/imdb_master.csv'
!ls $IMDB_PATH
/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/imdb_master.csv
df = pd.read_csv(IMDB_PATH, encoding="ISO-8859-1") # added encoding to fix error
df.head(7)
review | sentiment | |
---|---|---|
0 | One of the other reviewers has mentioned that ... | positive |
1 | A wonderful little production. <br /><br />The... | positive |
2 | I thought this was a wonderful way to spend ti... | positive |
3 | Basically there's a family where a little boy ... | negative |
4 | Petter Mattei's "Love in the Time of Money" is... | positive |
5 | Probably my all-time favorite movie, a story o... | positive |
6 | I sure would like to see a resurrection of a u... | positive |
# let's take a look at one review:
df.loc[0].review, df.loc[0].sentiment
("One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.", 'positive')
A common preprocessing step is to create a matrix where each row represents a review and each column represents whether a review contains a particular word in the vocabulary. To create a vector for a review, we just need to loop over the content and put 1s in places where the corresponding vocabulary words are present in the review.
The size of the review vectors depends on the global vocabulary of the reviews. If we have 2,000 unique words, you need vectors of length 2,000. This form of storage, called one-hot encoding, is the most common way to store binary information, in our case, the presence/absence of particular vocabulary words from the text of a review.
If our vocabulary have only 4 words, than the one-hot encoding might look like this:
one_hots = {}
one_hots['cat'] = np.array([1, 0, 0, 0])
one_hots['the'] = np.array([0, 1, 0, 0])
one_hots['dog'] = np.array([0, 0, 1, 0])
one_hots['sat'] = np.array([0, 0, 0, 1])
sentence = ['the', 'cat', 'sat']
x = one_hots[sentence[0]] + one_hots[sentence[1]] + one_hots[sentence[2]]
print('Sent Encoding:' + str(x))
Sent Encoding:[1 1 0 1]
We create a vector for each term in the vocabulary. Then we use vector addition to represent a set of words present in a sentence.
import re
import numpy as np
import pandas as pd
from collections import Counter
IMDB_PATH = '/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/imdb_master.csv'
df = pd.read_csv(IMDB_PATH, encoding="ISO-8859-1")
df = df[df['sentiment'].isin(['negative', 'positive'])]
all_reviews_text = " ".join(df.review.tolist())
# we get unique tokens
all_tokens = all_reviews_text.split(" ")
unique_tokens = [v for (v, _) in Counter(all_tokens).most_common(10000)]
len(all_tokens), len(unique_tokens)
(11557297, 10000)
# create a function out of it
def get_tokens(text):
return list(set(text.split(" ")))
# create one-hot representations of each token
word_to_index, index_to_word = {}, {}
for i, word in enumerate(unique_tokens):
word_to_index[word], index_to_word[i] = i, word
df['words_count'] = df['review'].apply(lambda x: len(x.split(" ")))
df.describe()
words_count | |
---|---|
count | 50000.000000 |
mean | 231.145940 |
std | 171.326419 |
min | 4.000000 |
25% | 126.000000 |
50% | 173.000000 |
75% | 280.000000 |
max | 2470.000000 |
We will set the size of the one-hot vector to be 10,000 (representing the 10K most frequent words in the corpus). In this case, the review length doesn't matter, we'll just add up each word in the review to get a final representation of the review in a 10,000 vector.
Let's preprocess the training data:
test_idx = int(len(df) * (1 - 0.2))
train, test = df.iloc[:test_idx], df.iloc[test_idx:]
train.shape, test.shape
((40000, 3), (10000, 3))
# we delete columns we're not interested in
train = train.drop(columns=['words_count'])
# now we transform label into a number
train['y'] = train['sentiment'].replace({'negative': 0, 'positive': 1})
train = train.drop('sentiment', axis=1)
<ipython-input-34-50996a7211ae>:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy train['y'] = train['sentiment'].replace({'negative': 0, 'positive': 1})
# shuffle train now ..
train = train.sample(frac=1).reset_index(drop=True)
x, y = [], []
for _, r in train.iterrows():
review, label = r['review'], r['y']
one_hot = np.zeros(10000)
tokens = get_tokens(review)
for token in tokens:
if token in word_to_index:
one_hot[word_to_index[token]] = int(1)
x.append(one_hot)
y.append(label)
x, y = np.array(x), np.array(y)
x.shape, y.shape
((40000, 10000), (40000,))
Now we have the representations we need to move forward and create a dense neural network to train.
We know that the first layer is the dataset. The first layer will be followed by what's called a linear layer, then an activation ReLU
layer, then another linear layer, and finally the output, which is the prediction layer.
As it turns out, we can take a bit of a shortcut to layer 1
by replacing the 1st linear layer with an embedding layer. An important thing to notice is that taking a vector of 1s and 0s is mathematically equivalent to summing several rows of a matrix. So we just have to sum W_0
's rows that mark available words to form the unique "embedding layer". Thus, it's much more efficient to select the relevant rows of W_0
and sum them as opposed to doing a big vector-matrix multiplication.
Because the sentiment vocabulary is on the order of 70k words, most of the vector matrix multiplication is spent multiplying zeros in the input vector by weights before summing them, embeddings are much more efficient. The advantage is that summing a bunch of rows is much faster.
import numpy as np
import sys
IMDB_PATH = '/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/reviews.txt'
IMDB_LABEL_PATH = '/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/labels.txt'
f = open(IMDB_PATH, mode='r')
raw_reviews = f.readlines()
f.close()
f = open(IMDB_LABEL_PATH, mode='r')
raw_labels = f.readlines()
f.close()
len(raw_reviews), len(raw_labels)
(25000, 25000)
# python's map object is an iterator
# we can also convert map objects to lists, tupes, ..
tokens = list(map(lambda x: set(x.split(" ")), raw_reviews))
# let's extract the vocab
vocab = set()
for sent in tokens:
for word in sent:
if (len(word)>0):
vocab.add(word)
vocab = list(vocab)
word2index = {}
for i, word in enumerate(vocab):
word2index[word] = i
# transform all reviews to vectors
input_dataset = list()
for sent in tokens:
sent_indices = list()
for word in sent:
try:
sent_indices.append(word2index[word])
except:
""
input_dataset.append(list(set(sent_indices)))
# same for target data
target_dataset = list()
for label in raw_labels:
if label == "positive\n":
target_dataset.append(1)
else:
target_dataset.append(0)
import numpy as np
np.random.seed(1)
def sigmoid(x):
return 1/(1+np.exp(-x))
lr, epochs = 0.01, 1
embedding_layer_size = 100
W0 = (0.2 * np.random.random((len(vocab), embedding_layer_size))) - 0.1
W1 = (0.2 * np.random.random((embedding_layer_size, 1))) - 0.1
# training loop
correct, total = (0, 0)
for epoch in range(epochs):
# leave last 1000 for testing
for i in range(len(input_dataset) - 1000):
# Forward Propagation
x, y = input_dataset[i], target_dataset[i]
layer_1 = sigmoid(np.sum(W0[x], axis=0))
layer_2 = sigmoid(layer_1.dot(W1))
# Gradients Calc
layer_2_delta = (layer_2 - y)
layer_1_delta = layer_2_delta.dot(W1.T)
# Backpropagation
W0[x] -= layer_1_delta*lr # update only corresponding embeddings (w/o attached input to gradient).
W1 -= np.outer(layer_1, layer_2_delta) * lr
# training accuracy
if(np.abs(layer_2_delta) < 0.5):
correct += 1
total += 1
if (i%1000 == 0):
progress = 100 * i / float(len(input_dataset))
print(f"Iter: {i} | Progress: {round(progress, 2)}% | Training Accuracy: {round(correct / float(total), 2)}%")
# test set evaluation
correct, total = (0, 0)
for i in range(len(input_dataset) - 1000, len(input_dataset)):
x, y = input_dataset[i], target_dataset[i]
layer_1 = sigmoid(np.sum(W0[x], axis=0))
layer_2 = sigmoid(layer_1.dot(W1))
if(np.abs(layer_2-y) < 0.5):
correct += 1
total += 1
print(f"Test accuracy: {correct / float(total)}")
Iter: 0 | Progress: 0.0% | Training Accuracy: 0.0% Iter: 1000 | Progress: 4.0% | Training Accuracy: 0.5% Iter: 2000 | Progress: 8.0% | Training Accuracy: 0.62% Iter: 3000 | Progress: 12.0% | Training Accuracy: 0.68% Iter: 4000 | Progress: 16.0% | Training Accuracy: 0.71% Iter: 5000 | Progress: 20.0% | Training Accuracy: 0.73% Iter: 6000 | Progress: 24.0% | Training Accuracy: 0.74% Iter: 7000 | Progress: 28.0% | Training Accuracy: 0.76% Iter: 8000 | Progress: 32.0% | Training Accuracy: 0.77% Iter: 9000 | Progress: 36.0% | Training Accuracy: 0.78% Iter: 10000 | Progress: 40.0% | Training Accuracy: 0.79% Iter: 11000 | Progress: 44.0% | Training Accuracy: 0.8% Iter: 12000 | Progress: 48.0% | Training Accuracy: 0.8% Iter: 13000 | Progress: 52.0% | Training Accuracy: 0.81% Iter: 14000 | Progress: 56.0% | Training Accuracy: 0.81% Iter: 15000 | Progress: 60.0% | Training Accuracy: 0.81% Iter: 16000 | Progress: 64.0% | Training Accuracy: 0.81% Iter: 17000 | Progress: 68.0% | Training Accuracy: 0.81% Iter: 18000 | Progress: 72.0% | Training Accuracy: 0.82% Iter: 19000 | Progress: 76.0% | Training Accuracy: 0.82% Iter: 20000 | Progress: 80.0% | Training Accuracy: 0.82% Iter: 21000 | Progress: 84.0% | Training Accuracy: 0.83% Iter: 22000 | Progress: 88.0% | Training Accuracy: 0.83% Iter: 23000 | Progress: 92.0% | Training Accuracy: 0.83% Test accuracy: 0.847
The Network was looking for correlation between the input data points and the output data points. It's extremely beneficial to know what kind of patterns the network detected while training and took as signal for predicting sentiment, just because the network was able to find correlation between the input and the output doesn't mean that it found every pattern of language. So understanding what the difference between what the network is able to currently learn from data sets and what it should learn to truly understand language is very important & essential to solve artificial general intelligence.
To answer this question, let's start by considering was what presented to the network. We presented a presence/absence binary indicator for every word in the top 10,000 most frequent words in the corpus. We'd expect the network to know which words have strong correlation with negative opinions and which are positive, but this isn't the whole story.
Hidden layers are about grouping input data points coming from the previous layer into n
groups. Each hidden neuron takes in a data point and asks "is this data point in my group?" and as the hidden layer learns, it searches for useful groupings. So what are the useful groupings for our task? We know that a grouping is useful if it manages to find hidden and interesting structure in the data. So:
For example, understanding the difference between "terrible" and "not terrible" is a powerful grouping. However, because the input to the network is a vocabulary and not a sequence, a sentence such as "It is Great, Not terrible" will be interpreted exactly like "It is Terrible, Not great".
If we can construct two examples with the same activation hidden layer & the pattern is present in the first example while absent in the 2nd, then the network won't be able to detect the pattern we're interested in.
We'd expect words that have similar predictive power should subscribe to similar groups.
Words that subscribe to similar groups, having similar weights, will have similar linguistic meaning with regards to the task at hand (sentiment analysis).
from collections import Counter
import math
def similar(target='beautiful'):
target_index = word2index[target]
scores = Counter()
for word, index in word2index.items():
raw_difference = W0[index] - W0[target_index]
squared_difference = raw_difference**2
scores[word] = -math.sqrt(sum(squared_difference))
return scores.most_common(10)
This procedure will allow us to easily find out the similar words to a target word, examples:
print(similar('beautiful'))
[('beautiful', -0.0), ('wonderfully', -0.7347476245943578), ('each', -0.7397281670618566), ('recommended', -0.7700989754926751), ('job', -0.8021862760775765), ('fascinating', -0.803780429603366), ('masterpiece', -0.806440875020042), ('true', -0.8087076223072098), ('especially', -0.8096794093609967), ('sweet', -0.8117635303801121)]
print(similar('terrible'))
[('terrible', -0.0), ('annoying', -0.7717906985613661), ('poorly', -0.8084446689686995), ('avoid', -0.8088320312353884), ('worse', -0.8246951041670193), ('stupid', -0.8309245272531632), ('boring', -0.8385096157034106), ('bad', -0.8395554304307457), ('disappointment', -0.8654898536150686), ('unfortunately', -0.8780291885742453)]
print(similar('average'))
[('average', -0.0), ('clearence', -0.6276424599813579), ('bizniss', -0.6339327432204244), ('brock', -0.6346237642100628), ('swordsmanship', -0.6370031733437977), ('sexegenarian', -0.6379006068844714), ('breckinridge', -0.6381035731502563), ('burnside', -0.6421944971499771), ('nudges', -0.6422029970777722), ('floorpan', -0.6482336019908111)]
print(similar('love'))
[('love', -0.0), ('friendship', -0.7000998912887154), ('believable', -0.7095907323713758), ('nice', -0.716328595780204), ('worth', -0.7191739646420174), ('bit', -0.7275198610668071), ('together', -0.7277400327527178), ('true', -0.7297715834947054), ('also', -0.7372723172510404), ('gives', -0.743701219426435)]
What we see is a standard phenomenon in the correlation summarization. It seeks to create similar latent representations within the network to facilitate information compression to arrive to the correct target label.
We should notice that Beautiful
& recommended
are nearly identical, but only in the context of sentiment prediction. In the other hand, their meaning is quite different.
The meaning of a neuron in the network depends entirely on the target labels. The NN is entirely ignorant of any other meaning outside the task it was trained on. So how do we make the meaning of a neuron more broad? Well, if we give it a task that requires broad understanding of language, it will learn new complexities and its neurons will become much more general.
The Task we'll use to learn more interesting word embeddings is the "fill in the blank" task. There is nearly infinite training data (the internet) which provides an infinite signal to the network. A NN being able to learn to fill the blank requires at least some context language understanding.
The following example uses almost the same previous architecture with minor modifications. We'll split the text into 5 words sentences, then remove one word (focus term), and train the network to predict the focus term.
We'll also use a technique called negative sampling to make the network train a bit more faster. Consider that in order to predict the focus term, we need one label for each possible word. This would require several thousand labels, which would cause the network to train slowly. To overcome this, we randomly ignore most of the labels for each forward propagation. Although this seems crude, it's a technique that works well in practice.
import sys, random, math
from collections import Counter
import numpy as np
IMDB_PATH = '/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/reviews.txt'
IMDB_LABEL_PATH = '/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/labels.txt'
np.random.seed(1)
random.seed(1)
f = open(IMDB_PATH, 'r')
raw_reviews = f.readlines()
f.close()
len(raw_reviews)
25000
tokens = list(map(lambda x: x.split(" "), raw_reviews))
len(tokens[0]), len(tokens[1]), len(tokens[2])
(185, 127, 537)
word_counter = Counter()
for review in tokens:
for token in review:
word_counter[token] -= 1
_ = word_counter.most_common() # least common in this case.
most_common()
just sorts out the data, it doesn't take the Top N most common tokens unless you force it to (by giving it an argument).
vocab = list(set(map(lambda x: x[0], word_counter.most_common())))
word2index = {}
for i, word in enumerate(vocab):
word2index[word] = i
concatenated = list()
input_dataset = list()
for review in tokens:
review_indices = list()
for token in review:
try:
review_indices.append(word2index[token])
concatenated.append(word2index[token])
except:
""
input_dataset.append(review_indices)
concatenated = np.array(concatenated)
random.shuffle(input_dataset)
lr, epochs = (.05, 2)
hidden_size, window, negative = 50, 2, 5
W0 = (np.random.rand(len(vocab), hidden_size) - 0.5) * 0.2
W1 = np.zeros((len(vocab), hidden_size))
W0.shape, W1.shape
((74075, 50), (74075, 50))
layer_2_target = np.zeros(negative+1)
layer_2_target[0] = 1
def similar(target='beautiful', top=7):
target_index = word2index[target]
scores = Counter()
for word, index in word2index.items():
raw_difference = W0[index] - W0[target_index]
squared_difference = raw_difference * raw_difference
scores[word] = -math.sqrt(sum(squared_difference))
return scores.most_common(top)
def sigmoid(x):
return 1 / (1 + np.exp(-x))
for review_i, review in enumerate(input_dataset * epochs):
for target_i in range(len(review)):
# predict only a random subset, because it's really expensive to predict every vocab
# We can't do a softmax over all possible words, we will predict for the target word + a subset of the total vocab
target_samples = [review[target_i]] + list(concatenated[(np.random.rand(negative)*len(concatenated)).astype('int').tolist()])
# get tokens on the right & on Left of target word
left_context = review[max(0, target_i-window):target_i]
right_context = review[target_i+1: min(len(review), target_i+window)]
# feed forward
# context words w/o target word
# mean instead of sum, interesting
layer_1 = np.mean(W0[left_context+right_context], axis=0)
# using sigmoid here is kind of weird because there is only one true target token
layer_2 = sigmoid(layer_1.dot(W1[target_samples].T))
layer_2_delta = layer_2 - layer_2_target
layer_1_delta = layer_2_delta.dot(W1[target_samples])
# update weights
W0[left_context+right_context] -= layer_1_delta*lr
W1[target_samples] -= np.outer(layer_2_delta, layer_1)*lr
if(review_i % 1000 == 0):
print(f"Progress: {round(review_i/float(len(input_dataset)*epochs), 3)} | `Terrible` nearest neighbors: {similar('terrible', top=5)}")
print(similar('terrible'))
Progress: 0.0 | `Terrible` nearest neighbors: [('terrible', -0.0), ('misperceived', -0.37629838414529776), ('origination', -0.38645406817879285), ('bumpuses', -0.3866817101079805), ('recognition', -0.3870682177410917)] Progress: 0.02 | `Terrible` nearest neighbors: [('terrible', -0.0), ('superb', -0.9395550125335567), ('fantastic', -0.9785445410150915), ('brilliant', -1.0009723994783375), ('excellent', -1.0507492536281997)] Progress: 0.04 | `Terrible` nearest neighbors: [('terrible', -0.0), ('brilliant', -1.4181573161847985), ('horrible', -1.435351250932622), ('hilarious', -1.4828872270341869), ('fantastic', -1.5224293865473035)] Progress: 0.06 | `Terrible` nearest neighbors: [('terrible', -0.0), ('fantastic', -1.3645701568801774), ('brilliant', -1.452030183491021), ('horrible', -1.4872861958792385), ('convincing', -1.6451485631159968)] Progress: 0.08 | `Terrible` nearest neighbors: [('terrible', -0.0), ('fantastic', -1.635188731505891), ('brilliant', -1.6478061610864096), ('convincing', -1.8233281433244541), ('lame', -1.903116578071796)] Progress: 0.1 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -1.8219440801238125), ('fantastic', -1.8671267127593203), ('lame', -1.9050522776833492), ('fascinating', -1.9732384841951174)] Progress: 0.12 | `Terrible` nearest neighbors: [('terrible', -0.0), ('brilliant', -2.1433165398420138), ('horrible', -2.1638703481599477), ('fascinating', -2.2868941326807817), ('weak', -2.3292512816130797)] Progress: 0.14 | `Terrible` nearest neighbors: [('terrible', -0.0), ('brilliant', -2.5092534705974803), ('fantastic', -2.6508016767739924), ('fascinating', -2.7452444884676908), ('horrible', -2.865792216162147)] Progress: 0.16 | `Terrible` nearest neighbors: [('terrible', -0.0), ('brilliant', -2.6959329307597475), ('horrible', -2.988478730789771), ('fascinating', -3.1247256619089545), ('terrific', -3.129899894825866)] Progress: 0.18 | `Terrible` nearest neighbors: [('terrible', -0.0), ('brilliant', -2.8446078019182597), ('horrible', -2.956595849168358), ('superb', -3.1012022870890967), ('fascinating', -3.262813698847574)] Progress: 0.2 | `Terrible` nearest neighbors: [('terrible', -0.0), ('brilliant', -2.9033044381747177), ('horrible', -3.0115624244181727), ('fantastic', -3.2470408994109534), ('superb', -3.256887985703758)] Progress: 0.22 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -3.0915151786846438), ('fantastic', -3.1265211166700086), ('brilliant', -3.1454738767589787), ('superb', -3.183344902892115)] Progress: 0.24 | `Terrible` nearest neighbors: [('terrible', -0.0), ('brilliant', -3.0127884429926897), ('horrible', -3.15807211093518), ('fantastic', -3.2550240363481096), ('superb', -3.4725565464616666)] Progress: 0.26 | `Terrible` nearest neighbors: [('terrible', -0.0), ('brilliant', -2.7865435546532855), ('horrible', -3.057266418335675), ('fantastic', -3.6015249499836965), ('remarkable', -3.7331652129773283)] Progress: 0.28 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.908750351571196), ('brilliant', -3.0256961917363117), ('remarkable', -3.3441415219255313), ('fantastic', -3.346827479733814)] Progress: 0.3 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -3.1965770870240044), ('brilliant', -3.232387427139472), ('fantastic', -3.3406354622067576), ('laughable', -3.4525431309832544)] Progress: 0.32 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.7193204217548046), ('fantastic', -3.4924465018991375), ('brilliant', -3.578142963089179), ('pathetic', -3.69027244678774)] Progress: 0.34 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.7508534380433756), ('laughable', -3.564716387829581), ('brilliant', -3.6256016784964853), ('ridiculous', -3.691640952693665)] Progress: 0.36 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.6235371223238038), ('brilliant', -3.456400101821386), ('fantastic', -3.591454386466738), ('pathetic', -3.6334012075001842)] Progress: 0.38 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.683096639333961), ('fantastic', -3.4242003788819737), ('pathetic', -3.4360802291007326), ('brilliant', -3.4872253519455176)] Progress: 0.4 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -3.1624804317677495), ('fantastic', -3.5925838881609518), ('superb', -3.768105964314105), ('wonderful', -3.862213327361215)] Progress: 0.42 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -3.437289914569698), ('fantastic', -3.5216332851140426), ('wonderful', -3.650556914213743), ('superb', -3.657396309163536)] Progress: 0.44 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -3.2723437444549783), ('wonderful', -3.4368907041506995), ('brilliant', -3.7929820192531545), ('superb', -3.849470415444905)] Progress: 0.46 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -3.2460430522999157), ('fantastic', -3.701891129630293), ('superb', -3.8686900453638264), ('wonderful', -3.8932476027869205)] Progress: 0.48 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.9614105229354526), ('fantastic', -3.6383683495777577), ('wonderful', -3.7598207425786945), ('marvelous', -3.8547592780789506)] Progress: 0.5 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -3.1209779081546825), ('pathetic', -3.9281919379578993), ('bad', -3.9341848616327995), ('ridiculous', -3.973607969275994)] Progress: 0.52 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -3.3213812570213213), ('superb', -3.6699560346070927), ('pathetic', -3.6779200874922333), ('brilliant', -3.741357710902307)] Progress: 0.54 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.9042261348652203), ('ridiculous', -3.523791388919498), ('brilliant', -3.553038115161444), ('pathetic', -3.670621870859373)] Progress: 0.56 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -3.019056443525238), ('ridiculous', -3.491907182197584), ('brilliant', -3.5144127474855265), ('pathetic', -3.637777041244951)] Progress: 0.58 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -3.047805663412288), ('ridiculous', -3.284331453269096), ('pathetic', -3.416744552532168), ('fantastic', -3.5162930264465393)] Progress: 0.6 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -3.0350582648337365), ('pathetic', -3.141418301245476), ('brilliant', -3.3284278791026645), ('ridiculous', -3.3681009224333156)] Progress: 0.62 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.8424670383329955), ('brilliant', -3.190611111009397), ('fantastic', -3.3554779094550637), ('pathetic', -3.613499339625761)] Progress: 0.64 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.6288033645941358), ('ridiculous', -3.1037980754584007), ('fantastic', -3.4010410214913493), ('brilliant', -3.414474269349876)] Progress: 0.66 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.8451699935635197), ('ridiculous', -3.360397446507512), ('fantastic', -3.466029776873303), ('brilliant', -3.578601473087416)] Progress: 0.68 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.66673055814208), ('fantastic', -3.3362097606735692), ('brilliant', -3.373603786961304), ('magnificent', -3.624886774249765)] Progress: 0.7 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.8630220563840676), ('fantastic', -3.3992359956414386), ('wonderful', -3.566726961067627), ('magnificent', -3.591300947805222)] Progress: 0.72 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.9368471700599974), ('magnificent', -3.608067713166859), ('brilliant', -3.655464546153343), ('dreadful', -3.7050087712182838)] Progress: 0.74 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.635848871982613), ('brilliant', -3.6651707790712265), ('fantastic', -3.7062670345209163), ('horrid', -3.7328953795618287)] Progress: 0.76 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.503166496540591), ('fantastic', -3.629923785322746), ('wonderful', -3.86941026115452), ('brilliant', -3.901485156616881)] Progress: 0.78 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.412898394155387), ('horrid', -3.549930811490535), ('dreadful', -3.5937964712495667), ('dire', -3.758413186706391)] Progress: 0.8 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.7875323979515803), ('brilliant', -3.553032704687312), ('fantastic', -3.5650466133229215), ('horrid', -3.6895680657641807)] Progress: 0.82 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.505351530365435), ('horrid', -3.5555145292149475), ('dreadful', -3.723125002700873), ('horrendous', -3.7799764137444125)] Progress: 0.84 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.52225303835475), ('horrid', -3.659388277805759), ('horrendous', -3.6596442985085273), ('brilliant', -3.7064657941555823)] Progress: 0.86 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.403871102526396), ('brilliant', -3.4543509427136447), ('dreadful', -3.6327538154677845), ('horrendous', -3.638227523309254)] Progress: 0.88 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.580263267663093), ('brilliant', -3.168048700465928), ('fantastic', -3.47530255959384), ('horrendous', -3.5527378429665823)] Progress: 0.9 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.5560655161199812), ('brilliant', -3.2765502997525595), ('phenomenal', -3.6192939150249934), ('marvelous', -3.627323403235288)] Progress: 0.92 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.773181520872384), ('brilliant', -3.393156773756116), ('superb', -3.628130376783132), ('fantastic', -3.6532165585570455)] Progress: 0.94 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.7740421486011266), ('pathetic', -3.3994302985804645), ('phenomenal', -3.4011259732447283), ('brilliant', -3.4238172462044756)] Progress: 0.96 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.7104808985858138), ('pathetic', -3.3380165149787935), ('phenomenal', -3.5040384247065166), ('brilliant', -3.6382402279199804)] Progress: 0.98 | `Terrible` nearest neighbors: [('terrible', -0.0), ('horrible', -2.421661394464606), ('phenomenal', -3.4302258651982), ('pathetic', -3.5310157422128037), ('fantastic', -3.5986885562168713)] [('terrible', -0.0), ('horrible', -2.9616723194748578), ('pathetic', -3.53126362760028), ('phenomenal', -3.704929909889765), ('brilliant', -3.8445386415610043), ('dreadful', -3.9673549749081625), ('superb', -4.113236415218835)]
The word embeddings get trained according to the task the neural network is trained on, let's give. a few examples:
In this sense, hate & love are pretty close!
Before, words were clustered according to the likelihood that the review is positive/negative. Now, they are clustered based on the likelihood that they will occur on the same phrase (regardless of the sentiment behind a review).
Our key takeaway is that even though we are training on the same dataset, using a very similar network architecture, we can influence what the network learns by changing the loss function (task). Even though it's looking at the same information, we can alter its learning behavior by simply changing the output structure.
Let's call the process of choosing what the network should learn: Intelligence Targeting. We can also change how the network measures error, its architetcure, and regularization, this is also a way of performing Intelligence targeting.
In deep learning research, all of the above techniques fall under the umbrella term: Loss function construction.
Considering that Learning is all about minimizing a loss function, this gives a broader understanding of how neural networks work.
Different kinds of layers, activations, regularization techniques, datasets, aren't really that different. For Example: if the network is overfitting, we can augment the loss fucntion by choosing simpler non-linearities, adding dropout, enforcing regularizations, adding more data and so on. All of these techniques will have a similar effect on the loss function and the learning behavior.
With learning, everything is contained within the loss function and If something is going wrong, remember that the solution is in the loss function.
The task of filling in the blank creates an interesting property called "word analogies". Analogies are one of the famous properties of word embeddings (or trained vectors).
We can take different embeddings and perform algebric operations on them to discover these analogies.
def analogy(positive=['terrible', 'good'], negative=['bad']):
norms = np.sum(W0*W0, axis=1)
norms.resize((norms.shape[0], 1))
# normalize weights for vector-level operations
normed_weights = W0 * norms
query_vect = np.zeros(W0.shape[1])
for word in positive:
query_vect += normed_weights[word2index[word]]
for word in negative:
query_vect -= normed_weights[word2index[word]]
scores = Counter()
for word, index in word2index.items():
raw_difference = W0[index] - query_vect
squared_difference = raw_difference * raw_difference
scores[word] = -math.sqrt(sum(squared_difference))
return scores.most_common(10)[1:]
analogy(['elizabeth', 'he'], ['she'])
[('lee', -174.27474476700854), ('been', -174.49725533677292), ('david', -174.61356143328723), ('william', -174.6143457630514), ('walken', -174.70899829891496), ('st', -174.73844365112527), ('simon', -174.76909796361042), ('sean', -174.8055256904309), ('smith', -174.89679055038982)]
Even though "Word Analogy" Discovery was initially very exciting, the deep learning NLP paradigm didn't move forward from that to discover new features, instead, current language models rely on Recurrent Neural Networks to do language modeling (This book was released before ELMO, BERT, & GPT-2, that is why the author considers RNNs to be the SoTA in Language modeling).
Nevertheless, we need to understand why this concept emerged out of the network as a result of us training the network to fill in the blank? If we imagine the word embeddings to have two dimensions, then it would be easier to know why word analogies work:
king = [.6, .1]
man = [.5, .0]
woman = [.0, .8]
queen = [.1, 1.0]
[x_i - y_i for (x_i, y_i) in zip(king, man)]
[0.09999999999999998, 0.1]
[x_i - y_i for (x_i, y_i) in zip(queen, woman)]
[0.1, 0.19999999999999996]
The relative usefulness to the final prediction between "Man"/"King" & "Woman"/"Queen" is similar because the difference between "King" and "Man" Leaves a vector that represents Royalty. There are a bunch of male/female related words in one grouping, and a bunch of king/queen related words in another grouping. Because the relative distance between the two group is constant, it means that the distances between each group items will be relatively the same.
This phenomena can be traced back to the chosen loss. What is important is that learning analogies is more about the properties of language than deep learning. Any linear compression of these co-occurent statistics will yield the same results.