Here we are implementing this useful algorithm with a library we know and trust. With luck this will be more accessible than reading the papers but more in-depth than typical "install gensim and just do what I say" tutorials, and still easy to understand for anyone whose maths skills have atrophied to nothing (like me). This is all based on the great work by Nejc Ilenic and reading the referenced papers and gensim's source.
doc2vec
descends from word2vec
, the basic form of which is that it is a model trained to predict the missing word in a context. Given sentences like "the cat ___ on the mat" it should predict "sat", and in doing so learn a useful representation of words. We can then extract the internal weights and re-use them as "word embeddings", vectors giving each word a position in N-dimensional space that is hopefully close to similar words and an appropriate distance from related words.
doc2vec
or "Paragraph vectors" extends the word2vec
idea by simply adding a document id to each context. This helps the network learn associations between contexts and produces vectors that position each paragraph (document) in space.
First we need to load the data. We'll begin by overfitting on a tiny dataset just to check all the parts fit together.
import pandas as pd
import spacy
nlp = spacy.load("en_core_web_sm")
pd.set_option("display.max_colwidth", 100)
example_df = pd.read_csv("data/example.csv")
def tokenize_text(df):
df["tokens"] = df.text.str.lower().str.strip().apply(lambda x: [token.text.strip() for token in nlp(x) if token.text.isalnum()])
return df
example_df = tokenize_text(example_df)
example_df
text | tokens | |
---|---|---|
0 | In the week before their departure to Arrakis, when all the final scurrying about had reached a ... | [in, the, week, before, their, departure, to, arrakis, when, all, the, final, scurrying, about, ... |
1 | It was a warm night at Castle Caladan, and the ancient pile of stone that had served the Atreide... | [it, was, a, warm, night, at, castle, caladan, and, the, ancient, pile, of, stone, that, had, se... |
2 | The old woman was let in by the side door down the vaulted passage by Paul's room and she was al... | [the, old, woman, was, let, in, by, the, side, door, down, the, vaulted, passage, by, paul, room... |
3 | By the half-light of a suspensor lamp, dimmed and hanging near the floor, the awakened boy could... | [by, the, half, light, of, a, suspensor, lamp, dimmed, and, hanging, near, the, floor, the, awak... |
We will need to construct a vocabulary so we can reference every word by an ID.
from collections import Counter
class Vocab:
def __init__(self, all_tokens, min_count=2):
self.min_count = min_count
self.freqs = {t:n for t, n in Counter(all_tokens).items() if n >= min_count}
self.words = sorted(self.freqs.keys())
self.word2idx = {w: i for i, w in enumerate(self.words)}
vocab = Vocab([tok for tokens in example_df.tokens for tok in tokens], min_count=1)
print(f"Dataset comprises {len(example_df)} documents and {len(vocab.words)} unique words (over the limit of {vocab.min_count} occurrences)")
Dataset comprises 4 documents and 106 unique words (over the limit of 1 occurrences)
Words that appear extremely rarely can harm performance, so we add a simple mechanism to strip those out.
def clean_tokens(df, vocab):
df["length"] = df.tokens.apply(len)
df["clean_tokens"] = df.tokens.apply(lambda x: [t for t in x if t in vocab.freqs.keys()])
df["clean_length"] = df.clean_tokens.apply(len)
return df
example_df = clean_tokens(example_df, vocab)
example_df[:5]
text | tokens | length | clean_tokens | clean_length | |
---|---|---|---|---|---|
0 | In the week before their departure to Arrakis, when all the final scurrying about had reached a ... | [in, the, week, before, their, departure, to, arrakis, when, all, the, final, scurrying, about, ... | 32 | [in, the, week, before, their, departure, to, arrakis, when, all, the, final, scurrying, about, ... | 32 |
1 | It was a warm night at Castle Caladan, and the ancient pile of stone that had served the Atreide... | [it, was, a, warm, night, at, castle, caladan, and, the, ancient, pile, of, stone, that, had, se... | 39 | [it, was, a, warm, night, at, castle, caladan, and, the, ancient, pile, of, stone, that, had, se... | 39 |
2 | The old woman was let in by the side door down the vaulted passage by Paul's room and she was al... | [the, old, woman, was, let, in, by, the, side, door, down, the, vaulted, passage, by, paul, room... | 34 | [the, old, woman, was, let, in, by, the, side, door, down, the, vaulted, passage, by, paul, room... | 34 |
3 | By the half-light of a suspensor lamp, dimmed and hanging near the floor, the awakened boy could... | [by, the, half, light, of, a, suspensor, lamp, dimmed, and, hanging, near, the, floor, the, awak... | 53 | [by, the, half, light, of, a, suspensor, lamp, dimmed, and, hanging, near, the, floor, the, awak... | 53 |
The difficulty with our "the cat _ on the mat" problem is that the missing word could be any one in the vocabulary V and so the network would have |V| outputs for each input e.g. a huge vector containing zero for every word in the vocabulary and some positive number for "sat" if the network was perfectly trained. For calculating loss we need to turn that into a probabilty distribution, i.e. softmax it. Computing the softmax for such a large vector is expensive.
So the trick (one of many possible) we will use is Noise Contrastive Estimation (NCE). We change our "the cat _ on the mat" problem into a multiple choice problem, asking the network to choose between "sat" and some random wrong answers like "hopscotch" and "luxuriated". This is easier to compute the softmax for since it's now a binary classifier (right or wrong answer) and the output is simply of a vector of size 1 + k where k is the number of random incorrect options.
Happily, this alternative problem still learns equally useful word representations. We just need to adjust the examples and the loss function. There is a simplified version of the NCE loss function called Negative Sampling (NEG) that we can use here.
Notes on Noise Contrastive Estimation and Negative Sampling (C. Dyer) explains the derivation of the NCE and NEG loss functions.
When we implement the loss function, we assume that the first element in a samples/scores vector is the score for the positive sample and the rest are negative samples. This convention saves us from having to pass around an auxiliary vector indicating which sample was positive.
import torch.nn as nn
class NegativeSampling(nn.Module):
def __init__(self):
super(NegativeSampling, self).__init__()
self.log_sigmoid = nn.LogSigmoid()
def forward(self, scores):
batch_size = scores.shape[0]
n_negative_samples = scores.shape[1] - 1 # TODO average or sum the negative samples? Summing seems to be correct by the paper
positive = self.log_sigmoid(scores[:,0])
negatives = torch.sum(self.log_sigmoid(-scores[:,1:]), dim=1)
return -torch.sum(positive + negatives) / batch_size # average for batch
loss = NegativeSampling()
It's helpful to play with some values to reassure ourselves that this function does the right thing.
import torch
data = [[[1, -1, -1, -1]], # this dummy data uses -1 to 1, but the real model is unconstrained
[[0.5, -1, -1, -1]],
[[0, -1, -1, -1]],
[[0, 0, 0, 0]],
[[0, 0, 0, 1]],
[[0, 1, 1, 1]],
[[0.5, 1, 1, 1]],
[[1, 1, 1, 1]]]
loss_df = pd.DataFrame(data, columns=["scores"])
loss_df["loss"] = loss_df.scores.apply(lambda x: loss(torch.FloatTensor([x])))
loss_df
scores | loss | |
---|---|---|
0 | [1, -1, -1, -1] | tensor(1.2530) |
1 | [0.5, -1, -1, -1] | tensor(1.4139) |
2 | [0, -1, -1, -1] | tensor(1.6329) |
3 | [0, 0, 0, 0] | tensor(2.7726) |
4 | [0, 0, 0, 1] | tensor(3.3927) |
5 | [0, 1, 1, 1] | tensor(4.6329) |
6 | [0.5, 1, 1, 1] | tensor(4.4139) |
7 | [1, 1, 1, 1] | tensor(4.2530) |
Higher scores for the positive sample (always the first element) reduce the loss but higher scores for the negative samples increase the loss. This looks like the right behaviour.
With that in the bag, let's look at creating training data. The general idea is to create a set of examples where each example has:
e.g. If our context size was 2, the first example from the above dataset would be:
{"doc_id": 0,
"sample_ids": [word2idx[x] for x in ["week", "random-word-from-vocab", "random-word-from-vocab"...],
"context_ids": [word2idx[x] for x in ["in", "the", "before", "their"]]}
The random words are chosen according to a probability distribution:
a unigram distribution raised to the 3/4rd power, as proposed by T. Mikolov et al. in Distributed Representations of Words and Phrases and their Compositionality
This has the effect of slightly increasing the relative probability of rare words (look at the graph of y=x^0.75
below and see how the lower end is raised above y=x
).
import altair as alt
import numpy as np
data = pd.DataFrame(zip(np.arange(0,1,0.01), np.power(np.arange(0,1,0.01), 0.75)), columns=["x", "y"])
alt.Chart(data, title="x^0.75").mark_line().encode(x="x", y="y")
import numpy as np
class NoiseDistribution:
def __init__(self, vocab):
self.probs = np.array([vocab.freqs[w] for w in vocab.words])
self.probs = np.power(self.probs, 0.75)
self.probs /= np.sum(self.probs)
def sample(self, n):
"Returns the indices of n words randomly sampled from the vocabulary."
return np.random.choice(a=self.probs.shape[0], size=n, p=self.probs)
noise = NoiseDistribution(vocab)
With this distribution, we advance through the documents creating examples. Note that we are always putting the positive sample first in the samples vector, following the convention the loss function expects.
import torch
def example_generator(df, context_size, noise, n_negative_samples, vocab):
for doc_id, doc in df.iterrows():
for i in range(context_size, len(doc.clean_tokens) - context_size):
positive_sample = vocab.word2idx[doc.clean_tokens[i]]
sample_ids = noise.sample(n_negative_samples).tolist()
# Fix a wee bug - ensure negative samples don't accidentally include the positive
sample_ids = [sample_id if sample_id != positive_sample else -1 for sample_id in sample_ids]
sample_ids.insert(0, positive_sample)
context = doc.clean_tokens[i - context_size:i] + doc.clean_tokens[i + 1:i + context_size + 1]
context_ids = [vocab.word2idx[w] for w in context]
yield {"doc_ids": torch.tensor(doc_id), # we use plural here because it will be batched
"sample_ids": torch.tensor(sample_ids),
"context_ids": torch.tensor(context_ids)}
examples = example_generator(example_df, context_size=5, noise=noise, n_negative_samples=5, vocab=vocab)
Now we package this up as a good old PyTorch dataset and dataloader.
from torch.utils.data import Dataset, DataLoader
class NCEDataset(Dataset):
def __init__(self, examples):
self.examples = list(examples) # just naively evaluate the whole damn thing - suboptimal!
def __len__(self):
return len(self.examples)
def __getitem__(self, index):
return self.examples[index]
dataset = NCEDataset(examples)
dataloader = DataLoader(dataset, batch_size=2, drop_last=True, shuffle=True) # TODO bigger batch size when not dummy data
It's going to also be useful to have a way to convert batches back to a readable form for debugging, so we add a helper function.
def describe_batch(batch, vocab):
results = []
for doc_id, context_ids, sample_ids in zip(batch["doc_ids"], batch["context_ids"], batch["sample_ids"]):
context = [vocab.words[i] for i in context_ids]
context.insert(len(context_ids) // 2, "____")
samples = [vocab.words[i] for i in sample_ids]
result = {"doc_id": doc_id,
"context": " ".join(context),
"context_ids": context_ids,
"samples": samples,
"sample_ids": sample_ids}
results.append(result)
return results
describe_batch(next(iter(dataloader)), vocab)
[{'doc_id': tensor(2), 'context': 'was allowed a moment to ____ in at him where he', 'context_ids': tensor([ 99, 5, 0, 61, 93, 52, 11, 48, 103, 47]), 'samples': ['peer', 'moment', 'atreides', 'a', 'caladan', 'night'], 'sample_ids': tensor([71, 61, 12, 0, 20, 65])}, {'doc_id': tensor(3), 'context': 'mother the old woman was ____ witch shadow hair like matted', 'context_ids': tensor([ 62, 91, 67, 105, 99, 104, 79, 44, 59, 60]), 'samples': ['a', 'where', 'in', 'cooled', 'an', 'woman'], 'sample_ids': tensor([ 0, 103, 52, 24, 6, -1])}]
Let's jump into creating the model itself. There isn't much to it - we multiply the input paragraph and word matrices by the output layer. Combining the paragraph and word matrices is done by summing here, but it could also be done by concatenating the inputs. The original paper actually found concatenation works better, perhaps because summing loses word order information.
import torch.nn as nn
class DistributedMemory(nn.Module):
def __init__(self, vec_dim, n_docs, n_words):
super(DistributedMemory, self).__init__()
self.paragraph_matrix = nn.Parameter(torch.randn(n_docs, vec_dim))
self.word_matrix = nn.Parameter(torch.randn(n_words, vec_dim))
self.outputs = nn.Parameter(torch.zeros(vec_dim, n_words))
def forward(self, doc_ids, context_ids, sample_ids):
# first add doc ids to context word ids to make the inputs
inputs = torch.add(self.paragraph_matrix[doc_ids,:], # (batch_size, vec_dim)
torch.sum(self.word_matrix[context_ids,:], dim=1)) # (batch_size, 2x context, vec_dim) -> sum to (batch_size, vec_dim)
#
# select the subset of the output layer for the NCE test
outputs = self.outputs[:,sample_ids] # (vec_dim, batch_size, n_negative_samples + 1)
#
return torch.bmm(inputs.unsqueeze(dim=1), # then multiply with some munging to make the tensor shapes line up
outputs.permute(1, 0, 2)).squeeze() # -> (batch_size, n_negative_samples + 1)
model = DistributedMemory(vec_dim=50,
n_docs=len(example_df),
n_words=len(vocab.words))
Let's take it for a spin!
with torch.no_grad():
logits = model.forward(**next(iter(dataloader)))
logits
tensor([[0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0.]])
Oh yeah, the output layer was initialized with zeros. Time to bash out a standard issue PyTorch training loop.
from tqdm import tqdm, trange
from torch.optim import Adam # ilenic uses Adam, but gensim uses plain SGD
import numpy as np
def train(model, dataloader, epochs=40, lr=1e-3):
optimizer = Adam(model.parameters(), lr=lr)
training_losses = []
try:
for epoch in trange(epochs, desc="Epochs"):
epoch_losses = []
for batch in dataloader:
model.zero_grad()
logits = model.forward(**batch)
batch_loss = loss(logits)
epoch_losses.append(batch_loss.item())
batch_loss.backward()
optimizer.step()
training_losses.append(np.mean(epoch_losses))
except KeyboardInterrupt:
print(f"Interrupted on epoch {epoch}!")
finally:
return training_losses
Now we'll sanity check by overfitting the example data. Training loss should drop from untrained loss to something close to the minimum possible.
training_losses = train(model, dataloader, epochs=40, lr=1e-3)
Epochs: 100%|██████████| 40/40 [00:02<00:00, 21.83it/s]
import altair as alt
df_loss = pd.DataFrame(enumerate(training_losses), columns=["epoch", "training_loss"])
alt.Chart(df_loss).mark_bar().encode(alt.X("epoch"), alt.Y("training_loss", scale=alt.Scale(type="log")))
And because we're paranoid types, let's check a prediction.
with torch.no_grad():
logits = model.forward(**next(iter(dataloader)))
logits
tensor([[ 4.0063, -7.1900, -6.8099, -5.7869, -6.4321, -3.4320], [ 4.9544, -6.3152, -7.6040, -6.6827, -7.8661, -5.0460]])
The positive sample gets a positive score and the negatives get negative scores. Super.
We should be able get the paragraph vectors for the documents and do things like check these for similarity to one another.
from sklearn.preprocessing import normalize
def most_similar(paragraph_matrix, docs_df, index, n=None):
pm = normalize(paragraph_matrix, norm="l2") # in a smarter implementation we would cache this somewhere
sims = np.dot(pm, pm[index,:])
df = pd.DataFrame(enumerate(sims), columns=["doc_id", "similarity"])
n = n if n is not None else len(sims)
return df.merge(docs_df[["text"]].reset_index(drop=True), left_index=True, right_index=True).sort_values(by="similarity", ascending=False)[:n]
most_similar(model.paragraph_matrix.data, example_df, 1, n=10)
doc_id | similarity | text | |
---|---|---|---|
1 | 1 | 1.000000 | It was a warm night at Castle Caladan, and the ancient pile of stone that had served the Atreide... |
0 | 0 | 0.177416 | In the week before their departure to Arrakis, when all the final scurrying about had reached a ... |
3 | 3 | 0.081760 | By the half-light of a suspensor lamp, dimmed and hanging near the floor, the awakened boy could... |
2 | 2 | -0.044768 | The old woman was let in by the side door down the vaulted passage by Paul's room and she was al... |
It's not particularly illuminating for our tiny set of dummy data though. We can also use PCA to reduce our n-dimensional paragraph vectors to 2 dimensions and see if they are clustered nicely.
from sklearn.decomposition import PCA
def pca_2d(paragraph_matrix, groups):
pca = PCA(n_components=2)
reduced_dims = pca.fit_transform(paragraph_matrix)
print(f"2-component PCA, explains {sum(pca.explained_variance_):.2f}% of variance")
df = pd.DataFrame(reduced_dims, columns=["x", "y"])
df["group"] = groups
return df
example_2d = pca_2d(model.paragraph_matrix.data, ["0","1","2","3"])
alt.Chart(example_2d).mark_point().encode(x="x", y="y", color="group")
2-component PCA, explains 48.18% of variance
Not much to see on such a tiny dataset without any labelled groups.
We'll use the BBC's dataset. The dataset was created by Derek Greene at UCD and all articles are copyright Auntie. I've munged it into a file per topic.
dfs = []
for document_set in ("sport",
"business",
"politics",
"tech",
"entertainment"):
df_ = pd.read_csv(f"data/bbc/{document_set}.csv.bz2", encoding="latin1")
df_ = tokenize_text(df_)
df_["group"] = document_set
dfs.append(df_)
bbc_df = pd.concat(dfs)
bbc_df[:4]
text | tokens | group | |
---|---|---|---|
0 | Claxton hunting first major medal British hurdler Sarah Claxton is confident she can win her fi... | [claxton, hunting, first, major, medal, british, hurdler, sarah, claxton, is, confident, she, ca... | sport |
1 | O'Sullivan could run in Worlds Sonia O'Sullivan has indicated that she would like to participat... | [could, run, in, worlds, sonia, has, indicated, that, she, would, like, to, participate, in, nex... | sport |
2 | Greene sets sights on world title Maurice Greene aims to wipe out the pain of losing his Olympi... | [greene, sets, sights, on, world, title, maurice, greene, aims, to, wipe, out, the, pain, of, lo... | sport |
3 | IAAF launches fight against drugs The IAAF - athletics' world governing body - has met anti-dop... | [iaaf, launches, fight, against, drugs, the, iaaf, athletics, world, governing, body, has, met, ... | sport |
bbc_vocab = Vocab([tok for tokens in bbc_df.tokens for tok in tokens])
bbc_df = clean_tokens(bbc_df, bbc_vocab)
print(f"Dataset comprises {len(bbc_df)} documents and {len(bbc_vocab.words)} unique words")
Dataset comprises 2225 documents and 19063 unique words
bbc_noise = NoiseDistribution(bbc_vocab)
bbc_examples = list(example_generator(bbc_df, context_size=5, noise=bbc_noise, n_negative_samples=5, vocab=bbc_vocab))
bbc_dataset = NCEDataset(bbc_examples)
bbc_dataloader = DataLoader(bbc_dataset, batch_size=1024, drop_last=True, shuffle=True) # TODO could tolerate a larger batch size
bbc_model = DistributedMemory(vec_dim=50,
n_docs=len(bbc_df),
n_words=len(bbc_vocab.words))
bbc_training_losses = train(bbc_model, bbc_dataloader, epochs=80, lr=1e-3)
Epochs: 100%|██████████| 80/80 [36:14<00:00, 26.78s/it]
alt.Chart(pd.DataFrame(enumerate(bbc_training_losses), columns=["epoch", "training_loss"])).mark_bar().encode(x="epoch", y="training_loss")
Let's take a look at the reduced dimensionality paragraph vectors.
bbc_2d = pca_2d(bbc_model.paragraph_matrix.data, bbc_df.group.to_numpy())
chart = alt.Chart(bbc_2d).mark_point().encode(x="x", y="y", color="group")
# Uncomment to print chart inline, but beware it will inflate the notebook size
# chart
2-component PCA, explains 2.65% of variance
These results aren't great, but we can see the beginnings of separation. If we look at just two topics it becomes more obvious.
chart = alt.Chart(bbc_2d[bbc_2d["group"].isin(["sport", "business"])]).mark_point().encode(x="x", y="y", color="group")
# Uncomment to print chart inline, but beware it will inflate the notebook size
# chart
Likewise we can see sorting by similarity produces reasonable, but not ideal, results.
most_similar(bbc_model.paragraph_matrix.data, bbc_df, 0, n=10)
doc_id | similarity | text | |
---|---|---|---|
0 | 0 | 1.000000 | Claxton hunting first major medal British hurdler Sarah Claxton is confident she can win her fi... |
37 | 37 | 0.504319 | Radcliffe proves doubters wrong This won't go down as one of the greatest marathons of Paula's ... |
41 | 41 | 0.499603 | Radcliffe enjoys winning comeback Paula Radcliffe made a triumphant return to competitive runni... |
1545 | 1545 | 0.499484 | Search wars hit desktop PCs Another front in the on-going battle between Microsoft and Google i... |
1266 | 1266 | 0.490500 | Student 'inequality' exposed Teenagers from well-off backgrounds are six times more likely to g... |
19 | 19 | 0.442955 | Edwards tips Idowu for Euro gold World outdoor triple jump record holder and BBC pundit Jonatha... |
348 | 348 | 0.430447 | Italy aim to rattle England Italy coach John Kirwan believes his side can upset England as the ... |
251 | 251 | 0.429918 | Ferguson rues failure to cut gap Boss Sir Alex Ferguson was left ruing Manchester United's fail... |
24 | 24 | 0.429485 | El Guerrouj targets cross country Double Olympic champion Hicham El Guerrouj is set to make a r... |
464 | 464 | 0.412518 | Henin-Hardenne beaten on comeback Justine Henin-Hardenne lost to Elena Dementieva in a comeback... |
That's all for now! I honestly hope that was fun and educational (it was for me, anyway).
But data science projects are notorious for never being finished. To carry this on, we could:
gensim
and Ilenic's PyTorch implementation; it should be very similar to the latter