Neighbors, Clusters, Classification

By Allison Parrish

(Note: Rough draft! Notes incomplete. But very usable!)

In this notebook, I'm going to take you through a couple of simple, well-known techniques for exploring small sequences of text (like lines of poetry or sentences). These techniques include:

  • Making vectors for text sequences
  • Nearest-neighbor lookups for semantic similarity
  • Visualizing corpora with t-SNE
  • Clustering sentence vectors to discover similar items
  • Classification with vectors

We're going to work with a corpus of several million lines of poetry, scraped from Project Gutenberg. Before you continue, download the file by executing the cell below:

In [ ]:
!curl -L -O http://static.decontextualize.com/gutenberg-poetry-v001.ndjson.gz

I'm going to use spaCy extensively, both as a way to parse text into sentences and also as a source for pre-trained word vectors. Make sure you have it installed, along with the en_core_web_md or en_core_web_lg models. This notebook also assumes that you have scikit-learn, numpy and simpleneighbors installed.

Sentence vectors

Review the concept of word vectors before continuing.

Word vectors work great when we're interested in individual words. More often, though, we're interested in longer stretches of text, like sentences, lines, paragraphs. If we had a way to represent these longer stretches of text as vectors, we could perform all of the same operations on them that word vectors allow us to perform on words. But how to represent stretches of text as sequences?

There are lots of different ways! The classic technique in machine learning is to use the frequency of terms found in each sequence (methods like tfidf), or similar techniques like doc2vec. Another way is to train a neural network (like an LSTM or transformer) for the task. There are any number of pre-trained models you can download and use for this, including Google's Universal Sentence Encoder and the Sentence-Transformers Python package.

But a surprisingly effective technique is to simply average together the word vectors for each word in the sentence. A big advantage of this technique is that no further training is needed, beyond the training needed to calculate the word vectors; if you're using pre-trained vectors, even that step can be skipped. You won't get state-of-the-art results on NLP benchmarks with this technique, but it's a good baseline and still useful for many tasks.

In the section below, I sample ten thousand lines of poetry from a Project Gutenberg poetry corpus and assign each a vector using this averaging technique.

In [1]:
import numpy as np
import spacy
In [2]:
import gzip, json, random

Load spacy's language model:

In [3]:
nlp = spacy.load('en_core_web_md')

And then load up all of the lines of poetry:

In [4]:
lines = []
for line in gzip.open("./gutenberg-poetry-v001.ndjson.gz"):
    data = json.loads(line)
    lines.append(data['s'])
In [5]:
len(lines)
Out[5]:
3085117

To make things a bit faster and less memory-intensive, I'm only going to use ten thousand lines of poetry, sampled randomly. (You can change this number to something bigger if you want! But note that some of the stuff we're doing later on in the notebook will take longer.)

In [6]:
sampled_lines = random.sample(lines, 10000)

Averages and weighted averages

Every spaCy span (i.e., documents, sentences, etc.) has a .vector attribute, which is calculated as the average of the vectors for each token in the span. The summary() function below parses the string you pass to it with spaCy and returns the vector that spaCy computes. (Here I disable the parser, tagger and ner pipelines in order to make the process faster. We're just after the tokens and vectors—we don't need parts of speech, etc.)

In [7]:
def summary(sent):
    return nlp(sent, disable=['parser', 'tagger', 'ner']).vector

The code in the cell below computes "summary vectors" for the lines of poetry sampled above:

In [8]:
embeddings = [summary(line) for line in sampled_lines]

And here's what they look like:

In [10]:
rand_idx = random.randrange(len(sampled_lines))
sampled_lines[rand_idx], embeddings[rand_idx]
Out[10]:
('(Her brother he) this costly present gave.',
 array([-1.75622299e-01,  1.32986456e-01, -8.44215900e-02,  4.82430980e-02,
         1.74502004e-02, -8.76883119e-02,  1.76149942e-02, -3.61497015e-01,
        -5.74538112e-02,  2.43268013e+00, -1.48494095e-01,  2.99035579e-01,
         1.11827075e-01, -1.91019416e-01, -1.60742804e-01,  1.32711336e-01,
        -3.05573996e-02,  8.21499228e-01, -1.68846309e-01,  5.52493632e-02,
         3.93133946e-02, -1.54419899e-01, -1.10142693e-01,  8.04464519e-02,
         6.40499145e-02, -2.25680182e-03, -4.54931036e-02, -1.31961077e-01,
        -6.36553019e-02, -7.24741071e-02, -6.74557984e-02,  7.51990154e-02,
        -9.89093930e-02,  3.45098961e-04,  9.51817632e-02, -8.35796669e-02,
         6.16111644e-02,  2.46332996e-02,  1.22698285e-01, -8.34121853e-02,
         1.00834802e-01,  2.57775076e-02,  1.40923887e-01, -8.89457017e-02,
         8.39405321e-03, -8.10271651e-02, -1.75619200e-01,  2.75510009e-02,
         9.43362117e-02,  4.86863144e-02,  1.02795996e-02, -6.38375152e-03,
         1.01606309e-01,  5.86330891e-04, -4.61001706e-04,  1.47682592e-01,
        -3.08114588e-02,  4.82391007e-02,  6.68640947e-03, -1.32405698e-01,
        -1.06554486e-01, -1.82260737e-01,  9.83719006e-02,  2.03293949e-01,
         1.25748012e-02, -2.88178977e-02,  1.98377207e-01,  7.94450045e-02,
        -1.91405006e-02,  1.09067008e-01,  1.45120006e-02,  3.76129453e-03,
        -7.55779445e-03, -4.35703881e-02, -7.15392977e-02,  9.29853171e-02,
         2.03223303e-01, -7.56087899e-02, -7.68765956e-02, -1.34497881e-04,
        -1.02373092e-02,  1.06746197e-01, -1.79291472e-01,  1.02564707e-01,
        -5.03048711e-02, -3.99849981e-01,  1.66823298e-01, -3.60850692e-01,
         4.65191081e-02, -5.36334999e-02, -4.26915959e-02,  1.10784806e-01,
        -4.71474975e-02, -2.70334575e-02,  1.73046201e-01, -1.28632322e-01,
        -2.47203968e-02, -5.20690009e-02, -1.43306583e-01,  1.73910689e-02,
         3.92426066e-02, -3.27980034e-02, -1.47105843e-01, -4.50260937e-02,
        -3.54429968e-02, -3.48001033e-01,  5.12190349e-02,  6.53546443e-03,
         9.74060968e-02,  3.29941995e-02,  1.64787501e-01, -2.80001964e-02,
        -5.91099868e-03,  5.34908064e-02, -1.70863897e-01, -8.83822814e-02,
        -3.04464996e-02, -2.38255989e-02,  6.25573024e-02, -3.96150090e-02,
         5.61850965e-02, -6.56950995e-02,  1.14461914e-01,  2.37020403e-02,
        -9.73643959e-02,  6.75332993e-02, -1.15192510e-01, -9.93870646e-02,
         1.62405998e-01, -1.85996287e-05,  2.38820054e-02, -1.54578984e-02,
        -8.09850022e-02,  2.11081784e-02,  6.75455108e-02,  1.04365788e-01,
         4.40144017e-02,  1.38173491e-01,  1.66973501e-01,  2.98656039e-02,
        -1.20869637e+00,  1.13973998e-01,  1.44356415e-01,  9.65137035e-02,
        -1.02534197e-01,  1.96033046e-02, -2.20774382e-01, -8.33212882e-02,
         1.26080010e-02, -1.48068577e-01, -3.25329974e-02,  1.79310784e-01,
         1.00549102e-01, -6.15501776e-04, -8.60620104e-03, -6.00807965e-02,
        -2.24144101e-01, -5.30852489e-02,  7.25568980e-02, -6.83693066e-02,
        -1.31818712e-01, -1.25280619e-01, -4.18565646e-02, -1.62337318e-01,
        -2.43057013e-01, -1.28554717e-01, -1.81620065e-02,  3.14226821e-02,
         2.86392987e-01,  1.27790391e-01, -1.09302104e-01, -6.31299987e-02,
         6.34147972e-02,  1.05533032e-02, -9.54764336e-02, -1.52642623e-01,
        -2.45969812e-03,  5.27312048e-02, -9.85375494e-02, -9.40263271e-06,
        -1.02721594e-01, -2.11546332e-01,  9.45660193e-03,  6.08954057e-02,
        -1.55054014e-02, -1.35336090e-02, -1.46346599e-01,  1.72033206e-01,
         1.02590472e-02,  1.79677010e-02,  7.66730215e-03,  5.76320961e-02,
        -6.72889799e-02,  1.79475129e-01,  3.75980278e-03, -3.17634009e-02,
        -1.35662094e-01, -1.62346214e-01, -8.91004875e-02,  2.43386418e-01,
        -1.51126519e-01,  1.64557006e-02, -2.38958210e-01, -8.08289051e-02,
         1.32009894e-01, -6.37291968e-02,  1.44125491e-01, -1.34023100e-01,
        -3.62050012e-02,  1.33394075e-04, -4.13244441e-02, -9.65869054e-02,
        -6.03779918e-03, -1.69263005e-01,  1.23964407e-01,  1.35745127e-02,
         6.21954985e-02,  2.79022995e-02, -1.72945634e-01,  6.10213168e-02,
         4.57272008e-02, -1.54524595e-01, -1.13604188e-01,  7.88334981e-02,
         7.51482025e-02,  6.91546053e-02,  6.71735033e-02,  5.26516624e-02,
        -7.16200992e-02,  4.70744967e-02,  7.22332578e-03, -1.43274993e-01,
         2.73157991e-02, -4.08897027e-02, -3.95579152e-02, -1.11864947e-01,
        -5.76461479e-02,  5.85939065e-02, -2.77539855e-03,  9.78328958e-02,
        -4.59123962e-02,  5.13046980e-02, -8.27012062e-02,  2.36173004e-01,
         1.57255292e-01, -2.05496103e-01, -5.07364050e-02, -1.42242998e-01,
        -1.33783802e-01,  6.17518015e-02,  7.71891922e-02,  6.56182989e-02,
        -3.19690183e-02, -8.98913965e-02,  5.90162054e-02,  1.40180901e-01,
         2.06935972e-01, -1.05141088e-01, -1.69202983e-02,  9.94717106e-02,
         3.61782946e-02,  1.80594996e-01,  1.73227079e-02,  6.58436790e-02,
        -7.31033012e-02, -2.94967033e-02, -9.17861536e-02, -4.97900769e-02,
         2.21351579e-01, -5.01069948e-02,  2.30176210e-01,  1.01559699e-01,
        -1.15857206e-01, -9.82692689e-02, -5.75364307e-02,  7.65967965e-02,
        -3.71842086e-02,  9.50946584e-02, -1.08238399e-01,  1.34326488e-01,
         2.52946019e-01,  1.52743086e-01,  1.57660082e-01, -1.49053603e-01,
         1.70904659e-02,  1.32871981e-04,  1.20634004e-01, -3.90610024e-02,
         3.74647900e-02,  4.06374000e-02, -8.93395990e-02,  4.30078804e-02,
        -1.60323046e-02,  8.46801922e-02,  7.74760544e-03,  9.40140057e-03,
         8.19397271e-02,  5.43425977e-02,  6.63953200e-02, -6.93768114e-02],
       dtype=float32))

Build your own little search engine

Sentence vectors aren't especially interesting on their own! One thing we can use them for is to build a tiny "search engine" based on semantic similarity between lines. To do this, we need to be able to calculate the distance between a target sentence's vector (not necessarily a sentence from our corpus) and vectors of the sentences in the corpus, returning them ranked based on the distance between the two vectors. However, doing this comparison is computationally expensive and potentially very slow. Instead, we'll use an approximate nearest neighbors algorithm, which uses some tricks to make the computation faster, at the cost of a little bit of accuracy. I'm going to use the Simple Neighbors package as a way to build an approximate nearest neighbors index quickly and easily.

In [12]:
from simpleneighbors import SimpleNeighbors

In the cell below, I build a nearest-neighbor lookup for the sampled lines of poetry:

In [13]:
lookup = SimpleNeighbors(300)
for vec, line in zip(embeddings, sampled_lines):
    lookup.add_one(line, vec)
lookup.build()

The .nearest() method returns the sentences from the corpus whose vectors are closest to the vector you pass in. The code in the cell below uses the summary() function to return the sentences most similar to the sentence you type in. The number controls how many sentences should be returned.

In [15]:
lookup.nearest(summary("I don't want the words, I want the sound."), 5)
Out[15]:
['I do you wrong?  I do not hear your praise',
 'I think you know it. Fitzwalter, I can save you,',
 "Mother, if you don't mind, I should like to become the boatman of",
 "I tell them they can't get me through the door, though:",
 'And I would tell it all to you;']

To get neighbors for a random item in the corpus:

In [16]:
lookup.neighbors(random.choice(lookup.corpus))
Out[16]:
['Wild winds went straying,',
 'Of wailing winds, and naked woods, and meadows brown and sere.',
 'Girt with rough skins, hies to the deserts wild,',
 'On Cowper Green I stray, tis a desert strange and chill,',
 'To slay wild beasts and chase the roving hind,',
 'And, brightly leaping down the hills,',
 'With thicket overgrown, grotesque and wild,',
 "No--'twas the wind that, whirring, rose,",
 'To bear her cloudy flame,',
 'The little rills, and waters numberless,',
 'Walked hog-eyed Allen, terror of the hills',
 'Sat upon a grassy hillock,']

Visualize poem space in two dimensions

Another thing you can do with sentence vectors is visualize them. But the vectors are large (in our case, 300 dimensions), which doesn't have an obvious mapping to 2-dimensional space. Thankfully, there are a number of algorithms to reduce the dimensionality of vectors. We're going to use t-SNE ("t-distributed stochastic neighbor embedding"), but there are others to experiment with that might be just as good or better for your application (like PCA or UMAP).

Note in the code below, I'm using an even smaller subset of the data. (That's what the [:2000] is doing—just using the first 2000 samples. This is because t-SNE is slow, as is drawing the results of a t-SNE).

In [17]:
from sklearn.manifold import TSNE
mapped_embeddings = TSNE(n_components=2,
                         metric='cosine',
                         init='pca',
                         verbose=1).fit_transform(embeddings[:2000])
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 2000 samples in 0.001s...
[t-SNE] Computed neighbors for 2000 samples in 0.208s...
[t-SNE] Computed conditional probabilities for sample 1000 / 2000
[t-SNE] Computed conditional probabilities for sample 2000 / 2000
[t-SNE] Mean sigma: 0.103510
[t-SNE] KL divergence after 250 iterations with early exaggeration: 82.455208
[t-SNE] KL divergence after 1000 iterations: 2.354854

The following line draws a very large image with the results of the t-SNE. (You might want to right-click to save the image and then bring it up in an image viewer to see the details.)

In [18]:
%matplotlib inline

import matplotlib.pyplot as plt

plt.figure(figsize=(100, 100))
x = mapped_embeddings[:,0]
y = mapped_embeddings[:,1]
plt.scatter(x, y)

for i, txt in enumerate(sampled_lines[:2000]):
    plt.annotate(txt, (x[i], y[i]))