Quick and dirty keywords from small texts with spaCy

By Allison Parrish

The idea of "keyword extraction" goes by a number of names, depending on your field of study ("keyphrase selection," "lexical feature selection," "automatic terms recognition"), but the basic idea is the same: for a given text, find the words in the text that are most representative of what that text is about.

I wanted to figure out a performant and elegant way to do keyword extraction that works well on individual texts of small to medium size (like individual poems), is fairly easy to explain and reason about, doesn't require tweaking but is easy to tweak if necessary, and doesn't require any external data or libraries outside of spaCy and what you get by default with Anaconda. This notebook shows a solution I found that meets these criteria.

Keyword extraction

There's enough academic literature about keyword extraction that someone could easily write a survey article about all of the survey articles. These two overviews were the most helpful for me in understanding the scope of the problem and formulating a solution:

Here's an outline of the solution I came up with:

  • Tokenize the source text with spaCy
  • Compute the observed frequency of each token in the source text
  • Calculate the expected frequency of each token using spaCy's built-in unigram probabilities
  • Find the significance of the difference between observed and expected frequency using a G-test (as implemented with scipy's chi2_contingency function)
  • Return the tokens with the highest significance

To be clear, the approach of using a likelihood ratio test to measure keyness is nothing new (see below for Dunning's proposal from 1993!). The only real trick in my implementation is using spaCy's unigram probabilities as the "reference corpus."

Implementation

First, import spacy and load the model. (You need the medium model or the large model for this to work; the small model doesn't come with accurate probability information.)

In [92]:
import spacy
In [93]:
nlp = spacy.load('en_core_web_md')

In the following few cells, I pasted in verbatim some small documents to test the procedure on.

In [94]:
genesis_txt = """\
In the beginning God created the heaven and the earth. 
And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters. 
And God said, Let there be light: and there was light. 
And God saw the light, that it was good: and God divided the light from the darkness. 
And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day. 
And God said, Let there be a firmament in the midst of the waters, and let it divide the waters from the waters. 
And God made the firmament, and divided the waters which were under the firmament from the waters which were above the firmament: and it was so. 
And God called the firmament Heaven. And the evening and the morning were the second day. 
And God said, Let the waters under the heaven be gathered together unto one place, and let the dry land appear: and it was so. 
And God called the dry land Earth; and the gathering together of the waters called he Seas: and God saw that it was good. 
And God said, Let the earth bring forth grass, the herb yielding seed, and the fruit tree yielding fruit after his kind, whose seed is in itself, upon the earth: and it was so. 
And the earth brought forth grass, and herb yielding seed after his kind, and the tree yielding fruit, whose seed was in itself, after his kind: and God saw that it was good. 
And the evening and the morning were the third day. 
And God said, Let there be lights in the firmament of the heaven to divide the day from the night; and let them be for signs, and for seasons, and for days, and years: 
And let them be for lights in the firmament of the heaven to give light upon the earth: and it was so. 
And God made two great lights; the greater light to rule the day, and the lesser light to rule the night: he made the stars also. 
And God set them in the firmament of the heaven to give light upon the earth, 
And to rule over the day and over the night, and to divide the light from the darkness: and God saw that it was good. 
And the evening and the morning were the fourth day. 
And God said, Let the waters bring forth abundantly the moving creature that hath life, and fowl that may fly above the earth in the open firmament of heaven. 
And God created great whales, and every living creature that moveth, which the waters brought forth abundantly, after their kind, and every winged fowl after his kind: and God saw that it was good. 
And God blessed them, saying, Be fruitful, and multiply, and fill the waters in the seas, and let fowl multiply in the earth. 
And the evening and the morning were the fifth day. 
And God said, Let the earth bring forth the living creature after his kind, cattle, and creeping thing, and beast of the earth after his kind: and it was so. 
And God made the beast of the earth after his kind, and cattle after their kind, and every thing that creepeth upon the earth after his kind: and God saw that it was good. 
And God said, Let us make man in our image, after our likeness: and let them have dominion over the fish of the sea, and over the fowl of the air, and over the cattle, and over all the earth, and over every creeping thing that creepeth upon the earth. 
So God created man in his own image, in the image of God created he him; male and female created he them. 
And God blessed them, and God said unto them, Be fruitful, and multiply, and replenish the earth, and subdue it: and have dominion over the fish of the sea, and over the fowl of the air, and over every living thing that moveth upon the earth. 
And God said, Behold, I have given you every herb bearing seed, which is upon the face of all the earth, and every tree, in the which is the fruit of a tree yielding seed; to you it shall be for meat. 
And to every beast of the earth, and to every fowl of the air, and to every thing that creepeth upon the earth, wherein there is life, I have given every green herb for meat: and it was so. 
And God saw every thing that he had made, and, behold, it was very good. And the evening and the morning were the sixth day. 
"""
In [95]:
frost_txt = """\
Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;

Then took the other, as just as fair,
And having perhaps the better claim,
Because it was grassy and wanted wear;
Though as for that the passing there
Had worn them really about the same,

And both that morning equally lay
In leaves no step had trodden black.
Oh, I kept the first for another day!
Yet knowing how way leads on to way,
I doubted if I should ever come back.

I shall be telling this with a sigh
Somewhere ages and ages hence:
Two roads diverged in a wood, and I —
I took the one less travelled by,
And that has made all the difference.
"""
In [96]:
hd_txt = """\
Rose, harsh rose, 
marred and with stint of petals, 
meagre flower, thin, 
spare of leaf,

more precious 
than a wet rose 
single on a stem -- 
you are caught in the drift.

Stunted, with small leaf, 
you are flung on the sand, 
you are lifted 
in the crisp sand 
that drives in the wind.

Can the spice-rose 
drip such acrid fragrance 
hardened in a leaf?
"""

Using spaCy to calculate the expected count of a token

Using spaCy's unigram probabilities, you can guess at how probable a given token is to appear in any given English text. The problem is that spaCy gives us the unigram probabilities in log form, but in order to calculate the G-test, we need an absolute count both of the token frequency and the total number of words in the reference corpus. My solution to this was to just... guess at the total number of tokens, and return numbers based on that guess. The expected() function below performs this operation. It takes a spaCy Language object and a word to look up and returns (as a tuple) the number of times the token is expected to occur and the size of the reference corpus that this guess is based on. (You can override the assumed total number of tokens by passing the named parameter assumed_total.)

In [102]:
from math import e
def expected(nlp, word, assumed_total=1e7):
    guess = (e**nlp.vocab[word].prob) * assumed_total
    return (guess, assumed_total)

The call below shows us that the token cat is expected to occur around 735 times out of every ten million words:

In [108]:
expected(nlp, "cat")
Out[108]:
(735.454676714558, 10000000.0)

Whereas the token the occurs nearly three hundred thousand times:

In [109]:
expected(nlp, "the")
Out[109]:
(293410.81908600265, 10000000.0)

Counting observed tokens

I'm using the first chapter of the King James Version of the Book of Genesis in the code below. The first step is to parse the text with spaCy so we can get a list of tokens:

In [110]:
genesis_doc = nlp(genesis_txt)

Then I use Python's built-in Counter object to count the number of times each token occurs in the text and get the total number of tokens:

In [111]:
from collections import Counter
observed = Counter([item.text for item in genesis_doc])
observed_total = sum(observed.values())

Simple enough!

G-test with scipy

For our purposes, you can think of a G-test as a way of determining whether the difference in token counts between two corpora is significant or not, based on the size of the corpora in question. The likelihood ratio calculation used here was originally proposed in the following paper:

  • Dunning, Ted. “Accurate Methods for the Statistics of Surprise and Coincidence.” Computational Linguistics, vol. 19, no. 1, 1993, pp. 61–74.

Ted Dunning wrote an informative blog post about the method here and also maintains a GitHub repository with an implementation of the log-likelihood ratio in question.

I'm going to use the scipy.stats library, which has a function called chi2_contingency that implements the same likelihood function that Dunning describes. You give it a 2x2 matrix with the observed frequency of a token, the total number of observed tokens, the expected frequency of the token according to the reference corpus (spaCy's unigram probabilities, in our case), and the total number of tokens in the reference corpus. It returns the result of the G-test as the first item in a tuple:

In [112]:
from scipy.stats import chi2_contingency
chi2_contingency([(observed["waters"], observed_total), expected(nlp, "waters")], lambda_=0)
Out[112]:
(139.73209059296906,
 3.0464897814259877e-32,
 1,
 array([[  5.52243782e-03,   9.62994478e+02],
        [  5.73464529e+01,   9.99998901e+06]]))

The following function performs this G-test on a word, given the Counter object for the text in question and the total number of tokens:

In [151]:
def spacygtest(nlp, ob_counts, ob_total, word, assumed_total=1e7):
    g = chi2_contingency([[ob_counts[word], ob_total], expected(nlp, word)], lambda_=0)
    return g[0]

The G-test says that the probability of waters is significantly different:

In [152]:
spacygtest(nlp, observed, observed_total, "waters")
Out[152]:
139.73209059296906

While the probability of was is less significant:

In [154]:
spacygtest(nlp, observed, observed_total, "was")
Out[154]:
16.32021435023545

Keywords at last

With all of the above in mind, getting a list of keywords from our source text is a simple matter of sorting words in reverse order by the result of the G-test:

In [115]:
scored = [(item, spacygtest(nlp, observed, observed_total, item)) for item in observed.keys()]
In [117]:
sorted(scored, key=lambda x: x[1], reverse=True)[:15]
Out[117]:
[('God', 264.28722467522732),
 ('earth', 198.04293506348199),
 ('And', 171.795604760217),
 ('firmament', 149.287711944471),
 ('waters', 139.73209059296906),
 ('the', 124.22752085499357),
 ('\n', 101.79227284635938),
 ('fowl', 88.192035359442002),
 ('and', 80.254213997373384),
 ('upon', 77.033264373268906),
 ('yielding', 68.555046765251475),
 ('light', 63.50760252933739),
 ('seed', 56.44045670161627),
 ('heaven', 54.068266900441202),
 ('evening', 53.712557047220272)]

This list of words is perhaps a little unintuitive, since it includes words that you might not consider to be "key," like And and the (and even the newline token \n). But if you actually take a look at the original source text, I think you'll find that these words really do occur in the first chapter of Genesis in a way that is peculiar to that text.

The keyness() function nicely wraps up the spacygtest() function. Pass it a spaCy Language object and a list of tokens, and you'll get back the tokens in order of their keyness:

In [155]:
def keyness(nlp, tokens):
    observed = Counter(tokens)
    observed_total = sum(observed.values())
    scored = [(item, spacygtest(nlp, observed, observed_total, item)) for item in observed.keys()]
    return sorted(scored, key=lambda x: x[1], reverse=True)

You can filter and transform the tokens however you'd like before sending them to this function. For example, in the cell below, I convert every token to lowercase and filter out any tokens that aren't alphanumeric:

In [156]:
keyness(nlp, [item.text.lower() for item in genesis_doc if item.is_alpha])[:10]
Out[156]:
[('god', 282.15227472985032),
 ('earth', 217.36279310994706),
 ('and', 210.96032879214761),
 ('firmament', 152.29209884772735),
 ('the', 151.26609219696854),
 ('waters', 143.43776984372414),
 ('fowl', 90.13927244329129),
 ('upon', 80.369633883611058),
 ('let', 73.736463442332791),
 ('yielding', 70.149066167234366)]

Quick, dirty, dead simple

To make things even simpler, I made the keywords() function defined below. It takes a spaCy Language object and a string and evaluates to a list of ten keywords from that string, tokenizing it first with spaCy. You can also pass a named parameter n to the function to set the number of keywords to return.

In [157]:
def keywords(nlp, text, n=10, transform=None, filter_=None):
    doc = nlp(text)
    if transform is None:
        transform = lambda x: x.text.lower()
    if filter_ is None:
        filter_ = lambda x: x.pos_ in ('NOUN', 'PROPN', 'ADJ')
    tokens = [transform(item) for item in doc if filter_(item)]
    scored = keyness(nlp, tokens)
    return [item[0] for item in scored][:n]

Twelve keywords from Genesis:

In [158]:
keywords(nlp, genesis_txt, n=12)
Out[158]:
['god',
 'earth',
 'firmament',
 'waters',
 'fowl',
 'light',
 'heaven',
 'seed',
 'evening',
 'kind',
 'day',
 'morning']

And from Robert Frost's "The Road Not Taken":

In [159]:
keywords(nlp, frost_txt, n=12)
Out[159]:
['roads',
 'ages',
 'wood',
 'undergrowth',
 'grassy',
 'traveler',
 'way',
 'sigh',
 'yellow',
 'passing',
 'leaves',
 'morning']

And from H.D.'s "Sea Rose":

In [160]:
keywords(nlp, hd_txt, n=12)
Out[160]:
['rose',
 'leaf',
 'sand',
 'acrid',
 'meagre',
 'petals',
 'stunted',
 'fragrance',
 'stint',
 'crisp',
 'drift',
 'spice']

By default, the keywords() function converts all words to lower case before processing, and only includes nouns, proper nouns and adjectives as candidate words. You can override these defaults with the named parameters filter_ and transform, which allow you to pass in functions that will be used to filter out tokens and transform them before counting (respectively). The functions will be passed a single parameter, which is the Token object from the spaCy document. For example, to use the lemmas of the words instead of the words themselves:

In [139]:
keywords(nlp, genesis_txt, n=10,
         transform=lambda x: x.lemma_)
Out[139]:
['god',
 'earth',
 '-PRON-',
 'firmament',
 'light',
 'fowl',
 'water',
 'heaven',
 'seed',
 'evening']

Or to find only the keywords that are adjectives with seven or more characters:

In [140]:
keywords(nlp, open("sonnets.txt").read(), n=20,
         filter_=lambda x: x.pos_ == 'ADJ' and len(x.text) >= 7)
Out[140]:
['beauteous',
 'fairest',
 'eternal',
 'gracious',
 'precious',
 'sweetest',
 'outward',
 'mistress',
 'strange',
 'antique',
 'forsworn',
 'contented',
 'virtuous',
 'heavenly',
 'sovereign',
 "imprison'd",
 "unfather'd",
 "confin'd",
 'unrespected',
 'outworn']

A comparison

I haven't tested this method of keyword extraction against any of the established benchmarks, but it seems to work pretty well in informal case studies (i.e., I think the results are pretty good). For the sake of comparison with other methods of keyword extraction, take a look at the results of using Gensim's keywords method (based on TextRank) on H.D.'s "Sea Rose":

In [148]:
from gensim.summarization import keywords as gensim_kw
In [149]:
print(gensim_kw(hd_txt))
rose
fragrance
meagre

Or "The Road Not Taken":

In [150]:
print(gensim_kw(frost_txt))
equally
black
wear

These results are admittedly cherry-picked, but I do think they're illustrative of the advantages of the approach outlined above.