NLP concepts with spaCy

By Allison Parrish

“Natural Language Processing” is a field at the intersection of computer science, linguistics and artificial intelligence which aims to make the underlying structure of language available to computer programs for analysis and manipulation. It’s a vast and vibrant field with a long history! New research and techniques are being developed constantly.

The aim of this notebook is to introduce a few simple concepts and techniques from NLP—just the stuff that’ll help you do creative things quickly, and maybe open the door for you to understand more sophisticated NLP concepts that you might encounter elsewhere. This tutorial is written for Python 3.6+.

There are a number of libraries for performing natural language processing tasks in Python, including:

But we'll be using a library called spaCy, which very powerful and easy for newcomers to understand. It's been among the most important tools in my text processing toolbox for many years!

Natural language

“Natural language” is a loaded phrase: what makes one stretch of language “natural” while another stretch is not? NLP techniques are opinionated about what language is and how it works; as a consequence, you’ll sometimes find yourself having to conceptualize your text with uncomfortable abstractions in order to make it work with NLP. (This is especially true of poetry, which almost by definition breaks most “conventional” definitions of how language behaves and how it’s structured.)

Of course, a computer can never really fully “understand” human language. Even when the text you’re using fits the abstractions of NLP perfectly, the results of NLP analysis are always going to be at least a little bit inaccurate. But often even inaccurate results can be “good enough”—and in any case, inaccurate output from NLP procedures can be an excellent source of the sublime and absurd juxtapositions that we (as poets) are constantly in search of.

Language support

Historically, most NLP researchers have focused their efforts on English specifically. But many natural language processing libraries now support a wide range of languages. You can find the full list of supported languages on their website, though the robustness of these models varies from one language to the next, as does the specifics of how the model works. (For example, different languages have different ideas about what a "part of speech" is.) The examples in this notebook are primarily in English. If you're having trouble applying these techniques to other languages, send me an e-mail—I'd be happy to help you figure out how to get things working for languages other than English!

English grammar: a crash course

The only thing I believe about English grammar is this:

"Oh yes, the sentence," Creeley once told the critic Burton Hatlen, "that's what we call it when we put someone in jail."

There is no such thing as a sentence, or a phrase, or a part of speech, or even a "word"---these are all pareidolic fantasies occasioned by glints of sunlight we see on reflected on the surface of the ocean of language; fantasies that we comfort ourselves with when faced with language's infinite and unknowable variability.

Regardless, we may find it occasionally helpful to think about language using these abstractions. The following is a gross oversimplification of both how English grammar works, and how theories of English grammar work in the context of NLP. But it should be enough to get us going!

Sentences and parts of speech

English texts can roughly be divided into "sentences." Sentences are themselves composed of individual words, each of which has a function in expressing the meaning of the sentence. The function of a word in a sentence is called its "part of speech"—i.e., a word functions as a noun, a verb, an adjective, etc. Here's a sentence, with words marked for their part of speech:

I       really love entrees       from        the        new       cafeteria.
pronoun adverb verb noun (plural) preposition determiner adjective noun

Of course, the "part of speech" of a word isn't a property of the word itself. We know this because a single "word" can function as two different parts of speech:

I love cheese.

The word "love" here is a verb. But here:

Love is a battlefield.

... it's a noun. For this reason (and others), it's difficult for computers to accurately determine the part of speech for a word in a sentence. (It's difficult sometimes even for humans to do this.) But NLP procedures do their best!

Phrases and larger syntactic structures

There are several different ways for talking about larger syntactic structures in sentences. The scheme used by spaCy is called a "dependency grammar." We'll talk about the details of this below.

Installing spaCy

Follow the instructions here. To install on Anaconda, you'll need to open a Terminal window (or the equivalent on your operating system) and type

conda install -c conda-forge spacy

If you're not using Anaconda, you can install with pip:

pip install spacy

This line installs the library. You'll also need to download a language model. For that, type:

python -m spacy download en_core_web_md

(Replace en_core_web_md with the name of the model you want to install. The spaCy documentation explains the difference between the various models. I suggest downloading at least the "medium" model, if it's available for the language you want to use.)

The language model contains machine learning models for splitting texts into sentences and words, tagging words with their parts of speech, identifying entities, and discovering the syntactic structure of sentences. The model also contains various bits of static data, including word vectors for some subset of the model's vocabulary. (See my word vector tutorial for an example of using spaCy to play around with word vectors.)

Basic usage

Import spacy like any other Python module:

In [1]:
import spacy

Create a new spaCy object using spacy.load('en_core_web_md'). (The name in the parentheses is the same as the name of the model you downloaded above. If you downloaded a different model, you can put its name here instead. You can also just write 'en' and spaCy will load the best model it has for that language.)

In [2]:
nlp = spacy.load('en_core_web_md')

It's more fun doing natural language processing on text that you're interested in. I recommend grabbing a something from Project Gutenberg. Download a plain text file and put it in the same directory as this notebook, taking care to replace the filename in the cell below with the name of the file you downloaded.

In [56]:
# replace "84-0.txt" with the name of your own text file, then run this cell with CTRL+Enter.
text = open("84-0.txt").read()

Now, use spaCy to parse it. (This might take a while, depending on the size of your text.)

In [57]:
doc = nlp(text)

Right off the bat, the spaCy library gives us access to a number of interesting units of text:

  • All of the sentences (doc.sents)
  • All of the words (doc)
  • All of the "named entities," like names of places, people, #brands, etc. (doc.ents)
  • All of the "noun chunks," i.e., nouns in the text plus surrounding matter like adjectives and articles

The cell below, we extract these into variables so we can play around with them a little bit.

In [58]:
sentences = list(doc.sents)
words = [w for w in list(doc) if w.is_alpha]
noun_chunks = list(doc.noun_chunks)
entities = list(doc.ents)

With this information in hand, we can answer interesting questions like: how many sentences are in the text?

In [59]:
len(sentences)
Out[59]:
3873

Using random.sample(), we can get a small, randomly-selected sample from these lists. Here are five random sentences:

In [66]:
for item in random.sample(sentences, 5):
    print(">", item.text.strip().replace("\n", " "))
> Surely in that moment I should have been possessed by frenzy and have destroyed my miserable existence but that my vow was heard and that I was reserved for vengeance.
> Years will pass, and you will have visitings of despair and yet be tortured by hope.
> This sound disturbed an old woman who was sleeping in a chair beside me.
> but I was quickly restored by the cold gale of the mountains.
> I never saw a man in so wretched a condition.

Ten random words:

In [61]:
for item in random.sample(words, 10):
    print(item.text)
father
in
has
had
magistrate
plentiful
we
be
friend
a

Ten random noun chunks:

In [62]:
for item in random.sample(noun_chunks, 10):
    print(item.text)
kings
any pursuit
my retreat
the place
my men
some shepherd
what new scene
the instruments
sister
it

Ten random entities:

In [67]:
for item in random.sample(entities, 10):
    print(item.text)
Chapter 18
Italy
September 2d
Greek
Justine
Pélissier
Cornelius Agrippa
Safie
Rotterdam
This morning

spaCy data types

Note that the values that spaCy returns belong to specific spaCy data types. You can read more about these data types in the spaCy documentation, in particular spans and tokens. (Spans represent sequences of tokens; a sentence in spaCy is a span, and a word is a token.) If you want a list of strings instead of a list of spaCy objects, use the .text attribute, which works for spans and tokens alike. For example:

In [80]:
sentence_strs = [item.text for item in doc.sents]
In [82]:
random.sample(sentence_strs, 10)
Out[82]:
["The Foundation's principal office is located at 4557",
 'In\nthe _Sorrows of Werter_, besides the interest of its simple and affecting\nstory, so many opinions are canvassed and so many lights thrown upon\nwhat had hitherto been to me obscure subjects that I found in it a\nnever-ending source of speculation and astonishment.  ',
 'It contained but two\nrooms, and these exhibited all the squalidness of the most miserable\npenury.  ',
 'I rushed towards\nthe window, and drawing a pistol from my bosom, fired; but he eluded me,\nleaped from his station, and running with the swiftness of lightning,\nplunged into the lake.\n\n',
 'And was I really as mad as the whole world would\nbelieve me to be if I disclosed the object of my suspicions?  ',
 'Our journey here lost the interest arising from beautiful scenery, but we\narrived in a few days at Rotterdam, whence we proceeded by sea to England.\n',
 'Once, after\nthe poor animals that conveyed me had with incredible toil gained the\nsummit of a sloping ice mountain, and one, sinking under his fatigue,\ndied, I viewed the expanse before me with anguish, when suddenly my eye\ncaught a dark speck upon the dusky plain.  ',
 'He approached; his countenance bespoke bitter anguish,\ncombined with disdain and malignity, while its unearthly ugliness\nrendered it almost too horrible for human eyes.  ',
 'Thus situated, employed in the most detestable occupation, immersed in\na solitude where nothing could for an instant call my attention from\nthe actual scene in which I was engaged, my spirits became unequal; I\ngrew restless and nervous.  ',
 'You have hope, and the world before you, and have no cause for\ndespair.']

Parts of speech

The spaCy parser allows us to check what part of speech a word belongs to. In the cell below, we create four different lists—nouns, verbs, adjs and advs—that contain only words of the specified parts of speech. (There's a full list of part of speech tags here).

In [68]:
nouns = [w for w in words if w.pos_ == "NOUN"]
verbs = [w for w in words if w.pos_ == "VERB"]
adjs = [w for w in words if w.pos_ == "ADJ"]
advs = [w for w in words if w.pos_ == "ADV"]

And now we can print out a random sample of any of these:

In [70]:
for item in random.sample(nouns, 20): # change "nouns" to "verbs" or "adjs" or "advs" to sample from those lists!
    print(item.text)
thoughts
board
breakfast
selfishness
perish
murderer
sound
steeples
hold
accomplishment
ocean
hands
signification
communication
iron
sleep
hair
hopes
society
occupations

Entity types

The parser in spaCy not only identifies "entities" but also assigns them to a particular type. See a full list of entity types here. Using this information, the following cell builds lists of the people, locations, and times mentioned in the text:

In [71]:
people = [e for e in entities if e.label_ == "PERSON"]
locations = [e for e in entities if e.label_ == "LOC"]
times = [e for e in entities if e.label_ == "TIME"]

And then you can print out a random sample:

In [72]:
for item in random.sample(times, 20): # change "times" to "people" or "locations" to sample those lists
    print(item.text.strip())
That hour
night
night
night
the next hour
the ensuing hours
evening
a single hour
night
night
a few minutes
the next morning
the hours and months
About two hours
the night
a few hours
several hours
the
morning
the next morning
midnight

Finding the most common

After we've parsed the text out into meaningful units, it might be interesting to see which examples of those units are the most common in a text.

One of the most common tasks in text analysis is counting how many times things occur in a text. The easiest way to do this in Python is with the Counter object, contained in the collections module. Run the following cell to create a Counter object to count your words.

In [76]:
from collections import Counter
word_count = Counter([w.text for w in words])

Once you've created the counter, you can check to see how many times any word occurs like so:

In [74]:
word_count['heaven']
Out[74]:
15

The Counter object's .most_common() method gives you access to a list of tuples with words and their counts, sorted in reverse order by count:

In [77]:
word_count.most_common(10)
Out[77]:
[('the', 4070),
 ('and', 3006),
 ('I', 2847),
 ('of', 2746),
 ('to', 2155),
 ('my', 1635),
 ('a', 1402),
 ('in', 1135),
 ('was', 1019),
 ('that', 1018)]

The code in the following cell prints this out nicely:

In [78]:
for word, count in word_count.most_common(20):
    print(word, count)
the 4070
and 3006
I 2847
of 2746
to 2155
my 1635
a 1402
in 1135
was 1019
that 1018
me 867
with 705
had 684
not 576
which 565
but 552
you 550
his 502
for 494
as 492

You'll note that the list of most frequent words here likely reflects the overall frequency of words in English. Consult my Quick and dirty keywords tutorial for some simple strategies for extracting words that are most unique to a text (rather than simply the most frequent words). You may also consider removing stop words from the list.

Writing to a file

You might want to export lists of words or other things that you make with spaCy to a file, so that you can bring them into other Python programs (or just other programs that form a part of your workflow). One way to do this is to write each item to a single line in a text file. The code in the following cell does exactly this for the word list that we just created:

In [83]:
with open("words.txt", "w") as fh:
    fh.write("\n".join([w.text for w in words]))

The following cell defines a function that performs this for any list of spaCy values you pass to it:

In [84]:
def save_spacy_list(filename, t):
    with open(filename, "w") as fh:
        fh.write("\n".join([item.text for item in t]))

Here's how to use it:

In [85]:
save_spacy_list("words.txt", words)

Since we're working with Counter objects a bunch in this notebook, it makes sense to find a way to save these as files too. The following cell defines a function for writing data from a Counter object to a file. The file is in "tab-separated values" format, which you can open using most spreadsheet programs. Execute it before you continue:

In [86]:
def save_counter_tsv(filename, counter, limit=1000):
    with open(filename, "w") as outfile:
        outfile.write("key\tvalue\n")
        for item, count in counter.most_common():
            outfile.write(item.strip() + "\t" + str(count) + "\n")    

Now, run the following cell. You'll end up with a file in the same directory as this notebook called 100_common_words.tsv that has two columns, one for the words and one for their associated counts:

In [87]:
save_counter_tsv("100_common_words.tsv", word_count, 100)

Try opening this file in Excel or Google Docs or Numbers!

If you want to write the data from another Counter object to a file:

  • Change the filename to whatever you want (though you should probably keep the .tsv extension)
  • Replace word_count with the name of any of the Counter objects we've made in this sheet and use it in place of word_count
  • Change the number to the number of rows you want to include in your spreadsheet.

When do things happen in this text?

Here's another example. Using the times entities, we can make a spreadsheet of how often particular "times" (durations, times of day, etc.) are mentioned in the text.

In [88]:
time_counter = Counter([e.text.lower().strip() for e in times])
save_counter_tsv("time_count.tsv", time_counter, 100)

Do the same thing, but with people:

In [89]:
people_counter = Counter([e.text.lower() for e in people])
save_counter_tsv("people_count.tsv", people_counter, 100)

More about words

The list of words that we made above is actually a list of spaCy Token objects, which have several interesting attributes. The .text attribute gives the text of the word (as a Python string), and the .lemma_ attribute gives the word's "lemma" (explained below):

In [91]:
for word in random.sample(words, 50):
    print(word.text, "→", word.lemma_)
Sometimes → sometimes
university → university
that → that
pursue → pursue
But → but
England → England
an → an
and → and
revolved → revolve
over → over
the → the
as → as
violently → violently
the → the
if → if
surrounded → surround
My → -PRON-
for → for
more → more
a → a
the → the
far → far
Happy → happy
a → a
where → where
to → to
with → with
and → and
rules → rule
said → say
I → -PRON-
be → be
is → be
thyself → thyself
greater → great
consider → consider
the → the
had → have
moon → moon
walked → walk
made → make
America → America
food → food
alluring → alluring
insurmountable → insurmountable
letter → letter
was → be
of → of
sight → sight
it → -PRON-

A word's "lemma" is its most "basic" form, the form without any morphology applied to it. "Sing," "sang," "singing," are all different "forms" of the lemma sing. Likewise, "octopi" is the plural of "octopus"; the "lemma" of "octopi" is octopus.

"Lemmatizing" a text is the process of going through the text and replacing each word with its lemma. This is often done in an attempt to reduce a text to its most "essential" meaning, by eliminating pesky things like verb tense and noun number.

Individual sentences can also be iterated over to get a list of words in that sentence:

In [92]:
sentence = random.choice(sentences)
for word in sentence:
    print(word.text)
Little
did
I
then
expect
the
calamity
that


was
in
a
few
moments
to
overwhelm
me
and
extinguish
in
horror
and
despair


all
fear
of
ignominy
or
death
.



Parts of speech

Token objects are tagged with their part of speech. The pos_ attribute gives a general part of speech; the tag_ attribute gives a more specific designation. List of meanings here. We used this attribute earlier in the notebook to extract lists of words that had particular parts of speech, but you can access the attribute in other contexts as well:

In [144]:
for item in random.sample(words, 24):
    print(item.text, "/", item.pos_, "/", item.tag_)
remained / VERB / VBD
well / ADV / RB
tale / NOUN / NN
is / AUX / VBZ
the / DET / DT
you / PRON / PRP
in / ADP / IN
will / VERB / MD
General / PROPN / NNP
words / NOUN / NNS
and / CCONJ / CC
returned / VERB / VBD
strange / ADJ / JJ
His / DET / PRP$
superior / ADJ / JJ
state / NOUN / NN
are / AUX / VBP
moment / NOUN / NN
spot / NOUN / NN
there / PRON / EX
them / PRON / PRP
during / ADP / IN
you / PRON / PRP
and / CCONJ / CC

Specific verb forms with .tag_

The .pos_ attribute only gives us general information about the part of speech. The .tag_ attribute allows us to be more specific about the kinds of verbs we want. For example, this code gives us only the verbs in past participle form:

In [96]:
only_past = [item.text for item in doc if item.tag_ == 'VBN']
In [97]:
random.sample(only_past, 12)
Out[97]:
['included',
 'engaged',
 'appeased',
 'bestowed',
 'been',
 'been',
 'rekindled',
 'separated',
 'given',
 'condemned',
 'taught',
 'snatched']

Larger syntactic units

Okay, so we can get individual words and small phrases, like named entities and noun chunks. Great! But what if we want larger chunks, based on their syntactic role in the sentence? For this, we'll need to learn about how spaCy parses sentences into its syntactic components.

Understanding dependency grammars

The spaCy library parses the underlying sentences using a dependency grammar. Dependency grammars look different from the kinds of sentence diagramming you may have done in high school, and even from tree-based phrase structure grammars commonly used in descriptive linguistics. The idea of a dependency grammar is that every word in a sentence is a "dependent" of some other word, which is that word's "head." Those "head" words are in turn dependents of other words. The finite verb in the sentence is the ultimate "head" of the sentence, and is not itself dependent on any other word. (The dependents of a particular head are sometimes called its "children.")

The question of how to know what constitutes a "head" and a "dependent" is complicated. As a starting point, here's a passage from Dependency Grammar and Dependency Parsing:

Here are some of the criteria that have been proposed for identifying a syntactic relation between a head H and a dependent D in a construction C (Zwicky, 1985; Hudson, 1990):

  1. H determines the syntactic category of C and can often replace C.
  2. H determines the semantic category of C; D gives semantic specification.
  3. H is obligatory; D may be optional.
  4. H selects D and determines whether D is obligatory or optional.
  5. The form of D depends on H (agreement or government).
  6. The linear position of D is specified with reference to H."

Dependents are related to their heads by a syntactic relation. The name of the syntactic relation describes the relationship between the head and the dependent. Use the displaCy visualizer (linked above) to see how a particular sentence is parsed, and what the relations between the heads and dependents are.

Every token object in a spaCy document or sentence has attributes that tell you what the word's head is, what the dependency relationship is between that word and its head, and a list of that word's children (dependents). The following code prints out each word in the sentence, the tag, the word's head, the word's dependency relation with its head, and the word's children (i.e., dependent words). (This code isn't especially useful on its own, it's just here to help show you how this functionality works.)

In [112]:
sent = random.choice(sentences)
print("Original sentence:", sent.text.replace("\n", " "))
for word in sent:
    print()
    print("Word:", word.text)
    print("Tag:", word.tag_)
    print("Head:", word.head.text)
    print("Dependency relation:", word.dep_)
    print("Children:", list(word.children))
Original sentence: When I reflected on his crimes and malice, my hatred and revenge burst all bounds of moderation.

Word: When
Tag: WRB
Head: reflected
Dependency relation: advmod
Children: []

Word: I
Tag: PRP
Head: reflected
Dependency relation: nsubj
Children: [
]

Word: 

Tag: _SP
Head: I
Dependency relation: 
Children: []

Word: reflected
Tag: VBD
Head: burst
Dependency relation: advcl
Children: [When, I, on]

Word: on
Tag: IN
Head: reflected
Dependency relation: prep
Children: [crimes]

Word: his
Tag: PRP$
Head: crimes
Dependency relation: poss
Children: []

Word: crimes
Tag: NNS
Head: on
Dependency relation: pobj
Children: [his, and, malice]

Word: and
Tag: CC
Head: crimes
Dependency relation: cc
Children: []

Word: malice
Tag: NN
Head: crimes
Dependency relation: conj
Children: []

Word: ,
Tag: ,
Head: burst
Dependency relation: punct
Children: []

Word: my
Tag: PRP$
Head: hatred
Dependency relation: poss
Children: []

Word: hatred
Tag: NN
Head: burst
Dependency relation: nsubj
Children: [my, and, revenge]

Word: and
Tag: CC
Head: hatred
Dependency relation: cc
Children: []

Word: revenge
Tag: NN
Head: hatred
Dependency relation: conj
Children: []

Word: burst
Tag: VBD
Head: burst
Dependency relation: ROOT
Children: [reflected, ,, hatred, bounds, .]

Word: all
Tag: DT
Head: bounds
Dependency relation: det
Children: []

Word: bounds
Tag: NNS
Head: burst
Dependency relation: dobj
Children: [all, 
, of]

Word: 

Tag: _SP
Head: bounds
Dependency relation: 
Children: []

Word: of
Tag: IN
Head: bounds
Dependency relation: prep
Children: [moderation]

Word: moderation
Tag: NN
Head: of
Dependency relation: pobj
Children: []

Word: .
Tag: .
Head: burst
Dependency relation: punct
Children: []

Here's a list of a few dependency relations and what they mean. (A more complete list can be found here.)

  • nsubj: this word's head is a verb, and this word is itself the subject of the verb
  • nsubjpass: same as above, but for subjects in sentences in the passive voice
  • dobj: this word's head is a verb, and this word is itself the direct object of the verb
  • iobj: same as above, but indirect object
  • aux: this word's head is a verb, and this word is an "auxiliary" verb (like "have", "will", "be")
  • attr: this word's head is a copula (like "to be"), and this is the description attributed to the subject of the sentence (e.g., in "This product is a global brand", brand is dependent on is with the attr dependency relation)
  • det: this word's head is a noun, and this word is a determiner of that noun (like "the," "this," etc.)
  • amod: this word's head is a noun, and this word is an adjective describing that noun
  • prep: this word is a preposition that modifies its head
  • pobj: this word is a dependent (object) of a preposition

Using .subtree for extracting syntactic units

That's all pretty abstract, so let's get a bit more concrete, and write some code that will let us extract syntactic units based on their dependency relation. There are a couple of things we need in order to do this. The .subtree attribute I used in the code above evaluates to a generator that can be flatted by passing it to list(). This is a list of the word's syntactic dependents—essentially, the "clause" that the word belongs to.

This function merges a subtree and returns a string with the text of the words contained in it:

In [113]:
def flatten_subtree(st):
    return ''.join([w.text_with_ws for w in list(st)]).strip()

With this function in our toolbox, we can write a loop that prints out the subtree for each word in a sentence. (Again, this code is just here to demonstrate what the process of grabbing subtrees looks like—it doesn't do anything useful yet!)

In [121]:
sent = random.choice(sentences)
print("Original sentence:", sent.text.replace("\n", " "))
for word in sent:
    print()
    print("Word:", word.text.replace("\n", " "))
    print("Flattened subtree: ", flatten_subtree(word.subtree).replace("\n", " "))
Original sentence: But the fresh air and bright sun seldom failed to restore me to some degree of composure, and on my return I met the salutations of my friends with a readier smile and a more cheerful heart.  

Word: But
Flattened subtree:  But

Word: the
Flattened subtree:  the

Word: fresh
Flattened subtree:  fresh

Word: air
Flattened subtree:  the fresh air and

Word: and
Flattened subtree:  and

Word:  
Flattened subtree:  

Word: bright
Flattened subtree:  bright

Word: sun
Flattened subtree:  bright sun

Word: seldom
Flattened subtree:  seldom

Word: failed
Flattened subtree:  But the fresh air and bright sun seldom failed to restore me to some degree of composure, and on my return I met the salutations of my friends with a readier smile and a more cheerful heart.

Word: to
Flattened subtree:  to

Word: restore
Flattened subtree:  to restore me to some degree of composure

Word: me
Flattened subtree:  me

Word: to
Flattened subtree:  to some degree of composure

Word: some
Flattened subtree:  some

Word: degree
Flattened subtree:  some degree of composure

Word: of
Flattened subtree:  of composure

Word: composure
Flattened subtree:  composure

Word: ,
Flattened subtree:  ,

Word: and
Flattened subtree:  and

Word:  
Flattened subtree:  

Word: on
Flattened subtree:  on my return

Word: my
Flattened subtree:  my

Word: return
Flattened subtree:  my return

Word: I
Flattened subtree:  I

Word: met
Flattened subtree:  on my return I met the salutations of my friends with a readier smile and a more cheerful heart.

Word: the
Flattened subtree:  the

Word: salutations
Flattened subtree:  the salutations of my friends

Word: of
Flattened subtree:  of my friends

Word: my
Flattened subtree:  my

Word: friends
Flattened subtree:  my friends

Word: with
Flattened subtree:  with a readier smile and a more cheerful heart

Word: a
Flattened subtree:  a

Word: readier
Flattened subtree:  readier

Word: smile
Flattened subtree:  a readier smile and a more cheerful heart

Word:  
Flattened subtree:  

Word: and
Flattened subtree:  and

Word: a
Flattened subtree:  a

Word: more
Flattened subtree:  more

Word: cheerful
Flattened subtree:  more cheerful

Word: heart
Flattened subtree:  a more cheerful heart

Word: .
Flattened subtree:  .

Word:   
Flattened subtree:  

Using the subtree and our knowledge of dependency relation types, we can write code that extracts larger syntactic units based on their relationship with the rest of the sentence. For example, to get all of the noun phrases that are subjects of a verb:

In [122]:
subjects = []
for word in doc:
    if word.dep_ in ('nsubj', 'nsubjpass'):
        subjects.append(flatten_subtree(word.subtree))
In [125]:
random.sample(subjects, 12)
Out[125]:
['The path',
 'which',
 'He',
 'that',
 'I',
 'Wandering spirits',
 'we',
 'I',
 'It',
 'this',
 'I',
 'rage and hatred']

Or every prepositional phrase:

In [128]:
prep_phrases = []
for word in doc:
    if word.dep_ == 'prep':
        prep_phrases.append(flatten_subtree(word.subtree).replace("\n", " "))
In [131]:
random.sample(prep_phrases, 12)
Out[131]:
['on a sudden',
 'even in the excess of misery',
 'with',
 'of feeling',
 'of my soul',
 'into my hovel',
 'to you',
 'of the river',
 'of your country',
 'in a singular manner',
 'of my friend',
 'with the joy']

Generating text from extracted units

One thing I like to do is put together text from parts we've disarticulated with spaCy. Let's use Tracery to do this. If you don't know how to use Tracery, feel free to consult my Tracery tutorial before continuing.

So I want to generate sentences based on things that I've extracted from my text. My first idea: get subjects of sentences, verbs of sentences, nouns and adjectives, and prepositional phrases:

In [193]:
subjects = [flatten_subtree(word.subtree).replace("\n", " ")
            for word in doc if word.dep_ in ('nsubj', 'nsubjpass')]
past_tense_verbs = [word.text for word in words if word.tag_ == 'VBD' and word.lemma_ != 'be']
adjectives = [word.text for word in words if word.tag_.startswith('JJ')]
nouns = [word.text for word in words if word.tag_.startswith('NN')]
prep_phrases = [flatten_subtree(word.subtree).replace("\n", " ")
                for word in doc if word.dep_ == 'prep']

Notes on the code above:

  • The .replace("\n", " ") is in there because spaCy treats linebreaks as normal whitespace, and retains them when we ask for the span's text. For formatting reasons, we want to get rid of this.
  • I'm using .startswith() in the checks for parts of speech in order to capture other related parts of speech (e.g., JJR is comparative adjectives, NNS is plural nouns).
  • I use only past tense verbs so we don't have to worry about subject/verb agreement in English. I'm excluding forms of to be because it is the only verb that agrees with its subject in the past tense.

Now I'll import Tracery...

In [194]:
import tracery
from tracery.modifiers import base_english

... and define a grammar. The "trick" of this example is that I grab entire rule expansions from the units extracted from the text using spaCy. The grammar itself is built around producing sentences that look and feel like English.

In [227]:
rules = {
    "origin": [
        "#subject.capitalize# #predicate#.",
        "#subject.capitalize# #predicate#.",
        "#prepphrase.capitalize#, #subject# #predicate#."
    ],
    "predicate": [
        "#verb#",
        "#verb# #nounphrase#",
        "#verb# #prepphrase#"
    ],
    "nounphrase": [
        "the #noun#",
        "the #adj# #noun#",
        "the #noun# #prepphrase#",
        "the #noun# and the #noun#",
        "#noun.a#",
        "#adj.a# #noun#",
        "the #noun# that #predicate#"
    ],
    "subject": subjects,
    "verb": past_tense_verbs,
    "noun": nouns,
    "adj": adjectives,
    "prepphrase": prep_phrases
}
grammar = tracery.Grammar(rules)
grammar.add_modifiers(base_english)
grammar.flatten("#origin#")
Out[227]:
'Your family sought the scenes and the dream.'

Let's generate a whole paragraph of this and format it nicely:

In [236]:
from textwrap import fill
output = " ".join([grammar.flatten("#origin#") for i in range(12)])
print(fill(output, 60))
He approached the length and the fiend. I declined with the
greatest tenderness. He solicited the fall that reflected a
Frenchwoman. The earth flitted. Of those mighty friends, you
went. Of your ship, You saw. I uttered. They shed the old
Clerval. I reflected the children. The dashing waves seemed
above all the rest. To your destruction and infallible
misery, I read the asleep Agatha. Of her, I brought an
ignorant union.

I like this approach for a number of reasons. Because I'm using a hand-written grammar, I have a great deal of control over the shape and rhythm of the sentences that are generated. But spaCy lets me pre-populate my grammar's vocabulary without having to write each item by hand.

Further reading and resources

We've barely scratched the surface of what it's possible to do with spaCy. There's a great page of tutorials on the official site that you should check out!