Natural Language Processing (NLP) with Python

Who am I?

  • Jon Ashley
  • Business & Empirical Research Librarian at the Univ. of Virginia Law Library
  • [email protected]

Outline for this workshop:

  1. Intro to spaCy
  2. Deep dive into spaCy
    • tokenization
    • sentence detection
    • entity recognition
    • stemming / lemmatization
    • part of speech tagging
    • shape analysis
    • entity types
    • token frequency
    • token categorization
  3. Topic modelling with LDA - overview
  4. Topic modelling with LDA - intro to gensim & example

SpaCy provides a number of functions for describing text. Let's fire it up and get started.

In [ ]:
# Set up spaCy

from spacy.en import English
parser = English()

# "parser" loads a set of functions that we can apply to text.  
# Documentation and examples often use "nlp" instead "parser."

Tokenization

Since computers don't process language in the same way humans do we break our corpus down into its basic parts or "tokens." For example, "My dog has fleas." becomes:

"My", "dog", "has", "fleas", "."

SpaCy provides ways of describing the properties of these tokens and this is what we'll spend our time on first. Here's an example:

In [ ]:
sample_text = parser("Haikus are easy " \
                     "but sometimes they don't make sense. " \
                     "Refrigerator.")

for token in sample_text:
    print(token)

Notice how it caught the contraction "don't"?

Stemming & lemmatization

Many words can be further reduced to their primary forms (their "lemma"). Some examples:

am, are, is -> be

brother, brothers, brother's, brothers' -> brother

runs, running, ran -> run

Again, this becomes especially useful later for topic modelling. In the meantime, here's an example:

In [ ]:
sample_text = parser("My cats are eating beans and growling.")

for token in sample_text:
    print(token.lemma_)

Let's look at some other properties spaCy returns.

Entity Recognition

spaCy can also identify entities in text (".ents"). For example:

In [ ]:
sample_text = parser("Fred Smith was born in El Segundo, California and spent four years at The Putney School " \
                     "in Vermont before joining the French Foreign Legion where he was stationed " \
                     "in Djibouti protecting a UNICEF distribution center for the knowledge-hungry children of " \
                     "Tadjoura. He is the world hero of Malagagaga, an Earth-class 4 planet in the Zeta " \
                     "Reticuli star system, with a secret base located in the Red Sea.  He is paid $30,000 a year" \
                     "in fish for his efforts.")

for entity in sample_text.ents:
    print(entity.label_, '\t', entity, )

The spaCy documentation (see "Resources" section at the end of this notebook) lists the kinds of entities recognized and their abbreviations.

Part of speech tagging

spaCy identifies a token's part of speech and is designed to handle messy data. Let's test it

In [ ]:
# part of speech tagging

sample_text = parser("The quick brown fox jumps over the lazy dog.")

for token in sample_text:
    print(token.pos_, '\t', token)

Let's test this a little bit more.

In [ ]:
# spaCy tries to handle messy data as well
# This is from Far East Movement's summer 2010 hit "Like a G6"

sample_text = parser("Popping bottles in the ice, like a blizzard " \
                     "When we drink we do it right gettin' slizzard " \
                     "Sippin sizzurp in my ride, like Three 6. " \
                     "Now I'm feeling so fly like a G6.")

for token in sample_text:
    print(token.pos_, '\t', token)
    
print("-----------------------------------")

# 
sample_text = parser("BTW Susie and Timmy are a rad cpl.")

for token in sample_text:
    print(token.pos_, '\t', token)

Additional token properties

The above covers some of the major token-level properties but there are more:

  • is the token a stopword?
  • is the token whitespace?
  • is the token a number?
  • is the token part of spaCy's default vocabulary?

SpaCy can also determine the following:

  • prefix
  • suffix
  • shape

Yet more options can be found in spaCy's documentation (see "Resources" below)

Some examples:

In [ ]:
# sample text from the play "Oliver."

sample_text = parser("Food, glorious food, " \
                    "Hot sausage and mustard! " \
                    "While we're in the mood " \
                    "Cold jelly and custard! " \
                    "Peas pudding and saveloys " \
                    "What next is the question? " \
                    "Rich gentlemen have it, boys " \
                    "in digestion! 42. ")

lemma = []
stopword = []
punctuation = []
number = []
out_of_vocab = []

for token in sample_text:
    lemma.append(token.lemma_)
    if token.is_stop:
        stopword.append(token)
    elif token.is_punct:
        punctuation.append(token)
    elif token.like_num:
        number.append(token)
    elif token.is_oov:
        out_of_vocab.append(token)
        
for token in stopword:
    print(token)

Sentence Detection

spaCy can also recognize sentences in your parsed text. Here's an example from Lewis Carroll's poem Jabberwocky:

In [ ]:
sample_text = parser("One, two! One, two! And through and through " \
                     "The vorpal blade went snicker-snack! " \
                     "He left it dead, and with its head " \
                     "He went galumphing back.")

for sent in sample_text.sents:
    print(sent)

Topic Modeling with Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) is an unsupervised machine learning technique used to infer latent topics within a corpus. Among the many topic modeling techniques it's one of the most common (and popular) which is why we'll focus on it here.

The steps in developing an LDA topic model are:

  1. Pre-process your text.

  2. Create a dictionary of all the words in all your documents in your corpus. We'll use another Python package, gensim, to handle this.

  3. Create a "bag-of-words" representation for each text in your corpus.

  4. Set the parameters for your model.

  5. Run the model.

  6. Analyze results.

In [ ]:
sample_text = parser("Cheese and salami are my favorite snacks, especially salami." \
                     "I like cheese and salami on my pizza." \
                     "Did I tell you I like cheese and salami? " \
                     "My brother has a really ugly dog." \
                     "I think mutts make the best dogs. " \
                     "Sometimes my dog eats cheese. " \
                     "My sister says salami smells like dog's breath.")

Pre-process text

"Garbage in, garbage out" is a popular phrase in computing and data science and it's no less true for topic modeling. The following pre-processing steps help avoid returning garbage from your model:

  1. Tokenize - reduce text to its core components.
  2. Lemmatize - reduce text even further by reducing words to their primary form.
  3. Strip out stopwords, punctuation, and whitespace - words like "the" or "a" or "and", don't provide any meaning to many topic models and are usually stripped out. Punctuation and whitespace are treated similarly.

Every corpus is different and there may be additional stopwords to include depending on the subject domain of your documents. Medical literature is different from legal texts which are different from mathematics articles which are different from 18th century British poetry and so on. In short, topic modeling is not a substitute for knowing your corpus or at least having some domain knowledge.

We've encountered some of these before but let's look at spaCy's stopwords:

In [ ]:
stopwords = []

from spacy import en

for word in en.STOPWORDS:
    stopwords.append(word)
   
stopwords.sort()

print(stopwords)

Now let's lemmatize our text and remove any stopwords, punctuation, and whitespace.

In [ ]:
def remove_text_junk(token):
    """
    function to eliminate tokens that are punctuation or whitespace or stopwords
    """
    
    return token.is_punct or token.is_space or token.is_stop
In [ ]:
parsed_sentences = []

for sentence in sample_text.sents:
    parsed_sentence = [token.lemma_ for token in sentence if not remove_text_junk(token)]
    parsed_sentences.append(parsed_sentence)

for sent in parsed_sentences:
    print(sent)

Create a dictionary

Here we switch to gensim to develop our topic model. First step, create a dictionary.

In [ ]:
from gensim.models import LdaModel
from gensim.corpora import Dictionary

sample_dictionary = Dictionary(parsed_sentences)

print(sample_dictionary)
print('\n', '--------------------', '\n')
print(sample_dictionary.token2id)

Create a bag-of-words

In [ ]:
sample_bow = [sample_dictionary.doc2bow(text) for text in parsed_sentences]

print(sample_bow)
In [ ]:
# another example to illustrate how everything ties together

print(parsed_sentences[0])
print('\n-------------\n')
print(sample_dictionary.token2id)
print('\n-------------\n')
print(sample_bow[0])

Set parameters and run the model.

Now we load up our dictionary and bag of words to generate our LDA topic model.

Topic modeling with LDA can take a long time depending on how much text you're processing. In addition to setting some key parameters, the LdaMulticore function also provides a way to use additional cores (if you have them) to speed things along. More passes creates a more reliable model but also requires more time. For demonstration purposes, 5,000 passes is fine but in real application handling thousands or millions of documents it's an unwise choice.

In [ ]:
# parameters to set here can include: your bag-of-words, number of topics (num_topics),
# your dictionary, number of passes (passes)

lda = LdaModel(sample_bow, num_topics=2, id2word=sample_dictionary, passes=5000)
In [ ]:
def explore_topic(topic_number, topn=3):
    """
    given a topic number print out a formatted list of the top terms
    """
        
    print ('{:15} {}'.format('term', 'freq') + '\n')

    for term, freq in lda.show_topic(topic_number, topn=3):
        print ('{:15} {}'.format(term, round(freq, 4)))
In [ ]:
print("        Topic 0", '\n')
explore_topic(topic_number=0)
print('\n',"---------------------------", '\n')
print("        Topic 1", '\n')
explore_topic(topic_number=1)

One topic seems to be about food and the other about dogs.

As stated previously, documents contain a mixture of topics. Let's look at topic frequencies for some sample documents.

In [ ]:
def get_sample_sentence(sentence_number):
    """
    retrieve a particular sentence index from the our sample text and return it
    """
    sentence_list = []
    
    for sentence in sample_text.sents:
        sentence_list.append(sentence)
        
    return sentence_list[sentence_number]
In [ ]:
def lda_description(sample_text, min_topic_freq=0.05):
    
    # convert parsed sentence back to a normal string
    sample_text = str(sample_text)
    
    # parse sentence text with spaCy
    parsed_sentence = parser(sample_text)
    
    # lemmatize text and remove punctuation, whitespace, stopwords
    lemma_sentence = [token.lemma_ for token in parsed_sentence
                      if not remove_text_junk(token)]
    
    # create a bag-of-words representation
    sentence_bow = sample_dictionary.doc2bow(lemma_sentence)
    
    # create an LDA representation
    sentence_lda = lda[sentence_bow]
    
    sentence_lda = sorted(sentence_lda)

    for topic_number, freq in sentence_lda:
        
        if freq < min_topic_freq:
            break
        else:
            print(topic_number, freq)
In [ ]:
sample_sentence = get_sample_sentence(3)
print(sample_sentence)
print('\n')
# find topic frequencies for this sentence/document
lda_description(sample_sentence)

Again. Bigger.

Let's do this all again, this time with a real example that may have some useful application.

We'll pull all 4th Circuit Court of Appeals cases from CourtListener, a service that provides free bulk access to the text of court opinions. If you're not familiar with the organization of the U.S. judicial system, federal cases typically begin in one of 96 federal district courts. If they are appealed the case is heard in one of 11 geographically defined Courts of Appeal. If appealed again, it may be selected to be heard by the Supreme Court (but probably not). We're going to model the opinons for the Fourth Circuit Court of Appeals which happens to include Virginia. Still confused? This may help: http://www.uscourts.gov/about-federal-courts/court-role-and-structure

All 4th Circuit cases can be pulled from here:

https://www.courtlistener.com/api/bulk-data/opinions/ca4.tar.gz

After unpacking we have a new directory, "ca4", and 123,580 cases (2.56GB) to explore. Each case is its own json file and the opinion appears in the "plain_text" attribute. Here's an example:

In [ ]:
with open('ca4/71.json', 'r') as infile:
    for line in infile:
        print(line)
In [ ]:
%%time

import json
import os
import re

ca4_dir = '/Users/overlord_of_fresh/Desktop/spaCy_workshop_20161025/ca4/'
all_opinions_file = 'all_opinions.txt'

if 0 == 1:  # this can take awhile

    with open(all_opinions_file, 'w') as outfile:

        # loop through files in ca4 & write out "plain text" attributes to a new file
        for file in os.listdir(ca4_dir):

            json_file = open(ca4_dir+file, 'r')
            parsed_json = json.load(json_file)
            opinion = parsed_json['plain_text']
            if opinion != "":

                # delete line feed and replace with single space
                opinion = re.sub('\n', ' ', opinion)

                # replace multiple whitespace with single space and left strip any remaining whitespace
                opinion = re.sub('\s+', ' ', opinion).lstrip()+'\n'

                # combine words that break over line
                opinion = re.sub('- ', '', opinion)

                outfile.write(opinion)

A little more parsing, this time to remove some common words.

In [ ]:
%%time

parsed_opinions_file = 'parsed_opinions.txt'

if 0 == 1:  # this can take awhile

    with open(parsed_opinions_file, 'w') as outfile:
        with open(all_opinions_file, 'r') as infile:
            for opinion in infile:

                # some other stopwords to get rid of - there must be a better way
                pattern = re.compile('a\/k\/a|fourth|versus|united|u\.s\.|states|v\.|’s|defendant|plaintiff|government|§|u\.s\.c\.|america|unpublished|judge|attorney|publish|per curiam|district|opinion.*? |court|appeal.*? |circuit|appell.*? |', flags=re.I)
                opinion = re.sub(pattern, '', opinion)
                outfile.write(opinion)
In [ ]:
with open(parsed_opinions_file, 'r') as infile:
    first_line = infile.readline()
    print(first_line)

Let's stem and lemmatize all the opinions in the parsed_opinions_file.

In [ ]:
def remove_hyphen(token):
    if re.search("-", str(token)):
        return True

def remove_text_junk(token):
    return token.is_punct or token.is_space or token.is_stop or token.like_num or remove_hyphen(token)

Gensim includes an option for piping a set of commands and multithreading. This should help speed things up.

In [ ]:
%%time

lemma_file = 'lemma_opinions.txt'

if 0 == 1:  # this can take awhile
    
    with open(lemma_file, 'w') as outfile:
        with open(parsed_opinions_file, 'r') as infile:
            for i, opinion in enumerate(parser.pipe(infile, batch_size=5000, n_threads=4)):
                lemma_line = ' '.join([token.lemma_ for token in opinion
                                      if not remove_text_junk(token)])
                outfile.write(lemma_line + '\n')
In [ ]:
with open(lemma_file, 'r') as infile:
    first_line = infile.readline()
    print(first_line)

Now that we have our text roughly in the shape we want we create a dictionary and bag of words representation so that we create a topic model.

In [ ]:
%%time

from gensim.corpora import MmCorpus, Dictionary
from gensim.models.word2vec import LineSentence
from gensim.models import LdaMulticore

if 0 == 1:  # this can take awhile (0:24)

    opinion = LineSentence(lemma_file)
    opinion_dictionary = Dictionary(opinion)
    print(opinion_dictionary)
In [ ]:
def opinion_bow_generator(filepath):
    
    for opinion in LineSentence(filepath):
        yield opinion_dictionary.doc2bow(opinion)
In [ ]:
%%time

opinion_bow = 'opinion_bow.mm'

if 0 == 1: # this can take awhile (0:37)
    
    MmCorpus.serialize(opinion_bow, opinion_bow_generator(lemma_file))

opinion_bow_corpus = MmCorpus(opinion_bow)

Again, gensim provides multithreading to help generate a topic model.

In [ ]:
%%time

lda_model_file = 'lda_model_all'

if 0 == 1:  # this can take awhile (6:07)

    lda = LdaMulticore(opinion_bow_corpus, num_topics=50, id2word=opinion_dictionary, workers=4)
    lda.save(lda_model_file)

Load the saved model.

In [ ]:
lda2 = LdaMulticore.load(lda_model_file)
In [ ]:
def explore_legal_topics(topic_number, topn=20):
        
    print ('{:15} {}'.format('term', 'frequency') + '\n')

    for term, frequency in lda2.show_topic(topic_number, topn=20):
        print ('{:15} {}'.format(term, round(frequency, 4)))
In [ ]:
print("        Topic ", '\n')
explore_legal_topics(topic_number=6)