Text Analysis

Motivation: Text Analysis on University of Maryland Patents

Recall that we were able to use the Patentsview API to pull data from patents awarded to inventors at University of Maryland, including the abstract from each of the patents. Suppose we wanted to know what types of patents were awarded to UMD. We can look at the information from the abstracts and read through them, but this would take a very long time since there are almost 1,300 abstracts. Instead, we can use text analysis to help us.

However, as-is, the text from abstracts can be difficult to analyze. We aren't able to use traditional statistical techniques without some heavy data manipulation, because the text is essentially a categorical variable with unique values for each patent. We need to basically break it apart and clean the data before we apply our data analysis techniques.

In this notebook, we will go through the process of cleaning and processing the text data to prepare it for topic modeling, which is the process of automatically assigning topics to individual documents (in this case, an individual document is an abstract from a patent). We will use a technique called Latent Dirichlet Allocation as our topic modeling technique, and try to determine what sorts of patents were awarded to University of Maryland.

Introduction to Text Analysis

Text analysis is used to extract useful information from or summarize a large amount of unstructured text stored in documents. This opens up the opportunity of using text data alongside more conventional data sources (e.g. surveys and administrative data). The goal of text analysis is to take a large corpus of complex and unstructured text data and extract important and meaningful messages in a comprehensible way.

Text analysis can help with the following tasks:

  • Information Retrieval: Find relevant information in a large database, such as a systematic literature review, that would be very time-consuming for humans to do manually.

  • Clustering and Text Categorization: Summarize a large corpus of text by finding the most important phrases, using methods like topic modeling.

  • Text Summarization: Create category-sensitive text summaries of a large corpus of text.

  • Machine Translation: Translate documents from one language to another.

Glossary of Terms

  • Corpus: A corpus is the set of all text documents used in your analysis; for example, your corpus of text may include hundreds of abstracts from patent data.

  • Tokenize: Tokenization is the process by which text is separated into meaningful terms or phrases. In English this is easy to do for individual words, as they are separated by whitespace; however, it can get more complicated to automate determining which groups of words constitute meaningful phrases.

  • Stemming: Stemming is normalizing text by reducing all forms or conjugations of a word to the word's most basic form. In English, this can mean making a rule of removing the suffixes "ed" or "ing" from the end of all words, but it gets more complex. For example, "to go" is irregular, so you need to tell the algorithm that "went" and "goes" stem from a common lemma, and should be considered alternate forms of the word "go."

  • TF-IDF: TF-IDF (term frequency-inverse document frequency) is an example of feature engineering where the most important words are extracted by taking account their frequency in documents and the entire corpus of documents as a whole.

  • Topic Modeling: Topic modeling is an unsupervised learning method where groups of words that often appear together are clustered into topics. Typically, the words in one topic should be related and make sense (e.g. boat, ship, captain). Individual documents can fall under one topic or multiple topics.

  • LDA: LDA (Latent Dirichlet Allocation) is a type of probabilistic model commonly used for topic modeling.

  • Stop Words: Stop words are words that have little semantic meaning but occur very frequently, like prepositions, articles and common nouns. For example, every document (in English) will probably contain the words "and" and "the" many times. You will often remove them as part of preprocessing using a list of stop words.

Setup

In [ ]:
# interacting with websites and web-APIs
import requests # easy way to interact with web sites and services

# data manipulation
import pandas as pd
import numpy as np

import nltk

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn import preprocessing

from nltk.corpus import stopwords
from nltk import SnowballStemmer
import string

Load the Data

To start, we'll load the data using the PatentsView API. For more information about how this part works, view the API notebook.

In [ ]:
url = 'http://www.patentsview.org/api/patents/query?'
PARAMS = {'q': '{"assignee_organization":"university of maryland"}',
         'f': '["patent_title","patent_year", "patent_type", "patent_abstract"]',
         'o': '{"per_page":1300}'}
r = requests.get(url, params=PARAMS)  # Get response from the URL
r.status_code # Check status code
In [ ]:
json = r.json()
In [ ]:
patents = pd.DataFrame(json['patents'])
In [ ]:
patents.head()
In [ ]:
abstracts = patents.patent_abstract

Topic Modeling

We are going to apply topic modeling, an unsupervised learning method, to our corpus to find the high-level topics in our corpus. Through this process, we'll discuss how to clean and preprocess our data to get the best results. Topic modeling is a broad subfield of machine learning and natural language processing. We are going to focus on a common modeling approach called Latent Dirichlet Allocation (LDA).

To use topic modeling, we first have to assume that topics exist in our corpus, and that some small number of these topics can "explain" the corpus. Topics in this context refer to words from the corpus, in a list that is ranked by probability. A single document can be explained by multiple topics. For instance, an article on net neutrality would fall under the topic "technology" as well as the topic "politics." The set of topics used by a document is known as the document's allocation, hence, the name Latent Dirchlet Allocation, each document has an allocation of latent topics allocated by Dirchlet distribution.

We will use topic modeling in order to determine what types of inventions have been produced at the University of Maryland.

Preparing Text Data for Natural Language Processing (NLP)

The first important step in working with text data is cleaning and processing the data, which includes (but is not limited to):

  • forming a corpus of text
  • stemming and lemmatization
  • tokenization
  • removing stop-words
  • finding words co-located together (N-grams)

The ultimate goal is to transform our text data into a form an algorithm can work with, because a document or a corpus of text cannot be fed directly into an algorithm. Algorithms expect numerical feature vectors with certain fixed sizes, and can't handle documents, which are basically sequences of symbols with variable length. We will be transforming our text corpus into a bag of n-grams to be further analyzed. In this form our text data is represented as a matrix where each row refers to a specific job description (document) and each column is the occurence of a word (feature).

Stemming and Lemmatization - Distilling text data

We want to process our text through stemming and lemmatization, or replacing words with their root or simplest form. For example "systems," "systematic," and "system" are all different words, but we can replace all these words with "system" without sacrificing much meaning.

  • A lemma is the original dictionary form of a word (e.g. the lemma for "lies," "lied," and "lying" is "lie").
  • The process of turning a word into its simplest form is stemming. There are several well known stemming algorithms -- Porter, Snowball, Lancaster -- that all have their respective strengths and weaknesses.

In this notebook, we'll use the Snowball Stemmer:

In [ ]:
# Examples of how a Stemmer works:
stemmer = SnowballStemmer("english")
print(stemmer.stem('lies'))
print(stemmer.stem("lying"))
print(stemmer.stem('systematic'))
print(stemmer.stem("running"))

Removing Punctuation

For some purposes, we might want to preserve punctuation. For example, if we wanted to be able to detect sentiment of text, we might want to keep exclamation points, because they signify something about the text. For our purposes, however, we will simply strip the punctuation so that it does not affect our analysis. To do this, we use the string package, creating a translator that takes any string and "translates" it into a string without any punctuation.

An example using the first abstract in our corpus is shown below.

In [ ]:
# Before
abstracts[0]
In [ ]:
# Create translator
translator=str.maketrans(string.punctuation, ' ' * len(string.punctuation))

# After
abstracts[0].translate(translator)

Tokenizing

We want to separate text into individual tokens (generally individual words). To do this, we'll first write a function that takes a string and splits it up into indiviudal words. We'll do the whole process of removing punctuation, stemming, and tokenizing all in one function.

In [ ]:
def tokenize(text):
    translator=str.maketrans(string.punctuation, ' '*len(string.punctuation)) # translator that replaces punctuation with empty spaces
    return [stemmer.stem(i) for i in text.translate(translator).split()]

The tokenize function actually does several things at the same time. First, it removes any punctuation using the translate method. Then, the split method breaks it apart into individual words. Then, using stemmer.stem, it creates a list of the stemmed versions of each of those individual words.

Let's take a look at an example of how this works using the first abstract in our corpus.

In [ ]:
tokenize(abstracts[0])

What we get out of it is something called a bag of words. This is a list of all of the words that are in the abstract, cleaned of all punctuation and stemmed. The paragraph is now represented as a vector of individual words rather than as one whole entity.

We can apply this to each abstract in our corpus using CountVectorizer. This will not only do the tokenizing, but it will also count any duplicates of words and create a matrix that contains the frequency of each word. This will be quite a large matrix (number of columns will be number of unique words), so it outputs the data as a sparse matrix.

Similar to how we fit models using sklearn, we will first create the vectorizer object (you can think of this like a model object), and then fit it with our abstracts. This should give us back our overall corpus bag of words, as well as a list of features (that is, the unique words in all the abstracts).

In [ ]:
vectorizer = CountVectorizer(analyzer= "word", # unit of features are single words rather then phrases of words 
                            tokenizer=tokenize, # function to create tokens
                            ngram_range=(0,1),
                            strip_accents='unicode',
                            min_df = 0.05,
                            max_df = 0.95)
In [ ]:
bag_of_words = vectorizer.fit_transform(abstracts) #transform our corpus is a bag of words 
features = vectorizer.get_feature_names()
In [ ]:
print(bag_of_words[0])
In [ ]:
features[0:10]

Now that we have our bag of words, we can start using it for models such as Latent Dirichlet Allocation.

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a statistical model that generates groups based on similarities. This is an example of an unsupervised machine learning model. That is, we don't have any sort of outcome variable -- we're just trying to group the abstracts into rough categories.

Let's try fitting an LDA model. The way we do it is very similar to the models we've fit before from sklearn. We first create a LatentDirichletAllocation object, then fit it using our corpus bag of words.

In [ ]:
lda = LatentDirichletAllocation(learning_method='online') 

doctopic = lda.fit_transform( bag_of_words )
In [ ]:
ls_keywords = []
for i,topic in enumerate(lda.components_):
    word_idx = np.argsort(topic)[::-1][:5]
    keywords = ', '.join( features[i] for i in word_idx)
    ls_keywords.append(keywords)
    print(i, keywords)

This doesn't look very helpful! There are way too many common words in the corpus, such as 'a', 'of', and so on. We need to remove them, because they don't actually have any interesting information about the documents.

Removing meaningless text - Stopwords

Stopwords are words that are found commonly throughout a text and carry little semantic meaning. Examples of common stopwords are prepositions ("to", "on", "in"), articles ("the", "an", "a"), conjunctions ("and", "or", "but") and common nouns. For example, the words the and of are totally ubiquitous, so they won't serve as meaningful features, whether to distinguish documents from each other or to tell what a given document is about. You may also run into words that you want to remove based on where you obtained your corpus of text or what it's about. There are many lists of common stopwords available for you to use, both for general documents and for specific contexts, so you don't have to start from scratch.

We can eliminate stopwords by checking all the words in our corpus against a list of commonly occuring stopwords that comes with NLTK.

In [ ]:
# Download most current stopwords
nltk.download('stopwords')
In [ ]:
stop = stopwords.words('english')
stop[0:10]
In [ ]:
# Tokenize stop words to match
eng_stopwords = [tokenize(s)[0] for s in stop]
In [ ]:
vectorizer = CountVectorizer(analyzer= "word", # unit of features are single words rather then phrases of words 
                            tokenizer=tokenize, # function to create tokens
                            ngram_range=(0,1),
                            strip_accents='unicode',
                            stop_words=eng_stopwords,
                            min_df = 0.05,
                            max_df = 0.95)

# Creating bag of words
bag_of_words = vectorizer.fit_transform(abstracts) #transform our corpus is a bag of words 
features = vectorizer.get_feature_names()

# Fitting LDA model
lda = LatentDirichletAllocation(n_components = 5, learning_method='online') 
doctopic = lda.fit_transform( bag_of_words )

# Displaying the top keywords in each topic
ls_keywords = []
for i,topic in enumerate(lda.components_):
    word_idx = np.argsort(topic)[::-1][:5]
    keywords = ', '.join( features[i] for i in word_idx)
    ls_keywords.append(keywords)
    print(i, keywords)

N-grams - Adding context by creating N-grams

Obviously, reducing a document to a bag of words means losing much of its meaning - we put words in certain orders, and group words together in phrases and sentences, precisely to give them more meaning. If you follow the processing steps we've gone through so far, splitting your document into individual words and then removing stopwords, you'll completely lose all phrases like "kick the bucket," "commander in chief," or "sleeps with the fishes."

One way to address this is to break down each document similarly, but rather than treating each word as an individual unit, treat each group of 2 words, or 3 words, or n words, as a unit. We call this a "bag of n-grams," where n is the number of words in each chunk. Then you can analyze which groups of words commonly occur together (in a fixed order).

In [ ]:
vectorizer = CountVectorizer(analyzer= "word", # unit of features are single words rather then phrases of words 
                            tokenizer=tokenize, # function to create tokens
                            ngram_range=(0,2), # Allow for bigrams
                            strip_accents='unicode',
                            stop_words=eng_stopwords,
                            min_df = 0.05,
                            max_df = 0.95)

# Creating bag of words
bag_of_words = vectorizer.fit_transform(abstracts) #transform our corpus is a bag of words 
features = vectorizer.get_feature_names()

# Fitting LDA model
lda = LatentDirichletAllocation(n_components = 5, learning_method='online') 
doctopic = lda.fit_transform( bag_of_words )

# Displaying the top keywords in each topic
ls_keywords = []
for i,topic in enumerate(lda.components_):
    word_idx = np.argsort(topic)[::-1][:10]
    keywords = ', '.join( features[i] for i in word_idx)
    ls_keywords.append(keywords)
    print(i, keywords)

TF-IDF - Weighting terms based on frequency

A final step in cleaning and processing our text data is Term Frequency-Inverse Document Frequency (TF-IDF). TF-IDF is based on the idea that the words (or terms) that are most related to a certain topic will occur frequently in documents on that topic, and infrequently in unrelated documents. TF-IDF re-weights words so that we emphasize words that are unique to a document and suppress words that are common throughout the corpus by inversely weighting terms based on their frequency within the document and across the corpus.

Let's look at how to use TF-IDF:

In [ ]:
stop = stopwords.words('english') + ['invent', 'produce', 'method', 'use', 'first', 'second']
full_stopwords = [tokenize(s)[0] for s in stop]
In [ ]:
vectorizer = CountVectorizer(analyzer= "word", # unit of features are single words rather then phrases of words 
                            tokenizer=tokenize, # function to create tokens
                            ngram_range=(0,2),
                            strip_accents='unicode',
                            stop_words=full_stopwords,
                            min_df = 0.05,
                            max_df = 0.95)
# Creating bag of words
bag_of_words = vectorizer.fit_transform(abstracts) #transform our corpus is a bag of words 
features = vectorizer.get_feature_names()

# Use TfidfTransformer to re-weight bag of words 
transformer = TfidfTransformer(norm = None, smooth_idf = True, sublinear_tf = True)
tfidf = transformer.fit_transform(bag_of_words)

# Fitting LDA model
lda = LatentDirichletAllocation(n_components = 5, learning_method='online') 
doctopic = lda.fit_transform(tfidf)

# Displaying the top keywords in each topic
ls_keywords = []
for i,topic in enumerate(lda.components_):
    word_idx = np.argsort(topic)[::-1][:10]
    keywords = ', '.join( features[i] for i in word_idx)
    ls_keywords.append(keywords)
    print(i, keywords)
In [ ]:
ls_keywords
In [ ]:
doctopic
In [ ]:
topic_df = pd.DataFrame(doctopic, columns = ls_keywords)
topic_df.head()

Supervised Learning: Document Classification

Previously, we used topic modeling to infer relationships between social service facilities within the data. That is an example of unsupervised learning: we were looking to uncover structure in the form of topics, or groups of agencies, but we did not necessarily know the ground truth of how many groups we should find or which agencies belonged in which group.

We can also do supervised learning with text data. In supervised learning, we have a known outcome or label (Y) that we want to produce given some data (X), and in general, we want to be able to produce this Y when we don't know it, or when we only have X.

In order to produce labels we need to first have examples our algorithm can learn from, a "training set." In the context of text analysis, developing a training set can be very expensive, as it can require a large amount of human labor or linguistic expertise. Document classification is an example of supervised learning in which want to characterize our documents based on their contents (X). A common example of document classification is spam e-mail detection. Another example of supervised learning in text analysis is sentiment analysis, where X is our documents and Y is the state of the author. This "state" is dependent on the question you're trying to answer, and can range from the author being happy or unhappy with a product to the author being politically conservative or liberal. Another example is part-of-speech tagging where X are individual words and Y is the part-of-speech.

Further Resources

A great resource for NLP in Python is Natural Language Processing with Python.