Topic Modelling

This notebook introduces topic modelling. It's part of the The Art of Literary Text Analysis (and assumes that you've already worked through previous notebooks – see the table of contents). In this notebook we'll look in particular at:

Topic modelling is a text mining technique that attempts to automatically identify groups of terms that are more likely to occur together in individual documents (or other text units, such as document segments). Topic modelling tends to be less interested in terms that occur uniformly throughout a text (like function words) and less interested in terms that occur rarely, and sometimes we're left with clusters of words – called topics – that are suggestive of a coherent group.

For a good collection of readings on topic modelling in the humanities, see this issue of the Journal of Digital Humanities.

For this notebook you'll need to install two libraries (see Getting Setup for more information on installing libraries).

sudo pip3 install -U gensim
sudo pip3 install -U networkx

You'll also need to have installed NLTK as described in the Getting NLTK library.

Loading Shakespeare's Sonnets

Let's start by loading Shakespeare's sonnets as an NLTK corpus. This was done previously in the Sentiment Analysis notebook, but if you need to quickly recapitulate those steps, you can copy and paste the code below and execute it first:

import urllib.request
sonnetsUrl = "http://www.gutenberg.org/cache/epub/1041/pg1041.txt"
sonnetsString = urllib.request.urlopen(sonnetsUrl).read().decode()
import re, os
pythonfilteredSonnetsStart = sonnetsString.find("  I\r\n") # title of first sonnet
filteredSonnetsEnd = sonnetsString.find("End of Project Gutenberg's") # end of sonnets
filteredSonnetsString = sonnetsString[filteredSonnetsStart:filteredSonnetsEnd].rstrip()
sonnetsList = re.split("  [A-Z]+\r\n\r\n", filteredSonnetsString)
sonnetsPath = 'sonnets' # this subdirectory will be relative to the current notebook
if not os.path.exists(sonnetsPath):
    os.makedirs(sonnetsPath)
for index, sonnet in enumerate(sonnetsList): # loop through our list as enumeration to get index
    if len(sonnet.strip()) > 0: # make sure we have text, not empty after stripping out whitespace
        filename = str(index).zfill(3)+".txt" # create filename from index
        pathname = os.path.join(sonnetsPath, filename) # directory name and filenamee
        f = open(pathname, "w")
        f.write(sonnet.rstrip()) # write out our sonnet into the file
        f.close()

Assuming we have a local directory called "sonnets" with our texts, we can use the PlaintextCorpusReader to load all .txt files in the directory.

In [1]:
from nltk.corpus import PlaintextCorpusReader
sonnetsCorpus = PlaintextCorpusReader("sonnets", ".*\.txt")
print(len(sonnetsCorpus.fileids()))
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-1-c60d4661ddba> in <module>()
      1 from nltk.corpus import PlaintextCorpusReader
----> 2 sonnetsCorpus = PlaintextCorpusReader("sonnets", ".*\.txt")
      3 print(len(sonnetsCorpus.fileids()))

/anaconda3/lib/python3.6/site-packages/nltk/corpus/reader/plaintext.py in __init__(self, root, fileids, word_tokenizer, sent_tokenizer, para_block_reader, encoding)
     60             corpus into paragraph blocks.
     61         """
---> 62         CorpusReader.__init__(self, root, fileids, encoding)
     63         self._word_tokenizer = word_tokenizer
     64         self._sent_tokenizer = sent_tokenizer

/anaconda3/lib/python3.6/site-packages/nltk/corpus/reader/api.py in __init__(self, root, fileids, encoding, tagset)
     82                 root = ZipFilePathPointer(zipfile, zipentry)
     83             else:
---> 84                 root = FileSystemPathPointer(root)
     85         elif not isinstance(root, PathPointer):
     86             raise TypeError('CorpusReader: expected a string or a PathPointer')

/anaconda3/lib/python3.6/site-packages/nltk/compat.py in _decorator(*args, **kwargs)
    219     def _decorator(*args, **kwargs):
    220         args = (args[0], add_py3_data(args[1])) + args[2:]
--> 221         return init_func(*args, **kwargs)
    222     return wraps(init_func)(_decorator)
    223 

/anaconda3/lib/python3.6/site-packages/nltk/data.py in __init__(self, _path)
    301         _path = os.path.abspath(_path)
    302         if not os.path.exists(_path):
--> 303             raise IOError('No such file or directory: %r' % _path)
    304         self._path = _path
    305 

OSError: No such file or directory: '/Users/melissamony/Documents/GitHub/alta2/ipynb/sonnets'

Functions with Keyword Arguments

Topic modelling typically treats each document as a collection (or bag) of words, so we need to create a list of lists where the outer list is for the documents and the inner list is for the words in each document.

[ # outer list for each document
    [term1, term2, term3], # inner list of terms for document1
    [term1, term2, term3],
]

In the simplest scenario, we could do something like this:

tokens = [sonnetsCorpus.words(fileid) for fileid in sonnetsCorpus.fileids()]

That would provide a list of tokens, but we'd then probably want to do further filtering for word tokens, stoplists, parts of speech, etc.

As we saw with sentiment analysis, it can be helpful to create reusable functions so that we can experiment easily with different settings. In the previous notebook we looked at defining functions with arguments that have values if they're not specified:

def multiply(left, right=1):
    return left * right

multiply(5) # 5 (second argument is 1 by default)
multiply(5, 5) # 25

This flexibility is great, but it can get unwieldy if we want the possibility of multiple optional arguments since the order of the arguments matters. Take this as an example:

def multiply(string, leftStrip=False, rightStrip=False, convertToLower=True, reverseDirection=False):
    # process the string and return it

What happens if I only want to define the reverseDirection argument? I have to specify all arguments just to do so:

multiply(" test ", None, None, None, True)

Far more useful would be something like this:

multiply(" test ", reverseDirection=True)

We can do this in Python with keyword arguments, which are essentially inline dictionaries that are passed to the function using a special prefix. Each name and value-pair is separated by a comma when we call the function, but the function receives all of these arguments in one dictionary (by convention called "kwargs" for keyword arguments).

Creating Bags of Words

Let's now write a function that takes a corpus and returns bags of words, or a list of lists of words. We'll provide functionality to filter out stop-words and to consider only specific parts of speech.

In [ ]:
import nltk

def get_lists_of_words(corpus, **kwargs): # the ** in front of kwargs does the magic of keyword arguments
    documents = [] # list of documents where each document is a list of words
    for fileid in corpus.fileids(): # go trough each file in our corpus
        
        # keep only words and convert them to lowercase
        words = [token.lower() for token in corpus.words(fileid) if token[0].isalpha()]
        
        # look for "minLength" in our keyword arguments and if it's defined, filter our list
        if "minLen" in kwargs and kwargs["minLen"]: 
            words = [word for word in words if len(word) >= kwargs["minLen"]]
        
        # look for "stopwords" in our keyword arguments and if any are defined, filter our list
        if "stopwords" in kwargs and kwargs["stopwords"]: 
            words = [word for word in words if word not in kwargs["stopwords"]]

        # look for "pos" in our keyword arguments and if any are defined, filter our list
        if "pos" in kwargs and kwargs["pos"]: 
            tagged = nltk.pos_tag(words)
            words = [word for word, pos in tagged if pos in kwargs["pos"]]
        
        documents.append(words) # add our list of words
    
    return documents # return our list of documents

We could run our new function this way, with only the corpus defined:

get_lists_of_words(sonnetsCorpus)

But let's at least use the NLTK stoplist (with a bit of tweaking for Shakespeare's language), as well as a minimum word length (minLen) of 3 characters.

In [ ]:
sonnetsStopwords = nltk.corpus.stopwords.words('english') # load the default stopword list
sonnetsStopwords += ["thee", "thou", "thy"] # append a few more obvious words
sonnetsWords = get_lists_of_words(sonnetsCorpus, stopwords=sonnetsStopwords, minLen=3)

# have a peek:
for i in range(0,2): # first two documents
    print("document", str(i), sonnetsWords[i][0:5])

Excellent, we now have a list of lists of words, and we can do the topic modelling.

Topic Modelling with Gensim

Topic modelling is a good example of a text mining technique that can be challenging to understand (without considerable mathematical and computer training). The fact that we don't fully understand how it works shouldn't necessarily stop us from using it, but it does mean we should approach any results with additional circumspection, since we may be tempted to make interpretations that aren't necessarily justified. Even throwing a bag of words in the air and seeing how they land might also produce some intriguing and even useful results. Topic modelling can suggest some compelling associations between terms that we may not have considered otherwise, but it's probably best to go back to the texts to investigate further.

One strength of topic modelling (in some circumstances) is that it's a form of unsupervised text mining, which means that it doesn't require any prior training sets in order to start working. Like frequency counts of strings, it doesn't care which language it's working with (and it can even be used for analyzing non-linguistic sequences).

We'll use the Python library Gensim for our topic modelling. This library has the benefit of being relatively easy to use, though alternatives like Mallet (in the programming language Java) are also widely used.

Gensim offers a way to compute topics using a technique called Latent Dirichlet Allocation. As can be seen from the Gensim LdaModel() documentation, there's a number of parameters that can be set and tweaked that affect the modelling work. Again, these parameters (alpha, decay, etc.) can be rather opaque to understand, but they should also be seen as an invitation to experiment – to try different settings with one's particular corpus to see which ones produce the most promising results.

We won't go into much detail for transforming our bags of words (list of terms for our list of documents) into a topic model, but the process is actually fairly short. We just need to interact with the high-level gensim library.

In [ ]:
from gensim import corpora, models

def get_lda_from_lists_of_words(lists_of_words, **kwargs):
    dictionary = corpora.Dictionary(lists_of_words) # this dictionary maps terms to integers
    corpus = [dictionary.doc2bow(text) for text in lists_of_words] # create a bag of words from each document
    tfidf = models.TfidfModel(corpus) # this models the significance of words by document
    corpus_tfidf = tfidf[corpus]
    kwargs["id2word"] = dictionary # set the dictionary
    return models.LdaModel(corpus_tfidf, **kwargs) # do the LDA topic modelling

Without further ado, let's generate an LDA topic model from our lists of words, requesting 10 topics for the corpus. Drum roll please…

In [ ]:
sonnetsLda = get_lda_from_lists_of_words(sonnetsWords, num_topics=10, passes=20) # small corpus, so more passes
print(sonnetsLda)

Well, that was a bit anticlimactic. But we just need to better understand what was returned. Essentially we have a list of however many topics we requested (in this case 10) and for each topic, every word in our corpus is listed. The order of the topics doesn't mean anything, but the order of the terms in each topic is ranked by significance to that topic.

Let's define a function to output the top terms for each of our topics.

In [ ]:
def print_top_terms(lda, num_terms=10):
    for i in range(0, lda.num_topics):
        terms = [term for val, term in lda.show_topic(i, num_terms)]
        print("Top 10 terms for topic #", str(i), ": ", ", ".join(terms))  
In [ ]:
print_top_terms(sonnetsLda)

Now we're talking! (or now we're modelling). We'll refrain from reading too much into these results and instead reiterate the need to approach them with some wariness and some willingness to experiment. There may be no better reason to treat these results carefully than the fact that rerunning the modelling will lead to different lists of words (because generation of the topics starts from randomly set conditions).

It's also worth noting that our corpus may not be ideal for topic modelling since each document (sonnet) is so short. Topic modelling can be most effective with documents that are longer but still short enough so that co-occurrence of terms in the same document may be significant. So, for instance, we could take a long text and divide it into segments of 1,000 words and perform the modelling on those segments.

Network Graphing

The list of terms above is probably useful, but can be a bit difficult to read. For instance, how many terms repeat in these topics?

One way we might explore and visualize the topics is by creating a network graph that associates each topic with each term. Network graphs are defined by a set of nodes with links or edges between them. Imagine these relationships:

  • student A went to school X
  • student A went to school Y
  • student B went to school X
  • student C went to school Y

That could be represented graphically like this, which would serve to show that student A (who went to both school X and school Y) is in some ways central to the graph. The schools are also more central (since they're shared) and students B and C are on the periphery since they have only one relationship.

Simple Network Graph

In the simplest form, we can create this same graph by merely defining the edges using the NetworkX library.

In [ ]:
import networkx as nx
import matplotlib.pyplot as plt
%matplotlib inline
G = nx.Graph()
G.add_edge("A", "X") # student A went to school X
G.add_edge("A", "Y") # student A went to school Y
G.add_edge("B", "X") # student B went to school X
G.add_edge("C", "Y") # student C went to school X
nx.draw(G)

Because of an isssue with node labelling, we actually need to do a slightly more involved version.

In [ ]:
pos = nx.spring_layout(G)
nx.draw_networkx_labels(G, pos, font_color='r') # font colour is "r" for red
nx.draw_networkx_edges(G, pos, alpha=0.1) # set the line alpha transparency to .1
plt.axis('off') # don't show the axes for this plot
plt.show()

We can treat our topics similarly – instead of schools we have topics and instead of students we have terms.

Graphing Topic Terms

The code below to generate the graph data shares some similarities with our code to print terms for each topic:

  • Go through each topic
    • Go through each term
      • Create a link (edge) between the topic and the term
In [ ]:
import networkx as nx
import matplotlib.pyplot as plt
%matplotlib inline

def graph_terms_to_topics(lda, num_terms=10):
    
    # create a new graph and size it
    G = nx.Graph()
    plt.figure(figsize=(10,10))

    # generate the edges
    for i in range(0, lda.num_topics):
        topicLabel = "topic "+str(i)
        terms = [term for term, val in lda.show_topic(i, num_terms)]
        for term in terms:
            G.add_edge(topicLabel, term)
    
    pos = nx.spring_layout(G) # positions for all nodes
    
    # we'll plot topic labels and terms labels separately to have different colours
    g = G.subgraph([topic for topic, _ in pos.items() if "topic " in topic])
    nx.draw_networkx_labels(g, pos,  font_color='r')
    g = G.subgraph([term for term, _ in pos.items() if "topic " not in term])
    nx.draw_networkx_labels(g, pos)
    
    # plot edges
    nx.draw_networkx_edges(G, pos, edgelist=G.edges(), alpha=0.1)

    plt.axis('off')
    plt.show()

graph_terms_to_topics(sonnetsLda)

The terms on the outside only exist in one topic. Some of the terms are able to cluster closer to some topics due to the way force-directed graphs try to efficiently plot nodes (topic and term labels) to have the shortest lines and the least amount of overlap. Terms in the middle exist with multiple topics. Is this a useful way to read Shakespeare's sonnets?

Next Steps

Try the following tasks to see if you can refine the topics:

  • Experiment with arguments to get_lists_of_words()
    • minLength of words
    • Stop-words
    • Parts-of-speech arguments (remember that these are Treebank codes)
    • Add an argument to the function (and try it) that determines if words are converted to lowercase
  • Experiment with arguments to get_lda_from_lists_of_words(), in other words to LdaModel()
  • Which tweaks seem to make the most difference?

Let's continue on to Document Similarity.


CC BY-SA From The Art of Literary Text Analysis by Stéfan Sinclair & Geoffrey Rockwell. Edited and revised by Melissa Mony.
Created March 23, 2015 and last modified December 9, 2015 (Jupyter 4)

In [ ]: