Notebook

Topic Modeling based on Digitised Volumes of theatrical English, Scottish, and Irish playbills between 1600 - 1902 from data.bl.uk¶

Topic Models are a type of statistical language models used for discovering hidden structure in a collection of texts.

This example is based on a dataset that comprises 264 volumes of digitised theatrical playbills published between 1660 – 1902 (mostly 19th century) from England, Scotland, Wales and Ireland. Digitised from the British Library's physical collection of over 500 volumes of playbills, the dataset contains text files in Optical Character Recognition (OCR) format. More information about the dataset at https://data.bl.uk/playbills/

Setting up things¶

In [ ]:

import sys
import requests
import pandas as pd
import re
import gensim
from gensim.utils import simple_preprocess
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
import nltk
nltk.download('wordnet')
nltk.download('punkt')

Reading the CSV file¶

Note: the original dataset did not include a CSV file. It was generated from a Excel file.

In [ ]:

# Read data into playbills
playbills = pd.read_csv('playbills-ocr-text/playbills.csv', encoding='iso-8859-1')

# Print head
playbills.head()

Data cleaning¶

Since the goal of this analysis is to perform topic modeling, we will focus on the text data from each register, and remove other metadata columns that are not necessary.

In [ ]:

# Remove the columns
playbills = playbills.drop(columns=['Ingestion Order', 'Shelf Mark', 'PID', 'Path', 'File Name (.PDF)', 'File Size (MB)'], axis=1)# Print out the first rows of papers
playbills.head()

Reading the files and extracting the text¶

In [ ]:

for index,row in playbills.iterrows():
    
    try:
        file = "playbills-ocr-text/lsidyv"+ row['LSID'] +".txt";
        f = open(file, "r")
        text = f.read()
        
        playbills.loc[index, 'original_text'] = text
                
    except:
        print("An exception occurred", sys.exc_info()[0]) 
        playbills.loc[index, 'original_text'] = ''

Reviewing the content of the files¶

In [ ]:

playbills.head()

Remove punctuation/lower casing/stopwords¶

Next, let’s perform a simple preprocessing on the content to make them more amenable for analysis, and reliable results. We use a regular expression to remove any punctuation, lowercase the text, remove stopwords and then remove non English words since the OCR may have some errors.

We use wordnet to verify if the word exists. We also have added some specific stopwords to enhance the performance.

The initial_clean function performs an initial clean by removing punctuations, uppercase text, etc.

In [ ]:

def initial_clean(text):
     """
     Function to clean text-remove punctuations, lowercase text etc.    
     """
     # remove_digits and special chars   
     text = re.sub("[^a-zA-Z ]", "", text)
    
     text = text.lower() # lower case text
     text = nltk.word_tokenize(text)
     return text

The next function stem_words() stems the words to its base forms to reduce variant forms of words.

In [ ]:

stemmer = PorterStemmer()
def stem_words(text):
     """
     Function to stem words
     """
     #try:
     text = [stemmer.stem(word) for word in text]
     text = [word for word in text if len(word) > 2] # no single letter words
     #except IndexError:
     #    pass
     return text  

Let's see an example

In [ ]:

some_words = "William Shakespeare was perhaps the most famous author"
some_words_tokens = nltk.word_tokenize(some_words)
print(stem_words(some_words_tokens))

We will use wordnet to remove non existent words. Due to the text provided in the dataset many words are not existent. We will encrease the performance by removing non existent words.

In [ ]:

def remove_non_english_words(text):
    filtered_text = [] 
    
    for token in text:

        if len(token) == 1:
            continue
        elif token in stop_words:
            continue
        elif not wordnet.synsets(token):
            #Not an English Word
            continue
        else:
            #English Word
            filtered_text.append(token)
    return filtered_text

In general, common words known as stopwords are removed from text since they could be considered as noise when used in text algorithms.

In [ ]:

from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = stopwords.words('english')
stop_words.extend(['news', 'say','use', 'not', 'would', 'say', 'could', '_', 'be', 'know', 'good', 'go', 'get', 'do','took','time','year',
'done', 'try', 'many', 'some','nice', 'thank', 'think', 'see', 'rather', 'easy', 'easily', 'lot', 'lack', 'make', 'want', 'seem', 'run', 'need', 'even', 'right', 'line','even', 'also', 'may', 'take', 'come', 'new','said', 'like','people'])
def remove_stop_words(text):
     return [word for word in text if word not in stop_words]

We create a function to perform the whole process

In [ ]:

def apply_all(text):
     """
     This function applies all the functions above into one
     """
     return stem_words(remove_stop_words(remove_non_english_words(initial_clean(text))))

Finallly, we process the original text by using the function apply.

In [ ]:

# clean reviews and create new column "tokenized" 
import time   
t1 = time.time()   
playbills['tokenized_text'] = playbills['original_text'].apply(apply_all)    
t2 = time.time()  
print("Time to clean and tokenize", len(playbills), "reviews:", (t2-t1)/60, "min") #Time to clean and tokenize

Checking the result¶

In [ ]:

playbills.head()

Create Gensim Dictionary and Corpus¶

Topic modeling using LDA are based on the dictionary and the corpus. This example is based on gensim library for building both.

In [ ]:

# LDA
import gensim
from gensim import corpora, models, similarities 

In [ ]:

tokenized = playbills['tokenized_text']

#Creating term dictionary of corpus, where each unique term is assigned an index.
dictionary = corpora.Dictionary(tokenized)
#Filter terms which occurs in less than 1 review and more than 80% of the reviews.
dictionary.filter_extremes(no_below=1, no_above=0.8)
#convert the dictionary to a bag of words corpus 
corpus = [dictionary.doc2bow(tokens) for tokens in tokenized]
#print(corpus[:1])

Building the Topic Model¶

In this step, num_topics is the number of topics to be created and passes corresponds to the number of times to iterate through the entire corpus. By running the LDA algorithm we get the topics as a result.

In [ ]:

import warnings
warnings.simplefilter("ignore", DeprecationWarning)

#LDA
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 5, id2word=dictionary, passes=15)
ldamodel.save('model_combined.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
   print(topic)

This output shows the 5 topics created and the 4 words within each topic which best describes them. From the above output we could guess that each topic and their corresponding words revolve around a common theme (For e.g., topic 2 is related to bologna and money).

In [ ]: