Topic Models are a type of statistical language models used for discovering hidden structure in a collection of texts.
This example is based on a dataset that comprises 264 volumes of digitised theatrical playbills published between 1660 – 1902 (mostly 19th century) from England, Scotland, Wales and Ireland. Digitised from the British Library's physical collection of over 500 volumes of playbills, the dataset contains text files in Optical Character Recognition (OCR) format. More information about the dataset at https://data.bl.uk/playbills/
import sys
import requests
import pandas as pd
import re
import gensim
from gensim.utils import simple_preprocess
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
import nltk
nltk.download('wordnet')
nltk.download('punkt')
Note: the original dataset did not include a CSV file. It was generated from a Excel file.
# Read data into playbills
playbills = pd.read_csv('playbills-ocr-text/playbills.csv', encoding='iso-8859-1')
# Print head
playbills.head()
Since the goal of this analysis is to perform topic modeling, we will focus on the text data from each register, and remove other metadata columns that are not necessary.
# Remove the columns
playbills = playbills.drop(columns=['Ingestion Order', 'Shelf Mark', 'PID', 'Path', 'File Name (.PDF)', 'File Size (MB)'], axis=1)# Print out the first rows of papers
playbills.head()
for index,row in playbills.iterrows():
try:
file = "playbills-ocr-text/lsidyv"+ row['LSID'] +".txt";
f = open(file, "r")
text = f.read()
playbills.loc[index, 'original_text'] = text
except:
print("An exception occurred", sys.exc_info()[0])
playbills.loc[index, 'original_text'] = ''
playbills.head()
Next, let’s perform a simple preprocessing on the content to make them more amenable for analysis, and reliable results. We use a regular expression to remove any punctuation, lowercase the text, remove stopwords and then remove non English words since the OCR may have some errors.
We use wordnet to verify if the word exists. We also have added some specific stopwords to enhance the performance.
The initial_clean function performs an initial clean by removing punctuations, uppercase text, etc.
def initial_clean(text):
"""
Function to clean text-remove punctuations, lowercase text etc.
"""
# remove_digits and special chars
text = re.sub("[^a-zA-Z ]", "", text)
text = text.lower() # lower case text
text = nltk.word_tokenize(text)
return text
The next function stem_words() stems the words to its base forms to reduce variant forms of words.
stemmer = PorterStemmer()
def stem_words(text):
"""
Function to stem words
"""
#try:
text = [stemmer.stem(word) for word in text]
text = [word for word in text if len(word) > 2] # no single letter words
#except IndexError:
# pass
return text
Let's see an example
some_words = "William Shakespeare was perhaps the most famous author"
some_words_tokens = nltk.word_tokenize(some_words)
print(stem_words(some_words_tokens))
We will use wordnet to remove non existent words. Due to the text provided in the dataset many words are not existent. We will encrease the performance by removing non existent words.
def remove_non_english_words(text):
filtered_text = []
for token in text:
if len(token) == 1:
continue
elif token in stop_words:
continue
elif not wordnet.synsets(token):
#Not an English Word
continue
else:
#English Word
filtered_text.append(token)
return filtered_text
In general, common words known as stopwords are removed from text since they could be considered as noise when used in text algorithms.
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = stopwords.words('english')
stop_words.extend(['news', 'say','use', 'not', 'would', 'say', 'could', '_', 'be', 'know', 'good', 'go', 'get', 'do','took','time','year',
'done', 'try', 'many', 'some','nice', 'thank', 'think', 'see', 'rather', 'easy', 'easily', 'lot', 'lack', 'make', 'want', 'seem', 'run', 'need', 'even', 'right', 'line','even', 'also', 'may', 'take', 'come', 'new','said', 'like','people'])
def remove_stop_words(text):
return [word for word in text if word not in stop_words]
We create a function to perform the whole process
def apply_all(text):
"""
This function applies all the functions above into one
"""
return stem_words(remove_stop_words(remove_non_english_words(initial_clean(text))))
Finallly, we process the original text by using the function apply.
# clean reviews and create new column "tokenized"
import time
t1 = time.time()
playbills['tokenized_text'] = playbills['original_text'].apply(apply_all)
t2 = time.time()
print("Time to clean and tokenize", len(playbills), "reviews:", (t2-t1)/60, "min") #Time to clean and tokenize
playbills.head()
Topic modeling using LDA are based on the dictionary and the corpus. This example is based on gensim library for building both.
# LDA
import gensim
from gensim import corpora, models, similarities
tokenized = playbills['tokenized_text']
#Creating term dictionary of corpus, where each unique term is assigned an index.
dictionary = corpora.Dictionary(tokenized)
#Filter terms which occurs in less than 1 review and more than 80% of the reviews.
dictionary.filter_extremes(no_below=1, no_above=0.8)
#convert the dictionary to a bag of words corpus
corpus = [dictionary.doc2bow(tokens) for tokens in tokenized]
#print(corpus[:1])
In this step, num_topics is the number of topics to be created and passes corresponds to the number of times to iterate through the entire corpus. By running the LDA algorithm we get the topics as a result.
import warnings
warnings.simplefilter("ignore", DeprecationWarning)
#LDA
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 5, id2word=dictionary, passes=15)
ldamodel.save('model_combined.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
print(topic)
This output shows the 5 topics created and the 4 words within each topic which best describes them. From the above output we could guess that each topic and their corresponding words revolve around a common theme (For e.g., topic 2 is related to bologna and money).