In this lecture, we'll work through an example of topic modeling. The idea of topic modeling is to find "topics" in documents that tie together many words. Here are some examples of hypothetical topics that you might find in a newspaper:
In this lecture, we'll see how to use the term-document matrix from last time, in combination with some nice algorithms from scikit-learn
, to perform topic modeling. Our overall aim is to get a coarse, topic-level summary of the plot of the short book Alice’s Adventures in Wonderland by Lewis Carroll.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import nltk
from nltk.corpus import gutenberg
# need to do this once to download the data
# nltk.download('gutenberg')
Let's briefly review the steps that we took to construct our term-document matrix. First, we used the gutenberg
module to read in the raw text of the book, and split it into chapters.
s = gutenberg.raw("carroll-alice.txt")
chapters = s.split("CHAPTER")[1:]
Then, we created a nice, tidy data frame in which we stored the complete text of each chapter.
df = pd.DataFrame({
"chapter" : range(1, len(chapters) + 1),
"text" : chapters
})
df
chapter | text | |
---|---|---|
0 | 1 | I. Down the Rabbit-Hole\n\nAlice was beginnin... |
1 | 2 | II. The Pool of Tears\n\n'Curiouser and curio... |
2 | 3 | III. A Caucus-Race and a Long Tale\n\nThey we... |
3 | 4 | IV. The Rabbit Sends in a Little Bill\n\nIt w... |
4 | 5 | V. Advice from a Caterpillar\n\nThe Caterpill... |
5 | 6 | VI. Pig and Pepper\n\nFor a minute or two she... |
6 | 7 | VII. A Mad Tea-Party\n\nThere was a table set... |
7 | 8 | VIII. The Queen's Croquet-Ground\n\nA large r... |
8 | 9 | IX. The Mock Turtle's Story\n\n'You can't thi... |
9 | 10 | X. The Lobster Quadrille\n\nThe Mock Turtle s... |
10 | 11 | XI. Who Stole the Tarts?\n\nThe King and Quee... |
11 | 12 | XII\n\n Alice's Evidence\n\n\n'Here... |
Then, and this is the complex part, we used the CountVectorizer
from sklearn
to construct the term-document matrix. In this example, I've used a few more of the arguments for CountVectorizer
. In particular, because I'd like to eventually be able to see how topics evolve between chapters, I use the max_df
argument to specify that I'd like like to include words that appear in at most 50% of the chapters.
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(max_df = 0.5, min_df = 0, stop_words = "english")
Next, we can use this CountVectorizer
to create the term-document matrix and collect it all as a nice, tidy data frame.
counts = vec.fit_transform(df['text'])
counts = counts.toarray()
count_df = pd.DataFrame(counts, columns = vec.get_feature_names())
df = pd.concat((df, count_df), axis = 1)
df.head(3)
chapter | text | _i_ | abide | able | absence | absurd | acceptance | accident | accidentally | ... | year | years | yelled | yelp | yer | yesterday | young | youth | zealand | zigzag | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | I. Down the Rabbit-Hole\n\nAlice was beginnin... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 2 | II. The Pool of Tears\n\n'Curiouser and curio... | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 3 | III. A Caucus-Race and a Long Tale\n\nThey we... | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3 rows × 2148 columns
Now we are ready to run our model! Topic modeling is an unsupervised machine learning framework, which means that there's no set of true labels y
. So, we just need to create the variables X
. To do this, we can ignore the text
and chapter
columns.
X = df.drop(['text', 'chapter'], axis = 1)
There are many algorithms for topic modeling. We will use nonnegative matrix factorization or NMF for now. As usual, there are three easy steps:
NMF requires us to specify n_components
, which is the number of topics to find. Choosing the right number of topics is a bit of an art, but there are also quantitative approaches based on Bayesian statistics that we won't go into here.
from sklearn.decomposition import NMF
model = NMF(n_components = 4, init = "random", random_state = 0)
model.fit(X)
NMF(init='random', n_components=4, random_state=0)
There are two important parts of NMF. First, we have the topics themselves, which are stored in the components_
attribute of the model.
model.components_
array([[0. , 0. , 0. , ..., 0.03184396, 0. , 0.00530733], [0.1975168 , 0.04338739, 0. , ..., 0. , 0. , 0. ], [0.20127845, 0.21670879, 0.2105765 , ..., 1.11826124, 0.11323678, 0.18637687], [0. , 0.00499915, 0. , ..., 0. , 0.00330162, 0. ]])
model.components_.shape
(4, 2146)
Uh, what does that mean? We can think of each component as a collection of weights for each word. We can find the most important words in each component by finding the words where the weights are highest within that component. We can do this with a handy function called np.argsort()
, which tells you which entries of an array are the largest, second largest, etc.
orders = np.argsort(model.components_, axis = 1)
orders
array([[ 0, 1272, 1271, ..., 806, 1162, 1966], [2145, 926, 1784, ..., 244, 980, 1442], [1634, 1527, 1524, ..., 247, 244, 1174], [ 0, 1277, 1274, ..., 1117, 493, 835]])
We can then use numpy
"fancy" indexing to arrange the words in the needed orders.
important_words = np.array(X.columns)[orders]
important_words
array([['_i_', 'painting', 'paint', ..., 'gryphon', 'mock', 'turtle'], ['zigzag', 'inquisitively', 'station', ..., 'cat', 'king', 'queen'], ['sheep', 'riper', 'rightly', ..., 'caterpillar', 'cat', 'mouse'], ['_i_', 'panted', 'pairs', ..., 'march', 'dormouse', 'hatter']], dtype=object)
It's convenient to write a function to automate this for us:
def top_words(X, model, component, num_words):
orders = np.argsort(model.components_, axis = 1)
important_words = np.array(X.columns)[orders]
return important_words[component][-num_words:]
top_words(X, model, 3, 5)
array(['tea', 'hare', 'march', 'dormouse', 'hatter'], dtype=object)
The next important aspect of topic modeling is the assignment of topics per document. This is done via weights. We can access this by using the transform()
method of the model.
weights = model.transform(X)
weights
array([[9.77501978e-05, 1.97664522e-02, 5.17821692e-01, 3.07225844e-02], [0.00000000e+00, 0.00000000e+00, 9.62026820e-01, 0.00000000e+00], [2.58343542e-04, 0.00000000e+00, 9.42349279e-01, 0.00000000e+00], [0.00000000e+00, 5.37128674e-03, 8.67735980e-01, 2.36338240e-04], [1.31577250e-02, 0.00000000e+00, 8.51527332e-01, 0.00000000e+00], [0.00000000e+00, 2.70835534e-01, 1.00413330e+00, 9.09882306e-02], [0.00000000e+00, 0.00000000e+00, 2.05082222e-02, 1.85495376e+00], [0.00000000e+00, 1.64643881e+00, 0.00000000e+00, 0.00000000e+00], [8.26943575e-01, 3.86067385e-01, 0.00000000e+00, 0.00000000e+00], [1.16895457e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00], [0.00000000e+00, 8.61097347e-01, 0.00000000e+00, 9.03937864e-01], [3.73729071e-02, 9.74766421e-01, 1.75619195e-02, 7.02019118e-02]])
fig, ax = plt.subplots(1)
ax.imshow(weights.T)
<matplotlib.image.AxesImage at 0x7f8604f90dc0>
The weights indicate the relative presence of each topic in each chapter. For example, Topic 2 is highly present in the first six chapters, but then mostly absent for the rest of the book. Topic 3 appears in Chapters 7 and 11, and so on.
We can also visualize the same information as a line chart. Let's add as labels some of the top words for each topic.
fig, ax = plt.subplots(1)
for i in range(4):
ax.plot(df['chapter'], weights[:,i], label = top_words(X, model, i, 5))
ax.legend(bbox_to_anchor=(1.05, 0.65), loc="upper left")
<matplotlib.legend.Legend at 0x7f86069d70d0>
This plot allows us to easily see several major features of the plot of the novel, including the tea party with the March Hare, the Mad Hatter, and the Dormouse (Chapter 7), the crocquet game in the court of the Queen of Hearts (Chapter 8), the appearance of the Mock Turtle and the Lobster in (Chapters 9 and 10), and the reappearance of many characters in Chapter 11.