DS4420: Topic modeling w/PLSA¶

In-class exercise implementing parameters estimation via Expectation-Maximization.

The model¶

$z_{dn} \sim \text{Categorical}(\theta_d)$

$x_{dn}|z_{dn} \sim \text{Categorical}(\beta_{z_{dn}})$

Suppose we have some data; we'll work with 20newgroups, which you will see again in homeworks!

In [1]:

from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian']
D = fetch_20newsgroups(subset='train', shuffle=True, 
                                       categories=categories,
                                       remove=('headers', 'footers', 'quotes'))

In [2]:

D = D["data"]
print(D[0][:500])

Library of Congress to Host Dead Sea Scroll Symposium April 21-22
 To: National and Assignment desks, Daybook Editor
 Contact: John Sullivan, 202-707-9216, or Lucy Suddreth, 202-707-9191
          both of the Library of Congress

   WASHINGTON, April 19  -- A symposium on the Dead Sea 
Scrolls will be held at the Library of Congress on Wednesday,
April 21, and Thursday, April 22.  The two-day program, cosponsored
by the library and Baltimore Hebrew University, with additional
support from the Pr

In [3]:

# how many documents?
n_docs = len(D)
print(n_docs)

In [4]:

# how many words?
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(max_features=5000) # limit to 5k features for now
X = vect.fit_transform(D)
X.shape

Out[4]:

(1079, 5000)

In [5]:

v = X.shape[1]

Let's implement Probabalistic Latent Semantic Analysis (PLSA)!¶

In [6]:

n_topics = 3 # hyperparameter

In [7]:

import numpy as np

# Initialize variables: theta, beta. 
# what are our shapes?
theta = np.zeros((n_docs, n_topics)) # topic probabilities per document
beta = np.zeros((v, n_topics))  # word probabilities per topic
phi = np.zeros((n_docs, v, n_topics)) # word-topic assignment probabilities

Aside: How should we initialize these variables?? What do we want? Could use random values, but what are the potential issues?

Idea: Use a Dirichlet distribution. Why? Because it will provide row-vectors that sum to 1, and we can control how peaky these are easily. We'll return to this in a bit; for now, just think of it as a convienent way to initialize row vectors that sum to 1.

In [8]:

def init_matrix_rows(m, scale=1.0):
    cols = m.shape[1]
    for row_idx in range(m.shape[0]):
        r = np.random.dirichlet(np.ones(cols)/scale, size=1)
        m[row_idx,:] = r

In [9]:

init_matrix_rows(theta, scale=1)
theta[0,:]

Out[9]:

array([0.27429133, 0.21086857, 0.51484011])

In [10]:

init_matrix_rows(beta)

In [11]:

# we have an extra dimension for phi
for i in range(n_docs):
    init_matrix_rows(phi[i,:,:], scale=.01)

In [12]:

phi[0,0,:]

Out[12]:

array([0.37974323, 0.36414193, 0.25611484])

In [13]:

X = X.todense() # convert sparse matrix to dense

In [14]:

X.shape

Out[14]:

(1079, 5000)

Pull above together into an initialization method

In [15]:

def init_params(s=0.1):
    theta = np.zeros((n_docs, n_topics)) # topic probabilities per document
    beta = np.zeros((v, n_topics)) # word probabilities per topic
    phi = np.zeros((n_docs, v, n_topics)) # word-topic assignment probabilities

    init_matrix_rows(theta, scale=s)
    init_matrix_rows(beta, scale=s)
    for i in range(n_docs):
        init_matrix_rows(phi[i,:,:], scale=s)
    

E-step¶

$\phi_{dnk}=\frac{\theta_{dk} \beta_{kv}}{\sum_l \theta_{dl} \beta_{lv}}$

In [16]:

# E-Step: Update word-topic assignments

def e_step():
    # update \phi (word-topic params), given current parameters
    for d in range(n_docs):
        for w in range(v):
            # \phi is distribution over topics for 
            # word w in doc d.
            z = np.dot(theta[d,:], beta[w,:])
            # the top here is an element-wise mult
            phi[d,w,:] = (theta[d,:] * beta[w,:]) / z 
    print("e-step done.")

M-step¶

$\beta_{kv} = \frac{N_{kv}}{\sum_w N_{kw}}$ Where $N_{kv} = \sum_{d,n} \phi_{dnk} x_{nv}$

$\theta_k = \frac{N_k}{\sum_l N_l}$ Where $N_l = \sum_{d,n} \phi_{dnk}$

In [17]:

# M-Step: Update parameters

def m_step():
    for k in range(n_topics):
        # \beta: word-topic probabilities
        
        # total expectation of words for this topic
        N_kz = np.sum(phi[:,:,k])
        for w in range(v):
            # p(w|t) = [expected count of w in k] / [expected count of k over all words]
            beta[w,k] = np.sum(phi[:,w,k]) / N_kz
        
        for d in range(n_docs):
            # \theta: document-topic probabilities
            
            # total mass over topics in this doc
            z_k = np.sum(phi[d,:,:])

            theta[d, k] = np.sum(phi[d,:,k]) / z_k
    print("m-step done")

In [18]:

def fit(iters=2):
    init_params(s=0.1)
    for iter in range(iters):
        print(iter)
        e_step()
        m_step()

In [19]:

fit(iters=10)

0
e-step done.
m-step done
1
e-step done.
m-step done
2
e-step done.
m-step done
3
e-step done.
m-step done
4
e-step done.
m-step done
5
e-step done.
m-step done
6
e-step done.
m-step done
7
e-step done.
m-step done
8
e-step done.
m-step done
9
e-step done.
m-step done

In [20]:

e_step()

e-step done.

In [21]:

m_step()

m-step done

In [22]:

theta[3,:]

Out[22]:

array([0.32411854, 0.3643046 , 0.31157687])

In [23]:

theta[10,:]

Out[23]:

array([0.24457848, 0.30548981, 0.44993172])

Let's inspect the topics. We're particularly interested in $\beta$ parameters, which give topic-word probabilities.

In [24]:

def get_top_words_for_topic(k, m=10):
    top_word_indices = np.argsort(beta[:,k])[-m:]
    
    # now retrieve the actual tokens from our vectorizer
    top_words = [vect.get_feature_names()[idx] for idx in top_word_indices]
    return top_words

In [25]:

np.argsort(beta[1,:])[-20:]

Out[25]:

array([2, 1, 0])

In [26]:

np.sort(beta[1,:])[:20]

Out[26]:

array([0.00017543, 0.00020041, 0.00022405])

In [28]:

get_top_words_for_topic(0,m=20)

Out[28]:

['herod',
 'church',
 'ezra',
 'reasoning',
 'perform',
 'apocrypha',
 'former',
 'unlimited',
 'open',
 'division',
 'aimed',
 'orthodox',
 'some',
 'appropriate',
 'chooses',
 'offended',
 'simply',
 'cast',
 'baltimore',
 'british']

In [ ]: