CS 562/662 (Natural Language Processing): Latent semantic analysis (Kyle Gorman)

In [1]:
from re import findall
from collections import Counter
from numpy import allclose, diag, dot, zeros
from numpy.linalg import norm, svd

%pylab
%matplotlib inline
Using matplotlib backend: MacOSX
Populating the interactive namespace from numpy and matplotlib

Introduction

In the previous lecture we introduced the notion of a term-document matrix, which summarizes word coocurence statistics for multiple "documents". Each row $t_i$ in a term-document matrix represents a term (i.e., a word or stem) and the frequency with which it occurs in each document. And, each column $d_j$ in such a matrix represents a document and the frequency with which it employs each term. A term-document matrix thus contains a great deal of information about the associations between words and between documents (in the collection of documents that make it up).

More specifically (and wonkishly), given a term-document matrix $X$ with $T$ terms and $D$ documents:

  • the dot product of two rows of terms ($t{_i}{^T} t{_q}$) in $X$ is the correlation between terms $t_i$ and $t_j$, and the matrix $X X^T$ contains all such dot products
  • the dot product of two columns of documents ($d{_j}{^T} d{_q}$) in $X$ is the correlation between two documents, and the matrix $X^T X$ contains all such dot products

Latent semantic analysis (LSA), also sometimes known as latent semantic indexing (LSI), is a method to exploit information in term-document matrices using first principles from linear algebra. In particular, we use dimensionality reduction techniques to create a much-reduced form of the term-document matrix, and then use this to project terms and documents into a low-dimensionality "topic space", in which we can perform basic clustering and comparison of both terms and documents.

Constructing a term-document matrix

The following example comes from Deeerwester et al. 1990. The first set of "documents" (actually, paper titles) are related to human-computer interaction (HCI), and the second are related to graph theory. We assume that terms which are not "underlined" (e.g., _term_) are filtered out (as stopwords or due to low term or document frequency.)

In [2]:
# HCI-related documents
c0 = """
_Human_ machine _interface_ for Lab ABC _computer_ applications
A _survey_ of _user_ opinion of _computer_ _system_ _response_ _time_
The _EPS_ _user_ _interface_ management _system_
_System_ and _human_ _system_ engineering testing of _EPS_
Relation of _user_-perceived _response_ _time_ to error measurement
"""
# Graph-theory-related documents
c1 = """
The generation of random binary unordered _trees_
The intersection _graph_ of paths in _trees_
_Graph_ _minors_ IV: Widths of _trees_ and well-quasi-ordering
_Graph_ _minors_: A _survey_
"""

def prep_corpus(lines):
    """
    Given `lines` (a string) of text, generate the corresponding
    corpus (a list of "documents", where each document is a list 
    of terms
    """
    for line in lines.split("\n"):
        # ignore empty lines
        if not line:
            continue
        yield [term.upper() for term in findall(r"_(.*?)_", line)]
    
corpus = prep_corpus(c0 + c1)

The following function constructs a dense (i.e., full-rank) term-document matrix from the corpus object.

In [3]:
def termdoc_index(corpus):
    """
    Given a `corpus` (a list of documents, which are themselves 
    lists of terms), return a dense term-document matrix and a 
    term index
    
    Many things you might do (such as filter by DF or TF) are not
    implemented here
    """
    # collect sparse frequencies
    terms = set() # to populate a term index
    termdoc_sparse = [] # to populate dense a t-d matrix
    for doc in corpus:
        # compute term frequencies for this document
        column_sparse = Counter(doc)
        # save term frequencies
        termdoc_sparse.append(column_sparse)
        # add new terms to term set
        terms.update(column_sparse.iterkeys())
    # convert term set to index
    index = {term: i for (i, term) in enumerate(terms)}
    # build dense matrix
    termdoc_dense = zeros((len(terms), len(termdoc_sparse)))
    for (j, column_sparse) in enumerate(termdoc_sparse):
        # a pointer to a column in the term-document matrix:
        column_dense = termdoc_dense[:, j]
        for (term, freq) in column_sparse.iteritems():
            i = index[term]
            column_dense[i] = freq
            # equivalently: `termdoc_dense[i, j] = freq`
    return (termdoc_dense, index)

(X, index) = termdoc_index(corpus)
print(X)
[[ 0.  0.  0.  0.  0.  0.  0.  1.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.  1.  1.]
 [ 0.  1.  1.  2.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  1.  1.  1.  0.]
 [ 0.  0.  1.  1.  0.  0.  0.  0.  0.]
 [ 1.  1.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.  0.  1.]
 [ 0.  1.  1.  0.  1.  0.  0.  0.  0.]
 [ 1.  0.  0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  1.  0.  0.  0.  0.]
 [ 1.  0.  1.  0.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  1.  0.  0.  0.  0.]]

We'll also hold on to some pointers to terms and documents, for later.

In [4]:
t_human = X[index["HUMAN"], :]
t_user = X[index["USER"], :]
t_graph = X[index["GRAPH"], :]
d_ABC = X[:, 0]
d_response = X[:, 4]
d_survey = X[:, 8]
print t_graph  # "GRAPH" occurs only in the graph-theory corpus
[ 0.  0.  0.  0.  0.  0.  1.  1.  1.]

Dimensionality reduction

One problem when working in a larger scale (than this toy example) is that the dense term-document matrix grows very rapidly, even if we remove terms with low term frequency or low document frequency. So we wish to generate a low-rank approximation of this matrix. This is accomplished using a matrix factorization technique known as singular value decomposition defined as follows. The singular value decomposition of a matrix $X$ is given by three matrices $U$, $\Sigma$, and $V$ such that

$$\hat{X} = U \Sigma V^T$$

where $U$, $\Sigma$, and $V$ are all matrices. In other words, $X$ can be approximated by the product of three square matrices.

In [5]:
(U, Sigma_diag, V) = svd(X, full_matrices=False)
Sigma = diag(Sigma_diag)
print allclose(X, dot(U, dot(Sigma, V)))  # not a bad approximation!
True

In this decomposition, $U$ is a orthogonal matrix; $\Sigma$ is a diagonal square matrix; $V$ is a orthogonal square matrix with the dimensionality determined by the number of documents.

In [6]:
print "U dimensionality:\t{}".format(U.shape)
print "Sigma dimensionality:\t{}".format(Sigma.shape)
print "V dimensionality:\t{}".format(V.shape)
U dimensionality:	(12, 9)
Sigma dimensionality:	(9, 9)
V dimensionality:	(9, 9)

So, the dimensionality of the SVD approximation grows quadratically both in the number of terms and number of documents. However, we can hold this constant by throwing away all but first $k$ terms in the approximation. This results in a new approximation

$$\hat{X_k} = U_k \Sigma_k V{_k}{^T}$$

which is known to be the optimal $k$-dimensional approximation. Here, we will use $k = 2$.

In [7]:
k = 2
U_k = U[:, :k]
# Sigma_diag is just an array...and going forward, 
# we only need its inverse, so we'll compute that
invSigma_k = diag(1. / Sigma_diag[:k])
# equivalently, `inv(diag(Sigma[:k]))`
V_k = V[:, :k]

The "topic space" translation

We can now translate each document into the $k$-dimensional "topic space" as follows. If $d_j$ is a document column in $X$, then

$$\hat{d_j} = {\Sigma_k}^{-1} U{_k}{^T} d_j~.$$
In [8]:
def doc_translate(d, U_k, invSigma_k):
    """
    Translate a document `t` into a topic space U \Sigma V.
    We tranpose `d` because numpy forgets that it's really a column
    vector.
    """
    return dot(dot(invSigma_k, U_k.T), d.T)

v_ABC = doc_translate(d_ABC, U_k, invSigma_k)
v_response = doc_translate(d_response, U_k, invSigma_k)
v_survey = doc_translate(d_survey, U_k, invSigma_k)
print(v_ABC)
print(v_response)
[-0.1973928  -0.05591352]
[-0.27946911  0.10677472]

The same can be done for novel queries; we simply treat each query $d_q$ as if it were a new document and apply the same translation.

We also can translate terms into topic space. If $t_i$ is a term row in $X$, then

$$\hat{t_i} = {\Sigma_k}^{-1} V{_k}{^T} t_i~.$$
In [9]:
def term_translate(t, invSigma_k, V_k):
    """
    Translate a term `t` into a topic space U_k \Sigma_k V_k 
    """
    return dot(dot(invSigma_k, V_k.T), t)

v_human = term_translate(t_human, invSigma_k, V_k)
v_user = term_translate(t_user, invSigma_k, V_k)
v_graph = term_translate(t_graph, invSigma_k, V_k)
print(v_human)
print(v_user)
[-0.34337556 -0.24969073]
[ 0.02994261 -0.21169323]

Similarity in topic space

Most importantly, we can compare pairs of vectors in the topic space $v_0$, $v_1$, regardless of their source (terms, queries, or documents). We take the relatedness of two topic-space vector to be monotonically related to the angle between the two vectors $\theta$. More specifically, we compute the cosine of this angle, a measure called cosine similarity, which is defined as

$$cos~\theta = \frac{v_0~\cdot~v_1}{\|v_0\| \|v_1\|}$$

where $\|v\|$ represents the Euclidean norm of the vector $v$. The value of this measure lines in $[-1, 1]$, where $1$ indicates vector identity and $-1$ indicates maximal vector dissimilarity.

In [10]:
def cosine_similarity(v0, v1):
    """
    Compute cosine similarity between two vectors `v0`, `v1`
    """
    numerator = dot(v0, v1)
    denominator = norm(v0) * norm(v1)
    return numerator / denominator

We can use this measure to estimate document and term similarities. As expected, the HCI terms and documents are more similar to each other than they are to graph-theory terms and documents, respectively.

In [11]:
# term similarities
print "HUMAN v. USER:\t{: .4f}".format(cosine_similarity(v_human, v_user))
print "HUMAN v. GRAPH:\t{:.4f}".format(cosine_similarity(v_human, v_graph))
HUMAN v. USER:	 0.4690
HUMAN v. GRAPH:	-0.1364
In [12]:
# document similarities
print "'ABC' article v. 'RESPONSE' article:\t{: .4f}".format(
      cosine_similarity(v_ABC, v_response))
print "'ABC' article v. 'SURVEY' article:\t{: .4f}".format(
      cosine_similarity(v_ABC, v_survey))
'ABC' article v. 'RESPONSE' article:	 0.8015
'ABC' article v. 'SURVEY' article:	-0.1223

When $k = 2$, we can also visualize terms and documents using a Cartesian coordinates. (In practice, we usually choose a much larger value of $k$.)

In [13]:
def plot_2d_translations(translations):
    plot([0, 0], [0, 0], ".")
    for translation in translations:
        plot([0, translation[0]], [0, translation[1]], "-")
In [14]:
plot_2d_translations([v_human, v_user, v_graph])
# green: 'HUMAN'
# red: 'USER'
# cyan: 'GRAPH'
In [15]:
plot_2d_translations([v_ABC, v_response, v_survey])
# green: 'ABC'
# red: 'RESPONSE'
# cyan: 'SURVEY'

Efficient and feasible LSA

A limitation of the LSA method illustrated above is that the naïve approach requires a full-rank term-document matrix which fits in memory; though this matrix is relatively sparse, the dense variant often grows on both dimensions every time a new document is entered! However, there exist many incremental (i.e., one document at a time) approximations of SVD, and one fast variant (Brand 2006) is implemented in the Gensim library (Řehůřek and Sojka 2010).

References

M. Brand. 2006. Fast low-rank modifications of the thin singular value decomposition. Linear Algebra and its Applications 415(1): 20-30.

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. 1990. Indexing by latent semantic analysis. 190. Journal of the American Society for Information Science 41(6): 391-407.

R. Řehůřek and P. Sojka. 2010. Software framework for topic modelling with large corpora. In LREC, 45-50.