Notebook

Latent semantic analysis¶

In [1]:

import collections
import re

import numpy

%pylab
%matplotlib inline

Using matplotlib backend: Qt5Agg
Populating the interactive namespace from numpy and matplotlib

Introduction¶

In the previous lecture we introduced the notion of a term-document matrix, which summarizes word co-occurence statistics for multiple "documents". Each row $t_i$ in a term-document matrix represents a term (i.e., a word or stem) and the frequency with which it occurs in each document. And, each column $d_j$ in such a matrix represents a document and the frequency with which it employs each term. A term-document matrix thus contains a great deal of information about the associations between words and between documents (in the collection of documents that make it up).

More specifically (and wonkishly), given a term-document matrix $X$ with $T$ terms and $D$ documents:

the dot product of two rows of terms ( $t{_i}{^\top} t{_q}$ ) in $X$ is the correlation between terms $t_i$ and $t_j$ , and the matrix $X X^\top$ contains all such dot products
the dot product of two columns of documents ( $d{_j}{^\top} d{_q}$ ) in $X$ is the correlation between two documents, and the matrix $X^\top X$ contains all such dot products

Latent semantic analysis (LSA), also sometimes known as latent semantic indexing (LSI), is a method to exploit information in term-document matrices using first principles from linear algebra. In particular, we use dimensionality reduction techniques to create a much-reduced form of the term-document matrix, and then use this to project terms and documents into a low-dimensionality "topic space", in which we can perform basic clustering and comparison of both terms and documents.

Constructing a term-document matrix¶

The following example comes from Deerwester et al. 1990. The first set of "documents" (actually, paper titles) are related to human-computer interaction (HCI), and the second are related to graph theory. We assume that terms which are not "underlined" (e.g., _term_) are filtered out (as stopwords or due to low term or document frequency.)

In [2]:

# HCI-related documents
c0 = """
_Human_ machine _interface_ for Lab ABC _computer_ applications
A _survey_ of _user_ opinion of _computer_ _system_ _response_ _time_
The _EPS_ _user_ _interface_ management _system_
_System_ and _human_ _system_ engineering testing of _EPS_
Relation of _user_-perceived _response_ _time_ to error measurement
"""
# Graph-theory-related documents
c1 = """
The generation of random binary unordered _trees_
The intersection _graph_ of paths in _trees_
_Graph_ _minors_ IV: Widths of _trees_ and well-quasi-ordering
_Graph_ _minors_: A _survey_
"""

The following function constructs a dense (i.e., full-rank) term-document matrix from the corpus object.

In [3]:

def termdoc_index(corpus):
    """Constructs a dense term-document matrix and a term index."""
    # Many things you might want to do here, such as filtering by
    # DF or TF thresholds, are not implemented yet.
    terms = set()  # To populate a term index.
    termdoc_sparse = []  # To populate a dense t-d matrix.
    for doc in corpus:
        # Computes term frequencies for this document.
        column_sparse = collections.Counter(doc)
        # Saves term frequencies.
        termdoc_sparse.append(column_sparse)
        # Adds new terms to term set.
        terms.update(column_sparse.keys())
    # Converts term set to index.
    index = {term: i for (i, term) in enumerate(terms)}
    # Builds dense matrix.
    termdoc_dense = numpy.zeros((len(terms), len(termdoc_sparse)))
    for (j, column_sparse) in enumerate(termdoc_sparse):
        # A pointer to a column in the term-document matrix.
        column_dense = termdoc_dense[:, j]
        for (term, freq) in column_sparse.items():
            i = index[term]
            column_dense[i] = freq
            # Equivalently: `termdoc_dense[i, j] = freq`.
    return (termdoc_dense, index)


def prep_corpus(lines):
    for line in lines.split("\n"):
        if not line:
            continue
        yield [term.upper() for term in re.findall(r"_(.*?)_", line)]
    
    
corpus = prep_corpus(c0 + c1)
(X, index) = termdoc_index(corpus)
print(X)

[[1. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 1.]
 [1. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 1. 1. 2. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 1.]
 [0. 0. 1. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 1. 1.]
 [1. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 1. 1. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 1. 1. 0.]]

We'll also hold on to some pointers to terms and documents, for later.

In [4]:

t_human = X[index["HUMAN"], :]
t_user = X[index["USER"], :]
t_graph = X[index["GRAPH"], :]
d_ABC = X[:, 0]
d_response = X[:, 4]
d_survey = X[:, 8]

Dimensionality reduction¶

One problem when working in a larger scale (than this toy example) is that the dense term-document matrix grows very rapidly, even if we remove terms with low term frequency or low document frequency. So we wish to generate a low-rank approximation of this matrix. This is accomplished using a matrix factorization technique known as singular value decomposition defined as follows. The singular value decomposition of a matrix $X$ is given by three matrices $U$ , $S$ , and $V$ such that

$\hat{X} = U S V^\top$

where $U$ , $S$ , and $V$ are all matrices. In other words, $X$ can be approximated by the product of three square matrices.

In [5]:

(U, S_diag, V) = numpy.linalg.svd(X, full_matrices=False)
S = numpy.diag(S_diag)
assert numpy.allclose(X, U @ S @ V)

In this decomposition, $U$ is a orthogonal matrix; $S$ is a diagonal square matrix; $V$ is a orthogonal square matrix with the dimensionality determined by the number of documents.

In [6]:

print(f"U dimensionality:\t{U.shape}")
print(f"S dimensionality:\t{S.shape}")
print(f"V dimensionality:\t{V.shape}")

U dimensionality:	(12, 9)
S dimensionality:	(9, 9)
V dimensionality:	(9, 9)

So, the dimensionality of the SVD approximation grows quadratically both in the number of terms and number of documents. However, we can hold this constant by throwing away all but first $k$ terms in the approximation. This results in a new approximation

$\hat{X_k} = U_k S_k V{_k}{^\top}$

which is known to be the optimal $k$ -dimensional approximation. Here, we will use $k = 2$ .

In [7]:

k = 2
U_k = U[:, :k]
# S_diag is just an array...and going forward, 
# we only need its inverse, so we'll compute that.
invS_k = numpy.linalg.inv(numpy.diag(S_diag[:k]))
# Equivalently, `inv(diag(S[:k]))`
V_k = V[:, :k]

The "topic space" translation¶

We can now translate each document into the $k$ -dimensional "topic space" as follows. If $d_j$ is a document column in $X$ , then

$\hat{d_j} = {S_k}^{-1} U{_k}{^\top} d_j~.$

In [8]:

def doc_translate(d, U_k, invS_k):
    """Translates a document into the topic space."""
    return invS_k @ U_k.T @ d.T


v_ABC = doc_translate(d_ABC, U_k, invS_k)
v_response = doc_translate(d_response, U_k, invS_k)
v_survey = doc_translate(d_survey, U_k, invS_k)
print(v_ABC)
print(v_response)

[-0.1973928   0.05591352]
[-0.27946911 -0.10677472]

The same can be done for novel queries; we simply treat each query $d_q$ as if it were a new document and apply the same translation.

We also can translate terms into topic space. If $t_i$ is a term row in $X$ , then

$\hat{t_i} = {S_k}^{-1} V{_k}{^\top} t_i~.$

In [9]:

def term_translate(t, invS_k, V_k):
    """Translates a term into the topic space."""
    return invS_k @ V_k.T @ t


v_human = term_translate(t_human, invS_k, V_k)
v_user = term_translate(t_user, invS_k, V_k)
v_graph = term_translate(t_graph, invS_k, V_k)
print(v_human)
print(v_user)

[ 0.22520754 -0.22714763]
[-0.02994261  0.21169323]

Similarity in topic space¶

Most importantly, we can compare pairs of vectors in the topic space $\textbf{v}_1$ , $\textbf{v}_2$ , regardless of their source (terms, queries, or documents). We assume the relatedness of two topic-space vector is measured by the angle between the two vectors. More specifically, we compute the cosine of this angle, a measure called cosine similarity, which is defined as

$\cos(\textbf{v}_1, \textbf{v}_2) = \frac{\textbf{v}_1~\cdot~\textbf{v}_2}{\|\textbf{v}_1\|_2 \times \|\textbf{v}_2\|_2}$

where $\|\textbf{v}\|_2$ represents the Euclidean norm of $\textbf{v}$ . This measure ranges from $[-1, 1]$ where $1$ indicates vector identity and $-1$ indicates maximal vector dissimilarity.

In [10]:

def cosine_similarity(v1, v2):
    """Computes cosine similarity between two vectors."""
    numerator = v1 @ v2
    denominator = numpy.linalg.norm(v1) * numpy.linalg.norm(v2)
    return numerator / denominator

We can use this measure to estimate document and term similarities.

In [11]:

# Term similarities.
print(f"HUMAN v. USER:\t{cosine_similarity(v_human, v_user): .4f}")
print(f"HUMAN v. GRAPH:\t{cosine_similarity(v_human, v_graph): .4f}")

HUMAN v. USER:	-0.8017
HUMAN v. GRAPH:	-0.9585

In [12]:

# Document similarities.
print("'ABC' article v. 'RESPONSE' article:\t"
      f"{cosine_similarity(v_ABC, v_response): .4f}")
print("'ABC' article v. 'SURVEY' article:\t"
      f"{cosine_similarity(v_ABC, v_survey): .4f}")

'ABC' article v. 'RESPONSE' article:	 0.8015
'ABC' article v. 'SURVEY' article:	-0.1223

When $k = 2$ , we can also visualize terms and documents using a Cartesian coordinates. (In practice, we usually choose a much larger value of $k$ .)

In [13]:

def plot_2d_translations(translations):
    plot([0, 0], [0, 0], ".")
    for translation in translations:
        plot([0, translation[0]], [0, translation[1]], "-")

In [14]:

plot_2d_translations([v_human, v_user, v_graph])
# orange: 'HUMAN'
# green: 'USER'
# red: 'GRAPH'

In [15]:

plot_2d_translations([v_ABC, v_response, v_survey])
# orange: 'ABC'
# green: 'RESPONSE'
# red: 'SURVEY'

Efficient and feasible LSA¶

A limitation of the LSA method illustrated above is that the naïve approach requires a full-rank term-document matrix which fits in memory; though this matrix is relatively sparse, the dense variant often grows on both dimensions every time a new document is entered! However, there exist many incremental (i.e., one document at a time) approximations of SVD, and one fast variant (Brand 2006) is implemented in the Gensim library (Řehůřek and Sojka 2010). For more information, see a walkthrough of the Deerwester et al. example in Gensim here.

References¶

M. Brand. 2006. Fast low-rank modifications of the thin singular value decomposition. Linear Algebra and its Applications 415(1): 20-30.

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. 1990. Indexing by latent semantic analysis. 190. Journal of the American Society for Information Science 41(6): 391-407.

R. Řehůřek and P. Sojka. 2010. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 45-50.