import collections
import re
import numpy
%pylab
%matplotlib inline
Using matplotlib backend: Qt5Agg Populating the interactive namespace from numpy and matplotlib
In the previous lecture we introduced the notion of a term-document matrix, which summarizes word co-occurence statistics for multiple "documents". Each row ti in a term-document matrix represents a term (i.e., a word or stem) and the frequency with which it occurs in each document. And, each column dj in such a matrix represents a document and the frequency with which it employs each term. A term-document matrix thus contains a great deal of information about the associations between words and between documents (in the collection of documents that make it up).
More specifically (and wonkishly), given a term-document matrix X with T terms and D documents:
Latent semantic analysis (LSA), also sometimes known as latent semantic indexing (LSI), is a method to exploit information in term-document matrices using first principles from linear algebra. In particular, we use dimensionality reduction techniques to create a much-reduced form of the term-document matrix, and then use this to project terms and documents into a low-dimensionality "topic space", in which we can perform basic clustering and comparison of both terms and documents.
The following example comes from Deerwester et al. 1990. The first set of "documents" (actually, paper titles) are related to human-computer interaction (HCI), and the second are related to graph theory. We assume that terms which are not "underlined" (e.g., _term_
) are filtered out (as stopwords or due to low term or document frequency.)
# HCI-related documents
c0 = """
_Human_ machine _interface_ for Lab ABC _computer_ applications
A _survey_ of _user_ opinion of _computer_ _system_ _response_ _time_
The _EPS_ _user_ _interface_ management _system_
_System_ and _human_ _system_ engineering testing of _EPS_
Relation of _user_-perceived _response_ _time_ to error measurement
"""
# Graph-theory-related documents
c1 = """
The generation of random binary unordered _trees_
The intersection _graph_ of paths in _trees_
_Graph_ _minors_ IV: Widths of _trees_ and well-quasi-ordering
_Graph_ _minors_: A _survey_
"""
The following function constructs a dense (i.e., full-rank) term-document matrix from the corpus
object.
def termdoc_index(corpus):
"""Constructs a dense term-document matrix and a term index."""
# Many things you might want to do here, such as filtering by
# DF or TF thresholds, are not implemented yet.
terms = set() # To populate a term index.
termdoc_sparse = [] # To populate a dense t-d matrix.
for doc in corpus:
# Computes term frequencies for this document.
column_sparse = collections.Counter(doc)
# Saves term frequencies.
termdoc_sparse.append(column_sparse)
# Adds new terms to term set.
terms.update(column_sparse.keys())
# Converts term set to index.
index = {term: i for (i, term) in enumerate(terms)}
# Builds dense matrix.
termdoc_dense = numpy.zeros((len(terms), len(termdoc_sparse)))
for (j, column_sparse) in enumerate(termdoc_sparse):
# A pointer to a column in the term-document matrix.
column_dense = termdoc_dense[:, j]
for (term, freq) in column_sparse.items():
i = index[term]
column_dense[i] = freq
# Equivalently: `termdoc_dense[i, j] = freq`.
return (termdoc_dense, index)
def prep_corpus(lines):
for line in lines.split("\n"):
if not line:
continue
yield [term.upper() for term in re.findall(r"_(.*?)_", line)]
corpus = prep_corpus(c0 + c1)
(X, index) = termdoc_index(corpus)
print(X)
[[1. 1. 0. 0. 0. 0. 0. 0. 0.] [0. 1. 0. 0. 1. 0. 0. 0. 0.] [0. 1. 0. 0. 1. 0. 0. 0. 0.] [0. 1. 0. 0. 0. 0. 0. 0. 1.] [1. 0. 1. 0. 0. 0. 0. 0. 0.] [0. 1. 1. 2. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 1. 1.] [0. 0. 1. 1. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 1. 1. 1.] [1. 0. 0. 1. 0. 0. 0. 0. 0.] [0. 1. 1. 0. 1. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 1. 1. 1. 0.]]
We'll also hold on to some pointers to terms and documents, for later.
t_human = X[index["HUMAN"], :]
t_user = X[index["USER"], :]
t_graph = X[index["GRAPH"], :]
d_ABC = X[:, 0]
d_response = X[:, 4]
d_survey = X[:, 8]
One problem when working in a larger scale (than this toy example) is that the dense term-document matrix grows very rapidly, even if we remove terms with low term frequency or low document frequency. So we wish to generate a low-rank approximation of this matrix. This is accomplished using a matrix factorization technique known as singular value decomposition defined as follows. The singular value decomposition of a matrix X is given by three matrices U, S, and V such that
ˆX=USV⊤where U, S, and V are all matrices. In other words, X can be approximated by the product of three square matrices.
(U, S_diag, V) = numpy.linalg.svd(X, full_matrices=False)
S = numpy.diag(S_diag)
assert numpy.allclose(X, U @ S @ V)
In this decomposition, U is a orthogonal matrix; S is a diagonal square matrix; V is a orthogonal square matrix with the dimensionality determined by the number of documents.
print(f"U dimensionality:\t{U.shape}")
print(f"S dimensionality:\t{S.shape}")
print(f"V dimensionality:\t{V.shape}")
U dimensionality: (12, 9) S dimensionality: (9, 9) V dimensionality: (9, 9)
So, the dimensionality of the SVD approximation grows quadratically both in the number of terms and number of documents. However, we can hold this constant by throwing away all but first k terms in the approximation. This results in a new approximation
^Xk=UkSkVk⊤which is known to be the optimal k-dimensional approximation. Here, we will use k=2.
k = 2
U_k = U[:, :k]
# S_diag is just an array...and going forward,
# we only need its inverse, so we'll compute that.
invS_k = numpy.linalg.inv(numpy.diag(S_diag[:k]))
# Equivalently, `inv(diag(S[:k]))`
V_k = V[:, :k]
We can now translate each document into the k-dimensional "topic space" as follows. If dj is a document column in X, then
^dj=Sk−1Uk⊤dj .def doc_translate(d, U_k, invS_k):
"""Translates a document into the topic space."""
return invS_k @ U_k.T @ d.T
v_ABC = doc_translate(d_ABC, U_k, invS_k)
v_response = doc_translate(d_response, U_k, invS_k)
v_survey = doc_translate(d_survey, U_k, invS_k)
print(v_ABC)
print(v_response)
[-0.1973928 0.05591352] [-0.27946911 -0.10677472]
The same can be done for novel queries; we simply treat each query dq as if it were a new document and apply the same translation.
We also can translate terms into topic space. If ti is a term row in X, then
^ti=Sk−1Vk⊤ti .def term_translate(t, invS_k, V_k):
"""Translates a term into the topic space."""
return invS_k @ V_k.T @ t
v_human = term_translate(t_human, invS_k, V_k)
v_user = term_translate(t_user, invS_k, V_k)
v_graph = term_translate(t_graph, invS_k, V_k)
print(v_human)
print(v_user)
[ 0.22520754 -0.22714763] [-0.02994261 0.21169323]
Most importantly, we can compare pairs of vectors in the topic space v1, v2, regardless of their source (terms, queries, or documents). We assume the relatedness of two topic-space vector is measured by the angle between the two vectors. More specifically, we compute the cosine of this angle, a measure called cosine similarity, which is defined as
cos(v1,v2)=v1 ⋅ v2‖v1‖2×‖v2‖2where ‖v‖2 represents the Euclidean norm of v. This measure ranges from [−1,1] where 1 indicates vector identity and −1 indicates maximal vector dissimilarity.
def cosine_similarity(v1, v2):
"""Computes cosine similarity between two vectors."""
numerator = v1 @ v2
denominator = numpy.linalg.norm(v1) * numpy.linalg.norm(v2)
return numerator / denominator
We can use this measure to estimate document and term similarities.
# Term similarities.
print(f"HUMAN v. USER:\t{cosine_similarity(v_human, v_user): .4f}")
print(f"HUMAN v. GRAPH:\t{cosine_similarity(v_human, v_graph): .4f}")
HUMAN v. USER: -0.8017 HUMAN v. GRAPH: -0.9585
# Document similarities.
print("'ABC' article v. 'RESPONSE' article:\t"
f"{cosine_similarity(v_ABC, v_response): .4f}")
print("'ABC' article v. 'SURVEY' article:\t"
f"{cosine_similarity(v_ABC, v_survey): .4f}")
'ABC' article v. 'RESPONSE' article: 0.8015 'ABC' article v. 'SURVEY' article: -0.1223
When k=2, we can also visualize terms and documents using a Cartesian coordinates. (In practice, we usually choose a much larger value of k.)
def plot_2d_translations(translations):
plot([0, 0], [0, 0], ".")
for translation in translations:
plot([0, translation[0]], [0, translation[1]], "-")
plot_2d_translations([v_human, v_user, v_graph])
# orange: 'HUMAN'
# green: 'USER'
# red: 'GRAPH'
plot_2d_translations([v_ABC, v_response, v_survey])
# orange: 'ABC'
# green: 'RESPONSE'
# red: 'SURVEY'
A limitation of the LSA method illustrated above is that the naïve approach requires a full-rank term-document matrix which fits in memory; though this matrix is relatively sparse, the dense variant often grows on both dimensions every time a new document is entered! However, there exist many incremental (i.e., one document at a time) approximations of SVD, and one fast variant (Brand 2006) is implemented in the Gensim library (Řehůřek and Sojka 2010). For more information, see a walkthrough of the Deerwester et al. example in Gensim here.
M. Brand. 2006. Fast low-rank modifications of the thin singular value decomposition. Linear Algebra and its Applications 415(1): 20-30.
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. 1990. Indexing by latent semantic analysis. 190. Journal of the American Society for Information Science 41(6): 391-407.
R. Řehůřek and P. Sojka. 2010. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 45-50.