In [1]:

```
from re import findall
from collections import Counter
from numpy import allclose, diag, dot, zeros
from numpy.linalg import norm, svd
%pylab
%matplotlib inline
```

In the previous lecture we introduced the notion of a *term-document matrix*, which summarizes word coocurence statistics for multiple "documents". Each row $t_i$ in a term-document matrix represents a *term* (i.e., a word or stem) and the frequency with which it occurs in each document. And, each column $d_j$ in such a matrix represents a *document* and the frequency with which it employs each term. A term-document matrix thus contains a great deal of information about the associations between words and between documents (in the collection of documents that make it up).

More specifically (and wonkishly), given a term-document matrix $X$ with $T$ terms and $D$ documents:

- the dot product of two rows of terms ($t{_i}{^T} t{_q}$) in $X$ is the correlation between terms $t_i$ and $t_j$, and the matrix $X X^T$ contains all such dot products
- the dot product of two columns of documents ($d{_j}{^T} d{_q}$) in $X$ is the correlation between two documents, and the matrix $X^T X$ contains all such dot products

*Latent semantic analysis* (LSA), also sometimes known as *latent semantic indexing* (LSI), is a method to exploit information in term-document matrices using first principles from linear algebra. In particular, we use dimensionality reduction techniques to create a much-reduced form of the term-document matrix, and then use this to project terms and documents into a low-dimensionality "topic space", in which we can perform basic clustering and comparison of both terms and documents.

The following example comes from Deeerwester et al. 1990. The first set of "documents" (actually, paper titles) are related to human-computer interaction (HCI), and the second are related to graph theory. We assume that terms which are not "underlined" (e.g., `_term_`

) are filtered out (as stopwords or due to low term or document frequency.)

In [2]:

```
# HCI-related documents
c0 = """
_Human_ machine _interface_ for Lab ABC _computer_ applications
A _survey_ of _user_ opinion of _computer_ _system_ _response_ _time_
The _EPS_ _user_ _interface_ management _system_
_System_ and _human_ _system_ engineering testing of _EPS_
Relation of _user_-perceived _response_ _time_ to error measurement
"""
# Graph-theory-related documents
c1 = """
The generation of random binary unordered _trees_
The intersection _graph_ of paths in _trees_
_Graph_ _minors_ IV: Widths of _trees_ and well-quasi-ordering
_Graph_ _minors_: A _survey_
"""
def prep_corpus(lines):
"""
Given `lines` (a string) of text, generate the corresponding
corpus (a list of "documents", where each document is a list
of terms
"""
for line in lines.split("\n"):
# ignore empty lines
if not line:
continue
yield [term.upper() for term in findall(r"_(.*?)_", line)]
corpus = prep_corpus(c0 + c1)
```

The following function constructs a dense (i.e., full-rank) term-document matrix from the `corpus`

object.

In [3]:

```
def termdoc_index(corpus):
"""
Given a `corpus` (a list of documents, which are themselves
lists of terms), return a dense term-document matrix and a
term index
Many things you might do (such as filter by DF or TF) are not
implemented here
"""
# collect sparse frequencies
terms = set() # to populate a term index
termdoc_sparse = [] # to populate dense a t-d matrix
for doc in corpus:
# compute term frequencies for this document
column_sparse = Counter(doc)
# save term frequencies
termdoc_sparse.append(column_sparse)
# add new terms to term set
terms.update(column_sparse.iterkeys())
# convert term set to index
index = {term: i for (i, term) in enumerate(terms)}
# build dense matrix
termdoc_dense = zeros((len(terms), len(termdoc_sparse)))
for (j, column_sparse) in enumerate(termdoc_sparse):
# a pointer to a column in the term-document matrix:
column_dense = termdoc_dense[:, j]
for (term, freq) in column_sparse.iteritems():
i = index[term]
column_dense[i] = freq
# equivalently: `termdoc_dense[i, j] = freq`
return (termdoc_dense, index)
(X, index) = termdoc_index(corpus)
print(X)
```

We'll also hold on to some pointers to terms and documents, for later.

In [4]:

```
t_human = X[index["HUMAN"], :]
t_user = X[index["USER"], :]
t_graph = X[index["GRAPH"], :]
d_ABC = X[:, 0]
d_response = X[:, 4]
d_survey = X[:, 8]
print t_graph # "GRAPH" occurs only in the graph-theory corpus
```

One problem when working in a larger scale (than this toy example) is that the dense term-document matrix grows very rapidly, even if we remove terms with low term frequency or low document frequency. So we wish to generate a low-rank approximation of this matrix. This is accomplished using a matrix factorization technique known as *singular value decomposition* defined as follows. The singular value decomposition of a matrix $X$ is given by three matrices $U$, $\Sigma$, and $V$ such that

where $U$, $\Sigma$, and $V$ are all matrices. **In other words, $X$ can be approximated by the product of three square matrices.**

In [5]:

```
(U, Sigma_diag, V) = svd(X, full_matrices=False)
Sigma = diag(Sigma_diag)
print allclose(X, dot(U, dot(Sigma, V))) # not a bad approximation!
```

In this decomposition, $U$ is a orthogonal matrix; $\Sigma$ is a diagonal square matrix; $V$ is a orthogonal square matrix with the dimensionality determined by the number of documents.

In [6]:

```
print "U dimensionality:\t{}".format(U.shape)
print "Sigma dimensionality:\t{}".format(Sigma.shape)
print "V dimensionality:\t{}".format(V.shape)
```

So, the dimensionality of the SVD approximation grows quadratically both in the number of terms and number of documents. However, we can hold this constant by throwing away all but first $k$ terms in the approximation. This results in a new approximation

$$\hat{X_k} = U_k \Sigma_k V{_k}{^T}$$which is known to be the optimal $k$-dimensional approximation. Here, we will use $k = 2$.

In [7]:

```
k = 2
U_k = U[:, :k]
# Sigma_diag is just an array...and going forward,
# we only need its inverse, so we'll compute that
invSigma_k = diag(1. / Sigma_diag[:k])
# equivalently, `inv(diag(Sigma[:k]))`
V_k = V[:, :k]
```

We can now translate each document into the $k$-dimensional "topic space" as follows. If $d_j$ is a document column in $X$, then

$$\hat{d_j} = {\Sigma_k}^{-1} U{_k}{^T} d_j~.$$In [8]:

```
def doc_translate(d, U_k, invSigma_k):
"""
Translate a document `t` into a topic space U \Sigma V.
We tranpose `d` because numpy forgets that it's really a column
vector.
"""
return dot(dot(invSigma_k, U_k.T), d.T)
v_ABC = doc_translate(d_ABC, U_k, invSigma_k)
v_response = doc_translate(d_response, U_k, invSigma_k)
v_survey = doc_translate(d_survey, U_k, invSigma_k)
print(v_ABC)
print(v_response)
```

The same can be done for novel queries; we simply treat each query $d_q$ as if it were a new document and apply the same translation.

We also can translate terms into topic space. If $t_i$ is a term row in $X$, then

$$\hat{t_i} = {\Sigma_k}^{-1} V{_k}{^T} t_i~.$$In [9]:

```
def term_translate(t, invSigma_k, V_k):
"""
Translate a term `t` into a topic space U_k \Sigma_k V_k
"""
return dot(dot(invSigma_k, V_k.T), t)
v_human = term_translate(t_human, invSigma_k, V_k)
v_user = term_translate(t_user, invSigma_k, V_k)
v_graph = term_translate(t_graph, invSigma_k, V_k)
print(v_human)
print(v_user)
```

Most importantly, we can compare pairs of vectors in the topic space $v_0$, $v_1$, regardless of their source (terms, queries, or documents). We take the relatedness of two topic-space vector to be monotonically related to the angle between the two vectors $\theta$. More specifically, we compute the cosine of this angle, a measure called *cosine similarity*, which is defined as

where $\|v\|$ represents the Euclidean norm of the vector $v$. The value of this measure lines in $[-1, 1]$, where $1$ indicates vector identity and $-1$ indicates maximal vector dissimilarity.

In [10]:

```
def cosine_similarity(v0, v1):
"""
Compute cosine similarity between two vectors `v0`, `v1`
"""
numerator = dot(v0, v1)
denominator = norm(v0) * norm(v1)
return numerator / denominator
```

We can use this measure to estimate document and term similarities. As expected, the HCI terms and documents are more similar to each other than they are to graph-theory terms and documents, respectively.

In [11]:

```
# term similarities
print "HUMAN v. USER:\t{: .4f}".format(cosine_similarity(v_human, v_user))
print "HUMAN v. GRAPH:\t{:.4f}".format(cosine_similarity(v_human, v_graph))
```

In [12]:

```
# document similarities
print "'ABC' article v. 'RESPONSE' article:\t{: .4f}".format(
cosine_similarity(v_ABC, v_response))
print "'ABC' article v. 'SURVEY' article:\t{: .4f}".format(
cosine_similarity(v_ABC, v_survey))
```

When $k = 2$, we can also visualize terms and documents using a Cartesian coordinates. (In practice, we usually choose a much larger value of $k$.)

In [13]:

```
def plot_2d_translations(translations):
plot([0, 0], [0, 0], ".")
for translation in translations:
plot([0, translation[0]], [0, translation[1]], "-")
```

In [14]:

```
plot_2d_translations([v_human, v_user, v_graph])
# green: 'HUMAN'
# red: 'USER'
# cyan: 'GRAPH'
```

In [15]:

```
plot_2d_translations([v_ABC, v_response, v_survey])
# green: 'ABC'
# red: 'RESPONSE'
# cyan: 'SURVEY'
```

A limitation of the LSA method illustrated above is that the naïve approach requires a full-rank term-document matrix which fits in memory; though this matrix is relatively sparse, the dense variant often grows on both dimensions every time a new document is entered! However, there exist many incremental (i.e., one document at a time) approximations of SVD, and one fast variant (Brand 2006) is implemented in the Gensim library (Řehůřek and Sojka 2010).

M. Brand. 2006. Fast low-rank modifications of the thin singular value decomposition. *Linear Algebra and its Applications* 415(1): 20-30.

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. 1990. Indexing by latent semantic analysis. 190. *Journal of the American Society for Information Science* 41(6): 391-407.

R. Řehůřek and P. Sojka. 2010. Software framework for topic modelling with large corpora. In *LREC*, 45-50.