In [1]:

```
from re import findall
from collections import Counter
from numpy import allclose, diag, dot, zeros
from numpy.linalg import norm, svd
%pylab
%matplotlib inline
```

*term-document matrix*, which summarizes word coocurence statistics for multiple "documents". Each row $t_i$ in a term-document matrix represents a *term* (i.e., a word or stem) and the frequency with which it occurs in each document. And, each column $d_j$ in such a matrix represents a *document* and the frequency with which it employs each term. A term-document matrix thus contains a great deal of information about the associations between words and between documents (in the collection of documents that make it up).

More specifically (and wonkishly), given a term-document matrix $X$ with $T$ terms and $D$ documents:

- the dot product of two rows of terms ($t{_i}{^T} t{_q}$) in $X$ is the correlation between terms $t_i$ and $t_j$, and the matrix $X X^T$ contains all such dot products
- the dot product of two columns of documents ($d{_j}{^T} d{_q}$) in $X$ is the correlation between two documents, and the matrix $X^T X$ contains all such dot products

*Latent semantic analysis* (LSA), also sometimes known as *latent semantic indexing* (LSI), is a method to exploit information in term-document matrices using first principles from linear algebra. In particular, we use dimensionality reduction techniques to create a much-reduced form of the term-document matrix, and then use this to project terms and documents into a low-dimensionality "topic space", in which we can perform basic clustering and comparison of both terms and documents.

`_term_`

) are filtered out (as stopwords or due to low term or document frequency.)

In [2]:

```
# HCI-related documents
c0 = """
_Human_ machine _interface_ for Lab ABC _computer_ applications
A _survey_ of _user_ opinion of _computer_ _system_ _response_ _time_
The _EPS_ _user_ _interface_ management _system_
_System_ and _human_ _system_ engineering testing of _EPS_
Relation of _user_-perceived _response_ _time_ to error measurement
"""
# Graph-theory-related documents
c1 = """
The generation of random binary unordered _trees_
The intersection _graph_ of paths in _trees_
_Graph_ _minors_ IV: Widths of _trees_ and well-quasi-ordering
_Graph_ _minors_: A _survey_
"""
def prep_corpus(lines):
"""
Given `lines` (a string) of text, generate the corresponding
corpus (a list of "documents", where each document is a list
of terms
"""
for line in lines.split("\n"):
# ignore empty lines
if not line:
continue
yield [term.upper() for term in findall(r"_(.*?)_", line)]
corpus = prep_corpus(c0 + c1)
```

`corpus`

object.

In [3]:

```
def termdoc_index(corpus):
"""
Given a `corpus` (a list of documents, which are themselves
lists of terms), return a dense term-document matrix and a
term index
Many things you might do (such as filter by DF or TF) are not
implemented here
"""
# collect sparse frequencies
terms = set() # to populate a term index
termdoc_sparse = [] # to populate dense a t-d matrix
for doc in corpus:
# compute term frequencies for this document
column_sparse = Counter(doc)
# save term frequencies
termdoc_sparse.append(column_sparse)
# add new terms to term set
terms.update(column_sparse.iterkeys())
# convert term set to index
index = {term: i for (i, term) in enumerate(terms)}
# build dense matrix
termdoc_dense = zeros((len(terms), len(termdoc_sparse)))
for (j, column_sparse) in enumerate(termdoc_sparse):
# a pointer to a column in the term-document matrix:
column_dense = termdoc_dense[:, j]
for (term, freq) in column_sparse.iteritems():
i = index[term]
column_dense[i] = freq
# equivalently: `termdoc_dense[i, j] = freq`
return (termdoc_dense, index)
(X, index) = termdoc_index(corpus)
print(X)
```

We'll also hold on to some pointers to terms and documents, for later.

In [4]:

```
t_human = X[index["HUMAN"], :]
t_user = X[index["USER"], :]
t_graph = X[index["GRAPH"], :]
d_ABC = X[:, 0]
d_response = X[:, 4]
d_survey = X[:, 8]
print t_graph # "GRAPH" occurs only in the graph-theory corpus
```

One problem when working in a larger scale (than this toy example) is that the dense term-document matrix grows very rapidly, even if we remove terms with low term frequency or low document frequency. So we wish to generate a low-rank approximation of this matrix. This is accomplished using a matrix factorization technique known as *singular value decomposition* defined as follows. The singular value decomposition of a matrix $X$ is given by three matrices $U$, $\Sigma$, and $V$ such that

where $U$, $\Sigma$, and $V$ are all matrices. **In other words, $X$ can be approximated by the product of three square matrices.**

In [5]:

```
(U, Sigma_diag, V) = svd(X, full_matrices=False)
Sigma = diag(Sigma_diag)
print allclose(X, dot(U, dot(Sigma, V))) # not a bad approximation!
```

In [6]:

```
print "U dimensionality:\t{}".format(U.shape)
print "Sigma dimensionality:\t{}".format(Sigma.shape)
print "V dimensionality:\t{}".format(V.shape)
```

So, the dimensionality of the SVD approximation grows quadratically both in the number of terms and number of documents. However, we can hold this constant by throwing away all but first $k$ terms in the approximation. This results in a new approximation

$$\hat{X_k} = U_k \Sigma_k V{_k}{^T}$$which is known to be the optimal $k$-dimensional approximation. Here, we will use $k = 2$.

In [7]:

```
k = 2
U_k = U[:, :k]
# Sigma_diag is just an array...and going forward,
# we only need its inverse, so we'll compute that
invSigma_k = diag(1. / Sigma_diag[:k])
# equivalently, `inv(diag(Sigma[:k]))`
V_k = V[:, :k]
```

We can now translate each document into the $k$-dimensional "topic space" as follows. If $d_j$ is a document column in $X$, then

$$\hat{d_j} = {\Sigma_k}^{-1} U{_k}{^T} d_j~.$$In [8]:

```
def doc_translate(d, U_k, invSigma_k):
"""
Translate a document `t` into a topic space U \Sigma V.
We tranpose `d` because numpy forgets that it's really a column
vector.
"""
return dot(dot(invSigma_k, U_k.T), d.T)
v_ABC = doc_translate(d_ABC, U_k, invSigma_k)
v_response = doc_translate(d_response, U_k, invSigma_k)
v_survey = doc_translate(d_survey, U_k, invSigma_k)
print(v_ABC)
print(v_response)
```

We also can translate terms into topic space. If $t_i$ is a term row in $X$, then

$$\hat{t_i} = {\Sigma_k}^{-1} V{_k}{^T} t_i~.$$In [9]:

```
def term_translate(t, invSigma_k, V_k):
"""
Translate a term `t` into a topic space U_k \Sigma_k V_k
"""
return dot(dot(invSigma_k, V_k.T), t)
v_human = term_translate(t_human, invSigma_k, V_k)
v_user = term_translate(t_user, invSigma_k, V_k)
v_graph = term_translate(t_graph, invSigma_k, V_k)
print(v_human)
print(v_user)
```

Most importantly, we can compare pairs of vectors in the topic space $v_0$, $v_1$, regardless of their source (terms, queries, or documents). We take the relatedness of two topic-space vector to be monotonically related to the angle between the two vectors $\theta$. More specifically, we compute the cosine of this angle, a measure called *cosine similarity*, which is defined as

where $\|v\|$ represents the Euclidean norm of the vector $v$. The value of this measure lines in $[-1, 1]$, where $1$ indicates vector identity and $-1$ indicates maximal vector dissimilarity.

In [10]:

```
def cosine_similarity(v0, v1):
"""
Compute cosine similarity between two vectors `v0`, `v1`
"""
numerator = dot(v0, v1)
denominator = norm(v0) * norm(v1)
return numerator / denominator
```

In [11]:

```
# term similarities
print "HUMAN v. USER:\t{: .4f}".format(cosine_similarity(v_human, v_user))
print "HUMAN v. GRAPH:\t{:.4f}".format(cosine_similarity(v_human, v_graph))
```

In [12]:

```
# document similarities
print "'ABC' article v. 'RESPONSE' article:\t{: .4f}".format(
cosine_similarity(v_ABC, v_response))
print "'ABC' article v. 'SURVEY' article:\t{: .4f}".format(
cosine_similarity(v_ABC, v_survey))
```

In [13]:

```
def plot_2d_translations(translations):
plot([0, 0], [0, 0], ".")
for translation in translations:
plot([0, translation[0]], [0, translation[1]], "-")
```

In [14]:

```
plot_2d_translations([v_human, v_user, v_graph])
# green: 'HUMAN'
# red: 'USER'
# cyan: 'GRAPH'
```

In [15]:

```
plot_2d_translations([v_ABC, v_response, v_survey])
# green: 'ABC'
# red: 'RESPONSE'
# cyan: 'SURVEY'
```

M. Brand. 2006. Fast low-rank modifications of the thin singular value decomposition. *Linear Algebra and its Applications* 415(1): 20-30.

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. 1990. Indexing by latent semantic analysis. 190. *Journal of the American Society for Information Science* 41(6): 391-407.

R. Řehůřek and P. Sojka. 2010. Software framework for topic modelling with large corpora. In *LREC*, 45-50.