__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2020"
The matrix weighting schemes reviewed in the first notebook for this unit deliver solid results. However, they are not capable of capturing higher-order associations in the data.
With dimensionality reduction, the goal is to eliminate correlations in the input VSM and capture such higher-order notions of co-occurrence, thereby improving the overall space.
As a motivating example, consider the adjectives gnarly and wicked used as slang positive adjectives. Since both are positive, we expect them to be similar in a good VSM. However, at least stereotypically, gnarly is Californian and wicked is Bostonian. Thus, they are unlikely to occur often in the same texts, and so the methods we've reviewed so far will not be able to model their similarity.
Dimensionality reduction techniques are often capable of capturing such semantic similarities (and have the added advantage of shrinking the size of our data structures).
Make sure your environment meets all the requirements for the cs224u repository. For help getting set-up, see setup.ipynb.
Make sure you've downloaded the data distribution for this course, unpacked it, and placed it in the current directory (or wherever you point DATA_HOME
to below).
from mittens import GloVe
import numpy as np
import os
import pandas as pd
import scipy.stats
from torch_autoencoder import TorchAutoencoder
import utils
import vsm
# Set all the random seeds for reproducibility:
utils.fix_random_seeds()
DATA_HOME = os.path.join('data', 'vsmdata')
imdb5 = pd.read_csv(
os.path.join(DATA_HOME, 'imdb_window5-scaled.csv.gz'), index_col=0)
imdb20 = pd.read_csv(
os.path.join(DATA_HOME, 'imdb_window20-flat.csv.gz'), index_col=0)
giga5 = pd.read_csv(
os.path.join(DATA_HOME, 'giga_window5-scaled.csv.gz'), index_col=0)
giga20 = pd.read_csv(
os.path.join(DATA_HOME, 'giga_window20-flat.csv.gz'), index_col=0)
Latent Semantic Analysis (LSA) is a prominent dimensionality reduction technique. It is an application of truncated singular value decomposition (SVD) and so uses only techniques from linear algebra (no machine learning needed).
The central mathematical result is that, for any matrix of real numbers $X$ of dimension $m \times n$, there is a factorization of $X$ into matrices $T$, $S$, and $D$ such that
$$X_{m \times n} = T_{m \times m}S_{m\times m}D_{n \times m}^{\top}$$The matrices $T$ and $D$ are orthonormal – their columns are length-normalized and orthogonal to one another (that is, they each have cosine distance of $1$ from each other). The singular-value matrix $S$ is a diagonal matrix arranged by size, so that the first dimension corresponds to the greatest source of variability in the data, followed by the second, and so on.
Of course, we don't want to factorize and rebuild the original matrix, as that wouldn't get us anywhere. The truncation part means that we include only the top $k$ dimensions of $S$. Given our row-oriented perspective on these matrices, this means using
$$T[1{:}m, 1{:}k]S[1{:}k, 1{:}k]$$which gives us a version of $T$ that includes only the top $k$ dimensions of variation.
To build up intuitions, imagine that everyone on the Stanford campus is associated with a 3d point representing their position: $x$ is east–west, $y$ is north–south, and $z$ is zenith–nadir. Since the campus is spread out and has relatively few deep basements and tall buildings, the top two dimensions of variation will be $x$ and $y$, and the 2d truncated SVD of this space will leave $z$ out. This will, for example, capture the sense in which someone at the top of Hoover Tower is close to someone at its base.
We can also return to our original motivating example of wicked and gnarly. Here is a matrix reflecting those assumptions:
gnarly_df = pd.DataFrame(
np.array([
[1,0,1,0,0,0],
[0,1,0,1,0,0],
[1,1,1,1,0,0],
[0,0,0,0,1,1],
[0,0,0,0,0,1]], dtype='float64'),
index=['gnarly', 'wicked', 'awesome', 'lame', 'terrible'])
gnarly_df
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
gnarly | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
wicked | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
awesome | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 |
lame | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
terrible | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
No column context includes both gnarly and wicked together so our count matrix places them far apart:
vsm.neighbors('gnarly', gnarly_df)
gnarly 0.000000 awesome 0.292893 wicked 1.000000 lame 1.000000 terrible 1.000000 dtype: float64
Reweighting doesn't help. For example, here is the attempt with Positive PMI:
vsm.neighbors('gnarly', vsm.pmi(gnarly_df))
gnarly 0.000000 awesome 0.292893 wicked 1.000000 lame 1.000000 terrible 1.000000 dtype: float64
However, both words tend to occur with awesome and not with lame or terrible, so there is an important sense in which they are similar. LSA to the rescue:
gnarly_lsa_df = vsm.lsa(gnarly_df, k=2)
vsm.neighbors('gnarly', gnarly_lsa_df)
gnarly 0.0 wicked 0.0 awesome 0.0 lame 1.0 terrible 1.0 dtype: float64
Here's an example that begins to convey the effect that this can have empirically.
First, the original count matrix:
vsm.neighbors('superb', imdb5).head()
superb 0.000000 . 0.814956 e 0.889682 k 0.891759 g 0.894064 dtype: float64
And then LSA with $k=100$:
imdb5_svd = vsm.lsa(imdb5, k=100)
vsm.neighbors('superb', imdb5_svd).head()
superb 0.000000 notch 0.015580 talents 0.028776 poor 0.029328 voice 0.031802 dtype: float64
A common pattern in the literature is to apply PMI first. The PMI values tend to give the count matrix a normal (Gaussian) distribution that better satisfies the assumptions underlying SVD:
imdb5_pmi = vsm.pmi(imdb5, positive=False)
imdb5_pmi_svd = vsm.lsa(imdb5_pmi, k=100)
vsm.neighbors('superb', imdb5_pmi_svd).head()
superb 0.000000 outstanding 0.017650 terrific 0.021071 fantastic 0.033293 brilliant 0.050633 dtype: float64
The sklearn.decomposition module contains an implementation of LSA (TruncatedSVD) that you might want to switch to for real experiments:
The sklearn
version is more flexible than the above in that it can operate on both dense matrices (Numpy arrays) and sparse matrices (from Scipy).
The sklearn
version will make it easy to try out other dimensionality reduction methods in your own code; Principal Component Analysis (PCA) and Non-Negative Matrix Factorization (NMF) are closely related methods that are worth a look.
Pennington et al. (2014) introduce an objective function for semantic word representations. Roughly speaking, the objective is to learn vectors for words $w_{i}$ and $w_{j}$ such that their dot product is proportional to their probability of co-occurrence:
$$w_{i}^{\top}\widetilde{w}_{k} + b_{i} + \widetilde{b}_{k} = \log(X_{ik})$$The paper is exceptionally good at motivating this objective from first principles. In their equation (6), they define
$$w_{i}^{\top}\widetilde{w}_{k} = \log(P_{ik}) = \log(X_{ik}) - \log(X_{i})$$If we allow that the rows and columns can be different, then we would do
$$w_{i}^{\top}\widetilde{w}_{k} = \log(P_{ik}) = \log(X_{ik}) - \log(X_{i} \cdot X_{*k})$$where, as in the paper, $X_{i}$ is the sum of the values in row $i$, and $X_{*k}$ is the sum of the values in column $k$.
The rightmost expression is PMI by the equivalence $\log(\frac{x}{y}) = \log(x) - \log(y)$, and hence we can see GloVe as aiming to make the dot product of two learned vectors equal to the PMI!
The full model is a weighting of this objective:
$$\sum_{i, j=1}^{|V|} f\left(X_{ij}\right) \left(w_i^\top \widetilde{w}_j + b_i + \widetilde{b}_j - \log X_{ij}\right)^2$$where $V$ is the vocabulary and $f$ is a scaling factor designed to diminish the impact of very large co-occurrence counts:
$$f(x) \begin{cases} (x/x_{\max})^{\alpha} & \textrm{if } x < x_{\max} \\ 1 & \textrm{otherwise} \end{cases}$$Typically, $\alpha$ is set to $0.75$ and $x_{\max}$ to $100$ (though it is worth assessing how many of your non-zero counts are above this; in dense word $\times$ word matrices, you could be flattening more than you want to).
The implementation in vsm.glove
is the most stripped-down, bare-bones version of the GloVe method I could think of. As such, it is quite slow.
The required mittens package includes a vectorized implementation that is much, much faster, so we'll mainly use that.
For really large jobs, the official C implementation released by the GloVe team is probably the best choice.
GloVe should do well on our gnarly/wicked evaluation, though you will see a lot variation due to the small size of this VSM:
gnarly_glove = vsm.glove(gnarly_df, n=5, max_iter=1000)
Stopping at iteration 592 with error 9.987450681981009e-05
vsm.neighbors('gnarly', gnarly_glove)
gnarly 0.000000 terrible 0.333939 lame 0.718500 awesome 1.027851 wicked 1.299370 dtype: float64
It is not easy analyze GloVe values derived from real data, but the following little simulation suggests that vsm.glove
is working as advertised: it does seem to reliably deliver vectors whose dot products are proportional to the log co-occurrence probability:
glove_test_count_df = pd.DataFrame(
np.array([
[10.0, 2.0, 3.0, 4.0],
[ 2.0, 10.0, 4.0, 1.0],
[ 3.0, 4.0, 10.0, 2.0],
[ 4.0, 1.0, 2.0, 10.0]]),
index=['A', 'B', 'C', 'D'],
columns=['A', 'B', 'C', 'D'])
glove_test_df = vsm.glove(glove_test_count_df, max_iter=1000, n=4)
Stopping at iteration 531 with error 9.96918936920818e-05
def correlation_test(true, pred):
mask = true > 0
M = pred.dot(pred.T)
with np.errstate(divide='ignore'):
log_cooccur = np.log(true)
log_cooccur[np.isinf(log_cooccur)] = 0.0
row_log_prob = np.log(true.sum(axis=1))
row_log_prob = np.outer(row_log_prob, np.ones(true.shape[1]))
prob = log_cooccur - row_log_prob
return np.corrcoef(prob[mask], M[mask])[0, 1]
correlation_test(glove_test_count_df.values, glove_test_df.values)
0.9474600254341587
The vsm.glove
implementation is too slow to use on real matrices. The distribution in the mittens
package is significantly faster, making its use possible even without a GPU (and it will be very fast indeed on a GPU machine):
glove_model = GloVe()
imdb5_glv = glove_model.fit(imdb5.values)
imdb5_glv = pd.DataFrame(imdb5_glv, index=imdb5.index)
WARNING: Logging before flag parsing goes to stderr. W1117 13:38:00.728562 140735705969536 deprecation.py:506] From /Applications/anaconda3/envs/nlu/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. Instructions for updating: If using Keras pass *_constraint arguments to layers. W1117 13:38:00.800172 140735705969536 deprecation.py:506] From /Applications/anaconda3/envs/nlu/lib/python3.7/site-packages/tensorflow_core/python/training/adagrad.py:76: calling Constant.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor Iteration 100: loss: 379389.9375
vsm.neighbors('superb', imdb5_glv).head()
superb 0.000000 outstanding 0.033546 terrific 0.039682 excellent 0.052638 performances 0.071205 dtype: float64
An autoencoder is a machine learning model that seeks to learn parameters that predict its own input. This is meaningful when there are intermediate representations that have lower dimensionality than the inputs. These provide a reduced-dimensional view of the data akin to those learned by LSA, but now we have a lot more design choices and a lot more potential to learn higher-order associations in the underyling data.
The module torch_autoencoder
uses PyToch to implement a simple one-layer autoencoder:
Here, we assume that the hidden representation $h$ has a low dimensionality like 100, and that $\mathbf{f}$ is a non-linear activation function (the default for TorchAutoencoder
is tanh
). These are the major design choices internal to the network. It might also be meaningful to assume that there are two matrices of weights $W_{xh}$ and $W_{hx}$, rather than using $W^{\top}$ for the output step.
The objective function for autoencoders will implement some kind of assessment of the distance between the inputs and their predicted outputs. For example, one could use the one-half mean squared error:
$$\frac{1}{m}\sum_{i=1}^{m} \frac{1}{2}(\widehat{X[i]} - X[i])^{2}$$where $X$ is the input matrix of examples (dimension $m \times n$) and $X[i]$ corresponds to the $i$th example.
When you call the fit
method of TorchAutoencoder
, it returns the matrix of hidden representations $h$, which is the new embedding space: same row count as the input, but with the column count set by the hidden_dim
parameter.
For much more on autoencoders, see the 'Autoencoders' chapter of Goodfellow et al. 2016.
Here's an evaluation that is meant to test the autoencoder implementation – we expect it to be able to full encode the input matrix because we know its rank is equal to the dimensionality of the hidden representation.
def randmatrix(m, n, sigma=0.1, mu=0):
return sigma * np.random.randn(m, n) + mu
def autoencoder_evaluation(nrow=1000, ncol=100, rank=20, max_iter=20000):
"""This an evaluation in which `TfAutoencoder` should be able
to perfectly reconstruct the input data, because the
hidden representations have the same dimensionality as
the rank of the input matrix.
"""
X = randmatrix(nrow, rank).dot(randmatrix(rank, ncol))
ae = TorchAutoencoder(hidden_dim=rank, max_iter=max_iter)
ae.fit(X)
X_pred = ae.predict(X)
mse = (0.5 * (X_pred - X)**2).mean()
return(X, X_pred, mse)
ae_max_iter = 100
_, _, ae = autoencoder_evaluation(max_iter=ae_max_iter)
print("Autoencoder evaluation MSE after {0} evaluations: {1:0.04f}".format(ae_max_iter, ae))
Finished epoch 100 of 100; error is 0.00022285695013124496
Autoencoder evaluation MSE after 100 evaluations: 0.0001
You can apply the autoencoder directly to the count matrix, but this could interact very badly with the internal activation function: if the counts are all very high or very low, then everything might get pushed irrevocably towards the extreme values of the activation.
Thus, it's a good idea to first normalize the values somehow. Here, I use vsm.length_norm
:
imdb5_l2 = imdb5.apply(vsm.length_norm, axis=1)
imdb5_l2_ae = TorchAutoencoder(
max_iter=100, hidden_dim=50, eta=0.001).fit(imdb5_l2)
Finished epoch 100 of 100; error is 0.00019485459779389203
vsm.neighbors('superb', imdb5_l2_ae).head()
superb 0.000000 witch 0.000909 instrument 0.000922 postage 0.000965 board 0.000970 dtype: float64
This is very slow and seems not to work all that well. To speed things up, one can first apply LSA or similar:
imdb5_l2_svd100 = vsm.lsa(imdb5_l2, k=100)
imdb_l2_svd100_ae = TorchAutoencoder(
max_iter=1000, hidden_dim=50, eta=0.01).fit(imdb5_l2_svd100)
Finished epoch 1000 of 1000; error is 0.0001060685608536005
vsm.neighbors('superb', imdb_l2_svd100_ae).head()
superb 0.000000 outstanding 0.041158 terrific 0.052538 excellent 0.081740 exceptional 0.102115 dtype: float64
The label word2vec picks out a family of models in which the embedding for a word $w$ is trained to predict the words that co-occur with $w$. This intuition can be cashed out in numerous ways. Here, we review just the skip-gram model, due to Mikolov et al. 2013.
The most natural starting point is to transform a corpus into a supervised data set by mapping each word to a subset (maybe all) of the words that it occurs with in a given window. Schematically:
Corpus: it was the best of times, it was the worst of times, ...
With window size 2:
(it, was)
(it, the)
(was, it)
(was, the)
(was, best)
(the, was)
(the, it)
(the, best)
(the, of)
...
The basic skip-gram model estimates the probability of an input–output pair $(a, b)$ as
$$P(b \mid a) = \frac{\exp(x_{a}w_{b})}{\sum_{b'\in V}\exp(x_{a}w_{b'})}$$where $x_{a}$ is the row (word) vector representation of word $a$ and $w_{b}$ is the column (context) vector representation of word $b$. The objective is to minimize the following quantity:
$$ -\sum_{i=1}^{m}\sum_{k=1}^{|V|} \textbf{1}\{c_{i}=k\} \log \frac{ \exp(x_{i}w_{k}) }{ \sum_{j=1}^{|V|}\exp(x_{i}w_{j}) }$$where $V$ is the vocabulary.
The inputs $x_{i}$ are the word representations, which get updated during training, and the outputs are one-hot vectors $c$. For example, if was
is the 560th element in the vocab, then the output $c$ for the first example in the corpus above would be a vector of all $0$s except for a $1$ in the 560th position. $x$ would be the representation of it
in the embedding space.
The distribution over the entire output space for a given input word $a$ is thus a standard softmax classifier; here we add a bias term for good measure:
$$c = \textbf{softmax}(x_{a}W + b)$$If we think of this model as taking the entire matrix $X$ as input all at once, then it becomes
$$c = \textbf{softmax}(XW + b)$$and it is now very clear that we are back to the core insight that runs through all of our reweighting and dimensionality reduction methods: we have a word matrix $X$ and a context matrix $W$, and we are trying to push the dot products of these two embeddings in a specific direction: here, to maximize the likelihood of the observed co-occurrences in the corpus.
Training the basic skip-gram model directly is extremely expensive for large vocabularies, because $W$, $b$, and the outputs $c$ get so large.
A straightforward way to address this is to change the objective to use noise contrastive estimation (negative sampling). Where $\mathcal{D}$ is the original training corpus and $\mathcal{D}'$ is a sample of pairs not in the corpus, we minimize
$$\sum_{a, b \in \mathcal{D}}-\log\sigma(x_{a}w_{b}) + \sum_{a, b \in \mathcal{D}'}\log\sigma(x_{a}w_{b})$$with $\sigma$ the sigmoid activation function $\frac{1}{1 + \exp(-x)}$.
The advice of Mikolov et al. is to sample $\mathcal{D}'$ proportional to a scaling of the frequency distribution of the underlying vocabulary in the corpus:
$$P(w) = \frac{\textbf{count}(w)^{0.75}}{\sum_{w'\in V} \textbf{count}(w')}$$where $V$ is the vocabulary.
Although this new objective function is a substantively different objective than the previous one, Mikolov et al. (2013) say that it should approximate it, and it is building on the same insight about words and their contexts. See Levy and Golberg 2014 for a proof that this objective reduces to PMI shifted by a constant value. See also Cotterell et al. 2017 for an interpretation of this model as a variant of PCA.
In the usual presentation, word2vec training involves looping repeatedly over the sequence of tokens in the corpus, sampling from the context window from each word to create the positive training pairs. I assume that this same process could be modeled by sampling (row, column) index pairs from our count matrices proportional to their cell values. However, I couldn't get this to work well. I'd be grateful if someone got it work or figured out why it won't!
Luckily, there are numerous excellent resources for word2vec. The TensorFlow tutorial Vector representations of words is very clear and links to code that is easy to work with. Because TensorFlow has a built in loss function called tf.nn.nce_loss
, it is especially simple to define these models – one pretty much just sets up an embedding $X$, a context matrix $W$, and a bias $b$, and then feeds them plus a training batch to the loss function.
The excellent Gensim package has an implementation that handles the scalability issues related to word2vec.
Learning word representations is one of the most active areas in NLP right now, so I can't hope to offer a comprehensive summary. I'll settle instead for identifying some overall trends and methods:
The LexVec model of Salle et al. 2016 combines the core insight of GloVe (learn vectors that approximate PMI) with the insight from word2vec that we should additionally try to push words that don't appear together farther apart in the VSM. (GloVe simply ignores 0 count cells and so can't do this.)
There is growing awareness that many apparently diverse models can be expressed as matrix factorization methods like SVD/LSA. See especially
Singh and Gordon 2008, Levy and Golberg 2014, Cotterell et al. 2017.
Subword modeling (reviewed briefly in the previous notebook) is increasingly yielding dividends. (It would already be central if most of NLP focused on languages with complex morphology!) Check out the papers at the Subword and Character-Level Models for NLP Workshops: SCLeM 2017, SCLeM 2018.
Contextualized word representations have proven valuable in many contexts. These methods do not provide representations for individual words, but rather represent them in their linguistic context. This creates space for modeling how word senses vary depending on their context of use. We will study these methods later in the quarter, mainly in the context of identifying ways that might achieve better results on your projects.
These are largely meant to give you a feel for the material, but some of them could lead to projects and help you with future work for the course. These are not for credit.
Try out some pipelines of reweighting, vsm.lsa
at various dimensions, and TorchAutoencoder
to see which seems best according to your sampling around with vsm.neighbors
and high-level visualization with vsm.tsne_viz
. Feel free to use other factorization methods defined in sklearn.decomposition as well.
What happens if you set k=1
using vsm.lsa
? What do the results look like then? What do you think this first (and now only) dimension is capturing?
Modify vsm.glove
so that it uses the AdaGrad optimization method as in the original paper. It's fine to use the authors' implementation, Jon Gauthier's implementation, or the mittens Numpy implementation as references, but you might enjoy the challenge of doing this with no peeking at their code.