Distributional word representations

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2016 term"

Overview

This codelab is about methods for creating effective vector representations of words from co-occurrence patterns in text. These are often called distributional representations, and the models are often called vector space models (VSMs).

Terminological notes:

  • Distributional representations are a specific kind of distributed representation. Later in the term, we'll look at models that use supervised learning to obtain vector-based word representations. These aren't purely distributional, in that they take advantage of more than just co-occcurence patterns among items in the vocabulary, but they share the idea that words can be modeled with vectors.

  • If a neural network is used to train the representations, then they might be called neural representations.

  • The term word embedding is also used for distributed representations, including distributional ones. This term is a reminder that vector representations are meaningful only when embedded in and compared with others in a unified space (usually a matrix) of representations of the same type.

Set-up

  • Make sure your environment includes all the requirements for the cs224u repository.
  • Download the data distribution for this unit, unpack it, and place it in the directory containing the course repository. (If you want to put it somewhere else, change vsmdata_home below.)
  • Download the Wikipedia 2014 + Gigaword 5 distribution of the pretrained GloVe vectors, unzip it, and put the resulting folder in the the same directory as this notebook. (If you want to put it somewhere else, change glove_home below.)
In [2]:
vsmdata_home = "vsmdata"
glove_home = "glove.6B"
In [3]:
import os
import sys
import csv
import random
import itertools
from operator import itemgetter
from collections import defaultdict
import numpy as np
import scipy
import scipy.spatial.distance
from numpy.linalg import svd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import utils
In [4]:
%matplotlib inline

Distributional matrices

Here's a basic recipe for building a word $\times$ word matrix:

  1. Define a notion of co-occurrence context. This could be an entire document, a paragraph, a sentence, a clause, an NP — whatever domain seems likely to capture the associations you care about.
  2. Scan through your corpus building a dictionary $d$ mapping word-pairs to counts. Every time a pair of words $w$ and $w'$ occurs in the same context (as you defined it in 1), increment $d[(w, w')]$ by $1$.
  3. Using the count dictionary $d$ that you collected in 2, establish your full vocabulary $V$, an ordered list of words types. For large collections of documents, $|V|$ will typically be huge. You will probably want to winnow the vocabulary at this point. You might do this by filtering to a specific subset, or just imposing a minimum count threshold. You might impose a minimum count threshold even if $|V|$ is small — for words with very low counts, you simply don't have enough evidence to say anything interesting.
  4. Now build a matrix $M$ of dimension $|V| \times |V|$. Both the rows and the columns of $M$ represent words. Each cell $M[i, j]$ is filled with the count $d[(w_i, w_j)]$.

For different designs, the procedure differs slightly. For example, if you are building a word $\times$ document matrix, then the rows of $M$ represent words and the columns of $M$ represent documents. The scan in step 2 then just keeps track of (word, document) pairs — compiling the number of times that word appears in document. Such matrices are often used in information retrieval, because the columns are multi-set representations of documents. They are much sparser than the the word $\times$ word matrices we will work with here. (In my experience, they yield lower-quality lexicons, but others have reported good results with them.)

The data distribution includes two pre-computed matrices of co-occurrence counts in IMDB movie reviews. The build function in the utils module for this repository allows you to read them in:

Let's read these in now for use in later examples:

In [5]:
ww = utils.build(os.path.join(vsmdata_home, 'imdb-wordword.csv'))
wd = utils.build(os.path.join(vsmdata_home, 'imdb-worddoc.csv'))

There are some great pre-computed matrices available online too. These aren't matrices of counts, but rather more abstract values computed using methods like those under discussion here. Just for kicks, let's load in some GloVe vectors:

In [6]:
glv = utils.build_glove(os.path.join(glove_home, 'glove.6B.50d.txt'))

Vector comparison

Vector comparisons form the heart of our analyses in this context. For the most part, we are interested in measuring the distance between vectors. The guiding idea is that semantically related words should be close together in the vector spaces we build, and semantically unrelated words should be far apart.

The scipy.spatial.distance module has a lot of vector comparison methods, so you might check them out if you want to go beyond the functions defined and explored here. Read the documentation closely, though: many of those methods are defined only for binary vectors, whereas the VSMs we'll use allow all float values in principle.

Euclidean distance

The most basic and intuitive distance measure between vectors is euclidean distance. The euclidean distance between two vectors $u$ and $v$ of dimension $n$ is

$$\sqrt{\sum_{i=1}^{n} |u_{i}-v_{i}|^2}$$

In two-dimensions, this corresponds to the length of the most direct line between the two points.

Here, we just rely on scipy to define it:

In [7]:
def euclidean(u, v):    
    """Eculidean distance between 1d np.arrays `u` and `v`, which must 
    have the same dimensionality. Returns a float."""
    # Use scipy's method:
    return scipy.spatial.distance.euclidean(u, v)
    # Or define it yourself:
    # return vector_length(u - v)    

The comment above shows how to define this measure yourself. The function used there is the length of a vector $u$ of dimension $n$, which is defined as

$$\|u\| = \sqrt{\sum_{i=1}^{n} u_{i}^{2}}$$

Here's the code:

In [8]:
def vector_length(u):
    """Length (L2) of the 1d np.array `u`. Returns a new np.array with the 
    same dimensions as `u`."""
    return np.sqrt(np.dot(u, u))

Here's the tiny vector space from the screencast on vector comparisons associated with this notebook:

In [9]:
ABC = np.array([
    [ 2.0,  4.0],  # A
    [10.0, 15.0],  # B
    [14.0, 10.0]]) # C

def plot_ABC(m):
    plt.plot(m[:,0], m[:,1], marker='', linestyle='')
    plt.xlim([0,np.max(m)*1.2])
    plt.ylim([0,np.max(m)*1.2])
    for i, x in enumerate(['A','B','C']):
        plt.annotate(x, m[i,:])

plot_ABC(ABC)

The euclidean distances align well with the raw visual distance in the plot:

In [10]:
euclidean(ABC[0], ABC[1])
Out[10]:
13.601470508735444
In [11]:
euclidean(ABC[1], ABC[2])
Out[11]:
6.4031242374328485

However, suppose we think of the vectors as word meanings in the vector-space sense. In that case, the values don't look good: the distributions of B and C are more or less directly opposed, suggesting very different meanings, whereas A and B are rather closely aligned, abstracting away from the fact that the first is far less frequent than the second. In terms of the large models we will soon explore, A and B resemble a pair like superb and good, which have similar meanings but very different frequencies. In contrast, B and C are like good and disappointing — similar overall frequencies but different distributions with respect to the overall vocabulary.

Length normalization

These affinities are immediately apparent if we normalize the vectors by their length. To do this, we use vector_length to define length_norm:

In [12]:
def length_norm(u):
    """L2 norm of the 1d np.array `u`. Returns a float."""
    return u / vector_length(u)
In [13]:
plot_ABC(np.array([length_norm(row) for row in ABC]))

Here, the connection between A and B is more apparent, as is the opposition between B and C.

Cosine distance

Cosine distance takes overall length into account. The cosine distance between two vectors $u$ and $v$ of dimension $n$ is

$$1 - \left(\frac{\sum_{i=1}^{n} u_{i} \cdot v_{i}}{\|u\|\cdot \|v\|}\right)$$

The similarity part of this (the righthand term of the subtraction) is actually measuring the angles between the two vectors. The result is the same (in terms of rank order) as one gets from first normalizing both vectors using vector_length and then calculating their Euclidean distance.

In [14]:
def cosine(u, v):        
    """Cosine distance between 1d np.arrays `u` and `v`, which must have 
    the same dimensionality. Returns a float."""
    # Use scipy's method:
    return scipy.spatial.distance.cosine(u, v)
    # Or define it yourself:
    # return 1.0 - (np.dot(u, v) / (vector_length(u) * vector_length(v)))

Matching-based methods

Matching-based methods are also common in the literature. The basic matching measure effectively creates a vector consisting of all of the smaller of the two values at each coordinate, and then sums them:

In [15]:
def matching(u, v):    
    """Matching coefficient between the 1d np.array vectors `u` and `v`, 
    which must have the same dimensionality. Returns a float."""
    # The scipy implementation is for binary vectors only. 
    # This version is more general.
    return np.sum(np.minimum(u, v))

One approach to normalizing the matching values is the Jaccard coefficient. The numerator is the matching coefficient. The denominator — the normalizer — is intuitively like the set union: for binary vectors it gives the cardinality of the union of the two being compared:

In [16]:
def jaccard(u, v):    
    """Jaccard distance between the 1d np.arrays `u` and `v`, which must 
    have the same dimensionality. Returns a float."""
    # The scipy implementation is for binary vectors only. 
    # This version is more general.
    return 1.0 - (matching(u, v) / np.sum(np.maximum(u, v)))

Summary

Suppose we set for ourselves the goal of associating A with B and disassociating B from C, in keeping with the semantic intuition expressed above. Then we can assess distance measures by whether they achieve this goal:

In [17]:
for m in (euclidean, cosine, jaccard):
    fmt = {'n': m.__name__,  
           'AB': m(ABC[0], ABC[1]), 
           'BC': m(ABC[1], ABC[2])}
    print('%(n)15s(A, B) = %(AB)5.2f %(n)15s(B, C) = %(BC)5.2f' % fmt)
      euclidean(A, B) = 13.60       euclidean(B, C) =  6.40
         cosine(A, B) =  0.01          cosine(B, C) =  0.07
        jaccard(A, B) =  0.76         jaccard(B, C) =  0.31

Distributional neighbors

The neighbors function is an investigative aide. For a given word, it ranks all the words in the vocabulary rownames according to their distance from word, as measured by distfunc in matrix mat:

In [18]:
def neighbors(word, mat, rownames, distfunc=cosine):    
    """Tool for finding the nearest neighbors of `word` in `mat` according 
    to `distfunc`. The comparisons are between row vectors.
    
    Parameters
    ----------
    word : str
        The anchor word. Assumed to be in `rownames`.
        
    mat : np.array
        The vector-space model.
        
    rownames : list of str
        The rownames of mat.
            
    distfunc : function mapping vector pairs to floats (default: `cosine`)
        The measure of distance between vectors. Can also be `euclidean`, 
        `matching`, `jaccard`, as well as any other distance measure  
        between 1d vectors.
        
    Raises
    ------
    ValueError
        If word is not in rownames.
    
    Returns
    -------    
    list of tuples
        The list is ordered by closeness to `word`. Each member is a pair 
        (word, distance) where word is a str and distance is a float.
    
    """
    if word not in rownames:
        raise ValueError('%s is not in this VSM' % word)
    w = mat[rownames.index(word)]
    dists = [(rownames[i], distfunc(w, mat[i])) for i in range(len(mat))]
    return sorted(dists, key=itemgetter(1), reverse=False)

By playing around with this function, you can start to get a sense for how the distance functions differ. Here are some example calls; you might try some new words to get a feel for what these matrices are like and how different words look.

In [19]:
neighbors(word='superb', mat=ww[0], rownames=ww[1], distfunc=cosine)[: 5]
Out[19]:
[('superb', 0.0),
 ('excellent', 0.0026965023912962627),
 ('outstanding', 0.0027344413235226295),
 ('beautifully', 0.0027345163104325332),
 ('brilliant', 0.0027888643627086429)]
In [20]:
neighbors(word='superb', mat=ww[0], rownames=ww[1], distfunc=euclidean)[: 5]
Out[20]:
[('superb', 0.0),
 ('familiar', 1448.8919904533948),
 ('violent', 1630.3723501090174),
 ('follows', 1647.0276257549538),
 ('convincing', 1701.2260284865147)]

The above rankings actually tend to look pretty good, with cosine less likely to associate words that happen to have similar frequency.

The GloVe vectors look even better — but they are based on much more than just raw counts, as we'll see soon:

In [21]:
neighbors(word='superb', mat=glv[0], rownames=glv[1], distfunc=cosine)[: 5]
Out[21]:
[('superb', 2.2204460492503131e-16),
 ('brilliant', 0.15809110259014769),
 ('impressive', 0.19352861376442676),
 ('masterful', 0.22871323564771928),
 ('excellent', 0.22928471014596696)]

Matrix reweighting

The goal of reweighting is to amplify the important, trustworthy, and unusual, while deemphasizing the mundane and the quirky. Absent a defined objective function, this will remain fuzzy, but the intuition behind moving away from raw counts is that frequency is a poor proxy for our target semantic ideas.

Normalization

Normalization (row-wise or column-wise) is perhaps the simplest form of reweighting. With length_norm, we normalize using vector_length. We can also normalize each row by the sum of its values, which turns each row into a probability distribution over the columns:

In [22]:
def prob_norm(u):
    """Normalize 1d np.array `u` into a probability distribution. Assumes 
    that all the members of `u` are positive. Returns a 1d np.array of 
    the same dimensionality as `u`."""
    return u / np.sum(u)

These normalization measures are insensitive to the magnitude of the underlying counts. This is often a mistake in the messy world of large data sets; $[1,10]$ and $[1000,10000]$ are very different in ways that will be partly or totally obscured by normalization.

Pointwise Mutual Information

Pointwise Mutual Information (PMI) addresses this issue, at least in part. The PMI for word $\times$ context pair $(w,c)$ is

$$\log\left(\frac{P(w,c)}{P(w) \cdot P(c)}\right)$$

with $\log(0) = 0$. This is a measure of how far that cell's value deviates from what we would expect given the row and column sums for that cell.

Positive PMI (PPMI) maps all negative PMI values to 0.0. Our function pmi has positive=True as a default, in light of the arguments in Levy and Goldberg 2014, section 3.3.

In [23]:
def pmi(mat, rownames=None, positive=True):
    """Pointwise Mutual Information with Positive on by default.
    
    Parameters
    ----------
    mat : 2d np.array
       The matrix to operate on.
           
    rownames : list of str or None
        Not used; it's an argument only for consistency with other methods 
        defined here.
        
    positive : bool (default: True)
        Implements Positive PMI.
        
    Returns
    -------    
    (np.array, list of str)
       The first member is the PMI-transformed version of `mat`, and the 
       second member is `rownames` (unchanged).
    
    """    
    # Joint probability table:
    p = mat / np.sum(mat, axis=None)
    # Pre-compute column sums:
    colprobs = np.sum(p, axis=0)
    # Vectorize this function so that it can be applied rowwise:
    np_pmi_log = np.vectorize((lambda x : _pmi_log(x, positive=positive)))
    p = np.array([np_pmi_log(row / (np.sum(row)*colprobs)) for row in p])   
    return (p, rownames)

def _pmi_log(x, positive=True):
    """With `positive=False`, return log(x) if possible, else 0.
    With `positive=True`, log(x) is mapped to 0 where negative."""
    val = 0.0
    if x > 0.0:
        val = np.log(x)
    if positive:
        val = max([val,0.0])
    return val

Here, we reweight the word $\times$ word IMDB matrix from above using PPMI:

In [24]:
ww_ppmi = pmi(mat=ww[0], rownames=ww[1], positive=True)
In [25]:
neighbors(word='superb', mat=ww_ppmi[0], rownames=ww_ppmi[1], distfunc=cosine)[: 5]
Out[25]:
[('superb', 0.0),
 ('excellent', 0.41348274842943578),
 ('performances', 0.44391628568702501),
 ('brilliant', 0.45785117509986173),
 ('performance', 0.46856555779383202)]

TF-IDF

Perhaps the best known reweighting schemes is Term Frequency–Inverse Document Frequency (TF-IDF), which is, I believe, still the backbone of today's Web search technologies. As the name suggests, it is built from TF and IDF measures:

For a word $w$ and collection of documents $D$ containing document $d$:

  • TF$(w,d)$: $P(w \mid c)$. (In our VSMs, this is column-normalization using prob_norm.)
  • IDF$(w,D)$: $\log\left(\frac{|D|}{|\{d \in D : w \in d\}|}\right)$, where $\log(0)=0$.
  • TFIDF$(w,d,D)$: TF$(w,d)$ * IDF$(w,D)$
In [26]:
def tfidf(mat, rownames=None):
    """TF-IDF 
    
    Parameters
    ----------
    mat : 2d np.array
       The matrix to operate on.
       
    rownames : list of str or None
        Not used; it's an argument only for consistency with other methods 
        defined here.
        
    Returns
    -------
    (np.array, list of str)    
       The first member is the TF-IDF-transformed version of `mat`, and 
       the second member is `rownames` (unchanged).
    
    """
    colsums = np.sum(mat, axis=0)
    doccount = mat.shape[1]
    w = np.array([_tfidf_row_func(row, colsums, doccount) for row in mat])
    return (w, rownames)

def _tfidf_row_func(row, colsums, doccount):
    df = float(len([x for x in row if x > 0]))
    idf = 0.0
    # This ensures a defined IDF value >= 0.0:
    if df > 0.0 and df != doccount:
        idf = np.log(doccount / df)
    tfs = row/colsums
    return tfs * idf

TF-IDF generally performs best with sparse matrices. It severely punishes words that appear in many documents; if a word appears in every document, then its IDF value is 0. As a result, it can even be problematic with verb dense word $\times$ word matrices like ww, where most words appear with most other words due to the permissive notion of co-occurrence used to create it.

Here's an example using our word x document matrix wd:

In [27]:
wd_tfidf = tfidf(mat=wd[0], rownames=wd[1])
In [28]:
neighbors(word='superb', mat=wd_tfidf[0], rownames=wd_tfidf[1], distfunc=cosine)[: 5]
Out[28]:
[('superb', 1.1102230246251565e-16),
 ('outstanding', 0.72256301656130362),
 ('remain', 0.73606093603489886),
 ('viewed', 0.74639986506462108),
 ('and', 0.74880553661599958)]

For a more full-featured version of TF-IDF, see sklearn.feature_extraction.text.TfidfTransformer.

Important: sklearn's version assumes that term frequency (TF) is defined row-wise and document frequency is defined column-wise. That is, it assumes sklearn's document $\times$ word basic design, which makes sense for classification tasks, where the design is example $\times$ features. This is the reverse of the above.

Dimensionality reduction

The above methods deliver solid results. However, they are not capable of capturing higher-order associations in the data. For example, both gnarly and wicked are used as slangily positive adjectives. We thus expect them to have many of the same neighbors. However, at least stereotypically, gnarly is Californian and wicked is Bostonian. Thus, they are unlikely to occur often in the same texts. Dimensionality reduction techniques are often capable of capturing their semantic similarity (and have the added advantage of shrinking the size of our data structures).

The general goal of dimensionality reduction is to eliminate rows/columns that are highly correlated while bringing similar things together and pushing dissimilar things apart. Latent Semantic Analysis (LSA) is a prominent method. It is an application of truncated singular value decomposition (SVD). SVD is a central matrix operation; 'truncation' here means looking only at submatrices of the full decomposition. LSA seeks not only to find a reduced-sized matrix but also to capture similarities that come not just from direct co-occurrence, but also from second-order co-occurrence.

In [29]:
def lsa(mat=None, rownames=None, k=100):
    """Latent Semantic Analysis using pure scipy.
    
    Parameters
    ----------
    mat : 2d np.array
       The matrix to operate on.
           
    rownames : list of str or None
        Not used; it's an argument only for consistency with other methods 
        defined here.
        
    k : int (default: 100)
        Number of dimensions to truncate to.
        
    Returns
    -------    
    (np.array, list of str)
        The first member is the SVD-reduced version of `mat` with 
        dimension (m x k), where m is the rowcount of mat and `k` is 
        either the user-supplied k or the column count of `mat`, whichever 
        is smaller. The second member is `rownames` (unchanged).

    """    
    rowmat, singvals, colmat = svd(mat, full_matrices=False)
    singvals = np.diag(singvals)
    trunc = np.dot(rowmat[:, 0:k], singvals[0:k, 0:k])
    return (trunc, rownames)

Here's a look at the example from the slides:

In [30]:
gnmat = np.array([
    [1,0,1,0,0,0],
    [0,1,0,1,0,0],
    [1,1,1,1,0,0],
    [0,0,0,0,1,1],
    [0,0,0,0,0,1]], dtype='float64')
gn_rownames = ['gnarly', 'wicked', 'awesome', 'lame', 'terrible']
In [31]:
neighbors(word='gnarly', mat=gnmat, rownames=gn_rownames)
Out[31]:
[('gnarly', 2.2204460492503131e-16),
 ('awesome', 0.29289321881345254),
 ('wicked', 1.0),
 ('lame', 1.0),
 ('terrible', 1.0)]

We see that gnarly and wicked are not close to each other. (Well, it's a small space, but they are as close as gnarly and lame.) Reweighting by PMI, PPMI, or TF-IDF is no help. LSA to the rescue:

In [32]:
gnmat_lsa = lsa(mat=gnmat, rownames=gn_rownames, k=2)
In [33]:
neighbors(word='gnarly', mat=gnmat_lsa[0], rownames=gnmat_lsa[1])
Out[33]:
[('gnarly', 0.0),
 ('wicked', 0.0),
 ('awesome', 0.0),
 ('lame', 1.0),
 ('terrible', 1.0)]

The sklearn.decomposition module contains an implementation of LSA (TruncatedSVD) that you might want to switch to for real experiments:

  • The sklearn version is more flexible than the above in that it can operate on both dense matrices (Numpy arrays) and sparse matrices (from Scipy).
  • The sklearn version will make it easy to try out other dimensionality reduction methods in your own code; Principal Component Analysis (PCA) and Non-Negative Matrix Factorization (NMF) are definitely worth a look.

Visualization

You can begin to get a feel for what your matrix is like by poking around with the neighbors function to see who is close to or far from whom. But this kind of sampling is unlikely to lead to robust new insights, unless you luck out and start to see an interesting cluster of associations developing.

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a powerful method for visualizing high-dimensional vector spaces in 2d. It allows you to find associations in an intuitive way, to guide later and more precise investigations. sklearn now contains a high-performance implementation: sklearn.manifold.TSNE.

In [34]:
def tsne_viz(
        mat,
        rownames,
        colors=None,
        output_filename=None,
        figheight=40,
        figwidth=50):     
    """2d plot of `mat` using t-SNE, with the points labeled by `rownames`, 
    aligned with `colors` (defaults to all black).
    
    Parameters
    ----------    
    mat : 2d np.array
        The matrix to visualize.
        
    rownames : list of str
        Names of the points to visualize.
                
    colors : list of colornames or None (default: None)
        Optional list of colors for rownames. The color names just need to 
        be interpretable by matplotlib. If they are supplied, they need to 
        have the same length as rownames, or indices if that is not None. 
        If `colors=None`, then all the words are displayed in black.
      
    output_filename : str (default: None)
        If not None, then the output image is written to this location. The 
        filename suffix determines the image type. If None, then 
        `plt.plot()` is called, with the behavior determined by the 
        environment.
        
    figheight : int (default: 40)
        Height in display units of the output.
            
    figwidth : int (default: 50)
        Width in display units of the output.
        
    """
    indices = list(range(len(rownames)))
    # Colors:
    if not colors:
        colors = ['black' for i in indices]    
    # Recommended reduction via PCA or similar:
    n_components = 50 if mat.shape[1] >= 50 else mat.shape[1]
    dimreduce = PCA(n_components=n_components)
    mat = dimreduce.fit_transform(mat)
    # t-SNE:
    tsne = TSNE(n_components=2, random_state=0)
    np.set_printoptions(suppress=True)    
    tsnemat = tsne.fit_transform(mat) 
    # Plot values:
    vocab = np.array(rownames)[indices]
    xvals = tsnemat[indices, 0] 
    yvals = tsnemat[indices, 1]
    # Plotting:
    fig, ax = plt.subplots(nrows=1, ncols=1)
    fig.set_figheight(40)
    fig.set_figwidth(50)
    ax.plot(xvals, yvals, marker='', linestyle='')
    # Text labels:
    for word, x, y, color in zip(vocab, xvals, yvals, colors):
        ax.annotate(word, (x, y), fontsize=8, color=color)
    # Output:
    if output_filename:
        plt.savefig(output_filename, bbox_inches='tight')
    else:
        plt.show()

Here's the code for running this on ww_ppmi using the default settings:

In [35]:
tsne_viz(mat=ww_ppmi[0], rownames=ww_ppmi[1])