Bake-off: Word analogies

Important: This isn't being run as a bake-off this year. It's included in the repository in case people want to do additional exploration or incorporate this kind of evaluation into a project.

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2018 term"


Word analogies provide another kind of evaluation for distributed representations. Here, we are given three vectors A, B, and C, in the relationship

A is to B as C is to __

and asked to identify the fourth that completes the analogy. This section conducts such analyses using a large, automatically collected analogies dataset from Google. These analogies are by and large substantially easier than the classic brain-teaser analogies that used to appear on tests like the SAT, but it's still an interesting, demanding task.

The core idea is that we make predictions by creating the vector


and then ranking all vectors based on their distance from this new vector, choosing the closest as our prediction.


In [2]:
from collections import defaultdict
import csv
import numpy as np
import os
import pandas as pd
from scipy.stats import spearmanr
import vsm
In [3]:
data_home = 'vsmdata'

analogies_home = os.path.join(data_home, 'question-data')

Analogy completion

The function analogy_completion implements the analogy calculation on VSMs.

In [4]:
def analogy_completion(a, b, c, df, distfunc=vsm.cosine):
    """a is to be as c is to predicted, where predicted is the 
    closest to (b-a) + c"""
    for x in (a, b, c):
        if x not in df.index:
            raise ValueError('{} is not in this VSM'.format(x))
    avec = df.loc[a]
    bvec = df.loc[b]
    cvec = df.loc[c]
    newvec = (bvec - avec) + cvec
    dists = df.apply(lambda row: distfunc(newvec, row))
    dists = dists.drop([a,b,c])
    return pd.Series(dists).sort_values()
In [5]:
imdb20 = pd.read_csv(
    os.path.join(data_home, "imdb_window20-flat.csv.gz"), index_col=0)
In [6]:
imdb_ppmi = vsm.pmi(imdb20)
In [7]:
x = analogy_completion("good", "great", "bad", imdb_ppmi)
In [8]:
awful       0.589598
terrible    0.603033
acting      0.604288
horrible    0.634886
worst       0.648661
dtype: float64


The function analogy_evaluation evaluates a VSM against one of the files in analogies_home. The default is to use gram1-adjective-to-adverb.txt, but there are a lot to choose from. The calculations are somewhat slow, so you can use the verbose=True option to see what predictions are being made incrementally.

In [9]:
def analogy_evaluation(
    """Basic analogies evaluation for a file `src_filename `
    in `question-data/`.
    df : pd.DataFrame
        The VSM being evaluated.
    src_filename : str
        Basename of the file to be evaluated. It's assumed to be in
    distfunc : function mapping vector pairs to floats (default: `cosine`)
        The measure of distance between vectors. Can also be `euclidean`, 
        `matching`, `jaccard`, as well as any other distance measure 
        between 1d vectors.
    (float, float)
        The first is the mean reciprocal rank of the predictions and 
        the second is the accuracy of the predictions.
    src_filename = os.path.join(analogies_home, src_filename)
    # Read in the data and restrict to problems we can solve:
    with open(src_filename) as f:    
        data = [line.split() for line in]
    data = [prob for prob in data if set(prob) <= set(df.index)]
    # Run the evaluation, collecting accuracy and rankings:
    results = defaultdict(int)
    ranks = []
    for a, b, c, d in data:
        ranking = analogy_completion(a, b, c, df=df, distfunc=distfunc)       
        predicted = ranking.index[0]
        # Accuracy:
        results[predicted == d] += 1  
        # Rank of actual, starting at 1:
        rank = ranking.index.get_loc(d) + 1
        if verbose:
            print("{} is to {} as {} is to {} (gold: {} at rank {})".format(
                a, b, c, predicted, d, rank))        
    # Return the mean reciprocal rank and the accuracy results:
    mrr = np.mean(1.0 / (np.array(ranks)))
    return (mrr, results)


My baseline is PPMI on imdb20 as loaded above:

In [10]:
analogy_evaluation(imdb_ppmi, src_filename='family.txt', verbose=False)
(0.5446665665352782, defaultdict(int, {False: 297, True: 209}))
In [11]:
analogy_evaluation(imdb_ppmi, src_filename='gram1-adjective-to-adverb.txt', verbose=False)
(0.14621311421118469, defaultdict(int, {False: 904, True: 88}))
In [12]:
analogy_evaluation(imdb_ppmi, src_filename='gram2-opposite.txt', verbose=False)
(0.11918977012507004, defaultdict(int, {False: 756, True: 56}))

Bake-off submission

  1. The name of the count matrix you started with (must be one in vsmdata).
  2. A description of the steps you took to create your bake-off VSM – must be different from the above baseline.
  3. Your mean reciprocal rank scores for the following files in analogies_home:
    • 'family.txt'
    • 'gram1-adjective-to-adverb.txt'
    • 'gram2-opposite.txt'