Bake-off: Word analogies

Important: This isn't being run as a bake-off this year. It's included in the repository in case people want to do additional exploration or incorporate this kind of evaluation into a project.

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2018 term"

Overview

Word analogies provide another kind of evaluation for distributed representations. Here, we are given three vectors A, B, and C, in the relationship

A is to B as C is to __

and asked to identify the fourth that completes the analogy. This section conducts such analyses using a large, automatically collected analogies dataset from Google. These analogies are by and large substantially easier than the classic brain-teaser analogies that used to appear on tests like the SAT, but it's still an interesting, demanding task.

The core idea is that we make predictions by creating the vector

$$(A−B)+C$$

and then ranking all vectors based on their distance from this new vector, choosing the closest as our prediction.

Set-up

In [2]:
from collections import defaultdict
import csv
import numpy as np
import os
import pandas as pd
from scipy.stats import spearmanr
import vsm
In [3]:
data_home = 'vsmdata'

analogies_home = os.path.join(data_home, 'question-data')

Analogy completion

The function analogy_completion implements the analogy calculation on VSMs.

In [4]:
def analogy_completion(a, b, c, df, distfunc=vsm.cosine):
    """a is to be as c is to predicted, where predicted is the 
    closest to (b-a) + c"""
    for x in (a, b, c):
        if x not in df.index:
            raise ValueError('{} is not in this VSM'.format(x))
    avec = df.loc[a]
    bvec = df.loc[b]
    cvec = df.loc[c]
    newvec = (bvec - avec) + cvec
    dists = df.apply(lambda row: distfunc(newvec, row))
    dists = dists.drop([a,b,c])
    return pd.Series(dists).sort_values()
In [5]:
imdb20 = pd.read_csv(
    os.path.join(data_home, "imdb_window20-flat.csv.gz"), index_col=0)
In [6]:
imdb_ppmi = vsm.pmi(imdb20)
In [7]:
x = analogy_completion("good", "great", "bad", imdb_ppmi)
In [8]:
x.head()
Out[8]:
awful       0.589598
terrible    0.603033
acting      0.604288
horrible    0.634886
worst       0.648661
dtype: float64

Evaluation

The function analogy_evaluation evaluates a VSM against one of the files in analogies_home. The default is to use gram1-adjective-to-adverb.txt, but there are a lot to choose from. The calculations are somewhat slow, so you can use the verbose=True option to see what predictions are being made incrementally.

In [9]:
def analogy_evaluation(
        df, 
        src_filename='gram1-adjective-to-adverb.txt', 
        distfunc=vsm.cosine,
        verbose=True):
    """Basic analogies evaluation for a file `src_filename `
    in `question-data/`.
    
    Parameters
    ----------    
    df : pd.DataFrame
        The VSM being evaluated.
    src_filename : str
        Basename of the file to be evaluated. It's assumed to be in
        `analogies_home`.        
    distfunc : function mapping vector pairs to floats (default: `cosine`)
        The measure of distance between vectors. Can also be `euclidean`, 
        `matching`, `jaccard`, as well as any other distance measure 
        between 1d vectors.
    
    Returns
    -------
    (float, float)
        The first is the mean reciprocal rank of the predictions and 
        the second is the accuracy of the predictions.
    
    """
    src_filename = os.path.join(analogies_home, src_filename)
    # Read in the data and restrict to problems we can solve:
    with open(src_filename) as f:    
        data = [line.split() for line in f.read().splitlines()]
    data = [prob for prob in data if set(prob) <= set(df.index)]
    # Run the evaluation, collecting accuracy and rankings:
    results = defaultdict(int)
    ranks = []
    for a, b, c, d in data:
        ranking = analogy_completion(a, b, c, df=df, distfunc=distfunc)       
        predicted = ranking.index[0]
        # Accuracy:
        results[predicted == d] += 1  
        # Rank of actual, starting at 1:
        rank = ranking.index.get_loc(d) + 1
        ranks.append(rank)        
        if verbose:
            print("{} is to {} as {} is to {} (gold: {} at rank {})".format(
                a, b, c, predicted, d, rank))        
    # Return the mean reciprocal rank and the accuracy results:
    mrr = np.mean(1.0 / (np.array(ranks)))
    return (mrr, results)

Baseline

My baseline is PPMI on imdb20 as loaded above:

In [10]:
analogy_evaluation(imdb_ppmi, src_filename='family.txt', verbose=False)
Out[10]:
(0.5446665665352782, defaultdict(int, {False: 297, True: 209}))
In [11]:
analogy_evaluation(imdb_ppmi, src_filename='gram1-adjective-to-adverb.txt', verbose=False)
Out[11]:
(0.14621311421118469, defaultdict(int, {False: 904, True: 88}))
In [12]:
analogy_evaluation(imdb_ppmi, src_filename='gram2-opposite.txt', verbose=False)
Out[12]:
(0.11918977012507004, defaultdict(int, {False: 756, True: 56}))

Bake-off submission

  1. The name of the count matrix you started with (must be one in vsmdata).
  2. A description of the steps you took to create your bake-off VSM – must be different from the above baseline.
  3. Your mean reciprocal rank scores for the following files in analogies_home:
    • 'family.txt'
    • 'gram1-adjective-to-adverb.txt'
    • 'gram2-opposite.txt'