# Bake-off: Word analogies¶

Important: This isn't being run as a bake-off this year. It's included in the repository in case people want to do additional exploration or incorporate this kind of evaluation into a project.

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2018 term"


## Overview¶

Word analogies provide another kind of evaluation for distributed representations. Here, we are given three vectors A, B, and C, in the relationship

A is to B as C is to __

and asked to identify the fourth that completes the analogy. This section conducts such analyses using a large, automatically collected analogies dataset from Google. These analogies are by and large substantially easier than the classic brain-teaser analogies that used to appear on tests like the SAT, but it's still an interesting, demanding task.

The core idea is that we make predictions by creating the vector

$$(A−B)+C$$

and then ranking all vectors based on their distance from this new vector, choosing the closest as our prediction.

## Set-up¶

In [2]:
from collections import defaultdict
import csv
import numpy as np
import os
import pandas as pd
from scipy.stats import spearmanr
import vsm

In [3]:
data_home = 'vsmdata'

analogies_home = os.path.join(data_home, 'question-data')


## Analogy completion¶

The function analogy_completion implements the analogy calculation on VSMs.

In [4]:
def analogy_completion(a, b, c, df, distfunc=vsm.cosine):
"""a is to be as c is to predicted, where predicted is the
closest to (b-a) + c"""
for x in (a, b, c):
if x not in df.index:
raise ValueError('{} is not in this VSM'.format(x))
avec = df.loc[a]
bvec = df.loc[b]
cvec = df.loc[c]
newvec = (bvec - avec) + cvec
dists = df.apply(lambda row: distfunc(newvec, row))
dists = dists.drop([a,b,c])
return pd.Series(dists).sort_values()

In [5]:
imdb20 = pd.read_csv(
os.path.join(data_home, "imdb_window20-flat.csv.gz"), index_col=0)

In [6]:
imdb_ppmi = vsm.pmi(imdb20)

In [7]:
x = analogy_completion("good", "great", "bad", imdb_ppmi)

In [8]:
x.head()

Out[8]:
awful       0.589598
terrible    0.603033
acting      0.604288
horrible    0.634886
worst       0.648661
dtype: float64

## Evaluation¶

The function analogy_evaluation evaluates a VSM against one of the files in analogies_home. The default is to use gram1-adjective-to-adverb.txt, but there are a lot to choose from. The calculations are somewhat slow, so you can use the verbose=True option to see what predictions are being made incrementally.

In [9]:
def analogy_evaluation(
df,
distfunc=vsm.cosine,
verbose=True):
"""Basic analogies evaluation for a file src_filename 
in question-data/.

Parameters
----------
df : pd.DataFrame
The VSM being evaluated.
src_filename : str
Basename of the file to be evaluated. It's assumed to be in
analogies_home.
distfunc : function mapping vector pairs to floats (default: cosine)
The measure of distance between vectors. Can also be euclidean,
matching, jaccard, as well as any other distance measure
between 1d vectors.

Returns
-------
(float, float)
The first is the mean reciprocal rank of the predictions and
the second is the accuracy of the predictions.

"""
src_filename = os.path.join(analogies_home, src_filename)
# Read in the data and restrict to problems we can solve:
with open(src_filename) as f:
data = [line.split() for line in f.read().splitlines()]
data = [prob for prob in data if set(prob) <= set(df.index)]
# Run the evaluation, collecting accuracy and rankings:
results = defaultdict(int)
ranks = []
for a, b, c, d in data:
ranking = analogy_completion(a, b, c, df=df, distfunc=distfunc)
predicted = ranking.index[0]
# Accuracy:
results[predicted == d] += 1
# Rank of actual, starting at 1:
rank = ranking.index.get_loc(d) + 1
ranks.append(rank)
if verbose:
print("{} is to {} as {} is to {} (gold: {} at rank {})".format(
a, b, c, predicted, d, rank))
# Return the mean reciprocal rank and the accuracy results:
mrr = np.mean(1.0 / (np.array(ranks)))
return (mrr, results)


## Baseline¶

My baseline is PPMI on imdb20 as loaded above:

In [10]:
analogy_evaluation(imdb_ppmi, src_filename='family.txt', verbose=False)

Out[10]:
(0.5446665665352782, defaultdict(int, {False: 297, True: 209}))
In [11]:
analogy_evaluation(imdb_ppmi, src_filename='gram1-adjective-to-adverb.txt', verbose=False)

Out[11]:
(0.14621311421118469, defaultdict(int, {False: 904, True: 88}))
In [12]:
analogy_evaluation(imdb_ppmi, src_filename='gram2-opposite.txt', verbose=False)

Out[12]:
(0.11918977012507004, defaultdict(int, {False: 756, True: 56}))

## Bake-off submission¶

1. The name of the count matrix you started with (must be one in vsmdata).
2. A description of the steps you took to create your bake-off VSM – must be different from the above baseline.
3. Your mean reciprocal rank scores for the following files in analogies_home:
• 'family.txt'