Important: This isn't being run as a bake-off this year. It's included in the repository in case people want to do additional exploration or incorporate this kind of evaluation into a project.
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2018 term"
Word analogies provide another kind of evaluation for distributed representations. Here, we are given three vectors A, B, and C, in the relationship
A is to B as C is to __
and asked to identify the fourth that completes the analogy. This section conducts such analyses using a large, automatically collected analogies dataset from Google. These analogies are by and large substantially easier than the classic brain-teaser analogies that used to appear on tests like the SAT, but it's still an interesting, demanding task.
The core idea is that we make predictions by creating the vector
$$(A−B)+C$$and then ranking all vectors based on their distance from this new vector, choosing the closest as our prediction.
from collections import defaultdict
import csv
import numpy as np
import os
import pandas as pd
from scipy.stats import spearmanr
import vsm
data_home = 'vsmdata'
analogies_home = os.path.join(data_home, 'question-data')
The function analogy_completion
implements the analogy calculation on VSMs.
def analogy_completion(a, b, c, df, distfunc=vsm.cosine):
"""a is to be as c is to predicted, where predicted is the
closest to (b-a) + c"""
for x in (a, b, c):
if x not in df.index:
raise ValueError('{} is not in this VSM'.format(x))
avec = df.loc[a]
bvec = df.loc[b]
cvec = df.loc[c]
newvec = (bvec - avec) + cvec
dists = df.apply(lambda row: distfunc(newvec, row))
dists = dists.drop([a,b,c])
return pd.Series(dists).sort_values()
imdb20 = pd.read_csv(
os.path.join(data_home, "imdb_window20-flat.csv.gz"), index_col=0)
imdb_ppmi = vsm.pmi(imdb20)
x = analogy_completion("good", "great", "bad", imdb_ppmi)
x.head()
awful 0.589598 terrible 0.603033 acting 0.604288 horrible 0.634886 worst 0.648661 dtype: float64
The function analogy_evaluation
evaluates a VSM against one of the files in analogies_home
. The default is to use gram1-adjective-to-adverb.txt
, but there are a lot to choose from. The calculations are somewhat slow, so you can use the verbose=True
option to see what predictions are being made incrementally.
def analogy_evaluation(
df,
src_filename='gram1-adjective-to-adverb.txt',
distfunc=vsm.cosine,
verbose=True):
"""Basic analogies evaluation for a file `src_filename `
in `question-data/`.
Parameters
----------
df : pd.DataFrame
The VSM being evaluated.
src_filename : str
Basename of the file to be evaluated. It's assumed to be in
`analogies_home`.
distfunc : function mapping vector pairs to floats (default: `cosine`)
The measure of distance between vectors. Can also be `euclidean`,
`matching`, `jaccard`, as well as any other distance measure
between 1d vectors.
Returns
-------
(float, float)
The first is the mean reciprocal rank of the predictions and
the second is the accuracy of the predictions.
"""
src_filename = os.path.join(analogies_home, src_filename)
# Read in the data and restrict to problems we can solve:
with open(src_filename) as f:
data = [line.split() for line in f.read().splitlines()]
data = [prob for prob in data if set(prob) <= set(df.index)]
# Run the evaluation, collecting accuracy and rankings:
results = defaultdict(int)
ranks = []
for a, b, c, d in data:
ranking = analogy_completion(a, b, c, df=df, distfunc=distfunc)
predicted = ranking.index[0]
# Accuracy:
results[predicted == d] += 1
# Rank of actual, starting at 1:
rank = ranking.index.get_loc(d) + 1
ranks.append(rank)
if verbose:
print("{} is to {} as {} is to {} (gold: {} at rank {})".format(
a, b, c, predicted, d, rank))
# Return the mean reciprocal rank and the accuracy results:
mrr = np.mean(1.0 / (np.array(ranks)))
return (mrr, results)
My baseline is PPMI on imdb20
as loaded above:
analogy_evaluation(imdb_ppmi, src_filename='family.txt', verbose=False)
(0.5446665665352782, defaultdict(int, {False: 297, True: 209}))
analogy_evaluation(imdb_ppmi, src_filename='gram1-adjective-to-adverb.txt', verbose=False)
(0.14621311421118469, defaultdict(int, {False: 904, True: 88}))
analogy_evaluation(imdb_ppmi, src_filename='gram2-opposite.txt', verbose=False)
(0.11918977012507004, defaultdict(int, {False: 756, True: 56}))
vsmdata
).analogies_home
: