Bake-off: Word similarity tasks

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2018 term"


Word similarity datasets have long been used to evaluate distributed representations. This section provides basic code for conducting such analyses with four datasets:

For the first three, the numeral in its name is the number of pairs it contains.

If you want to push this task further, consider using additional datasets from and perhaps even taking advantage of the evaluation infrastructure it provides. (For additional details, see the associated paper.)


  1. Each of the similarity datasets contains word pairs with an associated human-annotated similarity score. (We convert these to distances to align intuitively with our distance measure functions.)

  2. The evaluation code measures the distance between the word pairs in your chosen VSM (should be a pd.DataFrame).

  3. The evaluation metric is the Spearman correlation coefficient between the annotated scores and your distances.

  4. We also macro-average these correlations across the four datasets for an overall summary.

Based on my reading of the literature, I'd say that the best VSMs report scores in this range:

Dataset Competitive scores
WordSim-353 ≈0.75
MTurk-287 ≈0.75
MTurk-771 ≈0.75
MEN ≈0.70

Your scores won't quite be comparable because you'll be missing a few vocabulary items, but these are still good targets.


In [2]:
from collections import defaultdict
import csv
import numpy as np
import os
import pandas as pd
from scipy.stats import spearmanr
import vsm
In [3]:
data_home = 'vsmdata'

wordsim_home = os.path.join('vsmdata', 'wordsim')

Dataset readers

In [4]:
def wordsim_dataset_reader(src_filename, header=False, delimiter=','):    
    """Basic reader that works for all four files, since they all have the 
    format word1,word2,score, differing only in whether or not they include 
    a header line and what delimiter they use.
    src_filename : str
        Full path to the source file.        
    header : bool (default: False)
        Whether `src_filename` has a header.        
    delimiter : str (default: ',')
        Field delimiter in `src_filename`.
    (str, str, float)
       (w1, w2, score) where `score` is the negative of the similarity 
       score in the file so that we are intuitively aligned with our 
       distance-based code.
    with open(src_filename) as f:
        reader = csv.reader(f, delimiter=delimiter)
        if header:
        for row in reader:
            w1, w2, score = row
            # Negative of scores to align intuitively with distance functions:
            score = -float(score)
            yield (w1, w2, score)

def wordsim353_reader():
    src_filename = os.path.join(wordsim_home, 'wordsim353.csv')
    return wordsim_dataset_reader(src_filename, header=True)
def mturk287_reader():
    src_filename = os.path.join(wordsim_home, 'MTurk-287.csv')
    return wordsim_dataset_reader(src_filename, header=False)
def mturk771_reader():
    src_filename = os.path.join(wordsim_home, 'MTURK-771.csv')
    return wordsim_dataset_reader(src_filename, header=False)

def men_reader():
    src_filename = os.path.join(wordsim_home, 'MEN_dataset_natural_form_full')
    return wordsim_dataset_reader(src_filename, header=False, delimiter=' ')   


In [5]:
def word_similarity_evaluation(reader, df, distfunc=vsm.cosine, verbose=True):
    """Word-similarity evalution framework.
    reader : iterator
        A reader for a word-similarity dataset. Just has to yield
        tuples (word1, word2, score).    
    df : pd.DataFrame
        The VSM being evaluated.        
    distfunc : function mapping vector pairs to floats (default: `vsm.cosine`)
        The measure of distance between vectors. Can also be `vsm.euclidean`, 
        `vsm.matching`, `vsm.jaccard`, as well as any other distance measure 
        between 1d vectors.  
    verbose : bool
        Whether to print information about how much of the vocab
        `df` covers.
    To standard output
        Size of the vocabulary overlap between the evaluation set and
        rownames. We limit the evalation to the overlap, paying no price
        for missing words (which is not fair, but it's reasonable given
        that we're working with very small VSMs in this notebook).
        The Spearman rank correlation coefficient between the dataset
        scores and the similarity values obtained from `mat` using 
        `distfunc`. This evaluation is sensitive only to rankings, not
        to absolute values.
    sims = defaultdict(list)
    rownames = df.index
    vocab = set()    
    excluded = set()
    for w1, w2, score in reader():
        if w1 in rownames and w2 in rownames:
            sims[w1].append((w2, score))
            sims[w2].append((w1, score))
            vocab |= {w1, w2}
            excluded |= {w1, w2}
    all_words = vocab | excluded
    if verbose:
        print("Evaluation vocab: {:,} of {:,}".format(len(vocab), len(all_words)))
    # Evaluate the matrix by creating a vector of all_scores for data
    # and all_dists for mat's distances. 
    all_scores = []
    all_dists = []
    for word in vocab:
        vec = df.loc[word]
        vals = sims[word]
        cmps, scores = zip(*vals)
        all_scores += scores
        all_dists += [distfunc(vec, df.loc[w]) for w in cmps]
    rho, pvalue = spearmanr(all_scores, all_dists)
    return rho

Evaluation is then simple. The following lets us evaluate a VSM against all four datasets:

In [6]:
def full_word_similarity_evaluation(df, verbose=True):
    """Evaluate a VSM against all four datasets.
    df : pd.DataFrame
        Mapping dataset names to Spearman r values
    scores = {}
    for reader in (wordsim353_reader, mturk287_reader, mturk771_reader, men_reader):        
        if verbose: 
        score = word_similarity_evaluation(reader, df, verbose=verbose)
        scores[reader.__name__] = score
        if verbose:            
            print('Spearman r: {0:0.03f}'.format(score))
    mu = np.array(list(scores.values())).mean()
    if verbose:
        print("Mean Spearman r: {0:0.03f}".format(mu))
    return scores


My baseline is PPMI on imdb20:

In [7]:
imdb20 = pd.read_csv(
    os.path.join(data_home, "imdb_window20-flat.csv.gz"), index_col=0)
In [8]:
imdb20_ppmi = vsm.pmi(imdb20)
In [9]:
full_word_similarity_evaluation(imdb20_ppmi, verbose=True)
Evaluation vocab: 418 of 437
Spearman r: 0.469
Evaluation vocab: 499 of 499
Spearman r: 0.599
Evaluation vocab: 1,113 of 1,113
Spearman r: 0.462
Evaluation vocab: 751 of 751
Spearman r: 0.572
Mean Spearman r: 0.525
{'men_reader': 0.5724487594295152,
 'mturk287_reader': 0.5986597600532505,
 'mturk771_reader': 0.4615212813179131,
 'wordsim353_reader': 0.46888766456156583}

Bake-off submission

  1. The name of the count matrix you started with (must be one in vsmdata).
  2. A description of the steps you took to create your bake-off VSM – must be different from the above baseline.
  3. Your Spearman r value for each of the four evaluation datasets and your average across all four.

Submission URL: