Homework 1: Word similarity tasks

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2019"

Overview

Word similarity datasets have long been used to evaluate distributed representations. This notebook provides basic code for conducting such analyses with a number of datasets:

Dataset Pairs Task-type Current best Spearman $\rho$ Best $\rho$ paper
WordSim-353 353 Relatedness 82.8 Speer et al. 2017
MTurk-771 771 Relatedness 81.0 Speer et al. 2017
The MEN Test Collection 3,000 Relatedness 86.6 Speer et al. 2017
SimVerb-3500-dev 500 Similarity 61.1 Mrkišć et al. 2016
SimVerb-3500-test 3,000 Similarity 62.4 Mrkišć et al. 2016

Each of the similarity datasets contains word pairs with an associated human-annotated similarity score. (We convert these to distances to align intuitively with our distance measure functions.) The evaluation code measures the distance between the word pairs in your chosen VSM (which should be a pd.DataFrame).

The evaluation metric for each dataset is the Spearman correlation coefficient $\rho$ between the annotated scores and your distances, as is standard in the literature. We also macro-average these correlations across the datasets for an overall summary. (In using the macro-average, we are saying that we care about all the datasets equally, even though they vary in size.)

This homework (questions at the bottom of this notebook) asks you to write code that uses the count matrices in data/vsmdata to create and evaluate some baseline models as well as an original model $M$ that you design. This accounts for 9 of the 10 points for this assignment.

For the associated bake-off, we will distribute two new word similarity or relatedness datasets and associated reader code, and you will evaluate $M$ (no additional training or tuning allowed!) on those new datasets. Systems that enter will receive the additional homework point, and systems that achieve the top score will receive an additional 0.5 points.

Set-up

In [2]:
from collections import defaultdict
import csv
import itertools
import numpy as np
import os
import pandas as pd
from scipy.stats import spearmanr
import vsm
In [3]:
VSM_HOME = os.path.join('data', 'vsmdata')

WORDSIM_HOME = os.path.join('data', 'wordsim')

Dataset readers

In [4]:
def wordsim_dataset_reader(
        src_filename, 
        header=False, 
        delimiter=',', 
        score_col_index=2):
    """Basic reader that works for all similarity datasets. They are 
    all tabular-style releases where the first two columns give the 
    word and a later column (`score_col_index`) gives the score.

    Parameters
    ----------
    src_filename : str
        Full path to the source file.
    header : bool
        Whether `src_filename` has a header. Default: False
    delimiter : str
        Field delimiter in `src_filename`. Default: ','
    score_col_index : int
        Column containing the similarity scores Default: 2

    Yields
    ------
    (str, str, float)
       (w1, w2, score) where `score` is the negative of the similarity
       score in the file so that we are intuitively aligned with our
       distance-based code. To align with our VSMs, all the words are 
       downcased.

    """
    with open(src_filename, encoding='utf8') as f:
        reader = csv.reader(f, delimiter=delimiter)
        if header:
            next(reader)
        for row in reader:
            w1 = row[0].strip().lower()
            w2 = row[1].strip().lower()
            score = row[score_col_index]
            # Negative of scores to align intuitively with distance functions:
            score = -float(score)
            yield (w1, w2, score)

def wordsim353_reader():
    """WordSim-353: http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/"""
    src_filename = os.path.join(
        WORDSIM_HOME, 'wordsim353', 'combined.csv')
    return wordsim_dataset_reader(
        src_filename, header=True)

def mturk771_reader():
    """MTURK-771: http://www2.mta.ac.il/~gideon/mturk771.html"""
    src_filename = os.path.join(
        WORDSIM_HOME, 'MTURK-771.csv')
    return wordsim_dataset_reader(
        src_filename, header=False)

def simverb3500dev_reader():
    """SimVerb-3500: http://people.ds.cam.ac.uk/dsg40/simverb.html"""
    src_filename = os.path.join(
        WORDSIM_HOME, 'SimVerb-3500', 'SimVerb-500-dev.txt')
    return wordsim_dataset_reader(
        src_filename, delimiter="\t", header=True, score_col_index=3)

def simverb3500test_reader():
    """SimVerb-3500: http://people.ds.cam.ac.uk/dsg40/simverb.html"""
    src_filename = os.path.join(
        WORDSIM_HOME, 'SimVerb-3500', 'SimVerb-3000-test.txt')
    return wordsim_dataset_reader(
        src_filename, delimiter="\t", header=True, score_col_index=3)

def men_reader():
    """MEN: http://clic.cimec.unitn.it/~elia.bruni/MEN"""
    src_filename = os.path.join(
        WORDSIM_HOME, 'MEN', 'MEN_dataset_natural_form_full')
    return wordsim_dataset_reader(
        src_filename, header=False, delimiter=' ') 

This collection of readers will be useful for flexible evaluations:

In [5]:
READERS = (wordsim353_reader, mturk771_reader, simverb3500dev_reader, 
           simverb3500test_reader, men_reader)

Dataset comparisons

This section does some basic analysis of the datasets. The goal is to obtain a deeper understanding of what problem we're solving – what strengths and weaknesses the datasets have and how they relate to each other. For a full-fledged project, we would want to continue work like this and report on it in the paper, to provide context for the results.

In [6]:
def get_reader_name(reader):
    """Return a cleaned-up name for the similarity dataset 
    iterator `reader`
    """
    return reader.__name__.replace("_reader", "")

Vocab overlap

How many vocabulary items are shared across the datasets?

In [7]:
def get_reader_vocab(reader):
    """Return the set of words (str) in `reader`."""
    vocab = set()
    for w1, w2, _ in reader():
        vocab.add(w1)
        vocab.add(w2)
    return vocab
In [8]:
def get_reader_vocab_overlap(readers=READERS):
    """Get data on the vocab-level relationships between pairs of 
    readers. Returns a a pd.DataFrame containing this information.
    """
    data = []
    for r1, r2 in itertools.product(readers, repeat=2):       
        v1 = get_reader_vocab(r1)
        v2 = get_reader_vocab(r2)
        d = {
            'd1': get_reader_name(r1),
            'd2': get_reader_name(r2),
            'overlap': len(v1 & v2), 
            'union': len(v1 | v2),
            'd1_size': len(v1),
            'd2_size': len(v2)}
        data.append(d)
    return pd.DataFrame(data)
In [9]:
vocab_overlap = get_reader_vocab_overlap()
In [10]:
def vocab_overlap_crosstab(vocab_overlap):
    """Return an intuitively formatted `pd.DataFrame` giving 
    vocab-overlap counts for all the datasets represented in 
    `vocab_overlap`, the output of `get_reader_vocab_overlap`.
    """        
    xtab = pd.crosstab(
        vocab_overlap['d1'], 
        vocab_overlap['d2'], 
        values=vocab_overlap['overlap'], 
        aggfunc=np.mean)
    # Blank out the upper left to reduce visual clutter:
    for i in range(0, xtab.shape[0]):
        for j in range(i+1, xtab.shape[1]):
            xtab.iloc[i, j] = ''        
    return xtab        
In [11]:
vocab_overlap_crosstab(vocab_overlap)
Out[11]:
d2 men mturk771 simverb3500dev simverb3500test wordsim353
d1
men 751
mturk771 230 1113
simverb3500dev 23 67 536
simverb3500test 30 94 532 823
wordsim353 86 158 13 17 437

This looks reasonable. By design, the SimVerb dev and test sets have a lot of overlap. The other overlap numbers are pretty small, even adjusting for dataset size.

Pair overlap and score correlations

How many word pairs are shared across datasets and, for shared pairs, what is the correlation between their scores? That is, do the datasets agree?

In [12]:
def get_reader_pairs(reader):
    """Return the set of alphabetically-sorted word (str) tuples 
    in `reader`
    """
    return {tuple(sorted([w1, w2])): score for w1, w2, score in reader()}
In [13]:
def get_reader_pair_overlap(readers=READERS):
    """Return a `pd.DataFrame` giving the number of overlapping 
    word-pairs in pairs of readers, along with the Spearman 
    correlations.
    """    
    data = []
    for r1, r2 in itertools.product(READERS, repeat=2):
        if r1.__name__ != r2.__name__:
            d1 = get_reader_pairs(r1)
            d2 = get_reader_pairs(r2)
            overlap = []
            for p, s in d1.items():
                if p in d2:
                    overlap.append([s, d2[p]])
            if overlap:
                s1, s2 = zip(*overlap)
                rho = spearmanr(s1, s2)[0]
            else:
                rho = None        
            d = {
                'd1': get_reader_name(r1),
                'd2': get_reader_name(r2), 
                'pair_overlap': len(overlap),
                'rho': rho}
            data.append(d)
    df = pd.DataFrame(data)
    df = df.sort_values('pair_overlap', ascending=False)
    # Return only every other row to avoid repeats:
    return df[::2].reset_index(drop=True)
In [14]:
get_reader_pair_overlap()
Out[14]:
d1 d2 pair_overlap rho
0 men mturk771 11 0.592191
1 wordsim353 men 5 0.700000
2 simverb3500test mturk771 4 0.400000
3 men simverb3500test 2 1.000000
4 simverb3500test simverb3500dev 1 NaN
5 wordsim353 simverb3500dev 0 NaN
6 simverb3500test wordsim353 0 NaN
7 simverb3500dev wordsim353 0 NaN
8 mturk771 wordsim353 0 NaN
9 men simverb3500dev 0 NaN

This looks reasonable: none of the datasets have a lot of overlapping pairs, so we don't have to worry too much about places where they give conflicting scores.

Evaluation

This section builds up the evaluation code that you'll use for the homework and bake-off. For illustrations, I'll read in a VSM created from data/vsmdata/giga_window5-scaled.csv.gz:

In [15]:
giga5 = pd.read_csv(
    os.path.join(VSM_HOME, "giga_window5-scaled.csv.gz"), index_col=0)

Dataset evaluation

In [16]:
def word_similarity_evaluation(reader, df, distfunc=vsm.cosine):
    """Word-similarity evalution framework.
    
    Parameters
    ----------
    reader : iterator
        A reader for a word-similarity dataset. Just has to yield
        tuples (word1, word2, score).    
    df : pd.DataFrame
        The VSM being evaluated.        
    distfunc : function mapping vector pairs to floats.
        The measure of distance between vectors. Can also be 
        `vsm.euclidean`, `vsm.matching`, `vsm.jaccard`, as well as 
        any other distance measure between 1d vectors.    
        
    Raises
    ------
    ValueError
        If `df.index` is not a subset of the words in `reader`.
    
    Returns
    -------
    float, data
        `float` is the Spearman rank correlation coefficient between 
        the dataset scores and the similarity values obtained from 
        `df` using  `distfunc`. This evaluation is sensitive only to 
        rankings, not to absolute values.  `data` is a `pd.DataFrame` 
        with columns['word1', 'word2', 'score', 'distance'].
        
    """
    data = []
    for w1, w2, score in reader():
        d = {'word1': w1, 'word2': w2, 'score': score}
        for w in [w1, w2]:
            if w not in df.index:
                raise ValueError(
                    "Word '{}' is in the similarity dataset {} but not in the "
                    "DataFrame, making this evaluation ill-defined. Please "
                    "switch to a DataFrame with an appropriate vocabulary.".
                    format(w, get_reader_name(reader))) 
        d['distance'] = distfunc(df.loc[w1], df.loc[w2])
        data.append(d)
    data = pd.DataFrame(data)
    rho, pvalue = spearmanr(data['score'].values, data['distance'].values)
    return rho, data
In [17]:
rho, eval_df = word_similarity_evaluation(men_reader, giga5)
In [18]:
rho
Out[18]:
0.40375964105441753
In [19]:
eval_df.head()
Out[19]:
distance score word1 word2
0 0.956828 -50.0 sun sunlight
1 0.979143 -50.0 automobile car
2 0.970105 -49.0 river water
3 0.980475 -49.0 stairs staircase
4 0.963624 -49.0 morning sunrise

Dataset error analysis

For error analysis, we can look at the words with the largest delta between the gold score and the distance value in our VSM. We do these comparisons based on ranks, just as with our primary metric (Spearman $\rho$), and we normalize both rankings so that they have a comparable number of levels.

In [20]:
def word_similarity_error_analysis(eval_df):    
    eval_df['distance_rank'] = _normalized_ranking(eval_df['distance'])
    eval_df['score_rank'] = _normalized_ranking(eval_df['score'])
    eval_df['error'] =  abs(eval_df['distance_rank'] - eval_df['score_rank'])
    return eval_df.sort_values('error')
    
    
def _normalized_ranking(series):
    ranks = series.rank(method='dense')
    return ranks / ranks.sum()    

Best predictions:

In [21]:
word_similarity_error_analysis(eval_df).head()
Out[21]:
distance score word1 word2 distance_rank score_rank error
1041 0.975007 -32.0 hummingbird pelican 0.000243 0.000244 2.434543e-07
2315 0.980834 -13.0 lily pigs 0.000488 0.000487 4.016842e-07
2951 0.983473 -4.0 bucket girls 0.000602 0.000603 4.151568e-07
150 0.968690 -43.0 night sunset 0.000102 0.000103 6.520315e-07
2062 0.979721 -17.0 oak petals 0.000435 0.000436 7.162632e-07

Worst predictions:

In [22]:
word_similarity_error_analysis(eval_df).tail()
Out[22]:
distance score word1 word2 distance_rank score_rank error
67 0.984622 -45.0 branch twigs 0.000630 0.000077 0.000553
190 0.987704 -43.0 birds stork 0.000657 0.000103 0.000554
185 0.990993 -43.0 bloom tulip 0.000663 0.000103 0.000561
167 0.991760 -43.0 bloom blossom 0.000664 0.000103 0.000561
198 0.992406 -43.0 bloom rose 0.000664 0.000103 0.000561

Full evaluation

A full evaluation is just a loop over all the readers on which one want to evaluate, with a macro-average at the end:

In [23]:
def full_word_similarity_evaluation(df, readers=READERS, distfunc=vsm.cosine):
    """Evaluate a VSM against all datasets in `readers`.
    
    Parameters
    ----------
    df : pd.DataFrame
    readers : tuple 
        The similarity dataset readers on which to evaluate.
    distfunc : function mapping vector pairs to floats.
        The measure of distance between vectors. Can also be 
        `vsm.euclidean`, `vsm.matching`, `vsm.jaccard`, as well as 
        any other distance measure between 1d vectors. 
    
    Returns
    -------
    pd.Series
        Mapping dataset names to Spearman r values.
        
    """        
    scores = {}     
    for reader in readers:
        score, data_df = word_similarity_evaluation(reader, df, distfunc=distfunc)
        scores[get_reader_name(reader)] = score
    series = pd.Series(scores, name='Spearman r')
    series['Macro-average'] = series.mean()
    return series
In [24]:
full_word_similarity_evaluation(giga5, distfunc=vsm.cosine)
Out[24]:
wordsim353         0.327831
mturk771           0.143146
simverb3500dev    -0.068038
simverb3500test   -0.066348
men                0.403760
Macro-average      0.148070
Name: Spearman r, dtype: float64

Homework questions

Please embed your homework responses in this notebook, and do not delete any cells from the notebook. (You are free to add as many cells as you like as part of your responses.)

PPMI as a baseline [1 point]

The insight behind PPMI is a recurring theme in word representation learning, so it is a natural baseline for our task. For this question, submit code to do the following:

  1. Read the two count matrices created with a window of 20 and a flat scaling function into pd.DataFrames, as is done in the VSM notebooks. The files are data/vsmdata/giga_window20-flat.csv.gz and data/vsmdata/imdb_window20-flat.csv.gz, and the VSM notebooks provide examples of the needed code.

  2. Reweight these count matries with PPMI.

  3. Evaluate these reweighted matrices using full_word_similarity_evaluation.

In [ ]:
 

Gigaword with LSA at a few dimensions [1 point]

We might expect PPMI and LSA to form a solid pipeline that combines the strengths of PPMI with those of dimensionality reduction. However, LSA has a hyper-parameter $k$ – the dimensionality of the final representations – that will impact performance. For this problem, submit code to do the following:

  1. Apply LSA with $k \in \{100, 500, 1000\}$ to the PPMI reweighted version of data/vsmdata/giga_window20-flat.csv.gz that you created in the previous question.

  2. Print out each $k$ and its associated score. (For concise, clear code, you can use results.loc['Macro-average'] where results is a pd.DataFrame returned by full_word_similarity_evaluation.)

In [ ]:
 

Gigaword with GloVe for a small number of iterations [1 point]

Ideally, we would run GloVe for a very large number of iterations on a GPU machine to compare it against its close cousin PMI. However, we don't want this homework to cost you a lot of money or monopolize a lot of your available computing resources, so let's instead just probe GloVe a little bit to see if it has promise for our task. For this problem, submit code to do the following:

  1. Run GloVe for 10, 100, and 200 iterations on data/vsmdata/giga_window20-flat.csv.gz, using the mittens implementation of GloVe.
    • For all the other parameters to mittens.GloVe besides max_iter, use the package's defaults.
    • Because of the way that implementation is designed, these will have to be separate runs, but they should be relatively quick.
  2. Print out each value of max_iter and its associated score according to full_word_similarity_evaluation. The trend should give you a sense for whether it is worth running GloVe for more iterations.
    • Note: your trained GloVe matrix X needs to be wrapped in a pd.DataFrame to work with full_word_similarity_evaluation. pd.DataFrame(X, index=giga20.index) will do the trick.
    • Note: if glv is your GloVe model, then running glv.sess.close() after each model is trained will silence warnings from TensorFlow about interactive sessions being active.
In [25]:
from mittens import GloVe

t-test reweighting [2 points]

The t-test statistic can be thought of as a reweighting scheme. For a count matrix $X$, row index $i$, and column index $j$:

$$\textbf{ttest}(X, i, j) = \frac{ P(X, i, j) - \big(P(X, i, *)P(X, *, j)\big) }{ \sqrt{(P(X, i, *)P(X, *, j))} }$$

where $P(X, i, j)$ is $X_{ij}$ divided by the total values in $X$, $P(X, i, *)$ is the sum of the values in row $i$ of $X$ divided by the total values in $X$, and $P(X, *, j)$ is the sum of the values in column $j$ of $X$ divided by the total values in $X$.

For this problem, implement this reweighting scheme. You can use test_ttest_implementation below to check that your implementation is correct. You do not need to use this for any evaluations, though we hope you will be curious enough to do so!

In [26]:
def test_ttest_implementation(func):
    """`func` should be an implementation of t-test reweighting as 
    defined above.
    """
    X = pd.DataFrame(np.array([
        [  4.,   4.,   2.,   0.],
        [  4.,  61.,   8.,  18.],
        [  2.,   8.,  10.,   0.],
        [  0.,  18.,   0.,   5.]]))    
    actual = np.array([
        [ 0.33056, -0.07689,  0.04321, -0.10532],
        [-0.07689,  0.03839, -0.10874,  0.07574],
        [ 0.04321, -0.10874,  0.36111, -0.14894],
        [-0.10532,  0.07574, -0.14894,  0.05767]])    
    predicted = func(X)
    assert np.array_equal(predicted.round(5), actual)
In [ ]:
 

Your original system [4 points]

This question asks you to design your own model. You can of course include steps made above (ideally, the above questions informed your system design!), but your model should not be literally identical to any of the above models. Other ideas: retrofitting, autoencoders, GloVe, subword modeling, ...

Your code needs to be able to

  1. operate on all of the count matrices in data/vsmdata (except gigawordnyt-advmod-matrix.csv.gz, which doesn't have the right vocab for this task and is included just for fun); other pretrained vectors cannot be introduced; and
  2. be self-contained, so that we can work with your model directly in your homework submission notebook. (If your model depends on external data or other resources, please upload to Canvas a zip archive containing those resources and your submission notebook.)

Bake-off [1 point]

For the bake-off, we will release two additional datasets right after class on April 15. The announcement will go out on Piazza. We will also release reader code for these datasets that you can paste into this notebook. You will evaluate your custom model $M$ (from the previous question) on these new datasets using full_word_similarity_evaluation. Rules:

  1. Only one evaluation is permitted.
  2. No additional system tuning is permitted once the bake-off has started.

To enter the bake-off, upload this notebook on Canvas:

https://canvas.stanford.edu/courses/99711/assignments/187240

The cells below this one constitute your bake-off entry.

People who enter will receive the additional homework point, and people whose systems achieve the top score will receive an additional 0.5 points. We will test the top-performing systems ourselves, and only systems for which we can reproduce the reported results will win the extra 0.5 points.

The bake-off will close at 4:30 pm on April 17. Late entries will be accepted, but they cannot earn the extra 0.5 points. Similarly, you cannot win the bake-off unless your homework is submitted on time.

In [27]:
# Enter your bake-off assessment code into this cell. 
# Please do not remove this comment.
In [28]:
# On an otherwise blank line in this cell, please enter
# your "Macro-average" value as reported by the code above. 
# Please enter only a number between 0 and 1 inclusive.
# Please do not remove this comment.