Bake-off: Word-level entailment with neural networks

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2018"


Problem: Word-level natural language inference.

Training examples are pairs of words $(w_{L}, w_{R}), y$ with $y$ a relation in

  • synonym: very roughly identical meanings; symmetric
  • hyponym: e.g., puppy is a hyponym of dog
  • hypernym: e.g., dog is a hypernym of puppy
  • antonym: semantically opposed within a domain; symmetric

The dataset is due to Bowman et al. 2015. See below for details on how it was processed for this bake-off.


  1. Make sure your environment includes all the requirements for the cs224u repository.

  2. Make sure you have the the Wikipedia 2014 + Gigaword 5 distribution of pretrained GloVe vectors downloaded and unzipped, and that glove_home below is pointing to it.

  3. Make sure wordentail_filename below is pointing to the full path for nli_wordentail_bakeoff_data.json, which is included in the archive.

In [2]:
from collections import defaultdict
import json
import numpy as np
import os
import pandas as pd
import tensorflow as tf
from tf_shallow_neural_classifier import TfShallowNeuralClassifier
import nli
import utils
/Applications/anaconda/envs/nlu/lib/python3.6/site-packages/h5py/ FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
In [3]:
nlidata_home = 'nlidata'

wordentail_filename = os.path.join(
    nlidata_home, 'nli_wordentail_bakeoff_data.json')

glove_home = os.path.join("vsmdata", "glove.6B")


As noted above, the dataset was originally released by Bowman et al. 2015, who derived it from WordNet using some heuristics (and thus it might contain some errors or unintuitive pairings).

I've processed the data into three different train/test splits, in an effort to put some pressure on our models to actually learn these semantic relations, as opposed to exploiting regularities in the sample.

  • edge_disjoint: The train and dev edge sets are disjoint, but many words appear in both train and dev.
  • word_disjoint: The train and dev vocabularies are disjoint, and thus the edges are disjoint as well.
  • word_disjoint_balanced: Like word_disjoint, but with each word appearing at most one time as the left word and at most one time on the right for a given relation type.

These are progressively harder problems

  • For word_disjoint, there is real pressure on the model to learn abstract relationships, as opposed to memorizing properties of individual words.

  • For word_disjoint_balanced, the model can't even learn that some terms tend to appear more on the left or the right. This might be a step too far. For example, appearing more on the right for hypernym corresponds in a deep way with being a more general term, which is a non-trivial lexical property that we want our models to learn.

In [4]:
with open(wordentail_filename) as f:
    wordentail_data = json.load(f)

The outer keys are the three splits plus a list giving the vocabulary for the entire dataset:

In [5]:
dict_keys(['edge_disjoint', 'vocab', 'word_disjoint', 'word_disjoint_balanced'])

Edge disjoint

In [6]:
dict_keys(['dev', 'train'])

This is what the split looks like; all three have this same format:

In [7]:
wordentail_data['edge_disjoint']['dev'][: 5]
[[['archived', 'records'], 'synonym'],
 [['stage', 'station'], 'synonym'],
 [['engineers', 'design'], 'hypernym'],
 [['save', 'book'], 'hypernym'],
 [['match', 'supply'], 'hypernym']]

Let's test to make sure no edges are shared between train and dev:

In [8]:
nli.get_edge_overlap_size(wordentail_data, 'edge_disjoint')

As we expect, a lot of vocabulary items are shared between train and dev:

In [9]:
nli.get_vocab_overlap_size(wordentail_data, 'edge_disjoint')

This is a large percentage of the entire vocab:

In [10]:

Here's the distribution of labels in the train set. It's highly imbalanced, which will pose a challenge. (I'll go ahead and reveal that the dev set is similarly distributed.)

In [11]:
def label_distribution(split):
    return pd.DataFrame(wordentail_data[split]['train'])[1].value_counts()
In [12]:
synonym     8865
hypernym    6475
hyponym     1044
antonym      629
Name: 1, dtype: int64

Word disjoint

In [13]:
dict_keys(['dev', 'train'])

In the word_disjoint split, no words are shared between train and dev:

In [14]:
nli.get_vocab_overlap_size(wordentail_data, 'word_disjoint')

Because no words are shared between train and dev, no edges are either:

In [15]:
nli.get_edge_overlap_size(wordentail_data, 'word_disjoint')

The label distribution is similar to that of edge_disjoint, though the overall number of examples is a bit smaller:

In [16]:
synonym     5610
hypernym    3993
hyponym      627
antonym      386
Name: 1, dtype: int64

There is still an important bias in the data: some words appear much more often than others, and in specific positions. For example, the very general term part appears on the right in a large number of cases, many of them hypernym.

In [17]:
[[ex, y] for ex, y in wordentail_data['word_disjoint']['train'] 
 if ex[1] == 'part']
[[['frames', 'part'], 'hypernym'],
 [['heaven', 'part'], 'hypernym'],
 [['pan', 'part'], 'synonym'],
 [['middle', 'part'], 'hypernym'],
 [['shared', 'part'], 'synonym'],
 [['shares', 'part'], 'synonym'],
 [['ended', 'part'], 'hypernym'],
 [['twin', 'part'], 'synonym'],
 [['meal', 'part'], 'synonym'],
 [['bit', 'part'], 'hypernym'],
 [['sections', 'part'], 'synonym'],
 [['capacity', 'part'], 'hypernym'],
 [['beginning', 'part'], 'hypernym'],
 [['divorce', 'part'], 'hypernym'],
 [['paradise', 'part'], 'hypernym'],
 [['ends', 'part'], 'hypernym'],
 [['reduced', 'part'], 'hypernym'],
 [['units', 'part'], 'hypernym'],
 [['corner', 'part'], 'hypernym'],
 [['air', 'part'], 'hypernym'],
 [['section', 'part'], 'synonym'],
 [['something', 'part'], 'synonym'],
 [['reduce', 'part'], 'hypernym'],
 [['some', 'part'], 'synonym'],
 [['heavy', 'part'], 'hypernym'],
 [['segment', 'part'], 'hypernym'],
 [['share', 'part'], 'synonym'],
 [['hat', 'part'], 'hypernym'],
 [['maria', 'part'], 'hypernym'],
 [['way', 'part'], 'hypernym'],
 [['interests', 'part'], 'synonym']]

These tabulations suggest that a classifier could do well just by learning where words tend to appear:

In [18]:
def count_label_position_instances(split, pos=0):
    examples = wordentail_data[split]['train']    
    return pd.Series([(ex[pos], label) for ex, label in examples]).value_counts()
In [19]:
count_label_position_instances('word_disjoint', pos=0).head()
(forms, hypernym)       9
(have, synonym)         8
(question, synonym)     8
(questions, synonym)    8
(items, synonym)        8
dtype: int64
In [20]:
count_label_position_instances('word_disjoint', pos=1).head()
(be, hypernym)        51
(take, hypernym)      39
(alter, hypernym)     38
(person, hypernym)    33
(modify, hypernym)    32
dtype: int64

Word disjoint and balanced

To see how much our models are leveraging the uneven distribution of words across the left and right positions, we also have a split in which each word $w$ appears in at most one item $((w, w_{R}), y)$ and at most one item $((w_{L}, w), y)$.

The following tests establish that the dataset has the desired properties:

In [21]:
dict_keys(['dev', 'train'])
In [22]:
nli.get_edge_overlap_size(wordentail_data, 'word_disjoint_balanced')
In [23]:
nli.get_vocab_overlap_size(wordentail_data, 'word_disjoint_balanced')
In [24]:
[[ex, y] for ex, y in wordentail_data['word_disjoint_balanced']['train'] 
 if ex[1] == 'part']
[[['frames', 'part'], 'hypernym'], [['pan', 'part'], 'synonym']]
In [25]:
count_label_position_instances('word_disjoint_balanced', pos=0).head()
(remove, synonym)      1
(close, hyponym)       1
(seminar, hypernym)    1
(wants, hyponym)       1
(reform, synonym)      1
dtype: int64
In [26]:
count_label_position_instances('word_disjoint_balanced', pos=1).head()
(remove, synonym)       1
(attitude, synonym)     1
(relation, hypernym)    1
(weak, synonym)         1
(soon, synonym)         1
dtype: int64


Even in deep learning, feature representation is the most important thing and requires care! For our task, feature representation has two parts: representing the individual words and combining those representations into a single network input.

Representing words: vector_func

Let's consider two baseline word representations methods:

  1. Random vectors (as returned by utils.randvec).
  2. 50-dimensional GloVe representations.
In [27]:
def randvec(w, n=50, lower=-1.0, upper=1.0):
    """Returns a random vector of length `n`. `w` is ignored."""
    return utils.randvec(n=n, lower=lower, upper=upper)
In [28]:
# Any of the files in glove.6B will work here:
glove50_src = os.path.join(glove_home, 'glove.6B.50d.txt')

# Creates a dict mapping strings (words) to GloVe vectors:
GLOVE50 = utils.glove2dict(glove50_src)

def glove50vec(w):    
    """Return `w`'s GloVe representation if available, else return 
    a random vector."""
    return GLOVE50.get(w, randvec(w, n=50))

Combining words into inputs: vector_combo_func

Here we decide how to combine the two word vectors into a single representation. In more detail, where u is a vector representation of the left word and v is a vector representation of the right word, we need a function vector_combo_func such that vector_combo_func(u, v) returns a new input vector z of dimension m. A simple example is concatenation:

In [29]:
def vec_concatenate(u, v):
    """Concatenate np.array instances `u` and `v` into a new np.array"""
    return np.concatenate((u, v))

vector_combo_func could instead be vector average, vector difference, etc. (even combinations of those) – there's lots of space for experimentation here.

Classifier model

For a baseline model, I chose TfShallowNeuralClassifier with a pretty large hidden layer and a correspondingly high number of iterations.

In [30]:
net = TfShallowNeuralClassifier(hidden_dim=200, max_iter=500)

Baseline results

The following puts the above pieces together, using vector_func=glove50vec, since vector_func=randvec seems so hopelessly misguided for word_disjoint and word_disjoint_balanced!

First, we build the dataset:

In [31]:
X = nli.build_bakeoff_dataset(

And then we run the experiment with nli.bakeoff_experiment. This trains and tests on all three splits, and additionally trains on word_disjoint's train portion and tests on word_disjoint_balanced's dev portion, to see what distribution of examples is more effective for this balanced evaluation.

Since the bake-off focus is word_disjoint, you might want to run just that evaluation. To to that, use:

In [32]:
nli.bakeoff_experiment(X, net, conditions=['word_disjoint'])
Iteration 500: loss: 9.8024865388870243
             precision    recall  f1-score   support

    antonym       0.00      0.00      0.00       150
   hypernym       0.54      0.43      0.48      1594
    hyponym       0.22      0.01      0.03       275
    synonym       0.59      0.77      0.67      2229

avg / total       0.52      0.57      0.53      4248

This will run the complete evaluation:

In [33]:
nli.bakeoff_experiment(X, net)
Iteration 500: loss: 15.278596043586731/Applications/anaconda/envs/nlu/lib/python3.6/site-packages/sklearn/metrics/ UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
Iteration 2: loss: 12.505775809288025
             precision    recall  f1-score   support

    antonym       0.00      0.00      0.00       392
   hypernym       0.58      0.43      0.49      4310
    hyponym       0.51      0.04      0.07       710
    synonym       0.59      0.80      0.68      5930

avg / total       0.56      0.59      0.55     11342

Iteration 3: loss: 4.142282485961914884
             precision    recall  f1-score   support

    antonym       0.00      0.00      0.00       150
   hypernym       0.54      0.43      0.48      1594
    hyponym       0.33      0.02      0.04       275
    synonym       0.59      0.78      0.67      2229

avg / total       0.53      0.57      0.53      4248

Iteration 2: loss: 13.00054156780243554
             precision    recall  f1-score   support

    antonym       0.00      0.00      0.00       115
   hypernym       0.48      0.28      0.35       511
    hyponym       0.27      0.03      0.05       118
    synonym       0.55      0.84      0.67       831

avg / total       0.47      0.54      0.47      1575

Iteration 500: loss: 9.7987087368965157
word_disjoint_balanced, training on word_disjoint
             precision    recall  f1-score   support

    antonym       0.00      0.00      0.00       115
   hypernym       0.49      0.43      0.46       511
    hyponym       0.20      0.01      0.02       118
    synonym       0.58      0.79      0.67       831

avg / total       0.48      0.56      0.50      1575

Bake-off submission

The goal: achieve the highest average F1 score on word_disjoint.


  • Your score on the word_disjoint split.
  • A description of the method you used:
    • Your approach to representing words.
    • Your approach to combining them into inputs.
    • The model you used for predictions.

Submission URL:


  • For the methods, the only requirement is that they differ in some way from the baseline above. They don't have to be completely different, though. For example, you might want to stick with the model but represent examples differently, or the reverse.

  • You must train only on the train split. No outside training instances can be brought in. You can, though, bring in outside information via your input vectors, as long as this information is not from dev or edge_disjoint.

  • You can also augment your training data. For example, if ((A, B), synonym) is a training instance, then so should be ((B, A), synonym). Similarly, ((A, B), hyponym) and ((B, C), hyponym) are training cases, then so should be ((A, C), hyponym).

  • Since the evaluation is for word_disjoint, you're not going to get very far with random input vectors! A GloVe featurizer is defined above. Feel free to look around for new word vectors on the Web, or even train your own using our VSM notebooks.

  • You're not required to stick to TfShallowNeuralNetwork. For instance, you could create deeper feed-forward networks, change how they optimize, etc. As long as you have fit and predict methods with the same input and output types as our networks, you should be able to use bakeoff_experiment. For notes on how to extend the TensorFlow models included in this repository, see tensorflow_models.ipynb.