__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2018"
Problem: Word-level natural language inference.
Training examples are pairs of words $(w_{L}, w_{R}), y$ with $y$ a relation in
The dataset is due to Bowman et al. 2015. See below for details on how it was processed for this bake-off.
Make sure your environment includes all the requirements for the cs224u repository.
Make sure you have the the Wikipedia 2014 + Gigaword 5 distribution of pretrained GloVe vectors downloaded and unzipped, and that glove_home
below is pointing to it.
Make sure wordentail_filename
below is pointing to the full path for nli_wordentail_bakeoff_data.json
, which is included in the nlidata.zip archive.
from collections import defaultdict
import json
import numpy as np
import os
import pandas as pd
import tensorflow as tf
from tf_shallow_neural_classifier import TfShallowNeuralClassifier
import nli
import utils
/Applications/anaconda/envs/nlu/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters
nlidata_home = 'nlidata'
wordentail_filename = os.path.join(
nlidata_home, 'nli_wordentail_bakeoff_data.json')
glove_home = os.path.join("vsmdata", "glove.6B")
As noted above, the dataset was originally released by Bowman et al. 2015, who derived it from WordNet using some heuristics (and thus it might contain some errors or unintuitive pairings).
I've processed the data into three different train/test splits, in an effort to put some pressure on our models to actually learn these semantic relations, as opposed to exploiting regularities in the sample.
edge_disjoint
: The train
and dev
edge sets are disjoint, but many words appear in both train
and dev
.word_disjoint
: The train
and dev
vocabularies are disjoint, and thus the edges are disjoint as well.word_disjoint_balanced
: Like word_disjoint
, but with each word appearing at most one time as the left word and at most one time on the right for a given relation type.These are progressively harder problems
For word_disjoint
, there is real pressure on the model to learn abstract relationships, as opposed to memorizing properties of individual words.
For word_disjoint_balanced
, the model can't even learn that some terms tend to appear more on the left or the right. This might be a step too far. For example, appearing more on the right for hypernym
corresponds in a deep way with being a more general term, which is a non-trivial lexical property that we want our models to learn.
with open(wordentail_filename) as f:
wordentail_data = json.load(f)
The outer keys are the three splits plus a list giving the vocabulary for the entire dataset:
wordentail_data.keys()
dict_keys(['edge_disjoint', 'vocab', 'word_disjoint', 'word_disjoint_balanced'])
wordentail_data['edge_disjoint'].keys()
dict_keys(['dev', 'train'])
This is what the split looks like; all three have this same format:
wordentail_data['edge_disjoint']['dev'][: 5]
[[['archived', 'records'], 'synonym'], [['stage', 'station'], 'synonym'], [['engineers', 'design'], 'hypernym'], [['save', 'book'], 'hypernym'], [['match', 'supply'], 'hypernym']]
Let's test to make sure no edges are shared between train
and dev
:
nli.get_edge_overlap_size(wordentail_data, 'edge_disjoint')
0
As we expect, a lot of vocabulary items are shared between train
and dev
:
nli.get_vocab_overlap_size(wordentail_data, 'edge_disjoint')
4769
This is a large percentage of the entire vocab:
len(wordentail_data['vocab'])
6560
Here's the distribution of labels in the train
set. It's highly imbalanced, which will pose a challenge. (I'll go ahead and reveal that the dev
set is similarly distributed.)
def label_distribution(split):
return pd.DataFrame(wordentail_data[split]['train'])[1].value_counts()
label_distribution('edge_disjoint')
synonym 8865 hypernym 6475 hyponym 1044 antonym 629 Name: 1, dtype: int64
wordentail_data['word_disjoint'].keys()
dict_keys(['dev', 'train'])
In the word_disjoint
split, no words are shared between train
and dev
:
nli.get_vocab_overlap_size(wordentail_data, 'word_disjoint')
0
Because no words are shared between train
and dev
, no edges are either:
nli.get_edge_overlap_size(wordentail_data, 'word_disjoint')
0
The label distribution is similar to that of edge_disjoint
, though the overall number of examples is a bit smaller:
label_distribution('word_disjoint')
synonym 5610 hypernym 3993 hyponym 627 antonym 386 Name: 1, dtype: int64
There is still an important bias in the data: some words appear much more often than others, and in specific positions. For example, the very general term part
appears on the right in a large number of cases, many of them hypernym
.
[[ex, y] for ex, y in wordentail_data['word_disjoint']['train']
if ex[1] == 'part']
[[['frames', 'part'], 'hypernym'], [['heaven', 'part'], 'hypernym'], [['pan', 'part'], 'synonym'], [['middle', 'part'], 'hypernym'], [['shared', 'part'], 'synonym'], [['shares', 'part'], 'synonym'], [['ended', 'part'], 'hypernym'], [['twin', 'part'], 'synonym'], [['meal', 'part'], 'synonym'], [['bit', 'part'], 'hypernym'], [['sections', 'part'], 'synonym'], [['capacity', 'part'], 'hypernym'], [['beginning', 'part'], 'hypernym'], [['divorce', 'part'], 'hypernym'], [['paradise', 'part'], 'hypernym'], [['ends', 'part'], 'hypernym'], [['reduced', 'part'], 'hypernym'], [['units', 'part'], 'hypernym'], [['corner', 'part'], 'hypernym'], [['air', 'part'], 'hypernym'], [['section', 'part'], 'synonym'], [['something', 'part'], 'synonym'], [['reduce', 'part'], 'hypernym'], [['some', 'part'], 'synonym'], [['heavy', 'part'], 'hypernym'], [['segment', 'part'], 'hypernym'], [['share', 'part'], 'synonym'], [['hat', 'part'], 'hypernym'], [['maria', 'part'], 'hypernym'], [['way', 'part'], 'hypernym'], [['interests', 'part'], 'synonym']]
These tabulations suggest that a classifier could do well just by learning where words tend to appear:
def count_label_position_instances(split, pos=0):
examples = wordentail_data[split]['train']
return pd.Series([(ex[pos], label) for ex, label in examples]).value_counts()
count_label_position_instances('word_disjoint', pos=0).head()
(forms, hypernym) 9 (have, synonym) 8 (question, synonym) 8 (questions, synonym) 8 (items, synonym) 8 dtype: int64
count_label_position_instances('word_disjoint', pos=1).head()
(be, hypernym) 51 (take, hypernym) 39 (alter, hypernym) 38 (person, hypernym) 33 (modify, hypernym) 32 dtype: int64
To see how much our models are leveraging the uneven distribution of words across the left and right positions, we also have a split in which each word $w$ appears in at most one item $((w, w_{R}), y)$ and at most one item $((w_{L}, w), y)$.
The following tests establish that the dataset has the desired properties:
wordentail_data['word_disjoint_balanced'].keys()
dict_keys(['dev', 'train'])
nli.get_edge_overlap_size(wordentail_data, 'word_disjoint_balanced')
0
nli.get_vocab_overlap_size(wordentail_data, 'word_disjoint_balanced')
0
[[ex, y] for ex, y in wordentail_data['word_disjoint_balanced']['train']
if ex[1] == 'part']
[[['frames', 'part'], 'hypernym'], [['pan', 'part'], 'synonym']]
count_label_position_instances('word_disjoint_balanced', pos=0).head()
(remove, synonym) 1 (close, hyponym) 1 (seminar, hypernym) 1 (wants, hyponym) 1 (reform, synonym) 1 dtype: int64
count_label_position_instances('word_disjoint_balanced', pos=1).head()
(remove, synonym) 1 (attitude, synonym) 1 (relation, hypernym) 1 (weak, synonym) 1 (soon, synonym) 1 dtype: int64
Even in deep learning, feature representation is the most important thing and requires care! For our task, feature representation has two parts: representing the individual words and combining those representations into a single network input.
Let's consider two baseline word representations methods:
utils.randvec
).def randvec(w, n=50, lower=-1.0, upper=1.0):
"""Returns a random vector of length `n`. `w` is ignored."""
return utils.randvec(n=n, lower=lower, upper=upper)
# Any of the files in glove.6B will work here:
glove50_src = os.path.join(glove_home, 'glove.6B.50d.txt')
# Creates a dict mapping strings (words) to GloVe vectors:
GLOVE50 = utils.glove2dict(glove50_src)
def glove50vec(w):
"""Return `w`'s GloVe representation if available, else return
a random vector."""
return GLOVE50.get(w, randvec(w, n=50))
Here we decide how to combine the two word vectors into a single representation. In more detail, where u
is a vector representation of the left word and v
is a vector representation of the right word, we need a function vector_combo_func
such that vector_combo_func(u, v)
returns a new input vector z
of dimension m
. A simple example is concatenation:
def vec_concatenate(u, v):
"""Concatenate np.array instances `u` and `v` into a new np.array"""
return np.concatenate((u, v))
vector_combo_func
could instead be vector average, vector difference, etc. (even combinations of those) – there's lots of space for experimentation here.
For a baseline model, I chose TfShallowNeuralClassifier
with a pretty large hidden layer and a correspondingly high number of iterations.
net = TfShallowNeuralClassifier(hidden_dim=200, max_iter=500)
The following puts the above pieces together, using vector_func=glove50vec
, since vector_func=randvec
seems so hopelessly misguided for word_disjoint
and word_disjoint_balanced
!
First, we build the dataset:
X = nli.build_bakeoff_dataset(
wordentail_data,
vector_func=glove50vec,
vector_combo_func=vec_concatenate)
And then we run the experiment with nli.bakeoff_experiment
. This trains and tests on all three splits, and additionally trains on word_disjoint
's train
portion and tests on word_disjoint_balanced
's dev
portion, to see what distribution of examples is more effective for this balanced evaluation.
Since the bake-off focus is word_disjoint
, you might want to run just that evaluation. To to that, use:
nli.bakeoff_experiment(X, net, conditions=['word_disjoint'])
Iteration 500: loss: 9.8024865388870243
====================================================================== word_disjoint precision recall f1-score support antonym 0.00 0.00 0.00 150 hypernym 0.54 0.43 0.48 1594 hyponym 0.22 0.01 0.03 275 synonym 0.59 0.77 0.67 2229 avg / total 0.52 0.57 0.53 4248
This will run the complete evaluation:
nli.bakeoff_experiment(X, net)
Iteration 500: loss: 15.278596043586731/Applications/anaconda/envs/nlu/lib/python3.6/site-packages/sklearn/metrics/classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. 'precision', 'predicted', average, warn_for) Iteration 2: loss: 12.505775809288025
====================================================================== edge_disjoint precision recall f1-score support antonym 0.00 0.00 0.00 392 hypernym 0.58 0.43 0.49 4310 hyponym 0.51 0.04 0.07 710 synonym 0.59 0.80 0.68 5930 avg / total 0.56 0.59 0.55 11342
Iteration 3: loss: 4.142282485961914884
====================================================================== word_disjoint precision recall f1-score support antonym 0.00 0.00 0.00 150 hypernym 0.54 0.43 0.48 1594 hyponym 0.33 0.02 0.04 275 synonym 0.59 0.78 0.67 2229 avg / total 0.53 0.57 0.53 4248
Iteration 2: loss: 13.00054156780243554
====================================================================== word_disjoint_balanced precision recall f1-score support antonym 0.00 0.00 0.00 115 hypernym 0.48 0.28 0.35 511 hyponym 0.27 0.03 0.05 118 synonym 0.55 0.84 0.67 831 avg / total 0.47 0.54 0.47 1575
Iteration 500: loss: 9.7987087368965157
====================================================================== word_disjoint_balanced, training on word_disjoint precision recall f1-score support antonym 0.00 0.00 0.00 115 hypernym 0.49 0.43 0.46 511 hyponym 0.20 0.01 0.02 118 synonym 0.58 0.79 0.67 831 avg / total 0.48 0.56 0.50 1575
The goal: achieve the highest average F1 score on word_disjoint.
Submit:
word_disjoint
split.Submission URL: https://goo.gl/forms/CizXwS3kfPjsThxA3
Notes:
For the methods, the only requirement is that they differ in some way from the baseline above. They don't have to be completely different, though. For example, you might want to stick with the model but represent examples differently, or the reverse.
You must train only on the train
split. No outside training instances can be brought in. You can, though, bring in outside information via your input vectors, as long as this information is not from dev
or edge_disjoint
.
You can also augment your training data. For example, if ((A, B), synonym)
is a training instance, then so should be ((B, A), synonym)
. Similarly, ((A, B), hyponym)
and ((B, C), hyponym)
are training cases, then so should be ((A, C), hyponym)
.
Since the evaluation is for word_disjoint
, you're not going to get very far with random input vectors! A GloVe featurizer is defined above. Feel free to look around for new word vectors on the Web, or even train your own using our VSM notebooks.
You're not required to stick to TfShallowNeuralNetwork
. For instance, you could create deeper feed-forward networks, change how they optimize, etc. As long as you have fit
and predict
methods with the same input and output types as our networks, you should be able to use bakeoff_experiment
. For notes on how to extend the TensorFlow models included in this repository, see tensorflow_models.ipynb.