__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2018"
This notebook defines and explores a number of models for NLI. The general plot is familiar from our work with the Stanford Sentiment Treebank:
The twist here is that, while NLI is another classification problem, the inputs have important high-level structure: a premise and a hypothesis. This invites exploration of a host of neural model designs:
In sentence-encoding models, the premise and hypothesis are analyzed separately, combined only for the final classification step.
In chained models, the premise is processed first, then the hypotheses, giving a unified representation of the pair.
NLI resembles sequence-to-sequence problems like machine translation and language modeling. The central modeling difference is that NLI doesn't produce an output sequence, but rather consumes two sequences to produce a label. Still, there are enough affinities that many ideas have been shared among these fields.
See the previous notebook for set-up instructions for this unit.
Additionally, make sure you still have the Wikipedia 2014 + Gigaword 5 distribution of the pretrained GloVe vectors. This is probably already in vsmdata
.
from collections import Counter
from itertools import product
import numpy as np
from sklearn.linear_model import LogisticRegression
import tensorflow as tf
from tf_rnn_classifier import TfRNNClassifier
from tf_shallow_neural_classifier import TfShallowNeuralClassifier
from tf_rnn_classifier import TfRNNClassifier
import nli
import os
import sst
import utils
glove_home = os.path.join('vsmdata', 'glove.6B')
data_home = "nlidata"
snli_home = os.path.join(data_home, "snli_1.0")
multinli_home = os.path.join(data_home, "multinl_i.0")
We begin by looking at models based in sparse, hand-built feature representations. As in earlier units of the course, we will see that these models are competitive: easy to design, fast to optimize, and highly effective.
The guiding idea for NLI sparse features is that one wants to knit together the premise and hypothesis, so that the model can learn about their relationships rather than just about each part separately.
With word_overlap_phi
, we just get the set of words that occur in both the premise and hypothesis.
def word_overlap_phi(t1, t2):
"""Basis for features for the words in both the premise and hypothesis.
This tends to produce very sparse representations.
Parameters
----------
t1, t2 : `nltk.tree.Tree`
As given by `str2tree`.
Returns
-------
defaultdict
Maps each word in both `t1` and `t2` to 1.
"""
overlap = set([w1 for w1 in t1.leaves() if w1 in t2.leaves()])
return Counter(overlap)
With word_cross_product_phi
, we count all the pairs $(w_{1}, w_{1})$ where $w_{1}$ is a word from the premise and $w_{2}$ is a word from the hypothesis. This creates a very large feature space. These models are very strong right out of the box, and they can be supplemented with more fine-grained features.
def word_cross_product_phi(t1, t2):
"""Basis for cross-product features. This tends to produce pretty
dense representations.
Parameters
----------
t1, t2 : `nltk.tree.Tree`
As given by `str2tree`.
Returns
-------
defaultdict
Maps each (w1, w2) in the cross-product of `t1.leaves()` and
`t2.leaves()` to its count. This is a multi-set cross-product
(repetitions matter).
"""
return Counter([(w1, w2) for w1, w2 in product(t1.leaves(), t2.leaves())])
Our experiment framework is basically the same as the one we used for the Stanford Sentiment Treebank. Here, I actually use sst.fit_classifier_with_crossvalidation
(from that unit) to create a wrapper around LogisticRegression
for cross-validation of hyperparameters. At this point, I am not sure what parameters will be good for our NLI datasets, so this hyperparameter search is vital.
def fit_maxent_with_crossvalidation(X, y):
"""A MaxEnt model of dataset with hyperparameter cross-validation.
Parameters
----------
X : 2d np.array
The matrix of features, one example per row.
y : list
The list of labels for rows in `X`.
Returns
-------
sklearn.linear_model.LogisticRegression
A trained model instance, the best model found.
"""
basemod = LogisticRegression(fit_intercept=True)
cv = 3
param_grid = {'C': [0.4, 0.6, 0.8, 1.0],
'penalty': ['l1','l2']}
return sst.fit_classifier_with_crossvalidation(X, y, basemod, cv, param_grid)
Because SNLI and MultiNLI are huge, we can't afford to do experiments on the full datasets all the time. Thus, we will mainly work within the training sets, using the train readers to sample smaller datasets that can then be divided for training and assessment.
Here, we sample 10% of the training examples. I set the random seed (random_state=42
) so that we get consistency across the samples; setting random_state=None
will give new random samples each time.
train_reader = nli.SNLITrainReader(
samp_percentage=0.10, random_state=42)
An experimental dataset can be built directly from the reader and a feature function:
dataset = nli.build_dataset(train_reader, word_overlap_phi)
dataset.keys()
dict_keys(['X', 'y', 'vectorizer', 'raw_examples'])
However, it's more efficient to use nli.experiment
to bring all these pieces together. This wrapper will work for all the models we consider.
_ = nli.experiment(
train_reader=nli.SNLITrainReader(samp_percentage=0.10),
phi=word_overlap_phi,
train_func=fit_maxent_with_crossvalidation,
assess_reader=None,
random_state=42)
Best params {'C': 0.6, 'penalty': 'l2'} Best score: 0.412 precision recall f1-score support contradiction 0.436 0.621 0.513 5572 entailment 0.455 0.388 0.419 5498 neutral 0.379 0.272 0.317 5496 avg / total 0.423 0.428 0.416 16566
_ = nli.experiment(
train_reader=nli.SNLITrainReader(samp_percentage=0.10),
phi=word_cross_product_phi,
train_func=fit_maxent_with_crossvalidation,
assess_reader=None,
random_state=42)
Best params {'C': 0.4, 'penalty': 'l1'} Best score: 0.605 precision recall f1-score support contradiction 0.673 0.633 0.652 5520 entailment 0.616 0.694 0.653 5517 neutral 0.595 0.554 0.574 5396 avg / total 0.628 0.628 0.627 16433
As expected word_cross_product_phi
is very strong. Let's take the hyperparameters chosen there and use them for an experiment in which we train on the entire training set and evaluate on the dev set; this seems like a good way to balance responsible search over hyperparameters with our resource limitations.
def fit_maxent_classifier_with_preselected_params(X, y):
mod = LogisticRegression(
fit_intercept=True,
penalty='ll',
solver='saga', ## Required for penalty='ll'.
C=0.4)
mod.fit(X, y)
return mod
%%time
_ = nli.experiment(
train_reader=nli.SNLITrainReader(samp_percentage=1.0),
assess_reader=nli.SNLIDevReader(samp_percentage=1.0),
phi=word_cross_product_phi,
train_func=fit_maxent_classifier_with_preselected_params,
random_state=None)
precision recall f1-score support contradiction 0.762 0.729 0.745 3278 entailment 0.708 0.795 0.749 3329 neutral 0.716 0.657 0.685 3235 avg / total 0.729 0.728 0.727 9842 CPU times: user 19min 17s, sys: 9.56 s, total: 19min 26s Wall time: 19min 26s
This baseline is very similar to the one established in the original SNLI paper by Bowman et al. for models like this one.
We turn now to sentence-encoding models. The hallmark of these is that the premise and hypothesis get their own representation in some sense, and then those representations are combined to predict the label. Bowman et al. 2015 explore models of this form as part of introducing SNLI.
The feed-forward networks we used in the word-level bake-off are members of this family of models: each word was represented separately, and the concatenation of those representations was used as the input to the model.
Perhaps the simplest sentence-encoding model sums (or averages, etc.) the word representations for the premise, does the same for the hypothesis, and concatenates those two representations for use as the input to a linear classifier.
Here's a diagram that is meant to suggest the full space of models of this form:
Here's an implementation of this model where
glove_lookup = utils.glove2dict(
os.path.join(glove_home, 'glove.6B.50d.txt'))
def glove_leaves_phi(t1, t2, np_func=np.sum):
"""Represent `tree` as a combination of the vector of its words.
Parameters
----------
t1 : nltk.Tree
t2 : nltk.Tree
np_func : function (default: np.sum)
A numpy matrix operation that can be applied columnwise,
like `np.mean`, `np.sum`, or `np.prod`. The requirement is that
the function take `axis=0` as one of its arguments (to ensure
columnwise combination) and that it return a vector of a
fixed length, no matter what the size of the tree is.
Returns
-------
np.array
"""
prem_vecs = _get_tree_vecs(t1, glove_lookup, np_func)
hyp_vecs = _get_tree_vecs(t2, glove_lookup, np_func)
return np.concatenate((prem_vecs, hyp_vecs))
def _get_tree_vecs(tree, lookup, np_func):
allvecs = np.array([lookup[w] for w in tree.leaves() if w in lookup])
if len(allvecs) == 0:
dim = len(next(iter(lookup.values())))
feats = np.zeros(dim)
else:
feats = np_func(allvecs, axis=0)
return feats
_ = nli.experiment(
train_reader=nli.SNLITrainReader(samp_percentage=0.10),
phi=glove_leaves_phi,
train_func=fit_maxent_with_crossvalidation,
assess_reader=None,
random_state=42,
vectorize=False) # Ask `experiment` not to featurize; we did it already.
Best params {'C': 0.6, 'penalty': 'l1'} Best score: 0.508 precision recall f1-score support contradiction 0.492 0.471 0.481 5456 entailment 0.499 0.558 0.527 5558 neutral 0.531 0.492 0.511 5543 avg / total 0.508 0.507 0.506 16557
A small tweak to the above is to use a neural network instead of a softmax classifier at the top:
def fit_shallow_neural_classifier_with_crossvalidation(X, y):
basemod = TfShallowNeuralClassifier(max_iter=1000)
cv = 3
param_grid = {'hidden_dim': [25, 50, 100]}
return sst.fit_classifier_with_crossvalidation(X, y, basemod, cv, param_grid)
_ = nli.experiment(
train_reader=nli.SNLITrainReader(samp_percentage=0.10),
phi=glove_leaves_phi,
train_func=fit_shallow_neural_classifier_with_crossvalidation,
assess_reader=None,
random_state=42,
vectorize=False) # Ask `experiment` not to featurize; we did it already.
Iteration 1000: loss: 31.445966720581055
Best params {'hidden_dim': 100} Best score: 0.538 precision recall f1-score support contradiction 0.623 0.431 0.510 5438 entailment 0.551 0.591 0.570 5432 neutral 0.507 0.624 0.560 5529 avg / total 0.560 0.549 0.547 16399
A more sophisticated sentence-encoding model processes the premise and hypothesis with separate RNNs and uses the concatenation of their final states as the basis for the classification decision at the top:
This model is particularly easy to implement using the TensorFlow framework for this course:
TfRNNClassifier
.build_graph
.train_dict
and test_dict
to featurize the incoming examples X
as pairs of list of words, one for the premise and the other for the hypothesis.Here is a complete implementation:
class TfNLISentenceRepRNN(TfRNNClassifier):
def build_graph(self):
self._define_embedding()
# Separate RNN graphs:
self.prem_last = self.build_premise_graph()
self.hyp_last = self.build_hypothesis_graph()
# The outputs are labels as usual:
self.outputs = tf.placeholder(
tf.float32, shape=[None, self.output_dim])
# Output softmax layer:
self.last = tf.concat((self.prem_last, self.hyp_last), axis=1)
self.last_dim = self.hidden_dim * 2
self.W_ly = self.weight_init(
self.last_dim, self.output_dim, 'W_ly')
self.b_y = self.bias_init(self.output_dim, 'b_y')
self.model = tf.matmul(self.last, self.W_ly) + self.b_y
def build_premise_graph(self):
self.premises = tf.placeholder(
tf.int32, [None, self.max_length])
self.prem_lengths = tf.placeholder(tf.int32, [None])
self.prem_feats = tf.nn.embedding_lookup(
self.embedding, self.premises)
self.prem_cell = self.cell_class(
self.hidden_dim, activation=self.hidden_activation)
with tf.variable_scope('premise'):
prem_outputs, prem_state = tf.nn.dynamic_rnn(
self.prem_cell,
self.prem_feats,
dtype=tf.float32,
sequence_length=self.prem_lengths)
prem_last = self._get_final_state(self.prem_cell, prem_state)
return prem_last
def build_hypothesis_graph(self):
self.hypotheses = tf.placeholder(
tf.int32, [None, self.max_length])
self.hyp_lengths = tf.placeholder(tf.int32, [None])
self.hyp_feats = tf.nn.embedding_lookup(
self.embedding, self.hypotheses)
self.hyp_cell = self.cell_class(
self.hidden_dim, activation=self.hidden_activation)
with tf.variable_scope('hypothesis'):
hyp_outputs, hyp_state = tf.nn.dynamic_rnn(
self.hyp_cell,
self.hyp_feats,
dtype=tf.float32,
sequence_length=self.hyp_lengths)
hyp_last = self._get_final_state(self.hyp_cell, hyp_state)
return hyp_last
def train_dict(self, X, y):
X_prem, X_hyp = zip(*X)
X_prem, prem_lengths = self._convert_X(X_prem)
X_hyp, hyp_lengths = self._convert_X(X_hyp)
return {self.premises: X_prem,
self.hypotheses: X_hyp,
self.prem_lengths: prem_lengths,
self.hyp_lengths: hyp_lengths,
self.outputs: y}
def test_dict(self, X):
X_prem, X_hyp = zip(*X)
X_prem, prem_lengths = self._convert_X(X_prem)
X_hyp, hyp_lengths = self._convert_X(X_hyp)
return {self.premises: X_prem,
self.hypotheses: X_hyp,
self.prem_lengths: prem_lengths,
self.hyp_lengths: hyp_lengths}
For evaluation, we define a wrapper for TfNLISentenceRepRNN
:
def fit_sentence_rep_rnn(X, y):
vocab = get_vocab(X, n_words=2000)
# Reduce the network size or `max_iter` for non-GPU usage:
mod = TfNLISentenceRepRNN(vocab, hidden_dim=50, max_iter=1000)
mod.fit(X, y)
return mod
Examples are represented as pairs of lists of words:
def sentence_rep_rnn_phi(t1, t2):
return [t1.leaves(), t2.leaves()]
We carry over our usual method for getting a vocabulary for the RNN:
def get_vocab(X, n_words=None):
wc = Counter([w for pair in X for ex in pair for w in ex])
wc = wc.most_common(n_words) if n_words else wc.items()
vocab = {w for w, c in wc}
vocab.add("$UNK")
return sorted(vocab)
And finally a basic experiment; for a real analysis, we would train for much longer, find the optimal hyperparameters, and then scale this up to a full train/dev evaluation.
_ = nli.experiment(
train_reader=nli.SNLITrainReader(samp_percentage=0.10),
phi=sentence_rep_rnn_phi,
train_func=fit_sentence_rep_rnn,
assess_reader=None,
random_state=42,
vectorize=False)
Iteration 1000: loss: 33.303855180740356
precision recall f1-score support contradiction 0.553 0.574 0.563 5700 entailment 0.601 0.582 0.592 5511 neutral 0.544 0.540 0.542 5305 avg / total 0.566 0.566 0.566 16516
Given that we already explored tree-structured neural networks (TreeNNs), it's natural to consider these as the basis for sentence-encoding NLI models:
And this is just the begnning: any model used to represent sentences is presumably a candidate for use in sentence-encoding NLI!
The final major class of NLI designs we look at are those in which the premise and hypothesis are processed sequentially, as a pair. These don't deliver representations of the premise or hypothesis separately. They bear the strongest resemblance to classic sequence-to-sequence models.
In the simplest version of this model, we just concatenate the premise and hypothesis. The model itself is identical to the one we used for the Stanford Sentiment Treebank:
To implement this, we can use TfRNNClassifier
out of the box. We just need to concatenate the leaves of the premise and hypothesis trees:
def simple_chained_rep_rnn_phi(t1, t2):
"""Map `t1` and `t2` to a single list of leaf nodes.
A slight variant might insert a designated boundary symbol between
the premise leaves and the hypothesis leaves. Be sure to add it to
the vocab in that case, else it will be $UNK.
"""
return t1.leaves() + t2.leaves()
Here's a quick evaluation, just to get a feel for this model:
def fit_simple_chained_rnn(X, y):
vocab = get_vocab(X, n_words=2000)
# Reduce the network size or `max_iter` for non-GPU usage:
mod = TfRNNClassifier(vocab, hidden_dim=50, max_iter=1000)
mod.fit(X, y)
return mod
_ = nli.experiment(
train_reader=nli.SNLITrainReader(samp_percentage=0.10),
phi=simple_chained_rep_rnn_phi,
train_func=fit_simple_chained_rnn,
assess_reader=None,
random_state=42,
vectorize=False)
Iteration 1000: loss: 40.29029595851898
precision recall f1-score support contradiction 0.380 0.277 0.320 5507 entailment 0.465 0.473 0.469 5407 neutral 0.430 0.540 0.478 5497 avg / total 0.425 0.429 0.422 16411
A natural variation on the above is to give the premise and hypothesis each their own RNN:
This greatly increases the number of parameters, but it gives the model more chances to learn that appearing in the premise is different from appearing in the hypothesis. One could even push this idea further by giving the premise and hypothesis their own embeddings as well.
Implementing this in our TensorFlow is very easy, involving only minor modifications to TfNLISentenceRepRNN
above. Note that tf.nn.dynamic_rnn has a keyword parameter initial_state
that can be a Tensor
. Thus, the final state of the premise RNN can be passed in here to chain the two RNNs together.
Many of the best-performing systems in the SNLI leaderboard use attention mechanisms to help the model learn important associations between words in the premise and words in the hypothesis. I believe Rocktäschel et al. (2015) were the first to explore such models for NLI.
For instance, if puppy appears in the premise and dog in the conclusion, then that might be a high-precision indicator that the correct relationship is entailment.
This diagram is a high-level schematic for adding attention mechanisms to a chained RNN model for NLI:
Since TensorFlow will handle the details of backpropagation, implementing these models is largely reduced to figuring out how to wrangle the states of the model in the desired way.
A high-level lesson of the SNLI leaderboard is that one can do extremely well with simple neural models whose hyperparameters are selected via extensive cross-validation. This is mathematically interesting but might be dispiriting to those of us without vast resources to devote to these computations! (On the flip side, cleverly designed linear models or ensembles with sparse feature representations might beat all of these entrants with a fraction of the computational budget.)
In an outstanding project for this course in 2016, Leonid Keselman observed that one can do much better than chance on SNLI by processing only the hypothesis. This relates to observations we made in the word-level bake-off about how certain terms will tend to appear more on the right in entailment pairs than on the left.
As we pointed out at the start of this unit, Dagan et al. (2006) pitched NLI as a general-purpose NLU task. We might then hope that the representations we learn on this task will transfer to others. So far, the evidence for this is decidedly mixed. I suspect the core scientific idea is sound, but that we still lack the needed methods for doing transfer learning.
For SNLI, we seem to have entered the inevitable phase in machine learning problems where ensembles do best.
These are largely meant to give you a feel for the material, but some of them could lead to projects and help you with future work for the course. These are not for credit.
When we feed dense representations to a simple classifier, what is the effect of changing the combination functions (e.g., changing sum
to mean
; changing concatenate
to difference
)? What happens if we swap out LogisticRegression
for, say, an sklearn.ensemble.RandomForestClassifier instance?
Implement the Separate premise and hypothesis RNN and evaluate it, comparing in particular against the version that simply concatenates the premise and hypothesis. Does having all these additional parameters pay off? Do you need more training examples to start to see the value of this idea?
The illustrations above all use SNLI. It is worth experimenting with MultiNLI as well. It has both matched and mismatched dev sets that are worth exploring. It's also interesting to think about combining SNLI and MultiNLI, to get additional training instances, to push the models to generalize more, and to assess transfer learning hypotheses.