__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2020"
Natural Language Inference (NLI) is the task of predicting the logical relationships between words, phrases, sentences, (paragraphs, documents, ...). Such relationships are crucial for all kinds of reasoning in natural language: arguing, debating, problem solving, summarization, and so forth.
Dagan et al. (2006), one of the foundational papers on NLI (also called Recognizing Textual Entailment; RTE), make a case for the generality of this task in NLU:
It seems that major inferences, as needed by multiple applications, can indeed be cast in terms of textual entailment. For example, a QA system has to identify texts that entail a hypothesized answer. [...] Similarly, for certain Information Retrieval queries the combination of semantic concepts and relations denoted by the query should be entailed from relevant retrieved documents. [...] In multi-document summarization a redundant sentence, to be omitted from the summary, should be entailed from other sentences in the summary. And in MT evaluation a correct translation should be semantically equivalent to the gold standard translation, and thus both translations should entail each other. Consequently, we hypothesize that textual entailment recognition is a suitable generic task for evaluating and comparing applied semantic inference models. Eventually, such efforts can promote the development of entailment recognition "engines" which may provide useful generic modules across applications.
Our NLI data will look like this:
Premise | Relation | Hypothesis |
---|---|---|
turtle | contradiction | linguist |
A turtled danced | entails | A turtle moved |
Every reptile danced | entails | Every turtle moved |
Some turtles walk | contradicts | No turtles move |
James Byron Dean refused to move without blue jeans | entails | James Dean didn't dance without pants |
In the word-entailment bakeoff, we looked at a special case of this where the premise and hypothesis are single words. This notebook begins to introduce the problem of NLI more fully.
We're going to focus on three NLI corpora:
The first was collected by a group at Stanford, led by Sam Bowman, and the second was collected by a group at NYU, also led by Sam Bowman. Both have the same format and were crowdsourced using the same basic methods. However, SNLI is entirely focused on image captions, whereas MultiNLI includes a greater range of contexts.
The third corpus was collected by a group at Facebook AI and UNC Chapel Hill. The team's goal was to address the fact that datasets like SNLI and MultiNLI seem to be artificially easy – models trained on them can often surpass stated human performance levels but still fail on examples that are simple and intuitive for people. The dataset is "Adversarial" because the annotators were asked to try to construct examples that fooled strong models but still passed muster with other human readers.
This notebook presents tools for working with these corpora. The second notebook in the unit concerns models of NLI.
As usual, you need to be fully set up to work with the CS224u repository.
If you haven't already, download the course data, unpack it, and place it in the directory containing the course repository – the same directory as this notebook. (If you want to put it somewhere else, change DATA_HOME
below.)
import nli
import os
import pandas as pd
import random
DATA_HOME = os.path.join("data", "nlidata")
SNLI_HOME = os.path.join(DATA_HOME, "snli_1.0")
MULTINLI_HOME = os.path.join(DATA_HOME, "multinli_1.0")
ANNOTATIONS_HOME = os.path.join(DATA_HOME, "multinli_1.0_annotations")
ANLI_HOME = os.path.join(DATA_HOME, "anli_v0.1")
For SNLI (and MultiNLI), MTurk annotators were presented with premise sentences and asked to produce new sentences that entailed, contradicted, or were neutral with respect to the premise. A subset of the examples were then validated by an additional four MTurk annotators.
All the premises are captions from the Flickr30K corpus.
Some of the sentences rather depressingly reflect stereotypes (Rudinger et al. 2017).
550,152 train examples; 10K dev; 10K test
Mean length in tokens:
Clause-types
Vocab size: 37,026
56,951 examples validated by four additional annotators
Top scores currently around 90%.
The following readers should make it easy to work with SNLI:
nli.SNLITrainReader
nli.SNLIDevReader
Writing a Test
reader is easy and so left to the user who decides that a test-set evaluation is appropriate. We omit that code as a subtle way of discouraging use of the test set during project development.
The base class, nli.NLIReader
, is used by all the readers discussed here.
Because the datasets are so large, it is often useful to be able to randomly sample from them. All of the reader classes discussed here support this with their keyword argument samp_percentage
. For example, the following samples approximately 10% of the examples from the SNLI training set:
nli.SNLITrainReader(SNLI_HOME, samp_percentage=0.10, random_state=42)
"NLIReader({'src_filename': 'data/nlidata/snli_1.0/snli_1.0_train.jsonl', 'filter_unlabeled': True, 'samp_percentage': 0.1, 'random_state': 42, 'gold_label_attr_name': 'gold_label'})
The precise number of examples will vary somewhat because of the way the sampling is done. (Here, we trade efficiency for precision in the number of cases we return; see the implementation for details.)
All of the readers have a read
method that yields NLIExample
example instances. For SNLI, these have the following attributes:
list of str
str
str
str
str
nltk.tree.Tree
nltk.tree.Tree
str
nltk.tree.Tree
nltk.tree.Tree
The following creates the label distribution for the training data:
snli_labels = pd.Series(
[ex.gold_label for ex in nli.SNLITrainReader(
SNLI_HOME, filter_unlabeled=False).read()])
snli_labels.value_counts()
entailment 183416 contradiction 183187 neutral 182764 - 785 dtype: int64
Use filter_unlabeled=True
(the default) to silently drop the examples for which gold_label
is -
.
Let's look at a specific example in some detail:
snli_iterator = iter(nli.SNLITrainReader(SNLI_HOME).read())
snli_ex = next(snli_iterator)
print(snli_ex)
"NLIExample({'annotator_labels': ['neutral'], 'captionID': '3416050480.jpg#4', 'gold_label': 'neutral', 'pairID': '3416050480.jpg#4r1n', 'sentence1': 'A person on a horse jumps over a broken down airplane.', 'sentence1_binary_parse': Tree('X', [Tree('X', [Tree('X', ['A', 'person']), Tree('X', ['on', Tree('X', ['a', 'horse'])])]), Tree('X', [Tree('X', ['jumps', Tree('X', ['over', Tree('X', ['a', Tree('X', ['broken', Tree('X', ['down', 'airplane'])])])])]), '.'])]), 'sentence1_parse': Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NP', [Tree('DT', ['A']), Tree('NN', ['person'])]), Tree('PP', [Tree('IN', ['on']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['horse'])])])]), Tree('VP', [Tree('VBZ', ['jumps']), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['broken']), Tree('JJ', ['down']), Tree('NN', ['airplane'])])])]), Tree('.', ['.'])])]), 'sentence2': 'A person is training his horse for a competition.', 'sentence2_binary_parse': Tree('X', [Tree('X', ['A', 'person']), Tree('X', [Tree('X', ['is', Tree('X', [Tree('X', ['training', Tree('X', ['his', 'horse'])]), Tree('X', ['for', Tree('X', ['a', 'competition'])])])]), '.'])]), 'sentence2_parse': Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['A']), Tree('NN', ['person'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('VP', [Tree('VBG', ['training']), Tree('NP', [Tree('PRP$', ['his']), Tree('NN', ['horse'])]), Tree('PP', [Tree('IN', ['for']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['competition'])])])])]), Tree('.', ['.'])])])})
snli_ex
"NLIExample({'annotator_labels': ['neutral'], 'captionID': '3416050480.jpg#4', 'gold_label': 'neutral', 'pairID': '3416050480.jpg#4r1n', 'sentence1': 'A person on a horse jumps over a broken down airplane.', 'sentence1_binary_parse': Tree('X', [Tree('X', [Tree('X', ['A', 'person']), Tree('X', ['on', Tree('X', ['a', 'horse'])])]), Tree('X', [Tree('X', ['jumps', Tree('X', ['over', Tree('X', ['a', Tree('X', ['broken', Tree('X', ['down', 'airplane'])])])])]), '.'])]), 'sentence1_parse': Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NP', [Tree('DT', ['A']), Tree('NN', ['person'])]), Tree('PP', [Tree('IN', ['on']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['horse'])])])]), Tree('VP', [Tree('VBZ', ['jumps']), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['broken']), Tree('JJ', ['down']), Tree('NN', ['airplane'])])])]), Tree('.', ['.'])])]), 'sentence2': 'A person is training his horse for a competition.', 'sentence2_binary_parse': Tree('X', [Tree('X', ['A', 'person']), Tree('X', [Tree('X', ['is', Tree('X', [Tree('X', ['training', Tree('X', ['his', 'horse'])]), Tree('X', ['for', Tree('X', ['a', 'competition'])])])]), '.'])]), 'sentence2_parse': Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['A']), Tree('NN', ['person'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('VP', [Tree('VBG', ['training']), Tree('NP', [Tree('PRP$', ['his']), Tree('NN', ['horse'])]), Tree('PP', [Tree('IN', ['for']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['competition'])])])])]), Tree('.', ['.'])])])})
As you can see from the above attribute list, there are three versions of the premise and hypothesis sentences:
snli_ex.sentence1
'A person on a horse jumps over a broken down airplane.'
The binary parses lack node labels; so that we can use nltk.tree.Tree
with them, the label X
is added to all of them:
snli_ex.sentence1_binary_parse
Here's the full parse tree with syntactic categories:
snli_ex.sentence1_parse
The leaves of either tree are tokenized versions of them:
snli_ex.sentence1_parse.leaves()
['A', 'person', 'on', 'a', 'horse', 'jumps', 'over', 'a', 'broken', 'down', 'airplane', '.']
Train premises drawn from five genres:
Additional genres just for dev and test (the mismatched condition):
392,702 train examples; 20K dev; 20K test
19,647 examples validated by four additional annotators
Test-set labels available as a Kaggle competition.
For MultiNLI, we have the following readers:
nli.MultiNLITrainReader
nli.MultiNLIMatchedDevReader
nli.MultiNLIMismatchedDevReader
The MultiNLI test sets are available on Kaggle (matched version and mismatched version).
The interface to these is the same as for the SNLI readers:
nli.MultiNLITrainReader(MULTINLI_HOME, samp_percentage=0.10, random_state=42)
"NLIReader({'src_filename': 'data/nlidata/multinli_1.0/multinli_1.0_train.jsonl', 'filter_unlabeled': True, 'samp_percentage': 0.1, 'random_state': 42, 'gold_label_attr_name': 'gold_label'})
The NLIExample
instances for MultiNLI have the same attributes as those for SNLI. Here is the list repeated from above for convenience:
list of str
str
str
str
str
nltk.tree.Tree
nltk.tree.Tree
str
nltk.tree.Tree
nltk.tree.Tree
The full label distribution:
multinli_labels = pd.Series(
[ex.gold_label for ex in nli.MultiNLITrainReader(
MULTINLI_HOME, filter_unlabeled=False).read()])
multinli_labels.value_counts()
contradiction 130903 neutral 130900 entailment 130899 dtype: int64
No examples in the MultiNLI train set lack a gold label, so the value of the filter_unlabeled
parameter has no effect here, but it does have an effect in the Dev
versions.
MultiNLI includes additional annotations for a subset of the dev examples. The goal is to help people understand how well their models are doing on crucial NLI-related linguistic phenomena.
matched_ann_filename = os.path.join(
ANNOTATIONS_HOME,
"multinli_1.0_matched_annotations.txt")
mismatched_ann_filename = os.path.join(
ANNOTATIONS_HOME,
"multinli_1.0_mismatched_annotations.txt")
def view_random_example(annotations, random_state=42):
random.seed(random_state)
ann_ex = random.choice(list(annotations.items()))
pairid, ann_ex = ann_ex
ex = ann_ex['example']
print("pairID: {}".format(pairid))
print(ann_ex['annotations'])
print(ex.sentence1)
print(ex.gold_label)
print(ex.sentence2)
matched_ann = nli.read_annotated_subset(matched_ann_filename, MULTINLI_HOME)
view_random_example(matched_ann)
pairID: 63218c [] Recently, however, I have settled down and become decidedly less experimental. contradiction I am still as experimental as ever, and I am always on the move.
The ANLI dataset was created in response to evidence that datasets like SNLI and MultiNLI are artificially easy for modern machine learning models to solve. The team sought to tackle this weakness head-on, by designing a crowdsourcing task in which annotators were explicitly trying to confuse state-of-the-art models. In broad outline, the task worked like this:
The crowdworker is presented with a premise (context) text and asked to construct a hypothesis sentence that entails, contradicts, or is neutral with respect to that premise. (The precise wording is more informally, along the lines of the SNLI/MultiNLI task).
The crowdworker submits a hypothesis text.
The premise/hypothesis pair is fed to a trained model that makes a prediction about the correct NLI label.
If the model's prediction is correct, then the crowdworker loops back to step 2 to try again. If the model's prediction is incorrect, then the example is validated by different crowdworkers.
The dataset consists of three rounds, each involving a different model and a different set of sources for the premise texts:
Round | Model | Training data | Context sources |
---|---|---|---|
1 | BERT-large | SNLI + MultiNLI | Wikipedia |
2 | ROBERTa | SNLI + MultiNLI + NLI-FEVER + Round 1 | Wikipedia |
3 | ROBERTa | SNLI + MultiNLI + NLI-FEVER + Round 1 | Various |
Each round has train/dev/test splits. The sizes of these splits and their label distributions are calculated just below.
The project README seeks to establish some rules for how the rounds can be used for training and evaluation.
For ANLI, we have the following readers:
nli.ANLITrainReader
nli.ANLIDevReader
As with SNLI, we leave the writing of a Test
version to the user, as a way of discouraging inadvertent use of the test set during project development.
Because ANLI is distributed in three rounds, and the rounds can be used independently or pooled, the interface has a rounds
argument. The default is rounds=(1,2,3)
, but any subset of them can be specified. Here are some illustrations using the Train
reader; the Dev
interface is the same:
for rounds in ((1,), (2,), (3,), (1,2,3)):
count = len(list(nli.ANLITrainReader(ANLI_HOME, rounds=rounds).read()))
print("R{0:}: {1:,}".format(rounds, count))
R(1,): 16,946 R(2,): 45,460 R(3,): 100,459 R(1, 2, 3): 162,865
The above figures correspond to those in Table 2 of the paper. I am not sure what accounts for the differences of 100 examples in round 2 (and, in turn, in the grand total).
ANLI uses a different set of attributes from SNLI/MultiNLI. Here is a summary of what NLIExample
instances offer for this corpus:
pairID
in SNLI/MultiNLIsentence1
in SNLI/MultiNLIsentence2
in SNLI/MultiNLIgold_label
in SNLI/MultiNLITrue
if the annotator contributed only dev (test) exmples, else False
; in turn, it is False
for all train examples.All these attribute are str
-valued except for emturk
, which is bool
-valued.
The labels in this datset are conceptually the same as for SNLI/MultiNLI
, but they are encoded differently:
anli_labels = pd.Series([ex.label for ex in nli.ANLITrainReader(ANLI_HOME).read()])
anli_labels.value_counts()
n 68789 e 52111 c 41965 dtype: int64
For the dev set, the label
and model_label
values are always different, suggesting that these evaluations will be very challenging for present-day models:
pd.Series(
[ex.label == ex.model_label for ex in nli.ANLIDevReader(ANLI_HOME).read()]
).value_counts()
False 3200 dtype: int64
In the train set, they do sometimes correspond, and you can track the changes in the rate of correct model predictions across the rounds:
for r in (1,2,3):
dist = pd.Series(
[ex.label == ex.model_label for ex in nli.ANLITrainReader(ANLI_HOME, rounds=(r,)).read()]
).value_counts()
dist = dist / dist.sum()
dist.name = "Round {}".format(r)
print(dist, end="\n\n")
True 0.821197 False 0.178803 Name: Round 1, dtype: float64 True 0.932028 False 0.067972 Name: Round 2, dtype: float64 True 0.915916 False 0.084084 Name: Round 3, dtype: float64
This corresponds to Table 2, "Model error rate (Verified)", in the paper. (I am not sure what accounts for the slight differences in the percentages.)
The FraCaS textual inference test suite is a smaller, hand-built dataset that is great for evaluating a model's ability to handle complex logical patterns.
SemEval 2013 had a wide range of interesting data sets for NLI and related tasks.
The SemEval 2014 semantic relatedness shared task used an NLI dataset called Sentences Involving Compositional Knowledge (SICK).
MedNLI is specialized to the medical domain, using data derived from MIMIC III.
XNLI is a multilingual NLI dataset derived from MultiNLI.
Diverse Natural Language Inference Collection (DNC) transforms existing annotations from other tasks into NLI problems for a diverse range of reasoning challenges.
SciTail is an NLI dataset derived from multiple-choice science exam questions and Web text.
NLI Style FEVER is a version of the FEVER dataset put into a standard NLI format. It was used by the Adversarial NLI team to train models for their annotation round 2.
Models for NLI might be adapted for use with the 30M Factoid Question-Answer Corpus.
Models for NLI might be adapted for use with the Penn Paraphrase Database.