Notebook

Natural language inference: task and datasets¶

In [1]:

__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2020"

Contents¶

Overview
Our version of the task
Primary resources
Set-up
SNLI
SNLI properties
Working with SNLI
MultiNLI
MultiNLI properties
Working with MultiNLI
Annotated MultiNLI subsets
Adversarial NLI
Adversarial NLI properties
Working with Adversarial NLI
Other NLI datasets

Overview¶

Natural Language Inference (NLI) is the task of predicting the logical relationships between words, phrases, sentences, (paragraphs, documents, ...). Such relationships are crucial for all kinds of reasoning in natural language: arguing, debating, problem solving, summarization, and so forth.

Dagan et al. (2006), one of the foundational papers on NLI (also called Recognizing Textual Entailment; RTE), make a case for the generality of this task in NLU:

It seems that major inferences, as needed by multiple applications, can indeed be cast in terms of textual entailment. For example, a QA system has to identify texts that entail a hypothesized answer. [...] Similarly, for certain Information Retrieval queries the combination of semantic concepts and relations denoted by the query should be entailed from relevant retrieved documents. [...] In multi-document summarization a redundant sentence, to be omitted from the summary, should be entailed from other sentences in the summary. And in MT evaluation a correct translation should be semantically equivalent to the gold standard translation, and thus both translations should entail each other. Consequently, we hypothesize that textual entailment recognition is a suitable generic task for evaluating and comparing applied semantic inference models. Eventually, such efforts can promote the development of entailment recognition "engines" which may provide useful generic modules across applications.

Our version of the task¶

Our NLI data will look like this:

Premise	Relation	Hypothesis
turtle	contradiction	linguist
A turtled danced	entails	A turtle moved
Every reptile danced	entails	Every turtle moved
Some turtles walk	contradicts	No turtles move
James Byron Dean refused to move without blue jeans	entails	James Dean didn't dance without pants

In the word-entailment bakeoff, we looked at a special case of this where the premise and hypothesis are single words. This notebook begins to introduce the problem of NLI more fully.

Primary resources¶

We're going to focus on three NLI corpora:

The first was collected by a group at Stanford, led by Sam Bowman, and the second was collected by a group at NYU, also led by Sam Bowman. Both have the same format and were crowdsourced using the same basic methods. However, SNLI is entirely focused on image captions, whereas MultiNLI includes a greater range of contexts.

The third corpus was collected by a group at Facebook AI and UNC Chapel Hill. The team's goal was to address the fact that datasets like SNLI and MultiNLI seem to be artificially easy – models trained on them can often surpass stated human performance levels but still fail on examples that are simple and intuitive for people. The dataset is "Adversarial" because the annotators were asked to try to construct examples that fooled strong models but still passed muster with other human readers.

This notebook presents tools for working with these corpora. The second notebook in the unit concerns models of NLI.

Set-up¶

As usual, you need to be fully set up to work with the CS224u repository.
If you haven't already, download the course data, unpack it, and place it in the directory containing the course repository – the same directory as this notebook. (If you want to put it somewhere else, change DATA_HOME below.)

In [2]:

import nli
import os
import pandas as pd
import random

In [3]:

DATA_HOME = os.path.join("data", "nlidata")

SNLI_HOME = os.path.join(DATA_HOME, "snli_1.0")

MULTINLI_HOME = os.path.join(DATA_HOME, "multinli_1.0")

ANNOTATIONS_HOME = os.path.join(DATA_HOME, "multinli_1.0_annotations")

ANLI_HOME = os.path.join(DATA_HOME, "anli_v0.1")

SNLI¶

SNLI properties¶

For SNLI (and MultiNLI), MTurk annotators were presented with premise sentences and asked to produce new sentences that entailed, contradicted, or were neutral with respect to the premise. A subset of the examples were then validated by an additional four MTurk annotators.

All the premises are captions from the Flickr30K corpus.
Some of the sentences rather depressingly reflect stereotypes (Rudinger et al. 2017).
550,152 train examples; 10K dev; 10K test
Mean length in tokens:
- Premise: 14.1
- Hypothesis: 8.3
Clause-types
- Premise S-rooted: 74%
- Hypothesis S-rooted: 88.9%
Vocab size: 37,026
56,951 examples validated by four additional annotators
- 58.3% examples with unanimous gold label
- 91.2% of gold labels match the author's label
- 0.70 overall Fleiss kappa
Top scores currently around 90%.

Working with SNLI¶

The following readers should make it easy to work with SNLI:

nli.SNLITrainReader
nli.SNLIDevReader

Writing a Test reader is easy and so left to the user who decides that a test-set evaluation is appropriate. We omit that code as a subtle way of discouraging use of the test set during project development.

The base class, nli.NLIReader, is used by all the readers discussed here.

Because the datasets are so large, it is often useful to be able to randomly sample from them. All of the reader classes discussed here support this with their keyword argument samp_percentage. For example, the following samples approximately 10% of the examples from the SNLI training set:

In [4]:

nli.SNLITrainReader(SNLI_HOME, samp_percentage=0.10, random_state=42)

Out[4]:

"NLIReader({'src_filename': 'data/nlidata/snli_1.0/snli_1.0_train.jsonl', 'filter_unlabeled': True, 'samp_percentage': 0.1, 'random_state': 42, 'gold_label_attr_name': 'gold_label'})

The precise number of examples will vary somewhat because of the way the sampling is done. (Here, we trade efficiency for precision in the number of cases we return; see the implementation for details.)

All of the readers have a read method that yields NLIExample example instances. For SNLI, these have the following attributes:

annotator_labels: list of str
captionID: str
gold_label: str
pairID: str
sentence1: str
sentence1_binary_parse: nltk.tree.Tree
sentence1_parse: nltk.tree.Tree
sentence2: str
sentence2_binary_parse: nltk.tree.Tree
sentence2_parse: nltk.tree.Tree

The following creates the label distribution for the training data:

In [5]:

snli_labels = pd.Series(
    [ex.gold_label for ex in nli.SNLITrainReader(
        SNLI_HOME, filter_unlabeled=False).read()])

snli_labels.value_counts()

Out[5]:

entailment       183416
contradiction    183187
neutral          182764
-                   785
dtype: int64

Use filter_unlabeled=True (the default) to silently drop the examples for which gold_label is -.

Let's look at a specific example in some detail:

In [6]:

snli_iterator = iter(nli.SNLITrainReader(SNLI_HOME).read())

In [7]:

snli_ex = next(snli_iterator)

In [8]:

print(snli_ex)

"NLIExample({'annotator_labels': ['neutral'], 'captionID': '3416050480.jpg#4', 'gold_label': 'neutral', 'pairID': '3416050480.jpg#4r1n', 'sentence1': 'A person on a horse jumps over a broken down airplane.', 'sentence1_binary_parse': Tree('X', [Tree('X', [Tree('X', ['A', 'person']), Tree('X', ['on', Tree('X', ['a', 'horse'])])]), Tree('X', [Tree('X', ['jumps', Tree('X', ['over', Tree('X', ['a', Tree('X', ['broken', Tree('X', ['down', 'airplane'])])])])]), '.'])]), 'sentence1_parse': Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NP', [Tree('DT', ['A']), Tree('NN', ['person'])]), Tree('PP', [Tree('IN', ['on']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['horse'])])])]), Tree('VP', [Tree('VBZ', ['jumps']), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['broken']), Tree('JJ', ['down']), Tree('NN', ['airplane'])])])]), Tree('.', ['.'])])]), 'sentence2': 'A person is training his horse for a competition.', 'sentence2_binary_parse': Tree('X', [Tree('X', ['A', 'person']), Tree('X', [Tree('X', ['is', Tree('X', [Tree('X', ['training', Tree('X', ['his', 'horse'])]), Tree('X', ['for', Tree('X', ['a', 'competition'])])])]), '.'])]), 'sentence2_parse': Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['A']), Tree('NN', ['person'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('VP', [Tree('VBG', ['training']), Tree('NP', [Tree('PRP$', ['his']), Tree('NN', ['horse'])]), Tree('PP', [Tree('IN', ['for']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['competition'])])])])]), Tree('.', ['.'])])])})

In [9]:

snli_ex

Out[9]:

"NLIExample({'annotator_labels': ['neutral'], 'captionID': '3416050480.jpg#4', 'gold_label': 'neutral', 'pairID': '3416050480.jpg#4r1n', 'sentence1': 'A person on a horse jumps over a broken down airplane.', 'sentence1_binary_parse': Tree('X', [Tree('X', [Tree('X', ['A', 'person']), Tree('X', ['on', Tree('X', ['a', 'horse'])])]), Tree('X', [Tree('X', ['jumps', Tree('X', ['over', Tree('X', ['a', Tree('X', ['broken', Tree('X', ['down', 'airplane'])])])])]), '.'])]), 'sentence1_parse': Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NP', [Tree('DT', ['A']), Tree('NN', ['person'])]), Tree('PP', [Tree('IN', ['on']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['horse'])])])]), Tree('VP', [Tree('VBZ', ['jumps']), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['broken']), Tree('JJ', ['down']), Tree('NN', ['airplane'])])])]), Tree('.', ['.'])])]), 'sentence2': 'A person is training his horse for a competition.', 'sentence2_binary_parse': Tree('X', [Tree('X', ['A', 'person']), Tree('X', [Tree('X', ['is', Tree('X', [Tree('X', ['training', Tree('X', ['his', 'horse'])]), Tree('X', ['for', Tree('X', ['a', 'competition'])])])]), '.'])]), 'sentence2_parse': Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['A']), Tree('NN', ['person'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('VP', [Tree('VBG', ['training']), Tree('NP', [Tree('PRP$', ['his']), Tree('NN', ['horse'])]), Tree('PP', [Tree('IN', ['for']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['competition'])])])])]), Tree('.', ['.'])])])})

As you can see from the above attribute list, there are three versions of the premise and hypothesis sentences:

Regular string representations of the data
Unlabeled binary parses
Labeled parses

In [10]:

snli_ex.sentence1

Out[10]:

'A person on a horse jumps over a broken down airplane.'

The binary parses lack node labels; so that we can use nltk.tree.Tree with them, the label X is added to all of them:

In [11]:

snli_ex.sentence1_binary_parse

Out[11]:

Here's the full parse tree with syntactic categories:

In [12]:

snli_ex.sentence1_parse

Out[12]:

The leaves of either tree are tokenized versions of them:

In [13]:

snli_ex.sentence1_parse.leaves()

Out[13]:

['A',
 'person',
 'on',
 'a',
 'horse',
 'jumps',
 'over',
 'a',
 'broken',
 'down',
 'airplane',
 '.']

MultiNLI¶

MultiNLI properties¶

Train premises drawn from five genres:
1. Fiction: works from 1912–2010 spanning many genres
2. Government: reports, letters, speeches, etc., from government websites
3. The Slate website
4. Telephone: the Switchboard corpus
5. Travel: Berlitz travel guides
Additional genres just for dev and test (the mismatched condition):
1. The 9/11 report
2. Face-to-face: The Charlotte Narrative and Conversation Collection
3. Fundraising letters
4. Non-fiction from Oxford University Press
5. Verbatim articles about linguistics
392,702 train examples; 20K dev; 20K test
19,647 examples validated by four additional annotators
- 58.2% examples with unanimous gold label
- 92.6% of gold labels match the author's label
Test-set labels available as a Kaggle competition.
- Top matched scores currently around 0.81.
- Top mismatched scores currently around 0.83.

Working with MultiNLI¶

For MultiNLI, we have the following readers:

nli.MultiNLITrainReader
nli.MultiNLIMatchedDevReader
nli.MultiNLIMismatchedDevReader

The MultiNLI test sets are available on Kaggle (matched version and mismatched version).

The interface to these is the same as for the SNLI readers:

In [14]:

nli.MultiNLITrainReader(MULTINLI_HOME, samp_percentage=0.10, random_state=42)

Out[14]:

"NLIReader({'src_filename': 'data/nlidata/multinli_1.0/multinli_1.0_train.jsonl', 'filter_unlabeled': True, 'samp_percentage': 0.1, 'random_state': 42, 'gold_label_attr_name': 'gold_label'})

The NLIExample instances for MultiNLI have the same attributes as those for SNLI. Here is the list repeated from above for convenience:

annotator_labels: list of str
captionID: str
gold_label: str
pairID: str
sentence1: str
sentence1_binary_parse: nltk.tree.Tree
sentence1_parse: nltk.tree.Tree
sentence2: str
sentence2_binary_parse: nltk.tree.Tree
sentence2_parse: nltk.tree.Tree

The full label distribution:

In [15]:

multinli_labels = pd.Series(
    [ex.gold_label for ex in nli.MultiNLITrainReader(
        MULTINLI_HOME, filter_unlabeled=False).read()])

multinli_labels.value_counts()

Out[15]:

contradiction    130903
neutral          130900
entailment       130899
dtype: int64

No examples in the MultiNLI train set lack a gold label, so the value of the filter_unlabeled parameter has no effect here, but it does have an effect in the Dev versions.

Annotated MultiNLI subsets¶

MultiNLI includes additional annotations for a subset of the dev examples. The goal is to help people understand how well their models are doing on crucial NLI-related linguistic phenomena.

In [16]:

matched_ann_filename = os.path.join(
    ANNOTATIONS_HOME,
    "multinli_1.0_matched_annotations.txt")

mismatched_ann_filename = os.path.join(
    ANNOTATIONS_HOME, 
    "multinli_1.0_mismatched_annotations.txt")

In [17]:

def view_random_example(annotations, random_state=42):
    random.seed(random_state)
    ann_ex = random.choice(list(annotations.items()))
    pairid, ann_ex = ann_ex
    ex = ann_ex['example']   
    print("pairID: {}".format(pairid))
    print(ann_ex['annotations'])
    print(ex.sentence1)
    print(ex.gold_label)
    print(ex.sentence2)

In [18]:

matched_ann = nli.read_annotated_subset(matched_ann_filename, MULTINLI_HOME)

In [19]:

view_random_example(matched_ann)

pairID: 63218c
[]
Recently, however, I have settled down and become decidedly less experimental.
contradiction
I am still as experimental as ever, and I am always on the move.

Adversarial NLI¶

Adversarial NLI properties¶

The ANLI dataset was created in response to evidence that datasets like SNLI and MultiNLI are artificially easy for modern machine learning models to solve. The team sought to tackle this weakness head-on, by designing a crowdsourcing task in which annotators were explicitly trying to confuse state-of-the-art models. In broad outline, the task worked like this:

The crowdworker is presented with a premise (context) text and asked to construct a hypothesis sentence that entails, contradicts, or is neutral with respect to that premise. (The precise wording is more informally, along the lines of the SNLI/MultiNLI task).
The crowdworker submits a hypothesis text.
The premise/hypothesis pair is fed to a trained model that makes a prediction about the correct NLI label.
If the model's prediction is correct, then the crowdworker loops back to step 2 to try again. If the model's prediction is incorrect, then the example is validated by different crowdworkers.

The dataset consists of three rounds, each involving a different model and a different set of sources for the premise texts:

Round	Model	Training data	Context sources
1	BERT-large	SNLI + MultiNLI	Wikipedia
2	ROBERTa	SNLI + MultiNLI + NLI-FEVER + Round 1	Wikipedia
3	ROBERTa	SNLI + MultiNLI + NLI-FEVER + Round 1	Various

Each round has train/dev/test splits. The sizes of these splits and their label distributions are calculated just below.

The project README seeks to establish some rules for how the rounds can be used for training and evaluation.

Working with Adversarial NLI¶

For ANLI, we have the following readers:

nli.ANLITrainReader
nli.ANLIDevReader

As with SNLI, we leave the writing of a Test version to the user, as a way of discouraging inadvertent use of the test set during project development.

Because ANLI is distributed in three rounds, and the rounds can be used independently or pooled, the interface has a rounds argument. The default is rounds=(1,2,3), but any subset of them can be specified. Here are some illustrations using the Train reader; the Dev interface is the same:

In [20]:

for rounds in ((1,), (2,), (3,), (1,2,3)):
    count = len(list(nli.ANLITrainReader(ANLI_HOME, rounds=rounds).read()))
    print("R{0:}: {1:,}".format(rounds, count))

R(1,): 16,946
R(2,): 45,460
R(3,): 100,459
R(1, 2, 3): 162,865

The above figures correspond to those in Table 2 of the paper. I am not sure what accounts for the differences of 100 examples in round 2 (and, in turn, in the grand total).

ANLI uses a different set of attributes from SNLI/MultiNLI. Here is a summary of what NLIExample instances offer for this corpus:

uid: a unique identifier; akin to pairID in SNLI/MultiNLI
context: the premise; corresponds to sentence1 in SNLI/MultiNLI
hypothesis: the hypothesis; corresponds to sentence2 in SNLI/MultiNLI
label: the gold label; corresponds to gold_label in SNLI/MultiNLI
model_label: the label predicted by the model used in the current round
reason: a crowdworker's free-text hypothesis about why the model made an incorrect prediction for the current context/hypothesis pair
emturk: for dev (and test), this is True if the annotator contributed only dev (test) exmples, else False; in turn, it is False for all train examples.
genre: the source for the context text
tag: information about the round and train/dev/test classification

All these attribute are str-valued except for emturk, which is bool-valued.

The labels in this datset are conceptually the same as for SNLI/MultiNLI, but they are encoded differently:

In [21]:

anli_labels = pd.Series([ex.label for ex in nli.ANLITrainReader(ANLI_HOME).read()])

anli_labels.value_counts()

Out[21]:

n    68789
e    52111
c    41965
dtype: int64

For the dev set, the label and model_label values are always different, suggesting that these evaluations will be very challenging for present-day models:

In [22]:

pd.Series(
    [ex.label == ex.model_label for ex in nli.ANLIDevReader(ANLI_HOME).read()]
).value_counts()

Out[22]:

False    3200
dtype: int64

In the train set, they do sometimes correspond, and you can track the changes in the rate of correct model predictions across the rounds:

In [23]:

for r in (1,2,3):
    dist = pd.Series(
        [ex.label == ex.model_label for ex in nli.ANLITrainReader(ANLI_HOME, rounds=(r,)).read()]
    ).value_counts()
    dist = dist / dist.sum()
    dist.name = "Round {}".format(r)
    print(dist, end="\n\n")

True     0.821197
False    0.178803
Name: Round 1, dtype: float64

True     0.932028
False    0.067972
Name: Round 2, dtype: float64

True     0.915916
False    0.084084
Name: Round 3, dtype: float64

This corresponds to Table 2, "Model error rate (Verified)", in the paper. (I am not sure what accounts for the slight differences in the percentages.)

Other NLI datasets¶

The FraCaS textual inference test suite is a smaller, hand-built dataset that is great for evaluating a model's ability to handle complex logical patterns.
SemEval 2013 had a wide range of interesting data sets for NLI and related tasks.
The SemEval 2014 semantic relatedness shared task used an NLI dataset called Sentences Involving Compositional Knowledge (SICK).
MedNLI is specialized to the medical domain, using data derived from MIMIC III.
XNLI is a multilingual NLI dataset derived from MultiNLI.
Diverse Natural Language Inference Collection (DNC) transforms existing annotations from other tasks into NLI problems for a diverse range of reasoning challenges.
SciTail is an NLI dataset derived from multiple-choice science exam questions and Web text.
NLI Style FEVER is a version of the FEVER dataset put into a standard NLI format. It was used by the Adversarial NLI team to train models for their annotation round 2.
Models for NLI might be adapted for use with the 30M Factoid Question-Answer Corpus.
Models for NLI might be adapted for use with the Penn Paraphrase Database.