Natural language inference: Task and datasets

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2020"


Natural Language Inference (NLI) is the task of predicting the logical relationships between words, phrases, sentences, (paragraphs, documents, ...). Such relationships are crucial for all kinds of reasoning in natural language: arguing, debating, problem solving, summarization, and so forth.

Dagan et al. (2006), one of the foundational papers on NLI (also called Recognizing Textual Entailment; RTE), make a case for the generality of this task in NLU:

It seems that major inferences, as needed by multiple applications, can indeed be cast in terms of textual entailment. For example, a QA system has to identify texts that entail a hypothesized answer. [...] Similarly, for certain Information Retrieval queries the combination of semantic concepts and relations denoted by the query should be entailed from relevant retrieved documents. [...] In multi-document summarization a redundant sentence, to be omitted from the summary, should be entailed from other sentences in the summary. And in MT evaluation a correct translation should be semantically equivalent to the gold standard translation, and thus both translations should entail each other. Consequently, we hypothesize that textual entailment recognition is a suitable generic task for evaluating and comparing applied semantic inference models. Eventually, such efforts can promote the development of entailment recognition "engines" which may provide useful generic modules across applications.

Our version of the task

Our NLI data will look like this:

Premise Relation Hypothesis
turtle contradiction linguist
A turtled danced entails A turtle moved
Every reptile danced entails Every turtle moved
Some turtles walk contradicts No turtles move
James Byron Dean refused to move without blue jeans entails James Dean didn't dance without pants

In the word-entailment bakeoff, we looked at a special case of this where the premise and hypothesis are single words. This notebook begins to introduce the problem of NLI more fully.

Primary resources

We're going to focus on two large, human-labeled, relatively naturalistic entailment corpora:

The first was collected by a group at Stanford, led by Sam Bowman, and the second was collected by a group at NYU, also led by Sam Bowman. They have the same format and were crowdsourced using the same basic methods. However, SNLI is entirely focused on image captions, whereas MultiNLI includes a greater range of contexts.

This notebook presents tools for working with these corpora. The second notebook in the unit concerns models of NLI.

NLI model landscape


  • As usual, you need to be fully set up to work with the CS224u repository.

  • If you haven't already, download the course data, unpack it, and place it in the directory containing the course repository – the same directory as this notebook. (If you want to put it somewhere else, change DATA_HOME below.)

In [2]:
import nli
import os
import pandas as pd
import random
In [3]:
DATA_HOME = os.path.join("data", "nlidata")

SNLI_HOME = os.path.join(DATA_HOME, "snli_1.0")

MULTINLI_HOME = os.path.join(DATA_HOME, "multinli_1.0")

ANNOTATIONS_HOME = os.path.join(DATA_HOME, "multinli_1.0_annotations")

Properties of the corpora

For both SNLI and MultiNLI, MTurk annotators were presented with premise sentences and asked to produce new sentences that entailed, contradicted, or were neutral with respect to the premise. A subset of the examples were then validated by an additional four MTurk annotators.

SNLI properties

  • All the premises are captions from the Flickr30K corpus.

  • Some of the sentences rather depressingly reflect stereotypes (Rudinger et al. 2017).

  • 550,152 train examples; 10K dev; 10K test

  • Mean length in tokens:

    • Premise: 14.1
    • Hypothesis: 8.3
  • Clause-types

    • Premise S-rooted: 74%
    • Hypothesis S-rooted: 88.9%
  • Vocab size: 37,026

  • 56,951 examples validated by four additional annotators

    • 58.3% examples with unanimous gold label
    • 91.2% of gold labels match the author's label
    • 0.70 overall Fleiss kappa
  • Top scores currently around 89%.

MultiNLI properties

  • Train premises drawn from five genres:

    1. Fiction: works from 1912–2010 spanning many genres
    2. Government: reports, letters, speeches, etc., from government websites
    3. The Slate website
    4. Telephone: the Switchboard corpus
    5. Travel: Berlitz travel guides
  • Additional genres just for dev and test (the mismatched condition):

    1. The 9/11 report
    2. Face-to-face: The Charlotte Narrative and Conversation Collection
    3. Fundraising letters
    4. Non-fiction from Oxford University Press
    5. Verbatim articles about linguistics
  • 392,702 train examples; 20K dev; 20K test

  • 19,647 examples validated by four additional annotators

    • 58.2% examples with unanimous gold label
    • 92.6% of gold labels match the author's label
  • Test-set labels available as a Kaggle competition.

    • Top matched scores currently around 0.81.
    • Top mismatched scores currently around 0.83.

Working with SNLI and MultiNLI


The following readers should make it easy to work with these corpora:

  • nli.SNLITrainReader
  • nli.SNLIDevReader
  • nli.MultiNLITrainReader
  • nli.MultiNLIMatchedDevReader
  • nli.MultiNLIMismatchedDevReader

The base class is nli.NLIReader, which should be easy to use to define additional readers.

If you did change data_home, snli_home, or multinli_home above, then you'll need to call these readers with dirname as an argument, where dirname is your snli_home or multinli_home, as appropriate.

Because the datasets are so large, it is often useful to be able to randomly sample from them. All of the reader classes allow this with their keyword argument samp_percentage. For example, the following samples approximately 10% of the examples from the SNLI training set:

In [4]:
nli.SNLITrainReader(SNLI_HOME, samp_percentage=0.10)
"NLIReader({'src_filename': 'data/nlidata/snli_1.0/snli_1.0_train.jsonl', 'filter_unlabeled': True, 'samp_percentage': 0.1, 'random_state': None})

The precise number of examples will vary somewhat because of the way the sampling is done. (Here, we trade efficiency for precision in the number of cases we return; see the implementation for details.)

The NLIExample class

All of the readers have a read method that yields NLIExample example instances, which have the following attributes:

  • annotator_labels: list of str
  • captionID: str
  • gold_label: str
  • pairID: str
  • sentence1: str
  • sentence1_binary_parse: nltk.tree.Tree
  • sentence1_parse: nltk.tree.Tree
  • sentence2: str
  • sentence2_binary_parse: nltk.tree.Tree
  • sentence2_parse: nltk.tree.Tree
In [5]:
snli_iterator = iter(nli.SNLITrainReader(SNLI_HOME).read())
In [6]:
snli_ex = next(snli_iterator)
In [7]:
A person on a horse jumps over a broken down airplane.
A person is training his horse for a competition.
In [8]:
"NLIExample({'annotator_labels': ['neutral'], 'captionID': '3416050480.jpg#4', 'gold_label': 'neutral', 'pairID': '3416050480.jpg#4r1n', 'sentence1': 'A person on a horse jumps over a broken down airplane.', 'sentence1_binary_parse': Tree('X', [Tree('X', [Tree('X', ['A', 'person']), Tree('X', ['on', Tree('X', ['a', 'horse'])])]), Tree('X', [Tree('X', ['jumps', Tree('X', ['over', Tree('X', ['a', Tree('X', ['broken', Tree('X', ['down', 'airplane'])])])])]), '.'])]), 'sentence1_parse': Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NP', [Tree('DT', ['A']), Tree('NN', ['person'])]), Tree('PP', [Tree('IN', ['on']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['horse'])])])]), Tree('VP', [Tree('VBZ', ['jumps']), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['broken']), Tree('JJ', ['down']), Tree('NN', ['airplane'])])])]), Tree('.', ['.'])])]), 'sentence2': 'A person is training his horse for a competition.', 'sentence2_binary_parse': Tree('X', [Tree('X', ['A', 'person']), Tree('X', [Tree('X', ['is', Tree('X', [Tree('X', ['training', Tree('X', ['his', 'horse'])]), Tree('X', ['for', Tree('X', ['a', 'competition'])])])]), '.'])]), 'sentence2_parse': Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['A']), Tree('NN', ['person'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('VP', [Tree('VBG', ['training']), Tree('NP', [Tree('PRP$', ['his']), Tree('NN', ['horse'])]), Tree('PP', [Tree('IN', ['for']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['competition'])])])])]), Tree('.', ['.'])])])})


In [9]:
snli_labels = pd.Series(
    [ex.gold_label for ex in nli.SNLITrainReader(SNLI_HOME, filter_unlabeled=False).read()])
In [10]:
entailment       183416
contradiction    183187
neutral          182764
-                   785
dtype: int64
In [11]:
multinli_labels = pd.Series(
    [ex.gold_label for ex in nli.MultiNLITrainReader(MULTINLI_HOME, filter_unlabeled=False).read()])
In [12]:
contradiction    130903
neutral          130900
entailment       130899
dtype: int64

Tree representations

Both corpora contain three versions of the premise and hypothesis sentences:

  1. Regular string representations of the data
  2. Unlabeled binary parses
  3. Labeled parses
In [13]:
'A person on a horse jumps over a broken down airplane.'

The binary parses lack node labels; so that we can use nltk.tree.Tree with them, the label X is added to all of them:

In [14]:

Here's the full parse tree with syntactic categories:

In [15]:

The leaves of either tree are a tokenized version of the example:

In [16]:

Annotated MultiNLI subsets

MultiNLI includes additional annotations for a subset of the dev examples. The goal is to help people understand how well their models are doing on crucial NLI-related linguistic phenomena.

In [17]:
matched_ann_filename = os.path.join(

mismatched_ann_filename = os.path.join(
In [18]:
def view_random_example(annotations):
    ann_ex = random.choice(list(annotations.items()))
    pairid, ann_ex = ann_ex
    ex = ann_ex['example']   
    print("pairID: {}".format(pairid))
In [20]:
matched_ann = nli.read_annotated_subset(matched_ann_filename, MULTINLI_HOME)
In [21]:
pairID: 2367n
On the window above the sink a small container is stuffed with bits of leftovers--the red berries of barberry, small twigs of willow, cuttings of hinoki cypress with its fruits attached, and the pendulous leathery seed pods of wisteria.
There is a small jar on the window.

Other NLI datasets