Natural language inference: Task and datasets

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2018"

Overview

Natural Language Inference (NLI) is the task of predicting the logical relationships between words, phrases, sentences, (paragraphs, documents, ...). Such relationships are crucial for all kinds of reasoning in natural language: arguing, debating, problem solving, summarization, and so forth.

Dagan et al. (2006), one of the foundational papers on NLI (also called Recognizing Textual Entailment; RTE), make a case for the generality of this task in NLU:

It seems that major inferences, as needed by multiple applications, can indeed be cast in terms of textual entailment. For example, a QA system has to identify texts that entail a hypothesized answer. [...] Similarly, for certain Information Retrieval queries the combination of semantic concepts and relations denoted by the query should be entailed from relevant retrieved documents. [...] In multi-document summarization a redundant sentence, to be omitted from the summary, should be entailed from other sentences in the summary. And in MT evaluation a correct translation should be semantically equivalent to the gold standard translation, and thus both translations should entail each other. Consequently, we hypothesize that textual entailment recognition is a suitable generic task for evaluating and comparing applied semantic inference models. Eventually, such efforts can promote the development of entailment recognition "engines" which may provide useful generic modules across applications.

Our version of the task

Our NLI data will look like this:

Premise Relation Hypothesis
turtle contradiction linguist
A turtled danced entails A turtle moved
Every reptile danced entails Every turtle moved
Some turtles walk contradicts No turtles move
James Byron Dean refused to move without blue jeans entails James Dean didn't dance without pants

In the word-entailment bakeoff, we looked at a special case of this where the premise and hypothesis are single words. This notebook begins to introduce the problem of NLI more fully.

Primary resources

We're going to focus on two large, human-labeled, relatively naturalistic entailment corpora:

The first was collected by a group at Stanford, led by Sam Bowman, and the second was collected by a group at NYU, also led by Sam Bowman. They have the same format and were crowdsourced using the same basic methods. However, SNLI is entirely focused on image captions, whereas MultiNLI includes a greater range of contexts.

This notebook presents tools for working with these corpora. The second notebook in the unit concerns models of NLI.

NLI model landscape

Set-up

  1. As usual, you need to be fully set up to work with the CS224u repository.

  2. Make sure you have the nlidata.zip distribution downloaded and unpacked in the current directory. If you did the bake-off, then you likely already did this.

  3. Get the corpus distributions, place them in nlidata, and unpack them:

In [2]:
import nli
import os
import pandas as pd
import random
In [3]:
data_home = "nlidata"

snli_home = os.path.join(data_home, "snli_1.0")

multinli_home = os.path.join(data_home, "multinl_i.0")

annotations_home = os.path.join(data_home, "multinli_1.0_annotations")

Properties of the corpora

For both SNLI and MultiNLI, MTurk annotator were presented with premise sentences and asked to produce new sentences that entailed, contradicted, or were neutral with respect to the premise. A subset of the examples were then validated by an additional four MTurk annotators.

SNLI properties

  • All the premises are captions from the Flickr30K corpus.

  • Some of the sentences rather depressingly reflect stereotypes (Rudinger et al. 2017).

  • 550,152 train examples; 10K dev; 10K test

  • Mean length in tokens:

    • Premise: 14.1
    • Hypothesis: 8.3
  • Clause-types

    • Premise S-rooted: 74%
    • Hypothesis S-rooted: 88.9%
  • Vocab size: 37,026

  • 56,951 examples validated by four additional annotators

    • 58.3% examples with unanimous gold label
    • 91.2% of gold labels match the author's label
    • 0.70 overall Fleiss kappa
  • Top scores currently around 89%.

MultiNLI properties

  • Train premises drawn from five genres:

    1. Fiction: works from 1912–2010 spanning many genres
    2. Government: reports, letters, speeches, etc., from government websites
    3. The Slate website
    4. Telephone: the Switchboard corpus
    5. Travel: Berlitz travel guides
  • Additional genres just for dev and test (the mismatched condition):

    1. The 9/11 report
    2. Face-to-face: The Charlotte Narrative and Conversation Collection
    3. Fundraising letters
    4. Non-fiction from Oxford University Press
    5. Verbatim articles about linguistics
  • 392,702 train examples; 20K dev; 20K test

  • 19,647 examples validated by four additional annotators

    • 58.2% examples with unanimous gold label
    • 92.6% of gold labels match the author's label
  • Test-set labels available as a Kaggle competition.

    • Top matched scores currently around 0.81.
    • Top mismatched scores currently around 0.83.

Working with SNLI and MultiNLI

Readers

The following readers should make it easy to work with these corpora:

  • nli.SNLITrainReader
  • nli.SNLIDevReader
  • nli.MultiNLITrainReader
  • nli.MultiNLIMatchedDevReader
  • nli.MultiNLIMismatchedDevReader

The base class is nli.NLIReader, which should be easy to use to define additional readers.

If you did change data_home, snli_home, or multinli_home above, then you'll need to call these readers with dirname as an argument, where dirname is your snli_home or multinli_home, as appropriate.

Because the datasets are so large, it is often useful to be able to randomly sample from them. All of the reader classes allow this with their keyword argument samp_percentage. For example, the following samples approximately 10% of the examples from the SNLI training set:

In [4]:
nli.SNLITrainReader(samp_percentage=0.10)
Out[4]:
"NLIReader({'src_filename': 'nlidata/snli_1.0/snli_1.0_train.jsonl', 'filter_unlabeled': True, 'samp_percentage': 0.1, 'random_state': None})

The precise number of examples will vary somewhat because of the way the sampling is done. (Here, we trade efficiency for precision in the number of cases we return; see the implementation for details.)

The NLIExample class

All of the readers have a read method that yields NLIExample example instances, which have the following attributes:

  • annotator_labels: list of str
  • captionID: str
  • gold_label: str
  • pairID: str
  • sentence1: str
  • sentence1_binary_parse: nltk.tree.Tree
  • sentence1_parse: nltk.tree.Tree
  • sentence2: str
  • sentence2_binary_parse: nltk.tree.Tree
  • sentence2_parse: nltk.tree.Tree
In [5]:
snli_iterator = iter(nli.SNLITrainReader().read())
In [6]:
snli_ex = next(snli_iterator)
In [7]:
print(snli_ex)
A person on a horse jumps over a broken down airplane.
neutral
A person is training his horse for a competition.
In [8]:
snli_ex
Out[8]:
"NLIExample({'annotator_labels': ['neutral'], 'captionID': '3416050480.jpg#4', 'gold_label': 'neutral', 'pairID': '3416050480.jpg#4r1n', 'sentence1': 'A person on a horse jumps over a broken down airplane.', 'sentence1_binary_parse': Tree('X', [Tree('X', [Tree('X', ['A', 'person']), Tree('X', ['on', Tree('X', ['a', 'horse'])])]), Tree('X', [Tree('X', ['jumps', Tree('X', ['over', Tree('X', ['a', Tree('X', ['broken', Tree('X', ['down', 'airplane'])])])])]), '.'])]), 'sentence1_parse': Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NP', [Tree('DT', ['A']), Tree('NN', ['person'])]), Tree('PP', [Tree('IN', ['on']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['horse'])])])]), Tree('VP', [Tree('VBZ', ['jumps']), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['broken']), Tree('JJ', ['down']), Tree('NN', ['airplane'])])])]), Tree('.', ['.'])])]), 'sentence2': 'A person is training his horse for a competition.', 'sentence2_binary_parse': Tree('X', [Tree('X', ['A', 'person']), Tree('X', [Tree('X', ['is', Tree('X', [Tree('X', ['training', Tree('X', ['his', 'horse'])]), Tree('X', ['for', Tree('X', ['a', 'competition'])])])]), '.'])]), 'sentence2_parse': Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['A']), Tree('NN', ['person'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('VP', [Tree('VBG', ['training']), Tree('NP', [Tree('PRP$', ['his']), Tree('NN', ['horse'])]), Tree('PP', [Tree('IN', ['for']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['competition'])])])])]), Tree('.', ['.'])])])})

Labels

In [9]:
snli_labels = pd.Series(
    [ex.gold_label for ex in nli.SNLITrainReader(filter_unlabeled=False).read()])
In [10]:
snli_labels.value_counts()
Out[10]:
entailment       183416
contradiction    183187
neutral          182764
-                   785
dtype: int64
In [11]:
multinli_labels = pd.Series(
    [ex.gold_label for ex in nli.MultiNLITrainReader(filter_unlabeled=False).read()])
In [12]:
multinli_labels.value_counts()
Out[12]:
contradiction    130903
neutral          130900
entailment       130899
dtype: int64

Tree representations

Both corpora contain three versions of the premise and hypothesis sentences:

  1. Regular string representations of the data
  2. Unlabeled binary parses
  3. Labeled binary parses
In [13]:
snli_ex.sentence1
Out[13]:
'A person on a horse jumps over a broken down airplane.'

The binary parses lack node labels; so that we can use nltk.tree.Tree with them, the label X is added to all of them:

In [14]:
snli_ex.sentence1_binary_parse
Out[14]:

Here's the full parse tree with syntactic categories:

In [15]:
snli_ex.sentence1_parse
Out[15]:

The leaves of either tree are a tokenized version of the example:

In [16]:
snli_ex.sentence1_parse.leaves()
Out[16]:
['A',
 'person',
 'on',
 'a',
 'horse',
 'jumps',
 'over',
 'a',
 'broken',
 'down',
 'airplane',
 '.']

Annotated MultiNLI subsets

MultiNLI includes additional annotations for a subset of the dev examples. The goal is to help people understand how well their models are doing on crucial NLI-related linguistic phenomena.

In [17]:
matched_ann_filename = os.path.join(
    annotations_home,
    "multinli_1.0_matched_annotations.txt")

mismatched_ann_filename = os.path.join(
    annotations_home, 
    "multinli_1.0_mismatched_annotations.txt")
In [18]:
def view_random_example(annotations):
    ann_ex = random.choice(list(annotations.items()))
    pairid, ann_ex = ann_ex
    ex = ann_ex['example']   
    print("pairID: {}".format(pairid))
    print(ann_ex['annotations'])
    print(ex.sentence1)
    print(ex.gold_label)
    print(ex.sentence2)
In [19]:
matched_ann = nli.read_annotated_subset(matched_ann_filename)
In [20]:
view_random_example(matched_ann)
pairID: 24665n
['#NEGATION', '#TENSE_DIFFERENCE', '#LONG_SENTENCE']
It recalls William Randolph Hearst's castle in Caleornia, with its imaginative juxtaposition of ancient Roman and Chinese sculpture, fine Venetian glass chandeliers, Syvres porcelain, old Flemish masters, and naughty French erotica.
neutral
William Randolph Hearst didn't love chandeliers, but kept them to make his lady happy.

Other NLI datasets