CoNLL 2016 Shared Task Data Format

The data format is identical to what we used last year. But we made slight changes to some of the file names in the package to prevent confusion from last year. The package name indicates language (en or zh) and the date of creation (MM-DD-YY) and the data split (train, dev, trial, etc). Once you unpack the package, you can expect the following files and folders:

  • parses.json - The input file for the main task and the supplementary task (pdtb-parses.json in 2015)
  • relations-no-senses.json - The input file for the supplementary task (new this year)
  • relations.json - the gold standard discourse relations (pdtb-data.json in 2015)
  • raw/DocID - plain text file. One file per document. No extension. File name will match the DocID field in relations.json and key in parses.json.
  • conll_format/DocID.conll - CoNLL format for the training data (one file per document .conll)

We will show you how to work with each of these files in order to train your systems for the main task and the supplementary in the language of your choice.

In [3]:
ls -l conll16st-en-01-12-16-trial
total 496
drwxr-xr-x+ 3 te  staff     102 Jan 12 09:42 conll_format/
-rw-r--r--+ 1 te  staff    9950 Jan 13 11:42 output.json
-rw-r--r--+ 1 te  staff  150222 Jan 12 09:40 parses.json
drwxr-xr-x+ 3 te  staff     102 Jan 12 09:42 raw/
-rw-r--r--+ 1 te  staff   41739 Jan 12 09:42 relations-no-senses.json
-rw-r--r--+ 1 te  staff   42610 Jan 12 09:40 relations.json

relations.json : Gold standard discourse relation annotation

This file is from The Penn Discourse Treebank (PDTB) & Chinese Discourse Treebank (CDTB) for English and Chinese respectively. These are the gold standard annotation for both the main task and the supplementary task. Each line in the file is a json line. In Python, you can turn it into a dictionary. Similarly, you can turn it into HashMap in Java. But please do not do not use regex to parse json. Your system will most likely break during evaluation.

The dictionary describes the following component of a relation:

  • Arg1 : the text span of Arg1 of the relation
  • Arg2 : the text span of Arg2 of the relation
  • Connective : the text span of the connective of the relation
  • DocID : document id where the relation is in.
  • ID : the relation id, which is unique across training, dev, and test sets.
  • Sense : the sense of the relation
  • Type : the type of relation (Explicit, Implicit, Entrel, AltLex, or NoRel)

The text span is in the same format for Arg1, Arg2, and Connective. A text span has the following fields:

  • CharacterSpanList : the list of character offsets (beginning, end) in the raw untokenized data file.
  • RawText : the raw untokenized text of the span
  • TokenList : the list of the addresses of the tokens in the form of (character offset begin, character offset end, token offset within the document, sentence offset, token offset within the sentence)

For example,

In [4]:
import json
import codecs
pdtb_file = codecs.open('conll16st-en-01-12-16-trial/relations.json', encoding='utf8')
relations = [json.loads(x) for x in pdtb_file];
example_relation = relations[10]
example_relation
Out[4]:
{u'Arg1': {u'CharacterSpanList': [[2493, 2517]],
  u'RawText': u'and told them to cool it',
  u'TokenList': [[2493, 2496, 465, 15, 8],
   [2497, 2501, 466, 15, 9],
   [2502, 2506, 467, 15, 10],
   [2507, 2509, 468, 15, 11],
   [2510, 2514, 469, 15, 12],
   [2515, 2517, 470, 15, 13]]},
 u'Arg2': {u'CharacterSpanList': [[2526, 2552]],
  u'RawText': u"they're ruining the market",
  u'TokenList': [[2526, 2530, 472, 15, 15],
   [2530, 2533, 473, 15, 16],
   [2534, 2541, 474, 15, 17],
   [2542, 2545, 475, 15, 18],
   [2546, 2552, 476, 15, 19]]},
 u'Connective': {u'CharacterSpanList': [[2518, 2525]],
  u'RawText': u'because',
  u'TokenList': [[2518, 2525, 471, 15, 14]]},
 u'DocID': u'wsj_1000',
 u'ID': 14887,
 u'Sense': [u'Contingency.Cause.Reason'],
 u'Type': u'Explicit'}

Differences in Chinese data

Everything in Chinese data and English data are identical except that Chinese data have one extra field Punctuation. Punctuations in Chinese have some discourse functions, so they are annotated as well. But you are not required to detect those as part of the task. Discourse annotation in Chinese differs quite a bit from English from the linguistics perspective. Please refer to the original paper in Chinese Discourse Treebank.

In [5]:
data = codecs.open('conll16st-zh-01-08-2016-trial/relations.json', encoding='utf8')
chinese_relations = [json.loads(x) for x in data]
chinese_relations[13]
Out[5]:
{u'Arg1': {u'CharacterSpanList': [[500, 511]],
  u'RawText': u'\u6210\u4ea4 \u836f\u54c1 \u4e00\u4ebf\u591a \u5143',
  u'TokenList': [[500, 502, 187, 5, 27],
   [503, 505, 188, 5, 28],
   [506, 509, 189, 5, 29],
   [510, 511, 190, 5, 30]]},
 u'Arg2': {u'CharacterSpanList': [[514, 526]],
  u'RawText': u'\u6ca1\u6709 \u53d1\u73b0 \u4e00 \u4f8b \u56de\u6263',
  u'TokenList': [[514, 516, 192, 5, 32],
   [517, 519, 193, 5, 33],
   [520, 521, 194, 5, 34],
   [522, 523, 195, 5, 35],
   [524, 526, 196, 5, 36]]},
 u'Connective': {u'CharacterSpanList': [], u'RawText': u'', u'TokenList': []},
 u'DocID': u'chtb_0001',
 u'ID': 13,
 u'Punctuation': {u'CharacterSpanList': [[512, 513]],
  u'PunctuationType': u'Comma',
  u'RawText': u'\uff0c',
  u'TokenList': [[512, 513, 191, 5, 31]]},
 u'Sense': [u'Conjunction'],
 u'Type': u'Implicit'}
In [6]:
print 'Arg1 : %s\nArg2 : %s' % (chinese_relations[13]['Arg1']['RawText'], chinese_relations[13]['Arg2']['RawText'])
Arg1 : 成交 药品 一亿多 元
Arg2 : 没有 发现 一 例 回扣

parses.json : Input for the main task and the supplementary task

This is the file that your system will have to process during evaluation. The automatic parses and part-of-speech tags are provided in this file. Note that this file contains only one line unlike the discourse relation json file. Suppose we want the parse for the sentence in the relation above, which is sentence #15 shown in TokenList.

In [7]:
parse_file = codecs.open('conll16st-en-01-12-16-trial/parses.json', encoding='utf8')
en_parse_dict = json.load(parse_file)

en_example_relation = relations[10]
en_doc_id = en_example_relation['DocID']
print en_parse_dict[en_doc_id]['sentences'][15]['parsetree']
( (S (NP (PRP We)) (VP (VBP 've) (VP (VP (VBN talked) (PP (TO to) (NP (NP (NNS proponents)) (PP (IN of) (NP (NN index) (NN arbitrage)))))) (CC and) (VP (VBD told) (NP (PRP them)) (S (VP (TO to) (VP (VB cool) (NP (PRP it)) (SBAR (IN because) (S (NP (PRP they)) (VP (VBP 're) (VP (VBG ruining) (NP (DT the) (NN market)))))))))))) (. .)) )

In [8]:
parse_file = codecs.open('conll16st-zh-01-08-2016-trial/parses.json', encoding='utf8')
zh_parse_dict = json.load(parse_file)

zh_example_relation = chinese_relations[13]
zh_doc_id = zh_example_relation['DocID']
print zh_parse_dict[zh_doc_id]['sentences'][5]['parsetree']
( (IP (NP (CP (IP (LCP (NP (NT 去年)) (LC 初)) (NP (NP (NR 浦东)) (NP (NN 新区))) (VP (VV 诞生))) (DEC 的)) (NP (NP (NR 中国)) (QP (OD 第一) (CLP (M 家))) (NP (NN 医疗) (NN 机构))) (NP (NN 药品) (NN 采购) (NN 服务) (NN 中心))) (PU ,) (VP (VP (PP (ADVP (AD 正)) (PP (P 因为) (IP (IP (VP (ADVP (AD 一)) (VP (VV 开始)))) (VP (ADVP (AD 就)) (ADVP (AD 比较)) (VP (VA 规范)))))) (PU ,) (VP (VV 运转) (IP (VP (ADVP (AD 至今)) (PU ,) (VP (VV 成交) (NP (NN 药品)) (QP (CD 一亿多) (CLP (M 元)))))))) (PU ,) (VP (ADVP (AD 没有)) (VP (VV 发现) (NP (QP (CD 一) (CLP (M 例))) (NP (NN 回扣)))))) (PU 。)) )

In [9]:
en_parse_dict[en_doc_id]['sentences'][15]['dependencies']
Out[9]:
[[u'nsubj', u'talked-3', u'We-1'],
 [u'aux', u'talked-3', u"'ve-2"],
 [u'root', u'ROOT-0', u'talked-3'],
 [u'prep', u'talked-3', u'to-4'],
 [u'pobj', u'to-4', u'proponents-5'],
 [u'prep', u'proponents-5', u'of-6'],
 [u'nn', u'arbitrage-8', u'index-7'],
 [u'pobj', u'of-6', u'arbitrage-8'],
 [u'cc', u'talked-3', u'and-9'],
 [u'conj', u'talked-3', u'told-10'],
 [u'dobj', u'told-10', u'them-11'],
 [u'aux', u'cool-13', u'to-12'],
 [u'xcomp', u'told-10', u'cool-13'],
 [u'dobj', u'cool-13', u'it-14'],
 [u'mark', u'ruining-18', u'because-15'],
 [u'nsubj', u'ruining-18', u'they-16'],
 [u'aux', u'ruining-18', u"'re-17"],
 [u'advcl', u'cool-13', u'ruining-18'],
 [u'det', u'market-20', u'the-19'],
 [u'dobj', u'ruining-18', u'market-20']]

Each token can be iterated from words field within the sentence. Note that Linkers field is provided to indicate whether that token is part of an Arg or not. The format is arg1_ID. The ID corresponds to the ID field in the relation json.

In [10]:
en_parse_dict[en_doc_id]['sentences'][15]['words'][0]
Out[10]:
[u'We',
 {u'CharacterOffsetBegin': 2447,
  u'CharacterOffsetEnd': 2449,
  u'Linkers': [u'arg2_14886', u'arg1_14888'],
  u'PartOfSpeech': u'PRP'}]
In [11]:
en_parse_dict[en_doc_id]['sentences'][15]['words'][1]
Out[11]:
[u"'ve",
 {u'CharacterOffsetBegin': 2449,
  u'CharacterOffsetEnd': 2452,
  u'Linkers': [u'arg2_14886', u'arg1_14888'],
  u'PartOfSpeech': u'VBP'}]

relations-no-senses.json : Input for the supplementary task

The systems participating in the supplementary task (sense classification) take in this file as input. The file is the same as relations.json but the Type and Sense fields are left empty. This is the same for Chinese and English except for the Punctuation field.

In [12]:
supp_data = open('conll16st-en-01-12-16-trial/relations-no-senses.json')
relations_no_senses = [json.loads(x) for x in supp_data]
relations_no_senses[10]
Out[12]:
{u'Arg1': {u'CharacterSpanList': [[2493, 2517]],
  u'RawText': u'and told them to cool it',
  u'TokenList': [[2493, 2496, 465, 15, 8],
   [2497, 2501, 466, 15, 9],
   [2502, 2506, 467, 15, 10],
   [2507, 2509, 468, 15, 11],
   [2510, 2514, 469, 15, 12],
   [2515, 2517, 470, 15, 13]]},
 u'Arg2': {u'CharacterSpanList': [[2526, 2552]],
  u'RawText': u"they're ruining the market",
  u'TokenList': [[2526, 2530, 472, 15, 15],
   [2530, 2533, 473, 15, 16],
   [2534, 2541, 474, 15, 17],
   [2542, 2545, 475, 15, 18],
   [2546, 2552, 476, 15, 19]]},
 u'Connective': {u'CharacterSpanList': [[2518, 2525]],
  u'RawText': u'because',
  u'TokenList': [[2518, 2525, 471, 15, 14]]},
 u'DocID': u'wsj_1000',
 u'ID': 14887,
 u'Sense': [],
 u'Type': u''}

"The CoNLL Format"

JSON format makes your code much more readable instead of a bunch of unreadable indices. CoNLL format of this dataset is wicked sparse. Here's our suggested way to get something similar. You can use the Linker field in each token dictionary. Here's an example.

In [13]:
all_tokens = [token for sentence in en_parse_dict[en_doc_id]['sentences'] for token in sentence['words']]
for token in all_tokens[0:20]:
    for linker in token[1]['Linkers']:
        role, relation_id = linker.split('_')
        print '%s \t is part of %s in relation id %s' % (token[0], role, relation_id)
Kemper 	 is part of arg1 in relation id 14890
Financial 	 is part of arg1 in relation id 14890
Services 	 is part of arg1 in relation id 14890
Inc. 	 is part of arg1 in relation id 14890
, 	 is part of arg1 in relation id 14890
charging 	 is part of arg1 in relation id 14890
that 	 is part of arg1 in relation id 14890
program 	 is part of arg1 in relation id 14890
trading 	 is part of arg1 in relation id 14890
is 	 is part of arg1 in relation id 14890
ruining 	 is part of arg1 in relation id 14890
the 	 is part of arg1 in relation id 14890
stock 	 is part of arg1 in relation id 14890
market 	 is part of arg1 in relation id 14890
, 	 is part of arg1 in relation id 14890
cut 	 is part of arg1 in relation id 14890
off 	 is part of arg1 in relation id 14890
four 	 is part of arg1 in relation id 14890
big 	 is part of arg1 in relation id 14890
Wall 	 is part of arg1 in relation id 14890
In [14]:
print 'Relation ID is %s' % relations[13]['ID']
print 'Arg 1 : %s' % relations[13]['Arg1']['RawText']
Relation ID is 14890
Arg 1 : Kemper Financial Services Inc., charging that program trading is ruining the stock market, cut off four big Wall Street firms from doing any of its stock-trading business

We also provide CoNLL format for those who prefer it but it does not very pretty. Those can also be used for training. CoNLL format will not be provided during evaluation.

In [15]:
for x in open('conll16st-en-01-12-16-trial/conll_format/wsj_1000.conll').readlines()[0:5]:
    print x[0:40]
0	0	0	Kemper	NNP	arg1	_	_	_	_	_	_	_	_	_	
1	0	1	Financial	NNP	arg1	_	_	_	_	_	_	_	_
2	0	2	Services	NNPS	arg1	_	_	_	_	_	_	_	_
3	0	3	Inc.	NNP	arg1	_	_	_	_	_	_	_	_	_	_	
4	0	4	,	,	arg1	_	_	_	_	_	_	_	_	_	_	_	_	_

Here's the explanation of each field if a document has n relations:

  • Document-level token index
  • Sentence index
  • Sentence-level token index
  • POS tag
  • Relation 1 information
  • Relation 2 information
  • ...
  • Relation n information

The relation information field can take many forms:

  • arg1 part of Arg1 of the relation
  • arg2 part of Arg2 of the relation
  • conn|Comparison.Concession part of the discourse connective AND the sense of that relation is Comparison.Concession (Explicit relations only)
  • arg2|EntRel part of Arg2 of the relation AND the sense of that relation is EntRel (Entrel and Norel relations only)
  • arg2|because|Contingency.Pragmatic cause part of Arg2 (Implicit relations only)

What should the system output look like?

The system output must be in json format. It is very similar to the training set except for the TokenList field. The TokenList field is now a list of document level token indices. If the relation is not explicit, Connective field must still be there, and its TokenList must be an empty list. You may however add whatever field into json to help yourself debug or develop the system. Below is an example of a relation given by a system.

You can also run the sample parser:

python sample_parser.py conll16st-en-01-12-16-trial inputrun tutorial.

In [24]:
output_relations = [json.loads(x) for x in codecs.open('output.json', encoding='utf8')]
output_relations[10]
Out[24]:
{u'Arg1': {u'TokenList': [275,
   276,
   277,
   278,
   279,
   280,
   281,
   282,
   283,
   284,
   285,
   286,
   287,
   288,
   289,
   290,
   291,
   292,
   293,
   294,
   295,
   296,
   297,
   298,
   299,
   300,
   301,
   302,
   303,
   304,
   305,
   306,
   307,
   308,
   309,
   310,
   311,
   312,
   313,
   314,
   315,
   316,
   317,
   318,
   319,
   320,
   321,
   322,
   323,
   324,
   325,
   326,
   327]},
 u'Arg2': {u'TokenList': [329,
   330,
   331,
   332,
   333,
   334,
   335,
   336,
   337,
   338,
   339,
   340,
   341,
   342,
   343,
   344,
   345,
   346,
   347,
   348,
   349,
   350,
   351,
   352,
   353,
   354,
   355,
   356,
   357,
   358,
   359,
   360,
   361,
   362,
   363]},
 u'Connective': {u'TokenList': []},
 u'DocID': u'wsj_1000',
 u'Sense': [u'Expansion.Conjunction'],
 u'Type': u'Implicit'}

Validator and scorer

Suppose you already have a system and you want to evaluate the system. We provide validator.py and scorer.py to help you validate the format of the system out and evaluate the system respectively. These utility functions can be downloaded from CoNLL Shared Task Github. The usage is included in the functions.

That should be all that you need! Let's get the fun started.

If you find any errors or suggestions, please post to the forum or email the organizing committee at [email protected]. We hope you enjoy solving this challenging task of shallow discourse parsing. Together, we can make progress in understanding discourse phenomena.