The data format is identical to what we used last year. But we made slight changes to some of the file names in the package to prevent confusion from last year. The package name indicates language (en or zh) and the date of creation (MM-DD-YY) and the data split (train, dev, trial, etc). Once you unpack the package, you can expect the following files and folders:
parses.json
- The input file for the main task and the supplementary task (pdtb-parses.json
in 2015)relations-no-senses.json
- The input file for the supplementary task (new this year)relations.json
- the gold standard discourse relations (pdtb-data.json
in 2015)raw/DocID
- plain text file. One file per document. No extension. File name will match the DocID field in relations.json and key in parses.json.conll_format/DocID.conll
- CoNLL format for the training data (one file per document .conll)We will show you how to work with each of these files in order to train your systems for the main task and the supplementary in the language of your choice.
ls -l conll16st-en-01-12-16-trial
total 496 drwxr-xr-x+ 3 te staff 102 Jan 12 09:42 conll_format/ -rw-r--r--+ 1 te staff 9950 Jan 13 11:42 output.json -rw-r--r--+ 1 te staff 150222 Jan 12 09:40 parses.json drwxr-xr-x+ 3 te staff 102 Jan 12 09:42 raw/ -rw-r--r--+ 1 te staff 41739 Jan 12 09:42 relations-no-senses.json -rw-r--r--+ 1 te staff 42610 Jan 12 09:40 relations.json
relations.json
: Gold standard discourse relation annotation¶This file is from The Penn Discourse Treebank (PDTB) & Chinese Discourse Treebank (CDTB) for English and Chinese respectively. These are the gold standard annotation for both the main task and the supplementary task. Each line in the file is a json line. In Python, you can turn it into a dictionary. Similarly, you can turn it into HashMap in Java. But please do not do not use regex to parse json. Your system will most likely break during evaluation.
The dictionary describes the following component of a relation:
Arg1
: the text span of Arg1 of the relationArg2
: the text span of Arg2 of the relationConnective
: the text span of the connective of the relationDocID
: document id where the relation is in.ID
: the relation id, which is unique across training, dev, and test sets.Sense
: the sense of the relationType
: the type of relation (Explicit, Implicit, Entrel, AltLex, or NoRel)The text span is in the same format for Arg1
, Arg2
, and Connective
. A text span has the following fields:
CharacterSpanList
: the list of character offsets (beginning, end) in the raw untokenized data file.RawText
: the raw untokenized text of the spanTokenList
: the list of the addresses of the tokens in the form of(character offset begin, character offset end, token offset within the document, sentence offset, token offset within the sentence)
For example,
import json
import codecs
pdtb_file = codecs.open('conll16st-en-01-12-16-trial/relations.json', encoding='utf8')
relations = [json.loads(x) for x in pdtb_file];
example_relation = relations[10]
example_relation
{u'Arg1': {u'CharacterSpanList': [[2493, 2517]], u'RawText': u'and told them to cool it', u'TokenList': [[2493, 2496, 465, 15, 8], [2497, 2501, 466, 15, 9], [2502, 2506, 467, 15, 10], [2507, 2509, 468, 15, 11], [2510, 2514, 469, 15, 12], [2515, 2517, 470, 15, 13]]}, u'Arg2': {u'CharacterSpanList': [[2526, 2552]], u'RawText': u"they're ruining the market", u'TokenList': [[2526, 2530, 472, 15, 15], [2530, 2533, 473, 15, 16], [2534, 2541, 474, 15, 17], [2542, 2545, 475, 15, 18], [2546, 2552, 476, 15, 19]]}, u'Connective': {u'CharacterSpanList': [[2518, 2525]], u'RawText': u'because', u'TokenList': [[2518, 2525, 471, 15, 14]]}, u'DocID': u'wsj_1000', u'ID': 14887, u'Sense': [u'Contingency.Cause.Reason'], u'Type': u'Explicit'}
Everything in Chinese data and English data are identical except that Chinese data have one extra field Punctuation
. Punctuations in Chinese have some discourse functions, so they are annotated as well. But you are not required to detect those as part of the task. Discourse annotation in Chinese differs quite a bit from English from the linguistics perspective. Please refer to the original paper in Chinese Discourse Treebank.
data = codecs.open('conll16st-zh-01-08-2016-trial/relations.json', encoding='utf8')
chinese_relations = [json.loads(x) for x in data]
chinese_relations[13]
{u'Arg1': {u'CharacterSpanList': [[500, 511]], u'RawText': u'\u6210\u4ea4 \u836f\u54c1 \u4e00\u4ebf\u591a \u5143', u'TokenList': [[500, 502, 187, 5, 27], [503, 505, 188, 5, 28], [506, 509, 189, 5, 29], [510, 511, 190, 5, 30]]}, u'Arg2': {u'CharacterSpanList': [[514, 526]], u'RawText': u'\u6ca1\u6709 \u53d1\u73b0 \u4e00 \u4f8b \u56de\u6263', u'TokenList': [[514, 516, 192, 5, 32], [517, 519, 193, 5, 33], [520, 521, 194, 5, 34], [522, 523, 195, 5, 35], [524, 526, 196, 5, 36]]}, u'Connective': {u'CharacterSpanList': [], u'RawText': u'', u'TokenList': []}, u'DocID': u'chtb_0001', u'ID': 13, u'Punctuation': {u'CharacterSpanList': [[512, 513]], u'PunctuationType': u'Comma', u'RawText': u'\uff0c', u'TokenList': [[512, 513, 191, 5, 31]]}, u'Sense': [u'Conjunction'], u'Type': u'Implicit'}
print 'Arg1 : %s\nArg2 : %s' % (chinese_relations[13]['Arg1']['RawText'], chinese_relations[13]['Arg2']['RawText'])
Arg1 : 成交 药品 一亿多 元 Arg2 : 没有 发现 一 例 回扣
parses.json
: Input for the main task and the supplementary task¶This is the file that your system will have to process during evaluation.
The automatic parses and part-of-speech tags are provided in this file.
Note that this file contains only one line unlike the discourse relation json file.
Suppose we want the parse for the sentence in the relation above, which is sentence #15 shown in TokenList
.
parse_file = codecs.open('conll16st-en-01-12-16-trial/parses.json', encoding='utf8')
en_parse_dict = json.load(parse_file)
en_example_relation = relations[10]
en_doc_id = en_example_relation['DocID']
print en_parse_dict[en_doc_id]['sentences'][15]['parsetree']
( (S (NP (PRP We)) (VP (VBP 've) (VP (VP (VBN talked) (PP (TO to) (NP (NP (NNS proponents)) (PP (IN of) (NP (NN index) (NN arbitrage)))))) (CC and) (VP (VBD told) (NP (PRP them)) (S (VP (TO to) (VP (VB cool) (NP (PRP it)) (SBAR (IN because) (S (NP (PRP they)) (VP (VBP 're) (VP (VBG ruining) (NP (DT the) (NN market)))))))))))) (. .)) )
parse_file = codecs.open('conll16st-zh-01-08-2016-trial/parses.json', encoding='utf8')
zh_parse_dict = json.load(parse_file)
zh_example_relation = chinese_relations[13]
zh_doc_id = zh_example_relation['DocID']
print zh_parse_dict[zh_doc_id]['sentences'][5]['parsetree']
( (IP (NP (CP (IP (LCP (NP (NT 去年)) (LC 初)) (NP (NP (NR 浦东)) (NP (NN 新区))) (VP (VV 诞生))) (DEC 的)) (NP (NP (NR 中国)) (QP (OD 第一) (CLP (M 家))) (NP (NN 医疗) (NN 机构))) (NP (NN 药品) (NN 采购) (NN 服务) (NN 中心))) (PU ,) (VP (VP (PP (ADVP (AD 正)) (PP (P 因为) (IP (IP (VP (ADVP (AD 一)) (VP (VV 开始)))) (VP (ADVP (AD 就)) (ADVP (AD 比较)) (VP (VA 规范)))))) (PU ,) (VP (VV 运转) (IP (VP (ADVP (AD 至今)) (PU ,) (VP (VV 成交) (NP (NN 药品)) (QP (CD 一亿多) (CLP (M 元)))))))) (PU ,) (VP (ADVP (AD 没有)) (VP (VV 发现) (NP (QP (CD 一) (CLP (M 例))) (NP (NN 回扣)))))) (PU 。)) )
en_parse_dict[en_doc_id]['sentences'][15]['dependencies']
[[u'nsubj', u'talked-3', u'We-1'], [u'aux', u'talked-3', u"'ve-2"], [u'root', u'ROOT-0', u'talked-3'], [u'prep', u'talked-3', u'to-4'], [u'pobj', u'to-4', u'proponents-5'], [u'prep', u'proponents-5', u'of-6'], [u'nn', u'arbitrage-8', u'index-7'], [u'pobj', u'of-6', u'arbitrage-8'], [u'cc', u'talked-3', u'and-9'], [u'conj', u'talked-3', u'told-10'], [u'dobj', u'told-10', u'them-11'], [u'aux', u'cool-13', u'to-12'], [u'xcomp', u'told-10', u'cool-13'], [u'dobj', u'cool-13', u'it-14'], [u'mark', u'ruining-18', u'because-15'], [u'nsubj', u'ruining-18', u'they-16'], [u'aux', u'ruining-18', u"'re-17"], [u'advcl', u'cool-13', u'ruining-18'], [u'det', u'market-20', u'the-19'], [u'dobj', u'ruining-18', u'market-20']]
Each token can be iterated from words
field within the sentence. Note that Linkers
field is provided to indicate whether that token is part of an Arg or not. The format is arg1_ID
. The ID corresponds to the ID field in the relation json.
en_parse_dict[en_doc_id]['sentences'][15]['words'][0]
[u'We', {u'CharacterOffsetBegin': 2447, u'CharacterOffsetEnd': 2449, u'Linkers': [u'arg2_14886', u'arg1_14888'], u'PartOfSpeech': u'PRP'}]
en_parse_dict[en_doc_id]['sentences'][15]['words'][1]
[u"'ve", {u'CharacterOffsetBegin': 2449, u'CharacterOffsetEnd': 2452, u'Linkers': [u'arg2_14886', u'arg1_14888'], u'PartOfSpeech': u'VBP'}]
relations-no-senses.json
: Input for the supplementary task¶The systems participating in the supplementary task (sense classification) take in this file as input. The file is the same as relations.json
but the Type
and Sense
fields are left empty. This is the same for Chinese and English except for the Punctuation
field.
supp_data = open('conll16st-en-01-12-16-trial/relations-no-senses.json')
relations_no_senses = [json.loads(x) for x in supp_data]
relations_no_senses[10]
{u'Arg1': {u'CharacterSpanList': [[2493, 2517]], u'RawText': u'and told them to cool it', u'TokenList': [[2493, 2496, 465, 15, 8], [2497, 2501, 466, 15, 9], [2502, 2506, 467, 15, 10], [2507, 2509, 468, 15, 11], [2510, 2514, 469, 15, 12], [2515, 2517, 470, 15, 13]]}, u'Arg2': {u'CharacterSpanList': [[2526, 2552]], u'RawText': u"they're ruining the market", u'TokenList': [[2526, 2530, 472, 15, 15], [2530, 2533, 473, 15, 16], [2534, 2541, 474, 15, 17], [2542, 2545, 475, 15, 18], [2546, 2552, 476, 15, 19]]}, u'Connective': {u'CharacterSpanList': [[2518, 2525]], u'RawText': u'because', u'TokenList': [[2518, 2525, 471, 15, 14]]}, u'DocID': u'wsj_1000', u'ID': 14887, u'Sense': [], u'Type': u''}
JSON format makes your code much more readable instead of a bunch of unreadable indices.
CoNLL format of this dataset is wicked sparse. Here's our suggested way to get something similar.
You can use the Linker
field in each token dictionary. Here's an example.
all_tokens = [token for sentence in en_parse_dict[en_doc_id]['sentences'] for token in sentence['words']]
for token in all_tokens[0:20]:
for linker in token[1]['Linkers']:
role, relation_id = linker.split('_')
print '%s \t is part of %s in relation id %s' % (token[0], role, relation_id)
Kemper is part of arg1 in relation id 14890 Financial is part of arg1 in relation id 14890 Services is part of arg1 in relation id 14890 Inc. is part of arg1 in relation id 14890 , is part of arg1 in relation id 14890 charging is part of arg1 in relation id 14890 that is part of arg1 in relation id 14890 program is part of arg1 in relation id 14890 trading is part of arg1 in relation id 14890 is is part of arg1 in relation id 14890 ruining is part of arg1 in relation id 14890 the is part of arg1 in relation id 14890 stock is part of arg1 in relation id 14890 market is part of arg1 in relation id 14890 , is part of arg1 in relation id 14890 cut is part of arg1 in relation id 14890 off is part of arg1 in relation id 14890 four is part of arg1 in relation id 14890 big is part of arg1 in relation id 14890 Wall is part of arg1 in relation id 14890
print 'Relation ID is %s' % relations[13]['ID']
print 'Arg 1 : %s' % relations[13]['Arg1']['RawText']
Relation ID is 14890 Arg 1 : Kemper Financial Services Inc., charging that program trading is ruining the stock market, cut off four big Wall Street firms from doing any of its stock-trading business
We also provide CoNLL format for those who prefer it but it does not very pretty. Those can also be used for training. CoNLL format will not be provided during evaluation.
for x in open('conll16st-en-01-12-16-trial/conll_format/wsj_1000.conll').readlines()[0:5]:
print x[0:40]
0 0 0 Kemper NNP arg1 _ _ _ _ _ _ _ _ _ 1 0 1 Financial NNP arg1 _ _ _ _ _ _ _ _ 2 0 2 Services NNPS arg1 _ _ _ _ _ _ _ _ 3 0 3 Inc. NNP arg1 _ _ _ _ _ _ _ _ _ _ 4 0 4 , , arg1 _ _ _ _ _ _ _ _ _ _ _ _ _
Here's the explanation of each field if a document has n relations:
The relation information field can take many forms:
arg1
part of Arg1 of the relationarg2
part of Arg2 of the relationconn|Comparison.Concession
part of the discourse connective AND the sense of that relation is Comparison.Concession (Explicit relations only)arg2|EntRel
part of Arg2 of the relation AND the sense of that relation is EntRel (Entrel and Norel relations only)arg2|because|Contingency.Pragmatic cause
part of Arg2 (Implicit relations only)The system output must be in json format. It is very similar to the training set except for the TokenList
field.
The TokenList
field is now a list of document level token indices.
If the relation is not explicit, Connective
field must still be there, and its TokenList
must be an empty list.
You may however add whatever field into json to help yourself debug or develop the system.
Below is an example of a relation given by a system.
You can also run the sample parser:
python sample_parser.py conll16st-en-01-12-16-trial inputrun tutorial
.
output_relations = [json.loads(x) for x in codecs.open('output.json', encoding='utf8')]
output_relations[10]
{u'Arg1': {u'TokenList': [275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327]}, u'Arg2': {u'TokenList': [329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363]}, u'Connective': {u'TokenList': []}, u'DocID': u'wsj_1000', u'Sense': [u'Expansion.Conjunction'], u'Type': u'Implicit'}
Suppose you already have a system and you want to evaluate the system.
We provide validator.py
and scorer.py
to help you validate the format of the system out and evaluate the system respectively.
These utility functions can be downloaded from CoNLL Shared Task Github.
The usage is included in the functions.
If you find any errors or suggestions, please post to the forum or email the organizing committee at conll16st@gmail.com
.
We hope you enjoy solving this challenging task of shallow discourse parsing.
Together, we can make progress in understanding discourse phenomena.