%%html
<script>
function code_toggle() {
if (code_shown){
$('div.input').hide('500');
$('#toggleButton').val('Show Code')
} else {
$('div.input').show('500');
$('#toggleButton').val('Hide Code')
}
code_shown = !code_shown
}
$( document ).ready(function(){
code_shown=false;
$('div.input').hide()
});
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>
<style>
.rendered_html td {
font-size: xx-large;
text-align: left; !important
}
.rendered_html th {
font-size: xx-large;
text-align: left; !important
}
</style>
%%capture
%load_ext autoreload
%autoreload 2
import sys
sys.path.append("..")
from statnlpbook.util import execute_notebook
import statnlpbook.parsing as parsing
from statnlpbook.transition import *
from statnlpbook.dep import *
import pandas as pd
from io import StringIO
from IPython.display import display, HTML
execute_notebook('transition-based_dependency_parsing.ipynb')
%load_ext tikzmagic
The tikzmagic extension is already loaded. To reload it, use: %reload_ext tikzmagic
Parsing motivation
Background: parsing (10 min.)
Exercise: multi-word expressions (10 min.)
Background: Universal Dependencies (5 min.)
Background: transition-based parsing (10 min.)
Break (10 min.)
Example: transition-based parsing (5 min.)
Motivation: natural language understanding (5 min.)
Background: learning to parse (10 min.)
Math: dependency parsing evaluation (5 min.)
Examples: dependency parsers (5 min.)
Background: semantic parsing (15 min.)
Dechra Pharmaceuticals, which has just made its second acquisition, had previously purchased Genitrix.
Trinity Mirror plc, the largest British newspaper, purchased Local World, its rival.
Kraft, owner of Milka, purchased Cadbury Dairy Milk and is now gearing up for a roll-out of its new brand.
Parsing is is the process of constructing these graphs:
How is this done?
Task: determine the syntactic relations between words
Kraft, owner of Milka, purchased Cadbury Dairy Milk and is now gearing up for a roll-out of its new brand.
conllu = """
# ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
1 Kraft Kraft NOUN NN _ 7 nsubj _ _
2 , , PUNCT , _ 1 punct _ _
3 owner owner NOUN NN _ 1 appos _ _
4 of of ADP IN _ 5 case _ _
5 Milka Milka PROPN NNP _ 3 nmod _ _
6 , , PUNCT , _ 7 punct _ _
7 purchased purchase VERB VBD _ 0 root _ _
8 Cadbury Cadbury PROPN NNP _ 7 dobj _ _
9 Dairy Dairy PROPN NNP _ 8 flat _ _
10 Milk milk PROPN NNP _ 8 flat _ _
"""
arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))
render_displacy(arcs, tokens,"1200px")
conllu = """
# ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
1 Kraft Kraft NOUN NN _ 7 nsubj _ _
2 , , PUNCT , _ 1 punct _ _
3 owner owner NOUN NN _ 1 appos _ _
4 of of ADP IN _ 5 case _ _
5 Milka Milka PROPN NNP _ 3 nmod _ _
6 , , PUNCT , _ 7 punct _ _
7 purchased purchase VERB VBD _ 0 root _ _
8 Cadbury Cadbury PROPN NNP _ 7 dobj _ _
9 Dairy Dairy PROPN NNP _ 8 flat _ _
10 Milk milk PROPN NNP _ 8 flat _ _
"""
arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))
render_displacy(arcs, tokens,"1200px")
(in CoNLL-U Format)
conllu = """
# ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
1 Alice _ _ _ _ 2 nsubj _ _
2 saw _ _ _ _ 0 root _ _
3 Bob _ _ _ _ 2 dobj _ _
"""
display(HTML(pd.read_csv(StringIO(conllu), sep="\t").to_html(index=False)))
arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))
render_displacy(arcs, tokens,"900px")
# ID | FORM | LEMMA | UPOS | XPOS | FEATS | HEAD | DEPREL | DEPS | MISC |
---|---|---|---|---|---|---|---|---|---|
1 | Alice | _ | _ | _ | _ | 2 | nsubj | _ | _ |
2 | saw | _ | _ | _ | _ | 0 | root | _ | _ |
3 | Bob | _ | _ | _ | _ | 2 | dobj | _ | _ |
English and Danish are similar, while others are more distant:
conllu = """
# ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
1 Alice Alice NOUN _ _ 2 nsubj _ _
2 så se VERB _ _ 0 root _ _
3 Bob Bob PROPN _ _ 2 obj _ _
"""
display(HTML(pd.read_csv(StringIO(conllu), sep="\t").to_html(index=False)))
arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))
render_displacy(arcs, tokens,"900px")
# ID | FORM | LEMMA | UPOS | XPOS | FEATS | HEAD | DEPREL | DEPS | MISC |
---|---|---|---|---|---|---|---|---|---|
1 | Alice | Alice | NOUN | _ | _ | 2 | nsubj | _ | _ |
2 | så | se | VERB | _ | _ | 0 | root | _ | _ |
3 | Bob | Bob | PROPN | _ | _ | 2 | obj | _ | _ |
conllu = """
# ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
1 앨리스는 앨리스+는 NOUN _ _ 3 nsubj _ _
2 밥을 밥+을 NOUN _ _ 3 obj _ _
3 보았다 보+았+다 VERB _ _ 0 root _ _
"""
display(HTML(pd.read_csv(StringIO(conllu), sep="\t").to_html(index=False)))
arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))
render_displacy(arcs, tokens,"900px")
# ID | FORM | LEMMA | UPOS | XPOS | FEATS | HEAD | DEPREL | DEPS | MISC |
---|---|---|---|---|---|---|---|---|---|
1 | 앨리스는 | 앨리스+는 | NOUN | _ | _ | 3 | nsubj | _ | _ |
2 | 밥을 | 밥+을 | NOUN | _ | _ | 3 | obj | _ | _ |
3 | 보았다 | 보+았+다 | VERB | _ | _ | 0 | root | _ | _ |
conllu = """
# ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
1 Kraft Kraft NOUN NN _ 7 nsubj _ _
2 , , PUNCT , _ 1 punct _ _
3 owner owner NOUN NN _ 1 appos _ _
4 of of ADP IN _ 5 case _ _
5 Milka Milka PROPN NNP _ 3 nmod _ _
6 , , PUNCT , _ 7 punct _ _
7 purchased purchase VERB VBD _ 0 root _ _
8 Cadbury Cadbury PROPN NNP _ 7 dobj _ _
9 Dairy Dairy PROPN NNP _ 8 flat _ _
10 Milk milk PROPN NNP _ 8 flat _ _
"""
display(HTML(pd.read_csv(StringIO(conllu), sep="\t").to_html(index=False)))
arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))
render_displacy(arcs, tokens,"1200px")
# ID | FORM | LEMMA | UPOS | XPOS | FEATS | HEAD | DEPREL | DEPS | MISC |
---|---|---|---|---|---|---|---|---|---|
1 | Kraft | Kraft | NOUN | NN | _ | 7 | nsubj | _ | _ |
2 | , | , | PUNCT | , | _ | 1 | punct | _ | _ |
3 | owner | owner | NOUN | NN | _ | 1 | appos | _ | _ |
4 | of | of | ADP | IN | _ | 5 | case | _ | _ |
5 | Milka | Milka | PROPN | NNP | _ | 3 | nmod | _ | _ |
6 | , | , | PUNCT | , | _ | 7 | punct | _ | _ |
7 | purchased | purchase | VERB | VBD | _ | 0 | root | _ | _ |
8 | Cadbury | Cadbury | PROPN | NNP | _ | 7 | dobj | _ | _ |
9 | Dairy | Dairy | PROPN | NNP | _ | 8 | flat | _ | _ |
10 | Milk | milk | PROPN | NNP | _ | 8 | flat | _ | _ |
Nominals | Clauses | Modifier words | Function Words | |
Core arguments |
nsubj obj iobj |
csubj ccomp xcomp |
||
Non-core dependents |
obl vocative expl dislocated |
advcl |
advmod discourse |
aux cop mark |
Nominal dependents |
nmod appos nummod |
acl | amod |
det clf case |
Coordination | MWE | Loose | Special | Other |
conj cc |
fixed flat compound |
list parataxis |
orphan goeswith reparandum |
punct root dep |
conllu = """
# ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
1 Alice _ _ _ _ 2 nsubj _ _
2 saw _ _ _ _ 0 root _ _
3 Bob _ _ _ _ 2 dobj _ _
"""
display(HTML(pd.read_csv(StringIO(conllu), sep="\t").to_html(index=False)))
# ID | FORM | LEMMA | UPOS | XPOS | FEATS | HEAD | DEPREL | DEPS | MISC |
---|---|---|---|---|---|---|---|---|---|
1 | Alice | _ | _ | _ | _ | 2 | nsubj | _ | _ |
2 | saw | _ | _ | _ | _ | 0 | root | _ | _ |
3 | Bob | _ | _ | _ | _ | 2 | dobj | _ | _ |
Consists of a buffer, stack and set of arcs created so far.
of tokens waiting for processing
render_transitions_displacy(transitions[0:1], tokenized_sentence)
stack | buffer | parse | action |
ROOT | Alice saw Bob |
of tokens currently being processed
render_transitions_displacy(transitions[2:3],tokenized_sentence)
stack | buffer | parse | action |
ROOT Alice saw | Bob | shift |
tree built so far
render_transitions_displacy(transitions[6:7], tokenized_sentence)
stack | buffer | parse | action |
ROOT | rightArc-root |
We use the following
Push the word at the top of the buffer to the stack.
$$ (S, i|B, A)\rightarrow(S|i, B, A) $$render_transitions_displacy(transitions[0:2], tokenized_sentence)
stack | buffer | parse | action |
ROOT | Alice saw Bob | ||
ROOT Alice | saw Bob | shift |
Add labeled arc from secondmost top node of stack $i$ to top of the stack $j$. Pop the top of the stack.
$$ (S|i|j, B, A) \rightarrow (S|i, B, A\cup\{(i,j,l)\}) $$render_transitions_displacy(transitions[4:7], tokenized_sentence)
stack | buffer | parse | action |
ROOT saw Bob | shift | ||
ROOT saw | rightArc-dobj | ||
ROOT | rightArc-root |
Add labeled arc from top of stack, $j$, to secondmost top node of stack, $i$. Reduce the secondmost top node of the stack.
$$ (S|i|j, B, A) \rightarrow (S|j, B, A\cup\{(j,i,l)\}) $$render_transitions_displacy(transitions[2:4], tokenized_sentence)
stack | buffer | parse | action |
ROOT Alice saw | Bob | shift | |
ROOT saw | Bob | leftArc-nsubj |
render_transitions_displacy(transitions[:], tokenized_sentence)
stack | buffer | parse | action |
ROOT | Alice saw Bob | ||
ROOT Alice | saw Bob | shift | |
ROOT Alice saw | Bob | shift | |
ROOT saw | Bob | leftArc-nsubj | |
ROOT saw Bob | shift | ||
ROOT saw | rightArc-dobj | ||
ROOT | rightArc-root | ||
ROOT |
Configuration:
We further define two special configurations:
conllu = """
# ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
1 I _ _ _ _ 2 nsubj _ _
2 saw _ _ _ _ 0 root _ _
3 the _ _ _ _ 4 det _ _
4 star _ _ _ _ 2 dobj _ _
5 with _ _ _ _ 7 case _ _
6 the _ _ _ _ 7 det _ _
7 telescope _ _ _ _ 2 obl _ _
"""
display(HTML(pd.read_csv(StringIO(conllu), sep="\t").to_html(index=False)))
arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))
render_displacy(arcs, tokens,"900px")
# ID | FORM | LEMMA | UPOS | XPOS | FEATS | HEAD | DEPREL | DEPS | MISC |
---|---|---|---|---|---|---|---|---|---|
1 | I | _ | _ | _ | _ | 2 | nsubj | _ | _ |
2 | saw | _ | _ | _ | _ | 0 | root | _ | _ |
3 | the | _ | _ | _ | _ | 4 | det | _ | _ |
4 | star | _ | _ | _ | _ | 2 | dobj | _ | _ |
5 | with | _ | _ | _ | _ | 7 | case | _ | _ |
6 | the | _ | _ | _ | _ | 7 | det | _ | _ |
7 | telescope | _ | _ | _ | _ | 2 | obl | _ | _ |
conllu = """
# ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
1 I _ _ _ _ 2 nsubj _ _
2 saw _ _ _ _ 0 root _ _
3 the _ _ _ _ 4 det _ _
4 star _ _ _ _ 2 dobj _ _
5 with _ _ _ _ 7 case _ _
6 the _ _ _ _ 7 det _ _
7 telescope _ _ _ _ 4 nmod _ _
"""
display(HTML(pd.read_csv(StringIO(conllu), sep="\t").to_html(index=False)))
arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))
render_displacy(arcs, tokens,"900px")
# ID | FORM | LEMMA | UPOS | XPOS | FEATS | HEAD | DEPREL | DEPS | MISC |
---|---|---|---|---|---|---|---|---|---|
1 | I | _ | _ | _ | _ | 2 | nsubj | _ | _ |
2 | saw | _ | _ | _ | _ | 0 | root | _ | _ |
3 | the | _ | _ | _ | _ | 4 | det | _ | _ |
4 | star | _ | _ | _ | _ | 2 | dobj | _ | _ |
5 | with | _ | _ | _ | _ | 7 | case | _ | _ |
6 | the | _ | _ | _ | _ | 7 | det | _ | _ |
7 | telescope | _ | _ | _ | _ | 4 | nmod | _ | _ |
How to decide what action to take?
How do we get training data for the classifier?
Always 0 $\leq$ LAS $\leq$ UAS $\leq$ 100%.