Udapi is an API and framework for processing Universal Dependencies. In this tutorial, we will focus on the Python version of Udapi. Perl and Java versions are available as well, but they are missing some of the features.
Udapi can be used from the shell (e.g. Bash), using the wrapper script udapy
. It can be also used as a library, from Python, IPython or Jupyter notebooks. We will show both of these ways bellow.
This tutorial uses Details sections for extra info (if you want to know more or if you run into problems). You need to click on it to show its content.
First, make sure you have the newest version of Udapi. If you have already installed Udapi using git clone, just run git pull
. If you have not installed Udapi yet, run
pip3 install --user --upgrade udapi
, you can install the last version released on PyPI (possibly older).
!pip3 install --user --upgrade git+https://github.com/udapi/udapi-python.git
Now, make sure you can run the command-line interface udapy
, e.g. by printing the help message.
!udapy -h
usage: udapy [optional_arguments] scenario udapy - Python interface to Udapi - API for Universal Dependencies Examples of usage: udapy -s read.Sentences udpipe.En < in.txt > out.conllu udapy -T < sample.conllu | less -R udapy -HAM ud.MarkBugs < sample.conllu > bugs.html positional arguments: scenario A sequence of blocks and their parameters. optional arguments: -h, --help show this help message and exit -q, --quiet Warning, info and debug messages are suppressed. Only fatal errors are reported. -v, --verbose Warning, info and debug messages are printed to the STDERR. -s, --save Add write.Conllu to the end of the scenario -T, --save_text_mode_trees Add write.TextModeTrees color=1 to the end of the scenario -H, --save_html Add write.TextModeTreesHtml color=1 to the end of the scenario -A, --save_all_attributes Add attributes=form,lemma,upos,xpos,feats,deprel,misc (to be used after -T and -H) -C, --save_comments Add print_comments=1 (to be used after -T and -H) -M, --marked_only Add marked_only=1 to the end of the scenario (to be used after -T and -H) -N, --no_color Add color=0 to the end of the scenario, this overrides color=1 of -T and -H See http://udapi.github.io
pip3 --user
, it is installed into ~/.local/lib/python3.6/site-packages/udapi/
(or similar depending on your Python version) and the wrapper into ~/.local/bin
. Thus you need to
export PATH="$HOME/.local/bin/:$PATH"
!wget http://ufal.mff.cuni.cz/~popel/udapi/ud20sample.tgz
!tar -xf ud20sample.tgz
%cd sample
--2020-12-01 07:53:37-- http://ufal.mff.cuni.cz/~popel/udapi/ud20sample.tgz Resolving ufal.mff.cuni.cz (ufal.mff.cuni.cz)... 195.113.20.52 Connecting to ufal.mff.cuni.cz (ufal.mff.cuni.cz)|195.113.20.52|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 4670982 (4,5M) [application/x-gzip] Saving to: ‘ud20sample.tgz.1’ ud20sample.tgz.1 100%[===================>] 4,45M 1,49MB/s in 3,0s 2020-12-01 07:53:40 (1,49 MB/s) - ‘ud20sample.tgz.1’ saved [4670982/4670982] /home/martin/udapi/python/notebook/sample
Let's choose one of the sample files and see the raw CoNLL-U format.
cd
command is not prefixed by an exclamation mark because that would run in a sub-shell, which "forgets" the changed directory when finished. It is prefixed by a percent sign, which marks it as IPython magic.
cat
is another IPython magic command, this time an alias for the shell command of the same name (so you can prefix cat
with an exclamation mark, if you prefer), which prints a given file. With automagic
on, you can use it without the percent sign.
| head
to show just the first 10 lines of the output (preventing thus big ipynb file size). You can ignore the "cat: write error: Broken pipe" warning.
| head
because long outputs are automatically wrapped in a text box with a scrollbar.
less UD_Ancient_Greek/sample.conllu
cat UD_Ancient_Greek/sample.conllu | head
# newdoc id = tlg0008.tlg001.perseus-grc1.13.tb.xml # sent_id = tlg0008.tlg001.perseus-grc1.13.tb.xml@1144 # text = ἐρᾷ μὲν ἁγνὸς οὐρανὸς τρῶσαι χθόνα, ἔρως δὲ γαῖαν λαμβάνει γάμου τυχεῖν· 1 ἐρᾷ ἐράω VERB v3spia--- Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 0 root _ _ 2 μὲν μέν ADV d-------- _ 1 advmod _ _ 3 ἁγνὸς ἁγνός ADJ a-s---mn- Case=Nom|Gender=Masc|Number=Sing 4 nmod _ _ 4 οὐρανὸς οὐρανός NOUN n-s---mn- Case=Nom|Gender=Masc|Number=Sing 1 nsubj _ _ 5 τρῶσαι τιτρώσκω VERB v--ana--- Tense=Past|VerbForm=Inf|Voice=Act 1 xcomp _ _ 6 χθόνα χθών NOUN n-s---fa- Case=Acc|Gender=Fem|Number=Sing 5 obj _ SpaceAfter=No 7 , , PUNCT u-------- _ 1 punct _ _ cat: write error: Broken pipe
udapy -T
¶While the CoNLL-U format was designed with readibility (by both machines and humans) on mind, it may be still a bit difficult to read and interpret by humans. Let's visualize the dependency tree structure using ASCII-art by piping the conllu file into udapy -T
.
cat UD_Ancient_Greek/sample.conllu | udapy -T | head -n 20
2020-12-01 08:00:33,276 [ INFO] execute - No reader specified, using read.Conllu 2020-12-01 08:00:33,276 [ INFO] execute - ---- ROUND ---- 2020-12-01 08:00:33,276 [ INFO] execute - Executing block Conllu 2020-12-01 08:00:33,305 [ INFO] execute - Executing block TextModeTrees docname = tlg0008.tlg001.perseus-grc1.13.tb.xml loaded_from = - # sent_id = tlg0008.tlg001.perseus-grc1.13.tb.xml@1144 # text = ἐρᾷ μὲν ἁγνὸς οὐρανὸς τρῶσαι χθόνα, ἔρως δὲ γαῖαν λαμβάνει γάμου τυχεῖν· ─┮ ╰─┮ ἐρᾷ VERB root ┡─╼ μὲν ADV advmod │ ╭─╼ ἁγνὸς ADJ nmod ┡─┶ οὐρανὸς NOUN nsubj ┡─┮ τρῶσαι VERB xcomp │ ╰─╼ χθόνα NOUN obj ┡─╼ , PUNCT punct │ ╭─╼ ἔρως NOUN nsubj ┡─╼ δὲ CCONJ cc │ │ ┢─╼ γαῖαν NOUN obj ┡───────────────┾ λαμβάνει VERB conj │ │ ╭─╼ γάμου NOUN obj │ ╰─┶ τυχεῖν VERB xcomp ╰─╼ · PUNCT punct
less
, you need to instruct it to display the colors with -R
:
cat UD_Ancient_Greek/sample.conllu | udapy -T | less -R
udapy -T -N
to disable the colors.
udapy -q
suppresses all Udapi messages (warnings, info, debug) printed on the standard error output, so only fatal errors are printed. By default only debug messages are suppresses, but these can be printed with udapy -v
.
udapy -h
, am I right?
udapy -T
is a shortcut for udapy write.TextModeTrees color=1
, where write.TextModeTrees
is a so-called block (a basic Udapi processing unit) and color=1
is its parameter. See the documentation (or even the source code of write.TextModeTrees
to learn about further parameters. Now, let's print also the LEMMA and MISC columns and display the columns vertically aligned using parameters layout=align attributes=form,lemma,upos,deprel,misc
.
cat UD_Ancient_Greek/sample.conllu | udapy -q write.TextModeTrees color=1 layout=align attributes=form,lemma,upos,deprel,misc | head -n 20
docname = tlg0008.tlg001.perseus-grc1.13.tb.xml loaded_from = - # sent_id = tlg0008.tlg001.perseus-grc1.13.tb.xml@1144 # text = ἐρᾷ μὲν ἁγνὸς οὐρανὸς τρῶσαι χθόνα, ἔρως δὲ γαῖαν λαμβάνει γάμου τυχεῖν· ─┮ ╰─┮ ἐρᾷ ἐράω VERB root _ ┡─╼ μὲν μέν ADV advmod _ │ ╭─╼ ἁγνὸς ἁγνός ADJ nmod _ ┡─┶ οὐρανὸς οὐρανός NOUN nsubj _ ┡─┮ τρῶσαι τιτρώσκω VERB xcomp _ │ ╰─╼ χθόνα χθών NOUN obj SpaceAfter=No ┡─╼ , , PUNCT punct _ │ ╭─╼ ἔρως ἔρως NOUN nsubj _ ┡─╼ │ δὲ δέ CCONJ cc _ │ ┢─╼ γαῖαν γαῖα NOUN obj _ ┡───┾ λαμβάνει λαμβάνω VERB conj _ │ │ ╭─╼ γάμου γάμος NOUN obj _ │ ╰─┶ τυχεῖν τυγχάνω VERB xcomp SpaceAfter=No ╰─╼ · · PUNCT punct _
So far, we were using Udapi only via its command-line interface udapy
, which is handy, but not very Pythonic. So let's now use Udapi as a library and load the English conllu sample file into a document doc
and visualize the sixth tree (i.e. doc[5]
in zero-based indexing).
import udapi
doc = udapi.Document("UD_English/sample.conllu")
doc[5].draw()
# sent_id = weblog-juancole.com_juancole_20051126063000_ENG_20051126_063000-0006 # text = The third was being run by the head of an investment firm. ─┮ │ ╭─╼ The DET det │ ╭─┶ third ADJ nsubj:pass │ ┢─╼ was AUX aux │ ┢─╼ being AUX aux:pass ╰─┾ run VERB root │ ╭─╼ by ADP case │ ┢─╼ the DET det ┡─┾ head NOUN obl │ │ ╭─╼ of ADP case │ │ ┢─╼ an DET det │ │ ┢─╼ investment NOUN compound │ ╰─┶ firm NOUN nmod ╰─╼ . PUNCT punct
doc = udapi.Document(filename)
is a shortcut for
import udapi.core.document doc = udapi.core.document.Document(filename)
doc.draw()
.
doc.draw(**kwargs)
is a shortcut for creating a write.TextModeTrees
block and applying it on the document:
import udapi.block.write.textmodetrees block = udapi.block.write.textmodetrees.TextModeTrees(**kwargs) block.run(doc)
The draw()
method takes the same parameters as the write.TextModeTrees
block, so we can for example display only the node ID (aka ord
, i.e. word-order index), form and universal (morpho-syntactic) features.
doc[5].draw(layout="align", attributes="ord,form,feats")
# sent_id = weblog-juancole.com_juancole_20051126063000_ENG_20051126_063000-0006 # text = The third was being run by the head of an investment firm. ─┮ │ ╭─╼ 1 The Definite=Def|PronType=Art │ ╭─┶ 2 third Degree=Pos|NumType=Ord │ ┢─╼ 3 was Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin │ ┢─╼ 4 being VerbForm=Ger ╰─┾ 5 run Tense=Past|VerbForm=Part|Voice=Pass │ ╭─╼ 6 by _ │ ┢─╼ 7 the Definite=Def|PronType=Art ┡─┾ 8 head Number=Sing │ │ ╭─╼ 9 of _ │ │ ┢─╼ 10 an Definite=Ind|PronType=Art │ │ ┢─╼ 11 investment Number=Sing │ ╰─┶ 12 firm Number=Sing ╰─╼ 13 . _
Udapi document consists of a sequence of so-called bundles, mirroring a sequence of sentences in a typical natural language text.
A bundle corresponds to a sentence, possibly in multiple versions or with different representations, such as sentence-tuples from parallel corpora, or paraphrases in the same language or alternative analyses (e.g. parses produced by different parsers). If there are more trees in a bundle, they must be distinguished by a so-called zone (a label which contains the language code).
Each tree is represented by a special (artificial) root node, which is added to the top of a CoNLL-U tree in the Udapi model. The root node bears the ID of a given tree/sentence (sent_id
) and its word order (ord
) is 0. Technically, Root is subclass of Node, with some extra methods.
The Node class corresponds to a node
of a dependency tree. It provides access to all the CoNLL-U-defined attributes (ord
, form
, lemma
, upos
, xpos
, feats
, deprel
, deps
, misc
). There are methods for tree traversal (parent
, root
, children
, descendants
); word-order traversal (next_node
, prev_node
); tree manipulation (parent
setter) including word-order changes (shift_after_node(x)
, shift_before_subtree(x)
, etc.); and utility methods: is_descendant_of(x)
, is_nonprojective()
, precedes(x)
, is_leaf()
, is_root()
, get_attrs([])
, compute_text()
, draw()
.
Prepositions and postpositions are together called adpositions and assigned the ADP universal part-of-speech tag (upos
) in UD. Some languages (e.g. English) use mostly prepositions, others mostly postpositions.
UD_English/sample.conllu
(which has been loaded into doc
).prepositions, postpositions = 0, 0
# Iterate over all nodes in the document (in all trees)
for node in doc.nodes:
if node.upos == "ADP":
# TODO: fix this code to actually distinguish prepositions and postpositions
prepositions += 1
# Print the results
prepositions, postpositions
If you don't know how to proceed click on the following hints.
doc.nodes
iterates over all nodes in the document sorted by the word order, but this would be cumbersome to exploit. Find a method of Node
to detect the relative word order of two nodes (within the same tree/sentence).
node.parent
and node.precedes(another_node)
.
The latter is a shortcut for node.ord < another_node.ord
.
for node in doc.nodes: if node.upos == "ADP": if node.precedes(node.parent): prepositions += 1 else: postpositions += 1
The previous exercise indicates there are 7 occurrences of postpositions in the English sample. Find these 7 occurrences and visualize them using node.draw()
. Count which adpositions (lemma
) with which dependency relations (deprel
) are responsible for these occurrences. Recompute these statistics on the bigger English training data. Can you explain these occurrences? What are the reasons? Is any occurrence an annotation error?
# For the statistics, you may find useful: count["any string"] += 1
import collections
count = collections.Counter()
big_doc = udapi.Document("UD_English/train.conllu")
for node in doc.nodes:
# TODO detect postposition
pass
# Print the statistics
count.most_common()
for node in doc.nodes: if node.upos == "ADP" and node.parent.precedes(node): node.parent.draw() count[node.lemma + " " + node.deprel] += 1
ADP
according to the UD guidelines.
Let's filter out those cases and focus on the rest and let's switch to the big train data.
count = collections.Counter() for node in big_doc.nodes: if node.upos == "ADP" and node.parent.precedes(node) and node.parent.upos != "VERB": count[node.lemma + " " + node.deprel] += 1 count.most_common()Alternatively to
node.parent.upos != "VERB"
,
you could also filter out node.deprel != "compound:prt"
,
or directly focus on node.deprel == "case"
node.deprel == "fixed"
is being used for multi-word adpositions,
such as "because of", where "of" depends on "because" from technical (and consistency) reasons,
but the whole multi-word adpositions precedes its governing nound, so it is actually a multi-word preposition.
What about the remaining occurrences, after filtering out node.deprel not in {"compound:prt", "fixed"}
?
In the next tutorial, 02-blocks.ipynb (not finished yet), we will explore several useful Udapi blocks, some of which may be handy when working further on Exercise 2 or similar tasks.