Notebook

Introduction¶

Udapi is an API and framework for processing Universal Dependencies. In this tutorial, we will focus on the Python version of Udapi. Perl and Java versions are available as well, but they are missing some of the features.

Udapi can be used from the shell (e.g. Bash), using the wrapper script udapy. It can be also used as a library, from Python, IPython or Jupyter notebooks. We will show both of these ways bellow.

This tutorial uses Details sections for extra info (if you want to know more or if you run into problems). You need to click on it to show its content.

Details

It is a substitute for footnotes. The content may be long and showing it in the main text may be distracting.

Install (upgrade) Udapi¶

First, make sure you have the newest version of Udapi. If you have already installed Udapi using git clone, just run git pull. If you have not installed Udapi yet, run

Details

The command below installs Udapi from GitHub (from the master branch). With pip3 install --user --upgrade udapi, you can install the last version released on PyPI (possibly older).
The exclamation mark (!) in Jupyter or IPython means that the following command will be executed by the system shell (e.g. Bash).

In [ ]:

!pip3 install --user --upgrade git+https://github.com/udapi/udapi-python.git

Now, make sure you can run the command-line interface udapy, e.g. by printing the help message.

In [1]:

!udapy -h

usage: udapy [optional_arguments] scenario

udapy - Python interface to Udapi - API for Universal Dependencies

Examples of usage:
  udapy -s read.Sentences udpipe.En < in.txt > out.conllu
  udapy -T < sample.conllu | less -R
  udapy -HAM ud.MarkBugs < sample.conllu > bugs.html

positional arguments:
  scenario              A sequence of blocks and their parameters.

optional arguments:
  -h, --help            show this help message and exit
  -q, --quiet           Warning, info and debug messages are suppressed. Only fatal errors are reported.
  -v, --verbose         Warning, info and debug messages are printed to the STDERR.
  -s, --save            Add write.Conllu to the end of the scenario
  -T, --save_text_mode_trees
                        Add write.TextModeTrees color=1 to the end of the scenario
  -H, --save_html       Add write.TextModeTreesHtml color=1 to the end of the scenario
  -A, --save_all_attributes
                        Add attributes=form,lemma,upos,xpos,feats,deprel,misc (to be used after -T and -H)
  -C, --save_comments   Add print_comments=1 (to be used after -T and -H)
  -M, --marked_only     Add marked_only=1 to the end of the scenario (to be used after -T and -H)
  -N, --no_color        Add color=0 to the end of the scenario, this overrides color=1 of -T and -H

See http://udapi.github.io

Details: If the previous command fails with "udapy: command not found"

This means that Udapi is not properly installed. When installing Udapi with pip3 --user, it is installed into ~/.local/lib/python3.6/site-packages/udapi/ (or similar depending on your Python version) and the wrapper into ~/.local/bin. Thus you need to

export PATH="$HOME/.local/bin/:$PATH"

Browse CoNLL-U files¶

Get sample UD data¶

Download and extract ud20sample.tgz. There are just 100 sentences for each of the 70 treebanks (sample.conllu), plus 4 bigger files (train.conllu and dev.conllu) for German, English, French and Czech. For full UD (2.0 or newer), go to Lindat.

In [3]:

!wget http://ufal.mff.cuni.cz/~popel/udapi/ud20sample.tgz
!tar -xf ud20sample.tgz
%cd sample

--2020-12-01 07:53:37--  http://ufal.mff.cuni.cz/~popel/udapi/ud20sample.tgz
Resolving ufal.mff.cuni.cz (ufal.mff.cuni.cz)... 195.113.20.52
Connecting to ufal.mff.cuni.cz (ufal.mff.cuni.cz)|195.113.20.52|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4670982 (4,5M) [application/x-gzip]
Saving to: ‘ud20sample.tgz.1’

ud20sample.tgz.1    100%[===================>]   4,45M  1,49MB/s    in 3,0s    

2020-12-01 07:53:40 (1,49 MB/s) - ‘ud20sample.tgz.1’ saved [4670982/4670982]

/home/martin/udapi/python/notebook/sample

Let's choose one of the sample files and see the raw CoNLL-U format.

Details: executing from Bash, IPython, Jupyter

If you see "No such file or directory" error, make sure you executed the previous cell. Note that the cd command is not prefixed by an exclamation mark because that would run in a sub-shell, which "forgets" the changed directory when finished. It is prefixed by a percent sign, which marks it as IPython magic.
cat is another IPython magic command, this time an alias for the shell command of the same name (so you can prefix cat with an exclamation mark, if you prefer), which prints a given file. With automagic on, you can use it without the percent sign.
In this tutorial, we use | head to show just the first 10 lines of the output (preventing thus big ipynb file size). You can ignore the "cat: write error: Broken pipe" warning.
When using Jupyter, you can omit the | head because long outputs are automatically wrapped in a text box with a scrollbar.
When running this from IPython or Bash, you can use a pager: less UD_Ancient_Greek/sample.conllu

In [4]:

cat UD_Ancient_Greek/sample.conllu | head

# newdoc id = tlg0008.tlg001.perseus-grc1.13.tb.xml
# sent_id = tlg0008.tlg001.perseus-grc1.13.tb.xml@1144
# text = ἐρᾷ μὲν ἁγνὸς οὐρανὸς τρῶσαι χθόνα, ἔρως δὲ γαῖαν λαμβάνει γάμου τυχεῖν·
1	ἐρᾷ	ἐράω	VERB	v3spia---	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	0	root	_	_
2	μὲν	μέν	ADV	d--------	_	1	advmod	_	_
3	ἁγνὸς	ἁγνός	ADJ	a-s---mn-	Case=Nom|Gender=Masc|Number=Sing	4	nmod	_	_
4	οὐρανὸς	οὐρανός	NOUN	n-s---mn-	Case=Nom|Gender=Masc|Number=Sing	1	nsubj	_	_
5	τρῶσαι	τιτρώσκω	VERB	v--ana---	Tense=Past|VerbForm=Inf|Voice=Act	1	xcomp	_	_
6	χθόνα	χθών	NOUN	n-s---fa-	Case=Acc|Gender=Fem|Number=Sing	5	obj	_	SpaceAfter=No
7	,	,	PUNCT	u--------	_	1	punct	_	_
cat: write error: Broken pipe

Browse conllu files with `udapy -T`¶

While the CoNLL-U format was designed with readibility (by both machines and humans) on mind, it may be still a bit difficult to read and interpret by humans. Let's visualize the dependency tree structure using ASCII-art by piping the conllu file into udapy -T.

In [5]:

cat UD_Ancient_Greek/sample.conllu | udapy -T | head -n 20

2020-12-01 08:00:33,276 [   INFO] execute - No reader specified, using read.Conllu
2020-12-01 08:00:33,276 [   INFO] execute -  ---- ROUND ----
2020-12-01 08:00:33,276 [   INFO] execute - Executing block Conllu
2020-12-01 08:00:33,305 [   INFO] execute - Executing block TextModeTrees
docname = tlg0008.tlg001.perseus-grc1.13.tb.xml
loaded_from = -
# sent_id = tlg0008.tlg001.perseus-grc1.13.tb.xml@1144
# text = ἐρᾷ μὲν ἁγνὸς οὐρανὸς τρῶσαι χθόνα, ἔρως δὲ γαῖαν λαμβάνει γάμου τυχεῖν·
─┮
 ╰─┮ ἐρᾷ VERB root
   ┡─╼ μὲν ADV advmod
   │ ╭─╼ ἁγνὸς ADJ nmod
   ┡─┶ οὐρανὸς NOUN nsubj
   ┡─┮ τρῶσαι VERB xcomp
   │ ╰─╼ χθόνα NOUN obj
   ┡─╼ , PUNCT punct
   │               ╭─╼ ἔρως NOUN nsubj
   ┡─╼ δὲ CCONJ cc │
   │               ┢─╼ γαῖαν NOUN obj
   ┡───────────────┾ λαμβάνει VERB conj
   │               │ ╭─╼ γάμου NOUN obj
   │               ╰─┶ τυχεῖν VERB xcomp
   ╰─╼ · PUNCT punct

Details:

You may be used to see dependency trees where the root node is on the top and words are ordered horizontally (left to right). Here, the root is on left and words are ordered vertically (top to bottom).
The colors are implemented using the colorama package and ANSI escape codes. When running this from IPython or Bash and using less, you need to instruct it to display the colors with -R: cat UD_Ancient_Greek/sample.conllu | udapy -T | less -R
You can also use udapy -T -N to disable the colors.
udapy -q suppresses all Udapi messages (warnings, info, debug) printed on the standard error output, so only fatal errors are printed. By default only debug messages are suppresses, but these can be printed with udapy -v.
But you already know this because you have read udapy -h, am I right?

udapy -T is a shortcut for udapy write.TextModeTrees color=1, where write.TextModeTrees is a so-called block (a basic Udapi processing unit) and color=1 is its parameter. See the documentation (or even the source code of write.TextModeTrees to learn about further parameters. Now, let's print also the LEMMA and MISC columns and display the columns vertically aligned using parameters layout=align attributes=form,lemma,upos,deprel,misc.

In [6]:

cat UD_Ancient_Greek/sample.conllu | udapy -q write.TextModeTrees color=1 layout=align attributes=form,lemma,upos,deprel,misc | head -n 20

docname = tlg0008.tlg001.perseus-grc1.13.tb.xml
loaded_from = -
# sent_id = tlg0008.tlg001.perseus-grc1.13.tb.xml@1144
# text = ἐρᾷ μὲν ἁγνὸς οὐρανὸς τρῶσαι χθόνα, ἔρως δὲ γαῖαν λαμβάνει γάμου τυχεῖν·
─┮                                         
 ╰─┮         ἐρᾷ      ἐράω     VERB  root   _
   ┡─╼       μὲν      μέν      ADV   advmod _
   │ ╭─╼     ἁγνὸς    ἁγνός    ADJ   nmod   _
   ┡─┶       οὐρανὸς  οὐρανός  NOUN  nsubj  _
   ┡─┮       τρῶσαι   τιτρώσκω VERB  xcomp  _
   │ ╰─╼     χθόνα    χθών     NOUN  obj    SpaceAfter=No
   ┡─╼       ,        ,        PUNCT punct  _
   │   ╭─╼   ἔρως     ἔρως     NOUN  nsubj  _
   ┡─╼ │     δὲ       δέ       CCONJ cc     _
   │   ┢─╼   γαῖαν    γαῖα     NOUN  obj    _
   ┡───┾     λαμβάνει λαμβάνω  VERB  conj   _
   │   │ ╭─╼ γάμου    γάμος    NOUN  obj    _
   │   ╰─┶   τυχεῖν   τυγχάνω  VERB  xcomp  SpaceAfter=No
   ╰─╼       ·        ·        PUNCT punct  _

Browse conllu files from IPython/Jupyter¶

So far, we were using Udapi only via its command-line interface udapy, which is handy, but not very Pythonic. So let's now use Udapi as a library and load the English conllu sample file into a document doc and visualize the sixth tree (i.e. doc[5] in zero-based indexing).

In [7]:

import udapi
doc = udapi.Document("UD_English/sample.conllu")
doc[5].draw()

# sent_id = weblog-juancole.com_juancole_20051126063000_ENG_20051126_063000-0006
# text = The third was being run by the head of an investment firm.
─┮
 │   ╭─╼ The DET det
 │ ╭─┶ third ADJ nsubj:pass
 │ ┢─╼ was AUX aux
 │ ┢─╼ being AUX aux:pass
 ╰─┾ run VERB root
   │ ╭─╼ by ADP case
   │ ┢─╼ the DET det
   ┡─┾ head NOUN obl
   │ │ ╭─╼ of ADP case
   │ │ ┢─╼ an DET det
   │ │ ┢─╼ investment NOUN compound
   │ ╰─┶ firm NOUN nmod
   ╰─╼ . PUNCT punct

Details:

doc = udapi.Document(filename) is a shortcut for

import udapi.core.document
doc = udapi.core.document.Document(filename)

We can print the whole document using doc.draw().

doc.draw(**kwargs) is a shortcut for creating a write.TextModeTrees block and applying it on the document:

import udapi.block.write.textmodetrees
block = udapi.block.write.textmodetrees.TextModeTrees(**kwargs)
block.run(doc)

The draw() method takes the same parameters as the write.TextModeTrees block, so we can for example display only the node ID (aka ord, i.e. word-order index), form and universal (morpho-syntactic) features.

In [8]:

doc[5].draw(layout="align", attributes="ord,form,feats")

# sent_id = weblog-juancole.com_juancole_20051126063000_ENG_20051126_063000-0006
# text = The third was being run by the head of an investment firm.
─┮                      
 │   ╭─╼   1  The        Definite=Def|PronType=Art
 │ ╭─┶     2  third      Degree=Pos|NumType=Ord
 │ ┢─╼     3  was        Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin
 │ ┢─╼     4  being      VerbForm=Ger
 ╰─┾       5  run        Tense=Past|VerbForm=Part|Voice=Pass
   │ ╭─╼   6  by         _
   │ ┢─╼   7  the        Definite=Def|PronType=Art
   ┡─┾     8  head       Number=Sing
   │ │ ╭─╼ 9  of         _
   │ │ ┢─╼ 10 an         Definite=Ind|PronType=Art
   │ │ ┢─╼ 11 investment Number=Sing
   │ ╰─┶   12 firm       Number=Sing
   ╰─╼     13 .          _

Document representation in Udapi¶

Udapi document consists of a sequence of so-called bundles, mirroring a sequence of sentences in a typical natural language text.

A bundle corresponds to a sentence, possibly in multiple versions or with different representations, such as sentence-tuples from parallel corpora, or paraphrases in the same language or alternative analyses (e.g. parses produced by different parsers). If there are more trees in a bundle, they must be distinguished by a so-called zone (a label which contains the language code).

Each tree is represented by a special (artificial) root node, which is added to the top of a CoNLL-U tree in the Udapi model. The root node bears the ID of a given tree/sentence (sent_id) and its word order (ord) is 0. Technically, Root is subclass of Node, with some extra methods.

The Node class corresponds to a node of a dependency tree. It provides access to all the CoNLL-U-defined attributes (ord, form, lemma, upos, xpos, feats, deprel, deps, misc). There are methods for tree traversal (parent, root, children, descendants); word-order traversal (next_node, prev_node); tree manipulation (parent setter) including word-order changes (shift_after_node(x), shift_before_subtree(x), etc.); and utility methods: is_descendant_of(x), is_nonprojective(), precedes(x), is_leaf(), is_root(), get_attrs([]), compute_text(), draw().

Exercise 1: Count prepositions and postpositions¶

Prepositions and postpositions are together called adpositions and assigned the ADP universal part-of-speech tag (upos) in UD. Some languages (e.g. English) use mostly prepositions, others mostly postpositions.

Do you know any English postpositions?
Guess the typical adposition type (i.e. whether a given language uses more prepositions or postpositions) for at least 10 languages of your choice (from those in UD2.0).
Complete the following code and find out how many prepositions and postpositions are in UD_English/sample.conllu (which has been loaded into doc).

In [ ]:

prepositions, postpositions = 0, 0
# Iterate over all nodes in the document (in all trees)
for node in doc.nodes:
    if node.upos == "ADP":
        # TODO: fix this code to actually distinguish prepositions and postpositions
        prepositions += 1
# Print the results
prepositions, postpositions

If you don't know how to proceed click on the following hints.

Hint 1:

In some dependency grammars, adpositions govern noun (i.e. adposition is the *parent* of a given noun node). In other dependency grammars, adpositions depend on nouns (i.e. noun is the *parent* of a given adposition). Find out which style is being used by UD. Check the UD documentation or inspect some of the tree visualizations and guess.

Hint 2:

See the Node documentation and find out how to obtain dependency parent and dependency children. Note that these are properties of a given node, rather than methods, so you should not write parentheses () after the property name.

Hint 3:

doc.nodes iterates over all nodes in the document sorted by the word order, but this would be cumbersome to exploit. Find a method of Node to detect the relative word order of two nodes (within the same tree/sentence).

Hint 4:

Use node.parent and node.precedes(another_node). The latter is a shortcut for node.ord < another_node.ord.

Solution:

for node in doc.nodes:
    if node.upos == "ADP":
        if node.precedes(node.parent):
            prepositions += 1
        else:
            postpositions += 1

Exercise 2: Explore English postpositions¶

The previous exercise indicates there are 7 occurrences of postpositions in the English sample. Find these 7 occurrences and visualize them using node.draw(). Count which adpositions (lemma) with which dependency relations (deprel) are responsible for these occurrences. Recompute these statistics on the bigger English training data. Can you explain these occurrences? What are the reasons? Is any occurrence an annotation error?

In [ ]:

# For the statistics, you may find useful: count["any string"] += 1
import collections
count = collections.Counter()
big_doc = udapi.Document("UD_English/train.conllu")

for node in doc.nodes:
    # TODO detect postposition
    pass

# Print the statistics
count.most_common()

Solution 1:

for node in doc.nodes:
    if node.upos == "ADP" and node.parent.precedes(node):
        node.parent.draw()
        count[node.lemma + " " + node.deprel] += 1

Hint 1:

We can see there are many particles of phrase verbs, e.g. "busted up". These seem to be correctly annotated as ADP according to the UD guidelines. Let's filter out those cases and focus on the rest and let's switch to the big train data.

Solution 2:

count = collections.Counter()
for node in big_doc.nodes:
    if node.upos == "ADP" and node.parent.precedes(node) and node.parent.upos != "VERB":
        count[node.lemma + " " + node.deprel] += 1
count.most_common()

Alternatively to node.parent.upos != "VERB", you could also filter out node.deprel != "compound:prt", or directly focus on node.deprel == "case"

Partial answer:

Most of the occurrences are actually annotated correctly, although they are not typically considered as postpositions. For example, node.deprel == "fixed" is being used for multi-word adpositions, such as "because of", where "of" depends on "because" from technical (and consistency) reasons, but the whole multi-word adpositions precedes its governing nound, so it is actually a multi-word preposition.

What about the remaining occurrences, after filtering out node.deprel not in {"compound:prt", "fixed"}?

In the next tutorial, 02-blocks.ipynb (not finished yet), we will explore several useful Udapi blocks, some of which may be handy when working further on Exercise 2 or similar tasks.

Introduction¶

Install (upgrade) Udapi¶

Browse CoNLL-U files¶

Get sample UD data¶

Browse conllu files with udapy -T¶

Browse conllu files from IPython/Jupyter¶

Document representation in Udapi¶

Exercise 1: Count prepositions and postpositions¶

Exercise 2: Explore English postpositions¶

Browse conllu files with `udapy -T`¶