#!/usr/bin/env python # coding: utf-8 # # Introduction # Udapi is an API and framework for processing [Universal Dependencies](http://universaldependencies.org/). In this tutorial, we will focus on the Python version of Udapi. Perl and Java versions are [available](http://udapi.github.io/) as well, but they are missing some of the features. # # Udapi can be used from the shell (e.g. Bash), using the wrapper script `udapy`. It can be also used as a library, from Python, IPython or Jupyter notebooks. We will show both of these ways bellow. # # This tutorial uses Details sections for extra info (if you want to know more or if you run into problems). You need to click on it to show its content. #

Details

# It is a substitute for footnotes. The content may be long and showing it in the main text may be distracting. #

# # ### Install (upgrade) Udapi # First, make sure you have the newest version of Udapi. If you have already installed Udapi [using git clone](https://github.com/udapi/udapi-python#install-udapi-for-developers), just run `git pull`. If you have not installed Udapi yet, run #

Details

The command below installs Udapi from GitHub (from the master branch). With pip3 install --user --upgrade udapi, you can install the last version released on PyPI (possibly older). #
The exclamation mark (!) in Jupyter or IPython means that the following command will be executed by the system shell (e.g. Bash). #

# In[ ]: get_ipython().system('pip3 install --user --upgrade git+https://github.com/udapi/udapi-python.git') # Now, make sure you can run the command-line interface `udapy`, e.g. by printing the help message. # In[1]: get_ipython().system('udapy -h') #

Details: If the previous command fails with "udapy: command not found"

# This means that Udapi is not properly installed. When installing Udapi with pip3 --user, it is installed into ~/.local/lib/python3.6/site-packages/udapi/ (or similar depending on your Python version) and the wrapper into ~/.local/bin. Thus you need to #

# export PATH="$HOME/.local/bin/:$PATH"
#

# # Browse CoNLL-U files # ### Get sample UD data # # Download and extract [ud20sample.tgz](http://ufal.mff.cuni.cz/~popel/udapi/ud20sample.tgz). There are just 100 sentences for each of the 70 treebanks (`sample.conllu`), plus 4 bigger files (`train.conllu` and `dev.conllu`) for German, English, French and Czech. For full UD ([2.0](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1983) or [newer](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3424)), go to [Lindat](https://lindat.cz). # In[3]: get_ipython().system('wget http://ufal.mff.cuni.cz/~popel/udapi/ud20sample.tgz') get_ipython().system('tar -xf ud20sample.tgz') get_ipython().run_line_magic('cd', 'sample') # Let's choose one of the sample files and see the raw [CoNLL-U format](https://universaldependencies.org/format.html). #

Details: executing from Bash, IPython, Jupyter

If you see "No such file or directory" error, make sure you executed the previous cell. Note that the cd command is not prefixed by an exclamation mark because that would run in a sub-shell, which "forgets" the changed directory when finished. It is prefixed by a percent sign, which marks it as IPython magic. #
cat is another IPython magic command, this time an alias for the shell command of the same name (so you can prefix cat with an exclamation mark, if you prefer), which prints a given file. With automagic on, you can use it without the percent sign. #
In this tutorial, we use | head to show just the first 10 lines of the output (preventing thus big ipynb file size). You can ignore the "cat: write error: Broken pipe" warning. #
When using Jupyter, you can omit the | head because long outputs are automatically wrapped in a text box with a scrollbar. #
When running this from IPython or Bash, you can use a pager: less UD_Ancient_Greek/sample.conllu #

# # In[4]: cat UD_Ancient_Greek/sample.conllu | head # ### Browse conllu files with `udapy -T` # While the CoNLL-U format was designed with readibility (by both machines and humans) on mind, it may be still a bit difficult to read and interpret by humans. Let's visualize the dependency tree structure using ASCII-art by piping the conllu file into `udapy -T`. # In[5]: cat UD_Ancient_Greek/sample.conllu | udapy -T | head -n 20 #

Details:

You may be used to see dependency trees where the root node is on the top and words are ordered horizontally (left to right). Here, the root is on left and words are ordered vertically (top to bottom). #
The colors are implemented using the colorama package and ANSI escape codes. When running this from IPython or Bash and using less, you need to instruct it to display the colors with -R: # # cat UD_Ancient_Greek/sample.conllu | udapy -T | less -R # #
You can also use udapy -T -N to disable the colors. #
udapy -q suppresses all Udapi messages (warnings, info, debug) printed on the standard error output, so only fatal errors are printed. By default only debug messages are suppresses, but these can be printed with udapy -v. #
But you already know this because you have read udapy -h, am I right? #

# # `udapy -T` is a shortcut for `udapy write.TextModeTrees color=1`, where `write.TextModeTrees` is a so-called *block* (a basic Udapi processing unit) and `color=1` is its parameter. See [the documentation](https://udapi.readthedocs.io/en/latest/udapi.block.write.html#module-udapi.block.write.textmodetrees) (or even [the source code](https://github.com/udapi/udapi-python/blob/master/udapi/block/write/textmodetrees.py) of `write.TextModeTrees` to learn about further parameters. Now, let's print also the LEMMA and MISC columns and display the columns vertically aligned using parameters `layout=align attributes=form,lemma,upos,deprel,misc`. # In[6]: cat UD_Ancient_Greek/sample.conllu | udapy -q write.TextModeTrees color=1 layout=align attributes=form,lemma,upos,deprel,misc | head -n 20 # ### Browse conllu files from IPython/Jupyter # So far, we were using Udapi only via its command-line interface `udapy`, which is handy, but not very Pythonic. So let's now use Udapi as a library and load the English conllu sample file into a document `doc` and visualize the sixth tree (i.e. `doc[5]` in zero-based indexing). # In[7]: import udapi doc = udapi.Document("UD_English/sample.conllu") doc[5].draw() #

Details:

doc = udapi.Document(filename) is a shortcut for #

# import udapi.core.document
# doc = udapi.core.document.Document(filename)
#

We can print the whole document using doc.draw(). #

doc.draw(**kwargs) is a shortcut for creating a write.TextModeTrees block and applying it on the document: #

# import udapi.block.write.textmodetrees
# block = udapi.block.write.textmodetrees.TextModeTrees(**kwargs)
# block.run(doc)
#

# # The `draw()` method takes the same parameters as the `write.TextModeTrees` block, so we can for example display only the node ID (aka `ord`, i.e. word-order index), form and [universal (morpho-syntactic) features](https://universaldependencies.org/u/feat/index.html). # # In[8]: doc[5].draw(layout="align", attributes="ord,form,feats") # # Document representation in Udapi # # Udapi [document](https://github.com/udapi/udapi-python/blob/master/udapi/core/document.py) consists of a sequence of so-called *bundles*, mirroring a sequence of sentences in a typical natural language text. # # A [bundle](https://github.com/udapi/udapi-python/blob/master/udapi/core/bundle.py) corresponds to a sentence, # possibly in multiple versions or with different representations, such as sentence-tuples from parallel corpora, or paraphrases in the same language or alternative analyses (e.g. parses produced by different parsers). If there are more trees in a bundle, they must be distinguished by a so-called *zone* (a label which contains the language code). # # Each tree is represented by a special (artificial) [root](https://github.com/udapi/udapi-python/blob/master/udapi/core/root.py) node, which is added to the top of a CoNLL-U tree in the Udapi model. The root node bears the ID of a given tree/sentence (`sent_id`) and its word order (`ord`) is 0. Technically, Root is subclass of Node, with some extra methods. # # The [Node](https://github.com/udapi/udapi-python/blob/master/udapi/core/node.py) class corresponds to a node # of a dependency tree. It provides access to all the CoNLL-U-defined attributes (`ord`, `form`, `lemma`, `upos`, `xpos`, `feats`, `deprel`, `deps`, `misc`). There are methods for tree traversal (`parent`, `root`, `children`, `descendants`); word-order traversal (`next_node`, `prev_node`); tree manipulation (`parent` setter) including word-order changes (`shift_after_node(x)`, `shift_before_subtree(x)`, etc.); and utility methods: `is_descendant_of(x)`, `is_nonprojective()`, `precedes(x)`, `is_leaf()`, `is_root()`, `get_attrs([])`, `compute_text()`, `draw()`. # # ## Exercise 1: Count prepositions and postpositions # [Prepositions and postpositions](https://en.wikipedia.org/wiki/Preposition_and_postposition) are together called *adpositions* and assigned the [ADP](https://universaldependencies.org/u/pos/ADP.html) universal part-of-speech tag (`upos`) in UD. Some languages (e.g. English) use mostly prepositions, others mostly postpositions. # * Do you know any English postpositions? # * Guess the typical adposition type (i.e. whether a given language uses more prepositions or postpositions) for at least 10 languages of your choice (from those in UD2.0). # * Complete the following code and find out how many prepositions and postpositions are in `UD_English/sample.conllu` (which has been loaded into `doc`). # In[ ]: prepositions, postpositions = 0, 0 # Iterate over all nodes in the document (in all trees) for node in doc.nodes: if node.upos == "ADP": # TODO: fix this code to actually distinguish prepositions and postpositions prepositions += 1 # Print the results prepositions, postpositions # If you don't know how to proceed click on the following hints. #

Hint 1:

# In some dependency grammars, adpositions govern noun (i.e. adposition is the *parent* of a given noun node). In other dependency grammars, adpositions depend on nouns (i.e. noun is the *parent* of a given adposition). Find out which style is being used by UD. Check the UD documentation or inspect some of the tree visualizations and guess. #

Hint 2:

# See the Node documentation and find out how to obtain dependency parent and dependency children. Note that these are properties of a given node, rather than methods, so you should not write parentheses () after the property name. #

Hint 3:

# doc.nodes iterates over all nodes in the document sorted by the word order, but this would be cumbersome to exploit. Find a method of Node to detect the relative word order of two nodes (within the same tree/sentence). #

Hint 4:

# Use node.parent and node.precedes(another_node). # The latter is a shortcut for node.ord < another_node.ord. #

Solution:

# for node in doc.nodes:
#     if node.upos == "ADP":
#         if node.precedes(node.parent):
#             prepositions += 1
#         else:
#             postpositions += 1
#

# # ## Exercise 2: Explore English postpositions # The previous exercise indicates there are 7 occurrences of postpositions in the English sample. Find these 7 occurrences and visualize them using `node.draw()`. Count which adpositions (`lemma`) with which dependency relations (`deprel`) are responsible for these occurrences. Recompute these statistics on the bigger English training data. Can you explain these occurrences? What are the reasons? Is any occurrence an annotation error? # In[ ]: # For the statistics, you may find useful: count["any string"] += 1 import collections count = collections.Counter() big_doc = udapi.Document("UD_English/train.conllu") for node in doc.nodes: # TODO detect postposition pass # Print the statistics count.most_common() #

Solution 1:

# for node in doc.nodes:
#     if node.upos == "ADP" and node.parent.precedes(node):
#         node.parent.draw()
#         count[node.lemma + " " + node.deprel] += 1
#

Hint 1:

# We can see there are many particles of phrase verbs, e.g. "busted up". # These seem to be correctly annotated as ADP according to the UD guidelines. # Let's filter out those cases and focus on the rest and let's switch to the big train data. #

Solution 2:

# count = collections.Counter()
# for node in big_doc.nodes:
#     if node.upos == "ADP" and node.parent.precedes(node) and node.parent.upos != "VERB":
#         count[node.lemma + " " + node.deprel] += 1
# count.most_common()
#

# Alternatively to node.parent.upos != "VERB", # you could also filter out node.deprel != "compound:prt", # or directly focus on node.deprel == "case" #

Partial answer:

# Most of the occurrences are actually annotated correctly, # although they are not typically considered as postpositions. # For example, node.deprel == "fixed" is being used for multi-word adpositions, # such as "because of", where "of" depends on "because" from technical (and consistency) reasons, # but the whole multi-word adpositions precedes its governing nound, so it is actually a multi-word preposition. # # What about the remaining occurrences, after filtering out node.deprel not in {"compound:prt", "fixed"}? #

# In the next tutorial, 02-blocks.ipynb (not finished yet), we will explore several useful Udapi blocks, some of which may be handy when working further on Exercise 2 or similar tasks.