#!/usr/bin/env python # coding: utf-8 # # Introduction # Udapi is an API and framework for processing [Universal Dependencies](http://universaldependencies.org/). In this tutorial, we will focus on the Python version of Udapi. Perl and Java versions are [available](http://udapi.github.io/) as well, but they are missing some of the features. # # Udapi can be used from the shell (e.g. Bash), using the wrapper script `udapy`. It can be also used as a library, from Python, IPython or Jupyter notebooks. We will show both of these ways bellow. # # This tutorial uses Details sections for extra info (if you want to know more or if you run into problems). You need to click on it to show its content. #
Details # It is a substitute for footnotes. The content may be long and showing it in the main text may be distracting. #
# # ### Install (upgrade) Udapi # First, make sure you have the newest version of Udapi. If you have already installed Udapi [using git clone](https://github.com/udapi/udapi-python#install-udapi-for-developers), just run `git pull`. If you have not installed Udapi yet, run #
Details # #
# In[ ]: get_ipython().system('pip3 install --user --upgrade git+https://github.com/udapi/udapi-python.git') # Now, make sure you can run the command-line interface `udapy`, e.g. by printing the help message. # In[1]: get_ipython().system('udapy -h') #
Details: If the previous command fails with "udapy: command not found" # This means that Udapi is not properly installed. When installing Udapi with pip3 --user, it is installed into ~/.local/lib/python3.6/site-packages/udapi/ (or similar depending on your Python version) and the wrapper into ~/.local/bin. Thus you need to #
# export PATH="$HOME/.local/bin/:$PATH"
# 
#
# # Browse CoNLL-U files # ### Get sample UD data # # Download and extract [ud20sample.tgz](http://ufal.mff.cuni.cz/~popel/udapi/ud20sample.tgz). There are just 100 sentences for each of the 70 treebanks (`sample.conllu`), plus 4 bigger files (`train.conllu` and `dev.conllu`) for German, English, French and Czech. For full UD ([2.0](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1983) or [newer](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3424)), go to [Lindat](https://lindat.cz). # In[3]: get_ipython().system('wget http://ufal.mff.cuni.cz/~popel/udapi/ud20sample.tgz') get_ipython().system('tar -xf ud20sample.tgz') get_ipython().run_line_magic('cd', 'sample') # Let's choose one of the sample files and see the raw [CoNLL-U format](https://universaldependencies.org/format.html). #
Details: executing from Bash, IPython, Jupyter # #
# # In[4]: cat UD_Ancient_Greek/sample.conllu | head # ### Browse conllu files with `udapy -T` # While the CoNLL-U format was designed with readibility (by both machines and humans) on mind, it may be still a bit difficult to read and interpret by humans. Let's visualize the dependency tree structure using ASCII-art by piping the conllu file into `udapy -T`. # In[5]: cat UD_Ancient_Greek/sample.conllu | udapy -T | head -n 20 #
Details: # #
# # `udapy -T` is a shortcut for `udapy write.TextModeTrees color=1`, where `write.TextModeTrees` is a so-called *block* (a basic Udapi processing unit) and `color=1` is its parameter. See [the documentation](https://udapi.readthedocs.io/en/latest/udapi.block.write.html#module-udapi.block.write.textmodetrees) (or even [the source code](https://github.com/udapi/udapi-python/blob/master/udapi/block/write/textmodetrees.py) of `write.TextModeTrees` to learn about further parameters. Now, let's print also the LEMMA and MISC columns and display the columns vertically aligned using parameters `layout=align attributes=form,lemma,upos,deprel,misc`. # In[6]: cat UD_Ancient_Greek/sample.conllu | udapy -q write.TextModeTrees color=1 layout=align attributes=form,lemma,upos,deprel,misc | head -n 20 # ### Browse conllu files from IPython/Jupyter # So far, we were using Udapi only via its command-line interface `udapy`, which is handy, but not very Pythonic. So let's now use Udapi as a library and load the English conllu sample file into a document `doc` and visualize the sixth tree (i.e. `doc[5]` in zero-based indexing). # In[7]: import udapi doc = udapi.Document("UD_English/sample.conllu") doc[5].draw() #
Details: # #
# # The `draw()` method takes the same parameters as the `write.TextModeTrees` block, so we can for example display only the node ID (aka `ord`, i.e. word-order index), form and [universal (morpho-syntactic) features](https://universaldependencies.org/u/feat/index.html). # # In[8]: doc[5].draw(layout="align", attributes="ord,form,feats") # # Document representation in Udapi # # Udapi [document](https://github.com/udapi/udapi-python/blob/master/udapi/core/document.py) consists of a sequence of so-called *bundles*, mirroring a sequence of sentences in a typical natural language text. # # A [bundle](https://github.com/udapi/udapi-python/blob/master/udapi/core/bundle.py) corresponds to a sentence, # possibly in multiple versions or with different representations, such as sentence-tuples from parallel corpora, or paraphrases in the same language or alternative analyses (e.g. parses produced by different parsers). If there are more trees in a bundle, they must be distinguished by a so-called *zone* (a label which contains the language code). # # Each tree is represented by a special (artificial) [root](https://github.com/udapi/udapi-python/blob/master/udapi/core/root.py) node, which is added to the top of a CoNLL-U tree in the Udapi model. The root node bears the ID of a given tree/sentence (`sent_id`) and its word order (`ord`) is 0. Technically, Root is subclass of Node, with some extra methods. # # The [Node](https://github.com/udapi/udapi-python/blob/master/udapi/core/node.py) class corresponds to a node # of a dependency tree. It provides access to all the CoNLL-U-defined attributes (`ord`, `form`, `lemma`, `upos`, `xpos`, `feats`, `deprel`, `deps`, `misc`). There are methods for tree traversal (`parent`, `root`, `children`, `descendants`); word-order traversal (`next_node`, `prev_node`); tree manipulation (`parent` setter) including word-order changes (`shift_after_node(x)`, `shift_before_subtree(x)`, etc.); and utility methods: `is_descendant_of(x)`, `is_nonprojective()`, `precedes(x)`, `is_leaf()`, `is_root()`, `get_attrs([])`, `compute_text()`, `draw()`. # # ## Exercise 1: Count prepositions and postpositions # [Prepositions and postpositions](https://en.wikipedia.org/wiki/Preposition_and_postposition) are together called *adpositions* and assigned the [ADP](https://universaldependencies.org/u/pos/ADP.html) universal part-of-speech tag (`upos`) in UD. Some languages (e.g. English) use mostly prepositions, others mostly postpositions. # * Do you know any English postpositions? # * Guess the typical adposition type (i.e. whether a given language uses more prepositions or postpositions) for at least 10 languages of your choice (from those in UD2.0). # * Complete the following code and find out how many prepositions and postpositions are in `UD_English/sample.conllu` (which has been loaded into `doc`). # In[ ]: prepositions, postpositions = 0, 0 # Iterate over all nodes in the document (in all trees) for node in doc.nodes: if node.upos == "ADP": # TODO: fix this code to actually distinguish prepositions and postpositions prepositions += 1 # Print the results prepositions, postpositions # If you don't know how to proceed click on the following hints. #
Hint 1: # In some dependency grammars, adpositions govern noun (i.e. adposition is the *parent* of a given noun node). In other dependency grammars, adpositions depend on nouns (i.e. noun is the *parent* of a given adposition). Find out which style is being used by UD. Check the UD documentation or inspect some of the tree visualizations and guess. #
#
Hint 2: # See the Node documentation and find out how to obtain dependency parent and dependency children. Note that these are properties of a given node, rather than methods, so you should not write parentheses () after the property name. #
#
Hint 3: # doc.nodes iterates over all nodes in the document sorted by the word order, but this would be cumbersome to exploit. Find a method of Node to detect the relative word order of two nodes (within the same tree/sentence). #
#
Hint 4: # Use node.parent and node.precedes(another_node). # The latter is a shortcut for node.ord < another_node.ord. #
#
Solution: #
# for node in doc.nodes:
#     if node.upos == "ADP":
#         if node.precedes(node.parent):
#             prepositions += 1
#         else:
#             postpositions += 1
# 
#
# # ## Exercise 2: Explore English postpositions # The previous exercise indicates there are 7 occurrences of postpositions in the English sample. Find these 7 occurrences and visualize them using `node.draw()`. Count which adpositions (`lemma`) with which dependency relations (`deprel`) are responsible for these occurrences. Recompute these statistics on the bigger English training data. Can you explain these occurrences? What are the reasons? Is any occurrence an annotation error? # In[ ]: # For the statistics, you may find useful: count["any string"] += 1 import collections count = collections.Counter() big_doc = udapi.Document("UD_English/train.conllu") for node in doc.nodes: # TODO detect postposition pass # Print the statistics count.most_common() #
Solution 1: #
# for node in doc.nodes:
#     if node.upos == "ADP" and node.parent.precedes(node):
#         node.parent.draw()
#         count[node.lemma + " " + node.deprel] += 1
# 
#
#
Hint 1: # We can see there are many particles of phrase verbs, e.g. "busted up". # These seem to be correctly annotated as ADP according to the UD guidelines. # Let's filter out those cases and focus on the rest and let's switch to the big train data. #
#
Solution 2: #
# count = collections.Counter()
# for node in big_doc.nodes:
#     if node.upos == "ADP" and node.parent.precedes(node) and node.parent.upos != "VERB":
#         count[node.lemma + " " + node.deprel] += 1
# count.most_common()
# 
# Alternatively to node.parent.upos != "VERB", # you could also filter out node.deprel != "compound:prt", # or directly focus on node.deprel == "case" #
#
Partial answer: # Most of the occurrences are actually annotated correctly, # although they are not typically considered as postpositions. # For example, node.deprel == "fixed" is being used for multi-word adpositions, # such as "because of", where "of" depends on "because" from technical (and consistency) reasons, # but the whole multi-word adpositions precedes its governing nound, so it is actually a multi-word preposition. # # What about the remaining occurrences, after filtering out node.deprel not in {"compound:prt", "fixed"}? #
# In the next tutorial, 02-blocks.ipynb (not finished yet), we will explore several useful Udapi blocks, some of which may be handy when working further on Exercise 2 or similar tasks.