Making new annotations

It is possible to make your own annotations to the existing nodes and edges of a LAF resource, and use them later on.

This notebook shows how that can be done.

The original LAF source will not be changed.

You make your own annotations and they will be saved in a file. By putting this file in a location where LAF-Fabric can find it, and by instructing LAF-Fabric to include this file, you can do analysis on the original source plus your new annotations.

Annotations are organized by annotation spaces. If you choose a space that is different from the annotation spaces inthe main source your own annotations will be distinguishable from the original annotations.

But you can also choose to override original annotations with your own ones. In that case you have to create your annotations in the shebanq space.

This notebook is not honed by practice yet. There are clumsy things such as manually copy files and putting them into directories, and editing certain header files.

That said, this notebook performs the LAF specific things, and further adaptations do not involve deep dives into the LAF-Fabric.

Preparation

In order to run this notebook, it is necessary to have an extra annotations package called testannots on your system. If you have downloaded data from the given link, you have that directory.

In [1]:
import sys
import collections
import shutil

import pandas
from IPython.display import display
pandas.set_option('display.notebook_repr_html', True)

from laf.fabric import LafFabric
from etcbc.annotating import GenForm

fabric = LafFabric()
  0.00s This is LAF-Fabric 4.3.3
http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
In [2]:
API = fabric.load('etcbc4', '--', 'annox_workflow', {
    "primary": True,
    "xmlids": {"node": True, "edge": False},
    "features": ("otype oid monads typ sp book chapter verse", ""),
}, verbose='DETAIL')
exec(fabric.localnames.format(var='fabric'))
  0.00s LOADING API: please wait ... 
  0.00s DETAIL: COMPILING m: UP TO DATE
  0.00s INFO: USING DATA COMPILED AT: 2014-07-14T16-45-08
  0.00s DETAIL: COMPILING a: UP TO DATE
  0.01s DETAIL: load main: P.node_anchor
  0.06s DETAIL: load main: P.node_anchor_items
  0.30s DETAIL: load main: G.node_anchor_min
  0.36s DETAIL: load main: G.node_anchor_max
  0.41s DETAIL: load main: P.node_events
  0.49s DETAIL: load main: P.node_events_items
  0.77s DETAIL: load main: P.node_events_k
  0.85s DETAIL: load main: P.node_events_n
  0.99s DETAIL: load main: G.node_sort
  1.04s DETAIL: load main: G.node_sort_inv
  1.44s DETAIL: load main: G.edges_from
  1.51s DETAIL: load main: G.edges_to
  1.58s DETAIL: load main: P.primary_data
  1.63s DETAIL: load main: X. [node]  -> 
  2.80s DETAIL: load main: X. [node]  <- 
  3.51s DETAIL: load main: F.etcbc4_db_monads [node] 
  4.44s DETAIL: load main: F.etcbc4_db_oid [node] 
  5.30s DETAIL: load main: F.etcbc4_db_otype [node] 
  5.99s DETAIL: load main: F.etcbc4_ft_sp [node] 
  6.20s DETAIL: load main: F.etcbc4_ft_typ [node] 
  6.56s DETAIL: load main: F.etcbc4_sft_book [node] 
  6.58s DETAIL: load main: F.etcbc4_sft_chapter [node] 
  6.59s DETAIL: load main: F.etcbc4_sft_verse [node] 
  6.61s LOGFILE=/Users/dirk/laf-fabric-output/etcbc4/annox_workflow/__log__annox_workflow.txt
  6.61s INFO: DATA LOADED FROM SOURCE etcbc4 AND ANNOX -- FOR TASK annox_workflow AT 2014-07-15T13-57-51

Workflow

This is the workflow for new annotation data:

  1. First tell LAF-Fabric to construct a spreadsheet
  2. Fill in that spreadsheet
  3. Ask LAF-Fabric to read the filled in spreadsheet and turn it into a LAF file.
  4. Place the new annotations file in the right place.
  5. Use the new annotations.

Step 1: Construct a spreadsheet

In the dictionary config below, you can specify a spreadsheet with rows and colomns.

On each row you see the textual representation of the objects you specify, from the passages you specify. You can also ask for other columns with feature information for reference. There are extra columns for new features, with names that you specify.

The columns contain the following information

  1. the corresponding XML identifier in the LAF resource
  2. words
  3. phrases
  4. phrase_type values
  5. part_of_speech values
  6. empty cells for dirk_part_intro
  7. empty cells for dirk_part_role
In [3]:
form = GenForm(API, "dirk_intro_role", {
    'target_types': [
        'word',
        'phrase',
    ],
    'show_features': {
        'etcbc4': {
            'node': [
                "ft.typ,sp",
            ],
        },
    },
    'new_features': {
        'dirk': {
            'node': [
                "part.intro,role",
            ],
        },
    },
    'passages': {
        'Genesis': '1-3',
        'Jesaia': '40,66',
    },
})

The form create function

Run the make_form function:

In [4]:
form.make_form()
  8.02s Reading the data ...
Genesis1,2,3,**********Jesaia40,66,***************************  9.79s Done

Look at the form in its text form:

In [5]:
form_data = pandas.read_csv(my_file("form_{}.csv".format(form.name)), sep='\t', na_filter=False)
form_data.head(15)
Out[5]:
passage word phrase typ sp dirk:part.intro dirk:part.role
0 #Genesis 1:1
1 n7 בְּרֵאשִׁ֖ית PP
2 n5 בְּ prep
3 n12 רֵאשִׁ֖ית subs
4 n13 בָּרָ֣א verb
5 n15 בָּרָ֣א VP
6 n16 אֱלֹהִ֑ים subs
7 n18 אֱלֹהִ֑ים NP
8 n22 אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ PP
9 n20 אֵ֥ת prep
10 n23 הַ art
11 n24 שָּׁמַ֖יִם subs
12 n26 וְ conj
13 n27 אֵ֥ת prep
14 n28 הָ art

15 rows × 7 columns

Step 2: Fill in the spreadsheet

It is time to fill in your form. First we rename it, so that when you create a new blank form, it will not overwrite any data you have already filled in.

We replace the form prefix in your file name by data.

In [6]:
my_form = my_file("form_{}.csv".format(form.name))
my_data = my_file("data_{}.csv".format(form.name))
shutil.move(my_form, my_data)
Out[6]:
'/Users/dirk/laf-fabric-output/etcbc4/annox_workflow/data_dirk_intro_role.csv'

Open the data file as a spreadsheet. OpenOffice is recommended for that, because it handles unicode well.

Fill any your feature values as desired and save.

On Mac OS X we can open the spreadsheet straight away:

In [7]:
!open /Applications/OpenOffice.app --args {my_data}

Here you see the latest content of the form:

In [8]:
form_data = pandas.read_csv(my_file("data_{}.csv".format(form.name)), sep='\t', na_filter=False)
form_data.head(15)
Out[8]:
passage word phrase typ sp dirk:part.intro dirk:part.role
0 #Genesis 1:1
1 n7 בְּרֵאשִׁ֖ית PP
2 n5 בְּ prep
3 n12 רֵאשִׁ֖ית subs
4 n13 בָּרָ֣א verb
5 n15 בָּרָ֣א VP
6 n16 אֱלֹהִ֑ים subs
7 n18 אֱלֹהִ֑ים NP
8 n22 אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ PP
9 n20 אֵ֥ת prep aap
10 n23 הַ art
11 n24 שָּׁמַ֖יִם subs
12 n26 וְ conj
13 n27 אֵ֥ת prep noot
14 n28 הָ art

15 rows × 7 columns

Step 3: Turn the spreadsheet into annotations

The data file is going to be turned into an XML file with annotations to the original LAF resource.

Here is the function doing that.

In [9]:
form.make_annots()
my_annots = my_file("annot_{}.xml".format(form.name))

Look at the freshly created annotations.

In [10]:
!cat {my_annots}
<?xml version="1.0" encoding="UTF-8"?>
    <graph xmlns="http://www.xces.org/ns/GrAF/1.0/" xmlns:graf="http://www.xces.org/ns/GrAF/1.0/">
    <graphHeader>
        <labelsDecl/>
        <dependencies/>
        <annotationSpaces/>
    </graphHeader>
    <a xml:id="a1" as="dirk" label="part" ref="n27"><fs>
	<f name="role" value="noot"/>
</fs></a>
<a xml:id="a2" as="dirk" label="part" ref="n47"><fs>
	<f name="intro" value="mies"/>
	<f name="role" value="karel"/>
</fs></a>
<a xml:id="a3" as="dirk" label="part" ref="n20"><fs>
	<f name="intro" value="aap"/>
</fs></a>
</graph>

Step 4: Place the new annotation file in the right directory.

Here is the directory, we call it the annox directory.

In [11]:
fabric.lafapi.names.env['a_source_dir'][0:-3]
Out[11]:
'/Users/dirk/laf-fabric-data/etcbc4/annotations'

Extra annotations are organized in packages. You can place multiple new annotation files in a package.

Let us add this file to the existing package called testannots. It is the annox directory.

We put the new file in this directory:

In [12]:
shutil.copy(
    my_annots,
    "{}/testparticipants".format(fabric.lafapi.names.env['a_source_dir'][0:-3])
)
Out[12]:
'/Users/dirk/laf-fabric-data/etcbc4/annotations/testparticipants/annot_dirk_intro_role.xml'

If you have more files, you can place them in the same directory, and for each file you have to add a line to the header file in that directory, like the existing line, resulting in:

<annotation f.id="f_dirk1" loc="annot_dirk_intro_role.xml"/>
<annotation f.id="f_dirk2" loc="annot_dirk_other.xml"/>

The files in the package must be mentioned in the header file, that's the point.

Step 5. Use the new annotations.

You can invoke an extra annotation package by mentioning it in the statement where you initialize the processor.

The processor looks for the package, checks whether it has to be compiled, and if so, compiles it. Then the data corresponding to testannots is loaded after the main source has been loaded.

What ever task you perform, it will have access to the new annotations.

In [13]:
fabric.load('etcbc4', 'testparticipants', 'annox_workflow', {
    "primary": True,
    "xmlids": {
        "node": True,
        "edge": False,
    },
    "features": {
        "etcbc4": {
            "node": [
                "db.otype",
                "sft.label",
            ],
            "edge": [
            ],
        },
        "dirk": {
            "node": [
                "part.intro,role",
            ],
        }
    },
})
exec(fabric.localnames.format(var='fabric'))
  0.00s LOADING API: please wait ... 
  0.00s DETAIL: COMPILING m: UP TO DATE
  0.00s INFO: USING DATA COMPILED AT: 2014-07-14T16-45-08
  0.00s BEGIN COMPILE a: testparticipants
  0.00s DETAIL: load main: X. [node]  -> 
  1.35s DETAIL: load main: X. [e]  -> 
  3.51s DETAIL: load main: G.node_anchor_min
  3.57s DETAIL: load main: G.node_anchor_max
  3.62s DETAIL: load main: G.node_sort
  3.67s DETAIL: load main: G.node_sort_inv
  4.21s DETAIL: load main: G.edges_from
  4.28s DETAIL: load main: G.edges_to
  4.35s LOGFILE=/Users/dirk/laf-fabric-data/etcbc4/bin/A/testparticipants/__log__compile__.txt
  4.35s PARSING ANNOTATION FILES
  4.36s INFO: parsing annot_dirk_intro_role.xml
  4.36s INFO: END PARSING
         0 good   regions  and     0 faulty ones
         0 linked nodes    and     0 unlinked ones
         0 good   edges    and     0 faulty ones
         3 good   annots   and     0 faulty ones
         4 good   features and     0 faulty ones
         3 distinct xml identifiers

  4.36s MODELING RESULT FILES
  4.36s INFO: CONNECTIVITY
  4.55s WRITING RESULT FILES for a
  4.56s DETAIL: write annox: F.dirk_part_intro [node] 
  4.56s DETAIL: write annox: F.dirk_part_role [node] 
  4.56s END   COMPILE a: testparticipants
  4.66s INFO: USING DATA COMPILED AT: 2014-07-15T13-59-19
  4.66s DETAIL: keep main: P.node_anchor
  4.66s DETAIL: keep main: P.node_anchor_items
  4.66s DETAIL: keep main: G.node_anchor_min
  4.66s DETAIL: keep main: G.node_anchor_max
  4.67s DETAIL: keep main: P.node_events
  4.67s DETAIL: keep main: P.node_events_items
  4.67s DETAIL: keep main: P.node_events_k
  4.67s DETAIL: keep main: P.node_events_n
  4.67s DETAIL: keep main: G.node_sort
  4.67s DETAIL: keep main: G.node_sort_inv
  4.67s DETAIL: keep main: G.edges_from
  4.67s DETAIL: keep main: G.edges_to
  4.67s DETAIL: keep main: P.primary_data
  4.67s DETAIL: keep main: X. [node]  -> 
  4.67s DETAIL: keep main: X. [node]  <- 
  4.67s DETAIL: keep main: F.etcbc4_db_otype [node] 
  4.67s DETAIL: clear main: F.etcbc4_db_monads [node] 
  4.67s DETAIL: clear main: F.etcbc4_db_oid [node] 
  4.67s DETAIL: clear main: F.etcbc4_ft_sp [node] 
  4.67s DETAIL: clear main: F.etcbc4_ft_typ [node] 
  4.67s DETAIL: clear main: F.etcbc4_sft_book [node] 
  4.67s DETAIL: clear main: F.etcbc4_sft_chapter [node] 
  4.67s DETAIL: clear main: F.etcbc4_sft_verse [node] 
  4.67s DETAIL: load main: F.dirk_part_intro [node] 
  4.68s DETAIL: load main: F.dirk_part_role [node] 
  4.68s DETAIL: load main: F.etcbc4_sft_label [node] 
  4.69s DETAIL: load annox: F.dirk_part_intro [node] 
  4.70s DETAIL: load annox: F.dirk_part_role [node] 
  4.70s DETAIL: load annox: F.etcbc4_db_otype [node] 
  4.70s DETAIL: load annox: F.etcbc4_sft_label [node] 
  4.70s INFO: DATA LOADED FROM SOURCE etcbc4 AND ANNOX testparticipants FOR TASK annox_workflow AT 2014-07-15T13-59-19

So let us check which objects have got annotations.

For every object we show its type, its XML id in the LAF source, its primary data and the two new feature values, if applicable.

In [14]:
msg("Looking for fresh annotations ...")
cur_verse = None

for node in NN():
    otype = F.otype.v(node)
    if otype == 'verse':
        cur_verse = node
        continue
    intro = F.dirk_part_intro.v(node)
    role = F.dirk_part_role.v(node)
    if intro != None or role != None:
        verse = F.label.v(cur_verse)
        text = " ".join([txt for (n, txt) in P.data(node)])
        xmlid = X.r(node)
        print("{:<12} {:<6} id={:<8} {:<17}{:<16} {:<20}".format(
            verse,
            otype, 
            xmlid, 
            "intro={:<10} ".format(intro) if intro != None else '',
            "role={:<10} ".format(role) if role != None else '',
            text,
        ))
msg("Done")
    10s Looking for fresh annotations ...
    15s Done
 GEN 01,01   word   id=n20      intro=aap                         אֵ֥ת                
 GEN 01,01   word   id=n27                       role=noot        אֵ֥ת                
 GEN 01,02   word   id=n47      intro=mies       role=karel       תֹ֨הוּ֙             
In [ ]: