About this notebook¶

This notebook illustrates the working of LAF-Fabric, a tool to analyze the data inside LAF resources (Linguistic Annotation Framework). We use it for a particular LAF resource: the Hebrew Bible with linguistic annotations. The software to get the Hebrew Bible in LAF is part of LAF-Fabric (see emdros2laf).

NB 1. This is a static copy of the Gender notebook. You can download it, and if you have iPython installed and the LAF-Fabric, then you can run this notebook. And you can create many more notebooks like this, looking for patterns in the Hebrew Bible.

NB 2. All software involved is open source, and the data is Open Access (not for commercial use). Click the logo:

Gender in the Hebrew Bible¶

Words in Hebrew are either masculine, or feminine, or unknown.

We want to plot the percentage of masculine and feminine words per chapter.

The LAF way¶

In the Hebrew LAF data, some nodes are annotated as word, and some nodes as chapter (there are many more kinds of node, of course).

The names of chapters and the genders of words are coded as features inside annotations to these nodes.

More on feature names¶

The features we need are present in an annotation space named etcbc4 (after the name and version of this LAF resource). The chapter features are labeled with sft and the other features with ft.

When LAF-Fabric compiles features into binary data, it forgets the annotations in which the features come, but the annotation space and label are retained in a double prefix to the feature name.

LAF-Fabric remembers those features by their fully qualified names: etcbc4:ft.gender, etcbc4:sft.chapter etc. There may also be annotations without feature contents.

Importing¶

The next cell loads the required libraries and creates a task processor.

In [1]:

import sys
import collections

from laf.fabric import LafFabric
fabric = LafFabric(verbose='DETAIL')

  0.00s This is LAF-Fabric 4.8.1
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html

  0.00s DETAIL: Data dir = /Users/dirk/laf/laf-fabric-data
  0.00s DETAIL: Laf dir = /Users/dirk/laf/laf-fabric-data
  0.00s DETAIL: Output dir = /Users/dirk/laf/laf-fabric-output

Loading¶

The processor needs data. Here is where we say what data to load. We do not need the XML identifiers as they show up in the original LAF resource. But we do need a few features of nodes, namely the ones that give us the gender of the words, and the numbers of the chapters and the books in which the chapters are contained.

The init function actually draws that data in, and it will take a few seconds.

It needs to know the name of the source. This name corresponds with a subdirectory in your work_dir.

The '--' means that we do not draw in an annox (extra annotation package). If you want to do that, this is the place to give the name of such a package, which must be a subdirectory name inside the annotations directory in your work_dir.

Then gender is just a name we choose to give to this task. This name determines where on the filesystem the log file and output (if any) will be put: a subdirectory gender inside the source directory inside your output_dir.

The last argument to load() is a dictionary of data items to load.

The primary key indicates whether the primary data itself must be loaded. Tasks can then use methods to find the primary data that is attached to a node. For the Hebrew data this is hardly necessary, because the words have textual information as features on them.

The xmlids are tables mapping nodes and edges to the original xml identifiers they have in the original LAF source. Most tasks do not need this. Only when a task needs to link new annotations to nodes and edges and write the result as an additional LAF file, it needs to know the original identifiers.

The features to be loaded are specified by two strings, one for node features and one for edge features. For all these features, data will be loaded, and all other features' data will be unloaded, if still loaded.

Caution: Missing feature data

If you forget to mention a feature in the load declaration and you do use it in your task, LAF-Fabric will stop your task and shout error messages at you. If you declare features that do not exist in the LAF data, you just get a warning. But if you try to use such features, you get also a loud error.

In [2]:

fabric.load('etcbc4b', '--', 'gender',
{
    "primary": False,
    "xmlids": {"node": False, "edge": False},
    "features": ("otype gn chapter book", ""),
})
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s USING main: etcbc4b DATA COMPILED AT: 2015-11-02T15-08-56
  2.01s LOGFILE=/Users/dirk/laf/laf-fabric-output/etcbc4b/gender/__log__gender.txt
  2.01s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX  FOR TASK gender AT 2016-09-09T14-48-44

API¶

In order to write an efficient task, it is convenient to import the names of the API methods as local variables. The lookup of names in Python is fastest for local names. And it makes the code much cleaner. The method load() does this. See the API reference for full documentation.

F¶

All that you want to know about features and are not afraid to ask. It is an object, and for each feature that you have declared, it has a member with a handy name. For example,

F.etcbc4_db_otype

is a feature object that corresponds with the LAF feature given in an annotation in the annotation space etcbc4, with label db and name otype. It is a node feature, because otherwise we had to use FE instead of F.

You do not have to mention the annotation space and label, Laf-Fabric will find out what they should be given the available features. If there is confusion, Laf-Fabric will tell you, and you can supply more full names.

You can look up a feature value of this feature, say for node n, by saying

F.otype.v(n)

NN(test=function value=something values=list of somethings)¶

If you want to walk through all the nodes, possibly skipping some, then this is your method. It is an iterator that yields a new node everytime it is called. The order is so-called primary data order, which will be explained below. The test and value and values arguments are optional. If given, test should be a callable with one argument, returning a string; value should be a string, values a list of strings. test will be called for each passing node, and if the value returned is not equal to the given value and not a member of values, the node will be skipped.

msg¶

Issues a timed message to the standard error and to the log file.

infile(filename)¶

Creates a open file handle for reading a file in your task output directory

outfile(filename)¶

Creates a open file handle for writing a file in your task output directory

my_file(filename)¶

Gives the full path to a file in your task output directory

Available features¶

The F_all component delivers a list of available node features in the chosen source, and like wise FE_all yields the edge features. Let us see what we have got. For convenience, the components fF_all and fFE_all produce formatted outputs for these feature lists.

In [6]:

print(fF_all)
print(fFE_all)

etcbc4:
	db.maxmonad:
	db.minmonad:
	db.monads:
	db.oid:
	db.otype:
	ft.code:
	ft.det:
	ft.dist:
	ft.dist_unit:
	ft.domain:
	ft.function:
	ft.g_cons:
	ft.g_cons_utf8:
	ft.g_lex:
	ft.g_lex_utf8:
	ft.g_nme:
	ft.g_nme_utf8:
	ft.g_pfm:
	ft.g_pfm_utf8:
	ft.g_prs:
	ft.g_prs_utf8:
	ft.g_uvf:
	ft.g_uvf_utf8:
	ft.g_vbe:
	ft.g_vbe_utf8:
	ft.g_vbs:
	ft.g_vbs_utf8:
	ft.g_word:
	ft.g_word_utf8:
	ft.gn:
	ft.is_root:
	ft.kind:
	ft.language:
	ft.lex:
	ft.lex_utf8:
	ft.ls:
	ft.mother_object_type:
	ft.nme:
	ft.nu:
	ft.number:
	ft.pdp:
	ft.pfm:
	ft.prs:
	ft.ps:
	ft.rela:
	ft.sp:
	ft.st:
	ft.tab:
	ft.trailer_utf8:
	ft.txt:
	ft.typ:
	ft.uvf:
	ft.vbe:
	ft.vbs:
	ft.vs:
	ft.vt:
	sft.book:
	sft.chapter:
	sft.label:
	sft.verse:
etcbc4:
	ft.distributional_parent:
	ft.functional_parent:
	ft.mother:
laf:
	('', 'x'):
	('', 'y'):

Task Execution¶

We need to get an output file to write to. A simple method provides a handle to a file open for writing. The file will be created in the output_dir, under the subdir etcbc4, under the subdir gender.

In [3]:

table = outfile('table.tsv')

All open files (reading and writing) will be closed with

close()

below.

Walking the nodes¶

Here we loop over a bunch of nodes (in fact over all nodes), in a convenient document order.

Node order¶

There is an implicit partial order on nodes. The short story is: the nodes that are linked to primary data, inherit the order that is present in the primary data. The long story is a bit more complicated, since nodes may be attached to multiple ranges of primary data.

See node order for details. If you don't, it might be enough to know that embedding nodes always come before embedded nodes, meaning that if a node happens to be attached to a big piece of primary data, and a second node to a part of that data, then the node with the bigger attachment comes first.

When there is no inclusion either way, and the start and end points are the same, the order is left undefined.

Initialization¶

We initialize the counters in which we store the word counts. We keep track of the chapter we are in and accumulate counts of the words, masculine and feminine. For each chapter we create entries in the ch, m and f lists.

Note also the progress messages after each chapter.

In [4]:

stats = [0, 0, 0]
cur_chapter = None
cur_book = None
ch = []
m = []
f = []

In [5]:

for node in NN():
    otype = F.otype.v(node)
    if otype == "word":
        stats[0] += 1
        if F.gn.v(node) == "m":
            stats[1] += 1
        elif F.gn.v(node) == "f":
            stats[2] += 1
    elif otype == "chapter":
        if cur_chapter != None:
            masc = 0 if not stats[0] else 100 * float(stats[1]) / stats[0]
            fem = 0 if not stats[0] else 100 * float(stats[2]) / stats[0]
            ch.append(cur_chapter)
            m.append(masc)
            f.append(fem)
            table.write("{},{},{}\n".format(cur_chapter, masc, fem))
        else:
            table.write("{},{},{}\n".format('book chapter', 'masculine', 'feminine'))
        this_book = F.book.v(node)
        this_chapnum = F.chapter.v(node)
        this_chapter = "{} {}".format(this_book, this_chapnum)
        if this_book != cur_book:
            sys.stderr.write("\n{}".format(this_book))
            cur_book = this_book
        sys.stderr.write(" {}".format(this_chapnum))
        stats = [0, 0, 0]
        cur_chapter = this_chapter

Genesis 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
Exodus 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Leviticus 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Numeri 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
Deuteronomium 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
Josua 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Judices 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Samuel_I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Samuel_II 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Reges_I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Reges_II 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Jesaia 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
Jeremia 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
Ezechiel 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
Hosea 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Joel 1 2 3 4
Amos 1 2 3 4 5 6 7 8 9
Obadia 1
Jona 1 2 3 4
Micha 1 2 3 4 5 6 7
Nahum 1 2 3
Habakuk 1 2 3
Zephania 1 2 3
Haggai 1 2
Sacharia 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Maleachi 1 2 3
Psalmi 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150
Iob 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
Proverbia 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Ruth 1 2 3 4
Canticum 1 2 3 4 5 6 7 8
Ecclesiastes 1 2 3 4 5 6 7 8 9 10 11 12
Threni 1 2 3 4 5
Esther 1 2 3 4 5 6 7 8 9 10
Daniel 1 2 3 4 5 6 7 8 9 10 11 12
Esra 1 2 3 4 5 6 7 8 9 10
Nehemia 1 2 3 4 5 6 7 8 9 10 11 12 13
Chronica_I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
Chronica_II 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

Closing¶

We need to close open files. This is exactly what the next statement does.

In [7]:

close()

    22s Results directory:
/Users/dirk/SURFdrive/laf-fabric-output/etcbc4b/gender

__log__gender.txt                       217 Fri Nov 13 14:09:01 2015
table.tsv                             43234 Fri Nov 13 14:09:01 2015

Showing off¶

Everything is still in memory. Now it is the time to generate a graphical representation of the data.

The matplotlib package is full of instruments to do that.

But let us first have a look at a few rows of the data itself.

In [8]:

import pandas
import matplotlib.pyplot as plt
from IPython.display import display
pandas.set_option('display.notebook_repr_html', True)
%matplotlib inline

The files that have been generated reside in a subdirectory of your work directory. You can easily refer to them as follows:

In [9]:

table_file = my_file('table.tsv')
df = pandas.read_csv(table_file)

In [10]:

df.head(100)

Out[10]:

	book chapter	masculine	feminine
0	Genesis 1	42.347697	5.794948
1	Genesis 2	38.663968	7.692308
2	Genesis 3	37.474950	10.020040
3	Genesis 4	43.046358	11.920530
4	Genesis 5	40.748441	18.918919
5	Genesis 6	36.613272	9.610984
6	Genesis 7	33.596838	11.462451
7	Genesis 8	31.300813	9.959350
8	Genesis 9	37.972167	9.741551
9	Genesis 10	30.679157	4.683841
10	Genesis 11	38.416988	15.057915
11	Genesis 12	31.151832	10.209424
12	Genesis 13	36.994220	3.757225
13	Genesis 14	40.393013	4.148472
14	Genesis 15	36.187845	5.524862
15	Genesis 16	29.794521	22.945205
16	Genesis 17	41.299790	11.111111
17	Genesis 18	34.028892	8.346709
18	Genesis 19	30.900243	10.705596
19	Genesis 20	35.638298	11.702128
20	Genesis 21	35.559265	12.520868
21	Genesis 22	38.132296	5.252918
22	Genesis 23	38.873995	9.115282
23	Genesis 24	32.801822	12.984055
24	Genesis 25	42.572464	10.326087
25	Genesis 26	36.930091	8.662614
26	Genesis 27	42.129630	7.754630
27	Genesis 28	36.444444	8.444444
28	Genesis 29	30.745342	18.012422
29	Genesis 30	32.655654	15.374841
...	...	...	...
70	Exodus 21	40.703518	10.050251
71	Exodus 22	39.555556	8.000000
72	Exodus 23	38.218924	6.493506
73	Exodus 24	45.187166	4.010695
74	Exodus 25	33.438986	15.055468
75	Exodus 26	33.152909	18.809202
76	Exodus 27	37.333333	18.666667
77	Exodus 28	39.756098	13.048780
78	Exodus 29	36.707566	6.952965
79	Exodus 30	38.473520	11.059190
80	Exodus 31	31.104651	10.174419
81	Exodus 32	40.183246	4.450262
82	Exodus 33	40.120968	2.016129
83	Exodus 34	42.469471	4.884668
84	Exodus 35	36.060606	11.363636
85	Exodus 36	34.505208	17.578125
86	Exodus 37	34.637965	16.634051
87	Exodus 38	35.598706	16.343042
88	Exodus 39	39.303483	11.815920
89	Exodus 40	38.923077	4.923077
90	Leviticus 1	38.186813	4.945055
91	Leviticus 2	39.007092	16.312057
92	Leviticus 3	38.418079	6.497175
93	Leviticus 4	37.997433	10.654685
94	Leviticus 5	32.947020	13.245033
95	Leviticus 6	38.927739	13.519814
96	Leviticus 7	35.714286	13.025210
97	Leviticus 8	36.829559	7.985697
98	Leviticus 9	36.686391	7.692308
99	Leviticus 10	40.652174	7.173913

100 rows × 3 columns

Now let's get matplotlib to work. Here we just show a line graph of 20 chapters. If you want to see another series of chapters, just modify the start and end variables below and execute again by pressing Shift Enter. You can repeat this as often as you like without re-running earlier steps.

In [11]:

x = range(len(ch))
start = 100
end = 120
fig = plt.figure()
plt.plot(x[start:end], m[start:end], 'b-', x[start:end], f[start:end], 'r-')
plt.axis([start, end, 0, 50])
plt.xticks(x[start:end], ch[start:end], rotation='vertical')
plt.margins(0.2)
plt.subplots_adjust(bottom=0.15);
plt.title('gender');

Note the chapters where the feminine words peak: Leviticus 12 and 18.¶

Finally, save the chart.

In [12]:

fig.savefig('gender.png')

saved chart