Cooccurrences of lexemes between the books of the Hebrew Bible

Research Question

What does linguistic variation between bible books tell us about their origin, and about the evolution and transmission of their texts?


We study the co-occurrences of lexemes across the books of the bible and represent this data in a undirected weighted graph, where the books are nodes. There are edges between very pair of books that share a lexeme occurrence. Edges are weighted: the more lexemes are shared by a pair of books, the heavier the edge. However, the weight is corrected and normalized as well.

  • correction: frequent lexemes contribute less to the weight than rare lexemes,
  • normalization: the weight contribution of a lexeme is divided by the number of lexemes in the union of two books.

The initial plan was to consider only common nouns, but we are also experimenting with nouns in general, verbs, and all lexemes. Moreover, we also experiment with two measures of normalisation:

  • normal: divide by the sum of the number of distinct lexemes in the concatenation of two books
  • quadratic: as in normal, but divide by the square of the sum.

More formally:

Let $B$ be the set of books in the Bible.

The support of a lexeme $l$ is defined as $S(l) = card\{b \in B\ \vert\ l \in b\}$.

The lexeme content of book $b$ is defined as $L(b) = \{l\ \vert\ l \in b\}$,

and the lexeme content of two books $b_1$ and $b_2$ is defined as $L(b_1, b_2) = L(b_1)\ \cup\ L(b_2)$.

The cooccurrency of those two books is defined as $C(b_1, b_2) = L(b_1)\ \cap\ L(b_2)$.

We now define two measures of weight of a cooccurrences edge between two books $b_1$ and $b_2$:

$$W_1(b_1,b_2) = {\sum \{{1\over S(l)}\ \vert\ l \in C(b_1, b_2)\} \over card\,L(b_1, b_2)}$$$$W_2(b_1,b_2) = {\sum \{{1\over S(l)}\ \vert\ l \in C(b_1, b_2)\} \over (card\,L(b_1, b_2))^2}$$


Import the python modules, the plot modules, the LAF-Fabric module (laf) and initialize the laf processor.

In [1]:
import sys
import collections
import matplotlib.pyplot as plt
from laf.fabric import LafFabric
fabric = LafFabric()
  0.00s This is LAF-Fabric 4.4.7
API reference:
Feature doc:

Load the data, especially the features we need. Note that the task will be named cooccurrences. After loading we retrieve the names by which we can access the various pieces of the LAF data.

In [2]:
fabric.load('etcbc4', '--', 'cooccurrences', {
    "xmlids": {"node": False, "edge": False},
    "features": ("otype sp lex_utf8 book", ""),
  0.00s LOADING API: please wait ... 
  0.10s INFO: USING DATA COMPILED AT: 2014-07-23T09-31-37
  2.91s LOGFILE=/Users/dirk/Dropbox/laf-fabric-output/etcbc4/cooccurrences/__log__cooccurrences.txt
  2.91s INFO: DATA LOADED FROM SOURCE etcbc4 AND ANNOX -- FOR TASK cooccurrences AT 2014-11-12T10-05-55

For your convenience:

  • NN: iterator of nodes in primary data order
  • F: feature data

You can inspect the API by giving commands like F.*?, NN??

In [3]:
In [4]:

We are going to generate data files for Gephi, in its native XML format. Here we specify the subtasks and weighing methods.

  • Subtasks correspond to the kind of lexemes we are counting.
  • Methods correspond to the kind of normalization that we are applying: dividing by the sum or the square of the sum.

We also spell out the XML header of a Gephi file

In [3]:
tasks = {
    'noun_common': {
        '1': outfile("noun_common_1.gexf"),
        '2': outfile("noun_common_2.gexf"),
    'noun_proper': {
        '1': outfile("noun_proper_1.gexf"),
        '2': outfile("noun_proper_2.gexf"),
    'verb': {
        '1': outfile("verb_1.gexf"),
        '2': outfile("verb_2.gexf"),
    'verb-noun_common': {
        '1': outfile("verb-noun_common_1.gexf"),
        '2': outfile("verb-noun_common_2.gexf"),
    'all': {
        '1': outfile("all_1.gexf"),
        '2': outfile("all_2.gexf"),

methods = {
    '1': lambda x, y: float(x) / y,
    '2': lambda x, y: float(x) / y / y,

data_header = '''<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns:viz="http:///" xmlns="" version="1.2">
<graph defaultedgetype="undirected" idtype="string" type="static">


In [4]:
book_name = None
books = []
lexemes = collections.defaultdict(lambda: collections.defaultdict(lambda:collections.defaultdict(lambda:0)))
lexeme_support_book = collections.defaultdict(lambda: collections.defaultdict(lambda: {}))

Walk through the relevant nodes and collect the data:

In [5]:
for node in NN():
    this_type = F.otype.v(node)
    if this_type == "word":
        lexeme = F.lex_utf8.v(node)

        lexemes['all'][book_name][lexeme] += 1
        lexeme_support_book['all'][lexeme][book_name] = 1

        p_o_s = F.sp.v(node)
        if p_o_s == "subs":
            lexemes['noun_common'][book_name][lexeme] += 1
            lexeme_support_book['noun_common'][lexeme][book_name] = 1
            lexemes['verb-noun_common'][book_name][lexeme] += 1
            lexeme_support_book['verb-noun_common'][lexeme][book_name] = 1
        elif p_o_s == 'nmpr':
            lexemes['noun_proper'][book_name][lexeme] += 1
            lexeme_support_book['noun_proper'][lexeme][book_name] = 1
        elif p_o_s == "verb":
            lexemes['verb'][book_name][lexeme] += 1
            lexeme_support_book['verb'][lexeme][book_name] = 1
            lexemes['verb-noun_common'][book_name][lexeme] += 1
            lexeme_support_book['verb-noun_common'][lexeme][book_name] = 1

    elif this_type == "book":
        book_name =
        msg("{} ".format(book_name))
    30s Genesis 
    30s Exodus 
    30s Leviticus 
    30s Numeri 
    31s Deuteronomium 
    31s Josua 
    31s Judices 
    31s Samuel_I 
    31s Samuel_II 
    31s Reges_I 
    31s Reges_II 
    32s Jesaia 
    32s Jeremia 
    32s Ezechiel 
    32s Hosea 
    32s Joel 
    32s Amos 
    32s Obadia 
    32s Jona 
    32s Micha 
    32s Nahum 
    32s Habakuk 
    32s Zephania 
    32s Haggai 
    32s Sacharia 
    32s Maleachi 
    32s Psalmi 
    33s Iob 
    33s Proverbia 
    33s Ruth 
    33s Canticum 
    33s Ecclesiastes 
    33s Threni 
    33s Esther 
    33s Daniel 
    33s Esra 
    33s Nehemia 
    33s Chronica_I 
    33s Chronica_II 
    33s Done

Sort the data according to the various subtasks, and compute the edges with their weights.

In [6]:
nodes_header = '''<nodes count="{}">\n'''.format(len(books))

for this_type in tasks:

    lexeme_support = {}
    for lexeme in lexeme_support_book[this_type]:
        lexeme_support[lexeme] = len(lexeme_support_book[this_type][lexeme])
    book_size = collections.defaultdict(lambda: 0)
    for book in lexemes[this_type]:
        book_size[book] = len(lexemes[this_type][book])
    node_data = []
    for node in range(len(books)):
        node_data.append('''<node id="{}" label="{}"/>\n'''.format(node + 1, books[node]))

    edge_id = 0
    edge_data = collections.defaultdict(lambda: [])
    for src in range(len(books)):
        for tgt in range(src + 1, len(books)):
            book_src = books[src]
            book_tgt = books[tgt]
            lexemes_src = {}
            lexemes_tgt = {}
            lexemes_src = lexemes[this_type][book_src]
            lexemes_tgt = lexemes[this_type][book_tgt]
            intersection_size = 0
            weights = collections.defaultdict(lambda: 0)
            for lexeme in lexemes_src:
                if lexeme not in lexemes_tgt:
                pre_weight = lexeme_support[lexeme]
                for this_method in tasks[this_type]:
                    weights[this_method] += methods[this_method](1000, pre_weight)
                intersection_size += 1
            combined_size = book_size[book_src] + book_size[book_tgt] - intersection_size
            edge_id += 1
            for this_method in tasks[this_type]:
                edge_data[this_method].append('''<edge id="{}" source="{}" target="{}" weight="{:.3g}"/>\n'''.
                    format(edge_id, src + 1, tgt + 1, weights[this_method]/combined_size))
    for this_method in tasks[this_type]:
        edges_header = '''<edges count="{}">\n'''.format(len(edge_data[this_method]))
        out_file = tasks[this_type][this_method]

        for node_line in node_data:

        for edge_line in edge_data[this_method]:

    msg("{}: nodes:  {}; edges: {}".format(this_type, len(books), edge_id))
    43s all: nodes:  39; edges: 741
    43s noun_proper: nodes:  39; edges: 741
    44s verb-noun_common: nodes:  39; edges: 741
    44s verb: nodes:  39; edges: 741
    44s noun_common: nodes:  39; edges: 741
    44s Results directory:

__log__cooccurrences.txt               1115 Wed Nov 12 11:06:39 2014
all_1.gexf                            41738 Wed Nov 12 11:06:39 2014
all_2.gexf                            42256 Wed Nov 12 11:06:39 2014
noun_common_1.gexf                    41714 Wed Nov 12 11:06:39 2014
noun_common_2.gexf                    42239 Wed Nov 12 11:06:39 2014
noun_proper_1.gexf                    41868 Wed Nov 12 11:06:39 2014
noun_proper_2.gexf                    42548 Wed Nov 12 11:06:39 2014
verb-noun_common_1.gexf               41751 Wed Nov 12 11:06:39 2014
verb-noun_common_2.gexf               42247 Wed Nov 12 11:06:39 2014
verb_1.gexf                           41755 Wed Nov 12 11:06:39 2014
verb_2.gexf                           42252 Wed Nov 12 11:06:39 2014
In [10]:
!head -n 100 {my_file('verb-noun_common_1.gexf')}
<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns:viz="http:///" xmlns="" version="1.2">
<graph defaultedgetype="undirected" idtype="string" type="static">
<nodes count="39">
<node id="1" label="Genesis"/>
<node id="2" label="Exodus"/>
<node id="3" label="Leviticus"/>
<node id="4" label="Numeri"/>
<node id="5" label="Deuteronomium"/>
<node id="6" label="Josua"/>
<node id="7" label="Judices"/>
<node id="8" label="Samuel_I"/>
<node id="9" label="Samuel_II"/>
<node id="10" label="Reges_I"/>
<node id="11" label="Reges_II"/>
<node id="12" label="Jesaia"/>
<node id="13" label="Jeremia"/>
<node id="14" label="Ezechiel"/>
<node id="15" label="Hosea"/>
<node id="16" label="Joel"/>
<node id="17" label="Amos"/>
<node id="18" label="Obadia"/>
<node id="19" label="Jona"/>
<node id="20" label="Micha"/>
<node id="21" label="Nahum"/>
<node id="22" label="Habakuk"/>
<node id="23" label="Zephania"/>
<node id="24" label="Haggai"/>
<node id="25" label="Sacharia"/>
<node id="26" label="Maleachi"/>
<node id="27" label="Psalmi"/>
<node id="28" label="Iob"/>
<node id="29" label="Proverbia"/>
<node id="30" label="Ruth"/>
<node id="31" label="Canticum"/>
<node id="32" label="Ecclesiastes"/>
<node id="33" label="Threni"/>
<node id="34" label="Esther"/>
<node id="35" label="Daniel"/>
<node id="36" label="Esra"/>
<node id="37" label="Nehemia"/>
<node id="38" label="Chronica_I"/>
<node id="39" label="Chronica_II"/>
<edges count="741">
<edge id="1" source="1" target="2" weight="27.2"/>
<edge id="2" source="1" target="3" weight="17.2"/>
<edge id="3" source="1" target="4" weight="23.4"/>
<edge id="4" source="1" target="5" weight="23.6"/>
<edge id="5" source="1" target="6" weight="17.8"/>
<edge id="6" source="1" target="7" weight="21.7"/>
<edge id="7" source="1" target="8" weight="24.1"/>
<edge id="8" source="1" target="9" weight="22.4"/>
<edge id="9" source="1" target="10" weight="20.1"/>
<edge id="10" source="1" target="11" weight="20.5"/>
<edge id="11" source="1" target="12" weight="24.8"/>
<edge id="12" source="1" target="13" weight="24.2"/>
<edge id="13" source="1" target="14" weight="23.7"/>
<edge id="14" source="1" target="15" weight="15.5"/>
<edge id="15" source="1" target="16" weight="8.15"/>
<edge id="16" source="1" target="17" weight="12.7"/>
<edge id="17" source="1" target="18" weight="2.32"/>
<edge id="18" source="1" target="19" weight="5.68"/>
<edge id="19" source="1" target="20" weight="9.54"/>
<edge id="20" source="1" target="21" weight="6.56"/>
<edge id="21" source="1" target="22" weight="7.75"/>
<edge id="22" source="1" target="23" weight="6.75"/>
<edge id="23" source="1" target="24" weight="4.91"/>
<edge id="24" source="1" target="25" weight="12.8"/>
<edge id="25" source="1" target="26" weight="6.62"/>
<edge id="26" source="1" target="27" weight="26.6"/>
<edge id="27" source="1" target="28" weight="24.1"/>
<edge id="28" source="1" target="29" weight="21.1"/>
<edge id="29" source="1" target="30" weight="7.3"/>
<edge id="30" source="1" target="31" weight="10.8"/>
<edge id="31" source="1" target="32" weight="11.9"/>
<edge id="32" source="1" target="33" weight="11.1"/>
<edge id="33" source="1" target="34" weight="8.81"/>
<edge id="34" source="1" target="35" weight="14.1"/>
<edge id="35" source="1" target="36" weight="9.5"/>
<edge id="36" source="1" target="37" weight="14.3"/>
<edge id="37" source="1" target="38" weight="16.6"/>
<edge id="38" source="1" target="39" weight="19.3"/>
<edge id="39" source="2" target="3" weight="28.6"/>
<edge id="40" source="2" target="4" weight="33.2"/>
<edge id="41" source="2" target="5" weight="28"/>
<edge id="42" source="2" target="6" weight="19"/>
<edge id="43" source="2" target="7" weight="20.1"/>
<edge id="44" source="2" target="8" weight="20.4"/>
<edge id="45" source="2" target="9" weight="18.8"/>
<edge id="46" source="2" target="10" weight="21.1"/>
<edge id="47" source="2" target="11" weight="19.4"/>
<edge id="48" source="2" target="12" weight="24.9"/>
<edge id="49" source="2" target="13" weight="21.6"/>
<edge id="50" source="2" target="14" weight="26.1"/>
<edge id="51" source="2" target="15" weight="12.1"/>
<edge id="52" source="2" target="16" weight="6.46"/>


The output files can be loaded into Gephi and subjected to various graph rendering algorithms. After some playing you can get this out of it:

The Python module networkx is also capable of graph layout, let us try the most obvious methods.

In [8]:
%matplotlib inline
import networkx as nx
In [9]:
g_nc1 = nx.read_gexf(my_file('verb-noun_common_1.gexf'), relabel=True)
In [10]:
In [11]:
In [12]:
In [13]:
In [14]: