Both the BHSA and the OpenScriptures represent efforts to add linguistic markup to the Hebrew Bible.
The BHSA is the product of years of encoding work by researchers, in a strongly algorithmic fashion, although not without human decisions at the micro level.
OpenScriptures represents a crowd-sourced approach.
Regardless of theoretical considerations on the validity of these approaches, it is worthwhile to be able to compare them. Moreover, for some research problems, it might be helpful to use both encodings in one toolkit.
In this repo we develop a way of doing exactly this.
We make a link between the morphology in the Openscriptures and the linguistics in the BHSA.
We proceed as follows:
With this in hand, we have the OpenScriptures morphology in Text-Fabric, aligned to the BHSA. That opens the way for further comparisons, which take the actual morphology into account.
When we first made the comparison, in 2017, only 88% of the OpenScriptures Morphology was fixed.
In 2021 we have pulled the same repository of Open Scriptures again, and used a new version of the BHSA as well. It turns out the 100% of the words have been morphologically annotated by OpenScriptures now.
This notebook sets the stage for focused comparisons between the BHSA features on words and the OSM morphology.
See
import os
import sys
import collections
import yaml
from glob import glob
from lxml import etree
from itertools import zip_longest
from functools import reduce
from unicodedata import normalize, category
from tf.fabric import Fabric
from tf.core.helpers import rangesFromSet, formatMeta
import utils
if "SCRIPT" not in locals():
SCRIPT = False
FORCE = True
CORE_NAME = "bhsa"
NAME = "bridging"
VERSION = "2021"
def stop(good=False):
if SCRIPT:
sys.exit(0 if good else 1)
This notebook can run a lot of tests and create a lot of examples.
However, when run in the pipeline, we only want to create the two osm
features.
So, further on, there will be quite a bit of code under the condition not SCRIPT
.
The conversion is executed in an environment of directories, so that sources, temp files and results are in convenient places and do not have to be shifted around.
repoBase = os.path.expanduser("~/github/etcbc")
coreRepo = "{}/{}".format(repoBase, CORE_NAME)
thisRepo = "{}/{}".format(repoBase, NAME)
coreTf = "{}/tf/{}".format(coreRepo, VERSION)
thisTemp = "{}/_temp/{}".format(thisRepo, VERSION)
thisTempTf = "{}/tf".format(thisTemp)
thisTf = "{}/tf/{}".format(thisRepo, VERSION)
Check whether this conversion is needed in the first place. Only when run as a script.
if SCRIPT:
(good, work) = utils.mustRun(
None, "{}/.tf/{}.tfx".format(thisTf, "osm"), force=FORCE
)
if not good:
stop(good=False)
if not work:
stop(good=True)
utils.caption(4, "Load the existing TF dataset")
TF = Fabric(locations=coreTf, modules=[""])
api = TF.load(
"""
book
g_cons_utf8 g_word_utf8
"""
)
api.makeAvailableIn(globals())
.............................................................................................. . 0.00s Load the existing TF dataset . .............................................................................................. This is Text-Fabric 9.1.7 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 114 features found and 0 ignored 0.00s loading features ... | 0.00s Dataset without structure sections in otext:no structure functions in the T-API 11s All features loaded/computed - for details use TF.isLoaded()
[('Computed', 'computed-data', ('C Computed', 'Call AllComputeds', 'Cs ComputedString')), ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')), ('Fabric', 'loading', ('TF',)), ('Locality', 'locality', ('L Locality',)), ('Nodes', 'navigating-nodes', ('N Nodes',)), ('Features', 'node-features', ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')), ('Search', 'search', ('S Search',)), ('Text', 'text', ('T Text',))]
NB_DIR = os.getcwd()
OS_BASE = os.path.expanduser("~/github/openscriptures/morphhb/wlc")
os.chdir(OS_BASE)
OSM uses abbreviated book names. We map them onto the (latin) book names of the BHSA.
Here is a list of the BHSA books.
bhsBooks = [F.book.v(n) for n in F.otype.s("book")]
utils.caption(0, " ".join(bhsBooks))
| 11s Genesis Exodus Leviticus Numeri Deuteronomium Josua Judices Samuel_I Samuel_II Reges_I Reges_II Jesaia Jeremia Ezechiel Hosea Joel Amos Obadia Jona Micha Nahum Habakuk Zephania Haggai Sacharia Maleachi Psalmi Iob Proverbia Ruth Canticum Ecclesiastes Threni Esther Daniel Esra Nehemia Chronica_I Chronica_II
The next cell can be used to retrieve the OSM book names,
from which the ordered list osmBooks
below can be composed manually.
osmBookSet = set(fn[0:-4] for fn in glob("*.xml") if fn != "VerseMap.xml")
utils.caption(0, " ".join(sorted(osmBookSet)))
| 11s 1Chr 1Kgs 1Sam 2Chr 2Kgs 2Sam Amos Dan Deut Eccl Esth Exod Ezek Ezra Gen Hab Hag Hos Isa Jer Job Joel Jonah Josh Judg Lam Lev Mal Mic Nah Neh Num Obad Prov Ps Ruth Song Zech Zeph
We list the books in the "canonical" order (as given in the BHSA).
osmBooks = """
Gen Exod Lev Num Deut
Josh Judg 1Sam 2Sam 1Kgs 2Kgs
Isa Jer Ezek Hos Joel Amos Obad
Jonah Mic Nah Hab Zeph Hag Zech Mal
Ps Job Prov Ruth Song Eccl Lam Esth
Dan Ezra Neh 1Chr 2Chr
""".strip().split()
We check whether we did not overlook books or missed changes in the OSM abbreviations of the books.
osmBookSet == set(osmBooks)
True
Now we can construct the mapping, both ways.
osmBookFromBhs = {}
bhsBookFromOsm = {}
for (i, bhsBook) in enumerate(bhsBooks):
osmBook = osmBooks[i]
osmBookFromBhs[bhsBook] = osmBook
bhsBookFromOsm[osmBook] = bhsBook
For alignment purposes, we reduce all textual material to its consonantal representation. Sometimes we need to blur the distinction between final consonants and their normal counterparts.
In order to strip consonants from all their diacritical marks, we use unicode denormalization and unicode character categories.
NS = "{http://www.bibletechnologies.net/2003/OSIS/namespace}"
NFD = "NFD"
LO = "Lo"
finals = {
"ך": "כ",
"ם": "מ",
"ן": "נ",
"ף": "פ",
"ץ": "צ",
}
finalsI = {v: k for (k, v) in finals.items()}
toCons(s)
: strip all pointing (accents, vowels, dagesh, shin/sin dot) from all characters in string s
.
final(c)
: replace consonant c
by its final counterpart, if there is one, otherwise return c
.
finalCons(s)
: replace the last character of s
by its final counterpart.
unFinal(s)
: replace all consonants in s
by their non-final counterparts.
def toCons(s):
return "".join(c for c in normalize(NFD, s) if category(c) == LO)
def final(c):
return finalsI.get(c, c)
def finalCons(s):
return s[0:-1] + final(s[-1])
def unFinal(s):
return "".join(finals.get(c, c) for c in s)
We are going to read the OSM files. They correspond to books.
We drill down to verse nodes and pick up the <w>
elements.
What we need from these elements is the full text content and the attributes
lemma
and morph
.
We ignore markup within the full text content of the <w>
elements.
The material we extract, may contain /
.
We split the text content and the lemma
and morph
content on /
, and recombine the resulting parts in
OSM morpheme entries, having each a full-text bit, a morph bit and a lemma bit.
Caveat: when splitting the morpheme string, we should first split off the first character, which indicates language, and then add it to all the parts!
So, one <w>
element may give rise to several morpheme entries.
The full text is fully pointed. We also compute a consonantal version of the full text and store it within the morpheme entries.
We end up with a list, osmMorphemes
of morpheme entries.
In passing, we count the <w>
elements without morph
attributes, and those without textual content.
We also store the book, chapter, verse and sequence number of the <w>
element in each entry.
def readOsmBook(osmBook, osmMorphemes, stats):
infile = "{}.xml".format(osmBook)
parser = etree.XMLParser(remove_blank_text=True, ns_clean=True)
root = etree.parse(infile, parser).getroot()
osisTextNode = root[0]
divNode = osisTextNode[1]
chapterNodes = list(divNode)
utils.caption(
0,
"reading {:<5} ({:<15}) {:>3} chapters".format(
osmBook, bhsBookFromOsm[osmBook], len(chapterNodes)
),
)
ch = 0
for chapterNode in chapterNodes:
if chapterNode.tag != NS + "chapter":
continue
ch += 1
vs = 0
for verseNode in list(chapterNode):
if verseNode.tag != NS + "verse":
continue
vs += 1
w = 0
for wordNode in list(verseNode):
if wordNode.tag != NS + "w":
continue
w += 1
lemma = wordNode.get("lemma", None)
morph = wordNode.get("morph", None)
text = "".join(x for x in wordNode.itertext())
lemmas = lemma.split("/") if lemma is not None else []
morphs = morph.split("/") if morph is not None else []
if len(morphs) > 1:
lang = morphs[0][0]
morphs = [morphs[0]] + [lang + m for m in morphs[1:]]
texts = text.split("/") if text is not None else []
# zip_longest accomodates for unequal lengths of its operands
# for missing values we fill in ''
for (lm, mph, tx) in zip_longest(lemmas, morphs, texts, fillvalue=""):
txc = None if tx is None else toCons(tx)
osmMorphemes.append((tx, txc, mph, lm, osmBook, ch, vs, w))
if not mph:
stats["noMorph"] += 1
if not tx:
stats["noContent"] += 1
That was the definition of the read function, now we are going to execute it.
osmMorphemes = []
stats = dict(noMorph=0, noContent=0)
for bn in F.otype.s("book"):
bhsBook = T.sectionFromNode(bn, lang="la")[0]
osmBook = osmBookFromBhs[bhsBook]
readOsmBook(osmBook, osmMorphemes, stats)
utils.caption(
0,
"""
BHS words: {:>6}
OSM Morphemes: {:>6}
No morphology: {:>6}
No content: {:>6}
{} % of the words are morphologically annotated.
""".format(
F.otype.maxSlot,
len(osmMorphemes),
stats["noMorph"],
stats["noContent"],
round(
100
* (len(osmMorphemes) - stats["noMorph"] - stats["noContent"])
/ len(osmMorphemes)
),
),
)
| 11s reading Gen (Genesis ) 50 chapters | 11s reading Exod (Exodus ) 40 chapters | 11s reading Lev (Leviticus ) 27 chapters | 11s reading Num (Numeri ) 36 chapters | 11s reading Deut (Deuteronomium ) 34 chapters | 12s reading Josh (Josua ) 24 chapters | 12s reading Judg (Judices ) 21 chapters | 12s reading 1Sam (Samuel_I ) 31 chapters | 12s reading 2Sam (Samuel_II ) 24 chapters | 12s reading 1Kgs (Reges_I ) 22 chapters | 12s reading 2Kgs (Reges_II ) 25 chapters | 12s reading Isa (Jesaia ) 66 chapters | 13s reading Jer (Jeremia ) 52 chapters | 13s reading Ezek (Ezechiel ) 48 chapters | 13s reading Hos (Hosea ) 14 chapters | 13s reading Joel (Joel ) 4 chapters | 13s reading Amos (Amos ) 9 chapters | 13s reading Obad (Obadia ) 1 chapters | 13s reading Jonah (Jona ) 4 chapters | 13s reading Mic (Micha ) 7 chapters | 13s reading Nah (Nahum ) 3 chapters | 13s reading Hab (Habakuk ) 3 chapters | 13s reading Zeph (Zephania ) 3 chapters | 13s reading Hag (Haggai ) 2 chapters | 13s reading Zech (Sacharia ) 14 chapters | 13s reading Mal (Maleachi ) 3 chapters | 13s reading Ps (Psalmi ) 150 chapters | 13s reading Job (Iob ) 42 chapters | 14s reading Prov (Proverbia ) 31 chapters | 14s reading Ruth (Ruth ) 4 chapters | 14s reading Song (Canticum ) 8 chapters | 14s reading Eccl (Ecclesiastes ) 12 chapters | 14s reading Lam (Threni ) 5 chapters | 14s reading Esth (Esther ) 10 chapters | 14s reading Dan (Daniel ) 12 chapters | 14s reading Ezra (Esra ) 10 chapters | 14s reading Neh (Nehemia ) 13 chapters | 14s reading 1Chr (Chronica_I ) 29 chapters | 14s reading 2Chr (Chronica_II ) 36 chapters | 14s BHS words: 426590 OSM Morphemes: 469440 No morphology: 1 No content: 1 100 % of the words are morphologically annotated.
To give an impression of the contents of this list, we show the first few members. The column specification is:
consonantal fully-pointed morph lemma book chapter verse `w`-number
list(osmMorphemes[0:15])
[('בְּ', 'ב', 'HR', 'b', 'Gen', 1, 1, 1), ('רֵאשִׁ֖ית', 'ראשית', 'HNcfsa', '7225', 'Gen', 1, 1, 1), ('בָּרָ֣א', 'ברא', 'HVqp3ms', '1254 a', 'Gen', 1, 1, 2), ('אֱלֹהִ֑ים', 'אלהים', 'HNcmpa', '430', 'Gen', 1, 1, 3), ('אֵ֥ת', 'את', 'HTo', '853', 'Gen', 1, 1, 4), ('הַ', 'ה', 'HTd', 'd', 'Gen', 1, 1, 5), ('שָּׁמַ֖יִם', 'שמים', 'HNcmpa', '8064', 'Gen', 1, 1, 5), ('וְ', 'ו', 'HC', 'c', 'Gen', 1, 1, 6), ('אֵ֥ת', 'את', 'HTo', '853', 'Gen', 1, 1, 6), ('הָ', 'ה', 'HTd', 'd', 'Gen', 1, 1, 7), ('אָֽרֶץ', 'ארץ', 'HNcbsa', '776', 'Gen', 1, 1, 7), ('וְ', 'ו', 'HC', 'c', 'Gen', 1, 2, 1), ('הָ', 'ה', 'HTd', 'd', 'Gen', 1, 2, 1), ('אָ֗רֶץ', 'ארץ', 'HNcbsa', '776', 'Gen', 1, 2, 1), ('הָיְתָ֥ה', 'היתה', 'HVqp3fs', '1961', 'Gen', 1, 2, 2)]
We now have to face the task to map the BHSA words to the OSM morphemes.
We will encounter the challenge that at some spots the consonantal contents of the WLC (the source of the OSM) is different from that of the BHS, the source of the BHSA.
Another challenge is that at some points the analysis behind the OSM differs from that of the BHSA in such a way that the BHSA has a word-split within an OSM morpheme.
Yet another source of problems is that the BHSA inserts "empty" articles in places where the pointing in the surrounding material allows to conclude that an article is present, although it does not have a consonantal presence anymore.
We need a function to quickly show what is going on in difficult spots.
showCase(w, j, ln)
shows the BHSA from word w
onward, and the OSM from morpheme j
onward.
It lists ln
positions in both sources.
def showCase(w, j, ln):
print(T.sectionFromNode(w))
print("BHS")
for n in range(w, w + ln):
print("word {} = [{}]".format(n, toCons(F.g_cons_utf8.v(n))))
print("OSM")
for n in range(j, j + ln):
print("morph {} = [{}]".format(n, osmMorphemes[n][1]))
We also define another function to easy inspect difficult spots.
BHSvsOSM(ws, js)
compares the BHSA words specified by list ws
with the OSM morphemes
specified by list js
.
Here we bump into the fact that the BHSA deals with whole words, and the OSM splits into morphemes. In this case, the pronominal suffix is treated as a separate morpheme.
def BHSvsOSM(ws, js):
print(
"{}\n{:<25}BHS {:<30} = {}\n{:<25}OSM {:<30} = {}".format(
"{} {}:{}".format(*T.sectionFromNode(ws[0])),
" ",
", ".join(str(w) for w in ws),
"/".join(F.g_word_utf8.v(w) for w in ws),
" ",
", ".join("w{}".format(osmMorphemes[j][7]) for j in js),
"/".join(osmMorphemes[j][0] for j in js),
)
)
We have to develop a way of aligning each BHS word with one or more OSM morphemes.
For each BHS word, we grab OSM morphemes until all consonants in the BHS word have been matched. If needed, we grab additional BHS words when the current OSM string happens to be longer than the current BHS word.
We will encounter cases where this method breaks down: exceptions. We will collect them for later inspection.
The exceptions are coded as follows:
If w: n
is in the dictionary of exceptions, it means that slot (word) w
in the BHSA is different from its counterpart morpheme(s) in the OSM.
If n > 0
, that many OSM morphemes will be gobbled to align with slot w
.
If n < 0
, that many slots from w
will be gobbled to match the current OSM morpheme.
There are various subtleties involved, see the inline content in the code below.
allExceptions = {
"2017": {
215253: 1,
266189: 1,
287360: 2,
376865: 1,
383405: 2,
384049: 1,
384050: 1,
405102: -2,
},
"2021": {
215256: 1,
266192: 1,
287363: 2,
376869: 1,
383409: 2,
384053: 1,
384054: 1,
405108: -2,
},
}
exceptions = allExceptions[VERSION]
# index in the osmMorphemes list
j = -1
# mapping from BHSA slot numbers to OSM morphemes indices
osmFromBhs = {}
u = None
remainingErrors = False
for w in F.otype.s("word"):
# the previous iteration may have already dealt with this word
# in that case, we skip to the next word
# the signal is: w <= u
if u is not None and w <= u:
continue
# we get the consonantal BHSA word string
bhs = toCons(F.g_cons_utf8.v(w))
# if the BHSA word is empty, we do not link it to any OSM morpheme
# and continue
if bhs == "":
continue
# we are going to collect OSM morphemes
# as long as the consonantal reps of the morpheme fit into the BHSA word
j += 1
startJ = j
startW = w
osm = osmMorphemes[j][1]
# but if the word is listed as exception, we collect as many morphemes
# as specified in the exception
maxGobble = exceptions.get(w, None)
gobble = 1
while (len(osm) < len(bhs) and bhs.startswith(osm)) or (
maxGobble is not None and maxGobble > 0
):
if maxGobble is not None and gobble >= maxGobble:
break
j += 1
osm += osmMorphemes[j][1]
gobble += 1
# if the OSM morphemes have become longer than the BHSA word,
# we eat up the following BHSA word(s)
# we let u hold the new BHSA word position
u = w
gobble = 1
while (len(osm) > len(bhs) and osm.startswith(bhs)) or (
maxGobble is not None and maxGobble < 0
):
if maxGobble is not None and gobble >= -maxGobble:
break
u += 1
bhs += toCons(F.g_cons_utf8.v(u))
gobble += 1
gobble = 1
# if the BHSA words exceed the OSM morphemes found so far, we draw in additional OSM morphemes
# (for the last time)
while len(osm) < len(bhs) and bhs.startswith(osm):
if maxGobble is not None and gobble >= maxGobble:
break
j += 1
osm += osmMorphemes[j][1]
gobble += 1
# now we have gathered a BHSA string of material, and an OSM string of material
# We test if both strings are equal (modulo final consonant issues)
# If not: alignment breaks down, we stop the loop and show the offending case.
# The programmer should inspect the case and add an exception.
if maxGobble is None and finalCons(bhs) != finalCons(osm):
utils.caption(
0,
"""Mismatch in {} at BHS-{} OS-{}->{}:\nbhs=[{}]\nos=[{}]""".format(
"{} {}:{}".format(*T.sectionFromNode(w)),
w,
startJ,
j,
bhs,
osm,
),
)
showCase(w - 5, startJ - 5, j - startJ + 10)
remainingErrors = True
break
# but if all is well, we link the BHSA words in question to the OSM morphemes in question
# If the BHSA string contains multiple words, we link all those words to all morphemes
for k in range(startW, u + 1):
for m in range(startJ, j + 1):
osmFromBhs.setdefault(k, []).append(m)
if not remainingErrors:
utils.caption(0, "Succeeded in aligning BHS with OSM")
utils.caption(
0,
"{} BHS words matched against {} OSM morphemes with {} known exceptions".format(
len(osmFromBhs),
len(osmMorphemes),
len(exceptions),
),
)
| 16s Succeeded in aligning BHS with OSM | 16s 420109 BHS words matched against 469440 OSM morphemes with 8 known exceptions
We have constructed in passing the mapping osmFromBhs
,
which maps BHSA words onto corresponding sequences of OSM morphemes.
We also compute the inverse of this, bhsFromOsm
.
# mapping from OSM morphemes (by index in osmMorphemes list) to BHSA slot numbers
# It is the inverse of osmFromBhs
bhsFromOsm = {}
for (w, js) in osmFromBhs.items():
for j in js:
bhsFromOsm.setdefault(j, []).append(w)
utils.caption(0, "{} morphemes mapped in bhsFromOsm".format(len(bhsFromOsm)))
| 18s 469440 morphemes mapped in bhsFromOsm
We have encountered irregularities, but we want to make sure we have seen all potential alignment problems. We do this by adding a sanity check: find all cases where the consonantal material in a BHSA word is not the concatenation of the consonantal material in in its OSM morphemes.
We have now several irregularities to inspect.
bhsFromOsm
we can find the OSM morphemes that contain consonantal material
from multiple BHSA words.
These are interesting points of difference between the BHSA and OSM encoding, because
in these cases the OSM produces other word/morpheme boundaries than the BHSA.Now we want to make a comprehensive list of all problematic cases encountered during alignment.
We will add the BHSA word numbers involved in a problematic case to the set problematic
.
When we proceed to compare morphology, we will exclude the problematic cases.
problematic = set()
We gather the cases of multiple BHSA words against a single OSM morpheme.
multipleOSM = {} # OSM morphemes in correspondence with multiple BHS slots
noOSM = {} # OSM morphemes that do not correspond to any BHSA word
countMultipleOSM = (
collections.Counter()
) # how many times n BHSA words are linked to the same OSM morpheme
for (j, ws) in bhsFromOsm.items():
nws = len(ws)
if nws > 1:
multipleOSM[j] = nws
countMultipleOSM[nws] += 1
elif nws == 0:
noOSM.add(j)
utils.caption(
0,
"OSM morphemes without corresponding BHSA word: {:>5}".format(
len(noOSM)
),
)
utils.caption(
0,
"OSM morphemes corresponding to multiple BHSA words: {:>5}".format(
len(multipleOSM)
),
)
for (nws, amount) in sorted(countMultipleOSM.items()):
utils.caption(
0,
"OSM morphemes corresponding to {} BHSA words: {:>5}".format(
nws, amount
),
)
| 18s OSM morphemes without corresponding BHSA word: 0 | 18s OSM morphemes corresponding to multiple BHSA words: 122 | 18s OSM morphemes corresponding to 2 BHSA words: 115 | 18s OSM morphemes corresponding to 3 BHSA words: 7
for j in multipleOSM:
ws = bhsFromOsm[j]
problematic |= set(ws)
if not SCRIPT:
BHSvsOSM(ws, [j])
Genesis 24:65 BHS 12370, 12371 = הַ/לָּזֶה֙ OSM w6 = הַלָּזֶה֙ Genesis 37:19 BHS 20517, 20518 = הַ/לָּזֶ֖ה OSM w8 = הַלָּזֶ֖ה Genesis 50:10 BHS 28426, 28427 = הָ/אָטָ֗ד OSM w4 = הָאָטָ֗ד Genesis 50:11 BHS 28460, 28461 = הָֽ/אָטָ֔ד OSM w8 = הָֽאָטָ֔ד Numbers 13:21 BHS 78125, 78126 = לְ/בֹ֥א OSM w9 = לְבֹ֥א Numbers 34:8 BHS 91530, 91531 = לְ/בֹ֣א OSM w4 = לְבֹ֣א Deuteronomy 33:2 BHS 112265, 112266 = אשׁ/דת OSM w15 = אשדת Joshua 13:5 BHS 120469, 120470 = לְ/בֹ֥וא OSM w13 = לְב֥וֹא Joshua 18:24 BHS 123359, 123360 = ה/עמני OSM w2 = העמני Joshua 18:28 BHS 123394, 123395 = הָ/אֶ֜לֶף OSM w2 = הָאֶ֜לֶף Joshua 19:46 BHS 123994, 123995 = הַ/יַּרְקֹ֖ון OSM w2 = הַיַּרְק֖וֹן Judges 3:3 BHS 128774, 128775 = לְ/בֹ֥וא OSM w15 = לְב֥וֹא Judges 6:5 BHS 130514, 130515 = י/באו OSM w6 = יבאו Judges 6:11 BHS 130645, 130646 = הָֽ/עֶזְרִ֑י OSM w12 = הָֽעֶזְרִ֑י Judges 6:20 BHS 130861, 130862 = הַ/לָּ֔ז OSM w13 = הַלָּ֔ז Judges 6:24 BHS 130966, 130967 = הָ/עֶזְרִֽי OSM w16 = הָעֶזְרִֽי Judges 8:32 BHS 132817, 132818 = הָֽ/עֶזְרִֽי OSM w13 = הָֽעֶזְרִֽי Judges 15:19 BHS 137184, 137185 = הַ/קֹּורֵא֙ OSM w19 = הַקּוֹרֵא֙ 1_Samuel 6:14 BHS 144465, 144466 = הַ/שִּׁמְשִׁי֙ OSM w7 = הַשִּׁמְשִׁי֙ 1_Samuel 6:18 BHS 144606, 144607 = הַ/שִּׁמְשִֽׁי OSM w29 = הַשִּׁמְשִֽׁי 1_Samuel 13:18 BHS 148208, 148209 = הַ/צְּבֹעִ֖ים OSM w15 = הַצְּבֹעִ֖ים 1_Samuel 14:1 BHS 148335, 148336 = הַ/לָּ֑ז OSM w18 = הַלָּ֑ז 1_Samuel 16:1 BHS 150297, 150298 = הַ/לַּחְמִ֔י OSM w24 = הַלַּחְמִ֔י 1_Samuel 16:18 BHS 150666, 150667 = הַ/לַּחְמִי֒ OSM w10 = הַלַּחְמִי֒ 1_Samuel 17:2 BHS 150821, 150822 = הָ/אֵלָ֑ה OSM w7 = הָאֵלָ֑ה 1_Samuel 17:19 BHS 151163, 151164 = הָֽ/אֵלָ֑ה OSM w7 = הָֽאֵלָ֑ה 1_Samuel 17:26 BHS 151348, 151349 = הַ/לָּ֔ז OSM w15 = הַלָּ֔ז 1_Samuel 17:58 BHS 152160, 152161 = הַ/לַּחְמִֽי OSM w14 = הַלַּחְמִֽי 1_Samuel 21:10 BHS 154603, 154604 = הָ/אֵלָ֗ה OSM w9 = הָאֵלָ֗ה 1_Samuel 23:28 BHS 155965, 155966 = הַֽ/מַּחְלְקֹֽות OSM w15 = הַֽמַּחְלְקֽוֹת 2_Samuel 3:26 BHS 162277, 162278 = הַ/סִּרָ֑ה OSM w12 = הַסִּרָ֑ה 2_Samuel 12:22 BHS 166917, 166918 = י/חנני OSM w11 = יחנ 2_Samuel 12:22 BHS 166917, 166918 = י/חנני OSM w11 = ני 2_Samuel 20:15 BHS 173466, 173467 = הַֽ/מַּעֲכָ֔ה OSM w6 = הַֽמַּעֲכָ֔ה 2_Samuel 21:16 BHS 174125, 174126 = בְּ/נֹ֜ב OSM w2 = בְּנֹ֜ב 2_Samuel 21:19 BHS 174223, 174224 = הַ/לַּחְמִ֗י OSM w13 = הַלַּחְמִ֗י 2_Samuel 23:8 BHS 174932, 174933, 174934 = בַּ//שֶּׁ֜בֶת OSM w7 = בַּשֶּׁ֜בֶת 2_Samuel 23:33 BHS 175372, 175373 = הָ/ארָרִֽי OSM w6 = הָארָרִֽי 1_Kings 1:9 BHS 176264, 176265 = הַ/זֹּחֶ֔לֶת OSM w8 = הַזֹּחֶ֔לֶת 1_Kings 8:65 BHS 183614, 183615 = לְּ/בֹ֥וא OSM w12 = לְּב֥וֹא 1_Kings 16:34 BHS 189794, 189795 = הָ/אֱלִ֖י OSM w5 = הָאֱלִ֖י 2_Kings 4:25 BHS 196995, 196996 = הַ/לָּֽז OSM w21 = הַלָּֽז 2_Kings 7:15 BHS 199376, 199377 = בה/חפזם OSM w14 = ב 2_Kings 7:15 BHS 199376, 199377 = בה/חפזם OSM w14 = החפז 2_Kings 7:15 BHS 199376, 199377 = בה/חפזם OSM w14 = ם 2_Kings 14:25 BHS 204159, 204160 = לְּ/בֹ֥וא OSM w6 = לְּב֥וֹא 2_Kings 18:27 BHS 207188, 207189 = שׁ/יניהם OSM w25 = שיני 2_Kings 18:27 BHS 207188, 207189 = שׁ/יניהם OSM w25 = הם 2_Kings 19:13 BHS 207681, 207682, 207683 = לָ//עִ֣יר OSM w7 = לָעִ֣יר 2_Kings 23:17 BHS 210347, 210348 = הַ/לָּ֔ז OSM w4 = הַלָּ֔ז Isaiah 22:18 BHS 219268, 219269, 219270 = כַּ//דּ֕וּר OSM w4 = כַּדּ֕וּר Isaiah 36:12 BHS 224046, 224047 = שׁי/ניהם OSM w24 = שיני Isaiah 36:12 BHS 224046, 224047 = שׁי/ניהם OSM w24 = הם Isaiah 37:13 BHS 224517, 224518, 224519 = לָ//עִ֣יר OSM w7 = לָעִ֣יר Isaiah 49:13 BHS 229391, 229392 = י/פצחו OSM w5 = יפצחו Jeremiah 6:21 BHS 238058, 238059 = י/אבדו OSM w18 = יאבדו Jeremiah 6:29 BHS 238177, 238178 = אשׁ/תם OSM w3 = אשת Jeremiah 6:29 BHS 238177, 238178 = אשׁ/תם OSM w3 = ם Jeremiah 13:16 BHS 241521, 241522 = י/שׁית OSM w17 = ישית Jeremiah 17:13 BHS 243339, 243340 = י/סורי OSM w7 = יסור Jeremiah 17:13 BHS 243339, 243340 = י/סורי OSM w7 = י Jeremiah 21:9 BHS 245189, 245190 = י/חיה OSM w14 = יחיה Jeremiah 29:23 BHS 250010, 250011 = ה/וידע OSM w18 = הו Jeremiah 29:23 BHS 250010, 250011 = ה/וידע OSM w19 = ידע Jeremiah 38:2 BHS 255634, 255635 = י/חיה OSM w14 = יחיה Jeremiah 48:18 BHS 260639, 260640 = י/שׁבי OSM w3 = ישבי Ezekiel 36:35 BHS 283125, 283126 = הַ/לֵּ֨זוּ֙ OSM w3 = הַלֵּ֨זוּ֙ Ezekiel 42:14 BHS 287019, 287020 = י/לבשׁו OSM w18 = ילבשו Ezekiel 43:6 BHS 287226, 287227 = מִ/דַּבֵּ֥ר OSM w2 = מִדַּבֵּ֥ר Ezekiel 45:5 BHS 288552, 288553 = י/היה OSM w8 = יהיה Ezekiel 47:15 BHS 290023, 290024 = לְ/בֹ֥וא OSM w11 = לְב֥וֹא Ezekiel 47:16 BHS 290038, 290039 = הַ/תִּיכֹ֔ון OSM w12 = הַתִּיכ֔וֹן Ezekiel 47:20 BHS 290129, 290130 = לְ/בֹ֣וא OSM w8 = לְב֣וֹא Ezekiel 48:1 BHS 290210, 290211 = לְֽ/בֹוא OSM w10 = לְֽבוֹא Amos 6:14 BHS 297188, 297189 = לְּ/בֹ֥וא OSM w14 = לְּב֥וֹא Micah 1:10 BHS 299725, 299726 = לְ/עַפְרָ֔ה OSM w8 = לְעַפְרָ֔ה Nahum 3:3 BHS 301919, 301920 = י/כשׁלו OSM w14 = יכשלו Zechariah 2:8 BHS 305502, 305503 = הַ/לָּ֖ז OSM w7 = הַלָּ֖ז Zechariah 9:8 BHS 307601, 307602 = מִ/צָּבָה֙ OSM w3 = מִצָּבָה֙ Zechariah 14:6 BHS 309064, 309065 = י/קפאון OSM w8 = יקפאו Zechariah 14:6 BHS 309064, 309065 = י/קפאון OSM w8 = ן Psalms 9:1 BHS 311564, 311565, 311566 = לַ//בֵּ֗ן OSM w3 = לַבֵּ֗ן Psalms 9:10 BHS 311656, 311657, 311658 = בַּ//צָּרָֽה OSM w7 = בַּצָּרָֽה Psalms 10:1 BHS 311776, 311777, 311778 = בַּ//צָּרָֽה OSM w7 = בַּצָּרָֽה Psalms 10:10 BHS 311885, 311886 = חל/כאים OSM w5 = חלכאים Psalms 41:3 BHS 317247, 317248 = י/אשׁר OSM w4 = יאשר Psalms 55:16 BHS 319513, 319514 = ישׁי/מות OSM w1 = ישימות Psalms 123:4 BHS 332922, 332923 = גא/יונים OSM w8 = גְאֵ֥יוֹנִֽים Job 6:14 BHS 337706, 337707 = מֵ/רֵעֵ֣הוּ OSM w2 = מֵרֵעֵ֣ Job 6:14 BHS 337706, 337707 = מֵ/רֵעֵ֣הוּ OSM w2 = הוּ Job 9:30 BHS 338535, 338536 = ב/מו OSM w3 = במו Job 10:20 BHS 338784, 338785 = י/חדל OSM w4 = יחדל Job 10:20 BHS 338786, 338787 = י/שׁית OSM w5 = ישית Job 38:12 BHS 345551, 345552 = ידעת/ה OSM w4 = ידעתה Proverbs 18:17 BHS 351776, 351777 = י/בא OSM w4 = יבא Proverbs 19:4 BHS 351882, 351883 = מֵ/רֵ֥עהוּ OSM w6 = מֵרֵ֥ע Proverbs 19:4 BHS 351882, 351883 = מֵ/רֵ֥עהוּ OSM w6 = הוּ Proverbs 20:4 BHS 352164, 352165 = י/שׁאל OSM w5 = ישאל Ecclesiastes 6:10 BHS 361436, 361437 = שׁה/תקיף OSM w14 = ש Ecclesiastes 6:10 BHS 361436, 361437 = שׁה/תקיף OSM w14 = התקיף Daniel 8:16 BHS 375554, 375555 = הַ/לָּ֖ז OSM w10 = הַלָּ֖ז Daniel 11:12 BHS 377178, 377179 = י/רום OSM w3 = ירום Ezra 2:61 BHS 378928, 378929 = הַ/קֹּ֑וץ OSM w6 = הַקּ֑וֹץ Ezra 10:29 BHS 383320, 383321 = י/רמות OSM w8 = ירמות Nehemiah 3:4 BHS 384335, 384336 = הַ/קֹּ֔וץ OSM w8 = הַקּ֔וֹץ Nehemiah 3:6 BHS 384370, 384371 = הַ/יְשָׁנָ֜ה OSM w3 = הַיְשָׁנָ֜ה Nehemiah 3:12 BHS 384483, 384484 = הַ/לֹּוחֵ֔שׁ OSM w6 = הַלּוֹחֵ֔שׁ Nehemiah 3:21 BHS 384676, 384677 = הַ/קֹּ֖וץ OSM w7 = הַקּ֖וֹץ Nehemiah 7:63 BHS 386961, 386962 = הַ/קֹּ֑וץ OSM w6 = הַקּ֑וֹץ Nehemiah 10:25 BHS 388830, 388831 = הַ/לֹּוחֵ֥שׁ OSM w1 = הַלּוֹחֵ֥שׁ Nehemiah 11:35 BHS 389765, 389766 = הַ/חֲרָשִֽׁים OSM w4 = הַחֲרָשִֽׁים Nehemiah 12:39 BHS 390290, 390291 = הַ/יְשָׁנָ֜ה OSM w6 = הַיְשָׁנָ֜ה 1_Chronicles 4:7 BHS 392906, 392907 = י/צחר OSM w4 = יצחר 1_Chronicles 7:34 BHS 395488, 395489 = י/חבה OSM w5 = יחבה 1_Chronicles 13:5 BHS 398556, 398557 = לְ/בֹ֣וא OSM w10 = לְב֣וֹא 1_Chronicles 18:12 BHS 401053, 401054 = הַ/מֶּ֔לַח OSM w8 = הַמֶּ֔לַח 1_Chronicles 24:10 BHS 403641, 403642 = הַ/קֹּוץ֙ OSM w1 = הַקּוֹץ֙ 1_Chronicles 27:12 BHS 405108, 405109 = בן/ימיני OSM w6 = בנימיני 2_Chronicles 7:8 BHS 410242, 410243 = לְּ/בֹ֥וא OSM w15 = לְּב֥וֹא 2_Chronicles 25:11 BHS 419159, 419160 = הַ/מֶּ֑לַח OSM w8 = הַמֶּ֑לַח 2_Chronicles 36:21 BHS 426512, 426513 = הָ/שַּׁמָּה֙ OSM w13 = הָשַּׁמָּ 2_Chronicles 36:21 BHS 426512, 426513 = הָ/שַּׁמָּה֙ OSM w13 = ה֙
Which non-empty BHSA words are not the concatenation of their OSM morphemes?
We do not consider the cases where more than one BHSA word corresponds to an OSM morpheme, because we have already gathered those cases above.
insaneBHS = set() # alignment problems by BHSA slot number
insaneOSM = set() # alignment problems by OSM morpheme index in osmMorphemes
# We compute the slot numbers of that are part of a multiple slot alignment to a morpheme
multipleBHS = reduce(set.union, (bhsFromOsm[j] for j in multipleOSM), set())
# Gather the insanities
for (w, js) in osmFromBhs.items():
if w in multipleBHS:
continue
cw = toCons(F.g_cons_utf8.v(w))
cjs = "".join(osmMorphemes[j][1] for j in js)
if unFinal(cw) != unFinal(cjs):
insaneBHS.add(w)
insaneOSM |= set(js)
utils.caption(0, "insane BHS words: {:>4}".format(len(insaneBHS)))
utils.caption(0, "insane OSM morphemes: {:>4}".format(len(insaneOSM)))
| 20s insane BHS words: 6 | 20s insane OSM morphemes: 8
for w in sorted(insaneBHS):
problematic.add(w)
js = osmFromBhs[w]
if not SCRIPT:
BHSvsOSM([w], js)
Ezekiel 4:6 BHS 266192 = ימוני OSM w7 = ימיני Ezekiel 43:11 BHS 287363 = צורתו OSM w17, w17 = צורת/י Daniel 10:19 BHS 376869 = כְ OSM w10 = בְ Ezra 10:44 BHS 383409 = נשׂאו OSM w3, w3 = נשא/י Nehemiah 2:13 BHS 384053 = הם OSM w17 = ה Nehemiah 2:13 BHS 384054 = פרוצים OSM w17 = מפרוצים
Let's study the mapping of BHSA words to OSM morphemes in a bit more detail. We are interested in the question: to how many morphemes can words map?
Later we shall see that we can deal with 1 and 2 morphemes per word.
We deem words that map to more than two morphemes problematic.
This turns out to be a very small minority.
morphemesPerWord = collections.Counter()
tooMany = set()
for (w, js) in osmFromBhs.items():
n = len(js)
morphemesPerWord[n] += 1
if n > 2:
tooMany.add(w)
for (ln, amount) in sorted(morphemesPerWord.items()):
utils.caption(0, "{:>2} morphemes per word: {:>6}".format(ln, amount))
| 20s 1 morphemes per word: 370680 | 20s 2 morphemes per word: 49400 | 20s 3 morphemes per word: 27 | 20s 4 morphemes per word: 2
for w in sorted(tooMany):
js = osmFromBhs[w]
if not SCRIPT:
BHSvsOSM([w], js)
problematic.add(w)
Numbers 33:46 BHS 91209 = עַלְמֹ֥ן דִּבְלָתָֽיְמָה OSM w5, w6, w6 = עַלְמֹ֥ן/דִּבְלָתָֽיְמָ/ה Numbers 33:47 BHS 91213 = עַלְמֹ֣ן דִּבְלָתָ֑יְמָה OSM w2, w3, w3 = עַלְמֹ֣ן/דִּבְלָתָ֑יְמָ/ה Deuteronomy 10:6 BHS 99249 = בְּאֵרֹ֥ת בְּנֵי־יַעֲקָ֖ן OSM w4, w5, w6 = בְּאֵרֹ֥ת/בְּנֵי/יַעֲקָ֖ן Joshua 13:17 BHS 120713 = בֵ֖ית בַּ֥עַל מְעֹֽון OSM w9, w10, w11 = בֵ֖ית/בַּ֥עַל/מְעֽוֹן Joshua 15:32 BHS 121905 = עַ֣יִן וְרִמֹּ֑ון OSM w3, w4, w4 = עַ֣יִן/וְ/רִמּ֑וֹן Joshua 15:62 BHS 122150 = עִיר־הַמֶּ֖לַח OSM w2, w3, w3 = עִיר/הַ/מֶּ֖לַח Judges 8:13 BHS 132427 = מַעֲלֵ֖ה הֶחָֽרֶס OSM w7, w8, w8 = מַעֲלֵ֖ה/הֶ/חָֽרֶס 1_Samuel 4:1 BHS 143295 = הָאֶ֣בֶן הָעֵ֔זֶר OSM w13, w13, w14 = הָ/אֶ֣בֶן/הָעֵ֔זֶר 1_Samuel 25:3 BHS 156565 = כלבו OSM w17, w17, w17 = כ/לב/ו 1_Kings 7:36 BHS 181670 = ומסגרתיה OSM w6, w6, w6 = ו/מסגרתי/ה 1_Kings 15:20 BHS 188754 = אָבֵ֣ל בֵּֽית־מַעֲכָ֑ה OSM w22, w23, w24 = אָבֵ֣ל/בֵּֽית/מַעֲכָ֑ה 2_Kings 7:15 BHS 199376 = בה OSM w14, w14, w14 = ב/החפז/ם 2_Kings 7:15 BHS 199377 = חפזם OSM w14, w14, w14 = ב/החפז/ם 2_Kings 10:12 BHS 201428 = בֵּֽית־עֵ֥קֶד הָרֹעִ֖ים OSM w6, w7, w8 = בֵּֽית/עֵ֥קֶד/הָרֹעִ֖ים 2_Kings 15:29 BHS 204852 = אָבֵ֣ל בֵּֽית־מַעֲכָ֡ה OSM w14, w15, w16 = אָבֵ֣ל/בֵּֽית/מַעֲכָ֡ה Isaiah 8:1 BHS 214752 = מַהֵ֥ר שָׁלָ֖ל חָ֥שׁ בַּֽז OSM w12, w13, w14, w15 = מַהֵ֥ר/שָׁלָ֖ל/חָ֥שׁ/בַּֽז Isaiah 8:3 BHS 214783 = מַהֵ֥ר שָׁלָ֖ל חָ֥שׁ בַּֽז OSM w12, w13, w14, w15 = מַהֵ֥ר/שָׁלָ֖ל/חָ֥שׁ/בַּֽז Jeremiah 39:3 BHS 256418 = נֵרְגַ֣ל שַׂר־֠אֶצֶר OSM w9, w10, w11 = נֵרְגַ֣ל/שַׂר/אֶ֠צֶר Jeremiah 39:3 BHS 256423 = נֵרְגַ֤ל שַׂר־אֶ֨צֶר֙ OSM w18, w19, w20 = נֵרְגַ֤ל/שַׂר/אֶ֨צֶר֙ Jeremiah 39:13 BHS 256650 = נֵרְגַ֥ל שַׂר־אֶ֖צֶר OSM w8, w9, w10 = נֵרְגַ֥ל/שַׂר/אֶ֖צֶר Ezekiel 44:24 BHS 288287 = ושׁפטהו OSM w7, w7, w7 = ו/שפט/הו Psalms 91:12 BHS 326518 = יִשָּׂא֑וּנְךָ OSM w3, w3, w3 = יִשָּׂא֑וּ/נְ/ךָ Job 30:4 BHS 343152 = לַחְמָֽם OSM w7, w7, w7 = לַ/חְמָֽ/ם Proverbs 11:3 BHS 349658 = ושׁדם OSM w6, w6, w6 = ו/שד/ם Proverbs 12:26 BHS 350148 = מֵרֵעֵ֣הוּ OSM w2, w2, w2 = מֵ/רֵעֵ֣/הוּ Daniel 11:26 BHS 377478 = פַת־בָּגֹ֛ו OSM w2, w3, w3 = פַת/בָּג֛/וֹ 1_Chronicles 2:54 BHS 392520 = עַטְרֹ֖ות בֵּ֣ית יֹואָ֑ב OSM w6, w7, w8 = עַטְר֖וֹת/בֵּ֣ית/יוֹאָ֑ב 2_Chronicles 32:21 BHS 423602 = מיציאו OSM w20, w20, w20 = מ/יציא/ו 2_Chronicles 34:6 BHS 424586 = הר בתיהם OSM w7, w8, w8 = הר/בתי/הם
Finally, we inspect the cases that correspond to the manual exceptions.
for (w, n) in exceptions.items():
if n > 0:
js = osmFromBhs[w]
if not SCRIPT:
BHSvsOSM([w], js)
problematic.add(w)
else:
j = osmFromBhs[w][0]
ws = bhsFromOsm[j]
if not SCRIPT:
BHSvsOSM(ws, [j])
problematic |= set(ws)
Isaiah 9:6 BHS 215256 = מרבה OSM w1 = םרבה Ezekiel 4:6 BHS 266192 = ימוני OSM w7 = ימיני Ezekiel 43:11 BHS 287363 = צורתו OSM w17, w17 = צורת/י Daniel 10:19 BHS 376869 = כְ OSM w10 = בְ Ezra 10:44 BHS 383409 = נשׂאו OSM w3, w3 = נשא/י Nehemiah 2:13 BHS 384053 = הם OSM w17 = ה Nehemiah 2:13 BHS 384054 = פרוצים OSM w17 = מפרוצים 1_Chronicles 27:12 BHS 405108, 405109 = בן/ימיני OSM w6 = בנימיני
Here is the number of problematic words in the BHSA that we will exclude from comparisons.
utils.caption(
0, f"There are {len(problematic)} problematic words in the BHSA wrt to OSM"
)
utils.caption(0, "These will be excluded from further comparisons")
| 20s There are 259 problematic words in the BHSA wrt to OSM | 20s These will be excluded from further comparisons
We make a list of word nodes for which no morpheme has been tagged with morphology. Only for non-empty words.
noMorphWords = set()
for w in F.otype.s("word"):
if not F.g_word_utf8.v(w):
continue
hasMorph = False
for j in osmFromBhs.get(w, []):
if osmMorphemes[j][2]:
hasMorph = True
break
if not hasMorph:
noMorphWords.add(w)
if len(noMorphWords):
utils.caption(0, f"No OSM morphology for {len(noMorphWords)} non-empty BHSA words")
else:
utils.caption(0, "There is OSM morphology for all non-empty BHSA words")
| 20s There is OSM morphology for all non-empty BHSA words
Let's get a feeling for how the non-tagged morphemes are distributed. First we represent them as a list of intervals, using a utility function of TF, and then we get an overview of the lengths of the intervals.
noMorphIntervals = rangesFromSet(noMorphWords)
noMorphLengths = collections.Counter()
for interval in noMorphIntervals:
noMorphLengths[interval[1] - interval[0] + 1] += 1
if noMorphLengths:
utils.caption(0, "Non-marked-up stretches having length x: y times")
for (ln, amount) in sorted(noMorphLengths.items()):
utils.caption(0, "{:>4}: {:>5}".format(ln, amount))
else:
utils.caption(0, "no non-marked-up stretches")
| 20s no non-marked-up stretches
We now proceed to compile the OSM morphology into Text-Fabric features.
The basic idea is: create a feature osm
and for each BHSA word, let it contain the contents of the
corresponding morph
attribute in the OSM source.
There are several things to deal with, or not to deal with.
We will ignore the problematic cases. More precisely, whenever a BHSA word belongs to a problematic case,
as diagnosed before, we fill its osm
feature with the value *
.
There are BHSA words that do not correspond to OSM morphemes. The empty words. We will not give them an osm
value.
There are BHSA words that correspond to more than two morphemes. We have added them to our problematic list.
The vast majority of BHSA words correspond to a single OSM morpheme.
The osm
feature of those words will be filled
with the morph
attribute part of the corresponding OSM morpheme. No problem here.
The remaining cases consist of BHSA words that correspond to exactly two morphemes.
We will use the value of the morph
of the first morpheme to fill the osm
feature for such words,
and we will make a new feature, osm_sf
and fill it with the morph
of the second morpheme.
So, we will create a TF module consisting of two features: osm
and osm_sf
(osm
suffix).
Let's assemble the feature data.
osmData = {}
osm_sfData = {}
for (w, js) in osmFromBhs.items():
if w in problematic:
osmData[w] = "*"
continue
osmData[w] = osmMorphemes[js[0]][2]
if len(js) > 1:
osm_sfData[w] = osmMorphemes[js[1]][2]
genericMetaPath = f"{thisRepo}/yaml/generic.yaml"
bridgingMetaPath = f"{thisRepo}/yaml/bridging.yaml"
with open(genericMetaPath) as fh:
genericMeta = yaml.load(fh, Loader=yaml.FullLoader)
genericMeta["version"] = VERSION
with open(bridgingMetaPath) as fh:
bridgingMeta = formatMeta(yaml.load(fh, Loader=yaml.FullLoader))
metaData = {"": genericMeta, **bridgingMeta}
nodeFeatures = dict(osm=osmData, osm_sf=osm_sfData)
for f in nodeFeatures:
metaData[f]["valueType"] = "str"
And combine it with a bit of metadata.
utils.caption(4, "Writing tree feature to TF")
TFw = Fabric(locations=thisTempTf, silent=True)
TFw.save(nodeFeatures=nodeFeatures, edgeFeatures={}, metaData=metaData)
.............................................................................................. . 39s Writing tree feature to TF . ..............................................................................................
True
Check differences with previous versions.
utils.checkDiffs(thisTempTf, thisTf, only=set(nodeFeatures))
.............................................................................................. . 5m 50s Check differences with previous version . .............................................................................................. | 5m 50s no features to add | 5m 50s no features to delete | 5m 50s 2 features in common | 5m 50s osm ... no changes | 5m 51s osm_sf ... no changes | 5m 51s Done
Copy the new TF features from the temporary location where they have been created to their final destination.
utils.deliverDataset(thisTempTf, thisTf)
.............................................................................................. . 5m 56s Deliver data set to /Users/dirk/github/etcbc/bridging/tf/2021 . ..............................................................................................
utils.caption(4, "Load and compile the new TF features")
.............................................................................................. . 6m 06s Load and compile the new TF features . ..............................................................................................
TF = Fabric(locations=[coreTf, thisTf], modules=[""])
api = TF.load("language " + " ".join(nodeFeatures))
api.makeAvailableIn(globals())
This is Text-Fabric 9.0.4 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 117 features found and 0 ignored 0.00s loading features ... | 0.00s Dataset without structure sections in otext:no structure functions in the T-API | 0.85s T osm from ~/github/etcbc/bridging/tf/2021 | 0.15s T osm_sf from ~/github/etcbc/bridging/tf/2021 4.50s All features loaded/computed - for details use TF.isLoaded()
[('Computed', 'computed-data', ('C Computed', 'Call AllComputeds', 'Cs ComputedString')), ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')), ('Fabric', 'loading', ('TF',)), ('Locality', 'locality', ('L Locality',)), ('Nodes', 'navigating-nodes', ('N Nodes',)), ('Features', 'node-features', ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')), ('Search', 'search', ('S Search',)), ('Text', 'text', ('T Text',))]
utils.caption(4, "Basic tests")
utils.caption(4, "Language according to BHSA and OSM")
.............................................................................................. . 6m 53s Basic tests . .............................................................................................. .............................................................................................. . 6m 53s Language according to BHSA and OSM . ..............................................................................................
langBhsFromOsm = dict(A="Aramaic", H="Hebrew")
langOsmFromBhs = dict((y, x) for (x, y) in langBhsFromOsm.items())
xLanguage = set()
strangeLanguage = collections.Counter()
for w in F.otype.s("word"):
osm = F.osm.v(w)
if osm is None or osm == "" or osm == "*":
continue
osmLanguage = osm[0]
trans = langBhsFromOsm.get(osmLanguage, None)
if trans is None:
strangeLanguage[osmLanguage] += 1
else:
if langBhsFromOsm[osm[0]] != F.language.v(w):
xLanguage.add(w)
if strangeLanguage:
utils.caption(0, "Strange languages")
for (ln, amount) in sorted(strangeLanguage.items()):
utils.caption(0, "Strange language {}: {:>5}x".format(ln, amount))
else:
utils.caption(
0, "No other languages encountered than {}".format(", ".join(langBhsFromOsm))
)
utils.caption(0, "Language discrepancies: {}".format(len(xLanguage)))
for w in sorted(xLanguage):
passage = "{} {}:{}".format(*T.sectionFromNode(w))
utils.caption(0, f"{passage} word {w}: {F.g_word_utf8.v(w):>12} - BHSA: {F.language.v(w)}; OSM: {langBhsFromOsm[F.osm.v(w)[0]]}")
| 16m 49s No other languages encountered than A, H | 16m 49s Language discrepancies: 2 | 16m 49s Psalms 116:12 word 330987: כָּֽל - BHSA: Aramaic; OSM: Hebrew | 16m 49s Psalms 116:12 word 330988: תַּגְמוּלֹ֥והִי - BHSA: Aramaic; OSM: Hebrew
If this notebook is run with the purpose of generating data, this is the end then.
After this tests and examples are run.
In[41]:
if SCRIPT:
stop(good=True)
Now you can write notebooks to process BHSA data and grab the OSM morphology as you go, like so:
A = use("ETCBC/bhsa", mod="etcbc/bridging/tf", hoist=globals())
Before we flesh out the alignment algorithm, let's find the first point where BHSA and OSM diverge.
for (i, w) in enumerate(F.otype.s("word")):
bhs = toCons(F.g_cons_utf8.v(w))
osm = osmMorphemes[i][1]
if bhs != osm:
utils.caption(
0, "Mismatch at BHS-{} OSM-{}: bhs=[{}] osm=[{}]".format(w, i, bhs, osm)
)
break
| 17m 02s Mismatch at BHS-62 OSM-61: bhs=[] osm=[אור]
showCase(62, 61, 5)
('Genesis', 1, 5) BHS word 62 = [] word 63 = [אור] word 64 = [יום] word 65 = [ו] word 66 = [ל] OSM morph 61 = [אור] morph 62 = [יום] morph 63 = [ו] morph 64 = [ל] morph 65 = [חשך]
This is a case of an empty article in the BHSA. Let's circumvent this, and move on.
j = -1
for w in F.otype.s("word"):
bhs = toCons(F.g_cons_utf8.v(w))
if bhs == "":
continue
j += 1
osm = osmMorphemes[j][1]
if bhs != osm:
utils.caption(
0,
"""Mismatch at BHS-{} OSM-{}:\nbhs=[{}]\nos=[{}]""".format(w, j, bhs, osm),
)
break
| 17m 52s Mismatch at BHS-194 OSM-187: bhs=[מינו] os=[מינ]
showCase(194, 187, 5)
('Genesis', 1, 11) BHS word 194 = [מינו] word 195 = [אשר] word 196 = [זרעו] word 197 = [בו] word 198 = [על] OSM morph 187 = [מינ] morph 188 = [ו] morph 189 = [אשר] morph 190 = [זרע] morph 191 = [ו]