Q: What is the use of a phonetic transliteration of the Hebrew Bible? What can anyone wish beyond the careful, meticulous Masoretic system of consonants, vowels and accents?
A: Several things:
Q: But surely, there already exist transliterations of Hebrew? Why not use them?
Here are a few pragmatic reasons:
Q: But how good* is your transliteration?*
we do not know, ..., yet. A few remarks though:
phono
, then we
will collect them.Q: To me, this is not entirely satisfying.
A: Fair enough. Consider jumping to Bible Online Learner, where they have built in a pretty good transliteration, based on a different method of rule application. It is documented in an article by Nicolai Winther-Nielsen: Transliteration of Biblical Hebrew for the Role-Lexical Module and additional information can be found in Claus Tøndering's Bible Online Learner, Software on GitHub. See also Lex: A software project for linguists.
We are planning to conduct an automatic comparison of both transliteration schemes over the whole corpus.
Q: Who is the we?
That is the author of this notebook, Dirk Roorda, working together with Martijn Naaijer and getting input from Nicolai Winther-Nielsen and Willem van Peursen.
phono(
ETCBC-original, ...):
phonetic transliteration.The tests files show a prepared set of cases, against which to test new versions of the algorithm. These results have been obtained on version c
of the
BHSA dataset.
Note, that in the non-verb qamets puzzles we have tacitly made the assumption that qamets qatan and gadol are not phonological variants of each other. In other words, it never occurs that a qamets gadol becomes shortened into a qamets qatan. From the grammar rules it follows that short versions of the qamets can only be
and never
Whether this hypothesis is right, is not my competence. We just use it as a working hypothesis.
This method is not a pure method, in the sense that it works only with the information given in the source string.
We cheat, i.e. we use morphological information from the BHSA database to
steer us into the right direction. To this end, the input of the phono()
is always a
Text-Fabric node, from which we can get all information we need.
More precisely, the input is a sequence of nodes. This sequence is meant to correspond to a sequence of slots belonging to words that are written adjacently (no space between, no maqef between). From these nodes we can look up:
You can use phono()
to transliterate multiple words at the same time, but you can also do individual words,
even if in Hebrew they are written together.
However, it is better to feed combined words to phono()
in one go, because the prefix word may influence the transliteration of the postfix word. Think of the article followed by word starting with a BGDKPT
letter.
The dagesh in the BGDKPT
is interpreted as a lene, if the word stands on its own, but as a forte if it is combined.
However, it not not advised to feed longer strings to phono()
, because when phono retrieves lexical information, it uses the information of the last node that matches a word in the input string.
We determine "primary" and "secondary" stress in our transliteration, but this must not be taken in a phonetic sense. Every syllable that carries an accent pointing will get a primary stress mark. However, a few specific accent pointings are not deemed to produce an accent, and an other group of accents is deemed to produce only a secondary accent. The last syllable of a word also gets a secondary accent by default. We have not yet tried to be more precise in this, so segolates do not get the treatment they deserve.
The main rationale for accents is that they prevent a qamets to be read as qatan.
We have made a careful selection of UNICODE symbols to represent Hebrew sounds. Sometimes we follow the phonetic usage of the symbols, sometimes we follow wide spread custom. The actual mapping can be plugged in quite easily, and the intermediate stages in the transformation do not use these symbols, so the algorithm can be easily adapted to other choices.
Provided it is not part of a long vowel, we write י
as y
,
whilst j
would be more in line with the phonetic alphabet.
Likewise, we write ו
as w
, if it is not part of a long vowel.
If a word ends in יו
the ו
is not a mater lectionis, and the י
gets elided.
We represent this phonetically as ʸw
.
With regards to the BGDKPT
letters,
it would have been attractive to use the letters b g d k p t
without
diacritic for the plosive variants, and with a suitable diacritic for the fricative variants.
Alas, the UNICODE table does not offer such a suitable diacritic that is available for all these particular 6 letters.
So, we use b g d k p t
for the plosives, but for the fricatives we use v ḡ ḏ ḵ f ṯ
.
With regards to the emphatic consonants ט and ח and צ we
represent them with a dot below: ṭ ḥ ṣ
.
ק is just q
.
ע and א translate to ʕ
and ʔ
.
שׁ and שׂ translate to š
and ś
.
ס is just s
.
When א and ה are mater lectionis, they are left out. A ה with mappiq becomes just h
,
like every ה which is not a mater lectionis.
We do not mark the deviant final forms of the consonants ך and ם and ן and ף and ץ, assuming that this is just a scriptural peculiarity, with no effect on the actual sounds.
The remaining consonants go as follows:
ל | l |
מ | m |
נ | n |
ר | r |
ז | z |
The short vowels (patah, segol, hireq, qamets qatan, qibbuts) are just a e i o u
.
However, the furtive patah is a ₐ
in front of its consonant.
The long vowels without yod or waw (qamets gadol, tsere, holam) have a bar above ā ē ō
.
The complex vowels (tsere or hireq plus yod, holam plus waw, waw with dagesh) have a circumflex ê î ô û
.
A segol followed by yod becomes eʸ
The composite schwas (patah, segol, qamets) are written as superscripts ᵃ ᵉ ᵒ
.
The simple schwa is left out if silent, and otherwise it becomes ᵊ
.
The primary and secondary stress are marked as ˈ ˌ
and are placed in front of the vowel they occur with.
The sof-pasuq ׃ becomes .
.
If it is followed by ס (setumah) or ף (petuhah) or ̇׆ (nun-hafukha), these extra symbols are omitted.
The maqef ־ (between words) becomes -
.
If words are juxtaposed without space in the Hebrew, they are also juxtaposed without space in the phonetic transliteration.
The tetragrammaton is transliterated with the vowels it is encountered with, but the whole is put between
square brackets [ ]
.
We base the phonetics on the (vocalized) qere, if a qere is present.
The ketiv is then ignored. We precede each such word by a *
to indicate that the qere
is deviant from the ketiv. Using the data view in SHEBANQ it is possible to see what the ketiv is.
We leave the accents and the schwas in the end product of the phono()
function,
despite the fact that the accents, as they appear, do not have consistent phonetic significance.
And it can be argued that every schwa is silent.
If you do not care for schwas and accents, it is easy to remove them.
Also, if you find the results in separating the qamets into qatan and gadol unsatisfying or irrelevant, you can
just replace them both by a single symbol, such as å
.
Quite a bit of code is dedicated to count special cases, to test, and to produce neat tables with interesting forms.
It is also possible to call the phono()
function in debug mode, which will write to a text file all stages in the
transliteration from BHSA original into the phonetic result.
import sys
import os
import collections
import re
import yaml
import utils
from tf.fabric import Fabric
from tf.writing.transcription import Transcription
from tf.core.helpers import formatMeta
if "SCRIPT" not in locals():
SCRIPT = False
FORCE = True
CORE_NAME = "bhsa"
NAME = "phono"
VERSION = "2021"
def stop(good=False):
if SCRIPT:
sys.exit(0 if good else 1)
This notebook can run a lot of tests and create a lot of examples.
However, when run in the pipeline, we only want to create the two phono
features.
So, further on, there will be quite a bit of code under the condition not SCRIPT
.
The conversion is executed in an environment of directories, so that sources, temp files and results are in convenient places and do not have to be shifted around.
repoBase = os.path.expanduser("~/github/etcbc")
coreRepo = "{}/{}".format(repoBase, CORE_NAME)
thisRepo = "{}/{}".format(repoBase, NAME)
coreTf = "{}/tf/{}".format(coreRepo, VERSION)
thisTemp = "{}/_temp/{}".format(thisRepo, VERSION)
thisTempTf = "{}/tf".format(thisTemp)
thisTf = "{}/tf/{}".format(thisRepo, VERSION)
Check whether this conversion is needed in the first place. Only when run as a script.
if SCRIPT:
(good, work) = utils.mustRun(
None, "{}/.tf/{}.tfx".format(thisTf, "phono"), force=FORCE
)
if not good:
stop(good=False)
if not work:
stop(good=True)
utils.caption(4, "Load the existing TF dataset")
TF = Fabric(locations=coreTf, modules=[""])
.............................................................................................. . 0.00s Load the existing TF dataset . .............................................................................................. This is Text-Fabric 9.1.7 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 114 features found and 0 ignored
api = TF.load(
"""
qere qere_trailer
g_word_utf8 g_cons_utf8 trailer
g_word g_cons lex_utf8 lex lex0
sp vs vt gn nu ps st
uvf prs g_prs pfm vbs vbe
languageISO
"""
)
api.makeAvailableIn(globals())
0.00s loading features ... | 0.00s Dataset without structure sections in otext:no structure functions in the T-API 10s All features loaded/computed - for details use TF.isLoaded()
[('Computed', 'computed-data', ('C Computed', 'Call AllComputeds', 'Cs ComputedString')), ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')), ('Fabric', 'loading', ('TF',)), ('Locality', 'locality', ('L Locality',)), ('Nodes', 'navigating-nodes', ('N Nodes',)), ('Features', 'node-features', ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')), ('Search', 'search', ('S Search',)), ('Text', 'text', ('T Text',))]
Here is what we use as our starting point: the BHSA transliteration, with one or two tweaks.
The BHSA transliteration encodes also what comes after each word until the next word. Sometimes we want that extra bit, and sometimes not, and sometimes part of it.
# punctuation
punctuation = re.compile(
r"""
(?: [ -]\s*\Z) # space, (no maqef) or nospace
| (?:
0[05] # sof pasuq or paseq
(?:_[SNP])* # nun hafukha, setumah, petuhah at end of verse
\s*\Z
)
| (?:_[SPN]\s*\Z) # nun hafukha, setumah, petuhah between words
""",
re.X,
)
split_punctuation = re.compile(
r"""
(.*?) # part before punctuation
((?: # punctuation itself
(?: [ &-]\s*) # space, maqef, or nospace
| (?:
0[05] # sof pasuq or paseq
(?:_[SNP])* # nun hafukha, setumah, petuhah at end of verse
\s*
)
| (?:_[SPN]\s*) # nun hafukha, setumah, petuhah between words
)*)
""",
re.X,
)
start_punct = re.compile(
r"""
(?: \A[ &-]\s*) # space, maqef or nospace
| (?:
\A
0[05] # sof pasuq or paseq
(?:_[SNP])* # nun hafukha, setumah, petuhah at end of verse
\s*
)
| (?:\A\s*_[SPN]\s*) # nun hafukha, setumah, petuhah between words
""",
re.X,
)
noorigspace = re.compile(
r"""
(?: [&-]\Z) # space, maqef or nospace
| (?:
0[05] # sof pasuq or paseq
(?:_[SNP])* # nun hafukha, setumah, petuhah at end of verse
\Z
)
| (?:_[SPN])+ # nun hafukha, setumah, petuhah between words
""",
re.X,
)
setumah and petuhah
Usually, setumah and petuhah occur after the end of verse sign.
In that case we can strip them.
Sometimes they occur inter-word. Then we have to replace them by a space
because the words are otherwise adjacent.
This operation must be performed before originals are glued together,
because the _S
and _P
can only be reliably detected if they are at the end of a word.
So: set_pet to be used before phono(), in get_orig, but only if get_orig is
used for phono().
set_pet_pattern = re.compile(r"((?:0[05])?)(_[SNP])+\Z")
tetra_lex = "JHWH/"
def set_pet_pattern_repl(match):
(punct, nsp) = match.groups()
sep = " § " if punct == "" and nsp != "" else ""
return punct + sep
def get_orig(w, punct=True, set_pet=False, tetra=True, give_ketiv=False):
proto = F.g_word.v(w) + F.trailer.v(w)
qere = F.qere.v(w)
qere_trailer = F.qere_trailer.v(w)
if qere_trailer == "":
qere_trailer = "-"
orig = proto if give_ketiv or qere is None else qere + qere_trailer
if tetra and F.lex.v(w) == tetra_lex:
(mat, sep) = split_punctuation.fullmatch(orig).groups()
orig = "[ " + mat + " ]" + sep
if not punct:
orig = punctuation.sub("", orig)
else:
# if not noorigspace.search(orig):
# orig += ' '
if not set_pet:
orig = set_pet_pattern.sub(set_pet_pattern_repl, orig)
return orig
find the first occurrence of the string orig in the verse (ETCBC representation) Then deliver the sequence of nodes corresponding to that sequence it turns out that too much is happening with accents, so I will "normalize" the accents for the sake of looking up
digit = re.compile("[0-9]+")
def find_w(passage, orig, debug=False):
if len(orig) == 0:
return None
vn = T.nodeFromSection(passage, lang="la")
verse_words = L.d(vn, "word")
results = None
orig = orig.strip() + " "
lvw = len(verse_words)
for i in range(lvw):
target = orig
for j in range(i, lvw + 1):
target = start_punct.sub("", target)
target = digit.sub("", target)
if len(target) == 0:
results = verse_words[i:j]
break
if j >= lvw:
break
j_orig = digit.sub(
"",
get_orig(
verse_words[j],
punct=False,
tetra=False,
give_ketiv=True,
),
).rstrip("&")
if target.startswith(j_orig):
if debug:
TF.info("{}-{}: [{}] <= [{}]".format(i, j, j_orig, target))
target = target[len(j_orig) :]
if debug:
TF.info("{}-{}: [{}]".format(i, j, target))
continue
if debug:
TF.info("{}-{}: [{}] <! [{}]".format(i, j, j_orig, target))
break
return results
# partition a list of nodes into chunks
# whenever a node has an orig string that not ends with an - start a new chunk
def partition_w(wnodes):
results = []
cur_chunk = []
orig = None
for w in wnodes:
cur_chunk.append(w)
orig = get_orig(w, tetra=False)
if orig.endswith("-"):
continue
results.append(tuple(cur_chunk))
cur_chunk = []
if len(cur_chunk):
results.append(tuple(cur_chunk))
return results
Here is the list of symbols that constitutes the mapping from BHSA transcription codes to a phonetic transcription. It is a series of triplets (bhsa symbol, name, phonetic symbol).
If changes are needed to the appearance of the phonetic transcriptions (not to its logic), here is the place to tweak.
Note that the order is important. In the final stage of the transformation process, these substitutions will be applied in the order they appear here.
This is especially important for, but not only for, the BGDKPT
letters.
specials = (
(">", "alef", "ʔ"),
("<", "ayin", "ʕ"),
("v", "tet", "ṭ"),
("y", "tsade", "ṣ"),
("x", "chet", "ḥ"),
("c", "shin", "š"),
("f", "sin", "ś"),
("#", "s(h)in", "ŝ"),
("ij", "long hireq", "î"),
("I", "short hireq", "i"),
(";j", "long tsere", "ê"),
("ow", "long holam", "ô"),
("w.", "long `qibbuts`", "û"),
("ej", "e glide", "eʸ"),
("j", "yod", "y"),
(":a", "hataf patach", "ᵃ"),
(":@", "hataf qamats", "ᵒ"),
(":e", "hataf segol", "ᵉ"),
("%", "schwa mobile", "ᵊ"),
(":", "schwa quiescens", ""),
("@", "qamats gadol", "ā"),
("a", "patach", "a"),
("`", "furtive patach", "ₐ"),
("+", "qamats", "å"),
("e", "segol", "e"),
(
";",
"tsere",
"ē",
),
("i", "hireq", "i"),
("o", "holam", "ō"),
("^", "qamats qatan", "o"),
("u", "qibbuts", "u"),
("b.", "b plosive", "B"),
("g.", "g plosive", "G"),
("d.", "d plosive", "D"),
("k.", "k plosive", "K"),
("p.", "p plosive", "P"),
("t.", "t plosive", "T"),
("b", "b fricative", "v"),
("g", "g fricative", "ḡ"),
("d", "d fricative", "ḏ"),
("k", "k fricative", "ḵ"),
("p", "p fricative", "f"),
("t", "t fricative", "ṯ"),
("B", "b plosive", "b"),
("G", "g plosive", "g"),
("D", "d plosive", "d"),
("K", "k plosive", "k"),
("P", "p plosive", "p"),
("T", "t plosive", "t"),
("w", "waw", "w"),
("l", "lamed", "l"),
("m", "mem", "m"),
("n", "nun", "n"),
("r", "resh", "r"),
("z", "zajin", "z"),
("!", "primary accent", "ˈ"),
("/", "secundary accent", "ˌ"),
("&", "maqef", "-"),
("*", "masora", "*"),
)
specials2 = (
("$", "sof pasuq", "."),
("|", "paseq", " "),
("§", "interword setumah and petuhah", " "),
)
We compile the table of symbols in handy dictionaries for ease of processing later.
We need to quickly detect the dagesh lenes later on, so we store them in a dictionary.
Our treatment of accents is still primitive.
We ignore some accents (irrelevant accents
below) and we consider some accents as indicators of a mere
secondary accent (secundary accents
below).
The sound_dict
is the resulting (ordered) mapping of all source characters to "phonetic" characters.
dagesh_lenes = {"b.", "g.", "d.", "k.", "p.", "t."}
dagesh_lene_dict = dict()
irrelevant_accents = (
("01", "segol"), # occurs always with another accent
("03", "pashta"), # by definition on last syllable: not relevant for accent
("04", "telisha qetana"),
("14", "telisha gedola"),
("24", "telisha qetana"),
("44", "telisha gedola"),
)
secundary_accents = (
("71", "merkha"), # ??
("63", "qadma"), # ??
("73", "tipeha"), # ??
)
punctuation_accents = (
("00", "sof pasuq"),
("05", "paseq"),
)
known_accents = {
x[0] for x in irrelevant_accents + secundary_accents + punctuation_accents
}
primary_accents = {
"{:>02}".format(i) for i in range(100) if "{:>02}".format(i) not in known_accents
}
sound_dict = collections.OrderedDict()
sound_dict2 = collections.OrderedDict()
for (sym, let, glyph) in specials:
if sym in dagesh_lenes:
dagesh_lene_dict[sym[0]] = glyph
else:
sound_dict[sym] = glyph
for (sym, let, glyph) in specials2:
sound_dict2[sym] = glyph
The phono()
function that we will define (far) below, performs an ordered sequence of transformations.
Most of these are defined as regular expressions,
and some parts of those expressions occur over and over again, e.g. subpatterns for vowel and consonant.
Here we define the shortcuts that we are going to use in the regular expressions.
Normally, when a pattern matches a string, the string is consumed: the parts of the pattern that match consume corresponding stretches of the string. However, in many cases a pattern specifies specific contexts in which a match should be found. In those cases we do not want that the context parts of the pattern are responsible for string consumption, because in those parts there could be another relevant match.
In regular expression there is a solution for that: look-ahead and look-behind assertions and we use them frequently.
(?<=
before-pattern )
pattern (?=
behind-pattern )
A match of this pattern in a string is a portion of a string that matches pattern, provided that portion is preceded by before-pattern and followed by behind pattern.
If there is a match, and new matches must be searched for, the search will start right after pattern.
Instead of the above positive look-ahead and look-behind assertions, there are also negative variants:
(?<!
before-pattern )
pattern (?!
behind-pattern )
in those cases the match is good, if the before-pattern does not match the preceding material, and analogously the behind-pattern.
In Python there is a restriction on look-behind patterns: they must be patterns that only have matches of a predictable, fixed length. That will make some of our patterns slightly more complicated. For example, vowels can be simple or complex, and hence have variable length. If we want to specify a consonant, provided it is preceded by a vowel, we have to be careful.
In regular expressions there are greedy, non-greedy and possessive quantifiers. Greedy ones try to match as many times as possible at first; non-greedy ones try to match as few times as possible at first. Possessive quantifiers are like greedy ones, but greedy ones will give back occurrences if that helps to achieve a match. Possessive ones do not do that.
kind | greedy | non-greedy | possessive |
---|---|---|---|
0 or more | * | *? | *+ |
1 or more | + | +? | ++ |
at least *n*, at most *m* | { *n*, *m*} |
{ *n*, *m*}? |
{ *n*, *m*}+ |
For example, the pattern [ab]*b
matches substrings of a
s and b
s that end in a b
.
In order to match the string aaaaab
, the [a|b]*
part starts with greedily consuming the whole string,
but after discovering that the b
part in the pattern should also match something, the [a|b]*
part
reluctantly gives back one occurrence. That will do the trick.
However, [ab]*+b
will not match aaaaab
, because the possessive quantifier gives nothing back.
Possessive quantifiers a desirable in combination with negative look-behind assertions.
For example, take [ab]*+(?!c)$
. This will match substrings of a
s and b
s that are not followed by c
.
So it matches ababab
but not abababc
.
However, the non-possessive variant, [ab]*(?!c)
matches both. So how does it match abababc
?
First, the [ab]*
part matches all a
s and b
s. Then the look-behind assertion that c
does not follow,
is violated. So [ab]*
backtracks one occurrence, a b
. At that point the look-behind assertion finds a b
which is not c
, and the match succeeds.
Python lacks possessive quantifiers in regular expressions, so again, this makes some expressions below more complicated than they were otherwise.
We want to test for vowels in look-behind conditions. Python insists that look-behind conditions match patterns with fixed length. Vowels have variable length, so we need to take a bit more context. This extra context is dependent on whether the vowel occurs in front of a consonant or after it vowel 1 is for before, vowel 2 is for after, both are usable in look-behind conditions vowel matches purely vowels of variable length, and is not usable in look-behind conditions
vowel1 = r"(?:(?::[ea@])|(?:w\.)|(?:[i;]j)|(?:ow)|(?:.[%@\^;aeiIou`]))"
vowel2 = r"(?:(?::[ea@])|(?:w\.)|(?:[i;]j)|(?:ow)|(?:[%@\^;aeiIou`].))"
vowel = r"(?:(?::[ea@])|(?:w\.)|(?:[i;]j)|(?:ow)|(?:[%@\^;aeiIou`]))"
# lvowel are long vowels only (including compositions)
# svowel are short vowels only, including composite schwas
lvowel1 = r"(?:(?:w\.)|(?:[i;]j)|(?:ow)|(?:.[@;o]))"
svowel = r"(?:(?::[ea@])|(?:[%@\^;aeiIou`]))"
gadol = sound_dict["@"]
qatan = sound_dict["^"]
a_like = {":a", "a"}
o_like = {":@", "o", "ow", "u", "w."}
e_like = {":", ":e", ";", ";j", "e", "i", "ij"}
# complex i/w vowel: the composite vowels with waw and yod, after translation
complex_i_vowel = "".join(sound_dict[s] for s in {"ij", ";j"})
complex_w_vowel = "".join(sound_dict[s] for s in {"ow"})
# consonants
ncons = "[^>bgdhwzxvjklmns<pyqrfct _&$-]" # not a consonant
cons = "[>bgdhwzxvjklmns<pyqrfct]" # any consonant
consx = "[bgdwzxvjklmns<pyqrfct]" # any consonant except alef
bgdkpt = "[bgdkpt]" # begadkefat consonant
nbgdkpt = "[wzxvjlmns<yqrfc]" # non-begadkefat consonant
prep = "[bkl]" # proclitic preposition
accents
acc = "[ˈˌ]" # primary and secundary accent
Here are the patterns, but also the replacement functions we are going to carry out when the patterns match.
How exactly the patterns and replacement functions hang together, is a matter for the phono
function itself.
The rafe indicates a fricative pronunciation. It cancels a dagesh lene on a BGDKPT
letter.
If it occurs in other situations, we ignore it.
We have to reverse any CV
pattern at word ends where the V is a patah, and the C is a guttural (i.e. cheth, ayin or he-mappiq).
If there is an accent on the guttural, we ignore it in these cases, because the guttural does not initiate a syllable.
rafe
rafe = re.compile(r"({b})\.,".format(b=bgdkpt))
def rafe_repl(match):
return match.group(1)
# furtive patah
# note that we will deliberately loose any accent on the guttural
furtive_patah = re.compile(r"([x<]|(?:h\.))(?:[/!]?)a(?=\Z|[ &-])")
def furtive_patah_repl(match):
return "`" + match.group(1)
explicit accents
# lets assume that any cantillation mark or accent indicates that the vowel is stressed
# except for some types of mark (qadma, pashta)
sep_accent = re.compile("([0-9]{2})")
remove_accent = re.compile("|".join("~{}".format(x[0]) for x in irrelevant_accents))
primary_accent = re.compile("|".join("~{}".format(x) for x in primary_accents))
secundary_accent = re.compile("|".join("~{}".format(x[0]) for x in secundary_accents))
punctuation_accent = re.compile(
"({})".format("|".join("~{}".format(x[0]) for x in punctuation_accents))
)
condense_accents = re.compile("({v})([!/]+)".format(v=vowel))
def sep_accent_repl(match):
return "~" + match.group(1)
def condense_accents_repl(match):
accent = "!" if "!" in match.group(2) else "/"
return accent + match.group(1)
# implicit accents
last_part = re.compile(r"([^&-]*)\Z")
default_accent1 = re.compile(r"({v}`?{c}?\.?(?:\Z|[ ]))".format(v=svowel, c=cons))
default_accent2 = re.compile(r"({v}(?:\Z|[ ]))".format(v=lvowel1))
strip_accents = re.compile(r"[0-9*]")
# wrong last accents
last_accent = re.compile(r"[/!]+(?=[ ]|\Z)")
def default_accent_repl(match):
return "/" + match.group(1)
def punctuation_accent_repl(match):
if match.group(1) == "~00":
return " $"
return " | "
separate the phonetic representation from the interword material after it.
To be used at the end of phono().
specials2
specify how punctuation (sof pasuq, paseq, interword setumah-petuhah are
translated).
phono_sep = re.compile("(.*?)([ {}]*)".format("".join(x[2] for x in specials2)))
multiple_space = re.compile(" +")
verse_end_phono = re.compile(r"(\. *)\Z")
def verse_end_phono_repl(match):
return match.group(1).replace(" ", "")
stats = collections.Counter()
def doaccents(orig, debug=False, count=False):
dout = []
# prepare
if debug:
dout.append(("orig", orig))
if count:
pre = orig
result = orig.lower().replace("_", " ")
if debug:
dout.append(("trim", result))
if count and pre != result:
stats["trim"] += 1
# explicit accents
if count:
pre = result
result = sep_accent.sub(sep_accent_repl, result)
result = remove_accent.sub("", result)
result = secundary_accent.sub("/", result)
result = primary_accent.sub("!", result)
result = condense_accents.sub(condense_accents_repl, result)
if debug:
dout.append(("accents", result))
if count and pre != result:
stats["accents"] += 1
# punctuation
if count:
pre = result
result = punctuation_accent.sub(punctuation_accent_repl, result)
result = strip_accents.sub("", result)
if debug:
dout.append(("punctuation", result))
if count and pre != result:
stats["punctuation"] += 1
# rafe
if count:
pre = result
result = rafe.sub(rafe_repl, result)
result = result.replace(",", "")
if debug:
dout.append(("rafe", result))
if count and pre != result:
stats["rafe"] += 1
# furtive patah
if count:
pre = result
result = furtive_patah.sub(furtive_patah_repl, result)
if debug:
dout.append(("furtive_patah", result))
if count and pre != result:
stats["furtive_patah"] += 1
# implicit accents
if count:
pre = result
hotpart = last_part.search(result).group(1)
if "!" not in hotpart and "/" not in hotpart:
result = default_accent1.sub(default_accent_repl, result)
if "/" not in result:
result = default_accent2.sub(default_accent_repl, result)
result = last_accent.sub("", result)
if debug:
dout.append(("default accent", result))
if count and pre != result:
stats["default_accent"] += 1
# deliver
return (result, dout) if debug else result
qamets qatan NB: all patterns stipulate that the qamets (@) in question is unaccented
# near end of word:
qamets_qatan1 = re.compile(
r"(?<={c})(\.?)@(?={c}(?:\.?[/!]?(?:[ &-]|\Z)))".format(c=consx)
)
# before dagesh forte:
qamets_qatan2 = re.compile(r"(?<={c})(\.?)@(?={c}\.)".format(c=cons))
# if the following consonant is BGDKFT and does not have dagesh, the @ is in an open syllable:
qamets_qatan3 = re.compile(
r"(?<={c})(\.?)@(?={c}:(?:{nb}|(?:{b}\.)))".format(c=cons, b=bgdkpt, nb=nbgdkpt)
)
# assimilation of qamets with following composite schwa of type (chatef qamets),
# but if the qamets is under a preposition BCL, not if it is under the article H:
qamets_qatan4a = re.compile(r"(?<={p})(\.?[!/]?)@(?=-{c}:@)".format(p=prep, c=cons))
# or word-internal
qamets_qatan4b = re.compile(r"(?<={c})(\.?[!/]?)@(?={c}:@)".format(c=cons))
# before an other qamets qatan, provided the syllable is unaccented
qamets_qatan5 = re.compile(r"(?<={c})(\.?)@(?={c}\.?[/!]?\^)".format(c=cons))
in a pronominal suffix, qamets never becomes qatan. This pattern will be applied only on words that do have a non-empty pronominal suffix The pattern will spot the qamets qatan in front of the last consonant, if there is such a qatan
qamets_qatan_prs = re.compile(r"\^(?=[0-9]*{c}\.?[/!]?(?:[ &-]|\Z))".format(c=cons))
def qamets_qatan_repl(match):
return match.group(1) + "^"
# there are exceptions to the heuristic of interpreting qamets by voting between occurrences
qamets_qatan_x = """
BJT/ => 1A
JM/ => 1O
JWMM => 2A
JRB<M/ => 1A
JHWNTN/ => 2A
"""
xxx = """
<YBT/ => 2A
"""
# there are unaccented conjugated verb forms that must not be subjected to qamets-qatan transformation
qamets_qatan_verb_x = {
"verb qal perf 3sf",
"verb qal perf 3p-",
"verb nif impf 1s-",
"verb nif impf 1p-",
"verb nif impf 2sf",
"verb nif impf 2pm",
"verb nif impf 3pm",
"verb nif impv 2sf",
"verb nif impv 2pm",
}
qqv_experimental = {
"verb qal impf 3pm",
}
qamets_qatan_verb_x |= qqv_experimental
def qamets_qatan_verb_x_repl(match):
return match.group(1) + "@"
for the use of applying individual corrections:
Here is the function that carries out rule based qamets qatan detection, without going into verb paradigms and exceptions. It is the first go at it.
def doplainqamets(word, accentless=False, debug=False, count=False):
dout = []
result = word
if accentless:
result = result.replace("!", "").replace("/", "")
if count:
pre = result
result = qamets_qatan1.sub(qamets_qatan_repl, result)
if debug:
dout.append(("qamets_qatan1", result))
if count and pre != result:
stats["qamets_qatan1"] += 1
if count:
pre = result
result = qamets_qatan2.sub(qamets_qatan_repl, result)
if debug:
dout.append(("qamets_qatan2", result))
if count and pre != result:
stats["qamets_qatan2"] += 1
if count:
pre = result
result = qamets_qatan3.sub(qamets_qatan_repl, result)
if debug:
dout.append(("qamets_qatan3", result))
if count and pre != result:
stats["qamets_qatan3"] += 1
if count:
pre = result
result = qamets_qatan4a.sub(qamets_qatan_repl, result)
if debug:
dout.append(("qamets_qatan4a", result))
if count and pre != result:
stats["qamets_qatan4a"] += 1
if count:
pre = result
result = qamets_qatan4b.sub(qamets_qatan_repl, result)
if debug:
dout.append(("qamets_qatan4b", result))
if count and pre != result:
stats["qamets_qatan4b"] += 1
return (result, dout) if debug else result
The rules for the schwa that I have found are contradictory.
These rules I have seen (e.g.)
But there are examples where rules 1 and 3 apply at the same time.
And the qal 3 sg f
forms end with a tav with schwa, often preceded by a consonant with also schwa.
In this case the tav has a dagesh, which by the rules for dagesh cannot be a lene. So it must be a forte.
So this violates rule 2.
We will cut this matter short, and make any final schwa quiescens.
As to rule 4, there are cases where the schwa in question is also followed by a final consonant with schwa. In those cases it seems that the schwa in question is silent.
# mobile schwa
mobile_schwa1 = re.compile(
r"""
( # here is what goes before the schwa in question
(?:(?:\A|[ &-]).\.?)| # an initial consonant or
(?:.\.)| # a consonant with dagesh (which must be forte then) or
(?::.\.?)| # another schwa and then a consonant
(?: # a long vowel such as the following
(?:
@>?| # qamets possibly with alef as mater lectionis (the remaining qametses are gadol)
;j?| # tsere, possibly followed by yod
ij| # hireq with yod
o[>w]?| # holam possibly followed by yod
w\. # waw with dagesh
)
{c} # and then a consonant
)
)
:
(?![@ae]) # the schwa may not be composite
""".format(
c=cons
),
re.X,
)
mobile_schwa2 = re.compile(
r":(?={b}(?:[^.]|[ &-]|\Z))".format(b=bgdkpt)
) # before `BGDKPT` letter without dagesh
# second last consonant with schwa when last consonsoant also has schwa
mobile_schwa3 = re.compile(r"[%:](?={c}\.?{a}?[%:](?:[ &]|\Z))".format(a=acc, c=cons))
# all schwas and the end of the word are quiescens, only if the words are not glued together
mobile_schwa4 = re.compile(r"[%:](?=[ &]|\Z)")
def mobile_schwa1_repl(match):
return match.group(1) + "%"
# dagesh
dages_forte_lene = re.compile(
r"(?<={v1})(-*)({b})\.(?=[/!]?{v2})".format(v1=vowel1, v2=vowel, b=bgdkpt)
)
dages_forte = re.compile(
r"(?<={v1})(-?[h>]*-*)([^h])\.(?=[/!]?{v2})".format(v1=vowel1, v2=vowel)
)
dages_lene = re.compile(r"({b})\.".format(b=bgdkpt))
def dages_forte_lene_repl(match):
return match.group(1) + (dagesh_lene_dict[match.group(2)] * 2)
def dages_lene_repl(match):
return dagesh_lene_dict[match.group(1)]
def dages_forte_repl(match):
return match.group(1) + match.group(2) * 2
# silent aleph
silent_aleph = re.compile("(?<=[^ &-])>(?!(?:[/!]|{v}))".format(v=vowel))
# final mater lectionis
# I assume that heh and alef are only matrices lectionis after a LONG vowel
last_ml = re.compile(r"(?<={v1})[>h]+(?=[ &-]|\Z)".format(v1=lvowel1))
last_ml_jw = re.compile(r"jw(?=[ &-]|\Z)")
# mappiq heh
mappiq_heh = re.compile(r"h\.")
fixit_i = re.compile(r"([{v}])\.".format(v=complex_i_vowel))
fixit_w = re.compile(r"([{v}])\.".format(v=complex_w_vowel))
fixit = re.compile(r"(.)\.")
split_sep = re.compile(
"^(.*?)([ .&$\n-]*)$"
) # to split the result in the phono part and the interword part
def fixit_repl(match):
return match.group(1) * 2
def fixit_i_repl(match):
return match.group(1) + "j"
def fixit_w_repl(match):
return match.group(1) + "w"
END OF REGULAR EXPRESSIONS AND REPLACEMENT FUNCTIONS
For some words we need specific corrections. The rules for qamets qatan are not specific enough.
We define a function apply_corr(wordq, corr)
that can apply a correction instruction to wordq
, which is a word in pre-transliterated form, i.e. a word that has underwent transliteration steps ending with qamets interpretation, including applying special verb cases.
The corr
is a comma-separated list of basic instructions, which have the form
number letter. It will interpret the number-th qamets as a gadol of qatan, depending on whether letter = ā
or o
.
Later on we compile a dictionary qamets_corrections
of pre-computed corrections.
This dictionary is keyed by the pre-transliterated form, and valued by the corresponding correction string. Here we initialize this dictionary.
The phono()
function that carries out the complete transliteration, looks by default in qamets_corrections
, but this can be overridden. These corrections will not be carried out for the special verb cases.
qamets_corrections = {} # list of translits that must be corrected
apply correction instructions to a word
def apply_corr(wordq, corr):
if corr == "":
return wordq
corrs = corr.split(",")
indices = []
for (i, ch) in enumerate(wordq):
if ch == "^" or (ch == "@" and (i == 0 or wordq[i - 1] != ":")):
indices.append(i)
resultlist = list(wordq)
for c in corrs:
(pos, kind) = c
pos = int(pos) - 1
repl = "^" if kind == "o" else "@"
if pos >= len(indices):
TF.error("Line {}: pos={} out of range {}".format(ln, pos, indices))
continue
rpos = indices[pos]
resultlist[rpos] = repl
return "".join(resultlist)
We need concise, normalized values for the lexical features.
undefs = {"NA", "unknown", "n/a", "absent"}
png = dict(
NA="-",
unknown="-",
p1="1",
p2="2",
p3="3",
sg="s",
du="d",
pl="p",
m="m",
f="f",
a="a",
c="c",
e="e",
)
png["n/a"] = "-"
We need a label for lexical information such as part of speech, person, number, gender.
declensed = {"subs", "nmpr", "adjv", "prps", "prde", "prin"}
def get_lex_info(w):
sp = F.sp.v(w)
lex_infos = [sp]
if sp == "verb":
lex_infos.extend(
[
F.vs.v(w),
F.vt.v(w),
"{}{}{}".format(png[F.ps.v(w)], png[F.nu.v(w)], png[F.gn.v(w)]),
]
)
elif sp in declensed:
lex_infos.append("{}{}".format(png[F.nu.v(w)], png[F.gn.v(w)]))
lex_info = " ".join(lex_infos)
if sp == "verb" or sp in declensed:
prs = F.g_prs.v(w)
if prs not in undefs:
lex_info += ",{}".format(prs.lower())
return lex_info
def get_decl(lex_info):
if lex_info is None:
lex_info = ""
parts = lex_info.split(",")
return lex_info if len(parts) == 1 else parts[0]
def get_prs(lex_info):
if lex_info is None:
lex_info = ""
parts = lex_info.split(",")
return "" if len(parts) == 1 else parts[1]
interesting_stats = [
"total",
"qamets_verb_suppress_qatan",
"qamets_prs_suppress_qatan",
"qamets_qatan_corrections",
]
if suppress_in_verb
, phono will suppress qatan interpretation in certain verb paradigmatic forms
if suppress_in_prs
, phono will suppress qatan interpretation in pronominal suffixes
if correct
is 1, phono will apply individual corrections
if correct
is 0, phono will not apply individual corrections
if correct
is -1, phono will stop just before applying the qamets qatan corrections and return
the intermediate result
def phono_qamets(
ws,
result,
lex_info,
debug,
count,
dout,
suppress_in_verb,
suppress_in_prs,
correct,
corrections,
):
# qamets qatan
# check whether we are in a verb paradigm that requires suppressing qamets => qatan
if count:
pre = result
suppr = True
decl = get_decl(lex_info)
if suppress_in_verb:
suppr = False
if decl == "":
if debug:
dout.append(("qamets qatan", "no special verb form invoked"))
elif decl not in qamets_qatan_verb_x:
if debug:
dout.append(("qamets qatan", "no special verb form: {}".format(decl)))
elif "@" not in result:
if debug:
dout.append(("qamets qatan", "special verb form: no qamets present"))
elif "!" in result:
if debug:
dout.append(
("qamets qatan", "special verb form: primary accent present")
)
suppr = True
else:
suppr = True
if count:
stats["qamets_verb_suppress_qatan"] += 1
else:
if debug:
dout.append(("qamets qatan", "suppression for verb forms is switched off"))
suppr = False
if suppr:
if debug:
dout.append(
(
"qamets qatan",
"special verb form: qatan suppressed for {}".format(decl),
)
)
else:
if debug:
(result, this_dout) = doplainqamets(result, debug=True, count=count)
dout.extend(this_dout)
else:
result = doplainqamets(result, count=count)
# check whether we have a pronominal suffix that requires suppressing qamets => qatan
if count:
pre = result
suppr = True
prs = get_prs(lex_info)
if suppress_in_prs:
suppr = False
if prs == "":
if debug:
dout.append(("qamets qatan", "no pron suffix indicated"))
elif "@" not in prs:
if debug:
dout.append(("qamets qatan", "pronominal suffix: no qamets present"))
elif not qamets_qatan_prs.search(result):
if debug:
dout.append(
(
"qamets qatan",
"pron suffix {}: no qamets qatan present".format(prs),
)
)
else:
suppr = True
if count:
stats["qamets_prs_suppress_qatan"] += 1
else:
if debug:
dout.append(("qamets qatan", "suppression for pron suffix is switched off"))
suppr = False
if suppr:
result = qamets_qatan_prs.sub("@", result)
if debug:
dout.append(
("qamets qatan", "pron suffix {}: qatan suppressed".format(prs))
)
dout.append(("qamets qatan prs", result))
# now change gadol in qatan in front of other qatan
if count:
pre = result
result = qamets_qatan5.sub(qamets_qatan_repl, result)
if debug:
dout.append(("qamets_qatan5", result))
if count and pre != result:
stats["qamets_qatan5"] += 1
# handle desired corrections
if count:
pre = result
if correct == -1:
return (result, True)
if correct == 1 and decl not in qamets_qatan_verb_x:
if corrections is None:
corrections = qamets_corrections
parts = result.split("-")
hotpart = parts[-1]
wordq = phono(ws[-1], correct=-1, punct=False)
if wordq in corrections:
hotpartn = apply_corr(hotpart, corrections[wordq])
if debug:
dout.append(
("qamets qatan", "correction: {} => {}".format(hotpart, hotpartn))
)
parts[-1] = hotpartn
result = "-".join(parts)
if debug:
dout.append(("qamets_qatan_corr", result))
if count and pre != result:
stats["qamets_qatan_corrections"] += 1
return (result, False)
def phono_patterns(result, debug, count, dout):
# mobile schwa
if count:
pre = result
result = mobile_schwa1.sub(mobile_schwa1_repl, result)
if debug:
dout.append(("mobile_schwa1", result))
if count and pre != result:
stats["mobile_schwa1"] += 1
if count:
pre = result
result = mobile_schwa2.sub("%", result)
if debug:
dout.append(("mobile_schwa2", result))
if count and pre != result:
stats["mobile_schwa2"] += 1
if count:
pre = result
result = mobile_schwa3.sub("", result)
if debug:
dout.append(("mobile_schwa3", result))
if count and pre != result:
stats["mobile_schwa3"] += 1
if count:
pre = result
result = mobile_schwa4.sub("", result)
if debug:
dout.append(("mobile_schwa4", result))
if count and pre != result:
stats["mobile_schwa4"] += 1
# dagesh
if count:
pre = result
result = dages_forte_lene.sub(dages_forte_lene_repl, result)
if debug:
dout.append(("dagesh_forte_lene", result))
if count and pre != result:
stats["dagesh_forte_lene"] += 1
if count:
pre = result
result = result.replace("ij.", "Ijj")
result = dages_forte.sub(dages_forte_repl, result)
if debug:
dout.append(("dagesh_forte", result))
if count and pre != result:
stats["dagesh_forte"] += 1
if count:
pre = result
result = dages_lene.sub(dages_lene_repl, result)
if debug:
dout.append(("dagesh_lene", result))
if count and pre != result:
stats["dagesh_lene"] += 1
# silent aleph (but not in tetra)
if count:
pre = result
if "[" not in result:
result = silent_aleph.sub("", result)
if debug:
dout.append(("silent_aleph", result))
if count and pre != result:
stats["silent_aleph"] += 1
# final mater lectionis (but not in tetra)
if count:
pre = result
if "[" not in result:
result = last_ml_jw.sub("ʸw", result)
result = last_ml.sub("", result)
if debug:
dout.append(("last_ml", result))
if count and pre != result:
stats["last_ml"] += 1
# mappiq heh
if count:
pre = result
result = mappiq_heh.sub("h", result)
if debug:
dout.append(("mappiq_heh", result))
if count and pre != result:
stats["mappiq_heh"] += 1
return result
def phono_symbols(ws, result, debug, count, dout):
# split the result in parts corresponding with the word nodes of the original
resultparts = result.split("-")
results = []
for (i, w) in enumerate(ws):
resultp = resultparts[i]
result = resultp
# masora
if F.qere.v(w) is not None:
result = "*" + result
for (sym, repl) in sound_dict.items():
result = result.replace(sym, repl)
if debug:
dout.append(("symbols", result))
# fix left over dagesh and mappiq
if count:
pre = result
result = fixit_i.sub(fixit_i_repl, result)
if debug:
dout.append(("fixit_i", result))
if count and pre != result:
stats["fixit_i"] += 1
if count:
pre = result
result = fixit_w.sub(fixit_w_repl, result)
if debug:
dout.append(("fixit_w", result))
if count and pre != result:
stats["fixit_w"] += 1
if count:
pre = result
result = fixit.sub(fixit_repl, result)
if count and pre != result:
stats["fixit"] += 1
if debug:
dout.append(("fixit", result))
if count:
pre = result
for (sym, repl) in sound_dict2.items():
result = result.replace(sym, repl)
if debug:
dout.append(("punct", result))
if count and pre != result:
stats["punct"] += 1
# zero width word boundary
if count:
pre = result
result = multiple_space.sub(" ", result)
result = result.replace("[ ", "[").replace(" ]", "]") # tetra
if debug:
dout.append(("cleanup", result))
if count and pre != result:
stats["cleanup"] += 1
results.append(result)
return results
Here the rule fabrics are woven together, exceptions invoked.
def phono(
ws,
suppress_in_verb=True,
suppress_in_prs=True,
correct=1,
corrections=None,
inparts=False,
debug=False,
count=False,
punct=True,
):
if type(ws) is int:
ws = [ws]
if count:
stats["total"] += 1
dout = []
# collect information
orig = "".join(get_orig(w, punct=True) for w in ws)
lex_info = get_lex_info(ws[-1])
# strip punctuation at the end, if needed
if not punct:
orig = punctuation.sub("", orig)
# account for ketiv-qere if in debug mode
if debug:
for w in ws:
if F.qere.v(w) is not None:
dout.append(
(
"ketiv-qere",
"{} => {}".format(
F.g_word.v(w), F.qere.v(w) + F.qere_trailer.v(w)
),
)
)
# accents
if debug:
(result, dout) = doaccents(orig, debug=True, count=count)
else:
result = doaccents(orig, count=count)
# qamets
(result, deliver) = phono_qamets(
ws,
result,
lex_info,
debug,
count,
dout,
suppress_in_verb,
suppress_in_prs,
correct,
corrections,
)
if deliver:
return (result, dout) if debug else result
# patterns
result = phono_patterns(result, debug, count, dout)
# symbols
results = phono_symbols(ws, result, debug, count, dout)
result = "".join(results) if not inparts else results
# deliver
return (result, dout) if debug else result
We have to do more work for the qamets. Sometimes a word form on its own is not enough to determine whether a qamets is gadol or qatan. In those cases, we analyse all occurrences of the same lexeme, and for each syllable position we measure whether an A-like vowel of an O-like vowel tends to occur in that syllable.
In order to do that, we need to compute a vowel skeleton for each word.
A word may have extra syllables, due to inflections, such as plurals, feminine forms, or suffixes. Let us call this the paradigmatic material of a word.
Now, we strip from the initial vowel skeleton a number of trailing vowels that corresponds to the number of consonants found in the paradigmatic material. This is rather crude, but it will do.
# we need the number of letters in a defined value of a morpho feature
def len_suffix(v):
if v is None:
return 0
if v in undefs:
return 0
return len(v.replace("=", "").replace("W", "").replace("J", ""))
# we need a function that return 1 for plural/dual subs/adj and for fem adj
def len_ending(sp, n, g):
if sp == "subs":
return 1 if n in {"pl", "du"} else 0
if sp == "adjv":
return 1 if n in {"pl", "du"} or g in "f" else 0
return 0
# return the number of consonants in the suffixes
def len_morpho(w):
return max(
(
len_suffix(F.prs.v(w)) + len_suffix(F.uvf.v(w)),
len_ending(F.sp.v(w), F.nu.v(w), F.gn.v(w)),
)
)
Next, we reduce the vowel skeleton to a skeleton pattern. We are not interested in all vowels, only in whether the vowel is a qamets (gadol or qatan), A-like, O-like, or other (which we dub E-like).
# the qamets gadol/qatan skeleton
qamets_qatan_skel = re.compile("([^@^])")
# the vowel skeleton where the qamets gadol/qatan are preserved as @ and ^
# another o-like vowel becomes O (holam, qamets chatuf) (no waws nor yods)
# another a-like vowel becomes A (patah, patah chatuf) (no alefs)
silent_alef_start = re.compile(r"([ &-]|\A)>([!/]?(?:[^!/.:;@^aeiou]|\Z))")
def silent_alef_start_repl(match):
return match.group(1) + "E" + match.group(2)
qamets_qatan_fullskel = re.compile(
r"""
(
E # replacement of silent initial alef without vowels
| (?::[@ae]?) # a (composite) schwa
| (?:[;i]j) | (?:ow) | (?:w.) # a composite vowel
| [@a;eiou^] # a vowel point
| . # anything else
)
""",
re.X,
)
def qamets_qatan_fullskel_repl(match):
found = match.group(1)
if found == "E":
return "E"
if found == "@":
return gadol
if found == "^":
return qatan
if found in a_like:
return "A"
if found in o_like:
return "O"
if found in e_like:
return "E"
return ""
def get_full_skel(w, debug=False):
wordq = phono(w, correct=-1, punct=False)
wordqr = silent_alef_start.sub(silent_alef_start_repl, wordq)
fullskel = qamets_qatan_fullskel.sub(qamets_qatan_fullskel_repl, wordqr)
ending_length = len_morpho(w)
relevant_part = len(fullskel) - ending_length
if debug:
TF.info(
"{}: {} => {} => {} : {} minus {} = {}".format(
w,
orig,
wordq,
wordqr,
fullskel,
ending_length,
fullskel[0:relevant_part],
)
)
return fullskel[0:relevant_part]
A lot of work is needed to get the qamets gadol-qatan right. This involves looking at accents, verb paradigms and special cases among the non-verbs.
# find lexemes which have an occurrence with a qamets (except verbs)
utils.caption(0, "\tLooking for non-verb qamets")
qq_words = set()
qq_lex = collections.defaultdict(lambda: [])
| 11s Looking for non-verb qamets
for w in F.otype.s("word"):
ln = F.languageISO.v(w)
if ln != "hbo":
continue
sp = F.sp.v(w)
if sp == "verb":
continue
orig = get_orig(w, punct=False, tetra=False)
if "@" not in orig:
continue # no qamets in word
word = doaccents(orig)
lex = F.lex.v(w)
if word in qq_words:
continue
qq_words.add(word)
qq_lex[lex].append(w)
utils.caption(
0, "\t{} lexemes and {} unique occurrences".format(len(qq_lex), len(qq_words))
)
| 13s 4056 lexemes and 13451 unique occurrences
utils.caption(0, "\tFiltering lexemes with varied occurrences")
qq_varied = collections.defaultdict(lambda: [])
nocc = 0
for lex in qq_lex:
ws = qq_lex[lex]
if len(ws) == 1:
continue
occs = []
skel_set = set()
has_qatan = False
has_gadol = False
for w in ws:
wordq = phono(w, correct=-1, punct=False)
skel = (
qamets_qatan_skel.sub("", wordq.replace(":@", ""))
.replace("@", gadol)
.replace("^", qatan)
)
if gadol in skel:
has_gadol = True
if qatan in skel:
has_qatan = True
skel_set.add(skel)
occs.append((skel, w))
if len(skel_set) > 1 and has_qatan and has_gadol:
for (skel, w) in occs:
fullskel = get_full_skel(w)
qq_varied[lex].append((skel, fullskel, w))
nocc += 1
utils.caption(
0,
"\t{} interesting lexemes with {} unique occurrences".format(len(qq_varied), nocc),
)
| 13s Filtering lexemes with varied occurrences | 13s 161 interesting lexemes with 1704 unique occurrences
qamets_qatan_xc = dict(
(x[0], x[1]) for x in (y.split(" => ") for y in qamets_qatan_x.strip().split("\n"))
)
qamets_qatan_xcompiled = collections.defaultdict(lambda: {})
for (lex, corrstr) in qamets_qatan_xc.items():
corrs = corrstr.split(",")
for corr in corrs:
(pos, ins) = corr
pos = int(pos) - 1
qamets_qatan_xcompiled[lex][pos] = ins
def compile_occs(lex, occs):
vowel_counts = collections.defaultdict(lambda: collections.Counter())
for (skel, fullskel, w) in occs:
for (i, c) in enumerate(fullskel):
vowel_counts[i][c] += 1
occs_compiled = {}
for i in sorted(vowel_counts):
vowel_count = vowel_counts[i]
a_ish = vowel_count.get(gadol, 0) + vowel_count.get("A", 0)
o_ish = vowel_count.get(qatan, 0) + vowel_count.get("O", 0)
if a_ish != o_ish:
occs_compiled[i] = gadol if a_ish > o_ish else qatan
if lex in qamets_qatan_xcompiled:
override = qamets_qatan_xcompiled[lex]
for i in override:
ins = override[i]
old_ins = occs_compiled.get(i, "")
new_ins = gadol if ins == "A" else qatan
if old_ins == new_ins:
TF.info(
"\t{}: No override needed for syllable {} which is {}".format(
lex,
i + 1,
old_ins,
),
tm=False,
)
else:
TF.info(
"\t{}: Override for syllable {}: {} becomes {}".format(
lex,
i + 1,
old_ins,
new_ins,
),
tm=False,
)
occs_compiled[i] = new_ins
return occs_compiled
def guess_qq(occ, occs_compiled, debug=False):
(skel, fullskel, w) = occ
guess = ""
for (i, c) in enumerate(fullskel):
guess += occs_compiled.get(i, c) if c == gadol or c == qatan else c
if debug:
TF.info("{}".format(w), tm=False)
return guess
def get_corr(fullskel, guess, debug=False):
n = 0
corr = []
for (i, fc) in enumerate(fullskel):
if fc != qatan and fc != gadol:
continue
n += 1
gc = guess[i]
if fc == gc:
continue
corr.append("{}{}".format(n, gc))
if debug:
TF.info("{} guess {} corr {}".format(fullskel, guess, corr), tm=False)
return ",".join(corr)
utils.caption(0, "\tGuessing between gadol and qatan")
qamets_corrections = {}
qq_varied_remaining = set()
ndiff_occs = 0
ndiff_lexs = 0
nconflicts = 0
for lex in qq_varied:
debug = False
occs = qq_varied[lex]
occs_compiled = compile_occs(lex, occs)
this_ndiff_occs = 0
for occ in occs:
(skel, fullskel, w) = occ
guess = guess_qq(occ, occs_compiled, debug=debug)
corr = get_corr(fullskel, guess, debug=debug)
if corr:
this_ndiff_occs += 1
wordq = phono(w, correct=-1, punct=False)
if wordq in qamets_corrections:
old_corr = qamets_corrections[wordq]
if old_corr != corr:
TF.error(
"\t\tConflicting corrections for {} {} {} ({} => {}): first {} and then {}".format(
lex,
wordq,
skel,
fullskel,
guess,
old_corr,
corr,
)
)
nconflicts += 1
qamets_corrections[wordq] = corr
if this_ndiff_occs:
ndiff_lexs += 1
ndiff_occs += this_ndiff_occs
qq_varied_remaining.add(lex)
utils.caption(
0, "\t{} lexemes with modified occurrences ({})".format(ndiff_lexs, ndiff_occs)
)
utils.caption(0, "\t{} patterns with conflicts".format(nconflicts))
| 13s Guessing between gadol and qatan JM/: Override for syllable 1: ā becomes o BJT/: Override for syllable 1: o becomes ā JWMM: Override for syllable 2: becomes ā JHWNTN/: Override for syllable 2: becomes ā JRB<M/: No override needed for syllable 1 which is ā | 13s 107 lexemes with modified occurrences (224) | 13s 0 patterns with conflicts
def stats_prog():
return " ".join(str(stats.get(stat, 0)) for stat in interesting_stats)
utils.caption(4, "Generating data in two ways ... ")
.............................................................................................. . 13s Generating data in two ways ... . ..............................................................................................
phono_file = []
word_file = []
stats = collections.Counter()
nv = 0
nchunk = 1000
nvc = 0
for v in F.otype.s("verse"):
nv += 1
nvc += 1
if nvc == nchunk:
utils.caption(0, "\t{:>5} verses {}".format(nv, stats_prog()))
nvc = 0
words = partition_w(L.d(v, "word"))
phonos = []
for ws in words:
lws = len(ws)
phono_w = phono(ws, inparts=True, count=True)
phono_file.append("".join(phono_w))
for (i, w) in enumerate(ws):
(real_phono, sep) = phono_sep.fullmatch(phono_w[i]).groups()
word_file.append((w, real_phono, sep))
if not phono_file[-1].endswith(". "):
word_file.append((None, "", "+"))
| 14s 1000 verses 13316 62 0 21 | 16s 2000 verses 27407 123 2 79 | 17s 3000 verses 40963 174 5 125 | 18s 4000 verses 54143 242 8 143 | 19s 5000 verses 67151 308 13 171 | 20s 6000 verses 82448 394 15 196 | 22s 7000 verses 97551 457 17 254 | 23s 8000 verses 113748 529 18 287 | 24s 9000 verses 129602 573 20 327 | 26s 10000 verses 146217 624 20 438 | 27s 11000 verses 159809 749 20 487 | 28s 12000 verses 174192 891 24 524 | 30s 13000 verses 190555 1018 28 576 | 31s 14000 verses 205104 1168 32 622 | 32s 15000 verses 218610 1290 33 728 | 33s 16000 verses 227944 1336 39 777 | 34s 17000 verses 235635 1379 48 827 | 35s 18000 verses 243258 1396 51 866 | 35s 19000 verses 250709 1429 59 906 | 36s 20000 verses 260118 1470 60 960 | 38s 21000 verses 275083 1533 63 979 | 39s 22000 verses 286442 1590 65 1007 | 40s 23000 verses 301302 1645 66 1075
utils.caption(0, "\t{:>5} verses done {}".format(nv, stats_prog()))
for stat in sorted(stats):
amount = stats[stat]
utils.caption(
0,
"\t{:<1} {:>6} {}".format(
"#" if amount == 0 else "",
amount,
stat,
),
)
| 40s 23213 verses done 304800 1650 66 1081 | 40s 270191 accents | 40s 9006 cleanup | 40s 45235 dagesh_forte | 40s 21511 dagesh_forte_lene | 40s 59612 dagesh_lene | 40s 16322 default_accent | 40s 968 fixit | 40s 2658 furtive_patah | 40s 28195 last_ml | 40s 2201 mappiq_heh | 40s 93898 mobile_schwa1 | 40s 2255 mobile_schwa2 | 40s 179 mobile_schwa3 | 40s 7702 mobile_schwa4 | 40s 25498 punct | 40s 25498 punctuation | 40s 66 qamets_prs_suppress_qatan | 40s 5257 qamets_qatan1 | 40s 243 qamets_qatan2 | 40s 1791 qamets_qatan3 | 40s 28 qamets_qatan4a | 40s 256 qamets_qatan4b | 40s 209 qamets_qatan5 | 40s 1081 qamets_qatan_corrections | 40s 1650 qamets_verb_suppress_qatan | 40s 12 rafe | 40s 21098 silent_aleph | 40s 304800 total | 40s 304796 trim
We take the just generated phono
and wordph
files.
From the phono
file we strip the passage indicators, and from the wordph
we strip the node numbers.
They should be consistent.
utils.caption(0, "{} items in phono".format(len(phono_file)))
word_test = []
TF.info("Reading word")
i = 0
for (w, mat, sep) in word_file:
rsep = "" if sep == "+" else sep
word_test.append(mat + rsep)
if ". " in sep or "+" in sep:
i += 1
utils.caption(0, "\t{} lines".format(i))
| 40s 304800 items in phono 40s Reading word | 41s 23213 lines
phono_text = "".join(phono_file)
word_text = "".join(word_test)
if phono_text != word_text:
utils.caption(0, "\tERROR: phono text and word info are NOT consistent")
else:
utils.caption(0, "\tOK: phono text and word info are CONSISTENT")
| 41s OK: phono text and word info are CONSISTENT
We generate the features phono
and phono_trailer
.
They are defined for words.
We also generate a config feature otext@phono
, which will be picked up by Text-Fabric automatically.
In it we define the phonetic format, so that Text-Fabric has can output text in phonetic representation.
genericMetaPath = f"{thisRepo}/yaml/generic.yaml"
phonoMetaPath = f"{thisRepo}/yaml/phono.yaml"
with open(genericMetaPath) as fh:
genericMeta = yaml.load(fh, Loader=yaml.FullLoader)
genericMeta["version"] = VERSION
with open(phonoMetaPath) as fh:
phonoMeta = formatMeta(yaml.load(fh, Loader=yaml.FullLoader))
metaData = {"": genericMeta, **phonoMeta}
utils.caption(4, "Writing TF phono features")
nodeFeatures = dict(
phono=dict(((ln[0], ln[1]) for ln in word_file if ln[0] is not None)),
phono_trailer=dict(((ln[0], ln[2]) for ln in word_file if ln[0] is not None)),
)
edgeFeatures = {}
metaData["otext@phono"] = {
"about": "Provides phonetic transcriptions to Hebrew Words",
"see": "https://github.com/ETCBC/phono",
"fmt:text-phono-full": "{phono}{phono_trailer}",
}
metaData["phono"]["valueType"] = "str"
metaData["phono_trailer"]["valueType"] = "str"
TF = Fabric(locations=thisTempTf, silent=True)
TF.save(nodeFeatures=nodeFeatures, edgeFeatures=edgeFeatures, metaData=metaData)
.............................................................................................. . 3m 09s Writing TF phono features . ..............................................................................................
True
Check differences with previous versions.
utils.checkDiffs(thisTempTf, thisTf, only=set(nodeFeatures))
.............................................................................................. . 6m 08s Check differences with previous version . .............................................................................................. | 6m 08s no features to add | 6m 08s no features to delete | 6m 08s 2 features in common | 6m 08s phono ... no changes | 6m 08s phono_trailer ... no changes | 6m 08s Done
Copy the new TF features from the temporary location where they have been created to their final destination.
utils.deliverDataset(thisTempTf, thisTf)
.............................................................................................. . 6m 11s Deliver data set to /Users/werk/github/etcbc/phono/tf/2021 . ..............................................................................................
utils.caption(4, "Load and compile the new TF features")
.............................................................................................. . 6m 14s Load and compile the new TF features . ..............................................................................................
TF = Fabric(locations=[coreTf, thisTf], modules=[""])
api = TF.load(" ".join(nodeFeatures))
api.makeAvailableIn(globals())
This is Text-Fabric 9.1.7 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 117 features found and 0 ignored 0.00s loading features ... | 0.00s Dataset without structure sections in otext:no structure functions in the T-API | 1.01s T phono from ~/github/etcbc/phono/tf/2021 | 0.60s T phono_trailer from ~/github/etcbc/phono/tf/2021 15s All features loaded/computed - for details use TF.isLoaded()
[('Computed', 'computed-data', ('C Computed', 'Call AllComputeds', 'Cs ComputedString')), ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')), ('Fabric', 'loading', ('TF',)), ('Locality', 'locality', ('L Locality',)), ('Nodes', 'navigating-nodes', ('N Nodes',)), ('Features', 'node-features', ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')), ('Search', 'search', ('S Search',)), ('Text', 'text', ('T Text',))]
utils.caption(4, "Basic tests")
.............................................................................................. . 6m 33s Basic tests . ..............................................................................................
utils.caption(4, "First verses in phonetic transcription")
for v in F.otype.s("verse")[0:10]:
utils.caption(0, "{} {}:{}".format(*T.sectionFromNode(v)), continuation=True)
utils.caption(0, T.text(L.d(v, "word"), fmt="text-phono-full"), continuation=True)
.............................................................................................. . 6m 36s First verses in phonetic transcription . .............................................................................................. Genesis 1:1 bᵊrēšˌîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˌayim wᵊʔˌēṯ hāʔˈāreṣ . Genesis 1:2 wᵊhāʔˈāreṣ hāyᵊṯˌā ṯˈōhû wāvˈōhû wᵊḥˌōšeḵ ʕal-pᵊnˈê ṯᵊhˈôm wᵊrˈûₐḥ ʔᵉlōhˈîm mᵊraḥˌefeṯ ʕal-pᵊnˌê hammˈāyim . Genesis 1:3 wayyˌōmer ʔᵉlōhˌîm yᵊhˈî ʔˈôr wˈayᵊhî-ʔˈôr . Genesis 1:4 wayyˈar ʔᵉlōhˈîm ʔeṯ-hāʔˌôr kî-ṭˈôv wayyavdˈēl ʔᵉlōhˈîm bˌên hāʔˌôr ûvˌên haḥˈōšeḵ . Genesis 1:5 wayyiqrˌā ʔᵉlōhˈîm lāʔôr yˈôm wᵊlaḥˌōšeḵ qˈārā lˈāyᵊlā wˈayᵊhî-ʕˌerev wˈayᵊhî-vˌōqer yˌôm ʔeḥˈāḏ . f Genesis 1:6 wayyˈōmer ʔᵉlōhˈîm yᵊhˌî rāqˌîₐʕ bᵊṯˈôḵ hammˈāyim wiyhˈî mavdˈîl bˌên mˌayim lāmˈāyim . Genesis 1:7 wayyˈaʕaś ʔᵉlōhîm ʔeṯ-hārāqîˌₐʕ wayyavdˈēl bˈên hammˈayim ʔᵃšˌer mittˈaḥaṯ lārāqˈîₐʕ ûvˈên hammˈayim ʔᵃšˌer mēʕˈal lārāqˈîₐʕ wˈayᵊhî-ḵˈēn . Genesis 1:8 wayyiqrˈā ʔᵉlōhˈîm lˈārāqˌîₐʕ šāmˈāyim wˈayᵊhî-ʕˌerev wˈayᵊhî-vˌōqer yˌôm šēnˈî . f Genesis 1:9 wayyˈōmer ʔᵉlōhˈîm yiqqāwˌû hammˈayim mittˈaḥaṯ haššāmˈayim ʔel-māqˈôm ʔeḥˈāḏ wᵊṯērāʔˌeh hayyabbāšˈā wˈayᵊhî-ḵˈēn . Genesis 1:10 wayyiqrˌā ʔᵉlōhˈîm layyabbāšˌā ʔˈereṣ ûlᵊmiqwˌē hammˌayim qārˈā yammˈîm wayyˌar ʔᵉlōhˌîm kî-ṭˈôv .
utils.caption(4, "First verse in all formats")
for fmt in T.formats:
utils.caption(0, "{}".format(fmt), continuation=True)
utils.caption(0, "\t{}".format(T.text(range(1, 12), fmt=fmt)), continuation=True)
.............................................................................................. . 6m 41s First verse in all formats . .............................................................................................. lex-orig-full בְּ רֵאשִׁית בָּרָא אֱלֹה אֵת הַ שָּׁמַי וְ אֵת הָ אָרֶץ lex-orig-plain ב ראשׁית ברא אלהים את ה שׁמים ו את ה ארץ lex-trans-full B.:- R;>CIJT B.@R@> >:ELOH >;T HA- C.@MAJ W:- >;T H@- >@REY lex-trans-plain B R>CJT BR> >LHJM >T H CMJM W >T H >RY text-orig-full בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ text-orig-full-ketiv בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ text-orig-plain בראשׁית ברא אלהים את השׁמים ואת הארץ׃ text-phono-full bᵊrēšˌîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˌayim wᵊʔˌēṯ hāʔˈāreṣ . text-trans-full B.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 text-trans-full-ketiv B.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 text-trans-plain BR>CJT BR> >LHJM >T HCMJM W>T H>RY00
If this notebook is run with the purpose of generating data, this is the end then.
After this tests and examples are run.
if SCRIPT:
stop(good=True)
The function below reads a text file with tests.
A test is a tab separated line with as fields:
passage ETCBC-original phono-transcription expected-result bol-reference comments
The testing routine executes all tests, checks the results, produces on-screen output, debug output in file, and pretty output in a HTML file.
Load the features needed for testing.
api = TF.load(
"""
qere qere_trailer
g_word_utf8 g_cons_utf8 trailer
g_word g_cons lex_utf8 lex lex0
sp vs vt gn nu ps st
uvf prs g_prs pfm vbs vbe
languageISO
"""
)
api.makeAvailableIn(globals())
0.00s loading features ... 0.04s All features loaded/computed - for details use loadLog()
Given an occurrence in ETCBC transliteration in a passage, or a node number, we want to easily compile a test out of it.
Say we are looking for orig
.
The match need not be perfect.
We want to find the node w
, which carries a transliteration that occurs at the end of orig
.
If there are multiple, we want the longest.
If there are multiple longest ones, we want the first that occurs in the passage.
def get_hebrew(orig):
origm = Transcription.suffix_and_finales(orig)
return Transcription.to_hebrew(origm[0] + origm[1]).replace("-", "")
def get_passage(w):
return T.sectionFromNode(w, lang="la")
def tupleFromStr(passage):
(book, rest) = passage.split()
(chapter, verse) = rest.split(":")
return (book, int(chapter), int(verse))
def maketest(ws=None, orig=None, passageStr=None, expected=None, comment=None):
if comment is None:
comment = "isolated case"
passage = None if passageStr is None else tupleFromStr(passageStr)
if ws is None:
if passage is not None and orig is not None:
ws = find_w(passage, orig)
if ws is None:
TF.error("Cannot make test: {}: {} not found".format(passageStr, orig))
return None
else:
if type(ws) is int:
ws = [ws]
passage = get_passage(ws[-1])
if expected is None:
expected = phono(ws, punct=False)
test = (ws, expected.rstrip(" "), comment)
return test
Here are some HTML/CSS definitions for formatting test results.
def h_esc(txt):
return txt.replace("&", "&").replace("<", "<").replace(">", ">")
def test_html_head(title, stats, mystats):
return (
"""<html>
<head>
<meta http-equiv="Content-Type"
content="text/html; charset=UTF-8" />
<title>"""
+ title
+ """</title>
<style type="text/css">
.h {
font-family: Ezra SIL, SBL Hebrew, Verdana, sans-serif;
font-size: x-large;
text-align: right;
}
.t {
font-family: Menlo, Courier New, Courier, monospace;
font-size: small;
color: #0000cc;
}
.tl {
font-family: Menlo, Courier New, Courier, monospace;
font-size: medium;
font-weight: bold;
color: #000000;
}
.p {
font-family: Verdana, Arial, sans-serif;
font-size: medium;
}
.l {
font-family: Verdana, Arial, sans-serif;
font-size: small;
color: #440088;
}
.v {
font-family: Verdana, Arial, sans-serif;
font-size: x-small;
color: #666666;
}
.c {
font-family: Ezra SIL, SBL Hebrew, Verdana, sans-serif;
font-size: small;
background-color: #ffffdd;
width: 20%;
}
.cor {
font-family: Menlo, Courier New, Courier, monospace;
font-weight: bold
font-size: medium;
}
.exact {
background-color: #88ffff;
}
.good {
background-color: #88ff88;
}
.error {
background-color: #ff8888;
}
.norm {
background-color: #8888ff;
}
.ca {
background-color: #88ffff;
}
.cr {
background-color: #ffff33;
}
</style>
</head><body>
"""
+ (("<p>" + stats + "</p>") if stats else "")
+ (("<p>" + mystats + "</p>") if mystats else "")
+ """
<table>
"""
)
test_html_tail = """</table>
</body>
</html>
"""
This is the function that runs a sequence of tests. If the second argument is a string, it reads a tab separated file with tests from a file with that name. Otherwise it should be a list of tests, a test being a list or tuple consisting of:
source, orig, lex-info, expected, comment
where source
is either a string passage
or a number w
.
If it is a w
, it is the node corresponding to the word, and it is used to get the passage, orig, lex_info
which are allowed to be empty.
If it is a passage
, the node will be looked up on the basis of it plus orig
.
If the node is found, it will be used to get the lex_info
, if not, the given lex_info
will be used.
def vfname(inpath):
(indir, infile) = os.path.split(inpath)
(inbase, inext) = os.path.splitext(infile)
return os.path.join(indir, inbase + VERSION + inext)
def runtests(title, testsource, outfilename, htmlfilename, order=True, screen=False):
skipped = 0
if type(testsource) is list:
tests = testsource
else:
tests = []
test_in_file = open(testsource)
for tline in test_in_file:
(passageStr, orig, expected, comment) = tline.rstrip("\n").split("\t")
this_test = maketest(
orig=orig, passageStr=passageStr, expected=expected, comment=comment
)
if this_test is not None:
tests.append(this_test)
else:
skipped += 1
test_in_file.close()
lines = []
htmllines = []
longlines = []
nexact = 0
ngood = 0
ntests = len(tests)
test_sequence = sorted(tests, key=lambda x: (x[1], x[2], x[0])) if order else tests
for (i, (wset, expected, comment)) in enumerate(test_sequence):
passage = get_passage(wset[-1])
passageStr = "{} {}:{}".format(*passage)
wss = partition_w(wset)
orig = "".join(get_orig(w, punct=True, set_pet=True, tetra=False) for w in wset)
wordph = ""
lex_info = ""
dout = []
for (j, ws) in enumerate(wss):
this_lex_info = get_lex_info(ws[-1])
(this_wordph, this_dout) = phono(
ws, punct=not (j == len(wss) - 1), debug=True
)
wordph += this_wordph
lex_info += this_lex_info
dout.extend(this_dout)
wordph = wordph.rstrip(" ")
if wordph == expected:
isgood = "="
nexact += 1
elif wordph.replace("ˌ", "").replace("ˈ", "").replace(
"-", ""
) == expected.replace("ˌ", "").replace("ˈ", "").replace("-", ""):
isgood = "~"
ngood += 1
else:
isgood = "#"
line_text = "{:>3} {:<19} {:>6} {:<17} {:<22} {:<20} {} {:<20}".format(
i + 1,
passageStr,
ws[-1],
lex_info,
orig,
wordph,
isgood,
"" if isgood == "=" else expected,
)
lines.append(line_text)
if screen:
if isgood in {"=", "~"}:
TF.info(line_text, tm=False)
if isgood not in {"=", "~"}:
TF.info(line_text, tm=False)
longlines.append(
"{:>3} {:<19} {:>6} {:<17} {:<25} => {:<25} < {} {:<25} # {}\n{}\n\n".format(
i + 1,
passageStr,
ws[-1],
lex_info,
orig,
wordph,
isgood,
"" if isgood == "=" else expected,
comment,
"\n".join("{:<7} {:<20} {}".format("", x[0], x[1]) for x in dout),
)
)
htmllines.append(
(
"""
<tr>
<td class="{st}">{i}</td>
<td class="v">{v} {w}</td>
<td class="t">{t}</td>
<td class="h">{h}</td>
<td class="l">{l}</td>
<td class="p {st}">{p}</td>
<td class="p{est}">{e}</td>
<td class="c">{c}</td>
</tr>
"""
).format(
st="exact" if isgood == "=" else "good" if isgood == "~" else "error",
i=i + 1,
v=passageStr,
w="" if w is None else w,
t=h_esc(orig),
l=lex_info,
h=get_hebrew(orig),
p=wordph,
e="" if isgood == "=" else expected,
est="" if isgood == "=" else " ca" if isgood == "~" else " norm",
c=h_esc(comment),
)
)
line_text = "\n".join(lines)
longline_text = "\n".join(longlines)
test_out_file = open(vfname(outfilename), "w")
test_out_file.write("{}\n\n{}\n".format(line_text, longline_text))
stats = "{} tests; {} skipped; {} failed; {} passed of which {} exactly.".format(
ntests + skipped,
skipped,
ntests - ngood - nexact,
ngood + nexact,
nexact,
)
TF.info(
"ntests={}, skipped={}, ngood={}, nexact={}".format(
ntests, skipped, ngood, nexact
)
)
test_out_file.close()
test_html_file = open(vfname(htmlfilename), "w")
test_html_headline = """
<tr>
<th class="v">v</th>
<th class="v">verse</th>
<th class="t">etcbc</th>
<th class="h">hebrew</th>
<th class="l">lexical</th>
<th class="p">phono</th>
<th class="p norm">expected</th>
<th class="c">comment</th>
</tr>
"""
test_html_file.write(
"{}{}{}{}".format(
test_html_head(title, stats, ""),
test_html_headline,
"".join(htmllines),
test_html_tail,
)
)
test_html_file.close()
TF.info(stats, tm=False)
This is a variant on runtests()
.
It produces overviews of the cases where the corpus dependent rules have been applied.
def showcases(title, stats, testsource, order=True):
ctitle = title + " cases"
ttitle = title + " tests"
fctitle = ctitle.replace(" ", "_")
fttitle = ttitle.replace(" ", "_")
test_file_name = vfname(fttitle + ".txt")
html_file_name = vfname(fctitle + ".html")
TF.info("Generating HTML in {}".format(html_file_name))
TF.info("Generating test set {} in {}".format(title, test_file_name))
htmllines = []
ncorr = 0
test_sequence = (
sorted(testsource, key=lambda x: (x[3], x[0], x[1], x[5]))
if order
else testsource
)
ntests = len(testsource)
test_file = open(test_file_name, "w")
for (i, (corr, wordph, wordph_c, lex, orig, w, comment)) in enumerate(
test_sequence
):
passage = get_passage(w)
passageStr = "{} {}:{}".format(*passage)
lex_info = get_lex_info(w)
test_file.write(
"{}\t{}\t{}\t{}\n".format(
passageStr,
orig,
wordph_c,
comment,
)
)
heb = get_hebrew(orig)
if corr:
ncorr += 1
htmllines.append(
(
"""
<tr>
<td class="v">{i}</td>
<td class="cor{st}">{cr}</td>
<td class="tl">{tl}</td>
<td class="v">{v} {w}</td>
<td class="l">{l}</td>
<td class="h">{h}</td>
<td class="p {st}">{p}</td>
<td class="p {st1}">{pc}</td>
<td class="t">{t}</td>
<td class="c">{c}</td>
</tr>
"""
).format(
i=i + 1,
st=" cr" if corr else "",
st1=" good" if corr else "",
cr=corr,
tl=h_esc(lex),
v=passageStr,
w="" if w is None else w,
l=lex_info,
h=heb,
p=wordph if wordph != wordph_c else "",
pc=wordph_c,
t=h_esc(orig),
c=h_esc(comment),
)
)
test_file.close()
mystats = "{} occurrences and {} corrections".format(
ntests,
ncorr,
)
test_html_headline = """
<tr>
<th class="v">n</th>
<th class="cor cr">correction</th>
<th class="tl">lexeme</th>
<th class="v">verse</th>
<th class="l">lexical</th>
<th class="h">hebrew</th>
<th class="p cr">phono<br/>uncorrected</th>
<th class="p good">phono<br/>corrected</th>
<th class="t">etcbc</th>
<th class="c">comment</th>
</tr>
"""
test_html_file = open(html_file_name, "w")
test_html_file.write(
"{}{}{}{}".format(
test_html_head(ctitle, stats, mystats),
test_html_headline,
"".join(htmllines),
test_html_tail,
)
)
test_html_file.close()
if stats:
TF.info(stats, tm=False)
if mystats:
TF.info(mystats, tm=False)
for tname in [
"mixed",
"qamets_nonverb_tests",
"qamets_verb_tests",
"qamets_prs_tests",
]:
runtests(
tname,
"{}.txt".format(tname),
"{}_debug.txt".format(tname),
"{}.html".format(tname),
screen=False,
)
9.49s ntests=86, skipped=0, ngood=19, nexact=67 86 tests; 0 skipped; 0 failed; 86 passed of which 67 exactly. 10s ntests=1574, skipped=0, ngood=197, nexact=1377 1574 tests; 0 skipped; 0 failed; 1574 passed of which 1377 exactly. 10s ntests=513, skipped=0, ngood=30, nexact=483 513 tests; 0 skipped; 0 failed; 513 passed of which 483 exactly. 11s ntests=209, skipped=0, ngood=30, nexact=179 209 tests; 0 skipped; 0 failed; 209 passed of which 179 exactly.
special_tests = [
dict(passageStr="Joel 1:17", orig="<@B:C74W.", comment="qamets gadol or qatan"),
dict(ws=7494, expected=None, comment="schwa in front of BGDKPT without dagesh"),
dict(ws=5, expected=None, comment="article in isolation"),
dict(ws=6, expected=None, comment="word after article in isolation"),
dict(ws=106, expected=None, comment="proclitic min"),
dict(
ws=107, expected=None, comment="word starting with BGDKPT after proclitic min"
),
dict(
passageStr="Genesis 1:7",
orig="MI-T.A74XAT",
expected=None,
comment="proclitic min combined with word starting with BGDKPT",
),
dict(ws=1684, expected=None, comment="Tetra with end of verse"),
dict(
passageStr="Genesis 4:1",
orig="J:HW@75H00",
expected=None,
comment="Tetra with end of verse",
),
dict(ws=27477, expected=None, comment="pronominal suffix after verb"),
dict(ws=155387, expected=None, comment="peculiar representation of tetragrammaton"),
dict(
passageStr="Proverbia 10:10",
orig="<AY.@92BET",
expected=None,
comment="the qamets should be gadol",
),
dict(passageStr="Genesis 9:21", orig="*>HLH", expected=None, comment="ketiv qere"),
dict(
passageStr="Genesis 1:27",
orig="H@95->@D@M03",
expected=None,
comment="qamets gadol",
),
]
compiled_tests = []
for t in special_tests:
this_test = maketest(**t)
if this_test is not None:
compiled_tests.append(this_test)
runtests(
"special cases",
compiled_tests,
"special_cases_out.txt",
"special_cases.html",
screen=True,
)
1 Genesis 9:21 4420 subs sm,+h >@H:@LO75W00 *ʔohᵒlˈô = 2 Genesis 4:1 1685 nmpr sm, J:HW@75H00 [yᵊhwˈāh] = 3 Genesis 17:11 7494 subs sm, B.:FA74R bᵊśˈar = 4 Genesis 1:27 539 subs sm, H@95->@D@M03 hˈāʔāḏˌām = 5 Genesis 1:1 6 art HA- hˌa = 6 Genesis 1:7 108 subs sm, MI-T.A74XAT mittˈaḥaṯ = 7 Genesis 1:7 107 prep MI- mˌi = 8 Samuel_I 23:3 155387 nmpr s-, Q:<IL@80H qᵊʕilˈā = 9 Genesis 4:1 1684 prep >ET& ʔeṯ- = 10 Genesis 48:9 27477 prep >;LA73J ʔēlˌay = 11 Genesis 1:1 5 prep >;71T ʔˌēṯ = 12 Genesis 1:7 106 conj >:ACER03 ʔᵃšˌer = 13 Proverbia 10:10 349420 subs sf, <AY.@92BET ʕaṣṣˈāveṯ = 14 Joel 1:17 294304 verb qal perf 3p-, <@B:C74W. ʕāvᵊšˈû = 32s ntests=14, skipped=0, ngood=0, nexact=14 14 tests; 0 skipped; 0 failed; 14 passed of which 14 exactly.
We have generated a number of corrections of the qamets interpretation in non verbs. We have applied exceptions to the corrections. Here is the list of representative occurrences where corrections and/or exceptions have been applied.
TF.info("Showing lexemes with varied occurrences")
qqi_filename = "qamets_qatan_individuals"
qqi = open("{}.txt".format(qqi_filename), "w")
nvcases = []
noccs = 0
ncorrs = 0
for lex in sorted(qq_varied):
if lex not in qq_varied_remaining:
continue
occs = qq_varied[lex]
for (skel, fullskel, w) in sorted(occs, key=lambda x: (x[1], x[2])):
orig = get_orig(w, punct=False, tetra=False)
wordq = phono(w, punct=False, correct=-1)
corr = qamets_corrections.get(wordq, "")
if corr:
ncorrs += 1
noccs += 1
wordph = phono(w, punct=False, correct=0)
wordph_c = phono(w, punct=False, correct=1)
comment = "on the basis of other occurrences" if corr else "by the rules"
qqi.write(
"{:<1}\t{:<5}\t{:<16}\t{:<16}\t{:<10}\t{:<20}\t{}\n".format(
"*" if corr else "",
corr,
wordph,
wordph_c,
lex,
orig,
w,
)
)
nvcases.append((corr, wordph, wordph_c, lex, orig, w, comment))
qqi.write("\n")
qqi.close()
TF.info(
"{} lexemes with {} occurrences and {} corrections written".format(
len(qq_varied_remaining),
noccs,
ncorrs,
)
)
showcases(
"qamets nonverb",
"{} lexemes".format(len(qq_varied_remaining)),
nvcases,
order=False,
)
1m 08s Showing lexemes with varied occurrences 1m 10s 107 lexemes with 1192 occurrences and 224 corrections written 1m 10s Generating HTML in qamets_nonverb_casesc.html 1m 10s Generating test set qamets nonverb in qamets_nonverb_testsc.txt 107 lexemes 1192 occurrences and 224 corrections
Usually, accents take care that potential qatans are read as gadols. But sometimes the accents are missing. We have used a list of paradigm labels where such cases might occur, and there we suppress the qamets-as-qatan interpretation. We look at the verb paradigms to fill in the missing information.
Here we list the cases where this occurs, and show them.
qq_verb_words = set()
qq_verb_specials = []
TF.info("Finding qamets qatan special verb cases")
for w in F.otype.s("word"):
ln = F.languageISO.v(w)
if ln != "hbo":
continue
sp = F.sp.v(w)
if sp != "verb":
continue
orig = get_orig(w, punct=False, tetra=False)
if "@" not in orig:
continue # no qamets in word
word = doaccents(orig)
wordq = doplainqamets(word, accentless=True)
if "^" not in wordq:
continue # no risk of unwanted qamets qatan
# if '!' in word: continue # primary accent has been marked
lex_info = get_lex_info(w)
decl = get_decl(lex_info)
if decl in qamets_qatan_verb_x:
if (word, lex_info) in qq_verb_words:
continue
qq_verb_words.add((word, lex_info))
qq_verb_specials.append((w, orig, word))
TF.info("{} cases".format(len(qq_verb_specials)))
1m 15s Finding qamets qatan special verb cases 1m 16s 524 cases
TF.info("Showing verb cases")
ncorr = 0
ngood = 0
vcases = []
verb_lexemes = set()
for (w, orig, word) in qq_verb_specials:
wordph = phono(w, punct=False)
wordph_ns = phono(w, punct=False, suppress_in_verb=False)
corr = ""
lex = F.lex.v(w)
verb_lexemes.add(lex)
if wordph == wordph_ns:
ngood += 1
corr = ""
comment = "qamets: no need to suppress qatan"
else:
ncorr += 1
corr = "gadol"
comment = "qamets: gadol maintained because of verb paradigm"
vcases.append((corr, wordph_ns, wordph, lex, orig, w, comment))
showcases(
"qamets verb",
"{} lexemes".format(len(verb_lexemes)),
vcases,
order=True,
)
1m 22s Showing verb cases 1m 22s Generating HTML in qamets_verb_casesc.html 1m 22s Generating test set qamets verb in qamets_verb_testsc.txt 192 lexemes 524 occurrences and 300 corrections
Usually, rules involving closed unaccented syllables trigger the qatan interpretation of a qamets. But in pronominal suffixes a qamets is always gadol. We detect these cases and suppress the qamets-as-qatan interpretation there.
qq_prs_words = set()
qq_prs_specials = []
TF.info("Finding qamets qatan in pronominal suffixes")
for w in F.otype.s("word"):
ln = F.languageISO.v(w)
if ln != "hbo":
continue
lex_info = get_lex_info(w)
prs = get_prs(lex_info)
if prs == "":
continue
orig = get_orig(w, punct=False, tetra=False)
if "@" not in prs:
continue # no qamets in suffix
word = doaccents(orig)
wordq = doplainqamets(word, accentless=False)
if "^" not in wordq:
continue # no risk of unwanted qamets qatan
if (word, lex_info) in qq_prs_words:
continue
qq_prs_words.add((word, lex_info))
qq_prs_specials.append((w, orig, word))
TF.info("{} potential cases".format(len(qq_prs_specials)))
1m 26s Finding qamets qatan in pronominal suffixes 1m 28s 197 potential cases
TF.info("Showing prs cases")
ncorr = 0
ngood = 0
pcases = []
prs_lexemes = set()
for (w, orig, word) in qq_prs_specials:
lex = F.lex.v(w)
prs_lexemes.add(lex)
wordph = phono(w)
wordph_ns = phono(w, suppress_in_prs=False)
corr = ""
if wordph == wordph_ns:
ngood += 1
corr = ""
comment = "qamets: no need to suppress qatan"
else:
ncorr += 1
corr = "gadol"
comment = "qamets: gadol maintained in pronominal suffix"
pcases.append((corr, wordph_ns, wordph, lex, orig, w, comment))
showcases(
"qamets prs",
"{} lexemes".format(len(prs_lexemes)),
pcases,
order=True,
)
1m 32s Showing prs cases 1m 32s Generating HTML in qamets_prs_casesc.html 1m 32s Generating test set qamets prs in qamets_prs_testsc.txt 131 lexemes 197 occurrences and 60 corrections