Notebook

Phonetic Transliteration of Hebrew Masoretic Text¶

Frequently asked questions¶

Q: What is the use of a phonetic transliteration of the Hebrew Bible? What can anyone wish beyond the careful, meticulous Masoretic system of consonants, vowels and accents?

A: Several things:

the Hebrew Bible may be subject of study in various fields, where the people involved do not master the Hebrew script; a phonetic transcription removes a hurdle for them.
in computational linguistics there are many tools that deal with written language in Latin alphabets; even a simple task as getting the consonant-vowel pattern of a word is unnecessarily complicated when using the Hebrew script.
in phonetics and language learning theory, it is important to represent the sounds without being burdened by the idiosyncracies of the writing system and the spelling.

Q: But surely, there already exist transliterations of Hebrew? Why not use them?

Here are a few pragmatic reasons:

we want to be able to compute a transliteration based upon our own data;
we want to gain insight in to what extent the transliteration can be purely rule-based, and to what extent it depends on lexical information that you just need to know;
we want to make available a well documented transliteration, that can be studied, borrowed and improved by others.

Q: But how good* is your transliteration?*

we do not know, ..., yet. A few remarks though:

we have applied most of the rules that we could find in Hebrew grammars;
we have suspended some of the rules for some verb paradigms where it is known that they lead to incorrect results
where the rules did not suffice, we have searched the corpus for other occurrences of the same word, to get clues;
where we knew that clues pointed in the wrong direction, we have applied a list of exceptions (currently a list of only the word בָּתִּֽים (*bottˈîm => bāttˈîm)
we have a fair test set with critical cases that all pass
we have a few tables of all cases where the algorithm has made corpus based decisions and lexical decisions
we are open for your corrections: login into SHEBANQ, go to a passage with offending phonetic transliteration, and make a manual note. Tip: Give that note the keyword phono, then we will collect them.

Q: To me, this is not entirely satisfying.

A: Fair enough. Consider jumping to Bible Online Learner, where they have built in a pretty good transliteration, based on a different method of rule application. It is documented in an article by Nicolai Winther-Nielsen: Transliteration of Biblical Hebrew for the Role-Lexical Module and additional information can be found in Claus Tøndering's Bible Online Learner, Software on GitHub. See also Lex: A software project for linguists.

We are planning to conduct an automatic comparison of both transliteration schemes over the whole corpus.

Q: Who is the we?

That is the author of this notebook, Dirk Roorda, working together with Martijn Naaijer and getting input from Nicolai Winther-Nielsen and Willem van Peursen.

Overview of the results¶

The main result is a python function phono(ETCBC-original, ...):phonetic transliteration.
Showcases and tests: how the function solves particular classes of problems. The cases file shows a set of cases that have been generated in the last run.

The tests files show a prepared set of cases, against which to test new versions of the algorithm. These results have been obtained on version c of the BHSA dataset.

mixed with log file mixed_debug.
qamets-non-verb cases and qamets-non-verb tests with log file qamets-nonverb_tests_debug. The result of searching the corpus for related occurrences and having them vote for qatan/gadol interpretation of the qamets.
qamets-verb cases and qamets-verb tests with log file qamets-verb tests-debug. The result of suppressing the qatan interpretation of the qamets regardless of accent for a definite set of verb forms.
qamets-prs cases and qamets-prs tests with log file qamets-prs tests-debug. The result of suppressing the qatan interpretation of the qamets in pronominal suffixes.
A plain text with the complete text in BHSA transliteration and phonetic transcription, verse by verse.

Overview of the method¶

High-level description¶

BHSA transliteration Our starting point is the BHSA full transliteration of the Hebrew Masoretic text. This transliteration is in 1-1 correspondence with the Masoretic text, including all vowels and accents.
Grammar rules We have implemented the rules we find in grammars of Hebrew about long and short qamets, mobile and silent schwa, dagesh, and mater lectionis. The implementation takes the form of a row of regular expressions, where we transliterate targeted pieces of the original. These regular expressions are exquisitely formulated, and must be applied in the given order. Beware: Seemingly innocent modifications in these expressions or in the order of application, may ruin the transcription completely.
Qamets puzzles: verbs In many verb forms the grammar rules would dictate that a certain qamets is qatan while in fact it is gadol. In most cases this is caused by the fact that no accent has been marked on the syllable that carries the qamets in question. There is a limited set of verb paradigms where this occurs. We detect those and suppress qamets qatan interpretation for them.
Qamets puzzles: non-verbs There are quite a few non-verb occurrences where the accent pattern of a word invites a qamets to become qatan, that is, by the grammar rules. Yet, other occurrences of the same lexeme have other accent patterns, and lead to a gadol interpretation of the same qamets. In this case we count the unique cases in favor of gadol versus qatan, and let the majority decide for all occurrences. In cases where we know that the majority votes wrong, we have intervened.

Qamets work hypothesis¶

Note, that in the non-verb qamets puzzles we have tacitly made the assumption that qamets qatan and gadol are not phonological variants of each other. In other words, it never occurs that a qamets gadol becomes shortened into a qamets qatan. From the grammar rules it follows that short versions of the qamets can only be

patah
schwa
composite schwa with patah

and never

qamets qatan
composite schwa with qamets

Whether this hypothesis is right, is not my competence. We just use it as a working hypothesis.

Lexical information¶

This method is not a pure method, in the sense that it works only with the information given in the source string. We cheat, i.e. we use morphological information from the BHSA database to steer us into the right direction. To this end, the input of the phono() is always a Text-Fabric node, from which we can get all information we need.

More precisely, the input is a sequence of nodes. This sequence is meant to correspond to a sequence of slots belonging to words that are written adjacently (no space between, no maqef between). From these nodes we can look up:

the BHSA transliteration
the qere (if there is a discrepancy between ketiv and qere)
additional lexical information (taken from the last node)

Combined words¶

You can use phono() to transliterate multiple words at the same time, but you can also do individual words, even if in Hebrew they are written together. However, it is better to feed combined words to phono() in one go, because the prefix word may influence the transliteration of the postfix word. Think of the article followed by word starting with a BGDKPT letter. The dagesh in the BGDKPT is interpreted as a lene, if the word stands on its own, but as a forte if it is combined.

However, it not not advised to feed longer strings to phono(), because when phono retrieves lexical information, it uses the information of the last node that matches a word in the input string.

Accents¶

We determine "primary" and "secondary" stress in our transliteration, but this must not be taken in a phonetic sense. Every syllable that carries an accent pointing will get a primary stress mark. However, a few specific accent pointings are not deemed to produce an accent, and an other group of accents is deemed to produce only a secondary accent. The last syllable of a word also gets a secondary accent by default. We have not yet tried to be more precise in this, so segolates do not get the treatment they deserve.

The main rationale for accents is that they prevent a qamets to be read as qatan.

Individual symbols¶

We have made a careful selection of UNICODE symbols to represent Hebrew sounds. Sometimes we follow the phonetic usage of the symbols, sometimes we follow wide spread custom. The actual mapping can be plugged in quite easily, and the intermediate stages in the transformation do not use these symbols, so the algorithm can be easily adapted to other choices.

Consonants¶

Provided it is not part of a long vowel, we write י as y, whilst j would be more in line with the phonetic alphabet.

Likewise, we write ו as w, if it is not part of a long vowel. If a word ends in יו the ו is not a mater lectionis, and the י gets elided. We represent this phonetically as ʸw.

With regards to the BGDKPT letters, it would have been attractive to use the letters b g d k p t without diacritic for the plosive variants, and with a suitable diacritic for the fricative variants. Alas, the UNICODE table does not offer such a suitable diacritic that is available for all these particular 6 letters.

So, we use b g d k p t for the plosives, but for the fricatives we use v ḡ ḏ ḵ f ṯ.

With regards to the emphatic consonants ט and ח and צ we represent them with a dot below: ṭ ḥ ṣ. ק is just q.

ע and א translate to ʕ and ʔ.

שׁ and שׂ translate to š and ś. ס is just s.

When א and ה are mater lectionis, they are left out. A ה with mappiq becomes just h, like every ה which is not a mater lectionis.

We do not mark the deviant final forms of the consonants ך and ם and ן and ף and ץ, assuming that this is just a scriptural peculiarity, with no effect on the actual sounds.

The remaining consonants go as follows:

ל	l
מ	m
נ	n
ר	r
ז	z

Vowels¶

The short vowels (patah, segol, hireq, qamets qatan, qibbuts) are just a e i o u.

However, the furtive patah is a ₐ in front of its consonant.

The long vowels without yod or waw (qamets gadol, tsere, holam) have a bar above ā ē ō.

The complex vowels (tsere or hireq plus yod, holam plus waw, waw with dagesh) have a circumflex ê î ô û.

A segol followed by yod becomes eʸ

The composite schwas (patah, segol, qamets) are written as superscripts ᵃ ᵉ ᵒ.

The simple schwa is left out if silent, and otherwise it becomes ᵊ.

Accent¶

The primary and secondary stress are marked as ˈ ˌ and are placed in front of the vowel they occur with.

Punctuation¶

The sof-pasuq ׃ becomes .. If it is followed by ס (setumah) or ף (petuhah) or ̇׆ (nun-hafukha), these extra symbols are omitted.

The maqef ־ (between words) becomes -.

If words are juxtaposed without space in the Hebrew, they are also juxtaposed without space in the phonetic transliteration.

Tetragrammaton¶

The tetragrammaton is transliterated with the vowels it is encountered with, but the whole is put between square brackets [ ].

Ketiv-qere¶

We base the phonetics on the (vocalized) qere, if a qere is present. The ketiv is then ignored. We precede each such word by a * to indicate that the qere is deviant from the ketiv. Using the data view in SHEBANQ it is possible to see what the ketiv is.

Cleaning up¶

We leave the accents and the schwas in the end product of the phono() function, despite the fact that the accents, as they appear, do not have consistent phonetic significance. And it can be argued that every schwa is silent. If you do not care for schwas and accents, it is easy to remove them. Also, if you find the results in separating the qamets into qatan and gadol unsatisfying or irrelevant, you can just replace them both by a single symbol, such as å.

Testing¶

Quite a bit of code is dedicated to count special cases, to test, and to produce neat tables with interesting forms. It is also possible to call the phono() function in debug mode, which will write to a text file all stages in the transliteration from BHSA original into the phonetic result.

Load the modules¶

In [1]:

import sys
import os
import collections
import re
import yaml
import utils
from tf.fabric import Fabric
from tf.writing.transcription import Transcription
from tf.core.helpers import formatMeta

Pipeline¶

See operation for how to run this script in the pipeline.

In [2]:

if "SCRIPT" not in locals():
    SCRIPT = False
    FORCE = True
    CORE_NAME = "bhsa"
    NAME = "phono"
    VERSION = "2021"

In [3]:

def stop(good=False):
    if SCRIPT:
        sys.exit(0 if good else 1)

This notebook can run a lot of tests and create a lot of examples. However, when run in the pipeline, we only want to create the two phono features.

So, further on, there will be quite a bit of code under the condition not SCRIPT.

Setting up the context: source file and target directories¶

The conversion is executed in an environment of directories, so that sources, temp files and results are in convenient places and do not have to be shifted around.

In [4]:

repoBase = os.path.expanduser("~/github/etcbc")
coreRepo = "{}/{}".format(repoBase, CORE_NAME)
thisRepo = "{}/{}".format(repoBase, NAME)

In [5]:

coreTf = "{}/tf/{}".format(coreRepo, VERSION)

In [6]:

thisTemp = "{}/_temp/{}".format(thisRepo, VERSION)
thisTempTf = "{}/tf".format(thisTemp)

In [7]:

thisTf = "{}/tf/{}".format(thisRepo, VERSION)

Test¶

Check whether this conversion is needed in the first place. Only when run as a script.

In [8]:

if SCRIPT:
    (good, work) = utils.mustRun(
        None, "{}/.tf/{}.tfx".format(thisTf, "phono"), force=FORCE
    )
    if not good:
        stop(good=False)
    if not work:
        stop(good=True)

Load the TF data¶

In [9]:

utils.caption(4, "Load the existing TF dataset")
TF = Fabric(locations=coreTf, modules=[""])

..............................................................................................
.       0.00s Load the existing TF dataset                                                   .
..............................................................................................
This is Text-Fabric 9.1.7
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

114 features found and 0 ignored

In [10]:

api = TF.load(
    """
        qere qere_trailer
        g_word_utf8 g_cons_utf8 trailer
        g_word g_cons lex_utf8 lex lex0
        sp vs vt gn nu ps st
        uvf prs g_prs pfm vbs vbe
        languageISO
"""
)
api.makeAvailableIn(globals())

  0.00s loading features ...
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
    10s All features loaded/computed - for details use TF.isLoaded()

Out[10]:

[('Computed',
  'computed-data',
  ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),
 ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),
 ('Fabric', 'loading', ('TF',)),
 ('Locality', 'locality', ('L Locality',)),
 ('Nodes', 'navigating-nodes', ('N Nodes',)),
 ('Features',
  'node-features',
  ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),
 ('Search', 'search', ('S Search',)),
 ('Text', 'text', ('T Text',))]

The source string¶

Here is what we use as our starting point: the BHSA transliteration, with one or two tweaks.

The BHSA transliteration encodes also what comes after each word until the next word. Sometimes we want that extra bit, and sometimes not, and sometimes part of it.

Patterns¶

In [11]:

# punctuation
punctuation = re.compile(
    r"""
      (?: [ -]\s*\Z)        # space, (no maqef) or nospace
    | (?:
           0[05]            # sof pasuq or paseq
           (?:_[SNP])*      # nun hafukha, setumah, petuhah at end of verse
           \s*\Z
      )
    | (?:_[SPN]\s*\Z)       #  nun hafukha, setumah, petuhah between words
""",
    re.X,
)

In [12]:

split_punctuation = re.compile(
    r"""
  (.*?)                 # part before punctuation
  ((?:                  # punctuation itself
      (?: [ &-]\s*)         # space, maqef, or nospace
    | (?:
           0[05]            # sof pasuq or paseq
           (?:_[SNP])*      # nun hafukha, setumah, petuhah at end of verse
           \s*
      )
    | (?:_[SPN]\s*)         #  nun hafukha, setumah, petuhah between words
  )*)
""",
    re.X,
)

In [13]:

start_punct = re.compile(
    r"""
      (?: \A[ &-]\s*)       # space, maqef or nospace
    | (?:
           \A
           0[05]            # sof pasuq or paseq
           (?:_[SNP])*      # nun hafukha, setumah, petuhah at end of verse
           \s*
      )
    | (?:\A\s*_[SPN]\s*)    #  nun hafukha, setumah, petuhah between words
""",
    re.X,
)

In [14]:

noorigspace = re.compile(
    r"""
      (?: [&-]\Z)           # space, maqef or nospace
    | (?:
           0[05]            # sof pasuq or paseq
           (?:_[SNP])*      # nun hafukha, setumah, petuhah at end of verse
           \Z
      )
    | (?:_[SPN])+           #  nun hafukha, setumah, petuhah between words
""",
    re.X,
)

setumah and petuhah Usually, setumah and petuhah occur after the end of verse sign. In that case we can strip them. Sometimes they occur inter-word. Then we have to replace them by a space because the words are otherwise adjacent. This operation must be performed before originals are glued together, because the _S and _P can only be reliably detected if they are at the end of a word. So: set_pet to be used before phono(), in get_orig, but only if get_orig is used for phono().

In [15]:

set_pet_pattern = re.compile(r"((?:0[05])?)(_[SNP])+\Z")
tetra_lex = "JHWH/"

In [16]:

def set_pet_pattern_repl(match):
    (punct, nsp) = match.groups()
    sep = " § " if punct == "" and nsp != "" else ""
    return punct + sep

Actions¶

In [17]:

def get_orig(w, punct=True, set_pet=False, tetra=True, give_ketiv=False):
    proto = F.g_word.v(w) + F.trailer.v(w)
    qere = F.qere.v(w)
    qere_trailer = F.qere_trailer.v(w)
    if qere_trailer == "":
        qere_trailer = "-"
    orig = proto if give_ketiv or qere is None else qere + qere_trailer
    if tetra and F.lex.v(w) == tetra_lex:
        (mat, sep) = split_punctuation.fullmatch(orig).groups()
        orig = "[ " + mat + " ]" + sep
    if not punct:
        orig = punctuation.sub("", orig)
    else:
        # if not noorigspace.search(orig):
        #    orig += ' '
        if not set_pet:
            orig = set_pet_pattern.sub(set_pet_pattern_repl, orig)
    return orig

find the first occurrence of the string orig in the verse (ETCBC representation) Then deliver the sequence of nodes corresponding to that sequence it turns out that too much is happening with accents, so I will "normalize" the accents for the sake of looking up

In [18]:

digit = re.compile("[0-9]+")

In [19]:

def find_w(passage, orig, debug=False):
    if len(orig) == 0:
        return None
    vn = T.nodeFromSection(passage, lang="la")
    verse_words = L.d(vn, "word")
    results = None
    orig = orig.strip() + " "
    lvw = len(verse_words)
    for i in range(lvw):
        target = orig
        for j in range(i, lvw + 1):
            target = start_punct.sub("", target)
            target = digit.sub("", target)
            if len(target) == 0:
                results = verse_words[i:j]
                break
            if j >= lvw:
                break
            j_orig = digit.sub(
                "",
                get_orig(
                    verse_words[j],
                    punct=False,
                    tetra=False,
                    give_ketiv=True,
                ),
            ).rstrip("&")
            if target.startswith(j_orig):
                if debug:
                    TF.info("{}-{}: [{}] <= [{}]".format(i, j, j_orig, target))
                target = target[len(j_orig) :]
                if debug:
                    TF.info("{}-{}: [{}]".format(i, j, target))
                continue
            if debug:
                TF.info("{}-{}: [{}] <! [{}]".format(i, j, j_orig, target))
            break
    return results

In [20]:

# partition a list of nodes into chunks
# whenever a node has an orig string that not ends with an - start a new chunk
def partition_w(wnodes):
    results = []
    cur_chunk = []
    orig = None
    for w in wnodes:
        cur_chunk.append(w)
        orig = get_orig(w, tetra=False)
        if orig.endswith("-"):
            continue
        results.append(tuple(cur_chunk))
        cur_chunk = []
    if len(cur_chunk):
        results.append(tuple(cur_chunk))
    return results

The phonological symbols¶

Here is the list of symbols that constitutes the mapping from BHSA transcription codes to a phonetic transcription. It is a series of triplets (bhsa symbol, name, phonetic symbol).

If changes are needed to the appearance of the phonetic transcriptions (not to its logic), here is the place to tweak.

Note that the order is important. In the final stage of the transformation process, these substitutions will be applied in the order they appear here.

This is especially important for, but not only for, the BGDKPT letters.

In [21]:

specials = (
    (">", "alef", "ʔ"),
    ("<", "ayin", "ʕ"),
    ("v", "tet", "ṭ"),
    ("y", "tsade", "ṣ"),
    ("x", "chet", "ḥ"),
    ("c", "shin", "š"),
    ("f", "sin", "ś"),
    ("#", "s(h)in", "ŝ"),
    ("ij", "long hireq", "î"),
    ("I", "short hireq", "i"),
    (";j", "long tsere", "ê"),
    ("ow", "long holam", "ô"),
    ("w.", "long `qibbuts`", "û"),
    ("ej", "e glide", "eʸ"),
    ("j", "yod", "y"),
    (":a", "hataf patach", "ᵃ"),
    (":@", "hataf qamats", "ᵒ"),
    (":e", "hataf segol", "ᵉ"),
    ("%", "schwa mobile", "ᵊ"),
    (":", "schwa quiescens", ""),
    ("@", "qamats gadol", "ā"),
    ("a", "patach", "a"),
    ("`", "furtive patach", "ₐ"),
    ("+", "qamats", "å"),
    ("e", "segol", "e"),
    (
        ";",
        "tsere",
        "ē",
    ),
    ("i", "hireq", "i"),
    ("o", "holam", "ō"),
    ("^", "qamats qatan", "o"),
    ("u", "qibbuts", "u"),
    ("b.", "b plosive", "B"),
    ("g.", "g plosive", "G"),
    ("d.", "d plosive", "D"),
    ("k.", "k plosive", "K"),
    ("p.", "p plosive", "P"),
    ("t.", "t plosive", "T"),
    ("b", "b fricative", "v"),
    ("g", "g fricative", "ḡ"),
    ("d", "d fricative", "ḏ"),
    ("k", "k fricative", "ḵ"),
    ("p", "p fricative", "f"),
    ("t", "t fricative", "ṯ"),
    ("B", "b plosive", "b"),
    ("G", "g plosive", "g"),
    ("D", "d plosive", "d"),
    ("K", "k plosive", "k"),
    ("P", "p plosive", "p"),
    ("T", "t plosive", "t"),
    ("w", "waw", "w"),
    ("l", "lamed", "l"),
    ("m", "mem", "m"),
    ("n", "nun", "n"),
    ("r", "resh", "r"),
    ("z", "zajin", "z"),
    ("!", "primary accent", "ˈ"),
    ("/", "secundary accent", "ˌ"),
    ("&", "maqef", "-"),
    ("*", "masora", "*"),
)

In [22]:

specials2 = (
    ("$", "sof pasuq", "."),
    ("|", "paseq", " "),
    ("§", "interword setumah and petuhah", " "),
)

Assembling the symbols in dictionaries¶

We compile the table of symbols in handy dictionaries for ease of processing later.

We need to quickly detect the dagesh lenes later on, so we store them in a dictionary.

Our treatment of accents is still primitive.

We ignore some accents (irrelevant accents below) and we consider some accents as indicators of a mere secondary accent (secundary accents below).

The sound_dict is the resulting (ordered) mapping of all source characters to "phonetic" characters.

In [23]:

dagesh_lenes = {"b.", "g.", "d.", "k.", "p.", "t."}
dagesh_lene_dict = dict()

In [24]:

irrelevant_accents = (
    ("01", "segol"),  # occurs always with another accent
    ("03", "pashta"),  # by definition on last syllable: not relevant for accent
    ("04", "telisha qetana"),
    ("14", "telisha gedola"),
    ("24", "telisha qetana"),
    ("44", "telisha gedola"),
)
secundary_accents = (
    ("71", "merkha"),  # ??
    ("63", "qadma"),  # ??
    ("73", "tipeha"),  # ??
)
punctuation_accents = (
    ("00", "sof pasuq"),
    ("05", "paseq"),
)

In [25]:

known_accents = {
    x[0] for x in irrelevant_accents + secundary_accents + punctuation_accents
}

In [26]:

primary_accents = {
    "{:>02}".format(i) for i in range(100) if "{:>02}".format(i) not in known_accents
}
sound_dict = collections.OrderedDict()
sound_dict2 = collections.OrderedDict()

In [27]:

for (sym, let, glyph) in specials:
    if sym in dagesh_lenes:
        dagesh_lene_dict[sym[0]] = glyph
    else:
        sound_dict[sym] = glyph

In [28]:

for (sym, let, glyph) in specials2:
    sound_dict2[sym] = glyph

Patterns¶

The phono() function that we will define (far) below, performs an ordered sequence of transformations. Most of these are defined as regular expressions, and some parts of those expressions occur over and over again, e.g. subpatterns for vowel and consonant.

Here we define the shortcuts that we are going to use in the regular expressions.

Details of the matching process¶

Normally, when a pattern matches a string, the string is consumed: the parts of the pattern that match consume corresponding stretches of the string. However, in many cases a pattern specifies specific contexts in which a match should be found. In those cases we do not want that the context parts of the pattern are responsible for string consumption, because in those parts there could be another relevant match.

In regular expression there is a solution for that: look-ahead and look-behind assertions and we use them frequently.

(?<= before-pattern ) pattern (?= behind-pattern )

A match of this pattern in a string is a portion of a string that matches pattern, provided that portion is preceded by before-pattern and followed by behind pattern.

If there is a match, and new matches must be searched for, the search will start right after pattern.

Instead of the above positive look-ahead and look-behind assertions, there are also negative variants:

(?<! before-pattern ) pattern (?! behind-pattern )

in those cases the match is good, if the before-pattern does not match the preceding material, and analogously the behind-pattern.

In Python there is a restriction on look-behind patterns: they must be patterns that only have matches of a predictable, fixed length. That will make some of our patterns slightly more complicated. For example, vowels can be simple or complex, and hence have variable length. If we want to specify a consonant, provided it is preceded by a vowel, we have to be careful.

In regular expressions there are greedy, non-greedy and possessive quantifiers. Greedy ones try to match as many times as possible at first; non-greedy ones try to match as few times as possible at first. Possessive quantifiers are like greedy ones, but greedy ones will give back occurrences if that helps to achieve a match. Possessive ones do not do that.

kind	greedy	non-greedy	possessive
0 or more	`*`	`*?`	`*+`
1 or more	`+`	`+?`	`++`
at least n, at most m	`{`n`,` m`}`	`{`n`,` m`}?`	`{`n`,` m`}+`

For example, the pattern [ab]*b matches substrings of as and bs that end in a b. In order to match the string aaaaab, the [a|b]* part starts with greedily consuming the whole string, but after discovering that the b part in the pattern should also match something, the [a|b]* part reluctantly gives back one occurrence. That will do the trick.

However, [ab]*+b will not match aaaaab, because the possessive quantifier gives nothing back.

Possessive quantifiers a desirable in combination with negative look-behind assertions.

For example, take [ab]*+(?!c)$. This will match substrings of as and bs that are not followed by c. So it matches ababab but not abababc. However, the non-possessive variant, [ab]*(?!c) matches both. So how does it match abababc? First, the [ab]* part matches all as and bs. Then the look-behind assertion that c does not follow, is violated. So [ab]* backtracks one occurrence, a b. At that point the look-behind assertion finds a b which is not c, and the match succeeds.

Python lacks possessive quantifiers in regular expressions, so again, this makes some expressions below more complicated than they were otherwise.

We want to test for vowels in look-behind conditions. Python insists that look-behind conditions match patterns with fixed length. Vowels have variable length, so we need to take a bit more context. This extra context is dependent on whether the vowel occurs in front of a consonant or after it vowel 1 is for before, vowel 2 is for after, both are usable in look-behind conditions vowel matches purely vowels of variable length, and is not usable in look-behind conditions

In [29]:

vowel1 = r"(?:(?::[ea@])|(?:w\.)|(?:[i;]j)|(?:ow)|(?:.[%@\^;aeiIou`]))"
vowel2 = r"(?:(?::[ea@])|(?:w\.)|(?:[i;]j)|(?:ow)|(?:[%@\^;aeiIou`].))"
vowel = r"(?:(?::[ea@])|(?:w\.)|(?:[i;]j)|(?:ow)|(?:[%@\^;aeiIou`]))"

In [30]:

# lvowel are long vowels only (including compositions)
# svowel are short vowels only, including composite schwas
lvowel1 = r"(?:(?:w\.)|(?:[i;]j)|(?:ow)|(?:.[@;o]))"
svowel = r"(?:(?::[ea@])|(?:[%@\^;aeiIou`]))"

In [31]:

gadol = sound_dict["@"]
qatan = sound_dict["^"]
a_like = {":a", "a"}
o_like = {":@", "o", "ow", "u", "w."}
e_like = {":", ":e", ";", ";j", "e", "i", "ij"}

In [32]:

# complex i/w vowel: the composite vowels with waw and yod, after translation
complex_i_vowel = "".join(sound_dict[s] for s in {"ij", ";j"})
complex_w_vowel = "".join(sound_dict[s] for s in {"ow"})

In [33]:

# consonants
ncons = "[^>bgdhwzxvjklmns<pyqrfct _&$-]"  # not a consonant
cons = "[>bgdhwzxvjklmns<pyqrfct]"  # any consonant
consx = "[bgdwzxvjklmns<pyqrfct]"  # any consonant except alef
bgdkpt = "[bgdkpt]"  # begadkefat consonant
nbgdkpt = "[wzxvjlmns<yqrfc]"  # non-begadkefat consonant
prep = "[bkl]"  # proclitic preposition

accents

In [34]:

acc = "[ˈˌ]"  # primary and secundary accent

Regular expressions¶

Here are the patterns, but also the replacement functions we are going to carry out when the patterns match. How exactly the patterns and replacement functions hang together, is a matter for the phono function itself.

Rafe and furtive patah¶

Rafe¶

The rafe indicates a fricative pronunciation. It cancels a dagesh lene on a BGDKPT letter. If it occurs in other situations, we ignore it.

Furtive patah¶

We have to reverse any CV pattern at word ends where the V is a patah, and the C is a guttural (i.e. cheth, ayin or he-mappiq).

If there is an accent on the guttural, we ignore it in these cases, because the guttural does not initiate a syllable.

rafe

In [35]:

rafe = re.compile(r"({b})\.,".format(b=bgdkpt))

In [36]:

def rafe_repl(match):
    return match.group(1)

In [37]:

# furtive patah
# note that we will deliberately loose any accent on the guttural
furtive_patah = re.compile(r"([x<]|(?:h\.))(?:[/!]?)a(?=\Z|[ &-])")

In [38]:

def furtive_patah_repl(match):
    return "`" + match.group(1)

Accents¶

Patterns¶

explicit accents

In [39]:

# lets assume that any cantillation mark or accent indicates that the vowel is stressed
# except for some types of mark (qadma, pashta)
sep_accent = re.compile("([0-9]{2})")
remove_accent = re.compile("|".join("~{}".format(x[0]) for x in irrelevant_accents))
primary_accent = re.compile("|".join("~{}".format(x) for x in primary_accents))
secundary_accent = re.compile("|".join("~{}".format(x[0]) for x in secundary_accents))
punctuation_accent = re.compile(
    "({})".format("|".join("~{}".format(x[0]) for x in punctuation_accents))
)
condense_accents = re.compile("({v})([!/]+)".format(v=vowel))

In [40]:

def sep_accent_repl(match):
    return "~" + match.group(1)

In [41]:

def condense_accents_repl(match):
    accent = "!" if "!" in match.group(2) else "/"
    return accent + match.group(1)

In [42]:

# implicit accents
last_part = re.compile(r"([^&-]*)\Z")
default_accent1 = re.compile(r"({v}`?{c}?\.?(?:\Z|[ ]))".format(v=svowel, c=cons))
default_accent2 = re.compile(r"({v}(?:\Z|[ ]))".format(v=lvowel1))
strip_accents = re.compile(r"[0-9*]")

In [43]:

# wrong last accents
last_accent = re.compile(r"[/!]+(?=[ ]|\Z)")

In [44]:

def default_accent_repl(match):
    return "/" + match.group(1)

In [45]:

def punctuation_accent_repl(match):
    if match.group(1) == "~00":
        return " $"
    return " | "

separate the phonetic representation from the interword material after it. To be used at the end of phono(). specials2 specify how punctuation (sof pasuq, paseq, interword setumah-petuhah are translated).

In [46]:

phono_sep = re.compile("(.*?)([ {}]*)".format("".join(x[2] for x in specials2)))
multiple_space = re.compile("  +")

In [47]:

verse_end_phono = re.compile(r"(\. *)\Z")

In [48]:

def verse_end_phono_repl(match):
    return match.group(1).replace(" ", "")

Actions¶

In [49]:

stats = collections.Counter()

In [50]:

def doaccents(orig, debug=False, count=False):
    dout = []

    # prepare
    if debug:
        dout.append(("orig", orig))
    if count:
        pre = orig
    result = orig.lower().replace("_", " ")
    if debug:
        dout.append(("trim", result))
    if count and pre != result:
        stats["trim"] += 1

    # explicit accents
    if count:
        pre = result
    result = sep_accent.sub(sep_accent_repl, result)
    result = remove_accent.sub("", result)
    result = secundary_accent.sub("/", result)
    result = primary_accent.sub("!", result)
    result = condense_accents.sub(condense_accents_repl, result)
    if debug:
        dout.append(("accents", result))
    if count and pre != result:
        stats["accents"] += 1

    # punctuation
    if count:
        pre = result
    result = punctuation_accent.sub(punctuation_accent_repl, result)
    result = strip_accents.sub("", result)
    if debug:
        dout.append(("punctuation", result))
    if count and pre != result:
        stats["punctuation"] += 1

    # rafe
    if count:
        pre = result
    result = rafe.sub(rafe_repl, result)
    result = result.replace(",", "")
    if debug:
        dout.append(("rafe", result))
    if count and pre != result:
        stats["rafe"] += 1

    # furtive patah
    if count:
        pre = result
    result = furtive_patah.sub(furtive_patah_repl, result)
    if debug:
        dout.append(("furtive_patah", result))
    if count and pre != result:
        stats["furtive_patah"] += 1

    # implicit accents
    if count:
        pre = result
    hotpart = last_part.search(result).group(1)
    if "!" not in hotpart and "/" not in hotpart:
        result = default_accent1.sub(default_accent_repl, result)
        if "/" not in result:
            result = default_accent2.sub(default_accent_repl, result)
    result = last_accent.sub("", result)
    if debug:
        dout.append(("default accent", result))
    if count and pre != result:
        stats["default_accent"] += 1

    # deliver
    return (result, dout) if debug else result

Qamets gadol and qatan¶

Patterns¶

qamets qatan NB: all patterns stipulate that the qamets (@) in question is unaccented

In [51]:

# near end of word:
qamets_qatan1 = re.compile(
    r"(?<={c})(\.?)@(?={c}(?:\.?[/!]?(?:[ &-]|\Z)))".format(c=consx)
)

In [52]:

# before dagesh forte:
qamets_qatan2 = re.compile(r"(?<={c})(\.?)@(?={c}\.)".format(c=cons))

In [53]:

# if the following consonant is BGDKFT and does not have dagesh, the @ is in an open syllable:
qamets_qatan3 = re.compile(
    r"(?<={c})(\.?)@(?={c}:(?:{nb}|(?:{b}\.)))".format(c=cons, b=bgdkpt, nb=nbgdkpt)
)

In [54]:

# assimilation of qamets with following composite schwa of type (chatef qamets),
#     but if the qamets is under a preposition BCL, not if it is under the article H:
qamets_qatan4a = re.compile(r"(?<={p})(\.?[!/]?)@(?=-{c}:@)".format(p=prep, c=cons))

In [55]:

#     or word-internal
qamets_qatan4b = re.compile(r"(?<={c})(\.?[!/]?)@(?={c}:@)".format(c=cons))

In [56]:

# before an other qamets qatan, provided the syllable is unaccented
qamets_qatan5 = re.compile(r"(?<={c})(\.?)@(?={c}\.?[/!]?\^)".format(c=cons))

in a pronominal suffix, qamets never becomes qatan. This pattern will be applied only on words that do have a non-empty pronominal suffix The pattern will spot the qamets qatan in front of the last consonant, if there is such a qatan

In [57]:

qamets_qatan_prs = re.compile(r"\^(?=[0-9]*{c}\.?[/!]?(?:[ &-]|\Z))".format(c=cons))

In [58]:

def qamets_qatan_repl(match):
    return match.group(1) + "^"

In [59]:

# there are exceptions to the heuristic of interpreting qamets by voting between occurrences
qamets_qatan_x = """
BJT/ => 1A
JM/ => 1O
JWMM => 2A
JRB<M/ => 1A
JHWNTN/ => 2A
"""

In [60]:

xxx = """
<YBT/ => 2A
"""

In [61]:

# there are unaccented conjugated verb forms that must not be subjected to qamets-qatan transformation
qamets_qatan_verb_x = {
    "verb qal perf 3sf",
    "verb qal perf 3p-",
    "verb nif impf 1s-",
    "verb nif impf 1p-",
    "verb nif impf 2sf",
    "verb nif impf 2pm",
    "verb nif impf 3pm",
    "verb nif impv 2sf",
    "verb nif impv 2pm",
}
qqv_experimental = {
    "verb qal impf 3pm",
}

In [62]:

qamets_qatan_verb_x |= qqv_experimental

In [63]:

def qamets_qatan_verb_x_repl(match):
    return match.group(1) + "@"

for the use of applying individual corrections:

Actions¶

Here is the function that carries out rule based qamets qatan detection, without going into verb paradigms and exceptions. It is the first go at it.

In [64]:

def doplainqamets(word, accentless=False, debug=False, count=False):
    dout = []
    result = word
    if accentless:
        result = result.replace("!", "").replace("/", "")
    if count:
        pre = result
    result = qamets_qatan1.sub(qamets_qatan_repl, result)
    if debug:
        dout.append(("qamets_qatan1", result))
    if count and pre != result:
        stats["qamets_qatan1"] += 1

    if count:
        pre = result
    result = qamets_qatan2.sub(qamets_qatan_repl, result)
    if debug:
        dout.append(("qamets_qatan2", result))
    if count and pre != result:
        stats["qamets_qatan2"] += 1

    if count:
        pre = result
    result = qamets_qatan3.sub(qamets_qatan_repl, result)
    if debug:
        dout.append(("qamets_qatan3", result))
    if count and pre != result:

        stats["qamets_qatan3"] += 1

    if count:
        pre = result
    result = qamets_qatan4a.sub(qamets_qatan_repl, result)
    if debug:
        dout.append(("qamets_qatan4a", result))
    if count and pre != result:
        stats["qamets_qatan4a"] += 1

    if count:
        pre = result
    result = qamets_qatan4b.sub(qamets_qatan_repl, result)
    if debug:
        dout.append(("qamets_qatan4b", result))
    if count and pre != result:
        stats["qamets_qatan4b"] += 1

    return (result, dout) if debug else result

Schwa and dagesh¶

Schwa¶

The rules for the schwa that I have found are contradictory.

These rules I have seen (e.g.)

if two consecutive consonants have both a schwa, the second one is mobile;
a schwa under a consonant with dagesh forte is mobile
a schwa under the last consonant of a word is quiescens
a schwa on a consonant that follows a long vowel, is mobile

But there are examples where rules 1 and 3 apply at the same time.

And the qal 3 sg f forms end with a tav with schwa, often preceded by a consonant with also schwa. In this case the tav has a dagesh, which by the rules for dagesh cannot be a lene. So it must be a forte. So this violates rule 2.

We will cut this matter short, and make any final schwa quiescens.

As to rule 4, there are cases where the schwa in question is also followed by a final consonant with schwa. In those cases it seems that the schwa in question is silent.

In [65]:

# mobile schwa
mobile_schwa1 = re.compile(
    r"""
    (                           # here is what goes before the schwa in question
        (?:(?:\A|[ &-]).\.?)|   # an initial consonant or
        (?:.\.)|                # a consonant with dagesh (which must be forte then) or
        (?::.\.?)|              # another schwa and then a consonant
        (?:                     # a long vowel such as the following
            (?:
                @>?|               # qamets possibly with alef as mater lectionis (the remaining qametses are gadol)
                ;j?|               # tsere, possibly followed by yod
                ij|                # hireq with yod
                o[>w]?|            # holam possibly followed by yod
                w\.                # waw with dagesh
            )
            {c}                 # and then a consonant
        )
    )
    :
    (?![@ae])                   # the schwa may not be composite
""".format(
        c=cons
    ),
    re.X,
)

In [66]:

mobile_schwa2 = re.compile(
    r":(?={b}(?:[^.]|[ &-]|\Z))".format(b=bgdkpt)
)  # before `BGDKPT` letter without dagesh

In [67]:

# second last consonant with schwa when last consonsoant also has schwa
mobile_schwa3 = re.compile(r"[%:](?={c}\.?{a}?[%:](?:[ &]|\Z))".format(a=acc, c=cons))

In [68]:

# all schwas and the end of the word are quiescens, only if the words are not glued together
mobile_schwa4 = re.compile(r"[%:](?=[ &]|\Z)")

In [69]:

def mobile_schwa1_repl(match):
    return match.group(1) + "%"

In [70]:

# dagesh
dages_forte_lene = re.compile(
    r"(?<={v1})(-*)({b})\.(?=[/!]?{v2})".format(v1=vowel1, v2=vowel, b=bgdkpt)
)
dages_forte = re.compile(
    r"(?<={v1})(-?[h>]*-*)([^h])\.(?=[/!]?{v2})".format(v1=vowel1, v2=vowel)
)
dages_lene = re.compile(r"({b})\.".format(b=bgdkpt))

In [71]:

def dages_forte_lene_repl(match):
    return match.group(1) + (dagesh_lene_dict[match.group(2)] * 2)

In [72]:

def dages_lene_repl(match):
    return dagesh_lene_dict[match.group(1)]

In [73]:

def dages_forte_repl(match):
    return match.group(1) + match.group(2) * 2

Mater lectionis and final fixes¶

In [74]:

# silent aleph
silent_aleph = re.compile("(?<=[^ &-])>(?!(?:[/!]|{v}))".format(v=vowel))

In [75]:

# final mater lectionis
# I assume that heh and alef are only matrices lectionis after a LONG vowel
last_ml = re.compile(r"(?<={v1})[>h]+(?=[ &-]|\Z)".format(v1=lvowel1))
last_ml_jw = re.compile(r"jw(?=[ &-]|\Z)")

In [76]:

# mappiq heh
mappiq_heh = re.compile(r"h\.")

In [77]:

fixit_i = re.compile(r"([{v}])\.".format(v=complex_i_vowel))
fixit_w = re.compile(r"([{v}])\.".format(v=complex_w_vowel))
fixit = re.compile(r"(.)\.")

In [78]:

split_sep = re.compile(
    "^(.*?)([ .&$\n-]*)$"
)  # to split the result in the phono part and the interword part

In [79]:

def fixit_repl(match):
    return match.group(1) * 2

In [80]:

def fixit_i_repl(match):
    return match.group(1) + "j"

In [81]:

def fixit_w_repl(match):
    return match.group(1) + "w"

END OF REGULAR EXPRESSIONS AND REPLACEMENT FUNCTIONS

Qamets corrections¶

For some words we need specific corrections. The rules for qamets qatan are not specific enough.

Correction mechanism¶

We define a function apply_corr(wordq, corr) that can apply a correction instruction to wordq, which is a word in pre-transliterated form, i.e. a word that has underwent transliteration steps ending with qamets interpretation, including applying special verb cases.

The corr is a comma-separated list of basic instructions, which have the form number letter. It will interpret the number-th qamets as a gadol of qatan, depending on whether letter = ā or o.

Precomputed list of corrections¶

Later on we compile a dictionary qamets_corrections of pre-computed corrections. This dictionary is keyed by the pre-transliterated form, and valued by the corresponding correction string. Here we initialize this dictionary.

The phono() function that carries out the complete transliteration, looks by default in qamets_corrections, but this can be overridden. These corrections will not be carried out for the special verb cases.

In [82]:

qamets_corrections = {}  # list of translits that must be corrected

apply correction instructions to a word

In [83]:

def apply_corr(wordq, corr):
    if corr == "":
        return wordq
    corrs = corr.split(",")
    indices = []
    for (i, ch) in enumerate(wordq):
        if ch == "^" or (ch == "@" and (i == 0 or wordq[i - 1] != ":")):
            indices.append(i)
    resultlist = list(wordq)
    for c in corrs:
        (pos, kind) = c
        pos = int(pos) - 1
        repl = "^" if kind == "o" else "@"
        if pos >= len(indices):
            TF.error("Line {}: pos={} out of range {}".format(ln, pos, indices))
            continue
        rpos = indices[pos]
        resultlist[rpos] = repl
    return "".join(resultlist)

Feature value normalization¶

We need concise, normalized values for the lexical features.

In [84]:

undefs = {"NA", "unknown", "n/a", "absent"}

In [85]:

png = dict(
    NA="-",
    unknown="-",
    p1="1",
    p2="2",
    p3="3",
    sg="s",
    du="d",
    pl="p",
    m="m",
    f="f",
    a="a",
    c="c",
    e="e",
)
png["n/a"] = "-"

Lexical info¶

We need a label for lexical information such as part of speech, person, number, gender.

In [86]:

declensed = {"subs", "nmpr", "adjv", "prps", "prde", "prin"}

In [87]:

def get_lex_info(w):
    sp = F.sp.v(w)
    lex_infos = [sp]
    if sp == "verb":
        lex_infos.extend(
            [
                F.vs.v(w),
                F.vt.v(w),
                "{}{}{}".format(png[F.ps.v(w)], png[F.nu.v(w)], png[F.gn.v(w)]),
            ]
        )
    elif sp in declensed:
        lex_infos.append("{}{}".format(png[F.nu.v(w)], png[F.gn.v(w)]))
    lex_info = " ".join(lex_infos)
    if sp == "verb" or sp in declensed:
        prs = F.g_prs.v(w)
        if prs not in undefs:
            lex_info += ",{}".format(prs.lower())
    return lex_info

In [88]:

def get_decl(lex_info):
    if lex_info is None:
        lex_info = ""
    parts = lex_info.split(",")
    return lex_info if len(parts) == 1 else parts[0]

In [89]:

def get_prs(lex_info):
    if lex_info is None:
        lex_info = ""
    parts = lex_info.split(",")
    return "" if len(parts) == 1 else parts[1]

The phono function¶

The definition of the function that generates the phonological transliteration. It is a function with a big definition, so we have broken it in parts.

Phono parts¶

In [90]:

interesting_stats = [
    "total",
    "qamets_verb_suppress_qatan",
    "qamets_prs_suppress_qatan",
    "qamets_qatan_corrections",
]

if suppress_in_verb, phono will suppress qatan interpretation in certain verb paradigmatic forms if suppress_in_prs, phono will suppress qatan interpretation in pronominal suffixes if correct is 1, phono will apply individual corrections if correct is 0, phono will not apply individual corrections if correct is -1, phono will stop just before applying the qamets qatan corrections and return the intermediate result

In [91]:

def phono_qamets(
    ws,
    result,
    lex_info,
    debug,
    count,
    dout,
    suppress_in_verb,
    suppress_in_prs,
    correct,
    corrections,
):
    # qamets qatan

    # check whether we are in a verb paradigm that requires suppressing qamets => qatan
    if count:
        pre = result
    suppr = True
    decl = get_decl(lex_info)

    if suppress_in_verb:
        suppr = False
        if decl == "":
            if debug:
                dout.append(("qamets qatan", "no special verb form invoked"))
        elif decl not in qamets_qatan_verb_x:
            if debug:
                dout.append(("qamets qatan", "no special verb form: {}".format(decl)))
        elif "@" not in result:
            if debug:
                dout.append(("qamets qatan", "special verb form: no qamets present"))
        elif "!" in result:
            if debug:
                dout.append(
                    ("qamets qatan", "special verb form: primary accent present")
                )
            suppr = True
        else:
            suppr = True
            if count:
                stats["qamets_verb_suppress_qatan"] += 1
    else:
        if debug:
            dout.append(("qamets qatan", "suppression for verb forms is switched off"))
        suppr = False

    if suppr:
        if debug:
            dout.append(
                (
                    "qamets qatan",
                    "special verb form: qatan suppressed for {}".format(decl),
                )
            )
    else:
        if debug:
            (result, this_dout) = doplainqamets(result, debug=True, count=count)
            dout.extend(this_dout)
        else:
            result = doplainqamets(result, count=count)

    # check whether we have a pronominal suffix that requires suppressing qamets => qatan

    if count:
        pre = result
    suppr = True
    prs = get_prs(lex_info)
    if suppress_in_prs:
        suppr = False
        if prs == "":
            if debug:
                dout.append(("qamets qatan", "no pron suffix indicated"))
        elif "@" not in prs:
            if debug:
                dout.append(("qamets qatan", "pronominal suffix: no qamets present"))
        elif not qamets_qatan_prs.search(result):
            if debug:
                dout.append(
                    (
                        "qamets qatan",
                        "pron suffix {}: no qamets qatan present".format(prs),
                    )
                )
        else:
            suppr = True
            if count:
                stats["qamets_prs_suppress_qatan"] += 1
    else:
        if debug:
            dout.append(("qamets qatan", "suppression for pron suffix is switched off"))
        suppr = False

    if suppr:
        result = qamets_qatan_prs.sub("@", result)
        if debug:
            dout.append(
                ("qamets qatan", "pron suffix {}: qatan suppressed".format(prs))
            )
            dout.append(("qamets qatan prs", result))

    # now change gadol in qatan in front of other qatan
    if count:
        pre = result
    result = qamets_qatan5.sub(qamets_qatan_repl, result)
    if debug:
        dout.append(("qamets_qatan5", result))
    if count and pre != result:
        stats["qamets_qatan5"] += 1

    # handle desired corrections
    if count:
        pre = result
    if correct == -1:
        return (result, True)
    if correct == 1 and decl not in qamets_qatan_verb_x:
        if corrections is None:
            corrections = qamets_corrections
        parts = result.split("-")
        hotpart = parts[-1]
        wordq = phono(ws[-1], correct=-1, punct=False)
        if wordq in corrections:
            hotpartn = apply_corr(hotpart, corrections[wordq])
            if debug:
                dout.append(
                    ("qamets qatan", "correction: {} => {}".format(hotpart, hotpartn))
                )
            parts[-1] = hotpartn
            result = "-".join(parts)
    if debug:
        dout.append(("qamets_qatan_corr", result))
    if count and pre != result:
        stats["qamets_qatan_corrections"] += 1

    return (result, False)

In [92]:

def phono_patterns(result, debug, count, dout):

    # mobile schwa
    if count:
        pre = result
    result = mobile_schwa1.sub(mobile_schwa1_repl, result)
    if debug:
        dout.append(("mobile_schwa1", result))
    if count and pre != result:
        stats["mobile_schwa1"] += 1

    if count:
        pre = result
    result = mobile_schwa2.sub("%", result)
    if debug:
        dout.append(("mobile_schwa2", result))
    if count and pre != result:
        stats["mobile_schwa2"] += 1

    if count:
        pre = result
    result = mobile_schwa3.sub("", result)
    if debug:
        dout.append(("mobile_schwa3", result))
    if count and pre != result:
        stats["mobile_schwa3"] += 1

    if count:
        pre = result
    result = mobile_schwa4.sub("", result)
    if debug:
        dout.append(("mobile_schwa4", result))
    if count and pre != result:
        stats["mobile_schwa4"] += 1

    # dagesh
    if count:
        pre = result
    result = dages_forte_lene.sub(dages_forte_lene_repl, result)
    if debug:
        dout.append(("dagesh_forte_lene", result))
    if count and pre != result:
        stats["dagesh_forte_lene"] += 1

    if count:
        pre = result
    result = result.replace("ij.", "Ijj")
    result = dages_forte.sub(dages_forte_repl, result)
    if debug:
        dout.append(("dagesh_forte", result))
    if count and pre != result:
        stats["dagesh_forte"] += 1

    if count:
        pre = result
    result = dages_lene.sub(dages_lene_repl, result)
    if debug:
        dout.append(("dagesh_lene", result))
    if count and pre != result:
        stats["dagesh_lene"] += 1

    # silent aleph (but not in tetra)
    if count:
        pre = result
    if "[" not in result:
        result = silent_aleph.sub("", result)
    if debug:
        dout.append(("silent_aleph", result))
    if count and pre != result:
        stats["silent_aleph"] += 1

    # final mater lectionis (but not in tetra)
    if count:
        pre = result
    if "[" not in result:
        result = last_ml_jw.sub("ʸw", result)
        result = last_ml.sub("", result)
    if debug:
        dout.append(("last_ml", result))
    if count and pre != result:
        stats["last_ml"] += 1

    # mappiq heh
    if count:
        pre = result
    result = mappiq_heh.sub("h", result)
    if debug:
        dout.append(("mappiq_heh", result))
    if count and pre != result:
        stats["mappiq_heh"] += 1

    return result

In [93]:

def phono_symbols(ws, result, debug, count, dout):

    # split the result in parts corresponding with the word nodes of the original
    resultparts = result.split("-")
    results = []
    for (i, w) in enumerate(ws):
        resultp = resultparts[i]
        result = resultp
        # masora
        if F.qere.v(w) is not None:
            result = "*" + result

        for (sym, repl) in sound_dict.items():
            result = result.replace(sym, repl)
        if debug:
            dout.append(("symbols", result))

        # fix left over dagesh and mappiq
        if count:
            pre = result
        result = fixit_i.sub(fixit_i_repl, result)
        if debug:
            dout.append(("fixit_i", result))
        if count and pre != result:
            stats["fixit_i"] += 1

        if count:
            pre = result
        result = fixit_w.sub(fixit_w_repl, result)
        if debug:
            dout.append(("fixit_w", result))
        if count and pre != result:
            stats["fixit_w"] += 1

        if count:
            pre = result
        result = fixit.sub(fixit_repl, result)
        if count and pre != result:
            stats["fixit"] += 1
        if debug:
            dout.append(("fixit", result))

        if count:
            pre = result
        for (sym, repl) in sound_dict2.items():
            result = result.replace(sym, repl)
        if debug:
            dout.append(("punct", result))
        if count and pre != result:
            stats["punct"] += 1

        # zero width word boundary
        if count:
            pre = result
        result = multiple_space.sub(" ", result)
        result = result.replace("[ ", "[").replace(" ]", "]")  # tetra
        if debug:
            dout.append(("cleanup", result))
        if count and pre != result:
            stats["cleanup"] += 1
        results.append(result)

    return results

Phono whole¶

Here the rule fabrics are woven together, exceptions invoked.

In [94]:

def phono(
    ws,
    suppress_in_verb=True,
    suppress_in_prs=True,
    correct=1,
    corrections=None,
    inparts=False,
    debug=False,
    count=False,
    punct=True,
):
    if type(ws) is int:
        ws = [ws]
    if count:
        stats["total"] += 1
    dout = []
    # collect information
    orig = "".join(get_orig(w, punct=True) for w in ws)
    lex_info = get_lex_info(ws[-1])
    # strip punctuation at the end, if needed
    if not punct:
        orig = punctuation.sub("", orig)
    # account for ketiv-qere if in debug mode
    if debug:
        for w in ws:
            if F.qere.v(w) is not None:
                dout.append(
                    (
                        "ketiv-qere",
                        "{} => {}".format(
                            F.g_word.v(w), F.qere.v(w) + F.qere_trailer.v(w)
                        ),
                    )
                )
    # accents
    if debug:
        (result, dout) = doaccents(orig, debug=True, count=count)
    else:
        result = doaccents(orig, count=count)
    # qamets
    (result, deliver) = phono_qamets(
        ws,
        result,
        lex_info,
        debug,
        count,
        dout,
        suppress_in_verb,
        suppress_in_prs,
        correct,
        corrections,
    )
    if deliver:
        return (result, dout) if debug else result
    # patterns
    result = phono_patterns(result, debug, count, dout)
    # symbols
    results = phono_symbols(ws, result, debug, count, dout)
    result = "".join(results) if not inparts else results
    # deliver
    return (result, dout) if debug else result

Skeleton analysis¶

We have to do more work for the qamets. Sometimes a word form on its own is not enough to determine whether a qamets is gadol or qatan. In those cases, we analyse all occurrences of the same lexeme, and for each syllable position we measure whether an A-like vowel of an O-like vowel tends to occur in that syllable.

In order to do that, we need to compute a vowel skeleton for each word.

Stripping paradigmatic material¶

A word may have extra syllables, due to inflections, such as plurals, feminine forms, or suffixes. Let us call this the paradigmatic material of a word.

Now, we strip from the initial vowel skeleton a number of trailing vowels that corresponds to the number of consonants found in the paradigmatic material. This is rather crude, but it will do.

In [95]:

# we need the number of letters in a defined value of a morpho feature
def len_suffix(v):
    if v is None:
        return 0
    if v in undefs:
        return 0
    return len(v.replace("=", "").replace("W", "").replace("J", ""))

In [96]:

# we need a function that return 1 for plural/dual subs/adj and for fem adj
def len_ending(sp, n, g):
    if sp == "subs":
        return 1 if n in {"pl", "du"} else 0
    if sp == "adjv":
        return 1 if n in {"pl", "du"} or g in "f" else 0
    return 0

In [97]:

# return the number of consonants in the suffixes
def len_morpho(w):
    return max(
        (
            len_suffix(F.prs.v(w)) + len_suffix(F.uvf.v(w)),
            len_ending(F.sp.v(w), F.nu.v(w), F.gn.v(w)),
        )
    )

Skeleton patterns¶

Next, we reduce the vowel skeleton to a skeleton pattern. We are not interested in all vowels, only in whether the vowel is a qamets (gadol or qatan), A-like, O-like, or other (which we dub E-like).

In [98]:

# the qamets gadol/qatan skeleton
qamets_qatan_skel = re.compile("([^@^])")

In [99]:

# the vowel skeleton where the qamets gadol/qatan are preserved as @ and ^
# another o-like vowel becomes O (holam, qamets chatuf) (no waws nor yods)
# another a-like vowel becomes A (patah, patah chatuf) (no alefs)
silent_alef_start = re.compile(r"([ &-]|\A)>([!/]?(?:[^!/.:;@^aeiou]|\Z))")

In [100]:

def silent_alef_start_repl(match):
    return match.group(1) + "E" + match.group(2)

In [101]:

qamets_qatan_fullskel = re.compile(
    r"""
    (
        E                                         # replacement of silent initial alef without vowels
    |   (?::[@ae]?)                               # a (composite) schwa
    |   (?:[;i]j) | (?:ow) | (?:w.)               # a composite vowel
    |   [@a;eiou^]                                # a vowel point
    |   .                                         # anything else
    )
""",
    re.X,
)

In [102]:

def qamets_qatan_fullskel_repl(match):
    found = match.group(1)
    if found == "E":
        return "E"
    if found == "@":
        return gadol
    if found == "^":
        return qatan
    if found in a_like:
        return "A"
    if found in o_like:
        return "O"
    if found in e_like:
        return "E"
    return ""

In [103]:

def get_full_skel(w, debug=False):
    wordq = phono(w, correct=-1, punct=False)
    wordqr = silent_alef_start.sub(silent_alef_start_repl, wordq)
    fullskel = qamets_qatan_fullskel.sub(qamets_qatan_fullskel_repl, wordqr)
    ending_length = len_morpho(w)
    relevant_part = len(fullskel) - ending_length
    if debug:
        TF.info(
            "{}: {} => {} => {} : {} minus {} = {}".format(
                w,
                orig,
                wordq,
                wordqr,
                fullskel,
                ending_length,
                fullskel[0:relevant_part],
            )
        )

    return fullskel[0:relevant_part]

Qamets gadol qatan: sophisticated¶

A lot of work is needed to get the qamets gadol-qatan right. This involves looking at accents, verb paradigms and special cases among the non-verbs.

Qamets gadol qatan: non-verbs¶

Sometimes a qamets is gadol or qatan for lexical reasons, i.e. it can not be derived by rules based on the word occurrence itself, but other occurrences have to be invoked.

All candidates¶

In [104]:

# find lexemes which have an occurrence with a qamets (except verbs)
utils.caption(0, "\tLooking for non-verb qamets")
qq_words = set()
qq_lex = collections.defaultdict(lambda: [])

|         11s 	Looking for non-verb qamets

In [105]:

for w in F.otype.s("word"):
    ln = F.languageISO.v(w)
    if ln != "hbo":
        continue
    sp = F.sp.v(w)
    if sp == "verb":
        continue
    orig = get_orig(w, punct=False, tetra=False)
    if "@" not in orig:
        continue  # no qamets in word
    word = doaccents(orig)
    lex = F.lex.v(w)
    if word in qq_words:
        continue
    qq_words.add(word)
    qq_lex[lex].append(w)
utils.caption(
    0, "\t{} lexemes and {} unique occurrences".format(len(qq_lex), len(qq_words))
)

|         13s 	4056 lexemes and 13451 unique occurrences

Filtering interesting candidates¶

In [106]:

utils.caption(0, "\tFiltering lexemes with varied occurrences")
qq_varied = collections.defaultdict(lambda: [])
nocc = 0
for lex in qq_lex:
    ws = qq_lex[lex]
    if len(ws) == 1:
        continue
    occs = []
    skel_set = set()
    has_qatan = False
    has_gadol = False
    for w in ws:
        wordq = phono(w, correct=-1, punct=False)
        skel = (
            qamets_qatan_skel.sub("", wordq.replace(":@", ""))
            .replace("@", gadol)
            .replace("^", qatan)
        )
        if gadol in skel:
            has_gadol = True
        if qatan in skel:
            has_qatan = True
        skel_set.add(skel)
        occs.append((skel, w))
    if len(skel_set) > 1 and has_qatan and has_gadol:
        for (skel, w) in occs:
            fullskel = get_full_skel(w)
            qq_varied[lex].append((skel, fullskel, w))
            nocc += 1
utils.caption(
    0,
    "\t{} interesting lexemes with {} unique occurrences".format(len(qq_varied), nocc),
)

|         13s 	Filtering lexemes with varied occurrences
|         13s 	161 interesting lexemes with 1704 unique occurrences

Guess the qamets¶

In [107]:

qamets_qatan_xc = dict(
    (x[0], x[1]) for x in (y.split(" => ") for y in qamets_qatan_x.strip().split("\n"))
)
qamets_qatan_xcompiled = collections.defaultdict(lambda: {})
for (lex, corrstr) in qamets_qatan_xc.items():
    corrs = corrstr.split(",")
    for corr in corrs:
        (pos, ins) = corr
        pos = int(pos) - 1
        qamets_qatan_xcompiled[lex][pos] = ins

In [108]:

def compile_occs(lex, occs):
    vowel_counts = collections.defaultdict(lambda: collections.Counter())
    for (skel, fullskel, w) in occs:
        for (i, c) in enumerate(fullskel):
            vowel_counts[i][c] += 1
    occs_compiled = {}
    for i in sorted(vowel_counts):
        vowel_count = vowel_counts[i]
        a_ish = vowel_count.get(gadol, 0) + vowel_count.get("A", 0)
        o_ish = vowel_count.get(qatan, 0) + vowel_count.get("O", 0)
        if a_ish != o_ish:
            occs_compiled[i] = gadol if a_ish > o_ish else qatan
    if lex in qamets_qatan_xcompiled:
        override = qamets_qatan_xcompiled[lex]
        for i in override:
            ins = override[i]
            old_ins = occs_compiled.get(i, "")
            new_ins = gadol if ins == "A" else qatan
            if old_ins == new_ins:
                TF.info(
                    "\t{}: No override needed for syllable {} which is {}".format(
                        lex,
                        i + 1,
                        old_ins,
                    ),
                    tm=False,
                )
            else:
                TF.info(
                    "\t{}: Override for syllable {}: {} becomes {}".format(
                        lex,
                        i + 1,
                        old_ins,
                        new_ins,
                    ),
                    tm=False,
                )
                occs_compiled[i] = new_ins
    return occs_compiled

In [109]:

def guess_qq(occ, occs_compiled, debug=False):
    (skel, fullskel, w) = occ
    guess = ""
    for (i, c) in enumerate(fullskel):
        guess += occs_compiled.get(i, c) if c == gadol or c == qatan else c
    if debug:
        TF.info("{}".format(w), tm=False)
    return guess

In [110]:

def get_corr(fullskel, guess, debug=False):
    n = 0
    corr = []
    for (i, fc) in enumerate(fullskel):
        if fc != qatan and fc != gadol:
            continue
        n += 1
        gc = guess[i]
        if fc == gc:
            continue
        corr.append("{}{}".format(n, gc))
    if debug:
        TF.info("{} guess {} corr {}".format(fullskel, guess, corr), tm=False)
    return ",".join(corr)

Carrying out the guess work¶

In [111]:

utils.caption(0, "\tGuessing between gadol and qatan")
qamets_corrections = {}
qq_varied_remaining = set()
ndiff_occs = 0
ndiff_lexs = 0
nconflicts = 0
for lex in qq_varied:
    debug = False
    occs = qq_varied[lex]
    occs_compiled = compile_occs(lex, occs)
    this_ndiff_occs = 0
    for occ in occs:
        (skel, fullskel, w) = occ
        guess = guess_qq(occ, occs_compiled, debug=debug)
        corr = get_corr(fullskel, guess, debug=debug)
        if corr:
            this_ndiff_occs += 1
            wordq = phono(w, correct=-1, punct=False)
            if wordq in qamets_corrections:
                old_corr = qamets_corrections[wordq]
                if old_corr != corr:
                    TF.error(
                        "\t\tConflicting corrections for {} {} {} ({} => {}): first {} and then {}".format(
                            lex,
                            wordq,
                            skel,
                            fullskel,
                            guess,
                            old_corr,
                            corr,
                        )
                    )
                    nconflicts += 1
            qamets_corrections[wordq] = corr

    if this_ndiff_occs:
        ndiff_lexs += 1
        ndiff_occs += this_ndiff_occs
        qq_varied_remaining.add(lex)
utils.caption(
    0, "\t{} lexemes with modified occurrences ({})".format(ndiff_lexs, ndiff_occs)
)
utils.caption(0, "\t{} patterns with conflicts".format(nconflicts))

|         13s 	Guessing between gadol and qatan
	JM/: Override for syllable 1: ā becomes o
	BJT/: Override for syllable 1: o becomes ā
	JWMM: Override for syllable 2:  becomes ā
	JHWNTN/: Override for syllable 2:  becomes ā
	JRB<M/: No override needed for syllable 1 which is ā
|         13s 	107 lexemes with modified occurrences (224)
|         13s 	0 patterns with conflicts

Generate phonological data¶

In [112]:

def stats_prog():
    return " ".join(str(stats.get(stat, 0)) for stat in interesting_stats)

In [113]:

utils.caption(4, "Generating data in two ways ... ")

..............................................................................................
.         13s Generating data in two ways ...                                                .
..............................................................................................

In [114]:

phono_file = []
word_file = []

In [115]:

stats = collections.Counter()
nv = 0
nchunk = 1000
nvc = 0
for v in F.otype.s("verse"):
    nv += 1
    nvc += 1
    if nvc == nchunk:
        utils.caption(0, "\t{:>5} verses {}".format(nv, stats_prog()))
        nvc = 0

    words = partition_w(L.d(v, "word"))
    phonos = []

    for ws in words:
        lws = len(ws)
        phono_w = phono(ws, inparts=True, count=True)
        phono_file.append("".join(phono_w))
        for (i, w) in enumerate(ws):
            (real_phono, sep) = phono_sep.fullmatch(phono_w[i]).groups()
            word_file.append((w, real_phono, sep))

    if not phono_file[-1].endswith(". "):
        word_file.append((None, "", "+"))

|         14s 	 1000 verses 13316 62 0 21
|         16s 	 2000 verses 27407 123 2 79
|         17s 	 3000 verses 40963 174 5 125
|         18s 	 4000 verses 54143 242 8 143
|         19s 	 5000 verses 67151 308 13 171
|         20s 	 6000 verses 82448 394 15 196
|         22s 	 7000 verses 97551 457 17 254
|         23s 	 8000 verses 113748 529 18 287
|         24s 	 9000 verses 129602 573 20 327
|         26s 	10000 verses 146217 624 20 438
|         27s 	11000 verses 159809 749 20 487
|         28s 	12000 verses 174192 891 24 524
|         30s 	13000 verses 190555 1018 28 576
|         31s 	14000 verses 205104 1168 32 622
|         32s 	15000 verses 218610 1290 33 728
|         33s 	16000 verses 227944 1336 39 777
|         34s 	17000 verses 235635 1379 48 827
|         35s 	18000 verses 243258 1396 51 866
|         35s 	19000 verses 250709 1429 59 906
|         36s 	20000 verses 260118 1470 60 960
|         38s 	21000 verses 275083 1533 63 979
|         39s 	22000 verses 286442 1590 65 1007
|         40s 	23000 verses 301302 1645 66 1075

In [116]:

utils.caption(0, "\t{:>5} verses done {}".format(nv, stats_prog()))
for stat in sorted(stats):
    amount = stats[stat]
    utils.caption(
        0,
        "\t{:<1} {:>6} {}".format(
            "#" if amount == 0 else "",
            amount,
            stat,
        ),
    )

|         40s 	23213 verses done 304800 1650 66 1081
|         40s 	  270191 accents
|         40s 	    9006 cleanup
|         40s 	   45235 dagesh_forte
|         40s 	   21511 dagesh_forte_lene
|         40s 	   59612 dagesh_lene
|         40s 	   16322 default_accent
|         40s 	     968 fixit
|         40s 	    2658 furtive_patah
|         40s 	   28195 last_ml
|         40s 	    2201 mappiq_heh
|         40s 	   93898 mobile_schwa1
|         40s 	    2255 mobile_schwa2
|         40s 	     179 mobile_schwa3
|         40s 	    7702 mobile_schwa4
|         40s 	   25498 punct
|         40s 	   25498 punctuation
|         40s 	      66 qamets_prs_suppress_qatan
|         40s 	    5257 qamets_qatan1
|         40s 	     243 qamets_qatan2
|         40s 	    1791 qamets_qatan3
|         40s 	      28 qamets_qatan4a
|         40s 	     256 qamets_qatan4b
|         40s 	     209 qamets_qatan5
|         40s 	    1081 qamets_qatan_corrections
|         40s 	    1650 qamets_verb_suppress_qatan
|         40s 	      12 rafe
|         40s 	   21098 silent_aleph
|         40s 	  304800 total
|         40s 	  304796 trim

Consistency check¶

We take the just generated phono and wordph files. From the phono file we strip the passage indicators, and from the wordph we strip the node numbers.

They should be consistent.

In [117]:

utils.caption(0, "{} items in phono".format(len(phono_file)))
word_test = []
TF.info("Reading word")
i = 0
for (w, mat, sep) in word_file:
    rsep = "" if sep == "+" else sep
    word_test.append(mat + rsep)
    if ". " in sep or "+" in sep:
        i += 1
utils.caption(0, "\t{} lines".format(i))

|         40s 304800 items in phono
    40s Reading word
|         41s 	23213 lines

In [118]:

phono_text = "".join(phono_file)
word_text = "".join(word_test)
if phono_text != word_text:
    utils.caption(0, "\tERROR: phono text and word info are NOT consistent")
else:
    utils.caption(0, "\tOK: phono text and word info are CONSISTENT")

|         41s 	OK: phono text and word info are CONSISTENT

Generating phono module for Text-Fabric¶

We generate the features phono and phono_trailer. They are defined for words.

We also generate a config feature otext@phono, which will be picked up by Text-Fabric automatically. In it we define the phonetic format, so that Text-Fabric has can output text in phonetic representation.

In [122]:

genericMetaPath = f"{thisRepo}/yaml/generic.yaml"
phonoMetaPath = f"{thisRepo}/yaml/phono.yaml"

with open(genericMetaPath) as fh:
    genericMeta = yaml.load(fh, Loader=yaml.FullLoader)
    genericMeta["version"] = VERSION
with open(phonoMetaPath) as fh:
    phonoMeta = formatMeta(yaml.load(fh, Loader=yaml.FullLoader))
    
metaData = {"": genericMeta, **phonoMeta}

In [124]:

utils.caption(4, "Writing TF phono features")
nodeFeatures = dict(
    phono=dict(((ln[0], ln[1]) for ln in word_file if ln[0] is not None)),
    phono_trailer=dict(((ln[0], ln[2]) for ln in word_file if ln[0] is not None)),
)
edgeFeatures = {}
metaData["otext@phono"] = {
    "about": "Provides phonetic transcriptions to Hebrew Words",
    "see": "https://github.com/ETCBC/phono",
    "fmt:text-phono-full": "{phono}{phono_trailer}",
}
metaData["phono"]["valueType"] = "str"
metaData["phono_trailer"]["valueType"] = "str"

TF = Fabric(locations=thisTempTf, silent=True)
TF.save(nodeFeatures=nodeFeatures, edgeFeatures=edgeFeatures, metaData=metaData)

..............................................................................................
.      3m 09s Writing TF phono features                                                      .
..............................................................................................

Out[124]:

True

Diffs¶

Check differences with previous versions.

In [125]:

utils.checkDiffs(thisTempTf, thisTf, only=set(nodeFeatures))

..............................................................................................
.      6m 08s Check differences with previous version                                        .
..............................................................................................
|      6m 08s 	no features to add
|      6m 08s 	no features to delete
|      6m 08s 	2 features in common
|      6m 08s phono                     ... no changes
|      6m 08s phono_trailer             ... no changes
|      6m 08s Done

Deliver¶

Copy the new TF features from the temporary location where they have been created to their final destination.

In [126]:

utils.deliverDataset(thisTempTf, thisTf)

..............................................................................................
.      6m 11s Deliver data set to /Users/werk/github/etcbc/phono/tf/2021                     .
..............................................................................................

Compile TF¶

In [127]:

utils.caption(4, "Load and compile the new TF features")

..............................................................................................
.      6m 14s Load and compile the new TF features                                           .
..............................................................................................

In [128]:

TF = Fabric(locations=[coreTf, thisTf], modules=[""])
api = TF.load(" ".join(nodeFeatures))
api.makeAvailableIn(globals())

This is Text-Fabric 9.1.7
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

117 features found and 0 ignored
  0.00s loading features ...
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
   |     1.01s T phono                from ~/github/etcbc/phono/tf/2021
   |     0.60s T phono_trailer        from ~/github/etcbc/phono/tf/2021
    15s All features loaded/computed - for details use TF.isLoaded()

Out[128]:

[('Computed',
  'computed-data',
  ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),
 ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),
 ('Fabric', 'loading', ('TF',)),
 ('Locality', 'locality', ('L Locality',)),
 ('Nodes', 'navigating-nodes', ('N Nodes',)),
 ('Features',
  'node-features',
  ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),
 ('Search', 'search', ('S Search',)),
 ('Text', 'text', ('T Text',))]

In [129]:

utils.caption(4, "Basic tests")

..............................................................................................
.      6m 33s Basic tests                                                                    .
..............................................................................................

In [130]:

utils.caption(4, "First verses in phonetic transcription")
for v in F.otype.s("verse")[0:10]:
    utils.caption(0, "{} {}:{}".format(*T.sectionFromNode(v)), continuation=True)
    utils.caption(0, T.text(L.d(v, "word"), fmt="text-phono-full"), continuation=True)

..............................................................................................
. 6m 36s First verses in phonetic transcription .
..............................................................................................
Genesis 1:1
bᵊrēšˌîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˌayim wᵊʔˌēṯ hāʔˈāreṣ .
Genesis 1:2
wᵊhāʔˈāreṣ hāyᵊṯˌā ṯˈōhû wāvˈōhû wᵊḥˌōšeḵ ʕal-pᵊnˈê ṯᵊhˈôm wᵊrˈûₐḥ ʔᵉlōhˈîm mᵊraḥˌefeṯ ʕal-pᵊnˌê hammˈāyim .
Genesis 1:3
wayyˌōmer ʔᵉlōhˌîm yᵊhˈî ʔˈôr wˈayᵊhî-ʔˈôr .
Genesis 1:4
wayyˈar ʔᵉlōhˈîm ʔeṯ-hāʔˌôr kî-ṭˈôv wayyavdˈēl ʔᵉlōhˈîm bˌên hāʔˌôr ûvˌên haḥˈōšeḵ .
Genesis 1:5
wayyiqrˌā ʔᵉlōhˈîm lāʔôr yˈôm wᵊlaḥˌōšeḵ qˈārā lˈāyᵊlā wˈayᵊhî-ʕˌerev wˈayᵊhî-vˌōqer yˌôm ʔeḥˈāḏ . f
Genesis 1:6
wayyˈōmer ʔᵉlōhˈîm yᵊhˌî rāqˌîₐʕ bᵊṯˈôḵ hammˈāyim wiyhˈî mavdˈîl bˌên mˌayim lāmˈāyim .
Genesis 1:7
wayyˈaʕaś ʔᵉlōhîm ʔeṯ-hārāqîˌₐʕ wayyavdˈēl bˈên hammˈayim ʔᵃšˌer mittˈaḥaṯ lārāqˈîₐʕ ûvˈên hammˈayim ʔᵃšˌer mēʕˈal lārāqˈîₐʕ wˈayᵊhî-ḵˈēn .
Genesis 1:8
wayyiqrˈā ʔᵉlōhˈîm lˈārāqˌîₐʕ šāmˈāyim wˈayᵊhî-ʕˌerev wˈayᵊhî-vˌōqer yˌôm šēnˈî . f
Genesis 1:9
wayyˈōmer ʔᵉlōhˈîm yiqqāwˌû hammˈayim mittˈaḥaṯ haššāmˈayim ʔel-māqˈôm ʔeḥˈāḏ wᵊṯērāʔˌeh hayyabbāšˈā wˈayᵊhî-ḵˈēn .
Genesis 1:10
wayyiqrˌā ʔᵉlōhˈîm layyabbāšˌā ʔˈereṣ ûlᵊmiqwˌē hammˌayim qārˈā yammˈîm wayyˌar ʔᵉlōhˌîm kî-ṭˈôv .

In [131]:

utils.caption(4, "First verse in all formats")
for fmt in T.formats:
    utils.caption(0, "{}".format(fmt), continuation=True)
    utils.caption(0, "\t{}".format(T.text(range(1, 12), fmt=fmt)), continuation=True)

..............................................................................................
.      6m 41s First verse in all formats                                                     .
..............................................................................................
lex-orig-full
	בְּ רֵאשִׁית בָּרָא אֱלֹה אֵת הַ שָּׁמַי וְ אֵת הָ אָרֶץ 
lex-orig-plain
	ב ראשׁית ברא אלהים את ה שׁמים ו את ה ארץ 
lex-trans-full
	B.:- R;>CIJT B.@R@> >:ELOH >;T HA- C.@MAJ W:- >;T H@- >@REY 
lex-trans-plain
	B R>CJT BR> >LHJM >T H CMJM W >T H >RY 
text-orig-full
	בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ 
text-orig-full-ketiv
	בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ 
text-orig-plain
	בראשׁית ברא אלהים את השׁמים ואת הארץ׃ 
text-phono-full
	bᵊrēšˌîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˌayim wᵊʔˌēṯ hāʔˈāreṣ . 
text-trans-full
	B.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 
text-trans-full-ketiv
	B.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 
text-trans-plain
	BR>CJT BR> >LHJM >T HCMJM W>T H>RY00

End of pipeline¶

If this notebook is run with the purpose of generating data, this is the end then.

After this tests and examples are run.

In [41]:

if SCRIPT:
    stop(good=True)

Testing¶

The function below reads a text file with tests.

A test is a tab separated line with as fields:

passage ETCBC-original phono-transcription expected-result bol-reference comments

The testing routine executes all tests, checks the results, produces on-screen output, debug output in file, and pretty output in a HTML file.

Load the features needed for testing.

In [48]:

api = TF.load(
    """
        qere qere_trailer
        g_word_utf8 g_cons_utf8 trailer
        g_word g_cons lex_utf8 lex lex0
        sp vs vt gn nu ps st
        uvf prs g_prs pfm vbs vbe
        languageISO
"""
)
api.makeAvailableIn(globals())

  0.00s loading features ...
  0.04s All features loaded/computed - for details use loadLog()

Auxiliary functions¶

Composing tests¶

Given an occurrence in ETCBC transliteration in a passage, or a node number, we want to easily compile a test out of it. Say we are looking for orig.

The match need not be perfect. We want to find the node w, which carries a transliteration that occurs at the end of orig. If there are multiple, we want the longest. If there are multiple longest ones, we want the first that occurs in the passage.

In [ ]:

def get_hebrew(orig):
    origm = Transcription.suffix_and_finales(orig)
    return Transcription.to_hebrew(origm[0] + origm[1]).replace("-", "")

In [ ]:

def get_passage(w):
    return T.sectionFromNode(w, lang="la")

In [ ]:

def tupleFromStr(passage):
    (book, rest) = passage.split()
    (chapter, verse) = rest.split(":")
    return (book, int(chapter), int(verse))

In [49]:

def maketest(ws=None, orig=None, passageStr=None, expected=None, comment=None):
    if comment is None:
        comment = "isolated case"
    passage = None if passageStr is None else tupleFromStr(passageStr)
    if ws is None:
        if passage is not None and orig is not None:
            ws = find_w(passage, orig)
    if ws is None:
        TF.error("Cannot make test: {}: {} not found".format(passageStr, orig))
        return None
    else:
        if type(ws) is int:
            ws = [ws]
        passage = get_passage(ws[-1])
        if expected is None:
            expected = phono(ws, punct=False)
        test = (ws, expected.rstrip(" "), comment)
    return test

Formatting test results¶

Here are some HTML/CSS definitions for formatting test results.

In [ ]:

def h_esc(txt):
    return txt.replace("&", "&amp;").replace("<", "&lt;").replace(">", "&gt;")

In [ ]:

def test_html_head(title, stats, mystats):
    return (
        """<html>
<head>
    <meta http-equiv="Content-Type"
          content="text/html; charset=UTF-8" />
    <title>"""
        + title
        + """</title>
    <style type="text/css">
        .h {
            font-family: Ezra SIL, SBL Hebrew, Verdana, sans-serif;
            font-size: x-large;
            text-align: right;
        }
        .t {
            font-family: Menlo, Courier New, Courier, monospace;
            font-size: small;
            color: #0000cc;
        }
        .tl {
            font-family: Menlo, Courier New, Courier, monospace;
            font-size: medium;
            font-weight: bold;
            color: #000000;
        }
        .p {
            font-family: Verdana, Arial, sans-serif;
            font-size: medium;
        }
        .l {
            font-family: Verdana, Arial, sans-serif;
            font-size: small;
            color: #440088;
        }
        .v {
            font-family: Verdana, Arial, sans-serif;
            font-size: x-small;
            color: #666666;
        }
        .c {
            font-family: Ezra SIL, SBL Hebrew, Verdana, sans-serif;
            font-size: small;
            background-color: #ffffdd;
            width: 20%;
        }
        .cor {
            font-family: Menlo, Courier New, Courier, monospace;
            font-weight: bold
            font-size: medium;
        }
        .exact {
            background-color: #88ffff;
        }
        .good {
            background-color: #88ff88;
        }
        .error {
            background-color: #ff8888;
        }
        .norm {
            background-color: #8888ff;
        }
        .ca {
            background-color: #88ffff;
        }
        .cr {
            background-color: #ffff33;
        }
    </style>
</head><body>
"""
        + (("<p>" + stats + "</p>") if stats else "")
        + (("<p>" + mystats + "</p>") if mystats else "")
        + """
<table>
"""
    )

In [50]:

test_html_tail = """</table>
</body>
</html>
"""

Run tests¶

This is the function that runs a sequence of tests. If the second argument is a string, it reads a tab separated file with tests from a file with that name. Otherwise it should be a list of tests, a test being a list or tuple consisting of:

source, orig, lex-info, expected, comment

where source is either a string passage or a number w. If it is a w, it is the node corresponding to the word, and it is used to get the passage, orig, lex_info which are allowed to be empty. If it is a passage, the node will be looked up on the basis of it plus orig. If the node is found, it will be used to get the lex_info, if not, the given lex_info will be used.

In [ ]:

def vfname(inpath):
    (indir, infile) = os.path.split(inpath)
    (inbase, inext) = os.path.splitext(infile)
    return os.path.join(indir, inbase + VERSION + inext)

In [51]:

def runtests(title, testsource, outfilename, htmlfilename, order=True, screen=False):
    skipped = 0
    if type(testsource) is list:
        tests = testsource
    else:
        tests = []
        test_in_file = open(testsource)
        for tline in test_in_file:
            (passageStr, orig, expected, comment) = tline.rstrip("\n").split("\t")
            this_test = maketest(
                orig=orig, passageStr=passageStr, expected=expected, comment=comment
            )
            if this_test is not None:
                tests.append(this_test)
            else:
                skipped += 1
        test_in_file.close()

    lines = []
    htmllines = []
    longlines = []
    nexact = 0
    ngood = 0
    ntests = len(tests)
    test_sequence = sorted(tests, key=lambda x: (x[1], x[2], x[0])) if order else tests

    for (i, (wset, expected, comment)) in enumerate(test_sequence):
        passage = get_passage(wset[-1])
        passageStr = "{} {}:{}".format(*passage)
        wss = partition_w(wset)
        orig = "".join(get_orig(w, punct=True, set_pet=True, tetra=False) for w in wset)
        wordph = ""
        lex_info = ""
        dout = []
        for (j, ws) in enumerate(wss):
            this_lex_info = get_lex_info(ws[-1])
            (this_wordph, this_dout) = phono(
                ws, punct=not (j == len(wss) - 1), debug=True
            )
            wordph += this_wordph
            lex_info += this_lex_info
            dout.extend(this_dout)
        wordph = wordph.rstrip(" ")
        if wordph == expected:
            isgood = "="
            nexact += 1
        elif wordph.replace("ˌ", "").replace("ˈ", "").replace(
            "-", ""
        ) == expected.replace("ˌ", "").replace("ˈ", "").replace("-", ""):
            isgood = "~"
            ngood += 1
        else:
            isgood = "#"
        line_text = "{:>3} {:<19} {:>6} {:<17} {:<22} {:<20} {} {:<20}".format(
            i + 1,
            passageStr,
            ws[-1],
            lex_info,
            orig,
            wordph,
            isgood,
            "" if isgood == "=" else expected,
        )
        lines.append(line_text)
        if screen:
            if isgood in {"=", "~"}:
                TF.info(line_text, tm=False)
        if isgood not in {"=", "~"}:
            TF.info(line_text, tm=False)
        longlines.append(
            "{:>3} {:<19} {:>6} {:<17} {:<25} => {:<25} < {} {:<25} # {}\n{}\n\n".format(
                i + 1,
                passageStr,
                ws[-1],
                lex_info,
                orig,
                wordph,
                isgood,
                "" if isgood == "=" else expected,
                comment,
                "\n".join("{:<7} {:<20} {}".format("", x[0], x[1]) for x in dout),
            )
        )
        htmllines.append(
            (
                """
    <tr>
        <td class="{st}">{i}</td>
        <td class="v">{v} {w}</td>
        <td class="t">{t}</td>
        <td class="h">{h}</td>
        <td class="l">{l}</td>
        <td class="p {st}">{p}</td>
        <td class="p{est}">{e}</td>
        <td class="c">{c}</td>
    </tr>
    """
            ).format(
                st="exact" if isgood == "=" else "good" if isgood == "~" else "error",
                i=i + 1,
                v=passageStr,
                w="" if w is None else w,
                t=h_esc(orig),
                l=lex_info,
                h=get_hebrew(orig),
                p=wordph,
                e="" if isgood == "=" else expected,
                est="" if isgood == "=" else " ca" if isgood == "~" else " norm",
                c=h_esc(comment),
            )
        )

    line_text = "\n".join(lines)
    longline_text = "\n".join(longlines)
    test_out_file = open(vfname(outfilename), "w")
    test_out_file.write("{}\n\n{}\n".format(line_text, longline_text))
    stats = "{} tests; {} skipped; {} failed; {} passed of which {} exactly.".format(
        ntests + skipped,
        skipped,
        ntests - ngood - nexact,
        ngood + nexact,
        nexact,
    )
    TF.info(
        "ntests={}, skipped={}, ngood={}, nexact={}".format(
            ntests, skipped, ngood, nexact
        )
    )
    test_out_file.close()
    test_html_file = open(vfname(htmlfilename), "w")
    test_html_headline = """
    <tr>
        <th class="v">v</th>
        <th class="v">verse</th>
        <th class="t">etcbc</th>
        <th class="h">hebrew</th>
        <th class="l">lexical</th>
        <th class="p">phono</th>
        <th class="p norm">expected</th>
        <th class="c">comment</th>
    </tr>
    """
    test_html_file.write(
        "{}{}{}{}".format(
            test_html_head(title, stats, ""),
            test_html_headline,
            "".join(htmllines),
            test_html_tail,
        )
    )
    test_html_file.close()
    TF.info(stats, tm=False)

Produce showcases¶

This is a variant on runtests().

It produces overviews of the cases where the corpus dependent rules have been applied.

In [52]:

def showcases(title, stats, testsource, order=True):
    ctitle = title + " cases"
    ttitle = title + " tests"
    fctitle = ctitle.replace(" ", "_")
    fttitle = ttitle.replace(" ", "_")
    test_file_name = vfname(fttitle + ".txt")
    html_file_name = vfname(fctitle + ".html")

    TF.info("Generating HTML in {}".format(html_file_name))
    TF.info("Generating test set {} in {}".format(title, test_file_name))

    htmllines = []
    ncorr = 0
    test_sequence = (
        sorted(testsource, key=lambda x: (x[3], x[0], x[1], x[5]))
        if order
        else testsource
    )
    ntests = len(testsource)

    test_file = open(test_file_name, "w")
    for (i, (corr, wordph, wordph_c, lex, orig, w, comment)) in enumerate(
        test_sequence
    ):
        passage = get_passage(w)
        passageStr = "{} {}:{}".format(*passage)
        lex_info = get_lex_info(w)
        test_file.write(
            "{}\t{}\t{}\t{}\n".format(
                passageStr,
                orig,
                wordph_c,
                comment,
            )
        )
        heb = get_hebrew(orig)
        if corr:
            ncorr += 1
        htmllines.append(
            (
                """
    <tr>
        <td class="v">{i}</td>
        <td class="cor{st}">{cr}</td>
        <td class="tl">{tl}</td>
        <td class="v">{v} {w}</td>
        <td class="l">{l}</td>
        <td class="h">{h}</td>
        <td class="p {st}">{p}</td>
        <td class="p {st1}">{pc}</td>
        <td class="t">{t}</td>
        <td class="c">{c}</td>
    </tr>
"""
            ).format(
                i=i + 1,
                st=" cr" if corr else "",
                st1=" good" if corr else "",
                cr=corr,
                tl=h_esc(lex),
                v=passageStr,
                w="" if w is None else w,
                l=lex_info,
                h=heb,
                p=wordph if wordph != wordph_c else "",
                pc=wordph_c,
                t=h_esc(orig),
                c=h_esc(comment),
            )
        )
    test_file.close()

    mystats = "{} occurrences and {} corrections".format(
        ntests,
        ncorr,
    )
    test_html_headline = """
    <tr>
        <th class="v">n</th>
        <th class="cor cr">correction</th>
        <th class="tl">lexeme</th>
        <th class="v">verse</th>
        <th class="l">lexical</th>
        <th class="h">hebrew</th>
        <th class="p cr">phono<br/>uncorrected</th>
        <th class="p good">phono<br/>corrected</th>
        <th class="t">etcbc</th>
        <th class="c">comment</th>
    </tr>
    """
    test_html_file = open(html_file_name, "w")
    test_html_file.write(
        "{}{}{}{}".format(
            test_html_head(ctitle, stats, mystats),
            test_html_headline,
            "".join(htmllines),
            test_html_tail,
        )
    )
    test_html_file.close()
    if stats:
        TF.info(stats, tm=False)
    if mystats:
        TF.info(mystats, tm=False)

Test the existing examples¶

In [53]:

for tname in [
    "mixed",
    "qamets_nonverb_tests",
    "qamets_verb_tests",
    "qamets_prs_tests",
]:
    runtests(
        tname,
        "{}.txt".format(tname),
        "{}_debug.txt".format(tname),
        "{}.html".format(tname),
        screen=False,
    )

  9.49s ntests=86, skipped=0, ngood=19, nexact=67
86 tests; 0 skipped; 0 failed; 86 passed of which 67 exactly.
    10s ntests=1574, skipped=0, ngood=197, nexact=1377
1574 tests; 0 skipped; 0 failed; 1574 passed of which 1377 exactly.
    10s ntests=513, skipped=0, ngood=30, nexact=483
513 tests; 0 skipped; 0 failed; 513 passed of which 483 exactly.
    11s ntests=209, skipped=0, ngood=30, nexact=179
209 tests; 0 skipped; 0 failed; 209 passed of which 179 exactly.

Testing: Special cases¶

In [ ]:

special_tests = [
    dict(passageStr="Joel 1:17", orig="<@B:C74W.", comment="qamets gadol or qatan"),
    dict(ws=7494, expected=None, comment="schwa in front of BGDKPT without dagesh"),
    dict(ws=5, expected=None, comment="article in isolation"),
    dict(ws=6, expected=None, comment="word after article in isolation"),
    dict(ws=106, expected=None, comment="proclitic min"),
    dict(
        ws=107, expected=None, comment="word starting with BGDKPT after proclitic min"
    ),
    dict(
        passageStr="Genesis 1:7",
        orig="MI-T.A74XAT",
        expected=None,
        comment="proclitic min combined with word starting with BGDKPT",
    ),
    dict(ws=1684, expected=None, comment="Tetra with end of verse"),
    dict(
        passageStr="Genesis 4:1",
        orig="J:HW@75H00",
        expected=None,
        comment="Tetra with end of verse",
    ),
    dict(ws=27477, expected=None, comment="pronominal suffix after verb"),
    dict(ws=155387, expected=None, comment="peculiar representation of tetragrammaton"),
    dict(
        passageStr="Proverbia 10:10",
        orig="<AY.@92BET",
        expected=None,
        comment="the qamets should be gadol",
    ),
    dict(passageStr="Genesis 9:21", orig="*>HLH", expected=None, comment="ketiv qere"),
    dict(
        passageStr="Genesis 1:27",
        orig="H@95->@D@M03",
        expected=None,
        comment="qamets gadol",
    ),
]

In [ ]:

compiled_tests = []
for t in special_tests:
    this_test = maketest(**t)
    if this_test is not None:
        compiled_tests.append(this_test)

In [54]:

runtests(
    "special cases",
    compiled_tests,
    "special_cases_out.txt",
    "special_cases.html",
    screen=True,
)

  1 Genesis 9:21          4420 subs sm,+h        >@H:@LO75W00           *ʔohᵒlˈô             =                     
  2 Genesis 4:1           1685 nmpr sm,          J:HW@75H00             [yᵊhwˈāh]            =                     
  3 Genesis 17:11         7494 subs sm,          B.:FA74R               bᵊśˈar               =                     
  4 Genesis 1:27           539 subs sm,          H@95->@D@M03           hˈāʔāḏˌām            =                     
  5 Genesis 1:1              6 art               HA-                    hˌa                  =                     
  6 Genesis 1:7            108 subs sm,          MI-T.A74XAT            mittˈaḥaṯ            =                     
  7 Genesis 1:7            107 prep              MI-                    mˌi                  =                     
  8 Samuel_I 23:3       155387 nmpr s-,          Q:<IL@80H              qᵊʕilˈā              =                     
  9 Genesis 4:1           1684 prep              >ET&                   ʔeṯ-                 =                     
 10 Genesis 48:9         27477 prep              >;LA73J                ʔēlˌay               =                     
 11 Genesis 1:1              5 prep              >;71T                  ʔˌēṯ                 =                     
 12 Genesis 1:7            106 conj              >:ACER03               ʔᵃšˌer               =                     
 13 Proverbia 10:10     349420 subs sf,          <AY.@92BET             ʕaṣṣˈāveṯ            =                     
 14 Joel 1:17           294304 verb qal perf 3p-, <@B:C74W.              ʕāvᵊšˈû              =                     
    32s ntests=14, skipped=0, ngood=0, nexact=14
14 tests; 0 skipped; 0 failed; 14 passed of which 14 exactly.

Making new tests: Qamets gadol qatan: non-verbs¶

We have generated a number of corrections of the qamets interpretation in non verbs. We have applied exceptions to the corrections. Here is the list of representative occurrences where corrections and/or exceptions have been applied.

In [ ]:

TF.info("Showing lexemes with varied occurrences")
qqi_filename = "qamets_qatan_individuals"
qqi = open("{}.txt".format(qqi_filename), "w")
nvcases = []

In [55]:

noccs = 0
ncorrs = 0
for lex in sorted(qq_varied):
    if lex not in qq_varied_remaining:
        continue
    occs = qq_varied[lex]
    for (skel, fullskel, w) in sorted(occs, key=lambda x: (x[1], x[2])):
        orig = get_orig(w, punct=False, tetra=False)
        wordq = phono(w, punct=False, correct=-1)
        corr = qamets_corrections.get(wordq, "")
        if corr:
            ncorrs += 1
        noccs += 1
        wordph = phono(w, punct=False, correct=0)
        wordph_c = phono(w, punct=False, correct=1)
        comment = "on the basis of other occurrences" if corr else "by the rules"
        qqi.write(
            "{:<1}\t{:<5}\t{:<16}\t{:<16}\t{:<10}\t{:<20}\t{}\n".format(
                "*" if corr else "",
                corr,
                wordph,
                wordph_c,
                lex,
                orig,
                w,
            )
        )
        nvcases.append((corr, wordph, wordph_c, lex, orig, w, comment))
    qqi.write("\n")
qqi.close()
TF.info(
    "{} lexemes with {} occurrences and {} corrections written".format(
        len(qq_varied_remaining),
        noccs,
        ncorrs,
    )
)
showcases(
    "qamets nonverb",
    "{} lexemes".format(len(qq_varied_remaining)),
    nvcases,
    order=False,
)

 1m 08s Showing lexemes with varied occurrences
 1m 10s 107 lexemes with 1192 occurrences and 224 corrections written
 1m 10s Generating HTML in qamets_nonverb_casesc.html
 1m 10s Generating test set qamets nonverb in qamets_nonverb_testsc.txt
107 lexemes
1192 occurrences and 224 corrections

Making new tests: Qamets gadol qatan: verbs¶

Usually, accents take care that potential qatans are read as gadols. But sometimes the accents are missing. We have used a list of paradigm labels where such cases might occur, and there we suppress the qamets-as-qatan interpretation. We look at the verb paradigms to fill in the missing information.

Here we list the cases where this occurs, and show them.

Look up the cases¶

In [ ]:

qq_verb_words = set()
qq_verb_specials = []

In [56]:

TF.info("Finding qamets qatan special verb cases")
for w in F.otype.s("word"):
    ln = F.languageISO.v(w)
    if ln != "hbo":
        continue
    sp = F.sp.v(w)
    if sp != "verb":
        continue
    orig = get_orig(w, punct=False, tetra=False)
    if "@" not in orig:
        continue  # no qamets in word
    word = doaccents(orig)
    wordq = doplainqamets(word, accentless=True)
    if "^" not in wordq:
        continue  # no risk of unwanted qamets qatan
    #    if '!' in word: continue       # primary accent has been marked

    lex_info = get_lex_info(w)
    decl = get_decl(lex_info)
    if decl in qamets_qatan_verb_x:
        if (word, lex_info) in qq_verb_words:
            continue
        qq_verb_words.add((word, lex_info))
        qq_verb_specials.append((w, orig, word))
TF.info("{} cases".format(len(qq_verb_specials)))

 1m 15s Finding qamets qatan special verb cases
 1m 16s 524 cases

Show the cases¶

In [ ]:

TF.info("Showing verb cases")

In [ ]:

ncorr = 0
ngood = 0
vcases = []
verb_lexemes = set()
for (w, orig, word) in qq_verb_specials:
    wordph = phono(w, punct=False)
    wordph_ns = phono(w, punct=False, suppress_in_verb=False)
    corr = ""
    lex = F.lex.v(w)
    verb_lexemes.add(lex)
    if wordph == wordph_ns:
        ngood += 1
        corr = ""
        comment = "qamets: no need to suppress qatan"
    else:
        ncorr += 1
        corr = "gadol"
        comment = "qamets: gadol maintained because of verb paradigm"
    vcases.append((corr, wordph_ns, wordph, lex, orig, w, comment))

In [57]:

showcases(
    "qamets verb",
    "{} lexemes".format(len(verb_lexemes)),
    vcases,
    order=True,
)

 1m 22s Showing verb cases
 1m 22s Generating HTML in qamets_verb_casesc.html
 1m 22s Generating test set qamets verb in qamets_verb_testsc.txt
192 lexemes
524 occurrences and 300 corrections

Making new tests: Qamets gadol qatan: pronominal suffixes¶

Usually, rules involving closed unaccented syllables trigger the qatan interpretation of a qamets. But in pronominal suffixes a qamets is always gadol. We detect these cases and suppress the qamets-as-qatan interpretation there.

Look up the cases¶

In [ ]:

qq_prs_words = set()
qq_prs_specials = []

In [58]:

TF.info("Finding qamets qatan in pronominal suffixes")
for w in F.otype.s("word"):
    ln = F.languageISO.v(w)
    if ln != "hbo":
        continue
    lex_info = get_lex_info(w)
    prs = get_prs(lex_info)
    if prs == "":
        continue
    orig = get_orig(w, punct=False, tetra=False)
    if "@" not in prs:
        continue  # no qamets in suffix
    word = doaccents(orig)
    wordq = doplainqamets(word, accentless=False)
    if "^" not in wordq:
        continue  # no risk of unwanted qamets qatan
    if (word, lex_info) in qq_prs_words:
        continue
    qq_prs_words.add((word, lex_info))
    qq_prs_specials.append((w, orig, word))
TF.info("{} potential cases".format(len(qq_prs_specials)))

 1m 26s Finding qamets qatan in pronominal suffixes
 1m 28s 197 potential cases

Show the cases¶

In [ ]:

TF.info("Showing prs cases")

In [ ]:

ncorr = 0
ngood = 0
pcases = []
prs_lexemes = set()
for (w, orig, word) in qq_prs_specials:
    lex = F.lex.v(w)
    prs_lexemes.add(lex)
    wordph = phono(w)
    wordph_ns = phono(w, suppress_in_prs=False)
    corr = ""
    if wordph == wordph_ns:
        ngood += 1
        corr = ""
        comment = "qamets: no need to suppress qatan"
    else:
        ncorr += 1
        corr = "gadol"
        comment = "qamets: gadol maintained in pronominal suffix"
    pcases.append((corr, wordph_ns, wordph, lex, orig, w, comment))

In [59]:

showcases(
    "qamets prs",
    "{} lexemes".format(len(prs_lexemes)),
    pcases,
    order=True,
)

 1m 32s Showing prs cases
 1m 32s Generating HTML in qamets_prs_casesc.html
 1m 32s Generating test set qamets prs in qamets_prs_testsc.txt
131 lexemes
197 occurrences and 60 corrections