Getting Heads 😶¶

conj and

מְצֹרָ֞ע

subs have skin-disease pual ptcp

phrase Subj NP

וַ

conj and

phrase Subj NP

חֲסַר־

adjv lacking

לָֽחֶם׃

subs bread

phrase 7

phrase Subj NP

advb even

art the

subs king

conj and

subs whole

subs servant

phrase 8

phrase Objc NP

subs horse

conj and

subs chariot

conj and

subs power

adjv heavy

phrase 9

phrase Objc NP

subs sound

subs chariot

subs sound

subs horse

subs sound

subs power

adjv great

phrase 10

phrase Subj NP

subs chief

art the

subs king

conj and

subs ten

subs man

phrase 11

phrase Subj NP

subs fish

art the

subs sea

conj and

subs birds

art the

subs heavens

conj and

subs wild animal

art the

subs open field

conj and

subs whole

art the

subs creeping animals

phrase Subj NP

conj and

phrase Subj NP

כֹל֙

subs whole

הָֽ

art the

אָדָ֔ם

subs human, mankind

phrase 12

phrase Subj NP

subs weight

art the

subs house

art the

prde this

art the

adjv at the back

phrase 13

phrase Subj NP

עָרֶ֖יהָ

subs town

phrase Subj NP

סְבִיבֹתֶ֑יהָ

subs surrounding

phrase Subj NP

conj and

phrase Subj NP

art the

subs south

conj and

art the

subs low land

phrase 14

phrase Time NP

יָמִ֣ים

subs day

רַבִּ֔ים

adjv much

phrase Time NP

subs eight

conj and

subs hundred

subs day

phrase 15

phrase Objc NP

subs voice

subs horn

subs pipe

subs zither

subs sambuca

subs psaltery

conj and

subs symphony

conj and

subs whole

subs sort

subs music for strings

phrase 16

phrase Objc NP

גַ֣ם

advb even

אֱֽלֹהֵיהֶ֡ם

subs god(s)

phrase Objc NP

עִם־

prep with

נְסִֽכֵיהֶם֩

subs libation

phrase Objc NP

עִם־

prep with

כְּלֵ֨י

subs tool

חֶמְדָּתָ֜ם

subs what is desirable

כֶּ֧סֶף

subs silver

conj and

זָהָ֛ב

subs gold

phrase 17

phrase Objc NP

subs weight

art the

subs silver

conj and

art the

subs gold

conj and

art the

subs tool

phrase 18

phrase Cmpl PP

prep upon

subs defilement

art the

subs priesthood

phrase Cmpl PP

conj and

phrase Cmpl PP

subs covenant

art the

subs priesthood

conj and

art the

subs Levite

phrase 19

phrase Adju NP

subs four

conj and

subs four

subs thousand

conj and

subs seven

subs hundred

conj and

subs six

phrase 20

phrase Objc NP

subs disk

subs bread

conj and

subs <type of cake>

conj and

subs raisin cake

Data Discovery¶

The queries which follow were written at different times during the code construction for the heads algorithm.

In this section, important questions were asked whose answers are needed to ensure the code is written correctly. The BHSA data is queried to answer them. These are questions like, "Do we need to check for relational independency for only noun phrases?" (no); and "does every phrase type have a word with a corresponding pdp?" (no).

Make definitions available for exploration:¶

In [9]:

# mapping from phrase type to its head part of speech
type_to_pdp = {
    "VP": "verb",  # verb
    "NP": "subs",  # noun
    "PrNP": "nmpr",  # proper-noun
    "AdvP": "advb",  # adverbial
    "PP": "prep",  # prepositional
    "CP": "conj",  # conjunctive
    "PPrP": "prps",  # personal pronoun
    "DPrP": "prde",  # demonstrative pronoun
    "IPrP": "prin",  # interrogative pronoun
    "InjP": "intj",  # interjectional
    "NegP": "nega",  # negative
    "InrP": "inrg",  # interrogative
    "AdjP": "adjv",
}  # adjective

Test for non-NP phrases with valid `pdp` but invalid head¶

These tests demonstrate that subphrase relation checks are also needed for phrase types besides noun phrases. The only valid subphrase/phrase_atom relations for any potential head word is either NA or par/Para. While a few phrase types do not need additional relational checks, e.g. personal pronoun phrases, we can go ahead and consistently handle all phrases in the same way.

The only exception to the above rule is the VP, for which there are 14 cases of the VP's head word (verb) that is also in a subphrase with a regens (rec) relation.

The operational question of these tests was:

Are there cases in which a non-NP phrase(atom) contains a word with the corresponding pdp value, but which is probably not a head?

To answer the question, we first survey all cases where the phrase type's head candidate is in a subphrase with a relation that is not normally "independent." Based on the survey, we manually check the most pertinent phrase types and results. The tests reveal that, indeed, relation checks are needed for many phrase types.

In [15]:

def test_pdp_safe(phrase_object="phrase_atom"):

    """
    Make a survey of phrase types and their matching `pdp` words,
    count what kinds of subphrase relations these words
    occurr in. The survey can then be used to investigate
    whether phrase types besides noun phrases require relationship
    checks for independency.
    """

    pdp_relas_survey = collections.defaultdict(lambda: collections.Counter())
    headless = 0

    for phrase in F.otype.s(phrase_object):

        typ = F.typ.v(phrase)  # phrase type

        head_pdp = type_to_pdp[typ]

        maybe_heads = [w for w in L.d(phrase, "word") if F.pdp.v(w) == head_pdp]

        # this check shows that many
        # phrases don't have a word
        # with a corresponding pdp!
        if not maybe_heads:
            headless += 1

        # survey the candidate heads' relations
        for word in maybe_heads:

            head_name = typ + "|" + head_pdp
            subphrases = L.u(word, "subphrase")
            sp_relas = (
                set(F.rela.v(sp) for sp in subphrases) if subphrases else {"NA"}
            )  # <- handle cases without any subphrases (i.e. verbs)

            pdp_relas_survey[head_name].update(sp_relas)

    print(f"{phrase_object}s without matching pdp: {headless}\n")
    print("subphrase relation survey: ")
    for name, rela_counts in pdp_relas_survey.items():

        print(name)

        for r, count in rela_counts.items():
            print("\t", r, "-", count)

In [16]:

# for phrase_atoms
test_pdp_safe()

phrase_atoms without matching pdp: 847

subphrase relation survey: 
PP|prep
	 NA - 64521
	 par - 3824
	 adj - 42
	 rec - 8
VP|verb
	 NA - 69010
	 rec - 14
	 par - 1
NP|subs
	 NA - 53881
	 par - 5868
	 rec - 11628
	 adj - 2936
	 atr - 69
CP|conj
	 NA - 53859
AdvP|advb
	 NA - 5131
	 par - 102
	 mod - 49
	 adj - 1
AdjP|adjv
	 NA - 1848
	 par - 135
	 atr - 5
	 adj - 3
	 rec - 1
InjP|intj
	 NA - 1872
	 par - 11
DPrP|prde
	 NA - 790
PrNP|nmpr
	 NA - 11794
	 par - 1478
	 adj - 210
	 rec - 83
NegP|nega
	 NA - 6742
PPrP|prps
	 NA - 4468
	 par - 9
IPrP|prin
	 NA - 797
	 par - 1
InrP|inrg
	 NA - 1288
	 par - 3

In [17]:

# and for phrases
test_pdp_safe(phrase_object="phrase")

phrases without matching pdp: 679

subphrase relation survey: 
PP|prep
	 NA - 62315
	 par - 3678
	 adj - 42
	 rec - 9
VP|verb
	 NA - 69010
	 rec - 14
	 par - 1
NP|subs
	 NA - 50092
	 par - 5808
	 rec - 11214
	 adj - 2927
	 atr - 51
CP|conj
	 NA - 52544
AdvP|advb
	 NA - 5083
	 par - 101
	 mod - 46
	 adj - 1
AdjP|adjv
	 NA - 1800
	 par - 118
	 atr - 5
	 adj - 3
	 rec - 1
InjP|intj
	 NA - 1872
	 par - 11
DPrP|prde
	 NA - 791
PrNP|nmpr
	 NA - 11138
	 par - 1380
	 rec - 1267
	 adj - 209
NegP|nega
	 NA - 6742
PPrP|prps
	 NA - 4388
	 par - 9
IPrP|prin
	 NA - 797
	 par - 1
InrP|inrg
	 NA - 1288
	 par - 3

^ These surveys tell us that for several of these phrase types, e.g. InjP, we can automatically take the word with the pdp value that corresponds with its phrase type as the head.

There are also quite a few cases where the phrase type does not have a word with a matching pdp value: 837 for phrase atoms and 670 for phrases. In the subsequent section we will run tests to find out why this is the case.

Back to the question of this section: There are 14 examples of VP with verbs that have a rec (nomen regens) relation. Are these heads or not? We check now...

In [31]:

def find_and_show(search_pattern):
    results = sorted(B.search(search_pattern))
    print(len(results), "results")
    B.show(results, end=20, condenseType="phrase", withNodes=True)

In [32]:

# run notebook locally to see HTML-formatted results for the below searches


rec_verbs = """

phrase_atom typ=VP
    subphrase rela=rec
        word pdp=verb
"""

find_and_show(rec_verbs)

  0.86s 14 results
14 results

phrase 1

664760

phrase 664760 PreS VP

21355

בְּ

prep in

21356

עֵ֣ת

subs time

21357

לִדְתָּ֑הּ

verb bear qal infc

phrase 2

672346

phrase 672346 PreS VP

33361

prep in

33362

עֲב֖וּר

subs way

33363

הַרְאֹתְךָ֣

verb see hif infc

phrase 3

691339

phrase 691339 PreS VP

68049

prep from

68050

שְּׁנַת֙

subs year

68051

הִמָּ֣כְרֹו

verb sell nif infc

phrase 4

760045

phrase 760045 Pred VP

188292

prep from

188293

דֵּי־

subs sufficiency

188294

בֹ֥א

verb come qal infc

phrase 5

765331

phrase 765331 PreS VP

196651

prep from

196652

דֵּ֣י

subs sufficiency

196653

עָבְרֹ֔ו

verb pass qal infc

phrase 6

770954

phrase 770954 PreS VP

206121

בִּ

prep in

206122

תְחִלַּת֙

subs beginning

206123

שִׁבְתָּ֣ם

verb sit qal infc

phrase 7

774060

phrase 774060 PreS VP

212011

בִּ

prep in

212012

שְׁנַ֣ת

subs year

212013

מָלְכֹ֗ו

verb be king qal infc

phrase 8

779643

phrase 779643 PreS VP

221146

prep from

221147

דֵּ֤י

subs sufficiency

221148

עָבְרֹו֙

verb pass qal infc

phrase 9

799114

phrase 799114 PreS VP

251009

prep from

251010

דֵּ֤י

subs sufficiency

251011

דַבְּרִי֙

verb speak piel infc

phrase 10

804887

phrase 804887 PreS VP

261523

prep from

261524

קֹּ֣ול

subs sound

261525

נִפְלָ֔ם

verb fall qal infc

phrase 11

810542

phrase 810542 PreC VP

270676

prep from

270677

בְּלִ֣י

subs destruction

270678

עֹובֵ֔ר

verb pass qal ptca

phrase 12

834935

phrase 834935 Pred VP

310033

prep from

310034

אֵ֣ין

subs <NEG>

310035

עֹ֗וד

subs duration

310036

פְּנֹות֙

verb turn qal infc

phrase 13

902929

phrase 902929 PreS VP

422921

prep from

422922

לְּ

prep to

422923

בַ֞ד

subs linen, part, stave

422924

הִתְיַחְשָׂ֣ם

verb register hit infc

phrase 14

903656

phrase 903656 PreS VP

424337

לִ

prep to

424338

פְנֵ֖י

subs face

424339

הִכָּנְעֹ֑ו

verb be humble nif infc

In all 14 results, the verb serves as the true head word of the VP.

Note: The verb will prove to be an exception, as all other words in a rec relation are not head words

The PP also has some strange relations. We see what's going on with the same kind of inspection. First we look at the rec (regens) relations.

In [33]:

rec_preps = """

phrase_atom typ=PP
    subphrase rela=rec
        word pdp=prep
"""

find_and_show(rec_preps)

  0.87s 12 results
12 results

phrase 1

701824

phrase 701824 Adju PP

87947

prep from

87948

לְּ

prep to

87949

בַד֩

subs linen, part, stave

87950

עֹלַ֨ת

subs burnt-offering

87951

art the

87952

חֹ֜דֶשׁ

subs month

87953

conj and

87954

מִנְחָתָ֗הּ

subs present

87955

conj and

87956

עֹלַ֤ת

subs burnt-offering

87957

art the

87958

תָּמִיד֙

subs continuity

87959

conj and

87960

מִנְחָתָ֔הּ

subs present

87961

conj and

87962

נִסְכֵּיהֶ֖ם

subs libation

87963

כְּ

prep as

87964

מִשְׁפָּטָ֑ם

subs justice

phrase 2

729575

phrase 729575 Objc PP

138117

אֶת־

prep <object marker>

138118

יַ֤ד

subs hand

138119

אַחַד֙

subs one

138120

prep from

138121

בָּנָ֔יו

subs son

phrase 3

763550

phrase 763550 PreC PP

193931

כִּ

prep as

193932

דְבַ֛ר

subs word

193933

אַחַ֥ד

subs one

193934

מֵהֶ֖ם

prep from

phrase 4

793213

phrase 793213 Adju PP

240898

prep from

240899

רָעַ֣ת

subs evil

240900

יֹֽשְׁבֵי־

subs sit qal ptca

240901

בָ֗הּ

prep in

phrase 5

848705

phrase 848705 Adju PP

329764

מֵ֝

prep from

329765

רָעַ֗ת

subs evil

329766

יֹ֣שְׁבֵי

subs sit qal ptca

329767

בָֽהּ׃

prep in

phrase 6

893491

phrase 893491 Adju PP

404050

prep to

404051

עֻמַּת֙

subs side

404052

כַּ

prep as

404053

art the

404054

קָּטֹ֣ן

subs small

404055

כַּ

prep as

404056

art the

404057

גָּדֹ֔ול

subs great

404058

מֵבִ֖ין

subs understand hif ptca

404059

עִם־

prep with

404060

תַּלְמִֽיד׃ פ

subs scholar

The PP is different. In cases where the phrase_atom = rec, the preposition is not the head. Thus, the algorithm will need to check for these cases.

Now for the adj subphrase relation in PP:

In [34]:

adj_preps = """

phrase_atom typ=PP
    subphrase rela=adj
        word pdp=prep
"""

find_and_show(adj_preps)

  0.88s 42 results
42 results

phrase 1

659013

phrase 659013 Cmpl PP

12691

מֵֽ

prep from

12692

חֲוִילָ֜ה

nmpr Havilah

12693

prep unto

12694

שׁ֗וּר

nmpr Shur

phrase 2

674965

phrase 674965 Time PP

37803

prep in

37804

art the

37805

בֹּ֣קֶר

subs morning

37806

prep in

37807

art the

37808

בֹּ֔קֶר

subs morning

phrase 3

675546

phrase 675546 Time PP

38736

prep from

38737

art the

38738

בֹּ֖קֶר

subs morning

38739

prep unto

38740

art the

38741

עָֽרֶב׃

subs evening

phrase 4

675571

phrase 675571 Time PP

38778

prep from

38779

בֹּ֥קֶר

subs morning

38780

prep unto

38781

עָֽרֶב׃

subs evening

phrase 5

676757

phrase 676757 Subj NP

40647

art the

40648

גְּנֵבָ֗ה

subs what is stolen

phrase 676757 Subj NP

40649

prep from

40650

שֹּׁ֧ור

subs bullock

40651

prep unto

40652

חֲמֹ֛ור

subs he-ass

40653

prep unto

40654

שֶׂ֖ה

subs lamb

phrase 6

677725

phrase 677725 Adju PP

42226

prep from

42227

קָּצָה֙

subs end

42228

prep from

42229

זֶּ֔ה

prde this

phrase 7

677728

phrase 677728 Adju PP

42233

prep from

42234

קָּצָ֖ה

subs end

42235

prep from

42236

זֶּ֑ה

prde this

phrase 8

678437

phrase 678437 Time PP

43705

prep from

43706

עֶ֥רֶב

subs evening

43707

prep unto

43708

בֹּ֖קֶר

subs morning

phrase 9

679353

phrase 679353 Time PP

45628

prep in

45629

art the

45630

בֹּ֣קֶר

subs morning

45631

prep in

45632

art the

45633

בֹּ֗קֶר

subs morning

phrase 10

681309

phrase 681309 Time PP

49254

prep in

49255

art the

49256

בֹּ֥קֶר

subs morning

49257

prep in

49258

art the

49259

בֹּֽקֶר׃

subs morning

phrase 11

684131

phrase 684131 Time PP

55004

prep in

55005

art the

55006

בֹּ֣קֶר

subs morning

55007

prep in

55008

art the

55009

בֹּ֑קֶר

subs morning

phrase 12

686147

phrase 686147 Cmpl PP

58901

אֶל־

prep to

58902

אַהֲרֹ֣ן

nmpr Aaron

phrase 686147 Cmpl PP

58903

art the

58904

כֹּהֵ֔ן

subs priest

phrase 686147 Cmpl PP

58905

אֹ֛ו

conj or

phrase 686147 Cmpl PP

58906

אֶל־

prep to

58907

אַחַ֥ד

subs one

58908

prep from

58909

בָּנָ֖יו

subs son

phrase 686147 Cmpl PP

58910

art the

58911

כֹּהֲנִֽים׃

subs priest

phrase 13

695140

phrase 695140 PreC PP

76198

prep from

76199

עֶ֣רֶב

subs evening

76200

prep unto

76201

בֹּ֔קֶר

subs morning

phrase 14

701824

phrase 701824 Adju PP

87947

prep from

87948

לְּ

prep to

87949

בַד֩

subs linen, part, stave

87950

עֹלַ֨ת

subs burnt-offering

87951

art the

87952

חֹ֜דֶשׁ

subs month

87953

conj and

87954

מִנְחָתָ֗הּ

subs present

87955

conj and

87956

עֹלַ֤ת

subs burnt-offering

87957

art the

87958

תָּמִיד֙

subs continuity

87959

conj and

87960

מִנְחָתָ֔הּ

subs present

87961

conj and

87962

נִסְכֵּיהֶ֖ם

subs libation

87963

כְּ

prep as

87964

מִשְׁפָּטָ֑ם

subs justice

phrase 15

702443

phrase 702443 Objc NP

89462

אֶחָ֣ד׀

subs one

89463

אָחֻ֣ז

subs seize qal ptcp

phrase 702443 Objc NP

89464

prep from

89465

art the

89466

חֲמִשִּׁ֗ים

subs five

phrase 702443 Objc NP

89467

prep from

89468

art the

89469

אָדָ֧ם

subs human, mankind

89470

prep from

89471

art the

89472

בָּקָ֛ר

subs cattle

89473

prep from

89474

art the

89475

חֲמֹרִ֥ים

subs he-ass

89476

conj and

89477

prep from

89478

art the

89479

צֹּ֖אן

subs cattle

89480

prep from

89481

כָּל־

subs whole

89482

art the

89483

בְּהֵמָ֑ה

subs cattle

phrase 16

705970

phrase 705970 Cmpl PP

96035

בְּ

prep in

96036

זַרְעֹ֖ו

subs seed

96037

אַחֲרָ֑יו

prep after

phrase 17

706038

phrase 706038 Cmpl PP

96162

אֶל־

prep to

96163

אַחַ֛ת

subs one

96164

prep from

96165

הֶ

art the

96166

עָרִ֥ים

subs town

96167

art the

96168

אֵ֖ל

prde these

phrase 18

709891

phrase 709891 Adju PP

103115

בֵּֽין־

prep interval

103116

דָּ֨ם׀

subs blood

103117

prep to

103118

דָ֜ם

subs blood

103119

בֵּֽין־

prep interval

103120

דִּ֣ין

subs claim

103121

prep to

103122

דִ֗ין

subs claim

103123

conj and

103124

בֵ֥ין

prep interval

103125

נֶ֨גַע֙

subs stroke

103126

לָ

prep to

103127

נֶ֔גַע

subs stroke

phrase 19

722079

phrase 722079 Cmpl PP

125932

בֵּינֵ֣ינוּ

prep interval

125933

conj and

125934

בֵינֵיכֶ֗ם

prep interval

125935

conj and

125936

בֵ֣ין

prep interval

125937

דֹּרֹותֵינוּ֮

subs generation

125938

אַחֲרֵינוּ֒

prep after

phrase 20

749909

phrase 749909 Adju PP

170284

advb even

170285

prep to

170286

דָוִ֖ד

nmpr David

170287

גַּ֥ם

advb even

170288

prep to

170289

אַבְשָׁלֹֽם׃ ס

nmpr Absalom

The results above show that the adj subphrase relation is also a non-head. These cases have to be excluded.

Now we move on to test the adverb relations reflected in the survey...

In [35]:

adv_adj = """

phrase_atom typ=AdvP
    subphrase rela=adj
        word pdp=advb

"""

find_and_show(adv_adj)

  0.76s 1 result
1 results

phrase 1

883872

phrase 883872 Modi AdvP

383762

הַרְבֵּ֥ה

advb be many hif infa

383763

מְאֹֽד׃

advb might

The adj relationships in the adverbial phrase is also not a true head. Now for the mod (modifier) relation.

In [36]:

adv_mod = """

phrase_atom typ=AdvP
    subphrase rela=mod
        word pdp=advb

"""

find_and_show(adv_mod)

  0.77s 49 results
49 results

phrase 1

655580

phrase 655580 Modi AdvP

7275

גַ֥ם

advb even

7276

הֲלֹ֛ם

advb hither

phrase 2

656048

phrase 656048 Modi AdvP

8037

אַ֥ף

advb even

8038

אֻמְנָ֛ם

advb really

phrase 3

657025

phrase 657025 Modi AdvP

9488

גַם־

advb even

9489

אָמְנָ֗ה

advb indeed

phrase 4

661761

phrase 661761 Modi AdvP

16648

advb even

16649

אָכֹ֖ול

advb eat qal infa

phrase 5

665316

phrase 665316 Loca AdvP

22190

גַם־

advb even

22191

פֹּה֙

advb here

phrase 6

667158

phrase 667158 Time AdvP

25005

advb even

25006

עַתָּ֥ה

advb now

phrase 7

667895

phrase 667895 Modi AdvP

26052

גַם־

advb even

26053

עָלֹ֑ה

advb ascend qal infa

phrase 8

672097

phrase 672097 Modi AdvP

32927

רַ֛ק

advb only

32928

הַרְחֵ֥ק

advb be far hif infa

phrase 9

697540

phrase 697540 Modi AdvP

80306

advb even

80307

הִשְׂתָּרֵֽר׃

advb rule hit infa

phrase 10

700460

phrase 700460 Modi AdvP

85109

advb even

85110

קֹ֖ב

advb curse qal infc

phrase 11

700463

phrase 700463 Modi AdvP

85113

advb even

85114

בָּרֵ֖ךְ

advb bless piel infa

phrase 12

705225

phrase 705225 Adju PP

94602

prep to

94603

בַ֛ד

subs linen, part, stave

phrase 705225 Adju PP

94604

prep from

94605

עָרֵ֥י

subs town

94606

art the

94607

פְּרָזִ֖י

subs open country

phrase 705225 Adju PP

94608

הַרְבֵּ֥ה

advb be many hif infa

94609

מְאֹֽד׃

advb might

phrase 13

716546

phrase 716546 Modi AdvP

114323

הַרְחֵ֨ק

advb be far hif infa

114324

מְאֹ֜ד

advb might

phrase 14

719700

phrase 719700 Modi AdvP

120380

הַרְבֵּֽה־

advb be many hif infa

120381

מְאֹ֖ד

advb might

phrase 15

721834
        phrase 721834  Adju PP
    
125415בִּ
 prep in
125416נְכָסִ֨ים 
 subs riches
125417רַבִּ֜ים 
 adjv much

        phrase 721834  Adju PP
    
125421וּ
 conj and

        phrase 721834  Adju PP
    
125422בְ
 prep in
125423מִקְנֶ֣ה 
 subs purchase
125424רַב־
 adjv much
125425מְאֹ֔ד 
 advb might

        phrase 721834  Adju PP
    
125426בְּ
 prep in
125427כֶ֨סֶף 
 subs silver
125428וּ
 conj and
125429בְ
 prep in
125430זָהָ֜ב 
 subs gold
125431וּ
 conj and
125432בִ
 prep in
125433נְחֹ֧שֶׁת 
 subs bronze
125434וּ
 conj and
125435בְ
 prep in
125436בַרְזֶ֛ל 
 subs iron
125437וּ
 conj and
125438בִ
 prep in
125439שְׂלָמֹ֖ות 
 subs wrapper

        phrase 721834  Adju PP
    
125440הַרְבֵּ֣ה 
 advb be many hif infa
125441מְאֹ֑ד 
 advb might

phrase 16

731154

phrase 731154 Modi AdvP

140793

אַךְ֩

advb only

140794

נִגֹּ֨וף

advb hurt nif infa

phrase 17

735506

phrase 735506 Time AdvP

147570

advb even

147571

עַתָּה֙

advb now

phrase 18

741485

phrase 741485 Time AdvP

156845

advb even

156846

לַ֖יְלָה

advb night

156847

advb even

156848

יֹומָ֑ם

advb by day

phrase 741485 Time AdvP

156849

כָּל־

subs whole

156850

יְמֵ֛י

subs day

phrase 19

744756

phrase 744756 Time AdvP

162031

advb even

162032

תְּמֹול֙

advb yesterday

162033

advb even

162034

שִׁלְשֹׁ֔ם

advb day before yesterday

phrase 20

745299

phrase 745299 Time AdvP

162945

advb even

162946

אֶתְמֹ֣ול

advb yesterday

162947

advb even

162948

שִׁלְשֹׁ֗ום

advb day before yesterday

In this case, it appears that mod is also an invalid relation for adverb phrases. And example is גם הלם ('also here') where גם is the adverb in mod relation, but the head is really הלם "here" (also an adverb). In several cases, the modifier modifies a verb. In these cases the "head," often a participle or infinitive, acts as the adverb, even though it is not explicitly marked as such.

Now we move on to the last examination, that of the AdjP (adjective phrase). There are three relations of interest:

atr - 6
adj - 3
rec - 1

In [37]:

adj_atr = """

phrase_atom typ=AdjP
    subphrase rela=atr
        word pdp=adjv

"""

find_and_show(adj_atr)

  0.77s 5 results
5 results

phrase 1

661695

phrase 661695 PreC AdjP

16552

עֲקֻדִּ֥ים

adjv twisted

16553

נְקֻדִּ֖ים

adjv speckled

16554

conj and

16555

בְרֻדִּֽים׃

adjv speckled

phrase 2

661716

phrase 661716 PreC AdjP

16584

עֲקֻדִּ֥ים

adjv twisted

16585

נְקֻדִּ֖ים

adjv speckled

16586

conj and

16587

בְרֻדִּ֑ים

adjv speckled

phrase 3

715006

phrase 715006 PreC AdjP

111758

לְחֻ֥מֵי

adjv feed qal ptcp

111759

רֶ֖שֶׁף

subs flame

111760

conj and

111761

קֶ֣טֶב

subs sting

111762

מְרִירִ֑י

adjv bitter

In [38]:

adj_adj = """

phrase_atom typ=AdjP
    subphrase rela=adj
        word pdp=adjv

"""

find_and_show(adj_adj)

  0.78s 3 results
3 results

phrase 1

686718

phrase 686718 PreC AdjP

59788

לְבָנָ֣ה

adjv white

59789

אֲדַמְדֶּ֔מֶת

adjv reddish

phrase 2

853166

phrase 853166 PreC AdjP

336212

תָּ֧ם

adjv complete

336213

conj and

336214

יָשָׁ֛ר

adjv right

336215

יְרֵ֥א

adjv afraid

336216

אֱלֹהִ֖ים

subs god(s)

phrase 3

853425

phrase 853425 PreC AdjP

336595

תָּ֧ם

adjv complete

336596

conj and

336597

יָשָׁ֛ר

adjv right

336598

יְרֵ֥א

adjv afraid

336599

אֱלֹהִ֖ים

subs god(s)

In [39]:

adj_rec = """

phrase_atom typ=AdjP
    subphrase rela=rec
        word pdp=adjv

"""

find_and_show(adj_rec)

  0.83s 1 result
1 results

phrase 1

715006

phrase 715006 PreC AdjP

111758

לְחֻ֥מֵי

adjv feed qal ptcp

111759

רֶ֖שֶׁף

subs flame

111760

conj and

111761

קֶ֣טֶב

subs sting

111762

מְרִירִ֑י

adjv bitter

The results for the three searches above show indeed that the relations of atr, adj, and rec are not head words.

Tests for phrase types without a word that has a valid `pdp` value¶

The initial survey above revealed that 837 phrase atoms and 670 phrases lack a word with a corresponding pdp value. Here we investigate to see why that is the case. Is there a way to compensate for this problem? Are these truly phrases that lack heads?

We run another survey and count the phrase types against the non-matching pdp values found within them. At this point, we must also exclude words that have dependent relations (as defined above, subphrase values of NA or parallel).

In [49]:

count_no_pdp = collections.defaultdict(lambda: collections.Counter())
record_no_pdp = collections.defaultdict(lambda: collections.defaultdict(list))

for phrase in F.otype.s("phrase_atom"):

    typ = F.typ.v(phrase)

    # see if there is not corresponding `pdp` value
    corres_pdp = type_to_pdp[typ]
    corresponding_pdps = [w for w in L.d(phrase, "word") if F.pdp.v(w) == corres_pdp]

    if not corresponding_pdps:

        # put potential heads here
        maybe_heads = []

        # calculate subphrase relations
        for word in L.d(phrase, "word"):

            # get subphrase relations
            word_subphrs = L.u(word, "subphrase")
            sp_relas = set(F.rela.v(sp) for sp in word_subphrs) or {"NA"}

            # check subphrase relations for independence
            if sp_relas == {"NA"}:
                maybe_heads.append(word)

            # test parallel relation for independence
            elif sp_relas == {"NA", "par"} or sp_relas == {"par"}:

                # check for good, head mothers
                good_mothers = set(
                    sp for w in maybe_heads for sp in L.u(w, "subphrase")
                )
                this_daughter = [sp for sp in word_subphrs if F.rela.v(sp) == "par"][0]
                this_mother = E.mother.f(this_daughter)

                if this_mother in good_mothers:
                    maybe_heads.append(word)

        # sanity check
        # maybe_heads should have SOMETHING
        if not maybe_heads:
            raise Exception(f"phrase {phrase} looks HEADLESS!")

        # count pdp types
        head_pdps = [F.pdp.v(w) for w in maybe_heads]
        count_no_pdp[typ].update(head_pdps)

        # save for examination
        for word in maybe_heads:
            record_no_pdp[typ][F.pdp.v(word)].append((phrase, word))

for name, counts in count_no_pdp.items():

    print(name)

    for pdp, count in counts.items():
        print("\t", pdp, "-", count)

AdvP
	 nmpr - 253
	 subs - 499
	 art - 190
	 conj - 13
PrNP
	 subs - 9
	 art - 3
CP
	 prep - 85
	 subs - 79
	 advb - 6
NP
	 intj - 1

These results are a bit puzzling. The numbers here are words within the phrase atoms that have NO subphrase relations. That means, for example, words such as הַ "the" do not appear to have any subphrase relation to their modified nouns. That again illustrates the shortcoming of the ETCBC data in this respect. There should be a relation from the article to the determined noun.

From this point forward, I will begin working through all four phrase types and the cases reflected in the survey.

Beginning with the AdvP type and the article. Upon some initial inspection, I've found that in many of the AdvP with the article, there is also a substantive (subs) that was found by the search. Are there any cases where there is no nmpr or subs found alongside the article? We can use the dict record_no_pdp which has recorded all cases reflected in the survey. Below I look to see if all 190 cases of an article in these AdvP phrases also has a corresponding noun.

In [50]:

no_noun = []

for phrase in record_no_pdp["AdvP"]["art"]:

    pdps = set(F.pdp.v(w) for w in L.d(phrase[0], "word"))

    if not {"nmpr", "subs"} & pdps:
        no_noun.append((phrase,))

print(len(no_noun), "without nouns found...")

0 without nouns found...

There it is. So all cases of these articles can be discarded. In these cases, the noun serves as the head of the adverbial phrase. An example of this is when the noun marks the location of the action (hence adverb).

Next, we check the conjunctions found in the adverbial phrases. Are any of those heads?

In [51]:

# B.show(record_no_pdp['AdvP']['conj']) # uncomment me!

All conjunctions in these AdvP phrases function to mark coordinate elements (only ו in these results). They can also be discarded as not possible heads.

Now we investigate the PrNP results with subs and art...

In [52]:

# B.show(record_no_pdp['PrNP']['subs']) # uncomment me!

In [53]:

# B.show(record_no_pdp['PrNP']['art']) # uncomment me!

The art relations reflected in the second search are not heads, but are all related to a substantive. All of the results in subs are heads. Thus, the only acceptable pdp for PrNP besides a proper noun is subs.

Now we dig into CP results. 85 of them have no pdp of conjunction, but have a preposition instead. Let's see what's going on...

In [54]:

# B.show(record_no_pdp['CP']['prep'][:20]) # uncomment me!

These are very interesting results. These conjunction phrases are made up of constructions like ב+עבור and ב+טרם. Together these words function as a conjunction, but alone they are prepositions and particles. Is it even possible in this case to say that there is a "head"?

It could be said that these combinations of words mean more than the sum of their parts; they are good examples of constructions, i.e. combinations of words whose meaning cannot be inferred simply from their individual words. Constructions illustrate the vague boundary between syntax and lexicon (cf. e.g. Goldberg, 1995, Constructions).

While these words are indeed marked as conjunction phrases, it is better in this case to analyse them as prepositional phrases (which they also are...this is another shortcoming of our data, or perhaps a mistake??). Thus, the head is the preposition, not the prepositional object.

We should expect that the remaining subs and advb groups are in fact the objects of those prepositions (and hence excluded). Let's test that assumption by looking for a preposition behind these words...

In [55]:

no_prep = []

for (phrase, word) in record_no_pdp["CP"]["subs"] + record_no_pdp["CP"]["advb"]:

    possible_prep = word - 1

    if F.sp.v(possible_prep) != "prep":
        no_prep.append((phrase, word))

print(f"subs|advb with no preceding prepositions: {len(no_prep)}")

subs|advb with no preceding prepositions: 0

Here we see. We can confirm that none of the nouns or adverbs will be the head of a conjunction phrase. A preposition is the only other kind of head for the CP besides a conjunction itself.

Finally, we're left with a last noun phrase (NP) for which no matching noun was found. The search found instead both adjv (adjective) and a intj (interjection). Let's see it.

In [56]:

# B.show(record_no_pdp['NP']['intj']) # uncomment me

In this case, the word אוי "woe" functions like a noun. This thus appears to be another mislabelled pdp value, since it should read subs. This, like the previous example, will not receive a head value due to the mistake.

Retrieving Quantified Words¶

When the heads algorithm looks for a noun without any subphrase relations in the phrase, it will often return a quantifier noun such as a number, e.g. שבעה "seven", or such as another descriptor like כל. But these words function semantically in a more descriptive role than a head role. Thus, we want our algorithm to isolate quantified nouns from their quantifiers. To do that means we must first know how the ETCBC encodes the relationship between a quantifier and the quantified noun.

In a previous algorithm used for quantified extraction, we looked for a nomen regens relation on the quantifier and located the noun within the related subphrase. This approach works well for the quantifier כל. But for cardinal numbers, the relation adj (adjunct) is often used as well (as seen in the surveys below).

To illustrate with the search below, the quantifier שבעה "seven" has no nomen regens relation:

In [57]:

# B.show([(2217,)]) # uncomment me!

Rather than reflecting a regen/rectum relation, the second word שנים "years," the quantified noun, has a subphrase relation of adj "adjunct":

In [58]:

print(T.text(L.d(652883, "word")))
print()

for sp in L.u(2218, "subphrase"):  # subphrases belonging to "years"
    print(sp, "(subphrase)")
    print("\t", T.text(L.d(sp, "word")))
    print("\t", "rela:", F.rela.v(sp))
    print()

שֶׁ֣בַע שָׁנִ֔ים וּשְׁמֹנֶ֥ה מֵאֹ֖ות שָׁנָ֑ה 

1301096 (subphrase)
	 שָׁנִ֔ים 
	 rela: adj

1301097 (subphrase)
	 שֶׁ֣בַע שָׁנִ֔ים 
	 rela: NA

Let's see what other kinds of subphrase relations are reflected by quantifieds.

Below we make a survey of all mother-daughter relations between a quantifier subphrase and its daughters. The goal is to isolate those relationships which contain the quantified noun. We work through examples to get an idea of the meaning of the features. And we write a few TF search queries further below to confirm hypotheses about these relationships.

In [59]:

quant_relas = collections.defaultdict(lambda: collections.Counter())
quant_ex = collections.defaultdict(lambda: collections.defaultdict(list))
quants = [word for lex in F.ls.s("card") for word in L.d(lex, "word")]

for word in quants:

    subphrases = L.u(word, "subphrase")
    sp_daughters = [E.mother.t(sp) for sp in subphrases if E.mother.t(sp)]
    sp_daughters += [E.mother.t(word)] if E.mother.t(word) else list()
    sp_relas = [F.rela.v(sp[0]) for sp in sp_daughters]
    quant_relas[F.lex.v(word)].update(sp_relas)

    for rela in sp_relas:
        quant_ex[F.lex.v(word)][rela].append((L.u(word, "phrase")[0], word))

for name, counts in quant_relas.items():

    print(name)

    for pdp, count in counts.items():
        print("\t", pdp, "-", count)

>XD/
	 par - 62
	 adj - 42
	 rec - 69
	 Appo - 1
	 Spec - 6
	 atr - 8
	 mod - 4
CNJM/
	 rec - 345
	 adj - 267
	 par - 109
	 mod - 14
	 atr - 5
	 Spec - 6
	 dem - 3
	 Sfxs - 1
>RB</
	 adj - 285
	 par - 100
	 rec - 113
	 atr - 8
	 Spec - 1
	 mod - 1
CB</
	 par - 87
	 adj - 268
	 rec - 169
	 mod - 1
	 dem - 1
	 atr - 23
	 Spec - 4
CLC/
	 par - 148
	 adj - 331
	 rec - 175
	 atr - 4
	 dem - 1
	 Spec - 1
M>H/
	 rec - 30
	 adj - 302
	 par - 217
	 atr - 2
CMNH/
	 adj - 128
	 par - 28
	 rec - 8
	 mod - 1
TC</
	 adj - 46
	 par - 43
	 rec - 22
XMC/
	 adj - 282
	 par - 175
	 rec - 96
	 mod - 1
	 Attr - 1
	 atr - 4
	 Spec - 2
<FRH/
	 adj - 96
	 par - 16
	 Spec - 1
<FR=/
	 adj - 11
	 par - 6
	 rec - 30
CC/
	 adj - 176
	 par - 75
	 rec - 103
	 atr - 4
<FRJM/
	 adj - 209
	 par - 149
	 Spec - 1
<FR/
	 adj - 97
	 par - 12
	 mod - 2
	 atr - 6
	 rec - 2
	 Spec - 2
<FRH=/
	 adj - 52
	 rec - 49
	 par - 10
	 dem - 2
	 Spec - 1
>LP=/
	 adj - 143
	 rec - 33
	 par - 198
	 Spec - 1
	 atr - 2
<CTJ/
	 adj - 13
	 rec - 19
	 par - 1
RBW>/
	 adj - 3
	 par - 3
XD/
	 atr - 1
	 Spec - 2
CTJN/
	 par - 1
CT/
TLT/
	 adj - 2
TRJN/
	 rec - 2
>LP/
	 rec - 1
<FRJN/
TLTJN/

Based on the inspection below, it can be seen that quantified nouns are connected to their quantifier via a subphrase relation of either adj (adjunctive) or rec (regens), as mentioned at the beginning of this inquiry.

In [60]:

# B.show(quant_ex['CB</']['rec']) # uncomment me!

The query below shows that the relation par most frequently refers to a parallel number, e.g. שבעים ושבעה "seventy and seven" where "and seven" is in a parallel relationship.

In [61]:

# B.show(quant_ex['CB</']['par']) # uncomment me!

The atr relation appears when an adjective is used to describe a quantifier:

In [62]:

# B.show(quant_ex['>XD/']['atr']) # uncomment me!

The Spec (phrase_atom rela) are cases where a phrase atom is used to add adjectival information about the quantifier.

In [63]:

# B.show(quant_ex['>XD/']['Spec']) # uncomment me!

The mod relation are cases where the quantifier is modified with particles like גם or רק

In [64]:

# B.show(quant_ex['CNJM/']['mod']) # uncomment me!

The dem relation is when a demonstrative like אלה modifies the quantifier.

In [65]:

# B.show(quant_ex['CB</']['dem']) # uncomment me!

Based on the analysis up to this point, there are two kinds of relations which lead back to the quantified noun: the rec (regens) and adj (adjunct) relations. What about in cases where both of these relations are present? Is there ever a case where it is ambiguous which relation contains the quantified noun?

We use a TF search pattern to build a query for these cases. We look for cases that have three subphrases. The first has a word (w1) which is also contained in a lex object (second to bottom block) that has a ls (lexical set) value of card (cardinal number). Then we look for two other subphrases that have a relation either to the first subphrase (in the case of rela=adj) or the quantifier word contained in the first subphrase (in the case of a regens relation). Within sp2 and sp3, we also select the first word so we can highlight it in the B.show below.

In [66]:

quant_rec_adj = """

sp1:subphrase
    w1:word

sp2:subphrase rela=rec
    =: word

sp3:subphrase rela=adj
    =: word

lex ls=card
   w2:word
   
w1 = w2
w1 <mother- sp2
sp1 <mother- sp3

sp2 <: sp3
"""

quant_rec_adj = B.search(quant_rec_adj)

len(quant_rec_adj)

Out[66]:

In [67]:

# B.show(sorted(quant_rec_adj)) # uncomment me!

There are 245 cases with both relations. Based on inspection, it seems that the word in the rec relation is usually another quantifier. Are there cases where it is not?

We apply a filter with a list comprehension to the results below to filter out cases where there is a cardinal number in sp2.

In [68]:

non_card = [r for r in quant_rec_adj if F.ls.v(L.u(r[3], "lex")[0]) != "card"]

len(non_card)

Out[68]:

In [69]:

# B.show(non_card) # uncomment me!

The example below illustrates a complexity here. We iterate through every subphrase in one of the phrases from the result above. We print the subphrase number, the text, the relation, and the number of the subphrase mother...

In [70]:

show_subphrases(686936)

subphrase 	text 	relation 	mother
שְׁתֵּֽי־צִפֳּרִ֥ים חַיֹּ֖ות טְהֹרֹ֑ות  1316539 NA ()
שְׁתֵּֽי־צִפֳּרִ֥ים  1316535 NA ()
שְׁתֵּֽי־ 1316533 NA ()
צִפֳּרִ֥ים  1316534 rec (60261,)
חַיֹּ֖ות טְהֹרֹ֑ות  1316538 adj (1316535,)
חַיֹּ֖ות  1316536 NA ()
טְהֹרֹ֑ות  1316537 atr (1316536,)
עֵ֣ץ אֶ֔רֶז  1316542 par (1316539,)
עֵ֣ץ אֶ֔רֶז  1316543 NA ()
עֵ֣ץ  1316540 NA ()
אֶ֔רֶז  1316541 rec (60266,)
שְׁנִ֥י תֹולַ֖עַת  1316546 par (1316543,)
שְׁנִ֥י תֹולַ֖עַת  1316547 NA ()
שְׁנִ֥י  1316544 NA ()
תֹולַ֖עַת  1316545 rec (60269,)
אֵזֹֽב׃  1316548 par (1316547,)

In the example of שְׁתֵּֽי־צִפֳּרִ֥ים "two [of] birds" there are is a rec relation between two and birds. Then there is an adjunct relation, adj, further describing the whole subphrase "two birds": חַיֹּ֖ות טְהֹרֹ֑ות "pure beasts". In this case, it is the rec relation which is the valid head, while "pure beasts" is a secondary description. This example illustrates that there should be a priority for the rec relationship.

In [71]:

show_subphrases(682231)

subphrase 	text 	relation 	mother
שְׁתֵּי֙ הָעֲבֹתֹ֣ת  1314251 NA ()
שְׁתֵּי֙  1314249 NA ()
הָעֲבֹתֹ֣ת  1314250 rec (51341,)
הַזָּהָ֔ב  1314252 adj (1314251,)

This other example shows the same inner-structure, as do the other 3 that we've manually inspected. This confirms indeed that priority should be given when a noun is found in the rec relation. Afterwards, the adj relation is checked.

Finally, we want to test that there always is a quantified noun in the adj relation. Are there other cases, based on the findings above, where the adj relation does not actually contain the quantified nouns? We create a looser query than the one above to cover all cases of the adj relation. Then we filter the results...

In [72]:

no_quants_adj = """

sp1:subphrase
    w1:word

sp2:subphrase rela=adj
    =: word

lex ls=card
   w2:word
   
w1 = w2
sp1 <mother- sp2
"""

no_quants_adj = sorted(B.search(no_quants_adj))
no_quants_adj = [r for r in no_quants_adj if F.pdp.v(r[3]) not in {"subs", "nmpr"}]

len(no_quants_adj)

Out[72]:

In [73]:

# B.show(no_quants_adj[:10]) # uncomment me

Many of the cases are due to the presence of an article or a determiner.

In one case the noun that is present is a demonstrative pronoun (prde) אלה for which there is no further specification. For now we exclude determiners and demonstratives and consider their role afterwards.

In [74]:

no_quants_adj = [r for r in no_quants_adj if F.pdp.v(r[3]) not in {"art", "prde"}]

len(no_quants_adj)

Out[74]:

In [75]:

# B.show(sorted(no_quants_adj)) # uncomment me

An interesting result occurs in Micah 5:4 where the word in the adjunct position has a pdp of adjv (adjective) when it should be subs. This is a mistake in the data. In this case, the participle should be nominalised as a subs:

In [76]:

show_subphrases(829120)

subphrase 	text 	relation 	mother
שִׁבְעָ֣ה רֹעִ֔ים  1379983 NA ()
שִׁבְעָ֣ה  1379981 NA ()
רֹעִ֔ים  1379982 adj (1379981,)
שְׁמֹנָ֖ה נְסִיכֵ֥י אָדָֽם׃  1379988 par (1379983,)
שְׁמֹנָ֖ה  1379984 NA ()
נְסִיכֵ֥י אָדָֽם׃  1379987 adj (1379984,)
נְסִיכֵ֥י  1379985 NA ()
אָדָֽם׃  1379986 rec (300679,)

In [77]:

# B.show([(829120,)])

This case will be excluded from our algorithm due to the mistake.

In other cases, the first word in the adj related subphrase is used as an adjective in a construct relation:

In [78]:

show_subphrases(738097)

subphrase 	text 	relation 	mother
חֲמִשָּׁ֣ה  1342693 NA ()
חַלֻּקֵֽי־אֲבָנִ֣ים׀  1342696 adj (1342693,)
חַלֻּקֵֽי־ 1342694 NA ()
אֲבָנִ֣ים׀  1342695 rec (151691,)

If we print the pdp values of the words within the related subphrase חלקי־אבנים, we will find the "subs":

In [79]:

[F.pdp.v(w) for w in L.d(1342696, "word")]

Out[79]:

['adjv', 'subs']

There are only a couple of these cases in the results. Thus, it will be safe for the algorithm to take the first s43ubs word that it comes across.

Besides these cases, there are cases where there is no subs but a participle occurs as a subs with a pdp of verb due to the presence of satellite objects around the verb:

In [80]:

# B.show([(898716,)]) # uncomment me

For these cases, the algorithm should look for cases where there is no other pdp candidate and there is a verb that has a vt of participle.

There are also several cases where the quantified noun is a personal pronoun, as exemplified below:

In [81]:

# B.show([(867246,)]) # uncomment me

These should be added to the set of acceptable quantified heads.

Below we remove the cases accounted for thus far.

In [82]:

no_quants_adj = [
    r for r in no_quants_adj if F.pdp.v(r[3]) not in {"adjv", "verb", "prps"}
]

len(no_quants_adj)

Out[82]:

In [83]:

# B.show(no_quants_adj) # uncomment me

What remains is 5 instances of quantified prepositional phrases. These are cases where the number is truly acting in a nominal capacity. In these cases, the algorithm should not return any quantified nouns since the quantifier itself semantically functions as the noun.

In [84]:

no_quants_adj = [r for r in no_quants_adj if F.pdp.v(r[3]) not in {"prep"}]

len(no_quants_adj)

Out[84]:

That is all the cases in which there is not a traditional "subs" within the adjunct of a quantifier. The final set of acceptable pdp tags for a quantified noun in an adj related subphrase are as follows:

In [85]:

acceptable_adj_quantifieds = {
    "subs",  # noun
    "nmpr",  # proper noun
    "prde",  # demonsrative
    "prps",  # pronoun
    "verb",
}  # for participles

The queries above raise the equivalent question for rec related, quantified subphrases: Are there other kinds of acceptable quantified nouns in the rec relationship besides subs and nmpr? We make a query and test whether we also need a similar set to the one above.

In [86]:

rec_quants = """

sp1:subphrase
    w1:word

sp2:subphrase rela=rec
    =: word

lex ls=card
   w2:word
   
w1 = w2
w1 <mother- sp2
"""

rec_quants = sorted(B.search(rec_quants))

# apply filters:
rec_quants = [
    r
    for r in rec_quants
    if F.pdp.v(r[3]) not in {"subs", "nmpr"} and F.ls.v(L.u(r[3], "lex")[0]) != "card"
]

len(rec_quants)

Out[86]:

In [87]:

# B.show(rec_quants) # uncomment me

Many of these appear to be cases where the article is in the first position followed by a subs. We exclude those below...

In [88]:

rec_quants = [r for r in rec_quants if F.pdp.v(r[3]) != "art"]

len(rec_quants)

Out[88]:

In [89]:

# B.show(rec_quants) # uncomment me

There are a couple cases where the demonstrative occurs in the quantified position, exemplified below:

In [90]:

# B.show([(676421,)]) # uncomment me

The rest of the cases seem to be problematic examples of prepositions. They are problematic since they should be coded with a relation of adj rather than rec. In any case, the sets of acceptable solutions should not include the preposition, the same as with the adj.

Based on this analysis, the rec quantified subphrases can utilize the same check-set as the adj quantifieds.

Conclusion¶

This analysis has found that quantifieds should be processed in the order of rec subphrase relations first. If an acceptable part of speech tag is not found within the rec subphrase, then the subsequent adj subphrase (adjunct) should be checked. In a handful of cases, there will not be a quantified noun since the quantifier itself functions as a nominal element.

For both the rec and adj related subphrases, the same pdp check set can be used to isolate viable heads.

Appendix: Which Subphrase?¶

In cases where the quantified noun is related by the subphrase rela=adj, to which subphrase of the quantifier will it relate? It is assumed that it would relate to the largest one...

A good solution would be to progressively move up from the smallest subphrase to the largest subphrase and check for relations on each one until it is found. That is what we follow in the algorithm.

Adjective -> Subs Missed Results¶

In the first test, several nouns are missed due to the presence of an adjectival element. Let's look at those cases and see what's going on. I have copied the phrase numbers of a few relevant examples.

In [91]:

adj_examples = [(771933,), (799523,)]

# B.show(adj_examples) # uncomment me

In [92]:

show_subphrases(adj_examples[0][0])

subphrase 	text 	relation 	mother
בֶן־ 1355711 NA ()
אָמֹ֔וץ  1355712 rec (207817,)

In [93]:

for word in L.d(adj_examples[0][0], "word"):
    print(T.text([word]), F.pdp.v(word))

יְשַֽׁעְיָ֣הוּ  nmpr
בֶן־ subs
אָמֹ֔וץ  nmpr

In this case, the substantive is not detected by the algorithm since it is in a dependent subphrase, a construct relation, with its modifying adjective. How to extract these nouns?

This is very similar to the quantifier case, where the word in the rectum is actually the head (e.g. שתי שנה "two years" where "two" is registered as the head, but the substantive "years" is the semantic head). This kind of relationship is differentiated from non-heads by the fact that the adjective itself is independent. Thus, in cases where the adjective is independent and has a daughter rectum subphrase, the algorithm should retrieve the attributed noun.

proposed solution: Add adjv to the set of acceptable pdp for the NP. Any adjectives will be processed for dependency: most will fail that test. But for the dozens of cases where the adjective does not fail, the algorithm will apply a separate check for a rec related subphrase which contains the true head.

Participle -> Head Missed Results¶

Other phrases that end up headless are noun phrases that have a participle which serves as a the nominal element, but since it has satellites is coded as a "verb":

In [94]:

verb_examples = [(709010,), (711593,), (756104,)]

# B.show(verb_examples) # uncomment me

In [95]:

for phrase in verb_examples:
    show_subphrases(phrase[0])
    print()

subphrase 	text 	relation 	mother

subphrase 	text 	relation 	mother

subphrase 	text 	relation 	mother

There are mixed cases here due to the shortcomings of the current data model. In these cases, the participle is marked as a "verb" since it also has objects or descriptors. In the first example above, the noun גרה functions as the object of the verb. The head is מעלה. But the same logic does not hold for the second or third case. In the second case, ֹ פצוע־דכא gives an attribute or quality of שפכה. In the third case, מצק "poured", describes an attribute of נחשת "bronze." Thus the opposite of example 1 is true, that is, the head noun is the attributed noun in the construct relation.

Since the specific role of the noun or the verb is not specified at this lower phrase level, is there even a way to differentiate these cases?

In [96]:

for phrase in verb_examples:
    print(phrase[0])
    for word in L.d(phrase[0], "word"):
        print(T.text([word]), F.pdp.v(word))
    print()

709010
וְ conj

711593
אֵ֥ין  nega

756104
בֵיתֹו֩  subs

It actually appears that the database treats all 3 the same: as adjectives at the phrase-dependent part of speech level. Thus, these cases will receive the same treatments as the adjective cases above.

`KL/` relation problems¶

I found an instance in Number 3:15 where the subphrase relationship that connects כל with its quantified noun is atr .That is probably wrong. Are there other cases with the same problem?

In [97]:

kl_prob = """

sp1:subphrase
    w1:word lex=KL/ st=a

sp2:subphrase rela=atr

sp2 -mother> sp1
sp2 >> sp1
w1 :: sp1
"""

kl_prob = sorted(B.search(kl_prob))

len(kl_prob)

Out[97]:

In [98]:

# B.show(kl_prob) # uncomment me

It seems that the adjectives are not nominalised in this construction as pdp of subs. Most of the findings are adjectives in construct with כל. But there are several cases of the participle also.

Is this encoding correct?

If the rela code were properly rec as most are, then this would simply be a matter of adding an additional acceptable pdp to the list within the get_quantified function.

In [99]:

# kl_prob = [r for r in kl_prob if not {'adjv'} and set(F.pdp.v(w) for w in L.d(r[2]))]

len(kl_prob)

Out[99]:

In [100]:

kl_prob = set(r[0] for r in kl_prob)

len(kl_prob)

Out[100]:

Subphrase by Subphrase Approach?¶

Experimenting with switching from a word-by-word approach to a subphrase-by-subphrase. The first iteration of the get_heads function iterated word by word to identify valid heads with independent subphrase relations. A more efficient, and methodologically sound approach would be to work from the subphrase down to the word. Here I experiment with such a method.

In [193]:

test_phrases = [
    ph
    for ph in F.typ.s("NP")
    if len(L.d(ph, "word")) == 5 and F.otype.v(ph) == "phrase"
]

In [194]:

test = test_phrases[20]

test

Out[194]:

In [195]:

show_subphrases(test)

subphrase 	text 	relation 	mother
יְלִ֥יד בֵּֽיתְךָ֖  1302879 NA ()
יְלִ֥יד  1302877 NA ()
בֵּֽיתְךָ֖  1302878 rec (7530,)
מִקְנַ֣ת כַּסְפֶּ֑ךָ  1302882 par (1302879,)
מִקְנַ֣ת  1302880 NA ()
כַּסְפֶּ֑ךָ  1302881 rec (7533,)

In [196]:

head_cands = [sp for sp in L.d(test, "subphrase") if F.rela.v(sp) == "NA"]

head_cands

Out[196]:

[1302879, 1302877, 1302880]

Note above that the heads are those within NA relations that consist of single words. How consistent is this? Are there any cases where the head does not receive its own individual subphrase with a NA relation? Or are there cases of NA relations of non-head elements? Below we run a couple of tests, and then we build a primitive head finder based on this hypothesis in order to manually inspect what happens.

In [201]:

for word in F.otype.s("word"):

    subphrases = L.u(word, "subphrase")

    if not subphrases:
        continue

    sp_relas = set(F.rela.v(sp) for sp in subphrases)

    if not {"NA", "par"} & sp_relas:
        print("example found: ")
        print(word, subphrases, sp_relas)
        break

print("search complete with 0 results")

example found: 
23 (1300568,) {'rec'}
search complete with 0 results

In [202]:

T.text(L.d(1300568, "word"))

Out[202]:

'תְהֹ֑ום '

In [224]:

no_na = """

sp1:subphrase
    w1:word
sp2:subphrase

sp2 -mother> w1
"""

no_na = sorted(S.search(no_na))

len(no_na)

Out[224]:

In [229]:

no_na_filtered = []

for r in no_na:

    reg = r[1]

    reg_subphrases = L.u(reg, "subphrase")
    reg_sp_relas = set(F.rela.v(sp) for sp in reg_subphrases)

    if "NA" not in reg_sp_relas:
        no_na_filtered.append(r)

print(f"words with construct relation and no NA subphrase: {len(no_na_filtered)}")

words with construct relation and no NA subphrase: 0

The search above shows that in any case that a word is in a construct relation with a subphrase, a NA (no relation) subphrase exists.

Let's broaden the inquiry a bit. What are the specific situations in which there is NO non-related subphrase at all. What kinds of relations are present? What kinds of phrases are they?

In [236]:

na_survey = collections.Counter()

for phrase in F.otype.s("phrase"):

    subphrase_relas = tuple(
        sorted(set(F.rela.v(sp) for sp in L.d(phrase, "subphrase")))
    )

    if not subphrase_relas:
        na_survey["NO subphrases"] += 1

    elif "NA" in subphrase_relas:
        na_survey["has NA"] += 1

    else:
        na_survey[subphrase_relas] += 1

pprint(na_survey)

Counter({'NO subphrases': 215258, 'has NA': 37952})

This count shows that there are only two situations in the data: either

a phrase has no subphrases present, or
it has a subphrase with a relation of "NA". There are NO cases of phrases that lack an NA subphrase but have other relations. That is good for our hypothesis...

In the experiment below, two important assumptions are made about the head:

First, it is assumed that the head is the first valid pdp word in the phrase, with the exception of quantifieds and attributed nouns which are handled differently.

Second, it is assumed that the first NA-relation subphrase contains the head. We test that assumption by manually inspecting the output.

In [292]:

def primitive_head_hunter(phrase):

    """
    Looks at noun phrases for heads.
    """

    good_pdp = {"subs", "nmpr"}

    subphrase_candidates = [
        sp
        for sp in L.d(phrase, "subphrase")
        if F.rela.v(sp) == "NA" and F.rela.v(L.u(sp, "phrase_atom")[0]) == "NA"
    ]

    # handle simple phrases
    if not subphrase_candidates:
        head_candidates = [w for w in L.d(phrase, "word") if F.pdp.v(w) in good_pdp]
        try:
            return (head_candidates[0],)
        except:
            print(f"exception at {phrase}")

    # attempt simple head assignment
    first_na_subphrase = subphrase_candidates[0]
    try:
        the_head = next(
            w for w in L.d(first_na_subphrase, "word") if F.pdp.v(w) in good_pdp
        )
        return (the_head,)
    except:
        if F.pdp.v(L.d(first_na_subphrase, "word")[0]) == "adjv":
            pass
        else:
            raise Exception(phrase)

In [296]:

test_results = [primitive_head_hunter(ph) for ph in test_phrases]

random.shuffle(test_results)

In [298]:

# B.show(test_results) # uncomment me

As it turns out, the assumption about NA phrase type is workable. But the complications of this approach (explained below) make it an unlikely solution for now.

Conclusion¶

I've done some initial testing with the subphrase by subphrase approach. It is a promising method, but requires a more complicated implementation with nested searches through each level of the phrase hierarchy. A simple subphrase by subphrase approach is not sufficient—one needs to go phrase by phrase, phrase_atom by phrase_atom, subphrase by subphrase, and even beyond. It is a recursive problem that cannot be navigated with the present, limited data model. There is more to say about the present state of the data model which I will save for the final report.

At present, the word-by-word approach provides an elegant (though limited) solution that is able to navigate the quirks of the present data model and provide an acceptable level of accuracy, with some exceptions for more complicated phrase constructions.

Handling Parallels¶

What is the best way to handle parallel head elements? In general, a phrase has only one real "head". That is, often the first head element determines the grammatical gender or number of the verb (thanks to Constantijn Sikkel for this conversation). Yet, the nouns which are coordinate to the head are often of interest for both grammatical and semantic studies.

There are two approaches to collecting coordinate heads. One is to check for every word with a relation of "parallel" whether its mother is already established as a head. Another approach is to recursively search for nouns that are coordinate with the head word. Up until this inquiry, I have opted for option 1 due to the complexity of checking necessary relationships for a head candidate. But a phrase in Deuteronomy 12:17 is then missed by this current approach, since there is there a chain of head nouns in construct with the quantifier מעשר "tenth". These cases are missed.

It is possible to edit the algorithm to accommodate these cases. But the example raises the broader question of whether option 1 is truly sufficient and methodologically sound. In this section, I test whether option 2 is a better alternative. First, we are unsure about how to separate a head word from a larger, paralleled subphrase. In option 1, individual words are tested, each for dependent relationships. But option 2 will go the opposite direction: beginning at the subphrase level and working down to the word. Does this affect our ability to separate the head noun of the phrase?

In [40]:

def OLD_get_heads(phrase):
    """
    Extracts and returns the heads of a supplied
    phrase or phrase atom based on that phrase's type
    and the relations reflected within the phrase.

    --input--
    phrase(atom) node number

    --output--
    tuple of head word node(s)
    """

    # mapping from phrase type to good part of speech values for heads
    head_pdps = {
        "VP": {"verb"},  # verb
        "NP": {"subs", "adjv", "nmpr"},  # noun
        "PrNP": {"nmpr", "subs"},  # proper-noun
        "AdvP": {"advb", "nmpr", "subs"},  # adverbial
        "PP": {"prep"},  # prepositional
        "CP": {"conj", "prep"},  # conjunctive
        "PPrP": {"prps"},  # personal pronoun
        "DPrP": {"prde"},  # demonstrative pronoun
        "IPrP": {"prin"},  # interrogative pronoun
        "InjP": {"intj"},  # interjectional
        "NegP": {"nega"},  # negative
        "InrP": {"inrg"},  # interrogative
        "AdjP": {"adjv"},  # adjective
    }

    # get phrase-head's part of speech value and list of candidate matches
    phrase_type = F.typ.v(phrase)
    head_candidates = [
        w for w in L.d(phrase, "word") if F.pdp.v(w) in head_pdps[phrase_type]
    ]

    # VP with verbs require no further processing, return the head verb
    if phrase_type == "VP":
        return tuple(head_candidates)

    # go head-hunting!
    heads = []

    for word in head_candidates:

        # gather the word's subphrase (+ phrase_atom if otype is phrase) relations
        word_phrases = list(L.u(word, "subphrase"))
        word_phrases += (
            list(L.u(word, "phrase_atom"))
            if (F.otype.v(phrase) == "phrase")
            else list()
        )
        word_relas = set(F.rela.v(phr) for phr in word_phrases) or {"NA"}

        # check (sub)phrase relations for independency
        if word_relas - {"NA", "par", "Para"}:
            continue

        # check parallel relations for independency
        elif word_relas & {"par", "Para"} and mother_is_head(word_phrases, heads):
            this_head = find_quantified(word) or find_attributed(word) or word
            heads.append(this_head)

        # save all others as heads, check for quantifiers first
        elif word_relas == {"NA"}:
            this_head = find_quantified(word) or find_attributed(word) or word
            heads.append(this_head)

    return tuple(sorted(set(heads)))


def mother_is_head(word_phrases, previous_heads):

    """
    Test and validate parallel relationships for independency.
    Must gather the mother for each relation and check whether
    the mother contains a head word.

    --input--
    * list of phrase nodes for a given word (includes subphrases)
    * list of previously approved heads

    --output--
    boolean
    """

    # get word's enclosing phrases that are parallel
    parallel_phrases = [ph for ph in word_phrases if F.rela.v(ph) in {"par", "Para"}]
    # get the mother for the parallel phrases
    parallel_mothers = [E.mother.f(ph)[0] for ph in parallel_phrases]
    # get mothers' words, by mother
    parallel_mom_words = [set(L.d(mom, "word")) for mom in parallel_mothers]
    # test for head in each mother
    test_mothers = [
        bool(phrs_words & set(previous_heads)) for phrs_words in parallel_mom_words
    ]

    return all(test_mothers)

How many subphrases with a parallel relation to a validated head consist of more than one word?¶

We take the first head element for every noun phrase and check its parallel elements.

In [41]:

par_word_count = collections.Counter()
par_word_list = collections.defaultdict(list)

for np in F.typ.s("NP"):

    heads = OLD_get_heads(np)

    if not heads:
        continue

    the_head = heads[0]

    if not L.u(the_head, "subphrase"):
        continue

    head_smallest_sp = sorted(sp for sp in L.u(the_head, "subphrase"))[0]

    par_daughter = [d for d in E.mother.t(head_smallest_sp) if F.rela.v(d) == "par"]

    for pd in par_daughter:

        word_length = len(L.d(par_daughter[0], "word"))

        par_word_count[word_length] += 1
        par_word_list[word_length].append((the_head, head_smallest_sp))

for w_count, count in par_word_count.items():
    print("length:", w_count)
    print("\t", count)

length: 1
	 3346
length: 2
	 686
length: 3
	 86
length: 4
	 24
length: 9
	 10
length: 5
	 4
length: 6
	 3

Let's see some of the larger cases...

In [47]:

B.show(par_word_list[6], condenseType="phrase", withNodes=True)

phrase 1

713759

phrase 713759 Objc NP

109564

art the

109565

מַּסֹּות֙

subs proving

109566

art the

109567

גְּדֹלֹ֔ת

adjv great

phrase 713759 Objc NP

109571

art the

109572

אֹתֹ֧ת

subs sign

109573

conj and

109574

art the

109575

מֹּפְתִ֛ים

subs sign

109576

art the

109577

גְּדֹלִ֖ים

adjv great

109578

art the

109579

הֵֽם׃

prde they

phrase 2

897414

phrase 897414 Objc NP

412359

נְגִידִ֔ים

subs chief

412360

conj and

412361

אֹצְרֹ֥ות

subs supply

412362

מַאֲכָ֖ל

subs food

412363

conj and

412364

שֶׁ֥מֶן

subs oil

412365

וָ

conj and

412366

יָֽיִן׃

subs wine

In [46]:

ex_subphrase = par_word_list[6][1][1]

for daughter in [d for d in E.mother.t(ex_subphrase) if F.rela.v(d) == "par"]:

    print(daughter, F.rela.v(daughter), T.text(L.d(daughter, "word")))

1409887 par אֹצְרֹ֥ות מַאֲכָ֖ל וְשֶׁ֥מֶן וָיָֽיִן׃

And now some examples of 2 word lengths...

In [49]:

B.show(par_word_list[2][:5], condenseType="phrase", withNodes=True)

phrase 1

651919

phrase 651919 Subj NP

676

art the

677

שָּׁמַ֥יִם

subs heavens

678

conj and

679

art the

680

אָ֖רֶץ

subs earth

681

conj and

682

כָל־

subs whole

683

צְבָאָֽם׃

subs service

phrase 2

652841

phrase 652841 Time NP

2152

שְׁלֹשִׁ֤ים

subs three

2153

conj and

2154

מְאַת֙

subs hundred

2155

שָׁנָ֔ה

subs year

phrase 3

653522

phrase 653522 Time NP

3540

חֲמִשִּׁ֥ים

subs five

3541