Best viewed in Jupyter Notebook Viewer

Noun Semantics in the Hebrew Bible

In this notebook, I compare the syntactic contexts of the top 200 most frequent nouns in the Hebrew Bible. This notebook essentially walks through my process and includes limited commentary throughout. Full descriptions borrowed from the paper will soon be transferred to here as well. [content coming soon -Cody, 10 Jan 2019].

In [1]:
# ETCBC's BHSA data
from tf.fabric import Fabric
from tf.app import use

# stats & data-containers
import collections, re, random, csv
import pandas as pd
import numpy as np
import scipy.stats as stats
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import pairwise_distances
from kneed import KneeLocator # https://github.com/arvkevi/kneed

# data visualizations
import seaborn as sns
sns.set(style="whitegrid")
import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams['font.family'] = 'serif'
rcParams['font.serif'] = ['Times New Roman']
from IPython.display import HTML, display
from adjustText import adjust_text # fixes overlapping scatterplot annotations

# custom modules
from pyscripts.contextcount import ContextCounter, ContextTester
from pyscripts.contextparameters import deliver_params

# prep the Hebrew syntax data

name = 'noun_semantics'
hebrew_data = ['~/github/etcbc/{}/tf/c'.format(direc) for direc in ('bhsa','lingo/heads', 'heads', 'phono')] # data dirs
load_features = '''
typ phono lex_utf8 lex
voc_lex_utf8 voc_lex gloss
freq_lex pdp sp ls
language
rela number function
vs vt

head obj_prep sem_set nhead
heads noun_heads
''' 
# TF load statements
TF = Fabric(locations=hebrew_data)
api = TF.load(load_features)
B = use('bhsa', api=api, hoist=globals(), silent=True) # Bhsa functions for search and visualizing text
This is Text-Fabric 7.3.9
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

124 features found and 0 ignored
  0.00s loading features ...
   |     0.12s B lex                  from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.15s B lex_utf8             from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.18s B phono                from /Users/cody/github/etcbc/phono/tf/c
   |     0.01s B voc_lex_utf8         from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.18s B typ                  from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.01s B voc_lex              from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.00s B gloss                from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.08s B freq_lex             from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.10s B pdp                  from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.11s B sp                   from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.10s B ls                   from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.11s B language             from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.18s B rela                 from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.18s B number               from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.06s B function             from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.10s B vs                   from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.10s B vt                   from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.33s B head                 from /Users/cody/github/etcbc/heads/tf/c
   |     0.14s B obj_prep             from /Users/cody/github/etcbc/heads/tf/c
   |     0.03s B sem_set              from /Users/cody/github/etcbc/heads/tf/c
   |     0.40s B nhead                from /Users/cody/github/etcbc/heads/tf/c
   |     0.66s B heads                from /Users/cody/github/etcbc/lingo/heads/tf/c
   |     0.88s B noun_heads           from /Users/cody/github/etcbc/lingo/heads/tf/c
    10s All features loaded/computed - for details use loadLog()
Using bhsa commit 9374f7a8d075a94bc6b7e69e08a7ca86e725215f
  in /Users/cody/text-fabric-data/__apps__/bhsa
In [2]:
def reverse_hb(heb_text):
    '''
    Reverses order of left-to-right text 
    for good matplotlib formatting.
    '''
    return ''.join(reversed(heb_text))

def show_word_list(word_nodes, joiner='  |', title=''):
    '''
    Displays Hebrew for a pipe-separated list of word nodes
    Good for seeing lexemes without taking up screen space.
    '''
    formatted = joiner.join(B.plain(node, _asString=True, withPassage=False) for node in word_nodes)
    display(HTML(formatted))
    
def show_subphrases(phrase, direction=L.d):
    '''
    A simple function to print subphrases
    and their relations to each other.
    '''
    for sp in direction(phrase, 'subphrase'):
        
        mother = E.mother.f(sp)[0] if E.mother.f(sp) else ''
        mother_text = T.text(mother)
        
        print('-'*7 + str(sp) + '-'*16)
        print()
        print(f'{T.text(sp)} -{F.rela.v(sp)}-> {mother_text}')
        print(f'nodes:  {sp} -{F.rela.v(sp)}-> {mother}')
        print(f'slots:  {L.d(sp, "word")} -{F.rela.v(sp)}-> {L.d(mother or 0, "word")}')
        print('-'*30)

Corpus Size

Below is the number of words included in the corpus of BHSA.

In [3]:
len(list(F.otype.s('word')))
Out[3]:
426584

Demonstrating the Collocational Principle

Here is a query for all nouns that serve as the object to the verb אכל "to eat". This query demonstrates how the collocation patterns of syntactic context can be informative for semantic meaning. This is the driving principle behind this project.

In [4]:
eat_obj = '''

clause
    phrase function=Pred
        word pdp=verb lex=>KL[
    phrase function=Objc
        <head- w1:word pdp=subs
        
lex
    w2:word
    
w1 = w2
'''

eat_obj = B.search(eat_obj)
eaten_lexs = collections.Counter(T.text(r[5]) for r in eat_obj)

for word, count in eaten_lexs.most_common(10):
    print(f'{count}\t{word}')
  1.53s 285 results
59	לֶחֶם 
21	בָּשָׂר 
17	פְּרִי 
14	מַצָּה 
10	אַרְמֹון 
9	דָּם 
5	נְבֵלָה 
5	דְּבַשׁ 
4	חֵלֶב 
4	טְרֵיפָה 

Define a Target Noun Set

Insert discussion about the semantic relationship between iconicity and frequency with regards to the most frequent noun lexemes in the HB.

In [5]:
raw_search = '''

lex language=Hebrew sp=subs

'''

raw_nouns = B.search(raw_search)
  0.04s 3706 results

Now we order the results on the basis of lexeme frequency.

In [6]:
raw_terms_ordered = sorted(((F.freq_lex.v(res[0]), res[0]) for res in raw_nouns), reverse=True)

Below we have a look at the top 50 terms from the selected set. Pay attention to the feature ls, i.e. "lexical set." This feature gives us some rudimentary semantic information about the nouns and their usual functions, and it suggests that some additional restrictions are necessary for the noun selection procedure. Note especially that several of these nouns are used in adjectival or prepositional roles (e.g. כל ,אחד, אין, תחת).

In [7]:
raw_nnodes = [res[1] for res in raw_terms_ordered] # isolate the word nodes of the sample
B.displaySetup(extraFeatures={'ls', 'freq_lex'}) # config B to display ls and freq_lex


# display lexeme data
for i, node in enumerate(raw_nnodes[:50]):
    B.prettyTuple((node,), seq=i)

result 0

כֹּל
כֹּל
K.OL whole freq_lex=5412 language=Hebrew ls=nmdi sp=subs

result 1

בֵּן
בֵּן
B.;N son freq_lex=4937 language=Hebrew sp=subs

result 2

אֱלֹהִים
אֱלֹהִים
>:ELOHIJM god(s) freq_lex=2601 language=Hebrew sp=subs

result 3

מֶלֶךְ
מֶלֶךְ
MELEK: king freq_lex=2523 language=Hebrew sp=subs

result 4

אֶרֶץ
אֶרֶץ
>EREY earth freq_lex=2504 language=Hebrew sp=subs

result 5

יֹום
יֹום
JOWM day freq_lex=2304 language=Hebrew ls=padv sp=subs

result 6

אִישׁ
אִישׁ
>IJC man freq_lex=2186 language=Hebrew ls=nmdi sp=subs

result 7

פָּנֶה
פָּנֶה
[email protected] face freq_lex=2127 language=Hebrew sp=subs

result 8

בַּיִת
בַּיִת
B.AJIT house freq_lex=2063 language=Hebrew sp=subs

result 9

עַם
עַם
<AM people freq_lex=1866 language=Hebrew sp=subs

result 10

יָד
יָד
[email protected] hand freq_lex=1618 language=Hebrew sp=subs

result 11

דָּבָר
דָּבָר
[email protected]@R word freq_lex=1441 language=Hebrew sp=subs

result 12

אָב
אָב
>@B father freq_lex=1217 language=Hebrew sp=subs

result 13

עִיר
עִיר
<IJR town freq_lex=1090 language=Hebrew sp=subs

result 14

אֶחָד
אֶחָד
>[email protected] one freq_lex=970 language=Hebrew ls=card sp=subs

result 15

עַיִן
עַיִן
<AJIN eye freq_lex=887 language=Hebrew sp=subs

result 16

שָׁנָה
שָׁנָה
[email protected]@H year freq_lex=876 language=Hebrew sp=subs

result 17

שֵׁם
שֵׁם
C;M name freq_lex=864 language=Hebrew sp=subs

result 18

עֶבֶד
עֶבֶד
<EBED servant freq_lex=800 language=Hebrew sp=subs

result 19

אַיִן
אַיִן
>AJIN <NEG> freq_lex=788 language=Hebrew ls=nmcp sp=subs

result 20

אִשָּׁה
אִשָּׁה
>[email protected] woman freq_lex=781 language=Hebrew ls=nmdi sp=subs

result 21

שְׁנַיִם
שְׁנַיִם
C:NAJIM two freq_lex=768 language=Hebrew ls=card sp=subs

result 22

נֶפֶשׁ
נֶפֶשׁ
NEPEC soul freq_lex=754 language=Hebrew sp=subs

result 23

כֹּהֵן
כֹּהֵן
K.OH;N priest freq_lex=750 language=Hebrew sp=subs

result 24

אַחַר
אַחַר
>AXAR after freq_lex=715 language=Hebrew ls=ppre sp=subs

result 25

דֶּרֶךְ
דֶּרֶךְ
D.EREK: way freq_lex=706 language=Hebrew ls=ppre sp=subs

result 26

אָח
אָח
>@X brother freq_lex=629 language=Hebrew ls=nmdi sp=subs

result 27

שָׁלֹשׁ
שָׁלֹשׁ
[email protected] three freq_lex=602 language=Hebrew ls=card sp=subs

result 28

לֵב
לֵב
L;B heart freq_lex=601 language=Hebrew sp=subs

result 29

רֹאשׁ
רֹאשׁ
RO>C head freq_lex=599 language=Hebrew sp=subs

result 30

בַּת
בַּת
B.AT daughter freq_lex=588 language=Hebrew sp=subs

result 31

מַיִם
מַיִם
MAJIM water freq_lex=582 language=Hebrew sp=subs

result 32

מֵאָה
מֵאָה
M;>@H hundred freq_lex=579 language=Hebrew ls=card sp=subs

result 33

הַר
הַר
HAR mountain freq_lex=558 language=Hebrew sp=subs

result 34

גֹּוי
גֹּוי
G.OWJ people freq_lex=555 language=Hebrew sp=subs

result 35

אָדָם
אָדָם
>@[email protected] human, mankind freq_lex=553 language=Hebrew sp=subs

result 36

חָמֵשׁ
חָמֵשׁ
[email protected];C five freq_lex=506 language=Hebrew ls=card sp=subs

result 37

קֹול
קֹול
QOWL sound freq_lex=505 language=Hebrew sp=subs

result 38

תַּחַת
תַּחַת
T.AXAT under part freq_lex=505 language=Hebrew ls=ppre sp=subs

result 39

פֶּה
פֶּה
P.EH mouth freq_lex=498 language=Hebrew sp=subs

result 40

אֶלֶף
אֶלֶף
>ELEP thousand freq_lex=492 language=Hebrew ls=card sp=subs

result 41

עֹוד
עֹוד
<OWD duration freq_lex=490 language=Hebrew ls=padv sp=subs

result 42

שֶׁבַע
שֶׁבַע
CEBA< seven freq_lex=490 language=Hebrew ls=card sp=subs

result 43

צָבָא
צָבָא
[email protected]@> service freq_lex=486 language=Hebrew sp=subs

result 44

קֹדֶשׁ
קֹדֶשׁ
QODEC holiness freq_lex=469 language=Hebrew sp=subs

result 45

אַרְבַּע
אַרְבַּע
>AR:B.A< four freq_lex=454 language=Hebrew ls=card sp=subs

result 46

עֹולָם
עֹולָם
<[email protected] eternity freq_lex=438 language=Hebrew sp=subs

result 47

מִשְׁפָּט
מִשְׁפָּט
MIC:[email protected] justice freq_lex=422 language=Hebrew sp=subs

result 48

שַׂר
שַׂר
FAR chief freq_lex=421 language=Hebrew sp=subs

result 49

שָׁמַיִם
שָׁמַיִם
[email protected] heavens freq_lex=421 language=Hebrew sp=subs

Based on the nouns that are present, we should make some key exclusions. Many substantives have more functional or adjectival roles. Undesirable categories include copulative nouns (nmcp, e.g. אין), cardinal numbers (card), potential prepositions (ppre, e.g. תחת). The ls category of potential adverb (padv) contains desirable nouns like יום, but also more functionally adverbial-nouns like עוד. Thus we can see that there is a range of adverbial tendencies found in this category. Due to the potentially interesting possibility of seeing these tendencies play out in the data, we can decide to keep these instances.

To be sure, the very phenomenon of "functional" versus "nominal" is worthy of further, quantitative investigation. The ls feature is an experimental and incomplete feature in the ETCBC, and this is precisely the kind of shortcoming this present work seeks to address. Nouns and adverbs likely sit along a sliding scale of adverbial tendencies, with adverbs nearly always functioning in such a role, and nouns exhibiting various statistical tendencies. But due to the scope of this investigation, we limit ourselves to mainly nominal words with a small inclusion of some adverbial-like substantives.

We can eliminate more functional nouns by restricting the possible lexical set (ls) values. Below we apply those restrictions to the search template. In the case of certain quantifiers such as כל there is an ls feature of distributive noun (nmdi), yet this feature is likewise applied to nouns such as אח ("brother"). So it is undesirable to exclude all of these cases. Thus we depend, instead, on an additional filter list that excludes quantifiers.

A few terms such as דרך and עבר are eliminated because the ETCBC labels it as a potential preposition. This is a speculative classification. So we define a seperate parameter in the template that saves this instance.

In [8]:
exclude = '|'.join(('KL/', 'M<V/', 'JTR/', 'M<FR/', 'XYJ/')) # exclude quantifiers
include = '|'.join(('padv', 'nmdi'))  # ok ls features
keep = '|'.join(('DRK/', '<BR/'))

'''
Below is a TF search query for three cases:
One is a lexeme with included ls features.
The second is a lexeme with a null ls feature.
The third is lexemes we want to prevent from being excluded.
For all cases we exclude excluded lexemes.
'''

select_noun_search = f'''

lex language=Hebrew
/with/
sp=subs ls={include} lex#{exclude}
/or/
sp=subs ls# lex#{exclude}
/or/
sp=subs lex={keep}
/-/

'''

select_nouns = B.search(select_noun_search)
noun_dat_ordered = sorted(((F.freq_lex.v(res[0]), res[0]) for res in select_nouns), reverse=True)
nnodes_ordered = list(noun_dat[1] for noun_dat in noun_dat_ordered)
filtered_lexs = list(node for node in raw_nnodes if node not in nnodes_ordered)

print(f'\t{len(raw_nouns) - len(select_nouns)} results filtered out of raw noun list...')
print('\tfiltered lexemes shown below:')
show_word_list(filtered_lexs)

Plot the Nouns in Order of Frequency

Now that we have obtained a filtered noun-set, we must decide a cut-off point at which to limit the present analysis. Below we plot the attested nouns and their respective frequencies.

In [9]:
# plot data
y_freqs = [lex_data[0] for lex_data in noun_dat_ordered]
x_rank = [i+1 for i in range(0, len(y_freqs))]
title = 'Noun Frequencies in the Hebrew Bible'
xlabel = 'Rank'
ylabel = 'Frequency'

# first plot
plt.figure(figsize=(12, 6))
plt.plot(x_rank, y_freqs)
plt.title(title + f' (ranks 1-{len(x_rank)})', size=18)
plt.xlabel(xlabel, size=18)
plt.ylabel(ylabel, size=18)
plt.plot()
Out[9]:
[]

We zoom in closer to view ranks 1-1000...

Consider using a subplot here with 4 different zooms

In [10]:
# second plot
plt.figure(figsize=(12, 6))
plt.plot(x_rank[:1000], y_freqs[:1000])
plt.xlabel(xlabel, size=18)
plt.ylabel(ylabel, size=18)
plt.axvline(200, color='red')
plt.savefig('plots/noun_frequencies1-1000.png', dpi=300, bbox_inches='tight') # save the plot (without title)
plt.title(title + f' (ranks 1-1000)', size=18)
plt.show()

This curve is typical of Zipf's law:

Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table (wikipedia)

The curve sharply "elbows" at around rank 15. Between ranks 50-100 there is still an appreciable drop-off. The curve starts to significantly flatten after 200. We thus decide an arbitrary cut-off point at rank 200, based on the fact that the curve does not show any significant leveling after this point.

In [11]:
target_nouns = nnodes_ordered[:200]
tnoun_instances = set(word for lex in target_nouns for word in L.d(lex, 'word'))

show_word_list(target_nouns) # temporary comment out while bug is fixed
print(f'\n{len(tnoun_instances)} nouns ready for searches')
בֵּן  |אֱלֹהִים  |מֶלֶךְ  |אֶרֶץ  |יֹום  |אִישׁ  |פָּנֶה  |בַּיִת  |עַם  |יָד  |דָּבָר  |אָב  |עִיר  |עַיִן  |שָׁנָה  |שֵׁם  |עֶבֶד  |אִשָּׁה  |נֶפֶשׁ  |כֹּהֵן  |דֶּרֶךְ  |אָח  |לֵב  |רֹאשׁ  |בַּת  |מַיִם  |הַר  |גֹּוי  |אָדָם  |קֹול  |פֶּה  |עֹוד  |צָבָא  |קֹדֶשׁ  |עֹולָם  |מִשְׁפָּט  |שַׂר  |שָׁמַיִם  |תָּוֶךְ  |חֶרֶב  |כֶּסֶף  |מִזְבֵּחַ  |מָקֹום  |יָם  |זָהָב  |אֵשׁ  |רוּחַ  |נְאֻם  |שַׁעַר  |דָּם  |אֹהֶל  |סָבִיב  |אָדֹון  |עֵץ  |כְּלִי  |שָׂדֶה  |נָבִיא  |רָעָה  |מִלְחָמָה  |מְאֹד  |לֶחֶם  |עֵת  |חַטָּאת  |עֹלָה  |חֹדֶשׁ  |בְּרִית  |אַף  |פַּרְעֹה  |צֹאן  |אֶבֶן  |מִדְבָּר  |בָּשָׂר  |מַטֶּה  |לֵבָב  |רֶגֶל  |אַמָּה  |חֶסֶד  |חַיִל  |נַעַר  |גְּבוּל  |שָׁלֹום  |אֵל  |מַעֲשֶׂה  |עָוֹן  |זֶרַע  |קֶרֶב  |לַיְלָה  |בַּד  |נַחֲלָה  |אֲדָמָה  |מֹועֵד  |תֹּורָה  |אֵם  |בֶּגֶד  |מַחֲנֶה  |בֹּקֶר  |מַלְאָךְ  |מִנְחָה  |אֲרֹון  |כָּבֹוד  |חָצֵר  |כַּף  |שֶׁמֶן  |שֵׁבֶט  |בְּהֵמָה  |מִשְׁפָּחָה  |אֹזֶן  |רֵעַ  |סֵפֶר  |בָּקָר  |מִצְוָה  |שָׂפָה  |דֹּור  |בַּעַל  |חוּץ  |פֶּתַח  |אַיִל  |זֶבַח  |מָוֶת  |גִּבֹּור  |צְדָקָה  |רֹב  |צָפֹון  |חָכְמָה  |עֵדָה  |חַיִּים  |עֲבֹדָה  |יַיִן  |מַעַל  |מִשְׁכָּן  |נַחַל  |יָמִין  |נְחֹשֶׁת  |סוּס  |כִּסֵּא  |שֶׁמֶשׁ  |מִסְפָּר  |עֶרֶב  |חֹומָה  |פַּר  |נָשִׂיא  |חֹק  |אֶמֶת  |חֵמָה  |כֹּחַ  |קָהָל  |עֶצֶם  |בְּכֹר  |צֶדֶק  |רֶכֶב  |נָהָר  |פְּרִי  |פַּעַם  |תֹּועֵבָה  |לָשֹׁון  |מִשְׁפַּחַת  |אֹור  |מִגְרָשׁ  |אָחֹות  |שֶׁקֶר  |נֶגֶב  |שַׁבָּת  |עַמּוּד  |עָפָר  |כָּנָף  |כֶּבֶשׂ  |מְלָאכָה  |תָּמִיד  |חֻקָּה  |בָּמָה  |מַרְאֶה  |רָעָב  |רֹחַב  |חַיָּה  |עֹור  |חֲמֹור  |אֹרֶךְ  |שִׂמְחָה  |מַמְלָכָה  |פֶּשַׁע  |דַּעַת  |גֵּר  |כֶּרֶם  |קָצֶה  |חֵלֶב  |מַלְכוּת  |זְרֹועַ  |כְּרוּב  |עֵבֶר  |יֶלֶד  |שֶׁקֶל  |שֶׁלֶם  |דֶּלֶת  |קֶדֶם  |עֵצָה  |עָנָן  |פֵּאָה  |הָמֹון  |עֵדוּת  |הֵיכָל
73991 nouns ready for searches

Strategy for Context Selection

See pyscripts/contextparameters.py for the full delineation of these patterns and to see how they've been selected and tokenized. This is a work in progress.

  • phrase-type relations

    • PP and their objcs
  • subphrase relations

    • parallel (par) — syndetic connections
    • regens/rectum (rec) — nomen regens/rectum relations
    • adjunct (adj)
    • attribute (atr)
  • phrase_atom relations
    • apposition (appo) — re-identification relations
    • parallel and link, what is the difference here?
    • specification, what is it?
  • clause-constituent relations
    • "is a" relations with היה (subjects || predicate complements)
    • all other predicate / function roles — FRAMES
In [12]:
contexts = deliver_params(tnoun_instances, tf=api)
In [13]:
print('The following contextual relations will be queried):')
for i, param in enumerate(contexts):
    name = param['name']
    print(f'{i+1}. {name}')
The following contextual relations will be queried):
1. T.function→ st.verb.lex
2. T.prep.funct→ st.verb.lex
3. lex.PreC→ T.Subj
4. lex.prep.PreC→ T.Subj
5. T.PreC→ lex.Subj
6. T.prep.PreC→ lex.Subj
7. lex.coord→ T
8. T.coord→ lex
9. lex.atr→ T
10. lex.coord→ T (phrase atoms)
11. T.coord→ lex (phrase atoms)
12. lex.appo→ T
13. T.appo→ lex
In [14]:
counts = ContextCounter(contexts, tf=api, report=True)
running query on template [ T.function→ st.verb.lex ]...
	19878 results found.
running query on template [ T.prep.funct→ st.verb.lex ]...
	15012 results found.
running query on template [ lex.PreC→ T.Subj ]...
	2555 results found.
running query on template [ lex.prep.PreC→ T.Subj ]...
	1138 results found.
running query on template [ T.PreC→ lex.Subj ]...
	932 results found.
running query on template [ T.prep.PreC→ lex.Subj ]...
	1505 results found.
running query on template [ lex.coord→ T ]...
	4214 results found.
running query on template [ T.coord→ lex ]...
	4197 results found.
running query on template [ lex.atr→ T ]...
	1590 results found.
running query on template [ lex.coord→ T (phrase atoms) ]...
	704 results found.
running query on template [ T.coord→ lex (phrase atoms) ]...
	600 results found.
running query on template [ lex.appo→ T ]...
	1410 results found.
running query on template [ T.appo→ lex ]...
	3640 results found.
<><> Tests Done with 57375 results <><>

Random Samples of the Data

In [15]:
# randomized = [r for r in counts.search2result['T.const→ lex (with article separation)']]

# random.shuffle(randomized)
In [16]:
# B.show(randomized, end=50, condenseType='phrase', withNodes=True, extraFeatures={'sem_set'})

Excursus: Checking Context Tags and Gathering Examples

In this section I will inspect the tokens that are generated and counted, as well as pull out some examples and their counts for the presentation.

In [17]:
# patterns = {'funct.-> st.verb.lex': '\D*\.-> \D*\.\D*\[',
#             'funct.prep-> st.verb.lex': '\D*\.\D+\-> \D*\.\D*\['}

# token_examps = collections.defaultdict(list)

# for token in counts.data.index:
#     for query, pattern in patterns.items():
#         if re.match(pattern, token):
#             token_examps[query].append(token)

# for query in token_examps:
#     random.shuffle(token_examps[query])
#     examples = token_examps[query][:10]
#     targets = list()
    
#     # get example target nouns
#     for ex in examples:
#         ex_target = counts.data.loc[ex].sort_values(ascending=False).index[0]
#         targets.append(ex_target)
        
#     show_random = [f'target: {target} \t {ex}' for target, ex in zip(targets, examples)]
    
#     print('QUERY: ', query)
#     print('-'*5)
#     print('\n'.join(show_random))
#     print('-'*20, '\n')

Now some more specific counts...

In [18]:
counts.data['לב.n1']['T.Objc→ זכה.v1.piel'].sum()
Out[18]:
1.0
In [19]:
counts.data['פתח.n1']['T.Cmpl→ עמד.v1.qal'].sum()
Out[19]:
10.0
In [20]:
counts.data['אישׁ.n1']['T.Subj→ פקד.v1.hit'].sum()
Out[20]:
2.0
In [21]:
counts.data['שׁער.n1']['T.Loca→ שׁית.v1.qal'].sum()
Out[21]:
1.0
In [22]:
counts.data['גוי.n1']['T.ב.Adju→ אמר.v1.qal'].sum()
Out[22]:
2.0
In [23]:
counts.data['יד.n1']['T.מן.Cmpl→ ישׁע.v1.hif'].sum()
Out[23]:
17.0
In [24]:
counts.data['עת.n1']['T.ב.Time→ נתן.v1.nif'].sum()
Out[24]:
1.0
In [25]:
counts.data['דרך.n1']['T.ל.Cmpl→ פנה.v1.qal'].sum()
Out[25]:
1.0

Examining the Dataset

Below we look at the number of dimensions in the data:

In [26]:
counts.data.shape
Out[26]:
(13107, 199)

And a sample of the data is below, sorted on the results of אלהים in order to bring up interesting examples.

In [27]:
counts.data.sort_values(ascending=False, by='אלהים.n1').head(10)
Out[27]:
אלהים.n1 שׁמים.n1 ארץ.n1 אור.n1 יום.n1 לילה.n1 מים.n1 ים.n1 עץ.n1 נפשׁ.n1 ... ממלכה.n1 דעת.n1 עולם.n1 שׂמחה.n1 היכל.n1 עת.n1 תוך.n1 רחב.n2 ארך.n2 רב.n2
T.appo→ יהוה.n1 730.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
אחר.n2.atr→ T 61.0 0.0 2.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
T.Subj→ אמר.v1.qal 50.0 0.0 1.0 0.0 0.0 0.0 0.0 2.0 3.0 2.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
T.Objc→ עבד.v1.qal 33.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
T.Subj→ נתן.v1.qal 32.0 2.0 7.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
T.Subj→ עשׂה.v1.qal 23.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
T.appo→ אלהים.n1 18.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
אלהים.n1.appo→ T 18.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
T.אחר.n1.Cmpl→ הלך.v1.qal 15.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
T.Objc→ ירא.v1.qal 13.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

10 rows × 199 columns

Next we look at a few example counts:

In [28]:
pd.DataFrame(counts.data['אלהים.n1'][counts.data['אלהים.n1'] > 0].sort_values(ascending=False)).head(15)
Out[28]:
אלהים.n1
T.appo→ יהוה.n1 730.0
אחר.n2.atr→ T 61.0
T.Subj→ אמר.v1.qal 50.0
T.Objc→ עבד.v1.qal 33.0
T.Subj→ נתן.v1.qal 32.0
T.Subj→ עשׂה.v1.qal 23.0
אלהים.n1.appo→ T 18.0
T.appo→ אלהים.n1 18.0
T.אחר.n1.Cmpl→ הלך.v1.qal 15.0
T.Objc→ ירא.v1.qal 13.0
T.Subj→ ראה.v1.qal 13.0
T.Subj→ דבר.v1.piel 12.0
T.PreC→ יהוה.n1.Subj 11.0
T.Subj→ שׁמע.v1.qal 10.0
T.Objc→ עשׂה.v1.qal 9.0

This gives a good idea of the content of the co-occurrence counts.

Various Tag Searches Below

Below I isolate a few tags of interest to serve as examples in the paper.

TODO: Extract and display all the exact examples.

In [29]:
prec = [tag for tag in counts.data.index if 'PreC' in tag and 'אישׁ.n1' in tag]

prec
Out[29]:
['T.PreC→ אישׁ.n1.Subj',
 'T.אל.PreC→ אישׁ.n1.Subj',
 'T.את.PreC→ אישׁ.n1.Subj',
 'T.ב.PreC→ אישׁ.n1.Subj',
 'T.דרך.n1.PreC→ אישׁ.n1.Subj',
 'T.כ.PreC→ אישׁ.n1.Subj',
 'T.ל.PreC→ אישׁ.n1.Subj',
 'T.מן.PreC→ אישׁ.n1.Subj',
 'T.על.PreC→ אישׁ.n1.Subj',
 'אישׁ.n1.PreC→ T.Subj',
 'אישׁ.n1.כ.PreC→ T.Subj',
 'אישׁ.n1.ל.PreC→ T.Subj',
 'אישׁ.n1.על.PreC→ T.Subj']
In [30]:
target = 'עלה.n1'

target_counts = counts.data[target][counts.data[target]>0].sort_values(ascending=False)

prec_contexts = target_counts[target_counts.index.str.contains('ל.PreC')]

prec_contexts
Out[30]:
רצון.n1.ל.PreC→ T.Subj      2.0
רב.n2.ל.PreC→ T.Subj        1.0
נשׂיא.n1.על.PreC→ T.Subj    1.0
נגד.n1.ל.PreC→ T.Subj       1.0
T.ל.PreC→ אלה.Subj          1.0
T.ל.PreC→ אחד.n1.Subj       1.0
Name: עלה.n1, dtype: float64

Adjusting the Counts

We will apply two primary adjustments:

  1. We drop co-occurrences that are unique to a noun. The dropped observations will thus be considered outliers. While these items are useful for describing the uniqueness of a given lexeme, they are unhelpful for drawing comparisons between our sets.
  2. We convert the counts into a measure of statistical significance. For this we use Fisher's exact test, which is ideal for datasets that have counts that are less than 5. Our matrix is likely to have many such counts. The resulting p-values, of which <0.05 represents a statistically significant colexeme, will be log-transformed. Values that fall below expected frequencies will be negatively transformed.

Remove Co-occurrence Outliers

We will remove colexemes/bases that occur with only one target noun. This is done by subtracting the row total from each item in the row. Any 0 value in a row means that that row has a unique colexeme that only occurs with one target noun (we will call that a hapax_colex here). We will remove these rows further down.

In [31]:
colex_counts = counts.data.sum(1)
remaining_counts = counts.data.sub(colex_counts, axis=0) # subtract colex_counts
hapax_colex = remaining_counts[(remaining_counts == 0).any(1)] # select rows that have a 0 value anywhere

Below is an example just to make sure we've selected the right indices. The value has been manually chosen from hapax_colex.

In [32]:
counts.data.loc['T.Adju→ אכל.v1.pual'].sort_values(ascending=False).head()
Out[32]:
חרב.n1      1.0
רב.n2       0.0
מלחמה.n1    0.0
רע.n2       0.0
איל.n2      0.0
Name: T.Adju→ אכל.v1.pual, dtype: float64

Indeed this context tag is only attested with חרב, thus it is not useful for drawing meaningful comparisons to this noun. Below we see that there are 8191 other such basis elements. We remove these data points in the next cell and name the new dataset data.

In [33]:
hapax_colex.shape
Out[33]:
(8929, 199)
In [34]:
data = counts.data.drop(labels=hapax_colex.index, axis=0)

print(f'New data dimensions: {data.shape}')
print(f'New total observations: {data.sum().sum()}')
print(f'Observations removed: {counts.data.sum().sum() - data.sum().sum()}')
New data dimensions: (4178, 199)
New total observations: 44776.0
Observations removed: 12592.0

Random example to make sure there are no unique colexemes in the new dataset:

In [35]:
data.loc['T.Adju→ בוא.v1.hif'].sort_values(ascending=False).head(5)
Out[35]:
כלי.n1      1.0
זהב.n1      1.0
איל.n2      1.0
כסף.n1      1.0
מלחמה.n1    0.0
Name: T.Adju→ בוא.v1.hif, dtype: float64

Check for Orphaned Target Nouns

I want to see if any target nouns in the dataset now have 0 basis observations (i.e. "orphaned") as a result of our data pruning. The test below shows that there is no columns in the table with a sum of 0.

In [36]:
data.loc[:, (data == 0).all(0)].shape
Out[36]:
(4178, 0)

How many zero counts are there?

The raw count matrix has a lot of sparsity. Here's how many zeros there are. We also count other values.

In [37]:
unique_values, value_counts = np.unique(data.values, return_counts=True)
unique_counts = pd.DataFrame.from_dict(dict(zip(unique_values, value_counts)), orient='index', columns=['count'])
display(HTML('<h5>Top 10 Unique Values and Their Counts in Dataset</h5>'))
unique_counts.head(10)
Top 10 Unique Values and Their Counts in Dataset
Out[37]:
count
0.0 812715
1.0 12349
2.0 2934
3.0 1151
4.0 625
5.0 342
6.0 269
7.0 185
8.0 129
9.0 111
In [38]:
zero = unique_counts.loc[0.0][0]
non_zero = unique_counts[unique_counts.index > 0].sum()[0]
non_zero_ratio, zero_ratio = non_zero / (non_zero+zero), zero / (non_zero+zero)

print(f'Number of zero count variables: {zero} ({round(zero_ratio, 2)})')
print(f'Number of non-zero count variables: {non_zero} ({round(non_zero_ratio, 2)})')
Number of zero count variables: 812715 (0.98)
Number of non-zero count variables: 18707 (0.02)

Below the number of observed counts is given:

In [39]:
data.sum().sum()
Out[39]:
44776.0

Apply Fisher's Exact Test

Now we apply the Fisher's exact test to the data set. This involves supplying values to a 2x2 contingency table that is fed to scipy.stats.fisher_exact

Number of Datapoints To Iterate Over

The Fisher's exact test takes some time to run. That is because it must iterate over a lot of pairs. The number is printed below.

In [40]:
print(data.shape[0]*data.shape[1])
831422

Apply the Tests

The whole run takes 5.5-6.0 minutes on a 2017 Macbook pro.

In [41]:
# data for contingency tables
target_obs = data.apply(lambda col: col.sum(), axis=0, result_type='broadcast') # total target lexeme observations
colex_obs = data.apply(lambda col: col.sum(), axis=1, result_type='broadcast') # total colexeme/basis observations
total_obs = data.sum().sum() # total observations

# preprocess parts of contingency formula; 
# NB: a_matrix = data
b_matrix = target_obs.sub(data)
c_matrix = colex_obs.sub(data)
d_matrix = pd.DataFrame.copy(data, deep=True)
d_matrix[:] = total_obs
d_matrix = d_matrix.sub(data+b_matrix+c_matrix)

fisher_transformed = collections.defaultdict(lambda: collections.defaultdict())

i = 0 # counter for messages
indent(reset=True) # TF utility for timed messages
info('applying Fisher\'s test to dataset...')
indent(level=1, reset=True)

for lex in data.columns:
    for colex in data.index:
        a = data[lex][colex]
        b = b_matrix[lex][colex]
        c = c_matrix[lex][colex]
        d = d_matrix[lex][colex]
        contingency = np.matrix([[a, b], [c, d]])
        oddsratio, pvalue = stats.fisher_exact(contingency)
        fisher_transformed[lex][colex] = pvalue
        i += 1
        if i % 100000 == 0: # update message every 100,000 iterations
            info(f'finished iteration {i}...')
            
indent(level=0)
info(f'DONE at iteration {i}!')

fisherdata = pd.DataFrame(fisher_transformed)
  0.00s applying Fisher's test to dataset...
   |       45s finished iteration 100000...
   |    1m 31s finished iteration 200000...
   |    2m 14s finished iteration 300000...
   |    2m 55s finished iteration 400000...
   |    3m 36s finished iteration 500000...
   |    4m 17s finished iteration 600000...
   |    4m 57s finished iteration 700000...
   |    5m 35s finished iteration 800000...
 5m 47s DONE at iteration 831422!
In [42]:
fisherdata.head(10)
Out[42]:
אלהים.n1 שׁמים.n1 ארץ.n1 אור.n1 יום.n1 לילה.n1 מים.n1 ים.n1 עץ.n1 נפשׁ.n1 ... ממלכה.n1 דעת.n1 עולם.n1 שׂמחה.n1 היכל.n1 עת.n1 תוך.n1 רחב.n2 ארך.n2 רב.n2
T.Adju→ אסף.v1.nif 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
T.Adju→ בוא.v1.hif 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
T.Adju→ גור.v2.qal 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
T.Adju→ דבר.v1.piel 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
T.Adju→ הלך.v1.qal 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
T.Adju→ זבח.v1.qal 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
T.Adju→ ילד.v1.qal 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.031096 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
T.Adju→ יצא.v1.qal 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
T.Adju→ ישׁב.v1.qal 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
T.Adju→ לקח.v1.qal 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

10 rows × 199 columns

log10 transformation

In [43]:
expectedfreqs = (data+b_matrix) * (data+c_matrix) / (data+b_matrix+c_matrix+d_matrix)
fishertransf = collections.defaultdict(lambda: collections.defaultdict())

indent(reset=True)
info('applying log10 transformation to Fisher\'s data...')

for lex in data.columns:
    for colex in data.index:
        observed_freq = data[lex][colex]
        exp_freq = expectedfreqs[lex][colex]
        pvalue = fisherdata[lex][colex]
        if observed_freq < exp_freq:
            logv = np.log10(pvalue)
            fishertransf[lex][colex] = logv
        else:
            logv = -np.log10(pvalue)
            fishertransf[lex][colex] = logv
    
info('finished transformations!')
            
fishertransf = pd.DataFrame(fishertransf)
  0.00s applying log10 transformation to Fisher's data...
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:16: RuntimeWarning: divide by zero encountered in log10
  app.launch_new_instance()
    22s finished transformations!

The Fisher's test has produced zero values, indicating a very high degree of attraction between lexemes and a colexemes. A log-transformed zero equals infinity. Below those values are isolated.

In [44]:
display(HTML('<h5>contexts x nouns with a p-value of 0 :</h5>'))
inf_nouns = fishertransf.columns[(fishertransf == np.inf).any()]
inf_data = [] # inf data contains column/index information needed to assign the new values
for inf_noun in inf_nouns:
    inf_noun2context = pd.DataFrame(fishertransf[inf_noun][fishertransf[inf_noun] == np.inf])
    inf_data.append(inf_noun2context)
    display(inf_noun2context)
contexts x nouns with a p-value of 0 :
אלהים.n1
T.appo→ יהוה.n1 inf

In this case the Fisher's has returned a zero value. A p-value of 0 means that the likelihood אלהים and יהוה are not dependent variables is essentially null. We can thus reject the null hypothesis that the two values are not related. There is, rather, a maximum level of confidence that these two values are interrelated. The np.inf value that resulted from log10(0) is not viable for calculating vector distances. Thus, we need to substitute an arbitrary, but appropriate value. Below we access the lowest non-zero p-values in the dataset.

In [45]:
minimum_pvalues = fisherdata.min()[fisherdata.min() > 0].sort_values()
minmin_noun = minimum_pvalues.index[0]
minmin_context = fisherdata[minimum_pvalues.index[0]].sort_values().index[0]
minimum_pvalues.head(10)
Out[45]:
קול.n1     2.587459e-189
יד.n1      1.080958e-171
שׁם.n1     7.552549e-161
ברית.n1    5.858883e-131
דבר.n1     2.480769e-126
בן.n1      1.118129e-123
אמה.n2     3.888547e-123
שׁנה.n1    6.213908e-118
רחב.n2     5.041356e-106
לחם.n1     2.673358e-101
dtype: float64

The minimum noun x context score is shown below.

In [46]:
minmin_noun
Out[46]:
'קול.n1'
In [47]:
minmin_context
Out[47]:
'T.ב.Cmpl→ שׁמע.v1.qal'

The small pvalue listed above is used to substitute the infinitive values below.

In [48]:
# make the substitutions 
for inf_dat in inf_data:
    for noun in inf_dat.columns:
        for context in inf_dat.index:
            print(f'adjusting infinite score for {noun}')
            new_pvalue, new_transf = fisherdata[minmin_noun][minmin_context], fishertransf[minmin_noun][minmin_context]
            fisherdata[noun][context] = new_pvalue
            print(f'\tpvalue updated to {new_pvalue}')
            fishertransf[noun][context] = new_transf
            print(f'\ttransformed pvalue updated to {new_transf}')
adjusting infinite score for אלהים.n1
	pvalue updated to 2.5874592836059496e-189
	transformed pvalue updated to 188.58712647556183

Below we double to check to ensure that all infinitive values have been removed. The test should read False.

In [49]:
# infinites in dataset?
bool(len(fishertransf[(fishertransf == np.inf).any(1)].index))
Out[49]:
False

Comparing Raw and Adjusted Counts

What kinds of counts are "upvoted" and "downvoted" in the adjusted numbers? This information is helpful for gaining insight into the adjustment process and the efficacy of its results.

Below I isolate and compare counts for a set of key lexemes: מלך "king", עיר "city", and חכמה "wisdom". The counts are analyzed by comparing context tag rankings and looking for those contexts which are most affected (i.e. have the most absolute differences) by the changes.

In [50]:
examine_nouns = ['מלך.n1', 'עיר.n1', 'חכמה.n1']

context_rankings = {}

# gather context rankings into dataframes
for noun in examine_nouns:
    
    # make raw context DF, sorted, with columns count and rank
    rawcounts = pd.DataFrame(data[noun].values, 
                             columns=['count'], 
                             index=data.index).sort_values(ascending=False, by='count')
    rawcounts['rank'] = np.arange(len(rawcounts))+1 # add column "rank"
    
    # make adjusted context DF, sorted, with columns count and rank
    adjcounts = pd.DataFrame(fishertransf[noun].values, 
                             columns=['count'], 
                             index=fishertransf.index).sort_values(ascending=False, by='count')
    adjcounts['rank'] = np.arange(len(adjcounts))+1
    
    # put both DFs into dict mapped to noun
    context_rankings[noun]={'raw':rawcounts, 'adj':adjcounts}
    
    
# print for each noun a report on top up/downgrades
for noun, rankset in context_rankings.items():
    raw, adj = rankset['raw'], rankset['adj']
    upgrades = pd.DataFrame((raw['rank']-adj['rank']).sort_values(ascending=False))
    downgrades = pd.DataFrame((raw['rank']-adj['rank']).sort_values())
    upgrades.columns, downgrades.columns = [['difference']]*2
    upgrades['previous rank'], downgrades['previous rank'] = [raw['rank']]*2
    upgrades['new rank'], downgrades['new rank'] = [adj['rank']]*2

    display(HTML(f'<h3>{noun}</h3>'))
    print('top 10 raw counts:')
    display(raw.head(10))
    print('top 10 adjusted counts:')
    display(adj.head(10))
    print('top 10 rank upgrades')
    display(upgrades.head(10))
    print('top 10 rank downgrades')
    display(downgrades.head(10))
    print('-'*40)
    print()

מלך.n1

top 10 raw counts:
count rank
T.Subj→ אמר.v1.qal 127.0 1
T.appo→ אדון.n1 58.0 2
T.אל.Cmpl→ אמר.v1.qal 39.0 3
T.אל.Cmpl→ בוא.v1.qal 35.0 4
T.Subj→ שׁלח.v1.qal 34.0 5
דוד.n3.appo→ T 32.0 6
T.Subj→ בוא.v1.qal 28.0 7
T.Subj→ עשׂה.v1.qal 26.0 8
T.coord→ מלך.n1 25.0 9
מלך.n1.coord→ T 25.0 10
top 10 adjusted counts:
count rank
T.appo→ אדון.n1 83.023931 1
T.Subj→ אמר.v1.qal 78.126007 2
T.Subj→ שׁלח.v1.qal 30.922073 3
דוד.n3.appo→ T 30.375898 4
T.appo→ נבוכדראצר.n1 29.639071 5
מלך.n1.coord→ T 25.275193 6
T.ל.Cmpl→ משׁח.v1.qal 22.080721 7
T.Subj→ צוה.v1.piel 20.632771 8
רחבעם.n1.appo→ T 16.366377 9
T.appo→ צדקיהו.n1 16.263470 10
top 10 rank upgrades
difference previous rank new rank
T.ב.Cmpl→ נחה.v1.qal 2720 3570 850
T.ב.Cmpl→ פגע.v1.qal 2713 3539 826
T.ב.Cmpl→ ענה.v1.qal 2713 3534 821
T.ב.Cmpl→ עצר.v1.qal 2713 3535 822
T.ב.Cmpl→ עשׁן.v1.qal 2713 3537 824
T.ב.Cmpl→ עשׂה.v1.qal 2713 3538 825
T.ב.Cmpl→ ערב.v3.hit 2713 3536 823
T.ב.Cmpl→ פוץ.v1.hif 2713 3540 827
T.ב.Cmpl→ עמד.v1.hif 2712 3532 820
T.ב.Cmpl→ עלה.v1.qal 2712 3531 819
top 10 rank downgrades
difference previous rank new rank
T.Objc→ נתן.v1.qal -4133 33 4166
T.Objc→ שׂים.v1.qal -4060 60 4120
T.Objc→ ראה.v1.qal -4037 59 4096
T.appo→ יהוה.n1 -4024 153 4177
בן.n1.appo→ T -3871 293 4164
T.Subj→ אכל.v1.qal -3858 237 4095
בן.n1.coord→ T -3846 317 4163
T.Objc→ עלה.v1.hif -3834 313 4147
רב.n1.atr→ T -3822 331 4153
T.Objc→ לקח.v1.qal -3768 408 4176
----------------------------------------

עיר.n1

top 10 raw counts:
count rank
T.Objc→ נתן.v1.qal 25.0 1
חצר.n1.coord→ T 21.0 2
T.coord→ עיר.n1 20.0 3
עיר.n1.coord→ T 20.0 4
T.Objc→ בנה.v1.qal 19.0 5
T.אל.Cmpl→ בוא.v1.qal 15.0 6
T.Cmpl→ בוא.v1.qal 15.0 7
T.ב.Cmpl→ ישׁב.v1.qal 15.0 8
T.Objc→ לכד.v1.qal 14.0 9
בצור.n1.atr→ T 14.0 10
top 10 adjusted counts:
count rank
חצר.n1.coord→ T 31.664448 1
עיר.n1.coord→ T 23.079220 2
בצור.n1.atr→ T 23.032384 3
T.coord→ עיר.n1 16.543465 4
T.Objc→ לכד.v1.qal 15.462452 5
T.ב.Cmpl→ קבר.v1.qal 14.940743 6
T.ב.Cmpl→ קבר.v1.nif 14.393720 7
מגרשׁ.n1.coord→ T 13.281312 8
T.Cmpl→ בוא.v1.qal 10.175419 9
T.appo→ ירושׁלם.n1 9.903256 10
top 10 rank upgrades
difference previous rank new rank
T.ב.Cmpl→ יצא.v1.qal 2679 3509 830
T.ב.Cmpl→ יצב.v1.hit 2679 3510 831
T.ב.Cmpl→ יצג.v1.hif 2679 3511 832
T.ב.Cmpl→ יצת.v1.nif 2679 3512 833
T.ב.Cmpl→ ירד.v1.qal 2679 3513 834
T.ב.Cmpl→ ירה.v1.hif 2679 3514 835
T.ב.Cmpl→ ירה.v2.qal 2679 3515 836
T.ב.Cmpl→ ישׁר.v1.qal 2679 3516 837
T.ב.Cmpl→ יתר.v1.nif 2679 3517 838
T.ב.Cmpl→ כבד.v1.nif 2679 3518 839
top 10 rank downgrades
difference previous rank new rank
T.Objc→ לקח.v1.qal -4103 67 4170
T.Subj→ בוא.v1.qal -4103 51 4154
T.Objc→ שׂים.v1.qal -4058 47 4105
T.Objc→ שׁמר.v1.qal -4023 137 4160
T.Objc→ עשׂה.v1.qal -4001 176 4177
בן.n1.coord→ T -3995 157 4152
T.אל.Cmpl→ אמר.v1.qal -3979 183 4162
T.Objc→ אכל.v1.qal -3934 219 4153
T.ל.Cmpl→ נתן.v1.qal -3932 235 4167
T.ל.Cmpl→ עמד.v1.qal -3731 410 4141
----------------------------------------

חכמה.n1

top 10 raw counts:
count rank
T.Objc→ נתן.v1.qal 9.0 1
T.Objc→ ראה.v1.qal 5.0 2
תבונה.n1.coord→ T 5.0 3
T.Objc→ ידע.v1.qal 4.0 4
T.Objc→ שׁמע.v1.qal 4.0 5
טוב.n1.PreC→ T.Subj 4.0 6
T.Objc→ קנה.v1.qal 3.0 7
T.coord→ דבר.n1 3.0 8
T.Subj→ מצא.v1.nif 2.0 9
T.ב.Adju→ עשׂה.v1.qal 2.0 10
top 10 adjusted counts:
count rank
תבונה.n1.coord→ T 11.787864 1
טוב.n1.PreC→ T.Subj 5.262299 2
הוללה.n1.coord→ T 4.754802 3
T.Objc→ נתן.v1.qal 4.502330 4
T.coord→ דבר.n1 4.455632 5
T.Objc→ קנה.v1.qal 4.221725 6
T.Objc→ ידע.v1.qal 4.198602 7
T.ל.Cmpl→ קשׁב.v1.hif 3.912472 8
T.על.Cmpl→ שׁמע.v1.qal 3.679773 9
T.Objc→ ראה.v1.qal 3.476901 10
top 10 rank upgrades
difference previous rank new rank
T.Subj→ עטף.v2.hit 1486 3663 2177
T.Subj→ עטף.v2.qal 683 3828 3145
T.Subj→ עזב.v1.pual 680 3144 2464
T.appo→ אבשׁי.n1 324 4051 3727
T.Subj→ עיף.v1.qal 169 3829 3660
T.appo→ אבשׁלום.n1 166 4086 3920
T.appo→ אדון.n1 165 4087 3922
T.appo→ אברהם.n1 164 3921 3757
T.Subj→ זרח.v1.qal 164 2535 2371
T.Subj→ עזב.v1.qal 164 2629 2465
top 10 rank downgrades
difference previous rank new rank
T.Subj→ אמר.v1.qal -2057 2119 4176
T.Objc→ עשׂה.v1.qal -1282 2896 4178
T.Objc→ לקח.v1.qal -1140 3035 4175
T.Objc→ שׁבח.v1.piel -452 2177 2629
T.Objc→ שׁבר.v1.piel -350 2277 2627
T.Objc→ שׁאר.v1.hif -287 2694 2981
T.Objc→ שׁבה.v1.qal -259 2628 2887
T.coord→ בן.n1 -193 3728 3921
T.Objc→ יסד.v1.qal -161 2724 2885
T.Subj→ לון.v1.nif -129 2372 2501
----------------------------------------

Export Data for מלך for Paper

In [51]:
context_rankings['מלך.n1']['raw'].head(10).to_csv('spreadsheets/king_top10_raw.csv')
round(context_rankings['מלך.n1']['adj'].head(10), 2).to_csv('spreadsheets/king_top10_adj.csv')

Extracting Specific Examples for the Paper (on מלך) to Illustrate Count Adjustments

Below the four separate parts of the contingency table are extracted for מלך "king". These were previously calculated above

In [52]:
data['מלך.n1']['T.Objc→ נתן.v1.qal'] # A
Out[52]:
10.0
In [53]:
b_matrix['מלך.n1']['T.Objc→ נתן.v1.qal'] # B
Out[53]:
1564.0
In [54]:
c_matrix['מלך.n1']['T.Objc→ נתן.v1.qal'] # C
Out[54]:
639.0
In [55]:
d_matrix['מלך.n1']['T.Objc→ נתן.v1.qal'] # D
Out[55]:
42563.0

Where do the 10 cases happen?

In [56]:
passages = []
for res in counts.target2basis2result['מלך.n1']['T.Objc→ נתן.v1.qal']:
    passages.append('{} {}:{}'.format(*T.sectionFromNode(res[0])))
print('; '.join(passages))
Deuteronomy 7:24; Joshua 6:2; Joshua 8:1; Joshua 10:30; 1_Samuel 8:6; 1_Samuel 12:13; Hosea 13:10; Hosea 13:11; Nehemiah 13:26; 2_Chronicles 2:10

What is the result of the Fisher's test?

In [57]:
round(fisherdata['מלך.n1']['T.Objc→ נתן.v1.qal'], 4)
Out[57]:
0.0035

What is the value of the expected count?

In [58]:
round(expectedfreqs['מלך.n1']['T.Objc→ נתן.v1.qal'], 2)
Out[58]:
22.81
In [59]:
round(fishertransf['מלך.n1']['T.Objc→ נתן.v1.qal'], 2)
Out[59]:
-2.45

How has the rank changed?

In [60]:
context_rankings['מלך.n1']['raw'].loc['T.Objc→ נתן.v1.qal']
Out[60]:
count    10.0
rank     33.0
Name: T.Objc→ נתן.v1.qal, dtype: float64
In [61]:
context_rankings['מלך.n1']['adj'].loc['T.Objc→ נתן.v1.qal']
Out[61]:
count      -2.451746
rank     4166.000000
Name: T.Objc→ נתן.v1.qal, dtype: float64

Excursus: A Random Sample Examined

We saw that the model seems to be succeeding at isolating intuitive associations with קול. Let's look at another example at random, in this case the noun ארץ ("land"). Below are the transformed p-values for that noun.

In [62]:
fishertransf['ארץ.n1'].sort_values(ascending=False).head(10)
Out[62]:
T.Objc→ ירשׁ.v1.qal      54.042522
T.מן.Cmpl→ יצא.v1.hif    40.014625
T.coord→ שׁמים.n1        38.416243
T.מן.Cmpl→ עלה.v1.hif    23.970178
T.ב.Cmpl→ ישׁב.v1.qal    20.373228
T.coord→ ארץ.n1          17.908065
ארץ.n1.appo→ T           15.355883
T.אל.Cmpl→ בוא.v1.hif    13.967356
ארץ.n1.coord→ T          13.836144
T.appo→ ארץ.n1           13.783808
Name: ארץ.n1, dtype: float64

The most associated variables include cases where ארץ is an object to the verb ירשׁ, where ארץ serves as the complement from which something is brought (hifil of יצא and hifil of עלה), frequently in construct to עם "people"), the participle of ישב "inhabitant(s)"), and ממלכה, "kingdom", as well as other satisfying and expected occasions of use. These examples show that the model is working well.


Comparing the Nouns

The nouns are now ready to be compared. I will do so in two ways.

  1. Principle Component Analysis — We have a semantic space with 4,218 dimensions. That is a lot of potential angles from which to compare the vectors. One method that is commonly used in semantic space analysis is principle component analysis or PCA. PCA is a dimensionality reduction method that reduce a multi-dimensional vector to the two points in an imagined space that show the most distance between the nouns. We can visualize said space by plotting the two points on an X and Y axis.
  2. Cosine Similarity — This measure allows us to compare the vectors on the basis of their trajectories. This method is particularly well-suited for semantic spaces because it ignores differences in frequency and compares, rather, the closeness of relationship between two sets of frequencies.

PCA Analysis

We want to apply PCA in order to plot nouns in an imaginary space. The goal is to use the visualization to identify patterns and groups amongst the 199 target nouns. Nouns that are more similar should fall within the same general areas relative to the origin (0, 0). PCA seeks to identify the maximum variance amongst the vector spaces.

In [63]:
pca = PCA(10) # PCA with 3 principal components
noun_fit = pca.fit(fishertransf.T.values) # get coordinates
pca_nouns = noun_fit.transform(fishertransf.T.values)

plt.figure(figsize=(8, 6))
sns.barplot(x=np.arange(10)+1, y=noun_fit.explained_variance_ratio_[:10])
plt.xlabel('Principle Component', size=20)
plt.ylabel('Raio of Explained Variance', size=20)
plt.title('Ratio of Explained Variance for Principle Components 1-10 (Scree Plot)', size=20)
plt.show()

Variance accounted for by PC1 and PC2:

In [64]:
noun_fit.explained_variance_ratio_[0]+noun_fit.explained_variance_ratio_[1]
Out[64]:
0.1267060585344464

The plot above, also called a scree plot, tells us that the first two principle components only account for 12% of the total variance in the dataset. Thus the PCA noun space is rather noisy. This may be explained by the fact that we are combining many different kinds of syntactic contexts into one dataset. And it may also be due to the rather spread out nature of lexical data.

Below we extract the top 25 features which are most influential for the first two principal components.

In [65]:
loadings = noun_fit.components_.T * np.sqrt(noun_fit.explained_variance_)
loadings = pd.DataFrame(loadings.T, index=np.arange(10)+1, columns=data.index)
In [66]:
pc1_loadings = pd.DataFrame(loadings.loc[1].sort_values(ascending=False))
pc2_loadings = pd.DataFrame(loadings.loc[2].sort_values(ascending=False))

pc1_loadings_above0 = pc1_loadings[pc1_loadings[1] > 0.1] # isolate loadings > 0

# automatically detect elbow in graph:
elbow = KneeLocator(x=np.arange(pc1_loadings_above0.shape[0]), 
                    y=pc1_loadings_above0[1].values, 
                    curve='convex', 
                    direction='decreasing').knee

# plot it all
plt.figure(figsize=(8, 6))
plt.plot(pc1_loadings_above0.values)
plt.title('Loading Scores >0 by Rank for Principle Component 1', size=20)
plt.ylabel('Loading Score', size=20)
plt.xlabel('Rank', size=20)
plt.xticks(np.arange(pc1_loadings_above0.shape[0], step=20), size=20)
plt.yticks(size=20)
plt.axvline(elbow, color='red') # plot elbow with red line
plt.show()

Top PCX Loadings and Scores (for data exploration)

In [67]:
# pcx_loadings = pd.DataFrame(loadings.loc[4].sort_values(ascending=False)) # for experiments

# pcx_loadings.head(25)

Top 25 PC1 Loadings and Scores

In [68]:
pc1_loadings.round(2).head(25).to_csv('spreadsheets/PC1_loadings.csv')
pc1_loadings.head(25)
Out[68]:
1
בן.n1.coord→ T 7.902528
T.coord→ בן.n1 6.291447
בן.n1.appo→ T 5.999691
T.appo→ בן.n1 5.949898
בת.n1.coord→ T 5.327555
T.coord→ אהרן.n1 3.197214
T.Objc→ ילד.v1.qal 3.143915
T.אל.Cmpl→ דבר.v1.piel 2.002405
T.Objc→ ילד.v1.hif 1.703356
T.appo→ יהושׁע.n1 1.559670
אח.n1.coord→ T 1.354382
כהן.n1.appo→ T 1.316869
T.appo→ שׁלמה.n2 1.117382
T.appo→ בניהו.n1 1.070644
שׁבט.n1.coord→ T 0.860355
בן.n1.PreC→ T.Subj 0.745098
T.appo→ ירבעם.n1 0.725176
T.PreC→ בן.n1.Subj 0.612568
בכר.n1.appo→ T 0.601718
T.coord→ אתה 0.585144
T.appo→ אלעזר.n1 0.575114
T.coord→ אשׁה.n1 0.571152
T.appo→ פקח.n2 0.559075
T.appo→ שׁמעי.n2 0.535425
T.appo→ פינחס.n1 0.533944

PC1 Verb Contexts and Loadings

In [69]:
pc1_loadings[pc1_loadings.index.str.contains('v1')].round(2).head(15).to_csv('spreadsheets/top15_animate_verbs.csv')

top_pc1_loadings = pc1_loadings[pc1_loadings[1] >= 0.30]

pc1_loadings[pc1_loadings.index.str.contains('v1')].head(15)
Out[69]:
1
T.Objc→ ילד.v1.qal 3.143915
T.אל.Cmpl→ דבר.v1.piel 2.002405
T.Objc→ ילד.v1.hif 1.703356
T.Subj→ בנה.v1.qal 0.469168
T.Subj→ נסע.v1.qal 0.442882
T.Subj→ עשׂה.v1.qal 0.438378
T.Subj→ סמך.v1.qal 0.421702
T.Objc→ צוה.v1.piel 0.395177
T.ל.Cmpl→ לקח.v1.qal 0.336636
T.Subj→ ילד.v1.nif 0.333946
T.Subj→ ילד.v1.pual 0.333793
T.Objc→ לקח.v1.qal 0.331304
T.Subj→ מות.v1.qal 0.319758
T.Subj→ חנה.v1.qal 0.281649
T.Subj→ שׁמר.v1.qal 0.275626

Looking at T.ל.Cmpl→ לקח.v1.qal

This is an interesting top verbal context. Is it related to marriage situations?

In [70]:
take_contexts = [r for r in counts.basis2result['T.ל.Cmpl→ לקח.v1.qal']]
random.seed(213214) # shuffle random, preserve state
random.shuffle(take_contexts)
B.show(take_contexts, condenseType='clause', withNodes=True, end=5)
display(HTML(f'<h4>...{len(take_contexts)-5} other results cutoff...'))

result 1

Deuteronomy 7:3
445729
clause 445729 WxY0
phrase 706740 Conj CP
97490
conj and freq_lex=50272 language=Hebrew
phrase 706741 Objc NP
97491
subs daughter freq_lex=588 language=Hebrew
phrase 706742 Nega NegP
97492
nega not freq_lex=5167 language=Hebrew
phrase 706743 Pred VP
97493
verb take qal impf freq_lex=965 language=Hebrew
phrase 706744 Cmpl PP
97494
prep to freq_lex=20069 language=Hebrew
97495
subs son freq_lex=4937 language=Hebrew

result 2

Genesis 25:20
430015
clause 430015 Adju InfC
phrase 659030 PreS VP
12722
prep in freq_lex=15542 language=Hebrew
12723
verb take qal infc freq_lex=965 language=Hebrew
phrase 659031 Objc PP
12724
prep <object marker> freq_lex=10989 language=Hebrew
12725
nmpr Rebekah freq_lex=30 language=Hebrew
phrase 659031 Objc PP|NP
12726
subs daughter freq_lex=588 language=Hebrew
12727
nmpr <father of Laban> freq_lex=9 language=Hebrew
phrase 659031 Objc PP|NP
12728
art the freq_lex=30386 language=Hebrew
12729
subs Aramean freq_lex=12 language=Hebrew ls=gntl
phrase 659031 Objc PP
12730
prep from freq_lex=7562 language=Hebrew
12731
subs field? freq_lex=11 language=Hebrew
12732
nmpr Aram freq_lex=143 language=Hebrew
phrase 659031 Objc PP|NP
12733
subs sister freq_lex=114 language=Hebrew ls=nmdi
12734
nmpr Laban freq_lex=54 language=Hebrew
phrase 659031 Objc PP|NP
12735
art the freq_lex=30386 language=Hebrew
12736
subs Aramean freq_lex=12 language=Hebrew ls=gntl
phrase 659032 Supp PP
12737
prep to freq_lex=20069 language=Hebrew
phrase 659033 Cmpl PP
12738
prep to freq_lex=20069 language=Hebrew
12739
subs woman freq_lex=781 language=Hebrew ls=nmdi

result 3

Genesis 28:9
430475
clause 430475 Way0
phrase 660387 Conj CP
14654
conj and freq_lex=50272 language=Hebrew
phrase 660388 Pred VP
14655
verb take qal wayq freq_lex=965 language=Hebrew
phrase 660389 Objc PP
14656
prep <object marker> freq_lex=10989 language=Hebrew
14657
nmpr Mahalath freq_lex=2 language=Hebrew
phrase 660389 Objc PP|NP
14658
subs daughter freq_lex=588 language=Hebrew
14659
nmpr Ishmael freq_lex=48 language=Hebrew
phrase 660389 Objc PP|NP
14660
subs son freq_lex=4937 language=Hebrew
14661
nmpr Abraham freq_lex=175 language=Hebrew
phrase 660389 Objc PP|NP
14662
subs sister freq_lex=114 language=Hebrew ls=nmdi
14663
nmpr Nebaioth freq_lex=5 language=Hebrew
phrase 660390 Adju PP
14664
prep upon freq_lex=5766 language=Hebrew
14665
subs woman freq_lex=781 language=Hebrew ls=nmdi
phrase 660391 Supp PP
14666
prep to freq_lex=20069 language=Hebrew
phrase 660392 Cmpl PP
14667
prep to freq_lex=20069 language=Hebrew
14668
subs woman freq_lex=781 language=Hebrew ls=nmdi

result 4

Deuteronomy 25:5
447531
clause 447531 WQt0
phrase 712266 Conj CP
106888
conj and freq_lex=50272 language=Hebrew
phrase 712267 PreO VP
106889
verb take qal perf freq_lex=965 language=Hebrew
phrase 712268 Supp PP
106890
prep to freq_lex=20069 language=Hebrew
phrase 712269 Cmpl PP
106891
prep to freq_lex=20069 language=Hebrew
106892
subs woman freq_lex=781 language=Hebrew ls=nmdi

result 5

Genesis 12:19
428577
clause 428577 Way0
phrase 654740 Conj CP
5837
conj and freq_lex=50272 language=Hebrew
phrase 654741 Pred VP
5838
verb take qal wayq freq_lex=965 language=Hebrew
phrase 654742 Objc PP
5839
prep <object marker> freq_lex=10989 language=Hebrew
phrase 654743 Supp PP
5840
prep to freq_lex=20069 language=Hebrew
phrase 654744 Cmpl PP
5841
prep to freq_lex=20069 language=Hebrew
5842
subs woman freq_lex=781 language=Hebrew ls=nmdi

...36 other results cutoff...

In [71]:
'; '.join(['{} {}:{}'.format(*T.sectionFromNode(r[0])) for r in sorted(take_contexts)])
Out[71]:
'Genesis 12:19; Genesis 24:3; Genesis 24:4; Genesis 24:7; Genesis 24:37; Genesis 24:38; Genesis 24:40; Genesis 24:48; Genesis 25:20; Genesis 28:9; Genesis 34:4; Genesis 34:21; Genesis 43:18; Exodus 6:7; Exodus 6:20; Exodus 6:23; Exodus 6:25; Exodus 34:16; Numbers 8:8; Numbers 35:31; Deuteronomy 7:3; Deuteronomy 21:11; Deuteronomy 24:3; Deuteronomy 25:5; Joshua 9:4; Judges 14:2; Judges 20:10; 1_Samuel 17:17; 1_Samuel 25:39; 1_Samuel 25:40; 2_Samuel 12:9; 1_Kings 4:15; 2_Kings 4:1; Isaiah 66:21; Jeremiah 29:6; Ezekiel 44:22; Job 40:28; Proverbs 22:25; Esther 2:7; Esther 2:15; Nehemiah 10:31'
In [72]:
len(take_contexts)
Out[72]:
41

PC2 Loadings, top 25

In [73]:
pc2_loadings.head(25)
Out[73]:
2
T.ב.Cmpl→ שׁמע.v1.qal 12.016610
T.Objc→ שׁמע.v1.qal 7.507874
קול.n1.coord→ T 4.155558
T.coord→ קול.n1 3.829006
T.Subj→ שׁמע.v1.nif 2.416500
בן.n1.coord→ T 1.789764
T.ל.Cmpl→ שׁמע.v1.qal 1.685684
T.Objc→ דבר.v1.piel 1.668926
T.coord→ בן.n1 1.375580
בן.n1.appo→ T 1.370616
T.appo→ בן.n1 1.308421
בת.n1.coord→ T 1.220185
T.Objc→ שׁמע.v1.hif 1.126040
גדול.n1.atr→ T 1.084169
T.Objc→ רום.v1.hif 0.784451
T.coord→ אהרן.n1 0.736672
T.Objc→ ילד.v1.qal 0.726335
T.Objc→ נגד.v1.hif 0.634912
אור.n1.coord→ T 0.548433
T.Objc→ אבד.v1.hif 0.447747
T.Objc→ כתב.v1.qal 0.438189
T.Objc→ נשׂא.v1.qal 0.421637
T.אל.Cmpl→ דבר.v1.piel 0.416091
T.כ.Adju→ עשׂה.v1.qal 0.392442
T.Objc→ ילד.v1.hif 0.391060
In [74]:
def plot_PCA(pca_nouns, 
             zoom=tuple(), 
             noun_xy_dict=False, 
             save='', 
             annotate=True, 
             title='', 
             components=(pca_nouns[:,0], pca_nouns[:,1])):
    '''
    Plots a PCA noun space.
    Function is useful for presenting various zooms on the data.
    '''
    
    x, y = components
    
    # plot coordinates
    plt.figure(figsize=(12, 10))
    plt.scatter(x, y)

    if zoom:
        xmin, xmax, ymin, ymax = zoom
        plt.xlim(xmin, xmax)
        plt.ylim(ymin, ymax)
    
    if title:
        plt.title(title, size=18)
    plt.xlabel('PC1', size=18)
    plt.ylabel('PC2', size=18)
    plt.axhline(color='red', linestyle=':')
    plt.axvline(color='red', linestyle=':')
    
    # annotate points
    if annotate:
        noun_xy = {} # for noun_dict
        noun_lexs = [f'{reverse_hb(F.voc_lex_utf8.v(counts.target2lex[n]))}' for n in fishertransf.columns]
        for i, noun in enumerate(noun_lexs):
            noun_x, noun_y = x[i], y[i]
            noun_xy[fishertransf.columns[i]] = (noun_x, noun_y)
            if zoom: # to avoid annotating outside of field of view (makes plot small)
                if any([noun_x < xmin, noun_x > xmax, noun_y < ymin, noun_y > ymax]):                
                    continue # skip noun
            plt.annotate(noun, xy=(noun_x, noun_y), size='18')
    
    if save:
        plt.savefig(save, dpi=300, bbox_inches='tight')
    
    
    plt.show()
    
    if noun_xy_dict:
        return noun_xy

test_components = (pca_nouns[:,0], pca_nouns[:,1])
        

Whole PCA Space

In [75]:
pca_nouns_xy = plot_PCA(pca_nouns, noun_xy_dict=True, save='plots/PCA_whole.png', components=test_components)

We can already see some interesting tendencies in the data. קול and דבר are grouped in the same quadrant. In the upper right quadrant we see בן and בת. The lower left quadrant presents a particularly interesting match: יד "hand" and אלהים "God".

We zoom in closer below to have a better look at the tendencies.

Main Cluster of PCA space

In [76]:
plot_PCA(pca_nouns, zoom=((-3, 3, -2.5, 1)), save='plots/PCA_main.png')

~Animate Nouns

Note that nouns in the lower right quadrant tend to be people, while on the lower left there are primarily things.

The plot below shows person nouns.

In [77]:
plot_PCA(pca_nouns, zoom=((-0.1, 5, -2.5, 0.1)), save='plots/PCA_~animates')

Let's see what nouns to the right of the y axis have most in common. This could corroborate the intuition that the nouns on the right are personal.

First we isolate the nouns with a x-axis value > 0. Those are shown below, they are obviously personal nouns.

In [78]:
nouns_xy = pd.DataFrame.from_dict(pca_nouns_xy, orient='index', columns=['x', 'y'])
possibly_animate = pd.DataFrame(nouns_xy[nouns_xy.x > 0])
possibly_animate['gloss'] = [F.gloss.v(counts.target2lex[targ]) for targ in possibly_animate.index]
possibly_animate = possibly_animate.reindex(['gloss', 'x', 'y'], axis=1)
In [79]:
x_animate = pd.DataFrame(possibly_animate.drop('y', axis=1).sort_values(ascending=False, by='x'))
round(x_animate,2).to_csv('spreadsheets/animate_x.csv')
print(f'total number of ~animate nouns {x_animate.shape[0]}')
x_animate
total number of ~animate nouns 33
Out[79]:
gloss x
בן.n1 son 221.889462
בת.n1 daughter 49.192676
אח.n1 brother 13.490358
פר.n1 young bull 10.365245
כהן.n1 priest 10.071992
אשׁה.n1 woman 8.293637
מלך.n1 king 5.483404
שׁבט.n1 rod 4.117483
בכר.n1 first-born 3.527541
אישׁ.n1 man 3.413651
נשׂיא.n1 chief 3.367279
אב.n1 father 3.191826
עם.n1 people 3.010478
עבד.n1 servant 2.670238
בגד.n1 garment 2.437223
עדה.n1 gathering 1.785993
אדון.n1 lord 1.731945
גר.n1 sojourner 1.325033
אם.n1 mother 1.195948
פרעה.n1 pharaoh 1.093217
נביא.n1 prophet 0.958122
שׂר.n1 chief 0.888966
זרע.n1 seed 0.794056
בקר.n1 cattle 0.746797
משׁפחה.n1 clan 0.600403
נער.n1 boy 0.338308
ילד.n1 boy 0.313588
צאן.n1 cattle 0.239695
רע.n2 fellow 0.231876
כבשׂ.n1 young ram 0.152592
גבור.n1 vigorous 0.105801
אחות.n1 sister 0.066346
אשׁ.n1 fire 0.019037

Why בגד?

Why has בגד "garment" made it into the set? We compare the top loading scores against the top scores for בגד.

In [80]:
def cf_PC_Noun(pc_loadings, noun_counts, noun, pc_name='PC1', ascending=False):
    '''
    Compares PC loadings and noun counts.
    Returns a DF containing the top common
    counts sorted on the PC.
    '''
    top_cts = noun_counts[noun][noun_counts[noun]>0] # isolate non-zero counts
    pc_word = pc_loadings.copy() # make copy of PC loadings for modifications
    pc_word.columns = [pc_name] # rename col to PCX
    pc_word[noun] = top_cts[[i for i in top_cts.index if i in pc_word.index]] # add new column for noun
    pc_word = pc_word[pc_word[noun] > 0].sort_values(by='PC1', ascending=ascending) # remove zero counts completely, sort
    return pc_word
    
bgd_pc1 = cf_PC_Noun(pc1_loadings, fishertransf, 'בגד.n1')

bgd_pc1[bgd_pc1.PC1 >= 0.3].round(2).to_csv('spreadsheets/BGD_pc1.csv')
    
bgd_pc1[bgd_pc1.PC1 >= 0.3]
Out[80]:
PC1 בגד.n1
בן.n1.coord→ T 7.902528 1.798861
T.coord→ בן.n1 6.291447 1.548492
T.coord→ אהרן.n1 3.197214 6.072582
T.Objc→ לקח.v1.qal 0.331304 4.241531

Show passages for coord relations for paper:

In [227]:
etcbc2sbl = {
'Genesis': 'Gen', 'Exodus': 'Exod', 'Leviticus': 'Lev', 'Numbers': 'Num',
'Deuteronomy': 'Deut', 'Joshua': 'Josh', 'Judges': 'Judg', '1_Samuel': '1 Sam', '2_Samuel': '2 Sam',
'1_Kings': '1 Kgs', '2_Kings': '2 Kgs', 'Isaiah': 'Isa', 'Jeremiah': 'Jer', 'Ezekiel': 'Ezek',
'Hosea': 'Hos', 'Joel': 'Joel', 'Amos': 'Amos', 'Obadiah': 'Obad', 'Jonah': 'Jonah', 'Micah': 'Mic',
'Nahum': 'Nah', 'Habakkuk': 'Hab', 'Zephaniah': 'Zeph', 'Haggai': 'Hag', 'Zechariah': 'Zech',
'Malachi': 'Mal', 'Psalms': 'Ps', 'Job': 'Job', 'Proverbs': 'Prov', 'Ruth': 'Ruth',
'Song_of_songs': 'Song', 'Ecclesiastes': 'Eccl', 'Lamentations': 'Lam', 'Esther': 'Esth',
'Daniel': 'Dan', 'Ezra': 'Ezra', 'Nehemiah': 'Neh', '1_Chronicles': '1 Chr', '2_Chronicles': '2 Chr'}

def formatPassages(resultslist):
    '''
    Formats biblical passages with SBL style
    for a list of results.
    '''
    book2ch2vs = collections.defaultdict(lambda: collections.defaultdict(set))
    
    for result in resultslist:
        book, chapter, verse = T.sectionFromNode(result[0])
        book = etcbc2sbl[book]                
        book2ch2vs[book][chapter].add(str(verse))
            
    # assemble in to readable passages list
    passages = []
    for book, chapters in book2ch2vs.items():
        ch_verses = []
        for chapter, verses in chapters.items():
            verses = ', '.join(f'{chapter}:{verse}' for verse in sorted(verses))
            ch_verses.append(verses)
        passage = f'{book} {", ".join(ch_verses)}'
        passages.append(passage)
            
    return '; '.join(passages)

def collectPassages(contextslist, targetnoun):
    '''
    Collects and returns neatly 
    formatted passages
    for use in the paper.
    '''
    # map the passages with dicts to avoid repeats
    results = sorted(res for context in contextslist for res in counts.target2basis2result[targetnoun][context])
    return formatPassages(results)
    

bgd_mixContexts = ['']
collectPassages(bgd_pc1.head(4).index[bgd_pc1.head(4).index.str.contains('coord')], 'בגד.n1')
Out[227]:
'Exod 29:21; Lev 8:2, 8:30'
In [83]:
# B.show(counts.target2basis2result['בגד.n1']['T.coord→ אהרן.n1'], condenseType='phrase', withNodes=True)

Now we find the context tags that are highest in the set. We pull the fourth quartile (75th percentile) of the context tags to see which ones are most shared accross these nouns.

In [84]:
animate_context = fishertransf[possibly_animate.index].quantile(0.75, axis=1).sort_values(ascending=False)
pd.DataFrame(animate_context.head(15))
Out[84]:
0.75
T.אל.Cmpl→ אמר.v1.qal 2.378328
T.Subj→ אמר.v1.qal 1.997787
T.Objc→ לקח.v1.qal 1.871772
T.Subj→ בוא.v1.qal 1.435907
T.Subj→ עשׂה.v1.qal 1.272516
T.ל.Cmpl→ נתן.v1.qal 1.119660
T.coord→ בן.n1 0.978105
T.coord→ אח.n1 0.920191
T.coord→ אתה 0.909587
T.ל.Cmpl→ אמר.v1.qal 0.901707
T.coord→ אשׁה.n1 0.841112
T.Subj→ מות.v1.qal 0.838907
T.Subj→ ישׁב.v1.qal 0.828674
עבד.n1.coord→ T 0.790248
T.ל.Cmpl→ נגד.v1.hif 0.757463

PCA Space: Focus on Bordering ~Animate Nouns

In [85]:
plot_PCA(pca_nouns, zoom=((-0.5, 0.5, -1.5, -1)), save='plots/PCA_~animate_border')
In [184]:
nouns_xy[(nouns_xy.x < 0) & (nouns_xy.x > -0.4)].sort_values(ascending=False, by='x')
Out[184]:
x y
שׁמשׁ.n1 -0.033623 -1.374995
מעל.n1 -0.051413 -1.274193
אדם.n1 -0.106503 -1.223870
חמור.n1 -0.109288 -1.332184
בהמה.n1 -0.129031 -1.329458
קהל.n1 -0.179291 -1.304472
רוח.n1 -0.207986 -1.017615
חיה.n1 -0.217819 -1.116523
מטה.n1 -0.223322 -1.232570
גבול.n1 -0.223492 -1.308986
משׁפחת.n1 -0.245602 -1.383507
עצם.n1 -0.270069 -1.410259
שׁמן.n1 -0.290189 -1.303102
אף.n1 -0.297584 -1.262932
מחנה.n1 -0.298186 -1.292041
לשׁון.n1 -0.310659 -1.007489
עולם.n1 -0.312791 -1.359431
עוד.n1 -0.316034 -1.328066
רכב.n1 -0.338771 -1.421778
מלכות.n1 -0.343997 -1.347886
בקר.n2 -0.349800 -1.392131
מאד.n1 -0.350481 -1.393674
שׁקל.n1 -0.354065 -1.337142
רב.n2 -0.357386 -1.275989
קדם.n1 -0.362267 -1.352268
נחל.n1 -0.374899 -1.321821
תמיד.n1 -0.381627 -1.371093
עור.n2 -0.382345 -1.362990

Verbs are the greatest distinguishing factor here, with אמר, בוא,נתן, לקח and others serving a big role. מות "die" also plays a role. These are definitely contexts we could expect with animate nouns.

~Inanimate Nouns

The nouns to the left of the y axis appear to be mostly inanimate.

In [86]:
plot_PCA(pca_nouns, zoom=((-2, 0, -2.5, 0)), title='PCA Space: ~Inanimate Noun Cluster')

Below we pull the tendencies for the nouns with a PC1 < 0. These nouns appear to be impersonal in nature.

In [87]:
possibly_inanimate = pd.DataFrame(nouns_xy[(nouns_xy.x < 0) & (nouns_xy.y < 0)])
possibly_inanimate['gloss'] = [F.gloss.v(counts.target2lex[targ]) for targ in possibly_inanimate.index]
possibly_inanimate = possibly_inanimate.reindex(['gloss', 'x', 'y'], axis=1)

x_inanimate = pd.DataFrame(possibly_inanimate.drop('y', axis=1).sort_values(by='x'))
round(x_inanimate,2).head(x_animate.shape[0]).to_csv('spreadsheets/inanimate_x.csv')
print(f'Number of total ~inanimates: {x_inanimate.shape[0]}')
print(f'Top ~inanimates: ')
x_inanimate.head(x_animate.shape[0])
Number of total ~inanimates: 156
Top ~inanimates: 
Out[87]:
gloss x
אלהים.n1 god(s) -55.870922
יד.n1 hand -39.091970
שׁם.n1 name -10.190253
ארץ.n1 earth -5.343297
שׁנה.n1 year -5.307550
מצוה.n1 commandment -3.786202
יום.n1 day -3.660531
מלאך.n1 messenger -3.541798
בית.n1 house -3.514084
דרך.n1 way -3.319412
חסד.n1 loyalty -2.961852
פנה.n1 face -2.683260
עין.n1 eye -2.508275
ברית.n1 covenant -2.376186
מזבח.n1 altar -2.251410
לחם.n1 bread -2.016252
חקה.n1 regulation -1.947438
עיר.n1 town -1.921311
חק.n1 portion -1.842204
צדקה.n1 justice -1.803451
רחב.n2 breadth -1.763256
מלאכה.n1 work -1.718575
זהב.n1 gold -1.710359
ארך.n2 length -1.661483
עלה.n1 burnt-offering -1.562637
מים.n1 water -1.516691
פה.n1 mouth -1.480237
כף.n1 palm -1.477370
לב.n1 heart -1.461574
שׁמים.n1 heavens -1.459879
כסף.n1 silver -1.440435
עץ.n1 tree -1.375057
דם.n1 blood -1.318799

Top Influencing ~inanimate Contexts

In [88]:
pc1_loadings.tail(25).sort_values(by=1).round(2).to_csv('spreadsheets//PC1_loadings_negative.csv')

pc1_loadings.tail(25).sort_values(by=1)
Out[88]:
1
T.appo→ יהוה.n1 -4.295334
T.ב.Cmpl→ שׁמע.v1.qal -3.471581
T.Objc→ שׁמע.v1.qal -2.759099
T.ב.Cmpl→ נתן.v1.qal -2.245783
T.Objc→ עשׂה.v1.qal -1.259654
קול.n1.coord→ T -1.197493
T.coord→ קול.n1 -1.115596
T.מן.Cmpl→ נצל.v1.hif -0.996863
אחר.n2.atr→ T -0.962485
T.Objc→ דבר.v1.piel -0.838895
T.Objc→ קרא.v1.qal -0.801873
T.Objc→ שׁלח.v1.qal -0.763998
T.Subj→ שׁמע.v1.nif -0.716243
יד.n1.coord→ T -0.580984
גדול.n1.atr→ T -0.565519
T.coord→ יד.n1 -0.549651
T.Objc→ שׁמר.v1.qal -0.510847
T.ל.Cmpl→ שׁמע.v1.qal -0.482308
T.Objc→ עבד.v1.qal -0.471573
T.ב.Cmpl→ נתן.v1.nif -0.465951
T.Objc→ נשׂא.v1.qal -0.461200
T.Objc→ נתן.v1.qal -0.440112
T.ב.Cmpl→ לקח.v1.qal -0.439552
T.Objc→ סמך.v1.qal -0.434933
T.Objc→ בנה.v1.qal -0.423190

What about מלאך?

Why is מלאך rated in this list of mostly "inanimates"?

In [89]:
pc_mlak = cf_PC_Noun(pc1_loadings, fishertransf, 'מלאך.n1', ascending=True)

pc_mlak[pc_mlak.PC1 <= -0.2].round(2).to_csv('spreadsheets/MLAK_pc1.csv')

pc_mlak.head(10)
Out[89]:
PC1 מלאך.n1
אחר.n2.atr→ T -0.962485 1.065590
T.Objc→ שׁלח.v1.qal -0.763998 82.774958
T.appo→ אלהים.n1 -0.336296 0.959777
T.אחר.n1.Cmpl→ הלך.v1.qal -0.276358 1.004653
T.ב.Cmpl→ שׂים.v1.qal -0.157517 0.600816
T.Subj→ מצא.v1.qal -0.144336 0.860745
T.Subj→ ברך.v1.piel -0.126118 1.093149
T.Objc→ ענה.v1.qal -0.089964 3.816321
T.Subj→ דבר.v1.piel -0.070704 5.898547
T.Subj→ ישׁע.v1.hif -0.049809 1.263254

Note that several of the top 4 contexts are related to אלהים. We pull a few examples with אלהים out for use in the paper.

In [162]:
collectPassages(['T.אחר.n1.Cmpl→ הלך.v1.qal'], 'אלהים.n1')
Out[162]:
'Deut 6:14, 8:19, 11:28, 13:3, 28:14; Judg 2:12, 2:19; 1 Kgs 11:10; Jer 7:6, 7:9, 11:10, 13:10, 16:11, 25:6, 35:15'
In [164]:
collectPassages(['T.אחר.n1.Cmpl→ הלך.v1.qal'], 'מלאך.n1')
Out[164]:
'1 Sam 25:42'
In [167]:
collectPassages(['אחר.n2.atr→ T'], 'מלאך.n1')
Out[167]:
'1 Sam 19:21; Zech 2:7'
In [170]:
collectPassages(['T.appo→ אלהים.n1'], 'מלאך.n1')
Out[170]:
'Zech 12:8'

The next plot shows nouns to the left of the y-origin. Note especially the terms between y(-0.5) and y(0.0.). These are more conceptual nouns. This same trajectory extends up into the far parts of the upper left quadrant through דבר and קול.

Here is a closer look at the larger cluster near the left side of the y-origin.

In [91]:
plot_PCA(pca_nouns, zoom=((-0.5, -0.1, -1.5, -1)))

Moving over one more notch:

In [92]:
plot_PCA(pca_nouns, zoom=((-1, -0.5, -2, -0.5)))