The purpose of this notebook is to examine the shape and contents of the datasets generated using the parameters stored in project_code/parameters.py. Specifically, I am looking for consistency in the data; and I am checking to make sure the data which will be analyzed indeed represents what I expect it to. Selection of linguistic data inevitably involves a number of assumptions about the shape of the data. When there is a mismatch between expectation and the selected data, the result is incorrect data analysis. I obviously want to avoid that!
In the notebook, I load all of the experiments defined in the parameters module. For each dataset, I analyze the contents and distribution of features. At the same time, I will look at the matched clauses that correspond with the data for cases that are unexpected or surprising.
import numpy as np
import pandas as pd
import collections, os, sys, random, time, pickle, dill, copy, re
from IPython.display import clear_output
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from tf.fabric import Fabric
from tf.extra.bhsa import Bhsa
from project_code.experiments2 import Experiment
from project_code.semspace import SemSpace
bhsa_data_paths=['~/github/etcbc/bhsa/tf/c',
'~/github/verb_semantics/project_code/lingo/heads/tf/c',
'~/github/verb_semantics/project_code/sdbh']
TF = Fabric(bhsa_data_paths)
tf_api = TF.load('''
function lex vs language
pdp freq_lex gloss domain ls
mother rela typ sp st code txt instruction
heads prep_obj
prs prs_gn prs_nu prs_ps
sem_domain sem_domain_code
''', silent=True)
tf_api.makeAvailableIn(globals())
B = Bhsa(api=tf_api, name='', version='c')
This is Text-Fabric 5.4.3 Api reference : https://dans-labs.github.io/text-fabric/Api/General/ Tutorial : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb Example data : https://github.com/Dans-labs/text-fabric-data 118 features found and 0 ignored
Documentation: BHSA Feature docs BHSA API Text-Fabric API 5.4.3 Search Reference
There are two kinds of experiment parameters: inventories and frames. Inventories count cooccuring features with verbs individually. Frames count all features within the verb's clause as a single unit or frame. I will load both types into a single dictionary.
success = []
run_new, cached_data = True, True
if run_new:
good_runs = []
from project_code.parameters import * # import all definitions and functions from the parameters module
experiments = {}
for label, exp_type in params.items():
print(f'\nprocessing {label} experiments...')
for name, experiment_params in exp_type.items():
if name in success:
continue
print(f'\tbuilding {name}...')
is_frame = False if label =='inventory' else True
min_obs = 10 if name != 'vd_par_lex' else 1
experiments[name] = Experiment(experiment_params, tf=tf_api, frame=is_frame, min_observation=min_obs)
good_runs.append(name)
print('\t\tfinished')
print('finished processing experiments...')
print(f'\t{len(experiments)} experiments loaded.')
print('dumping experiment into cache...')
with open('/Users/cody/Documents/experiments.dill', 'wb') as outfile:
dill.dump(experiments, outfile)
print('\tDone!')
else:
print('Loading cached experiments...')
with open('/Users/cody/Documents/experiments.dill', 'rb') as infile:
experiments = dill.load(infile)
print(f'{len(experiments)} experiments loaded.')
All parameters ready! processing inventory experiments... building vi_subj_lex... finished building vi_subj_domain... finished building vi_subj_animacy... finished building vi_objc_pa... finished building vi_objc_lex... finished building vi_objc_domain... finished building vi_objc_animacy... finished building vi_cmpl_pa... finished building vi_cmpl_lex... finished building vi_cmpl_domain... finished building vi_cmpl_animacy... finished building vi_adj+_pa... finished building vi_adj+_lex... finished building vi_adj+_domain... finished building vi_adj+_animacy... finished building vi_coad_pa... finished building vi_coad_lex... finished building vi_coad_domain... finished building vi_coad_animacy... finished building vi_allarg_pa... finished building vi_allarg_lex... finished building vi_allarg_domain... finished building vi_allarg_animacy... finished building vd_par_lex... finished building vd_con_window... finished building vd_con_clause... finished building vd_con_chain... finished building vd_domain_simple... finished building vd_domain_embed... finished building vg_tense... finished processing frame experiments... building vf_argAll_pa... finished building vf_argAll_lex... finished building vf_argAll_domain... finished building vf_argAll_animacy... finished building vf_obj_pa... finished building vf_obj_lex... finished building vf_obj_domain... finished building vf_obj_animacy... finished building vf_cmpl_pa... finished building vf_cmpl_lex... finished building vf_cmpl_domain... finished building vf_cmpl_animacy... finished building vf_adju_pa... finished building vf_adju_lex... finished building vf_adju_domain... finished building vf_adju_animacy... finished building vf_coad_pa... finished building vf_coad_lex... finished building vf_coad_domain... finished building vf_coad_animacy... finished finished processing experiments... 50 experiments loaded. dumping experiment into cache... Done!
Here the experiments are sorted by their basis units or tests, i.e. domain, lexeme, presence/absence (pa). We sort by length below:
for shape, exp in sorted((experiments[exp].data.shape, exp) for exp in experiments):
print(f'{exp}:\t{shape}')
vi_subj_animacy: (2, 180) vi_objc_pa: (2, 714) vi_adj+_pa: (2, 734) vi_cmpl_pa: (2, 734) vi_coad_pa: (2, 734) vf_obj_pa: (3, 694) vd_domain_simple: (3, 704) vf_cmpl_pa: (4, 725) vi_allarg_pa: (4, 786) vi_objc_animacy: (5, 173) vf_adju_pa: (7, 733) vf_obj_animacy: (8, 127) vf_coad_pa: (8, 734) vg_tense: (8, 734) vi_cmpl_animacy: (39, 174) vf_argAll_pa: (43, 703) vi_adj+_animacy: (46, 108) vi_coad_animacy: (51, 241) vd_domain_embed: (73, 646) vf_cmpl_animacy: (88, 158) vi_allarg_animacy: (92, 370) vf_adju_animacy: (96, 78) vf_coad_animacy: (200, 192) vi_subj_domain: (247, 231) vd_par_lex: (305, 365) vf_argAll_animacy: (378, 207) vi_objc_domain: (448, 245) vf_obj_domain: (584, 213) vi_cmpl_domain: (1033, 223) vf_cmpl_domain: (1128, 207) vi_adj+_domain: (1219, 217) vf_adju_domain: (1575, 180) vi_coad_domain: (1735, 386) vi_subj_lex: (1959, 290) vi_objc_lex: (2251, 305) vi_allarg_domain: (2902, 527) vf_obj_lex: (3012, 274) vf_coad_domain: (3055, 301) vi_adj+_lex: (3478, 295) vf_cmpl_lex: (4074, 267) vi_cmpl_lex: (4079, 281) vf_adju_lex: (4184, 263) vd_con_window: (4463, 790) vf_argAll_domain: (5122, 339) vd_con_clause: (5477, 900) vi_coad_lex: (6765, 475) vd_con_chain: (8308, 1218) vf_coad_lex: (9337, 418) vi_allarg_lex: (11180, 652) vf_argAll_lex: (14761, 482)
#len(experiments['vf_argAll_pa'].basis2result['Adju|Adju|Adju|Adju|Cmpl'])
#B.prettySetup(features={'rela'})
#B.show(experiments['vf_argAll_pa'].basis2result['Adju|Adju|Adju|Adju|Cmpl'][:10], withNodes=True)
Complement domain elements are more sophisticated, since these combine domain tags with preposition lexemes:
Checking to make sure that there are no missing elements within frame tests.
frame_exps = {'vf_argAll_pa','vf_argAll_lex', 'vf_argAll_domain', 'vf_argAll_animacy',
'vf_obj_pa', 'vf_obj_lex', 'vf_obj_domain', 'vf_obj_animacy',
'vf_cmpl_pa', 'vf_cmpl_lex', 'vf_cmpl_domain', 'vf_cmpl_animacy',
'vf_adju_pa', 'vf_adju_lex', 'vf_adju_domain', 'vf_adju_animacy'}
problems = collections.defaultdict(list)
for exp in frame_exps:
samples = [(basis, match) for basis in experiments[exp].basis2result for match in experiments[exp].basis2result[basis]]
for basis, sample in samples:
clauses = sorted(n for n in sample if F.otype.v(n) == 'clause')
target_clause = next(cl for cl in clauses if next((m for m in E.mother.f(cl)),0) not in clauses)
# get expected relas based on experiment name
exp2functs = {
'cmpl': {'Cmpl'},
'adj': {'Adju', 'PrAd', 'Time', 'Loca'},
'obj': {'Objc', 'PreO', 'PtcO'},
'arg': {'Cmpl', 'Adju', 'PrAd', 'Time', 'Loca', 'Objc', 'PreO', 'PtcO'}
}
exp_key = next(key for key in exp2functs if re.search(key, exp))
expected_functs = exp2functs[exp_key]
# check to make sure all matching phrase functions are accounted for in the frame result
for phrase in L.d(target_clause, 'phrase'):
if F.function.v(phrase) in expected_functs and phrase not in sample and target_clause not in {514162}:
problems[exp].append((basis, sample, f'missing {phrase} a {F.otype.v(phrase)} with {F.function.v(phrase)}'))
# check all daughter relations
for d_cl in E.mother.t(target_clause):
if F.rela.v(d_cl) in expected_functs and d_cl not in sample:
problems[exp].append((basis, sample, f'missing {d_cl} a {F.otype.v(d_cl)} with {F.rela.v(d_cl)}'))
len(problems)
0
I have made a number of delimitations on how data is selected, the full scope of which can be seen in parameters.py. One important example of these delimitations is on the object frames, which exclude target clauses with relative particles. This is because a relative particle in the ETCBC primarily serves in its role as a connector to the mother clause, but the database does not specify what role the relative particle plays within its immediately enclosing clause. Often these particles serve as objects of the verb. But because the ETCBC does not disambiguate its clause-internal role, these cases must be excluded.
How do these kinds of exclusions affect verb lexeme distributions? Are there verbs whose distribution becomes significantly under-represented due to the selection restrictions? This might be the case, for example, if a verb lexeme is interconnected with an excluded construction.
In this test, I iterate through the presence/absence experiments. I make comparisons between a verb's overall occurrence ratio in the raw template search versus the experiments.
# standard target clause requirements
pred_target = '''
c1:clause
p1:phrase
/with/
clause typ#Ptcp
p:phrase function={pred_funct}
-heads> word pdp=verb language=Hebrew
p = p1
/or/
clause typ=Ptcp
p:phrase function={ptcp_funct}
-heads> word pdp=verb language=Hebrew
p = p1
/-/
target:word pdp=verb
{basis}
lex freq_lex>9
lexword:word
lexword = target
'''
all_preds = 'Pred|PreO|PreS|PtcO' # all predicate phrase functions
all_ptcp = 'PreC|PtcO'
baseline = B.search(pred_target.format(basis='', pred_funct=all_preds, ptcp_funct=all_ptcp))
base_lexs = collections.Counter(f'{F.lex.v(r[2])}.{F.vs.v(r[2])}' for r in baseline)
base_lexs = pd.Series(base_lexs)
base_lex_ratio = base_lexs / base_lexs.sum()
print(f'number of base lexemes (+stems): {len(base_lexs)}')
base_lexs.sort_values(ascending=False).head()
65323 results number of base lexemes (+stems): 1749
>MR[.qal 5273 HJH[.qal 3533 <FH[.qal 2446 BW>[.qal 1969 NTN[.qal 1910 dtype: int64
for exp in experiments:
# skip non frame/presence absence experiments
if not re.search('vf.*_pa', exp):
continue
# get lex sums/ratios for experiment
lex_sums = experiments[exp].data.sum()
lex_ratio = lex_sums / lex_sums.sum()
# make comparisons between base and experiment
base_dif = lex_ratio.subtract(base_lex_ratio)
#base_lex_ratio.combine(lex_ratio, lambda s1, s2: s1-s2) # absolute differences
# print biggest differences
print(f'{exp} differences from base:')
print('\tPLUS:')
print(base_dif.sort_values(ascending=False).head(20))
print('\tMINUS:')
print(base_dif.sort_values().head(20))
print('\n', '-'*30, '\n')
vf_argAll_pa differences from base: PLUS: >MR[.qal 0.008794 HJH[.qal 0.004967 LQX[.qal 0.001538 R>H[.qal 0.001506 CM<[.qal 0.001384 BW>[.qal 0.001372 HLK[.qal 0.001281 CWB[.qal 0.001245 QR>[.qal 0.000910 >KL[.qal 0.000853 QWM[.qal 0.000852 JD<[.qal 0.000718 FJM[.qal 0.000710 NF>[.qal 0.000678 MWT[.qal 0.000656 BW>[.hif 0.000611 CWB[.hif 0.000572 <NH[.qal 0.000566 JR>[.qal 0.000522 <LH[.qal 0.000513 dtype: float64 MINUS: YWH[.piel -0.002399 <FH[.qal -0.001568 DBR[.piel -0.000813 BXR[.qal -0.000704 MY>[.nif -0.000685 CB<[.nif -0.000663 JTR[.nif -0.000478 C>R[.nif -0.000428 NG<[.qal -0.000401 XV>[.qal -0.000373 <FH[.nif -0.000327 KTB[.qal -0.000309 GWR[.qal -0.000294 DBR[.qal -0.000284 XV>[.hif -0.000282 QR>[.nif -0.000248 JCB[.qal -0.000237 XPY[.qal -0.000182 GLH[.hif -0.000179 BNH[.qal -0.000164 dtype: float64 ------------------------------ vf_obj_pa differences from base: PLUS: >MR[.qal 0.009396 HJH[.qal 0.006854 BW>[.qal 0.002395 HLK[.qal 0.001931 R>H[.qal 0.001854 CM<[.qal 0.001637 CWB[.qal 0.001608 LQX[.qal 0.001104 QWM[.qal 0.001089 >KL[.qal 0.000997 MWT[.qal 0.000985 QR>[.qal 0.000955 <LH[.qal 0.000810 JY>[.qal 0.000779 JD<[.qal 0.000618 NPL[.qal 0.000596 NGD[.hif 0.000555 JR>[.qal 0.000542 <NH[.qal 0.000528 <BR[.qal 0.000508 dtype: float64 MINUS: YWH[.piel -0.003581 <FH[.qal -0.002199 NTN[.qal -0.001361 CLX[.qal -0.001261 BXR[.qal -0.000710 CB<[.nif -0.000612 MY>[.nif -0.000607 DBR[.piel -0.000563 JTR[.nif -0.000446 C>R[.nif -0.000405 BNH[.qal -0.000375 NG<[.qal -0.000346 KTB[.qal -0.000318 KRT[.qal -0.000311 <FH[.nif -0.000309 XV>[.qal -0.000304 XV>[.hif -0.000294 DBR[.qal -0.000275 GWR[.qal -0.000269 JY>[.hif -0.000219 dtype: float64 ------------------------------ vf_cmpl_pa differences from base: PLUS: >MR[.qal 0.005542 HJH[.qal 0.002983 <FH[.qal 0.001991 R>H[.qal 0.001532 HLK[.qal 0.001054 JD<[.qal 0.001044 >KL[.qal 0.000965 LQX[.qal 0.000903 BW>[.qal 0.000897 CM<[.qal 0.000881 MWT[.qal 0.000697 CWB[.qal 0.000677 JCB[.qal 0.000646 NKH[.hif 0.000640 QWM[.qal 0.000551 NTN[.qal 0.000510 NF>[.qal 0.000506 <LH[.qal 0.000431 JY>[.qal 0.000401 MLK[.qal 0.000395 dtype: float64 MINUS: DBR[.piel -0.002099 CLX[.qal -0.001041 NGD[.hif -0.000984 CB<[.nif -0.000433 CMR[.nif -0.000231 BXR[.qal -0.000186 NGD[.hof -0.000136 <WD[.hif -0.000132 QRB[.qal -0.000081 KRT[.qal -0.000079 FWF[.qal -0.000048 XNN[.hit -0.000042 BDL[.hif -0.000042 QHL[.hif -0.000038 C>L[.qal -0.000037 J<Y[.nif -0.000037 NW<[.hif -0.000030 FKR[.qal -0.000029 GLH[.nif -0.000025 SWT[.hif -0.000024 dtype: float64 ------------------------------ vf_adju_pa differences from base: PLUS: >MR[.qal 0.004488 HJH[.qal 0.003013 BW>[.qal 0.001672 <FH[.qal 0.001612 NTN[.qal 0.001589 HLK[.qal 0.001177 DBR[.piel 0.000928 R>H[.qal 0.000815 CM<[.qal 0.000812 LQX[.qal 0.000801 JCB[.qal 0.000649 JY>[.qal 0.000627 >KL[.qal 0.000593 CWB[.qal 0.000558 QR>[.qal 0.000557 <LH[.qal 0.000512 FJM[.qal 0.000493 CLX[.qal 0.000474 BW>[.hif 0.000468 NF>[.qal 0.000467 dtype: float64 MINUS: JD<[.qal -1.098729e-04 JKL[.qal -8.046618e-05 XLL[.hif -3.196558e-05 <WD[.hif -1.763629e-05 CBT[.qal -9.185770e-06 CB<[.hif -7.471264e-06 J>L[.hif 1.220051e-07 C>R[.hif 2.440102e-07 RWM[.piel 5.265521e-06 GWR==[.qal 8.572527e-06 BHL[.piel 8.572527e-06 SQL[.qal 8.572527e-06 SKK[.qal 8.572527e-06 KLM[.hif 8.572527e-06 CQD[.qal 8.572527e-06 RWX[.hif 8.572527e-06 NQM[.qal 8.572527e-06 BDL[.nif 8.572527e-06 MWL[.qal 8.572527e-06 XRD[.hif 8.572527e-06 dtype: float64 ------------------------------ vf_coad_pa differences from base: PLUS: >MR[.qal 0.004481 HJH[.qal 0.003116 BW>[.qal 0.001713 <FH[.qal 0.001569 NTN[.qal 0.001564 HLK[.qal 0.001217 DBR[.piel 0.000944 R>H[.qal 0.000831 CM<[.qal 0.000794 LQX[.qal 0.000779 JCB[.qal 0.000671 JY>[.qal 0.000649 >KL[.qal 0.000630 CWB[.qal 0.000577 <LH[.qal 0.000529 CLX[.qal 0.000490 BW>[.hif 0.000468 NF>[.qal 0.000467 QR>[.qal 0.000463 FJM[.qal 0.000461 dtype: float64 MINUS: NGD[.hif -6.170166e-04 JD<[.qal -1.694724e-04 CB<[.nif -5.958201e-05 CB<[.hif -3.906980e-05 <WD[.hif -1.666865e-05 CBT[.qal -8.452710e-06 DMH[.piel -3.783067e-06 KXD[.piel -2.896492e-06 J>L[.hif 6.498080e-07 C>R[.hif 1.299616e-06 RWM[.piel 5.969259e-06 SPD[.qal 7.742409e-06 SKK[.qal 8.865751e-06 N>P[.piel 8.865751e-06 RWX[.hif 8.865751e-06 MWV[.qal 8.865751e-06 MWL[.qal 8.865751e-06 XRD[.hif 8.865751e-06 GWR==[.qal 8.865751e-06 XBQ[.piel 8.865751e-06 dtype: float64 ------------------------------
One of the effects that can be seen is that other stems than qal tend to receive a slightly smaller representation in the experiment samples. In the object presence/absence experiment, one case of potential interest if the selection's negative affect on niphal representations. Let's see why that could be the case...
nifal_find = pred_target.format(basis='''
w3:word lex=MY>[ vs=nif
w3 = target
''', pred_funct=all_preds, ptcp_funct=all_ptcp)
nifal_find = B.search(nifal_find)
B.show(nifal_find[5:15])
The basic survey above gives a tip. The exclusion of relative particles may negatively affect niphal representation due to constructions such as אשר יִמָצֵא. Let's see if the niphal accounts for a higher proportion of these constructions than qal.
without_rela = pred_target.format(basis='''
c2:clause
/without/
phrase function=Rela
/-/
c1 = c2
''', pred_funct=all_preds, ptcp_funct=all_ptcp)
with_rela = pred_target.format(basis='''
phrase function=Rela
''', pred_funct=all_preds, ptcp_funct=all_ptcp)
def rela_vs_noRela(relaPat, noRelaPat):
vs_count = collections.defaultdict(lambda: collections.Counter())
for r in B.search(noRelaPat):
vs_count['øRela'][F.vs.v(r[2])] += 1
for r in B.search(relaPat):
vs_count['Rela'][F.vs.v(r[2])] += 1
rela_count = pd.Series(vs_count['Rela'])
no_rela_count = pd.Series(vs_count['øRela'])
rela_prop = rela_count / rela_count.sum()
no_rela_prop = no_rela_count / no_rela_count.sum()
print('\nrela ratios:')
print(rela_prop.sort_values(ascending=False))
print('\nø rela ratios:')
print(no_rela_prop.sort_values(ascending=False))
print('Relative clause verb stem proportional representations:\n')
rela_vs_noRela(with_rela, without_rela)
Relative clause verb stem proportional representations: 60636 results 4687 results rela ratios: qal 0.652230 piel 0.121613 hif 0.109878 nif 0.087903 hit 0.011521 hof 0.008748 pual 0.007041 hsht 0.000853 pasq 0.000213 dtype: float64 ø rela ratios: qal 0.700178 hif 0.134607 piel 0.089204 nif 0.051257 hit 0.012072 hof 0.004898 pual 0.004717 hsht 0.002721 hotp 0.000132 tif 0.000066 nit 0.000049 pasq 0.000033 etpa 0.000033 poal 0.000016 htpo 0.000016 dtype: float64
We see a marginal increase in the proportions of passive-type verb stems: nif +4%, hof +0.4%, pual +0.3% (the biggest boost comes in the Piel, which is in itself interesting). These are minor increases, but so are the differences between the base and experiment distributions. It is at least valid to say that an exclusion of relative particles will slightly decrease the representation of nifal and increase the qal (which accounts for 70% in øRela clauses versus 65% in Rela clauses. This simple search also did not take into account lexical collocation preferences for certain constructions. How does the root מצא in the nifal compare in its use of the relative?
nifal_rela = pred_target.format(basis='''
w3:word lex=MY>[
p2:phrase function=Rela
w3 = target
p2 < p1
''', pred_funct=all_preds, ptcp_funct=all_ptcp)
nifal_no_rela = pred_target.format(basis='''
c2:clause
/without/
phrase function=Rela
/-/
w3:word lex=MY>[
c2 = c1
w3 = target
''', pred_funct=all_preds, ptcp_funct=all_ptcp)
print('Clauses with מצא; verb stem representations with and without relative particles:\n')
rela_vs_noRela(nifal_rela, nifal_no_rela)
Clauses with מצא; verb stem representations with and without relative particles: 364 results 79 results rela ratios: nif 0.683544 qal 0.316456 dtype: float64 ø rela ratios: qal 0.760989 nif 0.219780 hif 0.019231 dtype: float64
Here we get the confirmation. The nifal of מצא has a much higher representation alongside the relative particle, +46% more (!). This explains the decrease of this verb's orverall representation in the experiment sample.
This loop makes sure that each result is only counted once per experiment.
problems = collections.defaultdict(lambda: collections.Counter())
for exp in experiments:
# skip contextual searches
if exp in ['vd_con_window', 'vd_con_clause', 'vd_con_chain']:
continue
samples = [(basis, match) for basis in experiments[exp].basis2result for match in experiments[exp].basis2result[basis]]
covered = set()
for basis, sample in samples:
if tuple(sample) in covered:
problems[exp][tuple(sample)] += 1
else:
covered.add(tuple(sample))
len(problems)
0
Evaluating datasets for accuracy is extremely important. This is especially the case since I am using two custom datasets in my research: heads and semantic domains. The first dataset is designed by me, but I am aware of some edge cases that are not always selected properly. I have attempted to exclude any of these edges in my experiment parameters. The semantic domains dataset has been converted from UBS's Semantic Dictionary of Biblical Hebrew via XML representation (courtesy of Renier de Blois). There are two versions: "domain" and "domain2." "domain" is quite experimental, as I attempt to map the SDBH categories to three custom categories: animate, inanimate, and events (an SDBH category). This mapping does not always work well, due to contextual features or lack of available data (the SDBH is not complete). It is also important to check data from this source carefully to ensure that the converted data accurately reflects its source.
How to strategically evaluate the datasets?
For one thing, the number of total observations among the datasets are very high:
for total, exp in sorted((experiments[exp].data.sum().sum(), exp) for exp in experiments):
print(f'{exp}:\t{total}')
vd_par_lex: 723.0 vf_adju_animacy: 2491.0 vi_adj+_animacy: 3712.0 vf_obj_animacy: 6576.0 vf_cmpl_animacy: 7992.0 vf_adju_domain: 8089.0 vi_objc_animacy: 8771.0 vf_coad_animacy: 9415.0 vi_cmpl_animacy: 9894.0 vi_adj+_domain: 10384.0 vf_argAll_animacy: 10532.0 vf_cmpl_domain: 10695.0 vi_subj_animacy: 10941.0 vf_obj_domain: 11961.0 vf_adju_lex: 12543.0 vi_cmpl_domain: 12962.0 vi_subj_domain: 13761.0 vi_coad_animacy: 14528.0 vi_objc_domain: 14883.0 vi_adj+_lex: 14947.0 vf_cmpl_lex: 15587.0 vf_obj_lex: 16654.0 vi_subj_lex: 16719.0 vf_coad_domain: 16934.0 vi_cmpl_lex: 18108.0 vi_objc_lex: 19759.0 vf_argAll_domain: 20418.0 vi_allarg_animacy: 24591.0 vi_coad_domain: 25137.0 vf_coad_lex: 26376.0 vf_argAll_lex: 32077.0 vi_coad_lex: 35176.0 vi_allarg_domain: 41752.0 vd_domain_embed: 49864.0 vf_obj_pa: 54983.0 vf_argAll_pa: 56815.0 vi_allarg_lex: 57126.0 vi_objc_pa: 57928.0 vf_cmpl_pa: 59667.0 vd_domain_simple: 59732.0 vf_coad_pa: 61748.0 vf_adju_pa: 61860.0 vg_tense: 62037.0 vi_adj+_pa: 62037.0 vi_cmpl_pa: 62037.0 vi_coad_pa: 62037.0 vd_con_window: 70942.0 vi_allarg_pa: 71471.0 vd_con_clause: 114732.0 vd_con_chain: 927079.0
One possibility is to make a small script that will guide me through a manual review process on N-random samples. The sample size could realistically only be 50-100 per relevant experiment. Though this is a relatively small number compared to the total number of observations, it would at least help me find any glaring mistakes that are reproduced frequently. Some experiments probably do not need rigorous review such as the discourse spaces (vd_con_window, vd_con_clause) as these spaces are more or less straightforward.
I will build a script below. It does the following:
experiments.keys()
dict_keys(['vi_subj_lex', 'vi_subj_domain', 'vi_subj_animacy', 'vi_objc_pa', 'vi_objc_lex', 'vi_objc_domain', 'vi_objc_animacy', 'vi_cmpl_pa', 'vi_cmpl_lex', 'vi_cmpl_domain', 'vi_cmpl_animacy', 'vi_adj+_pa', 'vi_adj+_lex', 'vi_adj+_domain', 'vi_adj+_animacy', 'vi_coad_pa', 'vi_coad_lex', 'vi_coad_domain', 'vi_coad_animacy', 'vi_allarg_pa', 'vi_allarg_lex', 'vi_allarg_domain', 'vi_allarg_animacy', 'vd_par_lex', 'vd_con_window', 'vd_con_clause', 'vd_con_chain', 'vd_domain_simple', 'vd_domain_embed', 'vg_tense', 'vf_argAll_pa', 'vf_argAll_lex', 'vf_argAll_domain', 'vf_argAll_animacy', 'vf_obj_pa', 'vf_obj_lex', 'vf_obj_domain', 'vf_obj_animacy', 'vf_cmpl_pa', 'vf_cmpl_lex', 'vf_cmpl_domain', 'vf_cmpl_animacy', 'vf_adju_pa', 'vf_adju_lex', 'vf_adju_domain', 'vf_adju_animacy', 'vf_coad_pa', 'vf_coad_lex', 'vf_coad_domain', 'vf_coad_animacy'])
# review_data['to_review'][experiment_name][basis] = list(Nsamples)
# review_data['reviewed'][experiment_name][basis][sample] = note
# build randomized samples
review_file = '/Users/cody/github/verb_semantics/project_code/datareview/viArgAllReview.dill'
review_data = {'to_review': collections.defaultdict(lambda: collections.defaultdict(list)),
'reviewed': collections.defaultdict(lambda: collections.defaultdict(lambda: collections.defaultdict(dict)))}
to_review = {'vi_allarg_pa', 'vi_allarg_lex', 'vi_allarg_domain'}
for exp_name, experiment in experiments.items():
if exp_name not in {'vi_allarg_pa', 'vi_allarg_lex', 'vi_allarg_domain', 'vi_allarg_animacy'}:
continue
sample_size = 50
exp_bases = list(experiment.basis2result.keys())
picked_samples = list()
# assemble randomly picked samples
while len(picked_samples) < sample_size:
basis = random.choice(exp_bases)
result = random.choice(experiment.basis2result[basis])
sample = (basis, result)
# ensure no repeat selections
while sample in picked_samples:
basis = random.choice(exp_bases)
result = random.choice(experiment.basis2result[basis])
sample = (basis, result)
picked_samples.append(sample) # pick it
for basis, result in sorted(picked_samples):
review_data['to_review'][exp_name][basis].append(result)
with open(review_file, 'wb') as outfile:
dill.dump(review_data, outfile)
print('review file exported...')
review file exported...
# review_data['reviewed'][experiment_name][basis][sample] = note
def review(pickle_file, backup=''):
'''
A simple reviewer function that reviews
random samples of my experiment data.
'''
with open(pickle_file, 'rb') as infile:
review_data = dill.load(infile)
to_review = review_data['to_review']
reviewed = review_data['reviewed']
previous = tuple()
new_to_review = copy.deepcopy(to_review)
completed = [] # strings of completed reviews
for i, experiment_name in enumerate(to_review):
print(f'reviewing {experiment_name}')
time.sleep(.5)
clear_output()
start_i = len([sample for basis in reviewed[experiment_name]
for sample in reviewed[experiment_name][basis]])
basis_i = 1 + start_i
len_bases = len([sample for basis in to_review[experiment_name]
for sample in to_review[experiment_name][basis]]) + start_i
for basis_name, samples in to_review[experiment_name].items():
for sample in samples:
name = f'{experiment_name}/{basis_name}/{basis_i}of{len_bases}'
print('1 for good; 2 for notes; 3 for get last; q for quit\n')
B.prettyTuple(sample, withNodes=True, seqNumber=name)
print(f'Exp. {i+1}/{len(to_review)}\t{name}\n')
while True:
choice = input()
if choice == '1':
reviewed[experiment_name][basis_name][name] = {'review': 'good', 'result': sample}
new_to_review[experiment_name][basis_name].remove(sample)
break
elif choice == '2':
note = input('input note:')
reviewed[experiment_name][basis_name][name] = {'review': note, 'result': sample}
new_to_review[experiment_name][basis_name].remove(sample)
break
elif choice == '3':
print('\n', previous)
elif choice == 'q':
print('quitting...')
print(f'\ncompleted:{completed}')
with open(pickle_file, 'wb') as outfile:
save_data = {'to_review': new_to_review, 'reviewed': reviewed}
dill.dump(save_data, outfile)
with open(f'/Users/cody/Documents/{backup}.dill', 'wb') as outfile:
dill.dump(save_data, outfile)
return(f'data successfully saved...')
else:
print('input invalid...')
# save constantly
with open(pickle_file, 'wb') as outfile:
save_data = {'to_review': new_to_review, 'reviewed': reviewed}
dill.dump(save_data, outfile)
with open('/Users/cody/Documents/{backup}.dill', 'wb') as outfile:
dill.dump(save_data, outfile)
previous = (name, sample)
basis_i += 1
clear_output()
completed.append(experiment_name)
print('**REVIEW COMPLETE**')
review_file = '/Users/cody/github/verb_semantics/project_code/datareview/viArgAllReview.dill'
backup = 'viArgAllReviewBackup'
B.prettySetup(features={'sem_domain', 'sem_domain_code'})
review(review_file, backup=backup)
**REVIEW COMPLETE**
review_file1 = '/Users/cody/github/verb_semantics/project_code/datareview/dataReview2.dill'
review_file2 = '/Users/cody/github/verb_semantics/project_code/datareview/dataReview3.dill'
review_file3 = '/Users/cody/github/verb_semantics/project_code/datareview/dataReview4.dill'
completed = dict()
for file in (review_file1, review_file2, review_file3):
with open(file, 'rb') as infile:
completed.update(dill.load(infile)['reviewed'])
problems = {}
for experiment, bases in completed.items():
for basis, tags in bases.items():
for tag, tagdata in tags.items():
status = tagdata['review']
if status != 'good':
problems[tag] = tagdata
len(problems)
46
i = 0
for tag, data in problems.items():
i+=1
print(tag)
print(f'\t{data["review"]}')
print(data['result'])
#B.prettyTuple(data['result'], withNodes=True, seqNumber=i)
print()
vi_subj_animacy/inanimate/31of50 bad.woman listed as "object reference"? (433836, 670388, 30142, 670389, 30143, 1438630, 30142) vi_subj_animacy/inanimate/35of50 bad.exclude quantity (441656, 694413, 74552, 694412, 74551, 1437671, 74552) vi_subj_animacy/inanimate/40of50 caution.chariot is inanimate, but it moves! (462674, 758019, 184988, 758020, 184989, 1437643, 184988) vi_subj_animacy/inanimate/41of50 bad.quantity is animate (464896, 764517, 195403, 764518, 195404, 1437746, 195403) vi_subj_animacy/inanimate/42of50 bad.house is figurative for people (477519, 799567, 251811, 799568, 251812, 1438660, 251811) vi_subj_animacy/inanimate/44of50 bad.town figurative for people (489850, 834267, 308947, 834268, 308950, 1437643, 308947) vi_subj_animacy/inanimate/45of50 bad.FIXED (491898, 839388, 316345, 839387, 316344, 1441676, 316345) vi_subj_animacy/inanimate/46of50 bad.frame is animate (497767, 853839, 337162, 853838, 337160, 1440211, 337162) vi_subj_animacy/inanimate/50of50 bad.quantity (508872, 884678, 385282, 884677, 385281, 1438481, 385282) vi_objc_lex/>T>CR_<LL[/11of50 caution.conjunction is object >t (434569, 672607, 33809, 434570, 672609, 672610, 33818, 1438336, 33809) vi_objc_animacy/KJ_inanimate/16of50 bad.FIXED (482724, 814469, 276550, 482725, 814470, 814472, 276553, 1437785, 276550) vi_objc_animacy/KJ_inanimate/17of50 bad.FIXED (482835, 814771, 276991, 482836, 814772, 814774, 276994, 1437785, 276991) vi_objc_animacy/KJ_inanimate/18of50 bad.FIXED (483208, 815817, 278764, 483209, 815818, 815820, 278767, 1437785, 278764) vi_objc_animacy/animate/21of50 caution.prep shouldnt be added with object (430120, 659363, 13193, 659365, 13195, 13198, 1437753, 13193) vi_objc_animacy/inanimate/37of50 bad.Sex frame = male = person (444294, 702341, 89191, 702342, 89193, 1437871, 89191) vi_cmpl_domain/L_Associate/39of50 caution.beast as "Associate"? (460928, 752795, 175016, 752797, 175018, 175020, 1437988, 175016) vi_cmpl_animacy/>L_inanimate/12of50 bad.FIXED (429089, 656291, 8370, 656292, 8371, 8372, 1438034, 8370) vi_cmpl_animacy/MN_animate/41of50 caution.Egypt not quite person (434968, 673871, 36038, 673873, 36040, 36041, 1437643, 36038) vi_cmpl_animacy/MN_animate/43of50 caution.egypt (437250, 680862, 48156, 680863, 48157, 48158, 1437643, 48156) vi_cmpl_animacy/animate/49of50 bad.place (473954, 789670, 235538, 789671, 235539, 1438009, 235538) vi_cmpl_animacy/animate/50of50 bad.place (476843, 797634, 248420, 797637, 248425, 1437837, 248420) vi_adj+_lex/>KL[/7of50 note: there is a coordinate clause with the infinitive clause (486559, 825260, 294857, 486560, 825261, 294858, 1437754, 294857) vi_adj+_animacy/>XR/_inanimate/16of50 bad.frequency/quantity animate (473794, 789238, 234782, 789240, 234786, 234787, 1438972, 234782) vi_coad_animacy/>T==_inanimate/12of50 bad.quantity person (483784, 817483, 281441, 817485, 281443, 281444, 1438034, 281441) vi_coad_animacy/MN_animate/35of50 caution.name of groups; animate? "Israel" (457014, 740910, 155999, 740911, 156000, 156002, 1437948, 155999) vi_coad_animacy/animate/50of50 bad.is a location here: should exclude 1\.003001009 (432999, 667931, 26120, 667933, 26122, 1437759, 26120) vf_argAll_pa/Cmpl|Objc|Objc|adj+|adj+|adj+|adj+/20of50 bad.discourse object mismatched due to 999 mother/daughter rela but no verbum dicendi (789280, 789281, 789282, 789283, 789284, 473809, 473810, 473811, 234868, 789279, 563289, 563290, 1437759) vf_argAll_lex/Cmpl.B_>LHJM/|Cmpl.B_MLK//24of50 note.multiple heads (215104, 215106, 215107, 1438021, 468806, 775932, 775933, 215102, 215103) vf_argAll_lex/Objc.JLQ//39of50 bad.ETCBC mislabels as Object; probably is a Subject! (1439072, 488274, 302118, 302119, 830106, 830107) vf_argAll_lex/Objc.NBLH=/|Objc.VRJPH/|adj+.<D_<TH|adj+.MN_N<WRJM//47of50 note.multiple heads; will fix (266347, 266349, 266351, 266352, 807761, 266353, 807763, 807764, 266355, 480374, 266356, 1437754) vf_argAll_animacy/Cmpl.animate|Objc.animate|adj+.B_inanimate/24of50 bad.group is place (377092, 880648, 880649, 880650, 880651, 377101, 377103, 377104, 377105, 507508, 1437759) vf_argAll_animacy/Objc.animate|adj+.B_inanimate|adj+.inanimate/30of50 bad.quantity is persons (192617, 192620, 192621, 192622, 762671, 192625, 762674, 192626, 762675, 762673, 464251, 1437885) vf_argAll_animacy/Objc.inanimate|adj+.B_animate|adj+.L_inanimate/35of50 bad.group is place (412291, 412292, 412293, 412294, 412295, 1437768, 412296, 513259, 897398, 897399, 897400, 897401) vf_argAll_animacy/adj+.B_inanimate|adj+.K_inanimate|adj+.MN_animate/44of50 bad.quantifier is person...also, head is missed (1437762, 437030, 680203, 680204, 680205, 680206, 47124, 47125, 47127, 47128, 47130, 47133, 47134) vf_obj_domain/KJ_Act|People/24of50 bad.participle head missed here due to noun phrase (185574, 185576, 1437609, 185578, 185581, 462775, 462776, 758362, 758364, 758365, 758366) vf_obj_animacy/animate/15of50 caution.animate here refers to dead meat (345824, 345828, 500159, 860555, 860557, 1439167) vf_obj_animacy/animate|animate/24of50 caution.shouldn't this be apposition? (189730, 189731, 189732, 189733, 760781, 760782, 760783, 463568, 1437750) vf_cmpl_animacy/<D_inanimate|>XR/_inanimate/1of50 caution.vehicles are animate (1438304, 129732, 129733, 129735, 451499, 129740, 129741, 724245, 724246, 724247) vf_cmpl_animacy/>L_inanimate|MN_animate/23of50 bad.group is location (463008, 186784, 759075, 759076, 759078, 186783, 186777, 186778, 186779, 1437759) vf_cmpl_animacy/L_animate|MN_animate/41of50 bad.noun is place "egypt" (897024, 897026, 897027, 1437643, 513132, 411673, 411675, 411676, 411677, 411678) vf_adju_lex/>XR/|L_W/10of50 bad.prep_obj needs to be fixed. See To Do (425704, 425705, 515530, 425706, 425707, 904312, 904313, 904314, 1439167) vf_adju_animacy/B_animate|inanimate/28of50 bad.group is not animate (1437762, 162567, 162569, 458380, 162572, 162573, 745080, 745081, 745082) vf_coad_animacy/<L_animate|B_inanimate|B_inanimate/6of50 bad.group not animate "Israel" (189408, 760609, 189409, 189411, 760612, 760613, 189414, 760614, 189416, 463511, 189400, 1439034, 189407) vf_coad_animacy/<L_animate|L_animate/7of50 1 (1440995, 207084, 771500, 771502, 207086, 207087, 771503, 467283, 207091, 207092) vf_coad_animacy/B_inanimate|MN_animate/36of50 bad.group is location "Egypt" (184992, 758019, 184995, 758021, 758022, 1437643, 462674, 184988, 184990, 184991) vf_coad_animacy/MN_animate|MN_inanimate/46of50 bad.not animate group (700224, 700225, 700227, 84804, 84805, 84806, 443558, 84810, 84811, 1438623)
To-do list for data adjustments after the review.
Loca
(location) function phrase. If it is, and if it has a code for either "reference to person" or "reference to group," the tag "inanimate" is assigned.vi_subj_animacy/inanimate/46of50
bad.frame is animate
(497767, 853839, 337162, 853838, 337160, 1440211, 337162)
The problem is that a participle functioning as a noun is marked with a frame domain, all of which I have marked as "inanimate." The problem is solved here by adding a further requirement: sp#verb
to all animacy noun candidates. This prevents participles from being selected. Another requirement is also added that sp#adjv
to prevent cases of where an adjective is mislabeled as inanimate (several adjectives have a code of 2.*. This does exclude ~135 words (instances) from the sample. But this is very small compared to the sample size.
Consider building an experiment that uses mid or high code level domains from SDBH.
Note on the following datapoint: vf_obj_domain/KJ_Act|People/24of50
, marked:
bad.participle head missed here due to noun phrase
(185574, 185576, 1437609, 185578, 185581, 462775, 462776, 758362, 758364, 758365, 758366)
The participle עשה here has a phrase dependent part of speech of adjv
. The interpretation of the ETCBC is thus not as a verbal participle, but adjectival in description of the head noun מלאכה, i.e. "doer of work / work doer". Thus no action is required on this item.
Add verbum dicendi req for speech objects or fix mismatches somehow. Done with >MR and DBR based on query.
Fix prep_obj on suffixed prepositions followed by a conjunction. See prep_obj on 425706, which is a conjunction rather than a prepositional object. This happens because the prepositional object is a suffix. This happens 217 times in the corpus.
conj
in the word selection parameters as an ineligible part of speech. This should have been there anyways.phrase typ=PP
-heads> word prs
-prep_obj> word pdp=conj
In the first iteration of these experiments, I attempted to map all of the Semantic Dictionary of Biblical Hebrew (SDBH) categories to one of three tags: animate
, inanimate
, and event
. Events are native SDBH categories, while I mapped "objects" and semantic frames to animacy categories. Upon several inspections of the data, it is apparent that events and frames cannot be consistently mapped to animacy categories, because these lexemes are too frequently used in various contexts. For example, because the SDBH lists both participles and adjectives under "events," the animacy mapping fails on the many cases where participles and adjectives stand in for persons. Presumably the contextual senses would indicate animacy in these cases. In Isaiah 54:13 there is a good example of that: Know > Human, where the participle event "Know" is transformed into its human referent. But these categories appear to still be in development.
Animacy can be consistently mapped to object categories (codes 001*), object references (codes 003001*), and a handful of frames. I will collect those and modify the semantic domain code accordingly. This means that there will now need to be two collections of semantic domain data where previously there was one set of templates. Domain2 has been using the native SDBH categories, and will need the old selection procedures. I will rename domain2 into simply domain. And I will create a new category specifically for animacy. The revamped experiments will thus follow accordingly:
animate
or inanimate
tags.