Sharing data features

Explore additional data

The ETCBC has a few other repositories with data that work in conjunction with the BHSA data. One of them you have already seen: phono, for phonetic transcriptions. There is also parallels for detecting parallel passages, and valence for studying patterns around verbs that determine their meanings.

Make your own data

If you study the additional data, you can observe how that data is created and also how it is turned into a text-fabric data module. The last step is incredibly easy. You can write out every Python dictionary where the keys are numbers and the values string or numbers as a Text-Fabric feature. When you are creating data, you have already constructed those dictionaries, so writing them out is just one method call. See for example how the flowchart notebook in valence writes out verb sense data.

Share your new data

You can then easily share your new features on GitHub, so that your colleagues everywhere can try it out for themselves.

Here is how you draw in other data, for example

You can add such data on the fly, by passing a mod={org}/{repo}/{path} parameter, or a bunch of them separated by commas.

If the data is there, it will be auto-downloaded and stored on your machine.

Let's do it.

In [1]:
%load_ext autoreload
%autoreload 2

Incantation

The ins and outs of installing Text-Fabric, getting the corpus, and initializing a notebook are explained in the start tutorial.

In [2]:
from tf.app import use
In [3]:
A = use(
    'bhsa',
    mod=(
        'etcbc/valence/tf,'
        'etcbc/lingo/heads/tf,'
        'ch-jensen/Semantic-mapping-of-participants/actor/tf'
    ),
    hoist=globals(),
)
Using TF-app in /Users/dirk/github/annotation/app-bhsa/code:
	repo clone offline under ~/github (local github)
	connecting to online GitHub repo etcbc/bhsa ... connected
Using data in /Users/dirk/text-fabric-data/etcbc/bhsa/tf/c:
	rv1.6 (latest release)
	connecting to online GitHub repo etcbc/phono ... connected
Using data in /Users/dirk/text-fabric-data/etcbc/phono/tf/c:
	r1.2 (latest release)
	connecting to online GitHub repo etcbc/parallels ... connected
Using data in /Users/dirk/text-fabric-data/etcbc/parallels/tf/c:
	r1.2 (latest release)
	connecting to online GitHub repo etcbc/valence ... connected
Using data in /Users/dirk/text-fabric-data/etcbc/valence/tf/c:
	r1.1 (latest release)
	connecting to online GitHub repo etcbc/lingo ... connected
Using data in /Users/dirk/text-fabric-data/etcbc/lingo/heads/tf/c:
	r0.1 (latest release)
	connecting to online GitHub repo ch-jensen/Semantic-mapping-of-participants ... connected
	downloading https://github.com/ch-jensen/participants/releases/download/1.3/actor-tf-c.zip ... 
	unzipping ... 
	saving data
	could not save data to /Users/dirk/text-fabric-data/ch-jensen/Semantic-mapping-of-participants/actor/tf/c
	Will try something else
	actor/tf/c/actor.tf...downloaded
	actor/tf/c/coref.tf...downloaded
	actor/tf/c/prs_actor.tf...downloaded
	OK
Using data in /Users/dirk/text-fabric-data/ch-jensen/Semantic-mapping-of-participants/actor/tf/c:
	r1.3=#1c17398f92c0836c06de5e1798687c3fa18133cf (latest release)

You see that the features from the etcbc/valence/tf and etcbc/lingo/heads/tf modules have been added to the mix.

If you want to check for data updates, you can add an check=True argument.

Note that edge features are in bold italic.

sense from valence

Let's find out about sense.

In [4]:
F.sense.freqList()
Out[4]:
(('--', 17999),
 ('d-', 9979),
 ('-p', 6193),
 ('-c', 4250),
 ('-i', 2869),
 ('dp', 1853),
 ('dc', 1073),
 ('di', 889),
 ('l.', 876),
 ('i.', 629),
 ('n.', 533),
 ('-b', 66),
 ('db', 61),
 ('c.', 57),
 ('k.', 54))

Which nodes have a sense feature?

In [5]:
{F.otype.v(n) for n in N() if F.sense.v(n)}
Out[5]:
{'word'}
In [6]:
results = A.search('''
word sense
''')
  0.32s 47381 results

Let's show some of the rarer sense values:

In [7]:
results = A.search('''
word sense=k.
''')
  0.39s 54 results

If we do a pretty display, the sense feature shows up.

In [9]:
A.show(results, start=1, end=1, withNodes=True)

result 1

sentence 1172573 53|287
clause 427940 WayX
phrase 652698 Conj CP
1943
conj and
phrase 652699 Pred VP
1944
verb know qal wayq sense=d-
phrase 652701 Objc PP
sentence 1172574 54|288
clause 427941 Way0
phrase 652702 Conj CP
1948
conj and
phrase 652703 Pred VP
1949
verb be pregnant qal wayq sense=--
sentence 1172575 55|289
clause 427942 Way0
phrase 652704 Conj CP
1950
conj and
phrase 652705 Pred VP
1951
verb bear qal wayq sense=d-
phrase 652706 Objc PP
1952
prep <object marker>
sentence 1172576 56|290
clause 427943 WayX|Way0
phrase 652707 Conj CP
phrase 652708 Pred VP
1955
verb be qal wayq sense=--
clause 427944 Subj Ptcp
phrase 652709 PreC VP
1956
verb build qal ptca sense=d-

actor from semantic

Let's find out about actor.

In [10]:
fl = F.actor.freqList()
len(fl)
Out[10]:
411
In [11]:
fl[0:10]
Out[11]:
(('JHWH', 358),
 ('BN JFR>L', 203),
 ('>JC', 103),
 ('2sm"YOUSgmas"', 66),
 ('MCH', 61),
 ('>RY', 58),
 ('>TM', 45),
 ('JFR>L', 35),
 ('NPC', 35),
 ('>X "YOUSgmas"', 34))

Which nodes have an actor feature?

In [12]:
{F.otype.v(n) for n in N() if F.actor.v(n)}
Out[12]:
{'phrase_atom', 'subphrase'}
In [13]:
results = A.search('''
phrase_atom actor
''')
  0.18s 2073 results

Let's show some of the rarer actor values:

In [14]:
results = A.search('''
phrase_atom actor=KHN
''')
  0.27s 30 results
In [16]:
A.show(results, start=1, end=1)

result 1

sentence 7|1258
clause WQtX
phrase Conj CP
conj and
phrase Pred VP
actor=>JC >JC
verb cut nif perf
phrase Subj NP
actor=>JC >JC
phrase Cmpl PP
actor=QRB <M >JC >JC
clause Adju xYqX
phrase Conj CP
phrase Pred VP
actor=BN JFR>L
verb come hif impf
phrase Subj NP
actor=BN JFR>L
phrase Objc PP
actor=ZBX BN JFR>L
prep <object marker>
clause Attr Ptcp
phrase Rela CP
conj <relative>
phrase Subj PPrP
actor=BN JFR>L
phrase PreC VP
actor=BN JFR>L
verb slaughter qal ptca
clause Coor WQt0
phrase Conj CP
conj and
phrase PreO VP
actor=BN JFR>L
verb come hif perf
phrase Cmpl PP
actor=JHWH
phrase Cmpl PP
actor=PTX >HL MW<D
phrase Cmpl PP
actor=KHN
clause Coor WQt0
phrase Conj CP
conj and
phrase Pred VP
actor=BN JFR>L
verb slaughter qal perf
phrase Objc NP|PP
actor=JHWH
phrase Objc PP
prep <object marker>

heads from lingo

Now, heads is an edge feature, we cannot directly make it visible in pretty displays, but we can use it in queries.

We also want to make the feature sense visible, so we mention the feature in the query, without restricting the results.

In [17]:
results = A.search('''
book book=Genesis
  chapter chapter=1
    clause
      phrase
      -heads> word sense*
'''
)
  0.57s 402 results

We make the feature sense visible:

In [18]:
A.show(results, start=1, end=3, withNodes=True)

result 1

clause 427553 xQtX
phrase 651543 Pred VP
3
verb create qal perf sense=d-
phrase 651545 Objc PP
5
prep <object marker>
6
art the
8
conj and
9
prep <object marker>
10
art the

result 2

clause 427553 xQtX
phrase 651543 Pred VP
3
verb create qal perf sense=d-
phrase 651545 Objc PP
5
prep <object marker>
6
art the
8
conj and
9
prep <object marker>
10
art the

result 3

clause 427553 xQtX
phrase 651543 Pred VP
3
verb create qal perf sense=d-
phrase 651545 Objc PP
5
prep <object marker>
6
art the
8
conj and
9
prep <object marker>
10
art the

Note how the words that are heads of their phrases are highlighted within their phrases.

All together!

Here is a query that shows results with all features.

In [19]:
results = A.search('''
book book=Leviticus
  phrase sense*
    phrase_atom actor=KHN
  -heads> word
''')
  0.74s 30 results
In [20]:
A.displaySetup(condensed=True, condenseType='verse')
A.show(results, start=8, end=8)
A.displaySetup()

verse 8

sentence 27|1610
clause CPen
phrase Conj CP
conj and
phrase Frnt NP
actor=KHN
clause Resu xYq0
phrase Conj CP
phrase Pred VP
actor=KHN
verb buy qal impf sense=d-
phrase Objc NP
actor=NPC
clause Coor XYqt
phrase Subj PPrP
actor=NPC
phrase Pred VP
actor=NPC
verb eat qal impf sense=-p
phrase Cmpl PP
sentence 28|1611
clause CPen
phrase Conj CP
conj and
phrase Frnt NP
actor=JLJD BJT KHN
clause Resu XYqt
phrase Subj PPrP
actor=JLJD BJT KHN
phrase Pred VP
actor=JLJD BJT KHN
verb eat qal impf sense=-p

Features from custom locations

If you want to load your features from your own local github repositories, instead of from the data that TF has downloaded for you into ~/text-fabric-data, you can do so by passing the checkout parameter checkout='clone'.

In [21]:
A = use('bhsa', checkout='clone', hoist=globals())
Using TF-app in /Users/dirk/github/annotation/app-bhsa/code:
	repo clone offline under ~/github (local github)
Using data in /Users/dirk/github/etcbc/bhsa/tf/c:
	repo clone offline under ~/github (local github)
Using data in /Users/dirk/github/etcbc/phono/tf/c:
	repo clone offline under ~/github (local github)
Using data in /Users/dirk/github/etcbc/parallels/tf/c:
	repo clone offline under ~/github (local github)

Hover over the features to see where they come from, and you'll see they come from your local github repo.

You may load extra features by specifying locations and modules manually.

Here we get the valence features, but not as a module, but in a custom way.

In [22]:
A = use('bhsa', locations='~/text-fabric-data/etcbc/valence/tf', modules='c', hoist=globals())
Using TF-app in /Users/dirk/github/annotation/app-bhsa/code:
	repo clone offline under ~/github (local github)
	connecting to online GitHub repo etcbc/bhsa ... connected
Using data in /Users/dirk/text-fabric-data/etcbc/bhsa/tf/c:
	rv1.6 (latest release)
	connecting to online GitHub repo etcbc/phono ... connected
Using data in /Users/dirk/text-fabric-data/etcbc/phono/tf/c:
	r1.2 (latest release)
	connecting to online GitHub repo etcbc/parallels ... connected
Using data in /Users/dirk/text-fabric-data/etcbc/parallels/tf/c:
	r1.2 (latest release)

Still, all features of the main corpus and the standard modules have been loaded.

Using locations and modules is useful if you want to load extra features from custom locations on your computer.

Less features

If you want to load less features, you can set up TF in the traditional way first, and then wrap the app API around it.

Here we load just the minimal set of features to get going.

In [23]:
from tf.fabric import Fabric
In [24]:
TF = Fabric(locations='~/github/etcbc/bhsa/tf', modules='c')
This is Text-Fabric 7.5.4
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

114 features found and 0 ignored
In [25]:
api = TF.load('pdp vs vt gn nu ps lex')
  0.00s loading features ...
   |     0.10s B lex                  from /Users/dirk/github/etcbc/bhsa/tf/c
   |     0.11s B pdp                  from /Users/dirk/github/etcbc/bhsa/tf/c
   |     0.11s B vs                   from /Users/dirk/github/etcbc/bhsa/tf/c
   |     0.11s B vt                   from /Users/dirk/github/etcbc/bhsa/tf/c
   |     0.08s B gn                   from /Users/dirk/github/etcbc/bhsa/tf/c
   |     0.10s B nu                   from /Users/dirk/github/etcbc/bhsa/tf/c
   |     0.10s B ps                   from /Users/dirk/github/etcbc/bhsa/tf/c
  4.58s All features loaded/computed - for details use loadLog()

And finally we wrap the app around it:

In [26]:
A = use('bhsa', api=api, hoist=globals())
Using TF-app in /Users/dirk/github/annotation/app-bhsa/code:
	repo clone offline under ~/github (local github)

This loads much quicker.

A small test: what are the verbal stems?

In [27]:
F.vs.freqList()
Out[27]:
(('NA', 352874),
 ('qal', 50205),
 ('hif', 9407),
 ('piel', 6811),
 ('nif', 4145),
 ('hit', 960),
 ('peal', 654),
 ('pual', 492),
 ('hof', 427),
 ('hsht', 172),
 ('haf', 163),
 ('pael', 88),
 ('htpe', 53),
 ('peil', 40),
 ('htpa', 30),
 ('shaf', 15),
 ('etpa', 8),
 ('hotp', 8),
 ('pasq', 6),
 ('poel', 5),
 ('tif', 5),
 ('afel', 4),
 ('etpe', 3),
 ('htpo', 3),
 ('nit', 3),
 ('poal', 3))

Next steps

  • display become an expert in creating pretty displays of your text structures
  • search turbo charge your hand-coding with search templates
  • exportExcel make tailor-made spreadsheets out of your results
  • export export your dataset as an Emdros database

Back to start