Notebook

Tutorial¶

This notebook gets you started with using Text-Fabric for coding in the Dead-Sea Scrolls.

Familiarity with the underlying data model is recommended.

Cookbook¶

This tutorial and its sister tutorials are meant to showcase most of things TF can do.

But we also have a cookbook with a set of focused recipes on tricky things.

Installing Text-Fabric¶

See here

Tip¶

If you start computing with this tutorial, first copy its parent directory to somewhere else, outside your repository. If you pull changes from the repository later, your work will not be overwritten. Where you put your tutorial directory is up to you. It will work from any directory.

Data¶

Text-Fabric will fetch the data set for you from GitHub, and check for updates.

The data will be stored in the text-fabric-data in your home directory.

Features¶

The data of the corpus is organized in features. They are columns of data. Think of the corpus as a gigantic spreadsheet, where row 1 corresponds to the first sign, row 2 to the second sign, and so on, for all ~ 1.5 M signs, followed by ~ 500 K word nodes and yet another 200 K nodes of other types.

The information which reading each sign has, constitutes a column in that spreadsheet. The DSS corpus contains > 50 columns.

Instead of putting that information in one big table, the data is organized in separate columns. We call those columns features.

In [1]:

%load_ext autoreload
%autoreload 2

In [2]:

import os
import collections

Incantation¶

The simplest way to get going is by this incantation:

In [3]:

from tf.app import use

In [4]:

A = use("ETCBC/dss", hoist=globals())

Locating corpus resources ...

app: ~/text-fabric-data/github/ETCBC/dss/app

data: ~/text-fabric-data/github/ETCBC/dss/tf/1.9

data: ~/text-fabric-data/github/ETCBC/dss/parallels/tf/1.9

   |     0.73s T otype                from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     7.90s T oslots               from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     1.20s T fullo                from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.98s T glex                 from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.37s T lang                 from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.97s T glexo                from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     1.09s T lexe                 from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.07s T punce                from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     1.13s T lexo                 from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.07s T punco                from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     1.81s T after                from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.07s T punc                 from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.14s T fragment             from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.11s T line                 from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     4.05s T glyph                from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.14s T scroll               from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.97s T glexe                from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     3.94s T glyphe               from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     1.04s T morpho               from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     1.27s T fulle                from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     3.95s T glypho               from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     1.19s T lex                  from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     1.33s T full                 from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |      |     0.22s C __levels__           from otype, oslots, otext
   |      |     8.01s C __order__            from otype, oslots, __levels__
   |      |     0.39s C __rank__             from otype, __order__
   |      |       13s C __levUp__            from otype, oslots, __rank__
   |      |     5.69s C __levDown__          from otype, __levUp__, __rank__
   |      |     1.26s C __characters__       from otext
   |      |     4.58s C __boundary__         from otype, oslots, __rank__
   |      |     0.72s C __sections__         from otype, oslots, otext, __levUp__, __levDown__, __levels__, scroll, fragment, line
   |     0.00s T alt                  from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.47s T biblical             from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.45s T book                 from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     3.61s T book_etcbc           from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.44s T chapter              from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.68s T cl                   from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.00s T cl2                  from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.01s T cor                  from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.46s T g_cons               from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.88s T g_nme_etcbc          from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.90s T g_prs_etcbc          from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.44s T gn                   from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.11s T gn2                  from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.00s T gn3                  from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.42s T gn_etcbc             from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.00s T halfverse            from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.00s T intl                 from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     3.93s T lang_etcbc           from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     1.09s T lex_etcbc            from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.01s T md                   from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.00s T merr                 from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.00s T nr                   from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.44s T nu                   from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.11s T nu2                  from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.00s T nu3                  from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     1.02s T nu_etcbc             from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.53s T occ                  from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.12s T ps                   from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.11s T ps2                  from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.00s T ps3                  from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     1.03s T ps_etcbc             from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     1.15s T rec                  from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.01s T rem                  from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.01s T script               from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.12s T sim                  from ~/text-fabric-data/github/ETCBC/dss/parallels/tf/1.9
   |     0.97s T sp                   from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.90s T sp_etcbc             from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.90s T srcLn                from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.24s T st                   from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     4.15s T type                 from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.18s T unc                  from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.85s T uvf_etcbc            from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.01s T vac                  from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.44s T verse                from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.16s T vs                   from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     1.01s T vs_etcbc             from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     0.16s T vt                   from ~/text-fabric-data/github/ETCBC/dss/tf/1.9
   |     1.01s T vt_etcbc             from ~/text-fabric-data/github/ETCBC/dss/tf/1.9

TF: TF API 12.6.2, ETCBC/dss/app v3, Search Reference
Data: ETCBC - dss 1.9, Character table, Feature docs

Node types

Name	# of nodes	# slots / node	% coverage
scroll	1001	1428.81	100
lex	10450	129.14	94
fragment	11182	127.91	100
line	52895	27.04	100
clause	125	12.85	0
cluster	101099	6.68	47
phrase	315	5.10	0
word	500995	2.81	99
sign	1430241	1.00	100

Sets: no custom sets
Features:

ETCBC/dss/parallels/tf

sim

int

similarity between lines, as a percentage of the common material wrt the combined material

Dead Sea Scrolls

after

str

space behind the word, if any

alt

int

alternative reading

biblical

int

whether we are in biblical material or not

book

str

acronym of the book in which the word occurs

book_etcbc

str

Dead Sea Scrolls: additions based on BHSA

chapter

str

label of the chapter in which the word occurs

str

class (morphology tag)

cl2

str

class (for part 2) (morphology tag)

cor

int

correction made by an ancient or modern editor

fragment

str

label of a fragment of a scroll

full

str

full transcription (Unicode) of a word including flags and brackets

fulle

str

full transcription (ETCBC transliteration) of a word including flags and brackets

fullo

str

full transcription (original source) of a word including flags and brackets

g_cons

str

Dead Sea Scrolls: additions based on BHSA and machine learning

g_nme_etcbc

str

Dead Sea Scrolls: additions based on BHSA and machine learning

g_prs_etcbc

str

Dead Sea Scrolls: additions based on BHSA and machine learning

glex

str

representation (Unicode) of a lexeme leaving out non-letters

glexe

str

representation (ETCBC transliteration) of a lexeme leaving out non-letters

glexo

str

representation (original source) of a lexeme leaving out non-letters

glyph

str

representation (Unicode) of a word or sign

glyphe

str

representation (ETCBC transliteration) of a word or sign

glypho

str

representation (original source) of a word or sign

str

gender (morphology tag)

gn2

str

gender (for part 2) (morphology tag)

gn3

str

gender (for part 3) (morphology tag)

gn_etcbc

str

Dead Sea Scrolls: additions based on BHSA and machine learning

halfverse

str

label of the half-verse in which the word occurs

intl

int

interlinear material, the value indicates the sequence number of the interlinear line

lang

str

language of a word or sign, only if it is not Hebrew

lang_etcbc

str

Dead Sea Scrolls: additions based on BHSA

lex

str

representation (Unicode) of a lexeme

lex_etcbc

str

Dead Sea Scrolls: additions based on BHSA and machine learning

lexe

str

representation (ETCBC transliteration) of a lexeme

lexo

str

representation (original source) of a lexeme

line

str

label of a line of a fragment of a scroll

str

mood (morphology tag)

merr

str

errors in parsing the morphology tag

morpho

str

morphological tag (by Abegg)

str

Dead Sea Scrolls: additions based on BHSA and machine learning

str

number (morphology tag)

nu2

str

number (for part 2) (morphology tag)

nu3

str

number (for part 3) (morphology tag)

nu_etcbc

str

Dead Sea Scrolls: additions based on BHSA and machine learning

otype

str

Dead Sea Scrolls: biblical and non-biblical scrolls

str

person (morphology tag)

ps2

str

person (for part 2) (morphology tag)

ps3

str

person (for part 3) (morphology tag)

ps_etcbc

str

Dead Sea Scrolls: additions based on BHSA and machine learning

punc

str

trailing punctuation (Unicode) of a word

punce

str

trailing punctuation (ETCBC transliteration) of a word

punco

str

trailing punctuation (original source) of a word

rec

int

reconstructed by a modern editor

rem

int

removed by an ancient or modern editor

script

str

script in which the word or sign is written if it is not Hebrew

scroll

str

acronym of a scroll

str

part of speech (morphology tag)

sp_etcbc

str

Dead Sea Scrolls: additions based on BHSA and machine learning

srcLn

int

the line number of the word in the source data file

str

state (morphology tag)

type

str

type of sign or cluster

unc

int

uncertain material in various degrees: higher degree is less certain

uvf_etcbc

str

Dead Sea Scrolls: additions based on BHSA and machine learning

vac

int

empty, unwritten space

verse

str

label of the verse in which the word occurs

str

verbal stem (morphology tag)

vs_etcbc

str

Dead Sea Scrolls: additions based on BHSA and machine learning

str

verbal tense/aspect (morphology tag)

vt_etcbc

str

Dead Sea Scrolls: additions based on BHSA and machine learning

occ

none

edge feature from a lexeme to its occurrences

oslots

none

Dead Sea Scrolls: biblical and non-biblical scrolls

Settings:

specified

apiVersion: 3
appName: ETCBC/dss
appPath: /Users/me/text-fabric-data/github/ETCBC/dss/app
commit: gd796845ffd026d7896a29d71f730d471cba06631
css:
.full,.glyph,.punc { font-family: "Ezra SIL", "SBL Hebrew", sans-serif; } .scriptpaleohebrew { border: 1px dashed navy; } .scriptgreekcapital { border: 1px dashed brown; } .langa { text-decoration: underline; } .intl1 { vertical-align: -0.25em; } .intl2 { vertical-align: -0.5em; } .langg { font-family: serif; text-decoration: underline; } .vac1 { background-color: #aaaaaa; border 2pt solid #dd3333; border-radius: 4pt; } .rem1 { font-weight: bold; color: red; text-decoration: line-through; } .rem2 { font-weight: bold; color: maroon; text-decoration: line-through; } .rec1 { color: teal; font-size: 80%; } .cor1 { font-weight: bold; color: dodgerblue; text-decoration: overline; } .cor2 { font-weight: bold; color: navy; text-decoration: overline; } .cor3 { font-weight: bold; color: navy; text-decoration: overline; vertical-align: super; } .alt1 { text-decoration: overline; } /* UNSURE: italic*/ .unc1 { font-weight: bold; color: #888888; } .unc2 { font-weight: bold; color: #bbbbbb; } .unc3 { font-weight: bold; color: #bbbbbb; text-shadow: #cccccc 1px 1px; } .unc4 { font-weight: bold; color: #dddddd; text-shadow: #eeeeee 2px 2px; } .empty { color: #ff0000; }
dataDisplay:
- noneValues:
  - unknown
  - no value
- showVerseInTuple: True
- textFormats:
  - layout-orig-full: {method: layoutOrig}
  - layout-source-full: {method: layoutSource}
  - layout-trans-full: {method: layoutTrans}
docs:
- docPage: about
- featureBase: {docBase}/transcription.md
- featurePage: transcription
interfaceDefaults: {lineNumbers: 0}
isCompatible: True
local: local
localDir: /Users/me/text-fabric-data/github/ETCBC/dss/_temp
provenanceSpec:
- corpus: Dead Sea Scrolls
- doi: 10.5281/zenodo.2652849
- moduleSpecs:
  :
  
  backend: no value
  corpus: Parallel Passages
  docUrl:
  https://nbviewer.jupyter.org/github/etcbc/dss/blob/master/programs/parallels.ipynb
  doi: 10.5281/zenodo.2652849
  org: ETCBC
  relative: parallels/tf
  repo: dss
- org: ETCBC
- relative: /tf
- repo: dss
- version: 1.9
- webBase: https://www.deadseascrolls.org.il/explore-the-archive
- webHint: Show this scroll in the Leon Levy library
- webUrl: {webBase}/search#q='<1>'
release: v1.9
typeDisplay:
- clause: {label: {nr}}
- cluster:
  - label: {type}
  - stretch: 0
- fragment: {features: biblical}
- lex:
  - features: lex lexe
  - featuresBare: lexo
  - label: True
  - lexOcc: word
  - template: True
- line: {features: biblical}
- phrase: {label: {nr}}
- scroll: {features: biblical}
- sign: {exclude: {type: term}}
- word:
  - base: True
  - features: lang lex cl ps gn nu st vs vt md
  - featuresBare: sp
  - label: True
  - lineNumber: srcLn
  - wrap: 0
writing: hbo

TF API: names N F E L T S C TF Fs Fall Es Eall Cs Call directly usable

You can see which features have been loaded, and if you click on a feature name, you find its documentation. If you hover over a name, you see where the feature is located on your system.

API¶

The result of the incantation is that we have a bunch of special variables at our disposal that give us access to the text and data of the corpus.

At this point it is helpful to throw a quick glance at the text-fabric API documentation (see the links under API Members above).

The most essential thing for now is that we can use F to access the data in the features we've loaded. But there is more, such as N, which helps us to walk over the text, as we see in a minute.

The API members above show you exactly which new names have been inserted in your namespace. If you click on these names, you go to the API documentation for them.

Search¶

Text-Fabric contains a flexible search engine, that does not only work for the data, of this corpus, but also for other corpora and data that you add to corpora.

Search is the quickest way to come up-to-speed with your data, without too much programming.

Jump to the dedicated search search tutorial first, to whet your appetite.

The real power of search lies in the fact that it is integrated in a programming environment. You can use programming to:

compose dynamic queries
process query results

Therefore, the rest of this tutorial is still important when you want to tap that power. If you continue here, you learn all the basics of data-navigation with Text-Fabric.

Counting¶

In order to get acquainted with the data, we start with the simple task of counting.

Count all nodes¶

We use the N.walk() generator to walk through the nodes.

We compared the TF data to a gigantic spreadsheet, where the rows correspond to the signs. In Text-Fabric, we call the rows slots, because they are the textual positions that can be filled with signs.

We also mentioned that there are also other textual objects. They are the clusters, lines, faces and documents. They also correspond to rows in the big spreadsheet.

In Text-Fabric we call all these rows nodes, and the N() generator carries us through those nodes in the textual order.

Just one extra thing: the info statements generate timed messages. If you use them instead of print you'll get a sense of the amount of time that the various processing steps typically need.

In [5]:

A.indent(reset=True)
A.info("Counting nodes ...")

i = 0
for n in N.walk():
    i += 1

A.info("{} nodes".format(i))

  0.00s Counting nodes ...
  0.11s 2108303 nodes

Here you see it: over 2M nodes.

What are those nodes?¶

Every node has a type, like sign, or line, face. But what exactly are they?

Text-Fabric has two special features, otype and oslots, that must occur in every Text-Fabric data set. otype tells you for each node its type, and you can ask for the number of slots in the text.

Here we go!

In [6]:

F.otype.slotType

Out[6]:

'sign'

In [7]:

F.otype.maxSlot

Out[7]:

In [8]:

F.otype.maxNode

Out[8]:

In [9]:

F.otype.all

Out[9]:

('scroll',
 'lex',
 'fragment',
 'line',
 'clause',
 'cluster',
 'phrase',
 'word',
 'sign')

In [10]:

C.levels.data

Out[10]:

(('scroll', 1428.8121878121879, 1605868, 1606868),
 ('lex', 129.1396172248804, 1542523, 1552972),
 ('fragment', 127.90565194061885, 1531341, 1542522),
 ('line', 27.03924756593251, 1552973, 1605867),
 ('clause', 12.848, 2107864, 2107988),
 ('cluster', 6.678582379647672, 1430242, 1531340),
 ('phrase', 5.098412698412698, 2107989, 2108303),
 ('word', 2.814359424744758, 1606869, 2107863),
 ('sign', 1, 1, 1430241))

This is interesting: above you see all the textual objects, with the average size of their objects, the node where they start, and the node where they end.

Count individual object types¶

This is an intuitive way to count the number of nodes in each type. Note in passing, how we use the indent in conjunction with info to produce neat timed and indented progress messages.

In [11]:

A.indent(reset=True)
A.info("counting objects ...")

for otype in F.otype.all:
    i = 0

    A.indent(level=1, reset=True)

    for n in F.otype.s(otype):
        i += 1

    A.info("{:>7} {}s".format(i, otype))

A.indent(level=0)
A.info("Done")

  0.00s counting objects ...
   |     0.00s    1001 scrolls
   |     0.00s   10450 lexs
   |     0.00s   11182 fragments
   |     0.01s   52895 lines
   |     0.00s     125 clauses
   |     0.01s  101099 clusters
   |     0.00s     315 phrases
   |     0.06s  500995 words
   |     0.17s 1430241 signs
  0.26s Done

Viewing textual objects¶

You can use the A API (the extra power) to display cuneiform text.

See the display tutorial.

Feature statistics¶

F gives access to all features. Every feature has a method freqList() to generate a frequency list of its values, higher frequencies first. Here are the parts of speech:

In [12]:

F.sp.freqList()

Out[12]:

(('ptcl', 154464),
 ('subs', 108562),
 ('unknown', 80256),
 ('verb', 58873),
 ('suff', 45747),
 ('adjv', 10633),
 ('numr', 6526),
 ('pron', 5784))

Signs, words and clusters have types. We can count them separately:

In [13]:

F.type.freqList("cluster")

Out[13]:

(('rec', 93733),
 ('vac', 3522),
 ('cor3', 1582),
 ('unc2', 906),
 ('rem2', 706),
 ('alt', 333),
 ('cor2', 147),
 ('cor', 95),
 ('rem', 75))

In [14]:

F.type.freqList("word")

Out[14]:

(('glyph', 470605), ('punct', 29927), ('numr', 463))

In [15]:

F.type.freqList("sign")

Out[15]:

(('cons', 1156780),
 ('empty', 98407),
 ('missing', 53864),
 ('sep', 46453),
 ('punct', 29927),
 ('unc', 27168),
 ('term', 15532),
 ('numr', 2029),
 ('add', 65),
 ('foreign', 16))

Word matters¶

Top 20 frequent words¶

We represent words by their essential symbols, collected in the feature glyph (which also exists for signs).

In [16]:

for (w, amount) in F.glyph.freqList("word")[0:20]:
    print(f"{amount:>5} {w}")

45393 ו
20491 ה
19378 ל
18225 ב
 6389 את
 5863 מ
 4894 אשר
 4789 יהוה
 4355 א
 4236 כול
 4185 על
 4172 אל
 3262 כי
 3091 כ
 3005 לא
 2841 כל
 2424 לוא
 1938 ארץ
 1829 ישראל
 1653 יום

Word distribution¶

Let's do a bit more fancy word stuff.

Hapaxes¶

A hapax can be found by picking the words with frequency 1. We do have lexeme information in this corpus, let's use it for determining hapaxes.

We print 20 hapaxes.

In [17]:

hapaxes1 = sorted(lx for (lx, amount) in F.lex.freqList("word") if amount == 1)
len(hapaxes1)

Out[17]:

In [18]:

for lx in hapaxes1[0:20]:
    print(lx)

 #  #  #  #  # 
 #  #  #  #  #  #  #  #  # 
 #  #  #  #  # ות
 #  #  #  #  # ל #  #  # 
 #  #  #  #  # ם
 #  #  #  # ב
 #  #  #  # ה
 #  #  #  # ו # 
 #  #  #  # ך
 #  #  #  # ל #  # 
 #  #  #  # תא
 #  #  # ד
 #  #  # דב
 #  #  # דה
 #  #  # ה #  # 
 #  #  # הו
 #  #  # הם
 #  #  # ות
 #  #  # ט
 #  #  # כת

An other way to find lexemes with only one occurrence is to use the occ edge feature from lexeme nodes to the word nodes of its occurrences.

In [19]:

hapaxes2 = sorted(F.lex.v(lx) for lx in F.otype.s("lex") if len(E.occ.f(lx)) == 1)
len(hapaxes2)

Out[19]:

In [20]:

for lx in hapaxes2[0:20]:
    print(lx)

 #  #  #  #  # 
 #  #  #  #  #  #  #  #  # 
 #  #  #  #  # ות
 #  #  #  #  # ל #  #  # 
 #  #  #  #  # ם
 #  #  #  # ב
 #  #  #  # ה
 #  #  #  # ו # 
 #  #  #  # ך
 #  #  #  # ל #  # 
 #  #  #  # תא
 #  #  # ד
 #  #  # דב
 #  #  # דה
 #  #  # ה #  # 
 #  #  # הו
 #  #  # הם
 #  #  # ות
 #  #  # ט
 #  #  # כת

The feature lex contains lexemes that may have uncertain characters in it.

The function glex has all those characters stripped. Let's use glex instead.

In [21]:

hapaxes1g = sorted(lx for (lx, amount) in F.glex.freqList("word") if amount == 1)
len(hapaxes1)

Out[21]:

In [22]:

for lx in hapaxes1g[0:20]:
    print(lx)

If we are not interested in the numerals:

In [23]:

for lx in [x for x in hapaxes1g if not x.isdigit()][0:20]:
    print(lx)

 ידה
 לוט
 נַחַל
 שֵׂעָר
ֶ
אֱגֹוז
אֱלִידָד
אֱלִיעָם
אֱלִישֶׁבַע
אֲבִיאֵל
אֲבִיטַל
אֲבִיעֶזְרִי
אֲבִיעֶזֶר
אֲבִישׁוּעַ
אֲבַטִּיחַ
אֲגֹורָה
אֲדַמְדַּם
אֲדָר
אֲדֹנִי
אֲדֹנִיָּה

Small occurrence base¶

The occurrence base of a word are the scrolls in which occurs.

We compute the occurrence base of each word, based on lexemes according to the glex feature.

In [24]:

occurrenceBase1 = collections.defaultdict(set)

A.indent(reset=True)
A.info("compiling occurrence base ...")
for w in F.otype.s("word"):
    scroll = T.sectionFromNode(w)[0]
    occurrenceBase1[F.glex.v(w)].add(scroll)
A.info(f"{len(occurrenceBase1)} entries")

  0.00s compiling occurrence base ...
  2.83s 8265 entries

Wow, that took long!

We looked up the scroll for each word.

But there is another way:

Start with scrolls, and iterate through their words.

In [25]:

occurrenceBase2 = collections.defaultdict(set)

A.indent(reset=True)
A.info("compiling occurrence base ...")
for s in F.otype.s("scroll"):
    scroll = F.scroll.v(s)
    for w in L.d(s, otype="word"):
        occurrenceBase2[F.glex.v(w)].add(scroll)
A.info("done")
A.info(f"{len(occurrenceBase2)} entries")

  0.00s compiling occurrence base ...
  0.19s done
  0.19s 8265 entries

Much better. Are the results equal?

In [26]:

occurrenceBase1 == occurrenceBase2

Out[26]:

True

Yes.

In [27]:

occurrenceBase = occurrenceBase2

An overview of how many words have how big occurrence bases:

In [28]:

occurrenceSize = collections.Counter()

for (w, scrolls) in occurrenceBase.items():
    occurrenceSize[len(scrolls)] += 1

occurrenceSize = sorted(
    occurrenceSize.items(),
    key=lambda x: (-x[1], x[0]),
)

for (size, amount) in occurrenceSize[0:10]:
    print(f"base size {size:>4} : {amount:>5} words")
print("...")
for (size, amount) in occurrenceSize[-10:]:
    print(f"base size {size:>4} : {amount:>5} words")

base size    1 :  2789 words
base size    2 :  1109 words
base size    3 :   692 words
base size    4 :   462 words
base size    5 :   335 words
base size    6 :   256 words
base size    7 :   219 words
base size    8 :   182 words
base size    9 :   177 words
base size   10 :   122 words
...
base size  457 :     1 words
base size  459 :     1 words
base size  538 :     1 words
base size  600 :     1 words
base size  605 :     1 words
base size  629 :     1 words
base size  745 :     1 words
base size  761 :     1 words
base size  844 :     1 words
base size  997 :     1 words

Let's give the predicate private to those words whose occurrence base is a single scroll.

In [29]:

privates = {w for (w, base) in occurrenceBase.items() if len(base) == 1}
len(privates)

Out[29]:

Peculiarity of scrolls¶

As a final exercise with scrolls, lets make a list of all scrolls, and show their

total number of words
number of private words
the percentage of private words: a measure of the peculiarity of the scroll

In [30]:

scrollList = []

empty = set()
ordinary = set()

for d in F.otype.s("scroll"):
    scroll = T.scrollName(d)
    words = {F.glex.v(w) for w in L.d(d, otype="word")}
    a = len(words)
    if not a:
        empty.add(scroll)
        continue
    o = len({w for w in words if w in privates})
    if not o:
        ordinary.add(scroll)
        continue
    p = 100 * o / a
    scrollList.append((scroll, a, o, p))

scrollList = sorted(scrollList, key=lambda e: (-e[3], -e[1], e[0]))

print(f"Found {len(empty):>4} empty scrolls")
print(f"Found {len(ordinary):>4} ordinary scrolls (i.e. without private words)")

Found    0 empty scrolls
Found  507 ordinary scrolls (i.e. without private words)

In [31]:

print(
    "{:<20}{:>5}{:>5}{:>5}\n{}".format(
        "scroll",
        "#all",
        "#own",
        "%own",
        "-" * 35,
    )
)

for x in scrollList[0:20]:
    print("{:<20} {:>4} {:>4} {:>4.1f}%".format(*x))
print("...")
for x in scrollList[-20:]:
    print("{:<20} {:>4} {:>4} {:>4.1f}%".format(*x))

scroll               #all #own %own
-----------------------------------
4Q341                  32   21 65.6%
4Q340                  15    5 33.3%
11Q26                   6    2 33.3%
4Q313a                  3    1 33.3%
4Q358                   3    1 33.3%
4Q347                  10    3 30.0%
4Q124                  86   25 29.1%
4Q282d                  7    2 28.6%
1Q70bis                11    3 27.3%
1Q70                   24    6 25.0%
4Q346a                  4    1 25.0%
4Q357                   4    1 25.0%
1Q41                    9    2 22.2%
3Q15                  269   58 21.6%
4Q561                  73   15 20.5%
4Q559                 129   26 20.2%
4Q360a                 20    4 20.0%
1Q58                    5    1 20.0%
4Q250b                  5    1 20.0%
4Q468bb                 5    1 20.0%
...
4Q427                 343    2  0.6%
4Q2                   174    1  0.6%
4Q366                 185    1  0.5%
4Q98                  192    1  0.5%
4Q56                  963    5  0.5%
4Q394                 194    1  0.5%
4Q59                  404    2  0.5%
4Q88                  208    1  0.5%
11Q20                 429    2  0.5%
4Q57                  875    4  0.5%
11Q11                 222    1  0.5%
4Q58                  450    2  0.4%
4Q174                 241    1  0.4%
4Q13                  257    1  0.4%
4Q524                 280    1  0.4%
4Q271                 293    1  0.3%
4Q84                  350    1  0.3%
4Q33                  365    1  0.3%
4Q428                 385    1  0.3%
1QpHab                463    1  0.2%

Tip¶

See the lexeme recipe in the cookbook for how you get from a lexeme node to its word occurrence nodes.

Locality API¶

We travel upwards and downwards, forwards and backwards through the nodes. The Locality-API (L) provides functions: u() for going up, and d() for going down, n() for going to next nodes and p() for going to previous nodes.

These directions are indirect notions: nodes are just numbers, but by means of the oslots feature they are linked to slots. One node contains an other node, if the one is linked to a set of slots that contains the set of slots that the other is linked to. And one if next or previous to an other, if its slots follow or precede the slots of the other one.

L.u(node) Up is going to nodes that embed node.

L.d(node) Down is the opposite direction, to those that are contained in node.

L.n(node) Next are the next adjacent nodes, i.e. nodes whose first slot comes immediately after the last slot of node.

L.p(node) Previous are the previous adjacent nodes, i.e. nodes whose last slot comes immediately before the first slot of node.

All these functions yield nodes of all possible node types. By passing an optional parameter, you can restrict the results to nodes of that type.

The result are ordered according to the order of things in the text.

The functions return always a tuple, even if there is just one node in the result.

Going up¶

We go from the first word to the scroll it contains. Note the [0] at the end. You expect one scroll, yet L returns a tuple. To get the only element of that tuple, you need to do that [0].

If you are like me, you keep forgetting it, and that will lead to weird error messages later on.

In [32]:

firstScroll = L.u(1, otype="scroll")[0]
print(firstScroll)

And let's see all the containing objects of sign 3:

In [33]:

s = 3
for otype in F.otype.all:
    if otype == F.otype.slotType:
        continue
    up = L.u(s, otype=otype)
    upNode = "x" if len(up) == 0 else up[0]
    print("sign {} is contained in {} {}".format(s, otype, upNode))

sign 3 is contained in scroll 1605868
sign 3 is contained in lex 1542524
sign 3 is contained in fragment 1531341
sign 3 is contained in line 1552973
sign 3 is contained in clause x
sign 3 is contained in cluster x
sign 3 is contained in phrase x
sign 3 is contained in word 1606870

Going next¶

Let's go to the next nodes of the first scroll.

In [34]:

afterFirstScroll = L.n(firstScroll)
for n in afterFirstScroll:
    print(
        "{:>7}: {:<13} first slot={:<6}, last slot={:<6}".format(
            n,
            F.otype.v(n),
            E.oslots.s(n)[0],
            E.oslots.s(n)[-1],
        )
    )
secondScroll = L.n(firstScroll, otype="scroll")[0]

  17149: sign          first slot=17149 , last slot=17149 
1612982: word          first slot=17149 , last slot=17149 
1553387: line          first slot=17149 , last slot=17176 
1531359: fragment      first slot=17149 , last slot=18207 
1605869: scroll        first slot=17149 , last slot=33885

Going previous¶

And let's see what is right before the second scroll.

In [35]:

for n in L.p(secondScroll):
    print(
        "{:>7}: {:<13} first slot={:<6}, last slot={:<6}".format(
            n,
            F.otype.v(n),
            E.oslots.s(n)[0],
            E.oslots.s(n)[-1],
        )
    )

1605868: scroll        first slot=1     , last slot=17148 
1531358: fragment      first slot=15658 , last slot=17148 
1553386: line          first slot=17099 , last slot=17148 
1612981: word          first slot=17147 , last slot=17148 
  17148: sign          first slot=17148 , last slot=17148

Going down¶

We go to the fragments of the first scroll, and just count them.

In [36]:

fragments = L.d(firstScroll, otype="fragment")
print(len(fragments))

The first line¶

We pick two nodes and explore what is above and below them: the first line and the first word.

In [37]:

for n in [
    F.otype.s("word")[0],
    F.otype.s("line")[0],
]:
    A.indent(level=0)
    A.info("Node {}".format(n), tm=False)
    A.indent(level=1)
    A.info("UP", tm=False)
    A.indent(level=2)
    A.info("\n".join(["{:<15} {}".format(u, F.otype.v(u)) for u in L.u(n)]), tm=False)
    A.indent(level=1)
    A.info("DOWN", tm=False)
    A.indent(level=2)
    A.info("\n".join(["{:<15} {}".format(u, F.otype.v(u)) for u in L.d(n)]), tm=False)
A.indent(level=0)
A.info("Done", tm=False)

Node 1606869
   |   UP
   |      |   1542523         lex
   |      |   1552973         line
   |      |   1531341         fragment
   |      |   1605868         scroll
   |   DOWN
   |      |   2               sign
Node 1552973
   |   UP
   |      |   1531341         fragment
   |      |   1605868         scroll
   |   DOWN
   |      |   1430242         cluster
   |      |   1               sign
   |      |   1606869         word
   |      |   2               sign
   |      |   1606870         word
   |      |   3               sign
   |      |   4               sign
   |      |   5               sign
   |      |   1606871         word
   |      |   6               sign
   |      |   7               sign
   |      |   8               sign
   |      |   9               sign
   |      |   1606872         word
   |      |   10              sign
   |      |   11              sign
   |      |   1606873         word
   |      |   12              sign
   |      |   13              sign
   |      |   14              sign
   |      |   15              sign
   |      |   16              sign
   |      |   1606874         word
   |      |   17              sign
   |      |   18              sign
   |      |   19              sign
   |      |   1606875         word
   |      |   20              sign
   |      |   1606876         word
   |      |   21              sign
   |      |   22              sign
   |      |   23              sign
   |      |   24              sign
   |      |   1606877         word
   |      |   25              sign
   |      |   1606878         word
   |      |   26              sign
   |      |   27              sign
   |      |   28              sign
   |      |   29              sign
Done

Text API¶

So far, we have mainly seen nodes and their numbers, and the names of node types. You would almost forget that we are dealing with text. So let's try to see some text.

In the same way as F gives access to feature data, T gives access to the text. That is also feature data, but you can tell Text-Fabric which features are specifically carrying the text, and in return Text-Fabric offers you a Text API: T.

Formats¶

DSS text can be represented in a number of ways:

orig: unicode
trans: ETCBC transcription
source: as in Abegg's data files

All three can be represented in two flavours:

full: all glyphs, but no bracketings and flags
extra: everything

If you wonder where the information about text formats is stored: not in the program text-fabric, but in the data set. It has a feature otext, which specifies the formats and which features must be used to produce them. otext is the third special feature in a TF data set, next to otype and oslots. It is an optional feature. If it is absent, there will be no T API.

Here is a list of all available formats in this data set.

In [38]:

T.formats

Out[38]:

{'lex-default': 'word',
 'lex-orig-full': 'word',
 'lex-source-full': 'word',
 'lex-trans-full': 'word',
 'morph-source-full': 'word',
 'text-orig-extra': 'word',
 'text-orig-full': 'sign',
 'text-source-extra': 'word',
 'text-source-full': 'sign',
 'text-trans-extra': 'word',
 'text-trans-full': 'sign',
 'layout-orig-full': 'sign',
 'layout-source-full': 'sign',
 'layout-trans-full': 'sign'}

Using the formats¶

The T.text() function is central to get text representations of nodes. Its most basic usage is

T.text(nodes, fmt=fmt)

where nodes is a list or iterable of nodes, usually word nodes, and fmt is the name of a format. If you leave out fmt, the default text-orig-full is chosen.

The result is the text in that format for all nodes specified:

You see for each format in the list above its intended level of operation: sign or word.

If TF formats a node according to a defined text-format, it will descend to constituent nodes and represent those constituent nodes.

In this case, the formats ending in -extra specify the word level as the descend type. Because, in this dataset, the features that contain the text-critical brackets are only defined at the word level. At the sign level, those brackets are no longer visible, but they have left their traces in other features.

If we do not specify a format, the default format is used (text-orig-full).

We examine a portion of biblical material at the start 1Q1.

In [39]:

fragmentNode = T.nodeFromSection(("1Q1", "f1"))
fragmentNode

Out[39]:

In [40]:

signs = L.d(fragmentNode, otype="sign")
words = L.d(fragmentNode, otype="word")
lines = L.d(fragmentNode, otype="line")
print(
    f"""
Fragment {T.sectionFromNode(fragmentNode)} with
  {len(signs):>3} signs
  {len(words):>3} words
  {len(lines):>3} lines
"""
)

Fragment ('1Q1', 'f1') with
  157 signs
   57 words
    3 lines

In [41]:

T.text(signs[0:100])

Out[41]:

'וירא אלהים כי טוב ׃ ויהי ערב ויהי בקר יום רביעי ׃ ויאמר ╱ אלהים ישרוצו המים שרץ נפש חיה ועוף יעופף על הארץ על פני רקיע השמים '

In [42]:

T.text(words[0:20])

Out[42]:

'וירא אלהים כי טוב ׃ ויהי ערב ויהי בקר יום רביעי ׃ ויאמר אלהים ישרוצו ה'

In [43]:

T.text(lines[0:2])

Out[43]:

'וירא אלהים כי טוב ׃ ויהי ערב ויהי בקר יום רביעי ׃ ויאמר ╱ אלהים ישרוצו המים שרץ נפש חיה ועוף יעופף על הארץ על פני רקיע השמים ׃ ╱ '

The `-extra` formats¶

In order to use non-default formats, we have to specify them in the fmt parameter.

In [44]:

T.text(signs[0:100], fmt="text-orig-extra")

Out[44]:

''

We do not get much, let's ask why.

In [45]:

T.text(signs[0:2], fmt="text-orig-extra", explain=True)

EXPLANATION: T.text() called with parameters:
	nodes  : iterable of 2 nodes
	fmt    : text-orig-extra targeted at word
	descend: implicit
	func   : no custom format implementation

	NODE: sign 770999
		TARGET LEVEL: word  (descend=None) (format target type)
		EXPANSION: 0 words 
		FORMATTING: explicit text-orig-extra does <function Text._compileFormat.<locals>.g at 0x3539e9080>
		MATERIAL:
	NODE: sign 771000
		TARGET LEVEL: word  (descend=None) (format target type)
		EXPANSION: 0 words 
		FORMATTING: explicit text-orig-extra does <function Text._compileFormat.<locals>.g at 0x3539e9080>
		MATERIAL:

Out[45]:

''

The reason can be found in TARGET LEVEL: word and EXPANSION 0 words. We are applying the word targeted format text-orig-extra to a sign, which does not contain words.

In [46]:

T.text(words[0:20], fmt="text-orig-extra")

Out[46]:

'[ וירא אל ]הים כי [ טוב ׃ ויהי ערב ויהי בקר יום רביעי ׃ ויאמר ] [ אלהים יש ]רוצו ה'

In [47]:

T.text(lines[0:2], fmt="text-orig-extra")

Out[47]:

'[ וירא אל ]הים כי [ טוב ׃ ויהי ערב ויהי בקר יום רביעי ׃ ויאמר ] [ אלהים יש ]רוצו המים שר#[ ץ נפש חיה ועוף יעופף על הארץ על פני רקיע השמים ׃ '

Note that the direction of the brackets look wrong, because they have not been adapted to the right-to-left writing direction.

We can view them in ETCBC transcription as well:

In [48]:

T.text(words[0:20], fmt="text-trans-extra")

Out[48]:

'[ WJR> >L ]HJm KJ [ VWB 00 WJHJ <RB WJHJ BQR JWm RBJ<J 00 WJ>MR ] [ >LHJm J# ]RWYW H'

In [49]:

T.text(lines[0:2], fmt="text-trans-extra")

Out[49]:

'[ WJR> >L ]HJm KJ [ VWB 00 WJHJ <RB WJHJ BQR JWm RBJ<J 00 WJ>MR ] [ >LHJm J# ]RWYW HMJm #R#[ y NP# XJH W<Wp J<WPp <L H>Ry <L PNJ RQJ< H#MJm 00 '

Or in Abegg's source encoding:

In [50]:

T.text(words[0:20], fmt="text-source-extra")

Out[50]:

']wyra al[hyM ky ]fwb . wyhy orb wyhy bqr ywM rbyoy . wyamr[ ]alhyM yC[rwxw h'

In [51]:

T.text(lines[0:2], fmt="text-source-extra")

Out[51]:

']wyra al[hyM ky ]fwb . wyhy orb wyhy bqr ywM rbyoy . wyamr[ ]alhyM yC[rwxw hmyM Cr«]X npC jyh wowP yowpP ol harX ol pny rqyo hCmyM . '

The function T.text() works with nodes of many types.

We compose a set of example nodes and run T.text on them:

In [52]:

exampleNodes = [
    F.otype.s("sign")[1],
    F.otype.s("word")[1],
    F.otype.s("cluster")[0],
    F.otype.s("line")[0],
    F.otype.s("fragment")[0],
    F.otype.s("scroll")[0],
    F.otype.s("lex")[1],
]
exampleNodes

Out[52]:

[2, 1606870, 1430242, 1552973, 1531341, 1605868, 1542524]

In [53]:

for n in exampleNodes:
    print(f"This is {F.otype.v(n)} {n}:")
    text = T.text(n)
    if len(text) > 200:
        text = text[0:200] + f"\nand {len(text) - 200} characters more"
    print(text)
    print("")

This is sign 2:
ו

This is word 1606870:
עתה 

This is cluster 1430242:
  

This is line 1552973:
  ועתה שמעו כל יודעי צדק ובינו במעשי 

This is fragment 1531341:
  ועתה שמעו כל יודעי צדק ובינו במעשי אל ׃ כי ריב ל׳ו עם כל בשר ומשפט יעשה בכל מנאצי׳ו ׃ כי במועל׳ם אשר עזבו׳הו הסתיר פני׳ו מישראל וממקדש׳ו ויתנ׳ם לחרב ׃ ובזכר׳ו ברית ראשנים השאיר שאירית לישראל ולא נתנ
and 827 characters more

This is scroll 1605868:
  ועתה שמעו כל יודעי צדק ובינו במעשי אל ׃ כי ריב ל׳ו עם כל בשר ומשפט יעשה בכל מנאצי׳ו ׃ כי במועל׳ם אשר עזבו׳הו הסתיר פני׳ו מישראל וממקדש׳ו ויתנ׳ם לחרב ׃ ובזכר׳ו ברית ראשנים השאיר שאירית לישראל ולא נתנ
and 21145 characters more

This is lex 1542524:
h-עַתָּה-<AT.@H-oAt;Dh

Look at the last case, the lexeme node: obviously, the text-format that has been invoked provides the language (h) of the lexeme, plus its representations in UNICODE, ETCBC, and Abegg transcription.

But what format exactly has been invoked? Let's ask.

In [54]:

T.text(exampleNodes[-1], explain=True)

EXPLANATION: T.text() called with parameters:
	nodes  : single node
	fmt    : implicit
	descend: implicit
	func   : no custom format implementation

	NODE: lex 1542524
		TARGET LEVEL: lex (no expansion needed) (descend=None) (format target type)
		EXPANSION: 1 lex 1542524
		FORMATTING: implicit lex-default does <function Text._compileFormat.<locals>.g at 0x3539e82c0>
		MATERIAL:
			lex 1542524 ADDS "h-עַתָּה-<AT.@H-oAt;Dh "

Out[54]:

'h-עַתָּה-<AT.@H-oAt;Dh '

The clue is in FORMATTING: implicit lex-default.

Remember that we saw the format lex-default in T.formats.

The Text-API has matched the type of the lexeme node we provided with this default format and applies it, thereby skipping the expansion of the lexeme node to its occurrences.

But we can force the expansion:

In [55]:

T.text(exampleNodes[-1], fmt="lex-default", descend=True)

Out[55]:

'h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh h-עַתָּה-<AT.@H-oAt;Dh '

Using the formats¶

Now let's use those formats to print out the first biblical line in this corpus.

Note that the formats starting with layout- are not usable for this. Also the format lex-default is not useful, so we leave that out as well.

For the layout- formats, see display.

In [56]:

usefulFormats = [
    fmt
    for fmt in sorted(T.formats)
    if not fmt.startswith("layout-") and not fmt == "lex-default"
]
len(usefulFormats)

Out[56]:

In [57]:

firstLine = T.nodeFromSection(("1Q1", "f1", "1"))
for fmt in usefulFormats:
    if not fmt.startswith("layout-"):
        print(
            "{}:\n\t{}\n".format(
                fmt,
                T.text(firstLine, fmt=fmt),
            )
        )

lex-orig-full:
	h-וְh-ראה h-אֱלֹהִים h-כִּי h-טֹוב h-׃ h-וְh-היה h-עֶרֶב h-וְh-היה h-בֹּקֶר h-יֹום h-רְבִיעִי h-׃ h-וְh-אמר 

lex-source-full:
	h-w◊h-rah h-aTløhIyM h-k;Iy h-føwb h-. h-w◊h-hyh h-oRr®b h-w◊h-hyh h-b;Oq®r h-yøwM h-r√bIyoIy h-. h-w◊h-amr 

lex-trans-full:
	h-W:h-R>H h->:ELOHIJm h-K.IJ h-VOWB h-00 h-W:h-HJH h-<EREB h-W:h-HJH h-B.OQER h-JOWm h-R:BIJ<IJ h-00 h-W:h->MR 

morph-source-full:
	Pcvqw3msj ncmp Pc ams . Pcvqw3msj ncms Pcvqw3msj ncms ncms uomsa . Pcvqw3ms 

text-orig-extra:
	[ וירא אל ]הים כי [ טוב ׃ ויהי ערב ויהי בקר יום רביעי ׃ ויאמר ] 

text-orig-full:
	וירא אלהים כי טוב ׃ ויהי ערב ויהי בקר יום רביעי ׃ ויאמר ╱ 

text-source-extra:
	]wyra al[hyM ky ]fwb . wyhy orb wyhy bqr ywM rbyoy . wyamr[ 

text-source-full:
	wyra alhyM ky fwb . wyhy orb wyhy bqr ywM rbyoy . wyamr ╱ 

text-trans-extra:
	[ WJR> >L ]HJm KJ [ VWB 00 WJHJ <RB WJHJ BQR JWm RBJ<J 00 WJ>MR ] 

text-trans-full:
	WJR> >LHJm KJ VWB 00 WJHJ <RB WJHJ BQR JWm RBJ<J 00 WJ>MR ╱

Whole text in all formats in a few seconds¶

Part of the pleasure of working with computers is that they can crunch massive amounts of data. The text of the Dead Sea Scrolls is a piece of cake.

It takes just a few seconds to have that cake and eat it. In all useful formats.

In [58]:

A.indent(reset=True)
A.info("writing plain text of all scrolls in all text formats")

text = collections.defaultdict(list)

for ln in F.otype.s("line"):
    for fmt in usefulFormats:
        if fmt.startswith("text-"):
            text[fmt].append(T.text(ln, fmt=fmt, descend=True))

A.info("done {} formats".format(len(text)))

for fmt in sorted(text):
    print("{}\n{}\n".format(fmt, "\n".join(text[fmt][0:5])))

  0.00s writing plain text of all scrolls in all text formats
  4.31s done 6 formats
text-orig-extra
ועתה שמעו כל יודעי צדק ובינו במעשי 
אל ׃ כי ריב ל׳ו עם כל בשר ומשפט יעשה בכל מנאצי׳ו ׃ 
כי במועל׳ם אשר עזבו׳הו הסתיר פני׳ו מישראל וממקדש׳ו 
ו?יתנ׳ם לחרב ׃ ובזכר׳ו ברית ראשנים השאיר שאירית 
לישראל ולא נתנ׳ם לכלה ׃ ובקץ חרון שנים שלוש מאות 

text-orig-full
  ועתה שמעו כל יודעי צדק ובינו במעשי 
אל ׃ כי ריב ל׳ו עם כל בשר ומשפט יעשה בכל מנאצי׳ו ׃ 
כי במועל׳ם אשר עזבו׳הו הסתיר פני׳ו מישראל וממקדש׳ו 
ויתנ׳ם לחרב ׃ ובזכר׳ו ברית ראשנים השאיר שאירית 
לישראל ולא נתנ׳ם לכלה ׃ ובקץ חרון שנים שלוש מאות 

text-source-extra
woth Cmow kl ywdoy xdq wbynw bmoCy 
al . ky ryb l/w oM kl bCr wmCpf yoCh bkl mnaxy/w . 
ky bmwol/M aCr ozbw/hw hstyr pny/w myCral wmmqdC/w 
wØytn/M ljrb . wbzkr/w bryt raCnyM hCayr Cayryt 
lyCral wla ntn/M lklh . wbqX jrwN CnyM ClwC mawt 

text-source-full
□ woth Cmow kl ywdoy xdq wbynw bmoCy 
al . ky ryb l/w oM kl bCr wmCpf yoCh bkl mnaxy/w . 
ky bmwol/M aCr ozbw/hw hstyr pny/w myCral wmmqdC/w 
wytn/M ljrb . wbzkr/w bryt raCnyM hCayr Cayryt 
lyCral wla ntn/M lklh . wbqX jrwN CnyM ClwC mawt 

text-trans-extra
W<TH #M<W KL JWD<J YDQ WBJNW BM<#J 
>L 00 KJ RJB L'W <m KL B#R WM#PV J<#H BKL MN>YJ'W 00 
KJ BMW<L'm >#R <ZBW'HW HSTJR PNJ'W MJ#R>L WMMQD#'W 
W?JTN'm LXRB 00 WBZKR'W BRJT R>#NJm H#>JR #>JRJT 
LJ#R>L WL> NTN'm LKLH 00 WBQy XRWn #NJm #LW# M>WT 

text-trans-full
  W<TH #M<W KL JWD<J YDQ WBJNW BM<#J 
>L 00 KJ RJB L'W <m KL B#R WM#PV J<#H BKL MN>YJ'W 00 
KJ BMW<L'm >#R <ZBW'HW HSTJR PNJ'W MJ#R>L WMMQD#'W 
WJTN'm LXRB 00 WBZKR'W BRJT R>#NJm H#>JR #>JRJT 
LJ#R>L WL> NTN'm LKLH 00 WBQy XRWn #NJm #LW# M>WT

The full plain text¶

We write all formats to file, in your Downloads folder.

In [59]:

for fmt in T.formats:
    if fmt.startswith("text-"):
        with open(
            os.path.expanduser(f"~/Downloads/{fmt}.txt"),
            "w",
            # encoding='utf8',
        ) as f:
            f.write("\n".join(text[fmt]))

(if this errors, uncomment the line with encoding)

Sections¶

A section in the DSS is a scroll, a fragment or a line. Knowledge of sections is not baked into Text-Fabric. The config feature otext.tf may specify three section levels, and tell what the corresponding node types and features are.

From that knowledge it can construct mappings from nodes to sections, e.g. from line nodes to tuples of the form:

(scroll acronym, fragment label, line number)

You can get the section of a node as a tuple of relevant scroll, fragment, and line nodes. Or you can get it as a passage label, a string.

You can ask for the passage corresponding to the first slot of a node, or the one corresponding to the last slot.

If you are dealing with scroll and fragment nodes, you can ask to fill out the line and fragment parts as well.

Here are examples of getting the section that corresponds to a node and vice versa.

NB: sectionFromNode always delivers a line specification, either from the first slot belonging to that node, or, if lastSlot, from the last slot belonging to that node.

In [60]:

someNodes = (
    F.otype.s("sign")[100000],
    F.otype.s("word")[10000],
    F.otype.s("cluster")[5000],
    F.otype.s("line")[15000],
    F.otype.s("fragment")[1000],
    F.otype.s("scroll")[500],
)

In [61]:

for n in someNodes:
    nType = F.otype.v(n)
    d = f"{n:>7} {nType}"
    first = A.sectionStrFromNode(n)
    last = A.sectionStrFromNode(n, lastSlot=True, fillup=True)
    tup = (
        T.sectionTuple(n),
        T.sectionTuple(n, lastSlot=True, fillup=True),
    )
    print(f"{d:<16} - {first:<18} {last:<18} {tup}")

 100001 sign     - 1QHa 25:31         1QHa 25:31         ((1605874, 1531445, 1555227), (1605874, 1531445, 1555227))
1616869 word     - 1QS 8:10           1QS 8:10           ((1605869, 1531366, 1553578), (1605869, 1531366, 1553578))
1435242 cluster  - 1Q29 f2:3          1Q29 f2:3          ((1605890, 1531685, 1556400), (1605890, 1531685, 1556400))
1567973 line     - 4Q368 f3:4         4Q368 f3:4         ((1606221, 1534207, 1567973), (1606221, 1534207, 1567973))
1532341 fragment - 4Q186 f2ii         4Q186 f2ii:3       ((1605991, 1532341), (1605991, 1532341, 1559220))
1606368 scroll   - 4Q471b             4Q471b f1a_d:10    ((1606368,), (1606368, 1536089, 1575660))

Clean caches¶

Text-Fabric pre-computes data for you, so that it can be loaded faster. If the original data is updated, Text-Fabric detects it, and will recompute that data.

But there are cases, when the algorithms of Text-Fabric have changed, without any changes in the data, that you might want to clear the cache of precomputed results.

There are two ways to do that:

Locate the .tf directory of your dataset, and remove all .tfx files in it. This might be a bit awkward to do, because the .tf directory is hidden on Unix-like systems.
Call TF.clearCache(), which does exactly the same.

It is not handy to execute the following cell all the time, that's why I have commented it out. So if you really want to clear the cache, remove the comment sign below.

In [62]:

# TF.clearCache()

Next steps¶

By now you have an impression how to compute around in the corpus. While this is still the beginning, I hope you already sense the power of unlimited programmatic access to all the bits and bytes in the data set.

Here are a few directions for unleashing that power.

display become an expert in creating pretty displays of your text structures
search turbo charge your hand-coding with search templates
exportExcel make tailor-made spreadsheets out of your results
share draw in other people's data and let them use yours
similarLines spot the similarities between lines

See the cookbook for recipes for small, concrete tasks.

CC-BY Dirk Roorda

Tutorial¶

Cookbook¶

Installing Text-Fabric¶

Tip¶

Data¶

Features¶

Incantation¶

API¶

Search¶

Counting¶

Count all nodes¶

What are those nodes?¶

Count individual object types¶

Viewing textual objects¶

Feature statistics¶

Word matters¶

Top 20 frequent words¶

Word distribution¶

Hapaxes¶

Small occurrence base¶

Peculiarity of scrolls¶

Tip¶

Locality API¶

Going up¶

Going next¶

Going previous¶

Going down¶

The first line¶

Text API¶

Formats¶

Using the formats¶

The -extra formats¶

Using the formats¶

Whole text in all formats in a few seconds¶

The full plain text¶

Sections¶

Clean caches¶

Next steps¶

The `-extra` formats¶