This notebook gets you started with using Text-Fabric for coding in the Peshitta, the Syriac Old Testament.
Familiarity with the underlying data model is recommended.
If you start computing with this tutorial, first copy its parent directory to somewhere else, outside your repository. If you pull changes from the repository later, your work will not be overwritten. Where you put your tutorial directory is up to you. It will work from any directory.
%load_ext autoreload
%autoreload 2
import os
import collections
from tf.app import use
The data of the corpus is organized in features. They are columns of data. Think of the text as a gigantic spreadsheet, where row 1 corresponds to the first word, row 2 to the second word, and so on, for all 400,000+ words.
The letters of each word is a column form
in that spreadsheet.
The corpus contains ca. 10 columns, not only for the words, but also for textual objects, such as books, chapters, and verses.
Instead of putting that information in one big table, the data is organized in separate columns. We call those columns features.
For the very last version, use hot
.
For the latest release, use latest
.
If you have cloned the repos (TF app and data), use clone
.
If you do not want/need to upgrade, leave out the checkout specifiers.
A = use("etcbc/peshitta", hoist=globals())
This is Text-Fabric 9.2.2 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 12 features found and 0 ignored
The result of this all is that we have a bunch of special variables at our disposal that give us access to the text and data of the Hebrew Bible.
At this point it is helpful to throw a quick glance at the text-fabric API documentation (see the links under API Members above).
The most essential thing for now is that we can use F
to access the data in the features
we've loaded.
But there is more, such as N
, which helps us to walk over the text, as we see in a minute.
In order to get acquainted with the data, we start with the simple task of counting.
We use the
N.walk()
generator
to walk through the nodes.
We compared corpus to a gigantic spreadsheet, where the rows correspond to the words.
In Text-Fabric, we call the rows slots
, because they are the textual positions that can be filled with words.
We also mentioned that there are also more textual objects. They are the verses, chapters and books. They also correspond to rows in the big spreadsheet.
In Text-Fabric we call all these rows nodes, and the N()
generator
carries us through those nodes in the textual order.
Just one extra thing: the info
statements generate timed messages.
If you use them instead of print
you'll get a sense of the amount of time that
the various processing steps typically need.
A.indent(reset=True)
A.info("Counting nodes ...")
i = 0
for n in N.walk():
i += 1
A.info("{} nodes".format(i))
0.00s Counting nodes ... 0.06s 459510 nodes
Every node has a type, like word, or phrase, sentence. We know that we have approximately 100,000 words and a few other nodes. But what exactly are they?
Text-Fabric has two special features, otype
and oslots
, that must occur in every Text-Fabric data set.
otype
tells you for each node its type, and you can ask for the number of slot
s in the text.
Here we go!
F.otype.slotType
'word'
F.otype.maxSlot
426835
F.otype.maxNode
459510
F.otype.all
('book', 'chapter', 'verse', 'word')
C.levels.data
(('book', 6566.692307692308, 426836, 426900), ('chapter', 336.3553979511426, 426901, 428169), ('verse', 13.619061293513289, 428170, 459510), ('word', 1, 1, 426835))
This is interesting: above you see all the textual objects, with the average size of their objects, the node where they start, and the node where they end.
This is an intuitive way to count the number of nodes in each type.
Note in passing, how we use the indent
in conjunction with info
to produce neat timed
and indented progress messages.
A.indent(reset=True)
A.info("counting objects ...")
for otype in F.otype.all:
i = 0
A.indent(level=1, reset=True)
for n in F.otype.s(otype):
i += 1
A.info("{:>7} {}s".format(i, otype))
A.indent(level=0)
A.info("Done")
0.00s counting objects ... | 0.00s 65 books | 0.00s 1269 chapters | 0.00s 31341 verses | 0.04s 426835 words 0.05s Done
We use the A API (the extra power) to peek into the corpus.
Let's inspect some words.
wordShow = (1000, 10000, 100000)
for word in wordShow:
A.pretty(word, withNodes=True)
F
gives access to all features.
Every feature has a method
freqList()
to generate a frequency list of its values, higher frequencies first.
Here are the words in ETCBC transliteration, only the top 10:
F.word_etcbc.freqList()[0:10]
(('MN', 8920), ('<L', 5323), ('MRJ>', 4587), ('L>', 4459), ('MVL', 3971), ('WL>', 3218), ('>JK', 3206), ('LH', 3127), ('>NWN', 2623), ('HW>', 2488))
We travel upwards and downwards, forwards and backwards through the nodes.
The Layer-API (L
) provides functions: u()
for going up, and d()
for going down,
n()
for going to next nodes and p()
for going to previous nodes.
These directions are indirect notions: nodes are just numbers, but by means of the
oslots
feature they are linked to slots. One node contains an other node, if the one is linked to a set of slots that contains the set of slots that the other is linked to.
And one if next or previous to an other, if its slots follow of precede the slots of the other one.
L.u(node)
Up is going to nodes that embed node
.
L.d(node)
Down is the opposite direction, to those that are contained in node
.
L.n(node)
Next are the next adjacent nodes, i.e. nodes whose first slot comes immediately after the last slot of node
.
L.p(node)
Previous are the previous adjacent nodes, i.e. nodes whose last slot comes immediately before the first slot of node
.
All these functions yield nodes of all possible node types. By passing an optional parameter, you can restrict the results to nodes of that type.
The result are ordered according to the order of things in the text.
The functions return always a tuple, even if there is just one node in the result.
We go from the first word to the book it contains.
Note the [0]
at the end. You expect one book, yet L
returns a tuple.
To get the only element of that tuple, you need to do that [0]
.
If you are like me, you keep forgetting it, and that will lead to weird error messages later on.
firstBook = L.u(1, otype="book")[0]
print(firstBook)
426836
And let's see all the containing objects of word 3:
w = 3
for otype in F.otype.all:
if otype == F.otype.slotType:
continue
up = L.u(w, otype=otype)
upNode = "x" if len(up) == 0 else up[0]
print("word {} is contained in {} {}".format(w, otype, upNode))
word 3 is contained in book 426836 word 3 is contained in chapter 426901 word 3 is contained in verse 428170
Let's go to the next nodes of the first book.
afterFirstBook = L.n(firstBook)
for n in afterFirstBook:
print(
"{:>7}: {:<13} first slot={:<6}, last slot={:<6}".format(
n,
F.otype.v(n),
E.oslots.s(n)[0],
E.oslots.s(n)[-1],
)
)
secondBook = L.n(firstBook, otype="book")[0]
20081: word first slot=20081 , last slot=20081 429703: verse first slot=20081 , last slot=20091 426951: chapter first slot=20081 , last slot=20335 426837: book first slot=20081 , last slot=36417
And let's see what is right before the second book.
for n in L.p(secondBook):
print(
"{:>7}: {:<13} first slot={:<6}, last slot={:<6}".format(
n,
F.otype.v(n),
E.oslots.s(n)[0],
E.oslots.s(n)[-1],
)
)
426836: book first slot=1 , last slot=20080 426950: chapter first slot=19735 , last slot=20080 429702: verse first slot=20071 , last slot=20080 20080: word first slot=20080 , last slot=20080
We go to the chapters of the second book, and just count them.
chapters = L.d(secondBook, otype="chapter")
print(len(chapters))
40
We pick the first verse and the first word, and explore what is above and below them.
for n in [1, L.u(1, otype="verse")[0]]:
A.indent(level=0)
A.info("Node {}".format(n), tm=False)
A.indent(level=1)
A.info("UP", tm=False)
A.indent(level=2)
A.info("\n".join(["{:<15} {}".format(u, F.otype.v(u)) for u in L.u(n)]), tm=False)
A.indent(level=1)
A.info("DOWN", tm=False)
A.indent(level=2)
A.info("\n".join(["{:<15} {}".format(u, F.otype.v(u)) for u in L.d(n)]), tm=False)
A.indent(level=0)
A.info("Done", tm=False)
Node 1 | UP | | 428170 verse | | 426901 chapter | | 426836 book | DOWN | | Node 428170 | UP | | 426901 chapter | | 426836 book | DOWN | | 1 word | | 2 word | | 3 word | | 4 word | | 5 word | | 6 word | | 7 word Done
So far, we have mainly seen nodes and their numbers, and the names of node types. You would almost forget that we are dealing with text. So let's try to see some text.
In the same way as F
gives access to feature data,
T
gives access to the text.
That is also feature data, but you can tell Text-Fabric which features are specifically
carrying the text, and in return Text-Fabric offers you
a Text API: T
.
Syriac text can be represented in a number of ways:
If you wonder where the information about text formats is stored:
not in the program text-fabric, but in the data set.
It has a feature otext
, which specifies the formats and which features
must be used to produce them. otext
is the third special feature in a TF data set,
next to otype
and oslots
.
It is an optional feature.
If it is absent, there will be no T
API.
Here is a list of all available formats in this data set.
sorted(T.formats)
['text-orig-full', 'text-trans-full']
We can pretty display in other formats:
for word in wordShow:
A.pretty(word, fmt="text-trans-full")
Now let's use those formats to print out the first verse of the Hebrew Bible.
for fmt in sorted(T.formats):
print("{}:\n\t{}".format(fmt, T.text(range(1, 12), fmt=fmt)))
text-orig-full: ܒܪܫܝܬ ܒܪܐ ܐܠܗܐ. ܝܬ ܫܡܝܐ ܘܝܬ ܐܪܥܐ. ܐܪܥܐ ܗܘܬ ܬܘܗ ܘܒܘܗ̇. text-trans-full: BRCJT BR> >LH>=. JT CMJ> WJT >R<>=. >R<> HWT TWH WBWH^=.
If we do not specify a format, the default format is used (text-orig-full
).
print(T.text(range(1, 12)))
ܒܪܫܝܬ ܒܪܐ ܐܠܗܐ. ܝܬ ܫܡܝܐ ܘܝܬ ܐܪܥܐ. ܐܪܥܐ ܗܘܬ ܬܘܗ ܘܒܘܗ̇.
Part of the pleasure of working with computers is that they can crunch massive amounts of data. The text of the Hebrew Bible is a piece of cake.
It takes just ten seconds to have that cake and eat it. In nearly a dozen formats.
A.indent(reset=True)
A.info("writing plain text of whole Peshitta in all formats")
text = collections.defaultdict(list)
for v in F.otype.s("verse"):
words = L.d(v, "word")
for fmt in sorted(T.formats):
text[fmt].append(T.text(words, fmt=fmt))
A.info("done {} formats".format(len(text)))
for fmt in sorted(text):
print("{}\n{}\n".format(fmt, "\n".join(text[fmt][0:5])))
0.00s writing plain text of whole Peshitta in all formats 1.58s done 2 formats text-orig-full ܒܪܫܝܬ ܒܪܐ ܐܠܗܐ. ܝܬ ܫܡܝܐ ܘܝܬ ܐܪܥܐ. ܐܪܥܐ ܗܘܬ ܬܘܗ ܘܒܘܗ̇. ܘܚܫܘܟܐ ܥܠ ܐ̈ܦܝ ܬܗܘܡܐ. ܘܪܘܚܗ ܕܐܠܗܐ ܡܪܚܦܐ ܥܠ ܐ̈ܦܝ ܡ̈ܝܐ. ܘܐܡ̣ܪ ܐܠܗܐ ܢܗܘܐ ܢܘܗܪܐ. ܘܗܘ̣ܐ ܢܘܗܪܐ. ܘܚ̣ܙܐ ܐܠܗܐ ܠܢܘܗܪܐ ܕܫܦܝܪ. ܘܦ̣ܪܫ ܐܠܗܐ ܒܝܬ ܢܘܗܪܐ ܠܚܫܘܟܐ. ܘܩ̣ܪܐ ܐܠܗܐ ܠܢܘܗܪܐ ܐܝܡܡܐ. ܘܠܚܫܘܟܐ ܩ̣ܪܐ ܠܠܝܐ. ܘܗܘ̣ܐ ܪܡܫܐ ܘܗܘܐ ܨܦܪܐ ܝܘܡܐ ܚܕ. text-trans-full BRCJT BR> >LH>=. JT CMJ> WJT >R<>=. >R<> HWT TWH WBWH^=. WXCWK> <L >"PJ THWM>=. WRWXH D>LH> MRXP> <L >"PJ M"J>=. W>M#R >LH> NHW> NWHR>=. WHW#> NWHR>=. WX#Z> >LH> LNWHR> DCPJR=. WP#RC >LH> BJT NWHR> LXCWK>=. WQ#R> >LH> LNWHR> >JMM>=. WLXCWK> Q#R> LLJ>=. WHW#> RMC> WHW> YPR> JWM> XD=.
We write a few formats to file, in your Downloads
folder.
There is one subtlety: some books come in two versions, A and B, which are based on different sets of manuscripts (witnesses). We will export two Peshitta's: one where for each book the A version is chosen, and one where for each book the B version is chosen.
We also write out book names, chapter and verse numbers.
orig = "text-orig-full"
trans = "text-trans-full"
for fmt in (orig, trans):
for witness in ("A", "B"):
with open(
os.path.expanduser(f"~/Downloads/Peshitta-{witness}-{fmt}.txt"), "w"
) as f:
for b in F.otype.s("book"):
thisWitness = F.witness.v(b)
if thisWitness and thisWitness != witness:
continue
book = T.sectionFromNode(b)[0]
acro = F.book.v(b)
f.write(f"{book} ({acro})\n\n")
for c in L.d(b, otype="chapter"):
f.write(f"{acro} {F.chapter.v(c)}\n\n")
for v in L.d(c, otype="verse"):
f.write(f"{F.verse.v(v)} {T.text(v, fmt=fmt, descend=True)}\n")
f.write("\n")
f.write("\n")
!head -n 20 ~/Downloads/Peshitta-A-{orig}.txt
Genesis (Gn) Gn 1 1 ܒܪܫܝܬ ܒܪܐ ܐܠܗܐ. ܝܬ ܫܡܝܐ ܘܝܬ ܐܪܥܐ. 2 ܐܪܥܐ ܗܘܬ ܬܘܗ ܘܒܘܗ̇. ܘܚܫܘܟܐ ܥܠ ܐ̈ܦܝ ܬܗܘܡܐ. ܘܪܘܚܗ ܕܐܠܗܐ ܡܪܚܦܐ ܥܠ ܐ̈ܦܝ ܡ̈ܝܐ. 3 ܘܐܡ̣ܪ ܐܠܗܐ ܢܗܘܐ ܢܘܗܪܐ. ܘܗܘ̣ܐ ܢܘܗܪܐ. 4 ܘܚ̣ܙܐ ܐܠܗܐ ܠܢܘܗܪܐ ܕܫܦܝܪ. ܘܦ̣ܪܫ ܐܠܗܐ ܒܝܬ ܢܘܗܪܐ ܠܚܫܘܟܐ. 5 ܘܩ̣ܪܐ ܐܠܗܐ ܠܢܘܗܪܐ ܐܝܡܡܐ. ܘܠܚܫܘܟܐ ܩ̣ܪܐ ܠܠܝܐ. ܘܗܘ̣ܐ ܪܡܫܐ ܘܗܘܐ ܨܦܪܐ ܝܘܡܐ ܚܕ. 6 ܘܐܡ̣ܪ ܐܠܗܐ ܢܗܘܐ ܐܪܩܝܥܐ ܒܡܨܥܬ ܡ̈ܝܐ. ܘܢܗܘܐ ܦܪܫ̇ ܒܝܬ ܡ̈ܝܐ ܠܡ̈ܝܐ. 7 ܘܥܒ̣ܕ ܐܠܗܐ ܐܪܩܝܥܐ. ܘܦ̣ܪܫ ܒܝܬ ܡ̈ܝܐ ܕܠܬܚܬ ܡܢ ܐܪܩܝܥܐ. ܘܒܝܬ ܡ̈ܝܐ ܕܠܥܠ ܡܢ ܐܪܩܝܥܐ. ܘܗܘ̣ܐ ܗܟܢܐ. 8 ܘܩ̣ܪܐ ܐܠܗܐ ܠܐܪܩܝܥܐ ܫܡܝܐ. ܘܗܘ̣ܐ ܪܡܫܐ ܘܗܘܐ ܨܦܪܐ ܝܘܡܐ ܕܬܪ̈ܝܢ. 9 ܘܐܡ̣ܪ ܐܠܗܐ ܢܬܟܢܫܘܢ ܡ̈ܝܐ ܕܠܬܚܬ ܡܢ ܫܡܝܐ ܠܐܬܪܐ ܚܕ ܘܬܬܚܙܐ ܝܒܝܫܬܐ. ܘܗܘ̣ܐ ܗܟܢܐ. 10 ܘܩ̣ܪܐ ܐܠܗܐ ܠܝܒܝܫܬܐ ܐܪܥܐ. ܘܠܟܢܫܐ ܕܡ̈ܝܐ ܩ̣ܪܐ ܝܡ̈ܡܐ. ܘܚ̣ܙܐ ܐܠܗܐ ܕܫܦܝܪ. 11 ܘܐܡ̣ܪ ܐܠܗܐ ܬܦܩ ܐܪܥܐ ܬܕܐܐ ܥܣܒܐ ܕܡܙܕܪܥ ܙܪܥܐ ܠܓܢܣܗ. ܘܐܝܠܢܐ ܕܦܐܪ̈ܐ ܕܥܒ̇ܕ ܦܐܪ̈ܐ ܠܓܢܣܗ ܕܢܨܒܬܗ ܒܗ ܥܠ ܐܪܥܐ. ܘܗ̣ܘܐ ܗܟܢܐ. 12 ܘܐܦ̣ܩܬ ܐܪܥܐ ܬܕܐܐ. ܥܣܒܐ ܕܡܙܕܪܥ ܙܪܥܐ ܠܓܢܣܗ. ܘܐܝܠܢܐ ܕܥܒ̇ܕ ܦܐܪ̈ܐ ܕܢܨܒܬܗ ܒܗ ܠܓܢܣܗ. ܘܚ̣ܙܐ ܐܠܗܐ ܕܫܦܝܪ. 13 ܘܗܘ̣ܐ ܪܡܫܐ ܘܗܘ̣ܐ ܨܦܪܐ. ܝܘܡܐ ܕܬܠܬܐ. 14 ܘܐܡ̣ܪ ܐܠܗܐ. ܢܗܘܘܢ ܢܗܝܪ̈ܐ ܒܐܪܩܝܥܐ ܕܫܡܝܐ. ܠܡܦܪܫ ܒܝܬ ܐܝܡܡܐ ܠܠܠܝܐ . ܘܢܗܘܘܢ ܠܐܬ̈ܘܬܐ ܘܠܙܒ̈ܢܐ. ܘܠܝ̈ܘܡܬܐ ܘܠܫ̈ܢܝܐ. 15 ܘܢܗܘܘܢ ܡܢܗܪ̈ܝܢ ܒܐܪܩܝܥܐ ܕܫܡܝܐ ܠܡܢܗܪܘ ܥܠ ܐܪܥܐ. ܘܗܘ̣ܐ ܗܟܢܐ. 16 ܘܥܒ̣ܕ ܐܠܗܐ ܬܪ̈ܝܢ ܢܗܝܪ̈ܐ ܪ̈ܘܪܒܐ. ܢܗܝܪܐ ܪܒܐ ܠܫܘܠܛܢܐ ܕܐܝܡܡܐ. ܘܢܗܝܪܐ ܙܥܘܪܐ ܠܫܘܠܛܢܐ ܕܠܠܝܐ ܘܟܘܟ̈ܒܐ.
!head -n 20 ~/Downloads/Peshitta-B-{trans}.txt
Genesis (Gn) Gn 1 1 BRCJT BR> >LH>=. JT CMJ> WJT >R<>=. 2 >R<> HWT TWH WBWH^=. WXCWK> <L >"PJ THWM>=. WRWXH D>LH> MRXP> <L >"PJ M"J>=. 3 W>M#R >LH> NHW> NWHR>=. WHW#> NWHR>=. 4 WX#Z> >LH> LNWHR> DCPJR=. WP#RC >LH> BJT NWHR> LXCWK>=. 5 WQ#R> >LH> LNWHR> >JMM>=. WLXCWK> Q#R> LLJ>=. WHW#> RMC> WHW> YPR> JWM> XD=. 6 W>M#R >LH> NHW> >RQJ<> BMY<T M"J>=. WNHW> PRC^ BJT M"J> LM"J>=. 7 W<B#D >LH> >RQJ<>=. WP#RC BJT M"J> DLTXT MN >RQJ<>=. WBJT M"J> DL<L MN >RQJ<>=. WHW#> HKN>=. 8 WQ#R> >LH> L>RQJ<> CMJ>=. WHW#> RMC> WHW> YPR> JWM> DTR"JN=. 9 W>M#R >LH> NTKNCWN M"J> DLTXT MN CMJ> L>TR> XD WTTXZ> JBJCT>=. WHW#> HKN>=. 10 WQ#R> >LH> LJBJCT> >R<>=. WLKNC> DM"J> Q#R> JM"M>=. WX#Z> >LH> DCPJR=. 11 W>M#R >LH> TPQ >R<> TD>> <SB> DMZDR< ZR<> LGNSH=. W>JLN> DP>R"> D<B^D P>R"> LGNSH DNYBTH BH <L >R<>=. WH#W> HKN>=. 12 W>P#QT >R<> TD>>=. <SB> DMZDR< ZR<> LGNSH=. W>JLN> D<B^D P>R"> DNYBTH BH LGNSH=. WX#Z> >LH> DCPJR=. 13 WHW#> RMC> WHW#> YPR>=. JWM> DTLT>=. 14 W>M#R >LH>=. NHWWN NHJR"> B>RQJ<> DCMJ>=. LMPRC BJT >JMM> LLLJ> =. WNHWWN L>T"WT> WLZB"N>=. WLJ"WMT> WLC"NJ>=. 15 WNHWWN MNHR"JN B>RQJ<> DCMJ> LMNHRW <L >R<>=. WHW#> HKN>=. 16 W<B#D >LH> TR"JN NHJR"> R"WRB>=. NHJR> RB> LCWLVN> D>JMM>=. WNHJR> Z<WR> LCWLVN> DLLJ> WKWK"B>=.
!sed -n '29196,29216p' ~/Downloads/Peshitta-A-{orig}.txt
43 ܘܕܒ̇ܚܘ ܒܝܘܡܐ ܗ̇ܘ ܕܒ̈ܚܐ ܪ̈ܘܪܒܐ ܘܚܕܝܘ ܡܛܠ ܕܡܪܝܐ ܚ̇ܕܝ ܐܢܘܢ ܚܕܘܬܐ ܪܒܬܐ ܘܐܦ ܢܫ̈ܐ ܘܛܠܝ̈ܐ ܚܕܝܘ ܘܐܫܬܡܥܬ ܚܕܘܬܐ ܕܐܘܪܫܠܡ ܠܪܘܚܩܐ 44 ܘܐܫܠܛܘ ܒܝܘܡܐ ܗ̇ܘ ܓܒܪ̈ܐ ܐܝܠܝܢ ܕܝܗܒܝܢ ܗܘܘ ܡܢ ܐܘܨܪ̈ܐ ܠܡܠܟܐ ܒ̈ܬܐ ܠܡܩܦܣܘ ܒܗܘܢ ܪ̈ܫܝܬܐ ܘܡܥܣܪ̈ܐ ܕܪ̈ܫܐ ܕܩܘܪ̈ܝܐ ܐܝܟ ܕܟܬܝܒ ܒܟܬܒܐ ܕܢܡܘܣܐ ܠܟܘܡܪ̈ܐ ܘܠܠܘܝ̈ܐ ܡܛܠ ܕܚܕܘܬܐ ܕܝܗ̈ܘܕܝܐ ܥܠ ܟܘܡܪ̈ܐ ܘܠܘܝ̈ܐ ܐܝܠܝܢ ܕܩܝܡܝܢ 45 ܘܢܛܪܝܢ ܡܛܪܬܐ ܒܒܝܬܐ ܕܐܠܗܗܘܢ ܘܢܛܪ̈ܝ ܢܛܘܪ̈ܬܐ ܕܟܝܐܝܬ ܘܡܫܡ̈ܫܢܐ ܘܬ̇ܪ̈ܥܐ ܐܝܟ ܦܘܩܕܢܐ ܕܕܘܝܕ ܘܕܫܠܝܡܘܢ ܒܪܗ 46 ܡܛܠ ܕܒܝܘܡ̈ܬܗ ܕܕܘܝܕ ܗ̣ܘܐ ܐܣܦ ܘܩܡ ܒܪܫܐ ܕܡܫܡ̈ܫܢܐ ܘܡܫܒܚ ܗܘܐ ܘܡܘܕܐ ܩܕܡ ܡܪܝܐ ܐܠܗܐ 47 ܘܟܘܠܗ ܐܝܣܪܐܝܠ ܒܝܘܡ̈ܘܗܝ ܕܙܘܪܒܒܠ ܘܒܝܘܡ̈ܬܗ ܕܢܚܡܝܐ ܝܗܒܝܢ ܡܘܗܒ̈ܬܐ ܠܡܫܡ̈ܫܢܐ ܘܬ̇ܪ̈ܥܐ ܡܦܩܝܢ ܘܝܗܒܝܢ ܝܘܡ ܒܝܘܡܗ ܘܡܩܕܫܝܢ ܠܠܘܝ̈ܐ ܘܠܘܝ̈ܐ ܡܩܕܫܝܢ ܠܒܢ̈ܝ ܐܗܪܘܢ Neh 13 1 ܒܗ ܒܝܘܡܐ ܗ̇ܘ ܐܬܩܪܝ ܟܬܒܐ ܕܢܡܘܣܐ ܕܡܘܫܐ ܒܐܕܢܝ̈ ܥܡܐ ܘܐܫܬܟܚ ܕܟܬܝܒ ܒܗ ܕܠܐ ܢܥܠܘܢ ܥܡܘܢܝ̈ܐ ܘܡܘܐܒܝ̈ܐ ܠܟܢܘܫܬܗ ܕܡܪܝܐ ܥܕܡܐ ܠܥܠܡ 2 ܡܛܠ ܕܠܐ ܐܪܥܘ ܠܒܢ̈ܝ ܐܝܣܪܝܠ ܒܠܚܡܐ ܘܒܡܝ̈ܐ ܘܐܓܪܘ ܠܗܘܢ ܠܒܠܥܡ ܠܡܠܛ ܐܢܘܢ ܘܐܗܦܟ ܐܠܗܢ ܠܘ̈ܛܬܗ ܠܒܘܪ̈ܟܬܐ 3 ܗܝܕܝܢ ܟܕ ܫܡܥܘ ܡ̈ܠܐ ܕܢܡܘܣܐ ܐܬܦܪܫܘ ܟܠܗܘܢ ܥܪ̈ܘܒܐ ܡܢ ܐܝܣܪܐܝܠ 4 ܘܐ̣ܬܐ ܐܠܝܫܒ ܟܘܡܪܐ 5 ܘܒ̣ܢܐ ܠܗ ܬܡܢ ܕܪܬܐ ܚܕܐ ܪܒܬܐ ܘܬܡܢ ܡܢ ܠܩܘܕܡܝܢ ܣܝܡܝܢ ܗܘܘ ܩܘܪ̈ܒܢܐ ܘܠܒܘܢܬܐ ܘܡܐܢ̈ܐ ܕܡܥܣܪ̈ܐ ܕܥܒܘܪܐ ܘܕܚܡ̣ܪܐ ܘܕܡܫܚܐ ܘܒܩܘܪ̈ܝܐ ܕܠܘܝ̈ܐ ܘܕܡܫܡ̈ܫܢܐ ܘܕܬ̇ܪ̈ܥܐ ܘܪ̈ܝܫܝܬܐ ܕܟܗ̈ܢܐ 6 ܘܒܟܠܗܝܢ ܗܠܝܢ ܠܐ ܗܘܝ̇ܬ ܒܐܘܪܫܠܡ ܡܛܠ ܕܒܫܢܬ ܬܠܬܝܢ ܘܬܪ̈ܬܝܢ ܠܐܪܛܚܫܫܬ ܡܠܟܐ ܕܒܒܠ ܐܬܝ̇ܬ ܠܘܬ ܡܠܟܐ ܘܒܚܪܬܐ ܕܝܘܡ̈ܬܐ ܐܫܬܐܠܬ̇ ܡܢ ܡܠܟܐ 7 ܘܐܬܝ̇ܬ ܠܐܘܪܫܠܡ ܘܐܣ̇ܬܟܠܬ ܒܝܫܬܐ ܕܥ̣ܒܕ ܐܠܝܫܒ ܠܛܘܒܝܐ ܕܥ̣ܒܕ ܠܗ ܒܝܬܐ ܒܕܪܬܐ ܕܡܪܝܐ 8 ܘܐܬܐܒܫ ܠܝ ܛܒ ܘܫ̇ܕܝܬ ܟܠܗܘܢ ܡܐܢ̈ܐ ܕܒܝܬܗ ܕܛܘܒܝܐ ܒܫܘܩܐ ܠܒܪ ܡܢ ܕܪܬܐ 9 ܘܐܡܪܬ̇ ܘܕܟܝܘ ܕܪܬܐ ܘܐܗܦܟܬ̇ ܠܬܡܢ ܡܐܢ̈ܝ ܒܝܬܗ ܕܡܪܝܐ ܘܩܘܪ̈ܒܢܐ ܘܠܒܘܢܬܐ 10 ܘܝܕܥܬ̇ ܕܡܢܬܐ ܕܠܘܝ̈ܐ ܠܐ ܡܬܝܗܒܐ ܘܥܪܩܘ ܓܒܪ ܠܚܩܠܗ ܠܘܝ̈ܐ ܘܡܫܡ̈ܫܢܐ ܘܥܒ̈ܕܝ ܥܒܝ̈ܕܬܐ 11 ܘܐܢܐ ܕܢܬ̇ ܥܡ ܪ̈ܫܐ ܘܐܡܪܬ̇ ܠܗܘܢ ܡܛܠ ܡܢܐ ܫܒܝܩ ܒܝܬܗ ܕܡܪܝܐ ܘܟܢܫ̇ܬ ܐܢܘܢ ܘܐܩܝܡ̇ܬ ܐܢܘܢ ܥܠ ܩܝܡܗܘܢ 12 ܘܟܠܗܘܢ ܝܗܘ̈ܕܝܐ ܐܝܬܝܘ ܡܥܣܪ̈ܐ ܕܥܒܘܪܐ ܘܕܚܡ̣ܪܐ ܘܕܡܫܚܐ ܠܐܘܨܪ̈ܐ 13 ܘܐܫܠܛܬ̇ ܥܠ ܐܘܨܪ̈ܐ ܠܫܠܡܝܐ ܟܘܡܪܐ ܘܠܨܕܘܩ ܣ̇ܦܪܐ ܘܠܦܪܝܐ ܒܪ ܠܘܝ̈ܐ ܘܥܡܗܘܢ ܚܢܢ ܒܪ ܙܟܘܪ ܒܪ ܡܬܢܝܐ ܡܛܠ ܕܫܪ̈ܝܪܐ ܐܬܚܫܒܘ ܘܦܨܬܗܘܢ ܣܠܩܬ ܠܡܗܘܐ ܪܝܫܐ ܠܐܚܝ̈ܗܘܢ 14 ܐܬܕܟܪ ܠܝ ܐܠܗܝ ܡܛܠ ܗܕܐ ܘܠܐ ܬܥܒܪ ܛܝܒܘܬܝ ܕܥܒܕܬ ܒܒܝܬ ܐܠܗܝ ܘܒܢܛܘܪ̈ܬܗ
!sed -n '29196,29216p' ~/Downloads/Peshitta-B-{orig}.txt
40 ܘܥܠ ܬܪ̈ܬܝܢ ܟܢ̈ܘܫܢ ܠܒܝܬܗ ܕܡܪܝܐ ܘܐܢܐ ܘܦܠܓܗܘܢ ܕܪ̈ܫܢܐ ܕܐܝܬ ܗܘܘ ܥܡܝ 41 ܘܟܘܡܪ̈ܐ ܐܠܝܩܝܡ ܘܡܥܣܝܐ ܡܚܠܝܢ ܡܝܟܐ ܐܠܝܗܘ ܥܢܢܝ ܙܟܪܝܐ ܚܢܢܝܐ ܒܩܪ̈ܢܬܐ 42 ܡܥܣܝܐ ܘܫܡܥܝܐ ܘܠܥܙܪ ܘܥܙܝ ܘܝܘܚܢܢ ܘܡܠܟܝܐ ܘܥܠܡ ܘܥܙܘܪ ܘܫܡܘܥ ܡܫܡ̈ܫܢܐ ܘܙܪܚܝܐ ܪܫܐ 43 ܘܕܒ̇ܚܘ ܒܝܘܡܐ ܗ̇ܘ ܕܒ̈ܚܐ ܪ̈ܘܪܒܐ ܘܚܕܝܘ ܡܛܠ ܕܡܪܝܐ ܚ̇ܕܝ ܐܢܘܢ ܚܕܘܬܐ ܪܒܬܐ ܘܐܦ ܢܫ̈ܐ ܘܛܠܝ̈ܐ ܚܕܝܘ ܘܐܫܬܡܥܬ ܚܕܘܬܐ ܕܐܘܪܫܠܡ ܠܪܘܚܩܐ 44 ܘܐܫܠܛܘ ܒܝܘܡܐ ܗ̇ܘ ܓܒܪ̈ܐ ܐܝܠܝܢ ܕܝܗܒܝܢ ܗܘܘ ܡܢ ܐܘܨܪ̈ܐ ܠܡܠܟܐ ܒ̈ܬܐ ܠܡܩܦܣܘ ܒܗܘܢ ܪ̈ܫܝܬܐ ܘܡܥܣܪ̈ܐ ܕܪ̈ܫܐ ܕܩܘܪ̈ܝܐ ܐܝܟ ܕܟܬܝܒ ܒܟܬܒܐ ܕܢܡܘܣܐ ܠܟܘܡܪ̈ܐ ܘܠܠܘܝ̈ܐ ܡܛܠ ܕܚܕܘܬܐ ܕܝܗ̈ܘܕܝܐ ܥܠ ܟܘܡܪ̈ܐ ܘܠܘܝ̈ܐ ܐܝܠܝܢ ܕܩܝܡܝܢ 45 ܘܢܛܪܝܢ ܡܛܪܬܐ ܒܒܝܬܐ ܕܐܠܗܗܘܢ ܘܢܛܪ̈ܝ ܢܛܘܪ̈ܬܐ ܕܟܝܐܝܬ ܘܡܫܡ̈ܫܢܐ ܘܬ̇ܪ̈ܥܐ ܐܝܟ ܦܘܩܕܢܐ ܕܕܘܝܕ ܘܕܫܠܝܡܘܢ ܒܪܗ 46 ܡܛܠ ܕܒܝܘܡ̈ܬܗ ܕܕܘܝܕ ܗ̣ܘܐ ܐܣܦ ܘܩܡ ܒܪܫܐ ܕܡܫܡ̈ܫܢܐ ܘܡܫܒܚ ܗܘܐ ܘܡܘܕܐ ܩܕܡ ܡܪܝܐ ܐܠܗܐ 47 ܘܟܘܠܗ ܐܝܣܪܐܝܠ ܒܝܘܡ̈ܘܗܝ ܕܙܘܪܒܒܠ ܘܒܝܘܡ̈ܬܗ ܕܢܚܡܝܐ ܝܗܒܝܢ ܡܘܗܒ̈ܬܐ ܠܡܫܡ̈ܫܢܐ ܘܬ̇ܪ̈ܥܐ ܡܦܩܝܢ ܘܝܗܒܝܢ ܝܘܡ ܒܝܘܡܗ ܘܡܩܕܫܝܢ ܠܠܘܝ̈ܐ ܘܠܘܝ̈ܐ ܡܩܕܫܝܢ ܠܒܢ̈ܝ ܐܗܪܘܢ Neh 13 1 ܒܗ ܒܝܘܡܐ ܗ̇ܘ ܐܬܩܪܝ ܟܬܒܐ ܕܢܡܘܣܐ ܕܡܘܫܐ ܒܐܕܢܝ̈ ܥܡܐ ܘܐܫܬܟܚ ܕܟܬܝܒ ܒܗ ܕܠܐ ܢܥܠܘܢ ܥܡܘܢܝ̈ܐ ܘܡܘܐܒܝ̈ܐ ܠܟܢܘܫܬܗ ܕܡܪܝܐ ܥܕܡܐ ܠܥܠܡ 2 ܡܛܠ ܕܠܐ ܐܪܥܘ ܠܒܢ̈ܝ ܐܝܣܪܝܠ ܒܠܚܡܐ ܘܒܡܝ̈ܐ ܘܐܓܪܘ ܠܗܘܢ ܠܒܠܥܡ ܠܡܠܛ ܐܢܘܢ ܘܐܗܦܟ ܐܠܗܢ ܠܘ̈ܛܬܗ ܠܒܘܪ̈ܟܬܐ 3 ܗܝܕܝܢ ܟܕ ܫܡܥܘ ܡ̈ܠܐ ܕܢܡܘܣܐ ܐܬܦܪܫܘ ܟܠܗܘܢ ܥܪ̈ܘܒܐ ܡܢ ܐܝܣܪܐܝܠ 4 ܘܐ̣ܬܐ ܐܠܝܫܒ ܟܘܡܪܐ 5 ܘܒ̣ܢܐ ܠܗ ܬܡܢ ܕܪܬܐ ܚܕܐ ܪܒܬܐ ܘܬܡܢ ܡܢ ܠܩܘܕܡܝܢ ܣܝܡܝܢ ܗܘܘ ܩܘܪ̈ܒܢܐ ܘܠܒܘܢܬܐ ܘܡܐܢ̈ܐ ܕܡܥܣܪ̈ܐ ܕܥܒܘܪܐ ܘܕܚܡ̣ܪܐ ܘܕܡܫܚܐ ܘܒܩܘܪ̈ܝܐ ܕܠܘܝ̈ܐ ܘܕܡܫܡ̈ܫܢܐ ܘܕܬ̇ܪ̈ܥܐ ܘܪ̈ܝܫܝܬܐ ܕܟܗ̈ܢܐ 6 ܘܒܟܠܗܝܢ ܗܠܝܢ ܠܐ ܗܘܝ̇ܬ ܒܐܘܪܫܠܡ ܡܛܠ ܕܒܫܢܬ ܬܠܬܝܢ ܘܬܪ̈ܬܝܢ ܠܐܪܛܚܫܫܬ ܡܠܟܐ ܕܒܒܠ ܐܬܝ̇ܬ ܠܘܬ ܡܠܟܐ ܘܒܚܪܬܐ ܕܝܘܡ̈ܬܐ ܐܫܬܐܠܬ̇ ܡܢ ܡܠܟܐ 7 ܘܐܬܝ̇ܬ ܠܐܘܪܫܠܡ ܘܐܣ̇ܬܟܠܬ ܒܝܫܬܐ ܕܥ̣ܒܕ ܐܠܝܫܒ ܠܛܘܒܝܐ ܕܥ̣ܒܕ ܠܗ ܒܝܬܐ ܒܕܪܬܐ ܕܡܪܝܐ 8 ܘܐܬܐܒܫ ܠܝ ܛܒ ܘܫ̇ܕܝܬ ܟܠܗܘܢ ܡܐܢ̈ܐ ܕܒܝܬܗ ܕܛܘܒܝܐ ܒܫܘܩܐ ܠܒܪ ܡܢ ܕܪܬܐ 9 ܘܐܡܪܬ̇ ܘܕܟܝܘ ܕܪܬܐ ܘܐܗܦܟܬ̇ ܠܬܡܢ ܡܐܢ̈ܝ ܒܝܬܗ ܕܡܪܝܐ ܘܩܘܪ̈ܒܢܐ ܘܠܒܘܢܬܐ 10 ܘܝܕܥܬ̇ ܕܡܢܬܐ ܕܠܘܝ̈ܐ ܠܐ ܡܬܝܗܒܐ ܘܥܪܩܘ ܓܒܪ ܠܚܩܠܗ ܠܘܝ̈ܐ ܘܡܫܡ̈ܫܢܐ ܘܥܒ̈ܕܝ ܥܒܝ̈ܕܬܐ 11 ܘܐܢܐ ܕܢܬ̇ ܥܡ ܪ̈ܫܐ ܘܐܡܪܬ̇ ܠܗܘܢ ܡܛܠ ܡܢܐ ܫܒܝܩ ܒܝܬܗ ܕܡܪܝܐ ܘܟܢܫ̇ܬ ܐܢܘܢ ܘܐܩܝܡ̇ܬ ܐܢܘܢ ܥܠ ܩܝܡܗܘܢ
For Bible book names, we can use several languages. Well, in this case we have just English.
Here are the languages that we can use for book names.
These languages come from the features book@ll
, where ll
is a two letter
ISO language code. Have a look in your data directory, you can't miss them.
T.languages
{'': {'language': 'default', 'languageEnglish': 'default'}, 'en': {'language': 'English', 'languageEnglish': 'English'}}
A section is a book, a chapter or a verse.
Knowledge of sections is not baked into Text-Fabric.
The config feature otext.tf
may specify three section levels, and tell
what the corresponding node types and features are.
From that knowledge it can construct mappings from nodes to sections, e.g. from verse nodes to tuples of the form:
(bookName, chapterNumber, verseNumber)
Here are examples of getting the section that corresponds to a node and vice versa.
NB: sectionFromNode
always delivers a verse specification, either from the
first slot belonging to that node, or, if lastSlot
, from the last slot
belonging to that node.
for x in (
("section of first word", T.sectionFromNode(1)),
("node of Genesis 1:1", T.nodeFromSection(("Genesis", 1, 1))),
("node of book Genesis", T.nodeFromSection(("Genesis",))),
("node of chapter Genesis 1", T.nodeFromSection(("Genesis", 1))),
("section of book node", T.sectionFromNode(109641)),
("idem, now last word", T.sectionFromNode(109641, lastSlot=True)),
("section of chapter node", T.sectionFromNode(109668)),
("idem, now last word", T.sectionFromNode(109668, lastSlot=True)),
):
print("{:<30} {}".format(*x))
section of first word ('Genesis', 1, 1) node of Genesis 1:1 428170 node of book Genesis 426836 node of chapter Genesis 1 426901 section of book node ('Samuel_1', 4, 20) idem, now last word ('Samuel_1', 4, 20) section of chapter node ('Samuel_1', 4, 21) idem, now last word ('Samuel_1', 4, 21)
By now you have an impression how to compute around in the text. While this is still the beginning, I hope you already sense the power of unlimited programmatic access to all the bits and bytes in the data set.
Here are a few directions for unleashing that power.
Text-Fabric contains a flexible search engine, that does not only work for this data, but also for data that you add to it. There is a tutorial dedicated to search.
If you study the additional data, you can observe how that data is created and also how it is turned into a text-fabric data module. The last step is incredibly easy. You can write out every Python dictionary where the keys are numbers and the values string or numbers as a Text-Fabric feature. When you are creating data, you have already constructed those dictionaries, so writing them out is just one method call.
You can then easily share your new features on GitHub, so that your colleagues everywhere can try it out for themselves.
EMDROS, written by Ulrik Petersen, is a text database system with the powerful topographic query language MQL. The ideas are based on a model devised by Christ-Jan Doedens in Text Databases: One Database Model and Several Retrieval Languages.
Text-Fabric's model of slots, nodes and edges is a fairly straightforward translation of the models of Christ-Jan Doedens and Ulrik Petersen.
SHEBANQ uses EMDROS to offer users to execute and save MQL queries against the Hebrew Text Database of the ETCBC.
So it is kind of logical and convenient to be able to work with a Text-Fabric resource through MQL.
If you have obtained an MQL dataset somehow, you can turn it into a text-fabric data set by importMQL()
,
which we will not show here.
And if you want to export a Text-Fabric data set to MQL, that is also possible.
After the Fabric(modules=...)
call, you can call exportMQL()
in order to save all features of the
indicated modules into a big MQL dump, which can be imported by an EMDROS database.
TF.exportMQL("peshitta", "~/Downloads")
0.00s Checking features of dataset peshitta
| 17s feature "book@en" => "book_en"
0.00s 9 features to export to MQL ... 0.00s Loading 9 features 0.01s Writing enumerations trailer : 14 values, 14 not a name, e.g. « » trailer_etcbc : 14 values, 14 not a name, e.g. « » | 0.11s Writing an all-in-one enum with 132 values 0.12s Mapping 9 features onto 4 object types 0.36s Writing 9 features as data in 4 object types | 0.00s word data ... | | 0.25s batch of size 6.3MB with 50000 of 50000 words | | 0.46s batch of size 6.3MB with 50000 of 100000 words | | 0.66s batch of size 6.4MB with 50000 of 150000 words | | 0.88s batch of size 6.4MB with 50000 of 200000 words | | 1.10s batch of size 6.4MB with 50000 of 250000 words | | 1.32s batch of size 6.4MB with 50000 of 300000 words | | 1.53s batch of size 6.4MB with 50000 of 350000 words | | 1.75s batch of size 6.4MB with 50000 of 400000 words | | 1.86s batch of size 3.5MB with 26835 of 426835 words | 1.86s word data: 426835 objects | 0.00s verse data ... | | 0.16s batch of size 3.1MB with 31341 of 31341 verses | 0.16s verse data: 31341 objects | 0.00s chapter data ... | | 0.04s batch of size 113.7KB with 1269 of 1269 chapters | 0.04s chapter data: 1269 objects | 0.00s book data ... | | 0.03s batch of size 6.4KB with 65 of 65 books | 0.03s book data: 65 objects 2.45s Done
Now you have a file ~/Downloads/peshitta.mql
of 61 MB.
You can import it into an Emdros database by saying:
cd ~/Downloads
rm peshitta
mql -b 3 < peshitta.mql
The result is an SQLite3 database peshitta
in the same directory (24 MB).
You can run a query against it by creating a text file test.mql with this contents:
select all objects where
[book book=Gn
[chapter chapter=1
[verse verse=1
[word]
]
]
]
And then say
mql -b 3 -d peshitta test.mql
You will see raw query results: all words in Genesis 1:1.
It is not very pretty, and probably you should use a more visual Emdros tool to run those queries. You see a lot of node numbers, but the good thing is, you can look those node numbers up in Text-Fabric.