This notebook gets you started with using Text-Fabric for coding in the Dhammapada.
Familiarity with the underlying data model is recommended.
Short introductions to other TF datasets:
or the
If you start computing with this tutorial, first copy its parent directory to somewhere else,
outside your dhammapada
directory.
If you pull changes from the dhammapada
repository later, your work will not be overwritten.
Where you put your tutorial directory is up till you.
It will work from any directory.
Text-Fabric will fetch a standard set of features for you from the newest GitHub release binaries.
It will fetch version 0.1
.
The data will be stored in the text-fabric-data
in your home directory.
The simplest way to get going is by this incantation:
from tf.app import use
For the very last version, use hot
.
For the latest release, use latest
.
If you have cloned the repos (TF app and data), use clone
.
If you do not want/need to upgrade, leave out the checkout specifiers.
A = use('etcbc/dhammapada:hot', hoist=globals())
rate limit is 5000 requests per hour, with 4999 left for this hour connecting to online GitHub repo etcbc/dhammapada ... connected app/__init__.py...downloaded app/app.py...downloaded app/config.yaml...downloaded app/static...directory app/static/display.css...downloaded app/static/logo.png...downloaded OK
This is Text-Fabric 9.2.0 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 16 features found and 0 ignored
The data of the Dhammapada is organized in features. They are columns of data. Think of the corpus as a big spreadsheet, where row 1 corresponds to the first word, row 2 to the second word, and so on, for all 13,000 words.
The one column contains the letters of each Pali word.
Another column contains the letters of each Latin word.
There are columns which tell whether words are parts of quotations, or between [ ]
(uncertain),
or between ( )
(for clarity), and so on.
Instead of putting that information in one big table, the data is organized in separate columns. We call those columns features.
By clicking on the triangle in front of Dhammapada-Latine you can see which features have been loaded, with a short description, and from there you can expand more information. If you click on a feature name, you find its documentation. If you hover over a name, you see where the feature is located on your system.
Edge features are marked by *bold italic* formatting.
We only have one edge feature: oslots
, which is a standard TF feature.
Corpora might add more edge features, and probably newer versions of this corpus will have edge features.
The result of the incantation is that we have a bunch of special variables at our disposal that give us access to the corpus.
At this point it is helpful to throw a quick glance at the text-fabric API documentation (see the links under API Members above).
The most essential thing for now is that we can use F
to access the data in the features
we've loaded.
But there is more, such as N
, which helps us to walk over the text, as we see in a minute.
The API members above show you exactly which new names have been inserted in your namespace. If you click on these names, you go to the API documentation for them.
Text-Fabric contains a flexible search engine, that does not only work for the data of this corpus, but also for data that you add to it.
Search is the quickest way to come up-to-speed with your data, without too much programming.
For example, lets display a number of words with frequencies higher than some threshold.
query = """
word freq_occ>20
"""
results = A.search(query)
A.show(results, start=1, end=5, condenseType="clause", condensed=True)
A.displayReset("tupleFeatures")
0.01s 2604 results
clause 1
clause 2
clause 3
clause 4
clause 5
Jump to the dedicated search search tutorial first, to whet your appetite further.
The real power of search lies in the fact that it is integrated in a programming environment. You can use programming to:
Therefore, the rest of this tutorial is still important when you want to tap that power. If you continue here, you learn all the basics of data-navigation with Text-Fabric.
Before we start coding, we load some modules that we need underway:
%load_ext autoreload
%autoreload 2
import os
import collections
from itertools import chain
In order to get acquainted with the data, we start with the simple task of counting.
We use the
N.walk()
generator
to walk through the nodes.
We compared the corpus data to a gigantic spreadsheet, where the rows correspond to the words.
In Text-Fabric, we call the rows slots
, because they are the textual positions that can be filled with words.
Besides the words there are other objects: clauses, sentences, stanzas, vaggas. They also correspond to rows in the big spreadsheet.
In Text-Fabric we call all these rows nodes, and the N()
generator
carries us through those nodes in the textual order.
Just one extra thing: the info
statements generate timed messages.
If you use them instead of print
you'll get a sense of the amount of time that
the various processing steps typically need.
A.indent(reset=True)
A.info("Counting nodes ...")
i = 0
for n in N.walk():
i += 1
A.info("{} nodes".format(i))
0.00s Counting nodes ... 0.00s 16664 nodes
Every node has a type, like word, clause, or sentence. We know that we have approximately 13,000 words and a 3500 other nodes. But what exactly are they?
Text-Fabric has two special features, otype
and oslots
, that must occur in every Text-Fabric data set.
otype
tells you for each node its type, and you can ask for the number of slot
s in the text.
Here we go!
F.otype.slotType
'word'
F.otype.maxSlot
12922
F.otype.maxNode
16664
F.otype.all
('vagga', 'stanza', 'sentence', 'clause', 'word')
C.levels.data
(('vagga', 497.0, 16639, 16664), ('stanza', 27.20421052631579, 16164, 16638), ('sentence', 14.153340635268346, 15251, 16163), ('clause', 5.5506872852233675, 12923, 15250), ('word', 1, 1, 12922))
This is interesting: above you see all the textual objects, with the average size of their objects, the node where they start, and the node where they end.
This is an intuitive way to count the number of nodes in each type.
Note in passing, how we use the indent
in conjunction with info
to produce neat timed
and indented progress messages.
A.indent(reset=True)
A.info("counting objects ...")
for otype in F.otype.all:
i = 0
A.indent(level=1, reset=True)
for n in F.otype.s(otype):
i += 1
A.info("{:>7} {}s".format(i, otype))
A.indent(level=0)
A.info("Done")
0.00s counting objects ... | 0.00s 26 vaggas | 0.00s 475 stanzas | 0.00s 913 sentences | 0.00s 2328 clauses | 0.00s 12922 words 0.00s Done
We use the A API (the extra power) to peek into the corpus.
First some words. Just to make sure that node 1 has type "word":
F.otype.v(1)
'word'
Some words in plain view:
wordShows = (90, 2007, 9001)
for word in wordShows:
A.plain(word, withPassage=True)
You see, words can be Pali and Latin.
Before the words you see the vagga and stanza references. There is in fact a hyperlink underneath them. Click on it, and you go to the same stanza online, on the Tipitaka site. This site provides an English translation and commentary.
We can improve the layout a bit by setting the text format to a different value:
A.displaySetup(fmt="layout-orig-full")
We do the same command again:
wordShows = (90, 2007, 9001)
for word in wordShows:
A.plain(word, withPassage=True)
You can leave out the passage reference:
for word in wordShows:
A.plain(word, withPassage=False)
Now we show other objects, both with and without passage reference.
normalShow = dict(
wordShow=wordShows[0],
clauseShow=13290,
sentenceShow=15228,
)
sectionShow = dict(
stanzaShow=16431,
vaggaShow=16580,
)
for (name, n) in normalShow.items():
A.dm(f"**{name}** = node `{n}`\n")
A.plain(n)
A.plain(n, withPassage=False)
A.dm("\n---\n")
for (name, n) in sectionShow.items():
if name == "verseShow":
continue
A.dm(f"**{name}** = node `{n}`\n")
A.plain(n)
A.plain(n, withPassage=False)
A.dm("\n---\n")
stanzaShow = node 16431
vaggaShow = node 16580
Note that for vagga nodes the withPassage
has little effect.
The passage is the thing that is hyperlinked. The node is represented as a textual reference to the piece of text
in question.
We can also dive into the structure of the textual objects, provided they are not too large.
The function pretty
gives a display of the object that a node stands for together with the structure below that node.
for (name, n) in normalShow.items():
A.dm(f"**{name}** = node `{n}`\n")
A.pretty(n)
A.dm("\n---\n")
Note
If you need a link to Tipitaka for just any node:
tenthousand = 10000
A.webLink(tenthousand)
We can show some standard features in the display:
for (name, n) in list(normalShow.items()) + list(sectionShow.items()):
A.dm(f"**{name}** = node `{n}`\n")
A.pretty(n, standardFeatures=True)
A.dm("\n---\n")
wordShow = node 90
clauseShow = node 13290
sentenceShow = node 15228
stanzaShow = node 16431
vaggaShow = node 16580
Or we can command a specific feature to show up:
for (name, n) in list(normalShow.items()):
A.dm(f"**{name}** = node `{n}`\n")
A.pretty(n, extraFeatures="freq_occ")
A.dm("\n---\n")
F
gives access to all features.
Every feature has a method
freqList()
to generate a frequency list of its values, higher frequencies first.
Here is a top 20 of the Pali words:
F.pali.freqList()[0:20]
(('ca', 181), ('na', 143), ('va', 73), ('yo', 54), ("n'", 47), ('atthi', 41), ('tam', 38), ('so', 36), ('hi', 35), ('hoti', 33), ('taṃ', 30), ('ve', 30), ('te', 28), ('pi', 26), ('attano', 24), ('ce', 24), ('etaṃ', 22), ('eva', 22), ('vā', 22), ('bhikkhu', 21))
And here for Latin:
F.latin.freqList()[0:20]
(('non', 220), ('et', 150), ('est', 137), ('in', 120), ('velut', 66), ('qui', 64), ('eum', 48), ('homo', 39), ('hoc', 37), ('vel', 37), ('Non', 36), ('dico', 36), ('is', 35), ('ego', 34), ('ad', 33), ('brāhmanam', 33), ('sapiens', 33), ('fit', 32), ('a', 30), ('gaudium', 30))
Let's do a bit more fancy word stuff.
A hapax is a unique word. Note that we have not (yet) lexeme information, so all we count are word occurrences. We are oblivious to the fact that the same word may occur in several forms.
We print 10 Pali hapaxes and 10 Latin hapaxes.
Let's do it with search templates.
Remember that we have a feature trans
that indicates whether an object belongs to the Pali text or to the Latin text.
But we forgot the details.
We call it up!
A.isLoaded("trans")
trans node (int) whether the node belongs to the original text or a translation
Good, but a little bit more info please:
A.isLoaded("trans", pretty=True, meta=True)
trans node (int) converters = Dirk Roorda (Text-Fabric) copynote1 = Digitisation supported by Shri Brihad Bhartiya Samaj 20 February 2020 dateWritten = 2021-12-24T14:49:10Z description = whether the node belongs to the original text or a translation digitizers = Bee Scherer, Yvonne Mataar edition = 2nd editor = V. Fausboll format = 1 (=Latin translation) or absent (=Pali original) institute = Text and Traditions, VU Amsterdam language = pli,lat place = London project = Dhammapada-latine publisher = Luzac & Co. researcher = Bee Scherer sourceFormat = plain text stamp = 50480 subtitle = being a collection of moral verses in Pali title = The Dhammapada version = 0.2 writtenBy = Text-Fabric yearPublished = 1900
We see under key format
: value 1 means Latin, absence of value means Pali.
In queries, we can select for exactly that:
trans#
means: feature trans
does not have a value for the node
trans
means: feature trans
has a value for the node
So here are two templates: one for the Pali hapaxes and one for the Latin hapaxes. We run them both.
query = """
word trans# freq_occ=1
"""
paliResults = A.search(query, sort=True)
query = """
word trans freq_occ=1
"""
latinResults = A.search(query, sort=True)
0.01s 2006 results 0.01s 1841 results
Now we print the first 10 results of both:
A.table(paliResults, end=10)
A.table(latinResults, end=10)
A.displayReset("tupleFeatures")
We can also get hapaxes by means of ordinary Python programming. We show this lower level way of working as well, because we are going to need it.
We use the feature freq_occ
and trans
again.
paliHapaxes = []
latinHapaxes = []
for w in F.otype.s("word"):
if F.freq_occ.v(w) == 1:
if F.trans.v(w):
latinHapaxes.append(F.latin.v(w))
else:
paliHapaxes.append(F.pali.v(w))
if len(paliHapaxes) >= 10 and len(latinHapaxes) >= 10:
break
print("pali-hapaxes")
for hapax in paliHapaxes[0:10]:
print(hapax)
print("\nlatin-hapaxes")
for hapax in latinHapaxes[0:10]:
print(hapax)
pali-hapaxes Yamakavagga paduṭṭhena cakkaṃ vahato pasannena chāyā anapāyinī upanayihanti sammati upanayhanti latin-hapaxes principium potior pars earum constant inquinata rota bovis vehentis pedem
There is yet another quite different way of getting the hapaxes:
We use the function freqList()
that is available for every feature in every text-fabric dataset.
It produces a frequency list of the values of that feature.
for lang in ("pali", "latin"):
hapaxes = sorted(word for (word, freq) in Fs(lang).freqList() if freq == 1)
print(f"{len(hapaxes):>4} {lang}-hapaxes")
for hapax in hapaxes[0:10]:
print(f"\t{hapax}")
2009 pali-hapaxes 'bhivaḍḍhati 'ham 'samānasaṃvāso 'taro 'tivākyaṃ 'yaṃ *1 *2 *3 Antako 1843 latin-hapaxes -omni Ac Ad Admoneat Aetatem Affectibus Alia Alieni Aliis Aliorum
This gives us hapaxes indeed, but sorted by the word form. Before we got them in the order in which they show up in de text.
Additionally, we see how many hapaxes there are in the corpus.
But, wait a minute: the numbers do not agree!
The query says: 2006 and 1841 hapaxes.
Above we get: 2009 and 1843 ones.
How can that be?
Well, the query looks for true hapaxes, words that occur only once in the whole corpus, Pali and Latin taken together.
The freqList()
mode has been computed for the feature it is called on.
So we have a separate frequency list for Pali and for Latin.
If there words that occur both in Pali and in Latin, it could indeed cause discrepancies.
Let's put our finger on it.
We find the pali hapaxes that are extra w.r.t. to the query results.
hapsFreqList = {x[0] for x in F.pali.freqList() if x[1] == 1}
len(hapsFreqList)
2009
hapsQuery = {F.pali.v(w[0]) for w in paliResults}
len(hapsQuery)
2006
We pick the difference:
hapsFreqList - hapsQuery
{'Atula', 'Buddham', 'saṃsāro'}
Now the corresponding nodes:
nodesPali = {n: F.pali.v(n) for n in F.otype.s("word") if F.pali.v(n) in {'Atula', 'Buddham', 'saṃsāro'}}
nodesPali
{1732: 'saṃsāro', 5511: 'Buddham', 6846: 'Atula'}
We now get the occurrences of these words in Latin sentences:
nodesLatin = {n: F.latin.v(n) for n in F.otype.s("word") if F.latin.v(n) in {'Atula', 'Buddham', 'saṃsāro'}}
nodesLatin
{1745: 'saṃsāro', 5492: 'Buddham', 5527: 'Buddham', 5816: 'Buddham', 6868: 'Atula', 8971: 'Buddham'}
Indeed, all these words have Latin occurrences.
The occurrence base of a word are the stanzas and vaggas in which occurs. Let's look for words that occur in a single vagga.
A.indent(reset=True)
A.info("Separating words into Pali and Latin")
words = dict(pali=[], latin=[])
for w in F.otype.s("word"):
if F.trans.v(w):
words["latin"].append(w)
else:
words["pali"].append(w)
for (lang, ws) in words.items():
A.info(f"{len(ws):>5} {lang} words")
0.00s Separating words into Pali and Latin 0.01s 5532 pali words 0.01s 7390 latin words
We write a function that collects for each word the vaggas they occur in.
The function accepts a parameter which holds the words we are interested in.
We use a part of the TF-API, L
(=locality) that will be explained later.
L.u()
finds nodes that embed a given node.
def inVaggas(wordList):
wordInVagga = collections.defaultdict(set)
for w in wordList:
word = F.latin.v(w) if F.trans.v(w) else F.pali.v(w)
v = L.u(w, otype="vagga")
wordInVagga[word].add(v)
return wordInVagga
We call the function for the Pali words and for the Latin words:
wordInVagga = {}
for (lang, ws) in words.items():
wordInVagga[lang] = inVaggas(ws)
Let's count how many words are confined to exactly one vagga, i.e. words that occur in one vagga or another and nowhere else.
And we want to know how many words occur in exactly 2 vaggas, and so on.
for (lang, invg) in wordInVagga.items():
print(f"{lang} word distribution over number of vaggas")
wordDist = collections.Counter()
for vs in invg.values():
wordDist[len(vs)] += 1
for (nv, nw) in sorted(wordDist.items(), key=lambda x: (-x[0], x[1])):
wPlural = " " if nw == 1 else "s"
vPlural = " " if nv == 1 else "s"
print(f"\t{nw:>4} word{wPlural} confined to {nv:>2} vagga{vPlural}")
pali word distribution over number of vaggas 1 word confined to 26 vaggas 1 word confined to 25 vaggas 1 word confined to 22 vaggas 2 words confined to 19 vaggas 1 word confined to 18 vaggas 2 words confined to 17 vaggas 3 words confined to 15 vaggas 4 words confined to 14 vaggas 1 word confined to 13 vaggas 1 word confined to 12 vaggas 2 words confined to 11 vaggas 2 words confined to 10 vaggas 4 words confined to 9 vaggas 8 words confined to 8 vaggas 6 words confined to 7 vaggas 19 words confined to 6 vaggas 14 words confined to 5 vaggas 41 words confined to 4 vaggas 90 words confined to 3 vaggas 272 words confined to 2 vaggas 2284 words confined to 1 vagga latin word distribution over number of vaggas 2 words confined to 26 vaggas 1 word confined to 25 vaggas 1 word confined to 24 vaggas 1 word confined to 22 vaggas 1 word confined to 20 vaggas 1 word confined to 18 vaggas 1 word confined to 17 vaggas 2 words confined to 16 vaggas 5 words confined to 15 vaggas 5 words confined to 14 vaggas 3 words confined to 13 vaggas 3 words confined to 12 vaggas 4 words confined to 11 vaggas 5 words confined to 10 vaggas 11 words confined to 9 vaggas 14 words confined to 8 vaggas 14 words confined to 7 vaggas 25 words confined to 6 vaggas 41 words confined to 5 vaggas 80 words confined to 4 vaggas 154 words confined to 3 vaggas 401 words confined to 2 vaggas 2118 words confined to 1 vagga
It would be interesting to know for each vagga what the proportion is of the words that are confined to it relative to the total number of words. Vaggas that score higher by this measure are in a sense more extravagant that vaggas that score lower.
Let's compute that list.
We use L.d()
which finds the nodes that are embedded in a given node.
print(f"vagga {'Pali':<13}|{'Latin':<13}")
print(
"{:<5} {:>4} {:>4} {:>5} | {:>4} {:>4} {:>5}\n{}".format(
"",
"#all",
"#own",
"%own",
"#all",
"#own",
"%own",
"-" * 40,
)
)
vaggaList = []
for v in F.otype.s("vagga"):
vagga = F.n.v(v)
ws = L.d(v, otype="word")
wordsPali = {F.pali.v(w) for w in ws if not F.trans.v(w)}
allPali = len(wordsPali)
wordsLatin = {F.latin.v(w) for w in ws if F.trans.v(w)}
allLatin = len(wordsLatin)
singlePali = sum(1 for word in wordsPali if len(wordInVagga["pali"][word]) == 1)
singleLatin = sum(1 for word in wordsLatin if len(wordInVagga["latin"][word]) == 1)
percentPali = 100 * singlePali / allPali
percentLatin = 100 * singleLatin / allLatin
vaggaList.append((vagga, allPali, singlePali, percentPali, allLatin, singleLatin, percentLatin))
for x in sorted(vaggaList, key=lambda e: (-e[3], -e[2], e[1])):
print("{:<2} {:>4} {:>4} {:>4.1f}% | {:>4} {:>4} {:>4.1f}%".format(*x))
vagga Pali |Latin #all #own %own | #all #own %own ---------------------------------------- 24 258 173 67.1% | 325 152 46.8% 11 125 82 65.6% | 157 86 54.8% 7 106 69 65.1% | 148 53 35.8% 3 103 64 62.1% | 133 64 48.1% 2 116 72 62.1% | 143 63 44.1% 12 110 68 61.8% | 141 52 36.9% 26 343 212 61.8% | 394 184 46.7% 4 137 84 61.3% | 180 79 43.9% 21 120 73 60.8% | 149 59 39.6% 1 171 104 60.8% | 207 86 41.5% 23 155 94 60.6% | 189 89 47.1% 22 142 85 59.9% | 180 80 44.4% 19 137 80 58.4% | 169 71 42.0% 20 168 98 58.3% | 221 97 43.9% 14 167 97 58.1% | 209 90 43.1% 18 195 110 56.4% | 249 117 47.0% 16 86 47 54.7% | 110 48 43.6% 8 117 63 53.8% | 155 64 41.3% 15 108 58 53.7% | 138 57 41.3% 6 149 80 53.7% | 184 80 43.5% 25 207 111 53.6% | 265 105 39.6% 5 161 86 53.4% | 199 81 40.7% 17 131 69 52.7% | 159 64 40.3% 10 176 91 51.7% | 214 102 47.7% 9 118 61 51.7% | 135 46 34.1% 13 113 53 46.9% | 137 49 35.8%
Note that the least extravagant vagga in Pali is also one of the least extravagant vaggas in Latin. And the second most extravagant vagga in Pali is the most extravagant vagga in Latin.
We travel upwards and downwards, forwards and backwards through the nodes.
The Locality-API (L
) provides functions: u()
for going up, and d()
for going down,
n()
for going to next nodes and p()
for going to previous nodes.
These directions are indirect notions: nodes are just numbers, but by means of the
oslots
feature they are linked to slots. One node contains an other node, if the one is linked to a set of slots that contains the set of slots that the other is linked to.
And one if next or previous to an other, if its slots follow or precede the slots of the other one.
L.u(node)
Up is going to nodes that embed node
.
L.d(node)
Down is the opposite direction, to those that are contained in node
.
L.n(node)
Next are the next adjacent nodes, i.e. nodes whose first slot comes immediately after the last slot of node
.
L.p(node)
Previous are the previous adjacent nodes, i.e. nodes whose last slot comes immediately before the first slot of node
.
All these functions yield nodes of all possible node types. By passing an optional parameter, you can restrict the results to nodes of that type.
The result are ordered according to the order of things in the text.
The functions return always a tuple, even if there is just one node in the result.
We go from the 10th word to the vagga it contains.
Note the [0]
at the end. You expect one vagga yet L
returns a tuple.
To get the only element of that tuple, you need to do that [0]
.
If you are like me, you keep forgetting it, and that will lead to weird error messages later on.
w = 10
firstVagga = L.u(w, otype="vagga")[0]
print(firstVagga)
A.plain(firstVagga)
16639
The 1 is a hyperlink that takes you to the online version of the vagga.
And let's see all the containing objects of word 10:
for otype in F.otype.all:
if otype == F.otype.slotType:
continue
up = L.u(w, otype=otype)
upNode = "x" if len(up) == 0 else up[0]
print("word {} is contained in {} {}".format(w, otype, upNode))
word 10 is contained in vagga 16639 word 10 is contained in stanza 16165 word 10 is contained in sentence 15252 word 10 is contained in clause 12925
Let's go to the next nodes of the first vagga.
afterFirstVagga = L.n(firstVagga)
for n in afterFirstVagga:
print(
"{:>7}: {:<13} first slot={:<6}, last slot={:<6}".format(
n,
F.otype.v(n),
E.oslots.s(n)[0],
E.oslots.s(n)[-1],
)
)
secondVagga = L.n(firstVagga, otype="vagga")[0]
687: word first slot=687 , last slot=687 13047: clause first slot=687 , last slot=687 15297: sentence first slot=687 , last slot=687 16186: stanza first slot=687 , last slot=687 16640: vagga first slot=687 , last slot=987
And let's see what is right before the second book.
for n in L.p(secondVagga):
print(
"{:>7}: {:<13} first slot={:<6}, last slot={:<6}".format(
n,
F.otype.v(n),
E.oslots.s(n)[0],
E.oslots.s(n)[-1],
)
)
16639: vagga first slot=1 , last slot=686 16185: stanza first slot=685 , last slot=686 15296: sentence first slot=685 , last slot=686 13046: clause first slot=685 , last slot=686 686: word first slot=686 , last slot=686
We go to the stanzas of the second book, and just count them.
stanzas = L.d(secondVagga, otype="stanza")
print(len(stanzas))
14
We pick the 10th stanza and explore what is above and below it.
s = F.otype.s("stanza")[10]
A.indent(level=0, reset=True)
A.info("Node {}".format(s), tm=False)
A.indent(level=1)
A.info("UP", tm=False)
A.indent(level=2)
A.info("\n".join(["{:<15} {}".format(u, F.otype.v(u)) for u in L.u(s)]), tm=False)
A.indent(level=1)
A.info("DOWN", tm=False)
A.indent(level=2)
A.info("\n".join(["{:<15} {}".format(u, F.otype.v(u)) for u in L.d(s)]), tm=False)
A.indent(level=0)
A.info("Done", tm=False)
Node 16174 | UP | | 16639 vagga | DOWN | | 15272 sentence | | 12980 clause | | 314 word | | 315 word | | 316 word | | 317 word | | 318 word | | 319 word | | 320 word | | 321 word | | 322 word | | 323 word | | 324 word | | 325 word | | 15273 sentence | | 12981 clause | | 326 word | | 327 word | | 328 word | | 329 word | | 12982 clause | | 330 word | | 331 word | | 332 word | | 12983 clause | | 333 word | | 334 word | | 335 word | | 336 word | | 12984 clause | | 337 word | | 338 word | | 339 word | | 340 word | | 341 word | | 342 word Done
So far, we have mainly seen nodes and their numbers, and the names of node types. You would almost forget that we are dealing with text. So let's try to see some text.
In the same way as F
gives access to feature data,
T
gives access to the text.
That is also feature data, but you can tell Text-Fabric which features are specifically
carrying the text, and in return Text-Fabric offers you
a Text API: T
.
The Dhammapada text can be represented in a number of ways:
If you wonder where the information about text formats is stored:
not in the program text-fabric, but in the data set.
It has a feature otext
, which specifies the formats and which features
must be used to produce them. otext
is the third special feature in a TF data set,
next to otype
and oslots
.
It is an optional feature.
If it is absent, there will be no T
API.
Here is a list of all available formats in this data set.
sorted(T.formats)
['layout-latin-full', 'layout-orig-full', 'layout-pali-full', 'text-latin-full', 'text-orig-full', 'text-pali-full']
We can pretty display in the default format, which is text-orig-full
:
s = F.otype.s("stanza")[10]
A.pretty(s, fmt="text-orig-full")
Or Pali -only:
A.pretty(s, fmt="text-pali-full")
Or Latin only
A.pretty(s, fmt="text-latin-full")
This function is central to get text representations of nodes. Its most basic usage is
T.text(nodes, fmt=fmt)
where nodes
is a list or iterable of nodes, usually word nodes, and fmt
is the name of a format.
If you leave out fmt
, the default text-orig-full
is chosen.
The result is the text in that format for all nodes specified:
T.text([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], fmt="text-orig-full")
'Yamakavagga manopubbaṅgamā dhammā manoseṭṭhā manomayā, manasā ce paduṭṭhena bhāsatī vā karoti '
There is also another usage of this function:
T.text(node, fmt=fmt)
where node
is a single node.
In this case, the default format is ntype-orig-full
where ntype
is the type of node
.
If the format is defined in the corpus, it will be used. Otherwise, the word nodes contained in node
will be looked up
and represented with the default format text-orig-full
.
In this way we can sensibly represent a lot of different nodes, such as vaggas, stanzas, sentences, clauses and words.
We compose a set of example nodes and run T.text
on them:
exampleNodes = [
1,
F.otype.s("sentence")[0],
F.otype.s("stanza")[0],
F.otype.s("vagga")[0],
]
exampleNodes
[1, 15251, 16164, 16639]
for n in exampleNodes:
print(f"This is {F.otype.v(n)} {n}:")
print(T.text(n))
print("")
This is word 1: Yamakavagga This is sentence 15251: Yamakavagga This is stanza 16164: Yamakavagga This is vagga 16639: Yamakavagga manopubbaṅgamā dhammā manoseṭṭhā manomayā, manasā ce paduṭṭhena bhāsatī vā karoti vā tato naṃ dukkham anveti cakkaṃ va vahato padaṃ. Naturae a mente principium ducunt, mens est potior pars earum, e mente constant; si (quis) mente inquinata aut loquitur aut agit, tum eum sequitur dolor, ut rota (bovis) vehentis pedem. manopubbaṅgamā dhammā manoseṭṭhā manomayā, manasā ce pasannena bhāsatī vā karoti vā tato naṃ sukham anveti chāyā va anapāyinī. Naturae a mente etc.; si (quis) mente serena aut loquitur aut agit, tum eum sequitur gaudium ut umbra non decedens. "akkocchi maṃ avadhi maṃ ajini maṃ ahāsi me", ye taṃ upanayihanti veraṃ tesaṃ na sammati. "Conviciis me obruit, verberavit me, vicit me, spoliavit me"; qui isto (animo) sese induunt, iracundia eorum non sedatur. "akkocchi maṃ avadhi maṃ ajini maṃ ahāsi me", ye taṃ na upanayhanti veraṃ tes' ūpasammati. "Conviciis etc."; qui isto (animo) sese non induunt, iracundia in iis sedatur. na hi verena verāni sammant' idha kudācanaṃ averena ca sammanti, esa dhammo sanantano. Non enim iracundia iracundiae sedantur hic unquam, placabilitate vero sedantur; haec lex aeterna (est). pare ca na vijānanti: "mayam ettha yamāmase", ye ca tattha vijānanti tato sammanti medhagā. Alieni non intelligunt: nos hic moriemur; qui vero hoc comprehendunt, tum (eorum) sedantur iurgia. subhānupassiṃ viharantaṃ indriyesu asaṃvutaṃ bhojanamhi câmattaññuṃ kusītaṃ hīnavīriyaṃ taṃ ve pasahatī Māro vāto rukkhaṃ va dubbalaṃ. Iucunda spectantem viventem, sensus non coercentem et in cibo modi nescium, socordem, viribus destitutum, eum certe superat Māras, ventus arborem sicut infirmam. asubhānupassiṃ viharantaṃ indriyesu susaṃvutaṃ bhojanamhi ca mattaññuṃ saddhaṃ āraddhavīriyaṃ taṃ [ve] na-ppasahatī Māro vāto selaṃ va pabbataṃ. Iucunda non spectantem viventem, sensus bene coercentem et in cibo modum noscentem, fidem habentem, intentis viribus praeditum, eum certe non superat Māras, ventus saxeum volut montem. anikkasāvo kāsāvaṃ yo vatthaṃ paridahessati apeto damasaccena na so kāsāvaṃ arhati. Affectibus non liber qui fulvam vestem induere vult, temperantia et veritate privatus, non ille fulva veste dignus est. yo ca vantakasāv' assa sīlesu susamāhito upeto damasaccena sa ve kāsāvam arhati. Qui vero affectus respuit, virtutibus bene instructus, temperantia et veritate praeditus, ille certe fulva veste dignus est. asāre sāramatino sāre câsāradassino te sāraṃ nâdhigacchanti micchāsaṃkappagocarā. In eo, quod non essentiale, essentiam opinantes atque in essentia nonessentiale videntes, hi essentiam non adeunt, falsi studii participes. sārañ ca sārato ñatvā asārañ ca asārato te sāraṃ adhigacchanti sammāsaṃkappagocarā. Essentiam vero essentiale habentes, et nonessentiale non-essentiale, hi essentiam adeunt, veri studii participes. yathā agāraṃ ducchannaṃ vuṭṭhi samativijjhati evaṃ abhāvitaṃ cittaṃ rāgo samativijjhati. Sicut domum male tectam pluvia perrumpit, ita meditatione destitutam cogitationionem cupido perrumpit. yathā agāraṃ succhannaṃ vuṭṭhi na samativijjhati evaṃ subhāvitaṃ cittaṃ rāgo na samativijjhati. Sicut domum bene tectam pluvia non perrumpit, ita meditabundam cogitationem cupido non perrumpit. idha socati pecca socati pāpakārī ubhayattha socati, so socati so vihaññati disvā kammakiliṭṭham attano. In hoc aevo moeret, morte obita moeret malum patrans, utrobique moeret; ille moeret, ille contristatur videns impuritatem facinoris sui. idha modati pecca modati katapuñño ubhayattha modati, so modati so pamodati disvā kammavisuddhim attano. In hoc aevo gaudet, morte obita gaudet qui bonum perfecit, utrobique gaudet; ille gaudet, ille valde gaudet videns munditiam facinoris sui. idha tappati pecca tappati pāpakārī ubhayattha tappati, "pāpaṃ me katan" ti tappati. bhiyyo tappati duggatiṃ gato. In hoc aevo cruciatur, morte obita cruciatur malum patrans, utrobique cruciatur; "malum a me peractum", ita (cogitans) cruciatur, magis cruciatur tartarum ingressus. idha nandati pecca nandati katapuñño, ubhayattha nandati, "puññam me katan" ti nandati. bhiyyo nandati suggatiṃ gato. In hoc aevo gaudet, morte obita gaudet qui bonum perfecit, utrobique gaudet; "bonum a me peractum", ita (cogitans) gaudet, magis gaudet coelum ingressus. bahum pi ce sahitam bhāsamāno na takkaro hoti naro pamatto gopo va gāvo gaṇayam paresaṃ na bhāgavā sāmaññassa hoti. Multa quoque si concinna loquens ea non facit vir socors, bubulcus velut vaccas aliorum numerans, congregationis Samanarum non fit particeps. appam pi ce sahitam bhāsamāno dhammassa hoti anudhammacārī rāgañ ca dosañ ca pahāya mohaṃ sammappajāno suvimuttacitto anupādiyāno idha vā huraṃ vā sa bhāgavā sāmaññassa hoti. Pauca quoque si (quis) concinna loquens secundum legem vitam degit, et cupidinem et odium (et) perturbationem animi relinquens, plane sapiens, cogitatione bene liberata praeditus, nihil appetens vel hic vel illic, is congregationis Samanarum fit particeps. Yamakavaggo paṭhamo
Now let's use those formats to print out the second stanza of the Dhammapada.
secondStanza = F.otype.s("stanza")[1]
for fmt in sorted(T.formats):
if fmt.startswith("layout"):
continue
print("{}:\n{}\n\n".format(fmt, T.text(secondStanza, fmt=fmt)))
text-latin-full: Naturae a mente principium ducunt, mens est potior pars earum, e mente constant; si (quis) mente inquinata aut loquitur aut agit, tum eum sequitur dolor, ut rota (bovis) vehentis pedem. text-orig-full: manopubbaṅgamā dhammā manoseṭṭhā manomayā, manasā ce paduṭṭhena bhāsatī vā karoti vā tato naṃ dukkham anveti cakkaṃ va vahato padaṃ. Naturae a mente principium ducunt, mens est potior pars earum, e mente constant; si (quis) mente inquinata aut loquitur aut agit, tum eum sequitur dolor, ut rota (bovis) vehentis pedem. text-pali-full: manopubbaṅgamā dhammā manoseṭṭhā manomayā, manasā ce paduṭṭhena bhāsatī vā karoti vā tato naṃ dukkham anveti cakkaṃ va vahato padaṃ.
If we do not specify a format, the default format is used (text-orig-full
).
T.text(range(1, 12))
'Yamakavagga manopubbaṅgamā dhammā manoseṭṭhā manomayā, manasā ce paduṭṭhena bhāsatī vā karoti '
The important things to remember are:
n
in default format by T.text(n)
n
in other formats by T.text(n, fmt=fmt, descend=True)
Part of the pleasure of working with computers is that they can crunch massive amounts of data. The text of the Dhammapada is a piece of cake.
It takes less than a tenth of a second to have that cake and eat it.
A.indent(reset=True)
A.info("writing plain text of whole Dhammapada in all formats ...")
text = collections.defaultdict(list)
for v in F.otype.s("stanza"):
for fmt in sorted(T.formats):
if fmt.startswith("layout"):
continue
text[fmt].append(T.text(v, fmt=fmt, descend=True))
A.info("done {} formats".format(len(text)))
0.00s writing plain text of whole Dhammapada in all formats ... 0.06s done 3 formats
for fmt in sorted(text):
print("{}\n{}\n".format(fmt, "\n".join(text[fmt][0:5])))
text-latin-full Naturae a mente principium ducunt, mens est potior pars earum, e mente constant; si (quis) mente inquinata aut loquitur aut agit, tum eum sequitur dolor, ut rota (bovis) vehentis pedem. Naturae a mente etc.; si (quis) mente serena aut loquitur aut agit, tum eum sequitur gaudium ut umbra non decedens. "Conviciis me obruit, verberavit me, vicit me, spoliavit me"; qui isto (animo) sese induunt, iracundia eorum non sedatur. "Conviciis etc."; qui isto (animo) sese non induunt, iracundia in iis sedatur. text-orig-full Yamakavagga manopubbaṅgamā dhammā manoseṭṭhā manomayā, manasā ce paduṭṭhena bhāsatī vā karoti vā tato naṃ dukkham anveti cakkaṃ va vahato padaṃ. Naturae a mente principium ducunt, mens est potior pars earum, e mente constant; si (quis) mente inquinata aut loquitur aut agit, tum eum sequitur dolor, ut rota (bovis) vehentis pedem. manopubbaṅgamā dhammā manoseṭṭhā manomayā, manasā ce pasannena bhāsatī vā karoti vā tato naṃ sukham anveti chāyā va anapāyinī. Naturae a mente etc.; si (quis) mente serena aut loquitur aut agit, tum eum sequitur gaudium ut umbra non decedens. "akkocchi maṃ avadhi maṃ ajini maṃ ahāsi me", ye taṃ upanayihanti veraṃ tesaṃ na sammati. "Conviciis me obruit, verberavit me, vicit me, spoliavit me"; qui isto (animo) sese induunt, iracundia eorum non sedatur. "akkocchi maṃ avadhi maṃ ajini maṃ ahāsi me", ye taṃ na upanayhanti veraṃ tes' ūpasammati. "Conviciis etc."; qui isto (animo) sese non induunt, iracundia in iis sedatur. text-pali-full Yamakavagga manopubbaṅgamā dhammā manoseṭṭhā manomayā, manasā ce paduṭṭhena bhāsatī vā karoti vā tato naṃ dukkham anveti cakkaṃ va vahato padaṃ. manopubbaṅgamā dhammā manoseṭṭhā manomayā, manasā ce pasannena bhāsatī vā karoti vā tato naṃ sukham anveti chāyā va anapāyinī. "akkocchi maṃ avadhi maṃ ajini maṃ ahāsi me", ye taṃ upanayihanti veraṃ tesaṃ na sammati. "akkocchi maṃ avadhi maṃ ajini maṃ ahāsi me", ye taṃ na upanayhanti veraṃ tes' ūpasammati.
We write those formats to file, in your Downloads folder.
for fmt in sorted(T.formats):
if fmt.startswith("layout"):
continue
with open(os.path.expanduser(f"~/Downloads/{fmt}.txt"), "w") as f:
f.write("\n".join(text[fmt]))
Text-Fabric pre-computes data for you, so that it can be loaded faster. If the original data is updated, Text-Fabric detects it, and will recompute that data.
But there are cases, when the algorithms of Text-Fabric have changed, without any changes in the data, that you might want to clear the cache of precomputed results.
There are two ways to do that:
.tf
directory of your dataset, and remove all .tfx
files in it.
This might be a bit awkward to do, because the .tf
directory is hidden on Unix-like systems.TF.clearCache()
, which does exactly the same.It is not handy to execute the following cell all the time, that's why I have commented it out. So if you really want to clear the cache, remove the comment sign below.
# TF.clearCache()
By now you have an impression how to compute around in the Hebrew Bible. While this is still the beginning, I hope you already sense the power of unlimited programmatic access to all the bits and bytes in the data set.
Here are a few directions for unleashing that power.
CC-BY Dirk Roorda