Tutorial¶

This notebook gets you started with using Text-Fabric for coding in the Dhammapada.

Familiarity with the underlying data model is recommended.

Short introductions to other TF datasets:

or the

Quran

Installing Text-Fabric¶

See here

Tip¶

If you start computing with this tutorial, first copy its parent directory to somewhere else, outside your dhammapada directory. If you pull changes from the dhammapada repository later, your work will not be overwritten. Where you put your tutorial directory is up till you. It will work from any directory.

Dhammapada data¶

Text-Fabric will fetch a standard set of features for you from the newest GitHub release binaries.

It will fetch version 0.1.

The data will be stored in the text-fabric-data in your home directory.

Incantation¶

The simplest way to get going is by this incantation:

In [1]:

from tf.app import use

For the very last version, use hot.

For the latest release, use latest.

If you have cloned the repos (TF app and data), use clone.

If you do not want/need to upgrade, leave out the checkout specifiers.

In [2]:

A = use('etcbc/dhammapada:hot', hoist=globals())

rate limit is 5000 requests per hour, with 4999 left for this hour
	connecting to online GitHub repo etcbc/dhammapada ... connected
	app/__init__.py...downloaded
	app/app.py...downloaded
	app/config.yaml...downloaded
	app/static...directory
		app/static/display.css...downloaded
		app/static/logo.png...downloaded
	OK

TF-app: ~/text-fabric-data/etcbc/dhammapada/app

data: ~/text-fabric-data/etcbc/dhammapada/tf/0.2

This is Text-Fabric 9.2.0
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

16 features found and 0 ignored

Text-Fabric: Text-Fabric API 9.2.0, etcbc/dhammapada/app v3, Search Reference
Data: DHAMMAPADA, Character table, Feature docs
Features:

Dhammapada-Latine

clarity

int

word is inserted for clarity, marked by inclusion in ( and ); only in Latin translation

converters:

Dirk Roorda (Text-Fabric)

copynote1:

Digitisation supported by Shri Brihad Bhartiya Samaj 20 February 2020

dateWritten:

2021-12-24T14:49:10Z

digitizers:

Bee Scherer, Yvonne Mataar

edition:

2nd

editor:

V. Fausboll

format:

1 (=true) or absent (=false)

institute:

Text and Traditions, VU Amsterdam

language:

pli,lat

place:

London

project:

Dhammapada-latine

publisher:

Luzac & Co.

researcher:

Bee Scherer

sourceFormat:

plain text

stamp:

50480

subtitle:

being a collection of moral verses in Pali

title:

The Dhammapada

version:

0.2

writtenBy:

Text-Fabric

yearPublished:

1900

extrastanza

int

word is outside a stanza, between stanzas or in pre/post vagga material

converters:

Dirk Roorda (Text-Fabric)

copynote1:

Digitisation supported by Shri Brihad Bhartiya Samaj 20 February 2020

dateWritten:

2021-12-24T14:49:10Z

digitizers:

Bee Scherer, Yvonne Mataar

edition:

2nd

editor:

V. Fausboll

format:

1 (=true) or absent (=false)

institute:

Text and Traditions, VU Amsterdam

language:

pli,lat

place:

London

project:

Dhammapada-latine

publisher:

Luzac & Co.

researcher:

Bee Scherer

sourceFormat:

plain text

stamp:

50480

subtitle:

being a collection of moral verses in Pali

title:

The Dhammapada

version:

0.2

writtenBy:

Text-Fabric

yearPublished:

1900

freq_occ

int

the number of times that this word occurs

converters:

Dirk Roorda (Text-Fabric)

copynote1:

Digitisation supported by Shri Brihad Bhartiya Samaj 20 February 2020

dateWritten:

2021-12-24T14:49:10Z

digitizers:

Bee Scherer, Yvonne Mataar

edition:

2nd

editor:

V. Fausboll

format:

positive integer

institute:

Text and Traditions, VU Amsterdam

language:

pli,lat

place:

London

project:

Dhammapada-latine

publisher:

Luzac & Co.

researcher:

Bee Scherer

sourceFormat:

plain text

stamp:

50480

subtitle:

being a collection of moral verses in Pali

title:

The Dhammapada

version:

0.2

writtenBy:

Text-Fabric

yearPublished:

1900

latin

str

bare word (without non-word-letters)

converters:

Dirk Roorda (Text-Fabric)

copynote1:

Digitisation supported by Shri Brihad Bhartiya Samaj 20 February 2020

dateWritten:

2021-12-24T14:49:10Z

digitizers:

Bee Scherer, Yvonne Mataar

edition:

2nd

editor:

V. Fausboll

format:

string (for Latin translation) or empty (for Pali original)

institute:

Text and Traditions, VU Amsterdam

language:

pli,lat

place:

London

project:

Dhammapada-latine

publisher:

Luzac & Co.

researcher:

Bee Scherer

sourceFormat:

plain text

stamp:

50480

subtitle:

being a collection of moral verses in Pali

title:

The Dhammapada

version:

0.2

writtenBy:

Text-Fabric

yearPublished:

1900

latinpost

str

non-word letters after word, with trailing spaces

converters:

Dirk Roorda (Text-Fabric)

copynote1:

Digitisation supported by Shri Brihad Bhartiya Samaj 20 February 2020

dateWritten:

2021-12-24T14:49:10Z

digitizers:

Bee Scherer, Yvonne Mataar

edition:

2nd

editor:

V. Fausboll

format:

string (for Latin translation) or empty (for Pali original)

institute:

Text and Traditions, VU Amsterdam

language:

pli,lat

place:

London

project:

Dhammapada-latine

publisher:

Luzac & Co.

researcher:

Bee Scherer

sourceFormat:

plain text

stamp:

50480

subtitle:

being a collection of moral verses in Pali

title:

The Dhammapada

version:

0.2

writtenBy:

Text-Fabric

yearPublished:

1900

latinpre

str

non-word letters before word, no leading spaces

converters:

Dirk Roorda (Text-Fabric)

copynote1:

Digitisation supported by Shri Brihad Bhartiya Samaj 20 February 2020

dateWritten:

2021-12-24T14:49:10Z

digitizers:

Bee Scherer, Yvonne Mataar

edition:

2nd

editor:

V. Fausboll

format:

string (for Latin translation) or empty (for Pali original)

institute:

Text and Traditions, VU Amsterdam

language:

pli,lat

place:

London

project:

Dhammapada-latine

publisher:

Luzac & Co.

researcher:

Bee Scherer

sourceFormat:

plain text

stamp:

50480

subtitle:

being a collection of moral verses in Pali

title:

The Dhammapada

version:

0.2

writtenBy:

Text-Fabric

yearPublished:

1900

n

int

number of vagga, stanza (relative to work), sentence, clause (both relative to vagga)

converters:

Dirk Roorda (Text-Fabric)

copynote1:

Digitisation supported by Shri Brihad Bhartiya Samaj 20 February 2020

dateWritten:

2021-12-24T14:49:10Z

digitizers:

Bee Scherer, Yvonne Mataar

edition:

2nd

editor:

V. Fausboll

format:

positive number, 0 for pre-stanza material in a vagga

institute:

Text and Traditions, VU Amsterdam

language:

pli,lat

place:

London

project:

Dhammapada-latine

publisher:

Luzac & Co.

researcher:

Bee Scherer

sourceFormat:

plain text

stamp:

50480

subtitle:

being a collection of moral verses in Pali

title:

The Dhammapada

version:

0.2

writtenBy:

Text-Fabric

yearPublished:

1900

otype

str

converters:

Dirk Roorda (Text-Fabric)

copynote1:

Digitisation supported by Shri Brihad Bhartiya Samaj 20 February 2020

dateWritten:

2021-12-24T14:49:10Z

digitizers:

Bee Scherer, Yvonne Mataar

edition:

2nd

editor:

V. Fausboll

institute:

Text and Traditions, VU Amsterdam

language:

pli,lat

place:

London

project:

Dhammapada-latine

publisher:

Luzac & Co.

researcher:

Bee Scherer

sourceFormat:

plain text

stamp:

50480

subtitle:

being a collection of moral verses in Pali

title:

The Dhammapada

version:

0.2

writtenBy:

Text-Fabric

yearPublished:

1900

pali

str

bare word (without non-word-letters)

converters:

Dirk Roorda (Text-Fabric)

copynote1:

Digitisation supported by Shri Brihad Bhartiya Samaj 20 February 2020

dateWritten:

2021-12-24T14:49:10Z

digitizers:

Bee Scherer, Yvonne Mataar

edition:

2nd

editor:

V. Fausboll

format:

string (for Pali original) or empty (for Latin translation)

institute:

Text and Traditions, VU Amsterdam

language:

pli,lat

place:

London

project:

Dhammapada-latine

publisher:

Luzac & Co.

researcher:

Bee Scherer

sourceFormat:

plain text

stamp:

50480

subtitle:

being a collection of moral verses in Pali

title:

The Dhammapada

version:

0.2

writtenBy:

Text-Fabric

yearPublished:

1900

palipost

str

non-word letters after word, with trailing spaces

converters:

Dirk Roorda (Text-Fabric)

copynote1:

Digitisation supported by Shri Brihad Bhartiya Samaj 20 February 2020

dateWritten:

2021-12-24T14:49:10Z

digitizers:

Bee Scherer, Yvonne Mataar

edition:

2nd

editor:

V. Fausboll

format:

string (for Pali original) or empty (for Latin translation)

institute:

Text and Traditions, VU Amsterdam

language:

pli,lat

place:

London

project:

Dhammapada-latine

publisher:

Luzac & Co.

researcher:

Bee Scherer

sourceFormat:

plain text

stamp:

50480

subtitle:

being a collection of moral verses in Pali

title:

The Dhammapada

version:

0.2

writtenBy:

Text-Fabric

yearPublished:

1900

palipre

str

non-word letters before word, no leading spaces

converters:

Dirk Roorda (Text-Fabric)

copynote1:

Digitisation supported by Shri Brihad Bhartiya Samaj 20 February 2020

dateWritten:

2021-12-24T14:49:10Z

digitizers:

Bee Scherer, Yvonne Mataar

edition:

2nd

editor:

V. Fausboll

format:

string (for Pali original) or empty (for Latin translation)

institute:

Text and Traditions, VU Amsterdam

language:

pli,lat

place:

London

project:

Dhammapada-latine

publisher:

Luzac & Co.

researcher:

Bee Scherer

sourceFormat:

plain text

stamp:

50480

subtitle:

being a collection of moral verses in Pali

title:

The Dhammapada

version:

0.2

writtenBy:

Text-Fabric

yearPublished:

1900

quote

int

word is inside a quote

converters:

Dirk Roorda (Text-Fabric)

copynote1:

Digitisation supported by Shri Brihad Bhartiya Samaj 20 February 2020

dateWritten:

2021-12-24T14:49:10Z

digitizers:

Bee Scherer, Yvonne Mataar

edition:

2nd

editor:

V. Fausboll

format:

1 (=true) or absent (=false)

institute:

Text and Traditions, VU Amsterdam

language:

pli,lat

place:

London

project:

Dhammapada-latine

publisher:

Luzac & Co.

researcher:

Bee Scherer

sourceFormat:

plain text

stamp:

50480

subtitle:

being a collection of moral verses in Pali

title:

The Dhammapada

version:

0.2

writtenBy:

Text-Fabric

yearPublished:

1900

trans

int

whether the node belongs to the original text or a translation

converters:

Dirk Roorda (Text-Fabric)

copynote1:

Digitisation supported by Shri Brihad Bhartiya Samaj 20 February 2020

dateWritten:

2021-12-24T14:49:10Z

digitizers:

Bee Scherer, Yvonne Mataar

edition:

2nd

editor:

V. Fausboll

format:

1 (=Latin translation) or absent (=Pali original)

institute:

Text and Traditions, VU Amsterdam

language:

pli,lat

place:

London

project:

Dhammapada-latine

publisher:

Luzac & Co.

researcher:

Bee Scherer

sourceFormat:

plain text

stamp:

50480

subtitle:

being a collection of moral verses in Pali

title:

The Dhammapada

version:

0.2

writtenBy:

Text-Fabric

yearPublished:

1900

uncertain

int

word is marked as uncertain by inclusion in [ and ]; only in Pali original

converters:

Dirk Roorda (Text-Fabric)

copynote1:

Digitisation supported by Shri Brihad Bhartiya Samaj 20 February 2020

dateWritten:

2021-12-24T14:49:10Z

digitizers:

Bee Scherer, Yvonne Mataar

edition:

2nd

editor:

V. Fausboll

format:

1 (=true) or absent (=false)

institute:

Text and Traditions, VU Amsterdam

language:

pli,lat

place:

London

project:

Dhammapada-latine

publisher:

Luzac & Co.

researcher:

Bee Scherer

sourceFormat:

plain text

stamp:

50480

subtitle:

being a collection of moral verses in Pali

title:

The Dhammapada

version:

0.2

writtenBy:

Text-Fabric

yearPublished:

1900

oslots

none

converters:

Dirk Roorda (Text-Fabric)

copynote1:

Digitisation supported by Shri Brihad Bhartiya Samaj 20 February 2020

dateWritten:

2021-12-24T14:49:10Z

digitizers:

Bee Scherer, Yvonne Mataar

edition:

2nd

editor:

V. Fausboll

institute:

Text and Traditions, VU Amsterdam

language:

pli,lat

place:

London

project:

Dhammapada-latine

publisher:

Luzac & Co.

researcher:

Bee Scherer

sourceFormat:

plain text

stamp:

50480

subtitle:

being a collection of moral verses in Pali

title:

The Dhammapada

version:

0.2

writtenBy:

Text-Fabric

yearPublished:

1900

Text-Fabric API: names N F E L T S C TF directly usable

Features¶

The data of the Dhammapada is organized in features. They are columns of data. Think of the corpus as a big spreadsheet, where row 1 corresponds to the first word, row 2 to the second word, and so on, for all 13,000 words.

The one column contains the letters of each Pali word. Another column contains the letters of each Latin word. There are columns which tell whether words are parts of quotations, or between [ ] (uncertain), or between ( ) (for clarity), and so on.

Instead of putting that information in one big table, the data is organized in separate columns. We call those columns features.

By clicking on the triangle in front of Dhammapada-Latine you can see which features have been loaded, with a short description, and from there you can expand more information. If you click on a feature name, you find its documentation. If you hover over a name, you see where the feature is located on your system.

Edge features are marked by *bold italic* formatting. We only have one edge feature: oslots, which is a standard TF feature. Corpora might add more edge features, and probably newer versions of this corpus will have edge features.

API¶

The result of the incantation is that we have a bunch of special variables at our disposal that give us access to the corpus.

At this point it is helpful to throw a quick glance at the text-fabric API documentation (see the links under API Members above).

The most essential thing for now is that we can use F to access the data in the features we've loaded. But there is more, such as N, which helps us to walk over the text, as we see in a minute.

The API members above show you exactly which new names have been inserted in your namespace. If you click on these names, you go to the API documentation for them.

Search¶

Text-Fabric contains a flexible search engine, that does not only work for the data of this corpus, but also for data that you add to it.

Search is the quickest way to come up-to-speed with your data, without too much programming.

For example, lets display a number of words with frequencies higher than some threshold.

In [3]:

query = """
word freq_occ>20
"""

results = A.search(query)
A.show(results, start=1, end=5, condenseType="clause", condensed=True)
A.displayReset("tupleFeatures")

  0.01s 2604 results

clause 1

1 1

clause

manasā

freq_occ=9

ce

freq_occ=24

paduṭṭhena

freq_occ=1

bhāsatī

freq_occ=2

vā

freq_occ=22

karoti

freq_occ=5

vā

freq_occ=22

tato

freq_occ=12

naṃ

freq_occ=8

dukkham

freq_occ=5

anveti

freq_occ=4

cakkaṃ

freq_occ=1

va

freq_occ=73

vahato

freq_occ=1

padaṃ.

freq_occ=9

clause 2

1 1

clause

Naturae

freq_occ=2

a

freq_occ=30

mente

freq_occ=12

principium

freq_occ=1

ducunt,

freq_occ=4

clause 3

1 1

clause

mens

freq_occ=6

est

freq_occ=137

potior

freq_occ=1

pars

freq_occ=1

earum,

freq_occ=1

clause 4

1 1

clause

e

freq_occ=21

mente

freq_occ=12

constant;

freq_occ=1

clause 5

1 1

clause

si

freq_occ=24

(quis)

freq_occ=25

mente

freq_occ=12

inquinata

freq_occ=1

aut

freq_occ=4

loquitur

freq_occ=2

aut

freq_occ=4

agit,

freq_occ=6

Jump to the dedicated search search tutorial first, to whet your appetite further.

The real power of search lies in the fact that it is integrated in a programming environment. You can use programming to:

compose dynamic queries
process query results

Therefore, the rest of this tutorial is still important when you want to tap that power. If you continue here, you learn all the basics of data-navigation with Text-Fabric.

Before we start coding, we load some modules that we need underway:

In [4]:

%load_ext autoreload
%autoreload 2

In [5]:

import os
import collections
from itertools import chain

Counting¶

In order to get acquainted with the data, we start with the simple task of counting.

Count all nodes¶

We use the N.walk() generator to walk through the nodes.

We compared the corpus data to a gigantic spreadsheet, where the rows correspond to the words. In Text-Fabric, we call the rows slots, because they are the textual positions that can be filled with words.

Besides the words there are other objects: clauses, sentences, stanzas, vaggas. They also correspond to rows in the big spreadsheet.

In Text-Fabric we call all these rows nodes, and the N() generator carries us through those nodes in the textual order.

Just one extra thing: the info statements generate timed messages. If you use them instead of print you'll get a sense of the amount of time that the various processing steps typically need.

In [6]:

A.indent(reset=True)
A.info("Counting nodes ...")

i = 0
for n in N.walk():
    i += 1

A.info("{} nodes".format(i))

  0.00s Counting nodes ...
  0.00s 16664 nodes

What are those nodes?¶

Every node has a type, like word, clause, or sentence. We know that we have approximately 13,000 words and a 3500 other nodes. But what exactly are they?

Text-Fabric has two special features, otype and oslots, that must occur in every Text-Fabric data set. otype tells you for each node its type, and you can ask for the number of slots in the text.

Here we go!

In [7]:

F.otype.slotType

Out[7]:

'word'

In [8]:

F.otype.maxSlot

Out[8]:

In [9]:

F.otype.maxNode

Out[9]:

In [10]:

F.otype.all

Out[10]:

('vagga', 'stanza', 'sentence', 'clause', 'word')

In [11]:

C.levels.data

Out[11]:

(('vagga', 497.0, 16639, 16664),
 ('stanza', 27.20421052631579, 16164, 16638),
 ('sentence', 14.153340635268346, 15251, 16163),
 ('clause', 5.5506872852233675, 12923, 15250),
 ('word', 1, 1, 12922))

This is interesting: above you see all the textual objects, with the average size of their objects, the node where they start, and the node where they end.

Count individual object types¶

This is an intuitive way to count the number of nodes in each type. Note in passing, how we use the indent in conjunction with info to produce neat timed and indented progress messages.

In [12]:

A.indent(reset=True)
A.info("counting objects ...")

for otype in F.otype.all:
    i = 0

    A.indent(level=1, reset=True)

    for n in F.otype.s(otype):
        i += 1

    A.info("{:>7} {}s".format(i, otype))

A.indent(level=0)
A.info("Done")

  0.00s counting objects ...
   |     0.00s      26 vaggas
   |     0.00s     475 stanzas
   |     0.00s     913 sentences
   |     0.00s    2328 clauses
   |     0.00s   12922 words
  0.00s Done

Viewing textual objects¶

We use the A API (the extra power) to peek into the corpus.

First some words. Just to make sure that node 1 has type "word":

In [13]:

F.otype.v(1)

Out[13]:

'word'

Some words in plain view:

In [14]:

wordShows = (90, 2007, 9001)
for word in wordShows:
    A.plain(word, withPassage=True)

1 3 maṃ

5 69 [bālo]

21 297 (est).

You see, words can be Pali and Latin.

Before the words you see the vagga and stanza references. There is in fact a hyperlink underneath them. Click on it, and you go to the same stanza online, on the Tipitaka site. This site provides an English translation and commentary.

We can improve the layout a bit by setting the text format to a different value:

In [15]:

A.displaySetup(fmt="layout-orig-full")

We do the same command again:

In [16]:

wordShows = (90, 2007, 9001)
for word in wordShows:
    A.plain(word, withPassage=True)

1 3 maṃ

5 69 [bālo]

21 297 (est).

You can leave out the passage reference:

In [17]:

for word in wordShows:
    A.plain(word, withPassage=False)

maṃ

[bālo]

(est).

Now we show other objects, both with and without passage reference.

In [18]:

normalShow = dict(
    wordShow=wordShows[0],
    clauseShow=13290,
    sentenceShow=15228,
)

sectionShow = dict(
    stanzaShow=16431,
    vaggaShow=16580,
)

In [19]:

for (name, n) in normalShow.items():
    A.dm(f"**{name}** = node `{n}`\n")
    A.plain(n)
    A.plain(n, withPassage=False)
    A.dm("\n---\n")

wordShow = node 90

1 3 maṃ

maṃ

clauseShow = node 13290

5 68 Id vero facinus bene factum,

Id vero facinus bene factum,

sentenceShow = node 15228

26 421 appetitus expertem,

appetitus expertem,

In [20]:

for (name, n) in sectionShow.items():
    if name == "verseShow":
        continue
    A.dm(f"**{name}** = node `{n}`\n")
    A.plain(n)
    A.plain(n, withPassage=False)
    A.dm("\n---\n")

stanzaShow = node 16431

17 1234 Kodhavaggo sattarasamo

Kodhavaggo sattarasamo

vaggaShow = node 16580

25 368 mettāvihārī yo bhikkhu pasanno Buddhasāsane adhigacche padaṃ santaṃ saṃkhārūpasamaṃ sukhaṃ. Benevole vivens bhikkhus, Buddhae praeceptis sedatus, adit locum tranquillum, naturarum (saṃkhārā) sedationem, gaudium.

mettāvihārī yo bhikkhu pasanno Buddhasāsane adhigacche padaṃ santaṃ saṃkhārūpasamaṃ sukhaṃ. Benevole vivens bhikkhus, Buddhae praeceptis sedatus, adit locum tranquillum, naturarum (saṃkhārā) sedationem, gaudium.

Note that for vagga nodes the withPassage has little effect. The passage is the thing that is hyperlinked. The node is represented as a textual reference to the piece of text in question.

We can also dive into the structure of the textual objects, provided they are not too large.

The function pretty gives a display of the object that a node stands for together with the structure below that node.

In [21]:

for (name, n) in normalShow.items():
    A.dm(f"**{name}** = node `{n}`\n")
    A.pretty(n)
    A.dm("\n---\n")

wordShow = node 90

1 3

maṃ

clauseShow = node 13290

5 68

clause

Id

vero

facinus

bene

factum,

sentenceShow = node 15228

26 421

clause

appetitus

expertem,

Note

if you click on the passage, you go to the Tipitaka site. Most of the time you go to the exact stanza, but on the Tipitaka site some stanzas are combined. In that case you land on a table of contents and you can find the desired stanza easily.

If you need a link to Tipitaka for just any node:

In [22]:

tenthousand = 10000
A.webLink(tenthousand)

23 329

We can show some standard features in the display:

In [23]:

for (name, n) in list(normalShow.items()) + list(sectionShow.items()):
    A.dm(f"**{name}** = node `{n}`\n")
    A.pretty(n, standardFeatures=True)
    A.dm("\n---\n")

wordShow = node 90

1 3

maṃ

quote=1

clauseShow = node 13290

5 68

clause 34

Id

vero

facinus

bene

factum,

sentenceShow = node 15228

26 421

clause 173

appetitus

expertem,

stanzaShow = node 16431

17 1234

stanza

sentence 14

clause 21

Kodhavaggo

sattarasamo

vaggaShow = node 16580

25 368

stanza

sentence 10

clause 18

mettāvihārī

yo

bhikkhu

pasanno

Buddhasāsane

adhigacche

padaṃ

santaṃ

saṃkhārūpasamaṃ

sukhaṃ.

sentence 9

clause 34

Benevole

vivens

bhikkhus,

clause 35

Buddhae

praeceptis

sedatus,

clause 36

adit

locum

tranquillum,

clause 37

naturarum

(saṃkhārā)

clarity=1

sedationem,

clause 38

gaudium.

Or we can command a specific feature to show up:

In [24]:

for (name, n) in list(normalShow.items()):
    A.dm(f"**{name}** = node `{n}`\n")
    A.pretty(n, extraFeatures="freq_occ")
    A.dm("\n---\n")

wordShow = node 90

1 3

maṃ

freq_occ=7

clauseShow = node 13290

5 68

clause

Id

freq_occ=1

vero

freq_occ=26

facinus

freq_occ=8

bene

freq_occ=25

factum,

freq_occ=14

sentenceShow = node 15228

26 421

clause

appetitus

freq_occ=1

expertem,

freq_occ=9

Feature statistics¶

F gives access to all features. Every feature has a method freqList() to generate a frequency list of its values, higher frequencies first. Here is a top 20 of the Pali words:

In [25]:

F.pali.freqList()[0:20]

Out[25]:

(('ca', 181),
 ('na', 143),
 ('va', 73),
 ('yo', 54),
 ("n'", 47),
 ('atthi', 41),
 ('tam', 38),
 ('so', 36),
 ('hi', 35),
 ('hoti', 33),
 ('taṃ', 30),
 ('ve', 30),
 ('te', 28),
 ('pi', 26),
 ('attano', 24),
 ('ce', 24),
 ('etaṃ', 22),
 ('eva', 22),
 ('vā', 22),
 ('bhikkhu', 21))

And here for Latin:

In [26]:

F.latin.freqList()[0:20]

Out[26]:

(('non', 220),
 ('et', 150),
 ('est', 137),
 ('in', 120),
 ('velut', 66),
 ('qui', 64),
 ('eum', 48),
 ('homo', 39),
 ('hoc', 37),
 ('vel', 37),
 ('Non', 36),
 ('dico', 36),
 ('is', 35),
 ('ego', 34),
 ('ad', 33),
 ('brāhmanam', 33),
 ('sapiens', 33),
 ('fit', 32),
 ('a', 30),
 ('gaudium', 30))

Word distribution¶

Let's do a bit more fancy word stuff.

Hapaxes¶

A hapax is a unique word. Note that we have not (yet) lexeme information, so all we count are word occurrences. We are oblivious to the fact that the same word may occur in several forms.

We print 10 Pali hapaxes and 10 Latin hapaxes.

Let's do it with search templates. Remember that we have a feature trans that indicates whether an object belongs to the Pali text or to the Latin text. But we forgot the details.

We call it up!

In [27]:

A.isLoaded("trans")

trans                node (int) whether the node belongs to the original text or a translation

Good, but a little bit more info please:

In [28]:

A.isLoaded("trans", pretty=True, meta=True)

trans                node (int)
	converters           = Dirk Roorda (Text-Fabric)
	copynote1            = Digitisation supported by Shri Brihad Bhartiya Samaj 20 February 2020
	dateWritten          = 2021-12-24T14:49:10Z
	description          = whether the node belongs to the original text or a translation
	digitizers           = Bee Scherer, Yvonne Mataar
	edition              = 2nd
	editor               = V. Fausboll
	format               = 1 (=Latin translation) or absent (=Pali original)
	institute            = Text and Traditions, VU Amsterdam
	language             = pli,lat
	place                = London
	project              = Dhammapada-latine
	publisher            = Luzac & Co.
	researcher           = Bee Scherer
	sourceFormat         = plain text
	stamp                = 50480
	subtitle             = being a collection of moral verses in Pali
	title                = The Dhammapada
	version              = 0.2
	writtenBy            = Text-Fabric
	yearPublished        = 1900

We see under key format: value 1 means Latin, absence of value means Pali.

In queries, we can select for exactly that:

trans# means: feature trans does not have a value for the node trans means: feature trans has a value for the node

So here are two templates: one for the Pali hapaxes and one for the Latin hapaxes. We run them both.

In [29]:

query = """
word trans# freq_occ=1
"""
paliResults = A.search(query, sort=True)

query = """
word trans freq_occ=1
"""
latinResults = A.search(query, sort=True)

  0.01s 2006 results
  0.01s 1841 results

Now we print the first 10 results of both:

In [30]:

A.table(paliResults, end=10)
A.table(latinResults, end=10)
A.displayReset("tupleFeatures")

n	p	word
1	1 1001	Yamakavagga
2	1 1	paduṭṭhena
3	1 1	cakkaṃ
4	1 1	vahato
5	1 2	pasannena
6	1 2	chāyā
7	1 2	anapāyinī.
8	1 3	upanayihanti
9	1 3	sammati.
10	1 4	upanayhanti

n	p	word
1	1 1	principium
2	1 1	potior
3	1 1	pars
4	1 1	earum,
5	1 1	constant;
6	1 1	inquinata
7	1 1	rota
8	1 1	(bovis)
9	1 1	vehentis
10	1 1	pedem.

We can also get hapaxes by means of ordinary Python programming. We show this lower level way of working as well, because we are going to need it.

We use the feature freq_occ and trans again.

In [31]:

paliHapaxes = []
latinHapaxes = []
for w in F.otype.s("word"):
    if F.freq_occ.v(w) == 1:
        if F.trans.v(w):
            latinHapaxes.append(F.latin.v(w))
        else:
            paliHapaxes.append(F.pali.v(w))
        if len(paliHapaxes) >= 10 and len(latinHapaxes) >= 10:
            break
            
print("pali-hapaxes")
for hapax in paliHapaxes[0:10]:
    print(hapax)
print("\nlatin-hapaxes")
for hapax in latinHapaxes[0:10]:
    print(hapax)

pali-hapaxes
Yamakavagga
paduṭṭhena
cakkaṃ
vahato
pasannena
chāyā
anapāyinī
upanayihanti
sammati
upanayhanti

latin-hapaxes
principium
potior
pars
earum
constant
inquinata
rota
bovis
vehentis
pedem

There is yet another quite different way of getting the hapaxes:

We use the function freqList() that is available for every feature in every text-fabric dataset. It produces a frequency list of the values of that feature.

In [32]:

for lang in ("pali", "latin"):
    hapaxes = sorted(word for (word, freq) in Fs(lang).freqList() if freq == 1)
    print(f"{len(hapaxes):>4} {lang}-hapaxes")
    for hapax in hapaxes[0:10]:
        print(f"\t{hapax}")

2009 pali-hapaxes
	'bhivaḍḍhati
	'ham
	'samānasaṃvāso
	'taro
	'tivākyaṃ
	'yaṃ
	*1
	*2
	*3
	Antako
1843 latin-hapaxes
	-omni
	Ac
	Ad
	Admoneat
	Aetatem
	Affectibus
	Alia
	Alieni
	Aliis
	Aliorum

This gives us hapaxes indeed, but sorted by the word form. Before we got them in the order in which they show up in de text.

Additionally, we see how many hapaxes there are in the corpus.

But, wait a minute: the numbers do not agree!

The query says: 2006 and 1841 hapaxes.

Above we get: 2009 and 1843 ones.

How can that be?

Well, the query looks for true hapaxes, words that occur only once in the whole corpus, Pali and Latin taken together.

The freqList() mode has been computed for the feature it is called on. So we have a separate frequency list for Pali and for Latin.

If there words that occur both in Pali and in Latin, it could indeed cause discrepancies.

Let's put our finger on it.

We find the pali hapaxes that are extra w.r.t. to the query results.

In [33]:

hapsFreqList = {x[0] for x in F.pali.freqList() if x[1] == 1}
len(hapsFreqList)

Out[33]:

In [36]:

hapsQuery = {F.pali.v(w[0]) for w in paliResults}
len(hapsQuery)

Out[36]:

We pick the difference:

In [37]:

hapsFreqList - hapsQuery

Out[37]:

{'Atula', 'Buddham', 'saṃsāro'}

Now the corresponding nodes:

In [38]:

nodesPali = {n: F.pali.v(n) for n in F.otype.s("word") if F.pali.v(n) in {'Atula', 'Buddham', 'saṃsāro'}}
nodesPali

Out[38]:

{1732: 'saṃsāro', 5511: 'Buddham', 6846: 'Atula'}

We now get the occurrences of these words in Latin sentences:

In [39]:

nodesLatin = {n: F.latin.v(n) for n in F.otype.s("word") if F.latin.v(n) in {'Atula', 'Buddham', 'saṃsāro'}}
nodesLatin

Out[39]:

{1745: 'saṃsāro',
 5492: 'Buddham',
 5527: 'Buddham',
 5816: 'Buddham',
 6868: 'Atula',
 8971: 'Buddham'}

Indeed, all these words have Latin occurrences.

Small occurrence base¶

The occurrence base of a word are the stanzas and vaggas in which occurs. Let's look for words that occur in a single vagga.

In [40]:

A.indent(reset=True)
A.info("Separating words into Pali and Latin")

words = dict(pali=[], latin=[])

for w in F.otype.s("word"):
    if F.trans.v(w):
        words["latin"].append(w)
    else:
        words["pali"].append(w)
        
for (lang, ws) in words.items():
    A.info(f"{len(ws):>5} {lang} words")

  0.00s Separating words into Pali and Latin
  0.01s  5532 pali words
  0.01s  7390 latin words

We write a function that collects for each word the vaggas they occur in.

The function accepts a parameter which holds the words we are interested in.

We use a part of the TF-API, L (=locality) that will be explained later. L.u() finds nodes that embed a given node.

In [41]:

def inVaggas(wordList):
    wordInVagga = collections.defaultdict(set)
    
    for w in wordList:
        word = F.latin.v(w) if F.trans.v(w) else F.pali.v(w)
        v = L.u(w, otype="vagga")
        wordInVagga[word].add(v)
        
    return wordInVagga

We call the function for the Pali words and for the Latin words:

In [42]:

wordInVagga = {}

for (lang, ws) in words.items():
    wordInVagga[lang] = inVaggas(ws)

Let's count how many words are confined to exactly one vagga, i.e. words that occur in one vagga or another and nowhere else.

And we want to know how many words occur in exactly 2 vaggas, and so on.

In [43]:

for (lang, invg) in wordInVagga.items():
    print(f"{lang} word distribution over number of vaggas")
    wordDist = collections.Counter()
    for vs in invg.values():
        wordDist[len(vs)] += 1
    for (nv, nw) in sorted(wordDist.items(), key=lambda x: (-x[0], x[1])):
        wPlural = " " if nw == 1 else "s" 
        vPlural = " " if nv == 1 else "s" 
        print(f"\t{nw:>4} word{wPlural} confined to {nv:>2} vagga{vPlural}")

pali word distribution over number of vaggas
	   1 word  confined to 26 vaggas
	   1 word  confined to 25 vaggas
	   1 word  confined to 22 vaggas
	   2 words confined to 19 vaggas
	   1 word  confined to 18 vaggas
	   2 words confined to 17 vaggas
	   3 words confined to 15 vaggas
	   4 words confined to 14 vaggas
	   1 word  confined to 13 vaggas
	   1 word  confined to 12 vaggas
	   2 words confined to 11 vaggas
	   2 words confined to 10 vaggas
	   4 words confined to  9 vaggas
	   8 words confined to  8 vaggas
	   6 words confined to  7 vaggas
	  19 words confined to  6 vaggas
	  14 words confined to  5 vaggas
	  41 words confined to  4 vaggas
	  90 words confined to  3 vaggas
	 272 words confined to  2 vaggas
	2284 words confined to  1 vagga 
latin word distribution over number of vaggas
	   2 words confined to 26 vaggas
	   1 word  confined to 25 vaggas
	   1 word  confined to 24 vaggas
	   1 word  confined to 22 vaggas
	   1 word  confined to 20 vaggas
	   1 word  confined to 18 vaggas
	   1 word  confined to 17 vaggas
	   2 words confined to 16 vaggas
	   5 words confined to 15 vaggas
	   5 words confined to 14 vaggas
	   3 words confined to 13 vaggas
	   3 words confined to 12 vaggas
	   4 words confined to 11 vaggas
	   5 words confined to 10 vaggas
	  11 words confined to  9 vaggas
	  14 words confined to  8 vaggas
	  14 words confined to  7 vaggas
	  25 words confined to  6 vaggas
	  41 words confined to  5 vaggas
	  80 words confined to  4 vaggas
	 154 words confined to  3 vaggas
	 401 words confined to  2 vaggas
	2118 words confined to  1 vagga

Extravagant vaggas¶

It would be interesting to know for each vagga what the proportion is of the words that are confined to it relative to the total number of words. Vaggas that score higher by this measure are in a sense more extravagant that vaggas that score lower.

Let's compute that list.

We use L.d() which finds the nodes that are embedded in a given node.

In [44]:

print(f"vagga {'Pali':<13}|{'Latin':<13}")
print(
    "{:<5} {:>4} {:>4} {:>5} | {:>4} {:>4} {:>5}\n{}".format(
        "",
        "#all",
        "#own",
        "%own",
        "#all",
        "#own",
        "%own",
        "-" * 40,
    )
)
vaggaList = []

for v in F.otype.s("vagga"):
    vagga = F.n.v(v)
    ws = L.d(v, otype="word")
    wordsPali = {F.pali.v(w) for w in ws if not F.trans.v(w)}
    allPali = len(wordsPali)
    wordsLatin = {F.latin.v(w) for w in ws if F.trans.v(w)}
    allLatin = len(wordsLatin)
    singlePali = sum(1 for word in wordsPali if len(wordInVagga["pali"][word]) == 1)
    singleLatin = sum(1 for word in wordsLatin if len(wordInVagga["latin"][word]) == 1)
    percentPali = 100 * singlePali / allPali
    percentLatin = 100 * singleLatin / allLatin
    vaggaList.append((vagga, allPali, singlePali, percentPali, allLatin, singleLatin, percentLatin))

for x in sorted(vaggaList, key=lambda e: (-e[3], -e[2], e[1])):
    print("{:<2}   {:>4} {:>4} {:>4.1f}% | {:>4} {:>4} {:>4.1f}%".format(*x))

vagga Pali         |Latin        
      #all #own  %own | #all #own  %own
----------------------------------------
24    258  173 67.1% |  325  152 46.8%
11    125   82 65.6% |  157   86 54.8%
7     106   69 65.1% |  148   53 35.8%
3     103   64 62.1% |  133   64 48.1%
2     116   72 62.1% |  143   63 44.1%
12    110   68 61.8% |  141   52 36.9%
26    343  212 61.8% |  394  184 46.7%
4     137   84 61.3% |  180   79 43.9%
21    120   73 60.8% |  149   59 39.6%
1     171  104 60.8% |  207   86 41.5%
23    155   94 60.6% |  189   89 47.1%
22    142   85 59.9% |  180   80 44.4%
19    137   80 58.4% |  169   71 42.0%
20    168   98 58.3% |  221   97 43.9%
14    167   97 58.1% |  209   90 43.1%
18    195  110 56.4% |  249  117 47.0%
16     86   47 54.7% |  110   48 43.6%
8     117   63 53.8% |  155   64 41.3%
15    108   58 53.7% |  138   57 41.3%
6     149   80 53.7% |  184   80 43.5%
25    207  111 53.6% |  265  105 39.6%
5     161   86 53.4% |  199   81 40.7%
17    131   69 52.7% |  159   64 40.3%
10    176   91 51.7% |  214  102 47.7%
9     118   61 51.7% |  135   46 34.1%
13    113   53 46.9% |  137   49 35.8%

Note that the least extravagant vagga in Pali is also one of the least extravagant vaggas in Latin. And the second most extravagant vagga in Pali is the most extravagant vagga in Latin.

Locality API¶

We travel upwards and downwards, forwards and backwards through the nodes. The Locality-API (L) provides functions: u() for going up, and d() for going down, n() for going to next nodes and p() for going to previous nodes.

These directions are indirect notions: nodes are just numbers, but by means of the oslots feature they are linked to slots. One node contains an other node, if the one is linked to a set of slots that contains the set of slots that the other is linked to. And one if next or previous to an other, if its slots follow or precede the slots of the other one.

L.u(node) Up is going to nodes that embed node.

L.d(node) Down is the opposite direction, to those that are contained in node.

L.n(node) Next are the next adjacent nodes, i.e. nodes whose first slot comes immediately after the last slot of node.

L.p(node) Previous are the previous adjacent nodes, i.e. nodes whose last slot comes immediately before the first slot of node.

All these functions yield nodes of all possible node types. By passing an optional parameter, you can restrict the results to nodes of that type.

The result are ordered according to the order of things in the text.

The functions return always a tuple, even if there is just one node in the result.

Going up¶

We go from the 10th word to the vagga it contains. Note the [0] at the end. You expect one vagga yet L returns a tuple. To get the only element of that tuple, you need to do that [0].

If you are like me, you keep forgetting it, and that will lead to weird error messages later on.

In [45]:

w = 10
firstVagga = L.u(w, otype="vagga")[0]
print(firstVagga)
A.plain(firstVagga)

1

The 1 is a hyperlink that takes you to the online version of the vagga.

And let's see all the containing objects of word 10:

In [46]:

for otype in F.otype.all:
    if otype == F.otype.slotType:
        continue
    up = L.u(w, otype=otype)
    upNode = "x" if len(up) == 0 else up[0]
    print("word {} is contained in {} {}".format(w, otype, upNode))

word 10 is contained in vagga 16639
word 10 is contained in stanza 16165
word 10 is contained in sentence 15252
word 10 is contained in clause 12925

Going next¶

Let's go to the next nodes of the first vagga.

In [47]:

afterFirstVagga = L.n(firstVagga)
for n in afterFirstVagga:
    print(
        "{:>7}: {:<13} first slot={:<6}, last slot={:<6}".format(
            n,
            F.otype.v(n),
            E.oslots.s(n)[0],
            E.oslots.s(n)[-1],
        )
    )
secondVagga = L.n(firstVagga, otype="vagga")[0]

    687: word          first slot=687   , last slot=687   
  13047: clause        first slot=687   , last slot=687   
  15297: sentence      first slot=687   , last slot=687   
  16186: stanza        first slot=687   , last slot=687   
  16640: vagga         first slot=687   , last slot=987

Going previous¶

And let's see what is right before the second book.

In [48]:

for n in L.p(secondVagga):
    print(
        "{:>7}: {:<13} first slot={:<6}, last slot={:<6}".format(
            n,
            F.otype.v(n),
            E.oslots.s(n)[0],
            E.oslots.s(n)[-1],
        )
    )

  16639: vagga         first slot=1     , last slot=686   
  16185: stanza        first slot=685   , last slot=686   
  15296: sentence      first slot=685   , last slot=686   
  13046: clause        first slot=685   , last slot=686   
    686: word          first slot=686   , last slot=686

Going down¶

We go to the stanzas of the second book, and just count them.

In [49]:

stanzas = L.d(secondVagga, otype="stanza")
print(len(stanzas))

The 10th stanza¶

We pick the 10th stanza and explore what is above and below it.

In [50]:

s = F.otype.s("stanza")[10]
A.indent(level=0, reset=True)
A.info("Node {}".format(s), tm=False)
A.indent(level=1)
A.info("UP", tm=False)
A.indent(level=2)
A.info("\n".join(["{:<15} {}".format(u, F.otype.v(u)) for u in L.u(s)]), tm=False)
A.indent(level=1)
A.info("DOWN", tm=False)
A.indent(level=2)
A.info("\n".join(["{:<15} {}".format(u, F.otype.v(u)) for u in L.d(s)]), tm=False)
A.indent(level=0)
A.info("Done", tm=False)

Node 16174
   |   UP
   |      |   16639           vagga
   |   DOWN
   |      |   15272           sentence
   |      |   12980           clause
   |      |   314             word
   |      |   315             word
   |      |   316             word
   |      |   317             word
   |      |   318             word
   |      |   319             word
   |      |   320             word
   |      |   321             word
   |      |   322             word
   |      |   323             word
   |      |   324             word
   |      |   325             word
   |      |   15273           sentence
   |      |   12981           clause
   |      |   326             word
   |      |   327             word
   |      |   328             word
   |      |   329             word
   |      |   12982           clause
   |      |   330             word
   |      |   331             word
   |      |   332             word
   |      |   12983           clause
   |      |   333             word
   |      |   334             word
   |      |   335             word
   |      |   336             word
   |      |   12984           clause
   |      |   337             word
   |      |   338             word
   |      |   339             word
   |      |   340             word
   |      |   341             word
   |      |   342             word
Done

Text API¶

So far, we have mainly seen nodes and their numbers, and the names of node types. You would almost forget that we are dealing with text. So let's try to see some text.

In the same way as F gives access to feature data, T gives access to the text. That is also feature data, but you can tell Text-Fabric which features are specifically carrying the text, and in return Text-Fabric offers you a Text API: T.

Formats¶

The Dhammapada text can be represented in a number of ways:

all text (Pali and Latin)
only Pali
only Latin

If you wonder where the information about text formats is stored: not in the program text-fabric, but in the data set. It has a feature otext, which specifies the formats and which features must be used to produce them. otext is the third special feature in a TF data set, next to otype and oslots. It is an optional feature. If it is absent, there will be no T API.

Here is a list of all available formats in this data set.

In [51]:

sorted(T.formats)

Out[51]:

['layout-latin-full',
 'layout-orig-full',
 'layout-pali-full',
 'text-latin-full',
 'text-orig-full',
 'text-pali-full']

Using the formats¶

We can pretty display in the default format, which is text-orig-full:

In [52]:

s = F.otype.s("stanza")[10]
A.pretty(s, fmt="text-orig-full")

1 10

stanza

sentence

clause

yo

ca

vantakasāv'

assa

sīlesu

susamāhito

upeto

damasaccena

sa

ve

kāsāvam

arhati.

sentence

clause

Qui

vero

affectus

respuit,

clause

virtutibus

bene

instructus,

clause

temperantia

et

veritate

praeditus,

clause

ille

certe

fulva

veste

dignus

est.

Or Pali -only:

In [53]:

A.pretty(s, fmt="text-pali-full")

1 10

stanza

sentence

clause

yo

ca

vantakasāv'

assa

sīlesu

susamāhito

upeto

damasaccena

sa

ve

kāsāvam

arhati.

sentence

clause

Or Latin only

In [54]:

A.pretty(s, fmt="text-latin-full")

1 10

stanza

sentence

clause

sentence

clause

Qui

vero

affectus

respuit,

clause

virtutibus

bene

instructus,

clause

temperantia

et

veritate

praeditus,

clause

ille

certe

fulva

veste

dignus

est.

T.text()¶

This function is central to get text representations of nodes. Its most basic usage is

T.text(nodes, fmt=fmt)

where nodes is a list or iterable of nodes, usually word nodes, and fmt is the name of a format. If you leave out fmt, the default text-orig-full is chosen.

The result is the text in that format for all nodes specified:

In [55]:

T.text([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], fmt="text-orig-full")

Out[55]:

'Yamakavagga manopubbaṅgamā dhammā manoseṭṭhā manomayā, manasā ce paduṭṭhena bhāsatī vā karoti '

There is also another usage of this function:

T.text(node, fmt=fmt)

where node is a single node. In this case, the default format is ntype-orig-full where ntype is the type of node.

If the format is defined in the corpus, it will be used. Otherwise, the word nodes contained in node will be looked up and represented with the default format text-orig-full.

In this way we can sensibly represent a lot of different nodes, such as vaggas, stanzas, sentences, clauses and words.

We compose a set of example nodes and run T.text on them:

In [56]:

exampleNodes = [
    1,
    F.otype.s("sentence")[0],
    F.otype.s("stanza")[0],
    F.otype.s("vagga")[0],
]
exampleNodes

Out[56]:

[1, 15251, 16164, 16639]

In [57]:

for n in exampleNodes:
    print(f"This is {F.otype.v(n)} {n}:")
    print(T.text(n))
    print("")

This is word 1:
Yamakavagga

This is sentence 15251:
Yamakavagga

This is stanza 16164:
Yamakavagga

This is vagga 16639:
Yamakavagga manopubbaṅgamā dhammā manoseṭṭhā manomayā, manasā ce paduṭṭhena bhāsatī vā karoti vā tato naṃ dukkham anveti cakkaṃ va vahato padaṃ. Naturae a mente principium ducunt, mens est potior pars earum, e mente constant; si (quis) mente inquinata aut loquitur aut agit, tum eum sequitur dolor, ut rota (bovis) vehentis pedem. manopubbaṅgamā dhammā manoseṭṭhā manomayā, manasā ce pasannena bhāsatī vā karoti vā tato naṃ sukham anveti chāyā va anapāyinī. Naturae a mente etc.; si (quis) mente serena aut loquitur aut agit, tum eum sequitur gaudium ut umbra non decedens. "akkocchi maṃ avadhi maṃ ajini maṃ ahāsi me", ye taṃ upanayihanti veraṃ tesaṃ na sammati. "Conviciis me obruit, verberavit me, vicit me, spoliavit me"; qui isto (animo) sese induunt, iracundia eorum non sedatur. "akkocchi maṃ avadhi maṃ ajini maṃ ahāsi me", ye taṃ na upanayhanti veraṃ tes' ūpasammati. "Conviciis etc."; qui isto (animo) sese non induunt, iracundia in iis sedatur. na hi verena verāni sammant' idha kudācanaṃ averena ca sammanti, esa dhammo sanantano. Non enim iracundia iracundiae sedantur hic unquam, placabilitate vero sedantur; haec lex aeterna (est). pare ca na vijānanti: "mayam ettha yamāmase", ye ca tattha vijānanti tato sammanti medhagā. Alieni non intelligunt: nos hic moriemur; qui vero hoc comprehendunt, tum (eorum) sedantur iurgia. subhānupassiṃ viharantaṃ indriyesu asaṃvutaṃ bhojanamhi câmattaññuṃ kusītaṃ hīnavīriyaṃ taṃ ve pasahatī Māro vāto rukkhaṃ va dubbalaṃ. Iucunda spectantem viventem, sensus non coercentem et in cibo modi nescium, socordem, viribus destitutum, eum certe superat Māras, ventus arborem sicut infirmam. asubhānupassiṃ viharantaṃ indriyesu susaṃvutaṃ bhojanamhi ca mattaññuṃ saddhaṃ āraddhavīriyaṃ taṃ [ve] na-ppasahatī Māro vāto selaṃ va pabbataṃ. Iucunda non spectantem viventem, sensus bene coercentem et in cibo modum noscentem, fidem habentem, intentis viribus praeditum, eum certe non superat Māras, ventus saxeum volut montem. anikkasāvo kāsāvaṃ yo vatthaṃ paridahessati apeto damasaccena na so kāsāvaṃ arhati. Affectibus non liber qui fulvam vestem induere vult, temperantia et veritate privatus, non ille fulva veste dignus est. yo ca vantakasāv' assa sīlesu susamāhito upeto damasaccena sa ve kāsāvam arhati. Qui vero affectus respuit, virtutibus bene instructus, temperantia et veritate praeditus, ille certe fulva veste dignus est. asāre sāramatino sāre câsāradassino te sāraṃ nâdhigacchanti micchāsaṃkappagocarā. In eo, quod non essentiale, essentiam opinantes atque in essentia nonessentiale videntes, hi essentiam non adeunt, falsi studii participes. sārañ ca sārato ñatvā asārañ ca asārato te sāraṃ adhigacchanti sammāsaṃkappagocarā. Essentiam vero essentiale habentes, et nonessentiale non-essentiale, hi essentiam adeunt, veri studii participes. yathā agāraṃ ducchannaṃ vuṭṭhi samativijjhati evaṃ abhāvitaṃ cittaṃ rāgo samativijjhati. Sicut domum male tectam pluvia perrumpit, ita meditatione destitutam cogitationionem cupido perrumpit. yathā agāraṃ succhannaṃ vuṭṭhi na samativijjhati evaṃ subhāvitaṃ cittaṃ rāgo na samativijjhati. Sicut domum bene tectam pluvia non perrumpit, ita meditabundam cogitationem cupido non perrumpit. idha socati pecca socati pāpakārī ubhayattha socati, so socati so vihaññati disvā kammakiliṭṭham attano. In hoc aevo moeret, morte obita moeret malum patrans, utrobique moeret; ille moeret, ille contristatur videns impuritatem facinoris sui. idha modati pecca modati katapuñño ubhayattha modati, so modati so pamodati disvā kammavisuddhim attano. In hoc aevo gaudet, morte obita gaudet qui bonum perfecit, utrobique gaudet; ille gaudet, ille valde gaudet videns munditiam facinoris sui. idha tappati pecca tappati pāpakārī ubhayattha tappati, "pāpaṃ me katan" ti tappati. bhiyyo tappati duggatiṃ gato. In hoc aevo cruciatur, morte obita cruciatur malum patrans, utrobique cruciatur; "malum a me peractum", ita (cogitans) cruciatur, magis cruciatur tartarum ingressus. idha nandati pecca nandati katapuñño, ubhayattha nandati, "puññam me katan" ti nandati. bhiyyo nandati suggatiṃ gato. In hoc aevo gaudet, morte obita gaudet qui bonum perfecit, utrobique gaudet; "bonum a me peractum", ita (cogitans) gaudet, magis gaudet coelum ingressus. bahum pi ce sahitam bhāsamāno na takkaro hoti naro pamatto gopo va gāvo gaṇayam paresaṃ na bhāgavā sāmaññassa hoti. Multa quoque si concinna loquens ea non facit vir socors, bubulcus velut vaccas aliorum numerans, congregationis Samanarum non fit particeps. appam pi ce sahitam bhāsamāno dhammassa hoti anudhammacārī rāgañ ca dosañ ca pahāya mohaṃ sammappajāno suvimuttacitto anupādiyāno idha vā huraṃ vā sa bhāgavā sāmaññassa hoti. Pauca quoque si (quis) concinna loquens secundum legem vitam degit, et cupidinem et odium (et) perturbationem animi relinquens, plane sapiens, cogitatione bene liberata praeditus, nihil appetens vel hic vel illic, is congregationis Samanarum fit particeps. Yamakavaggo paṭhamo

Using the formats¶

Now let's use those formats to print out the second stanza of the Dhammapada.

In [58]:

secondStanza = F.otype.s("stanza")[1]

for fmt in sorted(T.formats):
    if fmt.startswith("layout"):
        continue
    print("{}:\n{}\n\n".format(fmt, T.text(secondStanza, fmt=fmt)))

text-latin-full:
Naturae a mente principium ducunt, mens est potior pars earum, e mente constant; si (quis) mente inquinata aut loquitur aut agit, tum eum sequitur dolor, ut rota (bovis) vehentis pedem. 


text-orig-full:
manopubbaṅgamā dhammā manoseṭṭhā manomayā, manasā ce paduṭṭhena bhāsatī vā karoti vā tato naṃ dukkham anveti cakkaṃ va vahato padaṃ. Naturae a mente principium ducunt, mens est potior pars earum, e mente constant; si (quis) mente inquinata aut loquitur aut agit, tum eum sequitur dolor, ut rota (bovis) vehentis pedem. 


text-pali-full:
manopubbaṅgamā dhammā manoseṭṭhā manomayā, manasā ce paduṭṭhena bhāsatī vā karoti vā tato naṃ dukkham anveti cakkaṃ va vahato padaṃ.

If we do not specify a format, the default format is used (text-orig-full).

In [59]:

T.text(range(1, 12))

Out[59]:

'Yamakavagga manopubbaṅgamā dhammā manoseṭṭhā manomayā, manasā ce paduṭṭhena bhāsatī vā karoti '

The important things to remember are:

you can supply a list of word nodes and get them represented in all formats
you can get non-word nodes n in default format by T.text(n)
you can get non-word nodes n in other formats by T.text(n, fmt=fmt, descend=True)

Whole text in all formats¶

Part of the pleasure of working with computers is that they can crunch massive amounts of data. The text of the Dhammapada is a piece of cake.

It takes less than a tenth of a second to have that cake and eat it.

In [60]:

A.indent(reset=True)
A.info("writing plain text of whole Dhammapada in all formats ...")
text = collections.defaultdict(list)
for v in F.otype.s("stanza"):
    for fmt in sorted(T.formats):
        if fmt.startswith("layout"):
            continue
        text[fmt].append(T.text(v, fmt=fmt, descend=True))
A.info("done {} formats".format(len(text)))

  0.00s writing plain text of whole Dhammapada in all formats ...
  0.06s done 3 formats

In [61]:

for fmt in sorted(text):
    print("{}\n{}\n".format(fmt, "\n".join(text[fmt][0:5])))

text-latin-full

Naturae a mente principium ducunt, mens est potior pars earum, e mente constant; si (quis) mente inquinata aut loquitur aut agit, tum eum sequitur dolor, ut rota (bovis) vehentis pedem.
Naturae a mente etc.; si (quis) mente serena aut loquitur aut agit, tum eum sequitur gaudium ut umbra non decedens.
"Conviciis me obruit, verberavit me, vicit me, spoliavit me"; qui isto (animo) sese induunt, iracundia eorum non sedatur.
"Conviciis etc."; qui isto (animo) sese non induunt, iracundia in iis sedatur.

text-orig-full
Yamakavagga
manopubbaṅgamā dhammā manoseṭṭhā manomayā, manasā ce paduṭṭhena bhāsatī vā karoti vā tato naṃ dukkham anveti cakkaṃ va vahato padaṃ. Naturae a mente principium ducunt, mens est potior pars earum, e mente constant; si (quis) mente inquinata aut loquitur aut agit, tum eum sequitur dolor, ut rota (bovis) vehentis pedem.
manopubbaṅgamā dhammā manoseṭṭhā manomayā, manasā ce pasannena bhāsatī vā karoti vā tato naṃ sukham anveti chāyā va anapāyinī. Naturae a mente etc.; si (quis) mente serena aut loquitur aut agit, tum eum sequitur gaudium ut umbra non decedens.
"akkocchi maṃ avadhi maṃ ajini maṃ ahāsi me", ye taṃ upanayihanti veraṃ tesaṃ na sammati. "Conviciis me obruit, verberavit me, vicit me, spoliavit me"; qui isto (animo) sese induunt, iracundia eorum non sedatur.
"akkocchi maṃ avadhi maṃ ajini maṃ ahāsi me", ye taṃ na upanayhanti veraṃ tes' ūpasammati. "Conviciis etc."; qui isto (animo) sese non induunt, iracundia in iis sedatur.

text-pali-full
Yamakavagga
manopubbaṅgamā dhammā manoseṭṭhā manomayā, manasā ce paduṭṭhena bhāsatī vā karoti vā tato naṃ dukkham anveti cakkaṃ va vahato padaṃ.
manopubbaṅgamā dhammā manoseṭṭhā manomayā, manasā ce pasannena bhāsatī vā karoti vā tato naṃ sukham anveti chāyā va anapāyinī.
"akkocchi maṃ avadhi maṃ ajini maṃ ahāsi me", ye taṃ upanayihanti veraṃ tesaṃ na sammati.
"akkocchi maṃ avadhi maṃ ajini maṃ ahāsi me", ye taṃ na upanayhanti veraṃ tes' ūpasammati.

The full plain text¶

We write those formats to file, in your Downloads folder.

In [62]:

for fmt in sorted(T.formats):
    if fmt.startswith("layout"):
        continue
    with open(os.path.expanduser(f"~/Downloads/{fmt}.txt"), "w") as f:
        f.write("\n".join(text[fmt]))

Clean caches¶

Text-Fabric pre-computes data for you, so that it can be loaded faster. If the original data is updated, Text-Fabric detects it, and will recompute that data.

But there are cases, when the algorithms of Text-Fabric have changed, without any changes in the data, that you might want to clear the cache of precomputed results.

There are two ways to do that:

Locate the .tf directory of your dataset, and remove all .tfx files in it. This might be a bit awkward to do, because the .tf directory is hidden on Unix-like systems.
Call TF.clearCache(), which does exactly the same.

It is not handy to execute the following cell all the time, that's why I have commented it out. So if you really want to clear the cache, remove the comment sign below.

In [65]:

# TF.clearCache()

All steps¶

By now you have an impression how to compute around in the Hebrew Bible. While this is still the beginning, I hope you already sense the power of unlimited programmatic access to all the bits and bytes in the data set.

Here are a few directions for unleashing that power.

start your first step in mastering the bible computationally
search turbo charge your hand-coding with search templates

CC-BY Dirk Roorda