#!/usr/bin/env python # coding: utf-8 # # # # # # Tutorial # # This notebook gets you started with using # [Text-Fabric](https://annotation.github.io/text-fabric/) for coding in the Hebrew Bible. # # Familiarity with the underlying # [data model](https://annotation.github.io/text-fabric/tf/about/datamodel.html) # is recommended. # # Short introductions to other TF datasets: # # * [Dead Sea Scrolls](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/lorentz2020/dss.ipynb), # * [Old Babylonian Letters](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/lorentz2020/oldbabylonian.ipynb), # or the # * [Quran](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/lorentz2020/quran.ipynb) # # In[1]: get_ipython().run_line_magic('load_ext', 'autoreload') get_ipython().run_line_magic('autoreload', '2') # ## Installing Text-Fabric # # See [here](https://annotation.github.io/text-fabric/tf/about/install.html) # ## Tip # If you start computing with this tutorial, first copy its parent directory to somewhere else, # outside your repository. # If you pull changes from the repository later, your work will not be overwritten. # Where you put your tutorial directory is up to you. # It will work from any directory. # ## BHSA data # # Text-Fabric will fetch a standard set of features for you from the newest GitHub release binaries. # # It will fetch version `2021`. # # The data will be stored in the `text-fabric-data` in your home directory. # # Incantation # # The simplest way to get going is by this *incantation*: # In[2]: from tf.app import use # For the very last version, use `hot`. # # For the latest release, use `latest`. # # If you have cloned the repos (TF app and data), use `clone`. # # If you do not want/need to upgrade, leave out the checkout specifiers. # In[3]: A = use("ETCBC/bhsa", hoist=globals()) # # Features # The data of the BHSA is organized in features. # They are *columns* of data. # Think of the Hebrew Bible as a gigantic spreadsheet, where row 1 corresponds to the # first word, row 2 to the second word, and so on, for all 425,000 words. # # The information which part-of-speech each word is, constitutes a column in that spreadsheet. # The BHSA contains over 100 columns, not only for the 425,000 words, but also for a million more # textual objects. # # Instead of putting that information in one big table, the data is organized in separate columns. # We call those columns **features**. # You can see which features have been loaded, and if you click on a feature name, you find its documentation. # If you hover over a name, you see where the feature is located on your system. # # Edge features are marked by ***bold italic*** formatting. # # There are ways to tweak the set of features that is loaded. You can load more and less. # # See [share](share.ipynb) for examples. # # Modules # Note that we have `phono` features. # The BHSA data has a special 1-1 transcription from Hebrew to ASCII, # but not a *phonetic* transcription. # # I have made a # [notebook](https://github.com/etcbc/phono/blob/master/programs/phono.ipynb) # that tries hard to find phonological representations for all the words. # The result is a *module* in text-fabric format. # We'll encounter that later. # # This module, and the module [etcbc/parallels](https://github.com/etcbc/parallels) # are standard modules of the BHSA app. # See the [share](share.ipynb) tutorial or [Data](https://annotation.github.io/text-fabric/tf/about/datasharing.html) how you can add and invoke additional data. # ## API # # The result of the incantation is that we have a bunch of special variables at our disposal # that give us access to the text and data of the Hebrew Bible. # # At this point it is helpful to throw a quick glance at the text-fabric API documentation # (see the links under **API Members** above). # # The most essential thing for now is that we can use `F` to access the data in the features # we've loaded. # But there is more, such as `N`, which helps us to walk over the text, as we see in a minute. # # The **API members** above show you exactly which new names have been inserted in your namespace. # If you click on these names, you go to the API documentation for them. # ## Search # Text-Fabric contains a flexible search engine, that does not only work for the BHSA data, # but also for data that you add to it. # # **Search is the quickest way to come up-to-speed with your data, without too much programming.** # # Jump to the dedicated [search](search.ipynb) search tutorial first, to whet your appetite. # And if you already know MQL queries, you can build from that in # [search From MQL](searchFromMQL.ipynb). # # The real power of search lies in the fact that it is integrated in a programming environment. # You can use programming to: # # * compose dynamic queries # * process query results # # Therefore, the rest of this tutorial is still important when you want to tap that power. # If you continue here, you learn all the basics of data-navigation with Text-Fabric. # Before we start coding, we load some modules that we need underway: # In[3]: get_ipython().run_line_magic('load_ext', 'autoreload') get_ipython().run_line_magic('autoreload', '2') # In[4]: import os import collections from itertools import chain # # Counting # # In order to get acquainted with the data, we start with the simple task of counting. # # ## Count all nodes # We use the # [`N.walk()` generator](https://annotation.github.io/text-fabric/tf/core/nodes.html#tf.core.nodes.Nodes.walk) # to walk through the nodes. # # We compared the BHSA data to a gigantic spreadsheet, where the rows correspond to the words. # In Text-Fabric, we call the rows `slots`, because they are the textual positions that can be filled with words. # # We also mentioned that there are also 1,000,000 more textual objects. # They are the phrases, clauses, sentences, verses, chapters and books. # They also correspond to rows in the big spreadsheet. # # In Text-Fabric we call all these rows *nodes*, and the `N()` generator # carries us through those nodes in the textual order. # # Just one extra thing: the `info` statements generate timed messages. # If you use them instead of `print` you'll get a sense of the amount of time that # the various processing steps typically need. # In[5]: A.indent(reset=True) A.info("Counting nodes ...") i = 0 for n in N.walk(): i += 1 A.info("{} nodes".format(i)) # Here you see it: 1,4 M nodes! # ## What are those million nodes? # Every node has a type, like word, or phrase, sentence. # We know that we have approximately 425,000 words and a million other nodes. # But what exactly are they? # # Text-Fabric has two special features, `otype` and `oslots`, that must occur in every Text-Fabric data set. # `otype` tells you for each node its type, and you can ask for the number of `slot`s in the text. # # Here we go! # In[6]: F.otype.slotType # In[7]: F.otype.maxSlot # In[8]: F.otype.maxNode # In[9]: F.otype.all # In[10]: C.levels.data # This is interesting: above you see all the textual objects, with the average size of their objects, # the node where they start, and the node where they end. # ## Count individual object types # This is an intuitive way to count the number of nodes in each type. # Note in passing, how we use the `indent` in conjunction with `info` to produce neat timed # and indented progress messages. # In[11]: A.indent(reset=True) A.info("counting objects ...") for otype in F.otype.all: i = 0 A.indent(level=1, reset=True) for n in F.otype.s(otype): i += 1 A.info("{:>7} {}s".format(i, otype)) A.indent(level=0) A.info("Done") # # Viewing textual objects # # We use the A API (the extra power) to peek into the corpus. # First some words. # Node 15890 is a word with a dotless shin. # # Node 1002 is a word with a yod after a segol hataf. # # Node 100,000 is just a word slot. # # Let's inspect them and see where they are. # # First the plain view: # In[12]: F.otype.v(1) # In[13]: wordShows = (15890, 1002, 100000) for word in wordShows: A.plain(word, withPassage=True) # You can leave out the passage reference: # In[14]: for word in wordShows: A.plain(word, withPassage=False) # Now we show other objects, both with and without passage reference. # In[15]: normalShow = dict( wordShow=wordShows[0], phraseShow=700000, clauseShow=500000, sentenceShow=1200000, lexShow=1437667, ) sectionShow = dict( verseShow=1420000, chapterShow=427000, bookShow=426598, ) # In[16]: for (name, n) in normalShow.items(): A.dm(f"**{name}** = node `{n}`\n") A.plain(n) A.plain(n, withPassage=False) A.dm("\n---\n") # Note that for section nodes (except verse and half-verse) the `withPassage` has little effect. # The passage is the thing that is hyperlinked. The node is represented as a textual reference to the piece of text # in question. # In[17]: for (name, n) in sectionShow.items(): if name == "verseShow": continue A.dm(f"**{name}** = node `{n}`\n") A.plain(n) A.plain(n, withPassage=False) A.dm("\n---\n") # We can also dive into the structure of the textual objects, provided they are not too large. # # The function `pretty` gives a display of the object that a node stands for together with the structure below that node. # In[18]: for (name, n) in normalShow.items(): A.dm(f"**{name}** = node `{n}`\n") A.pretty(n) A.dm("\n---\n") # Note # * if you click on a word in a pretty display # you go to a page in SHEBANQ that shows a list of all occurrences of this lexeme; # * if you click on the passage, you go to SHEBANQ, to exactly this verse. # If you need a link to shebanq for just any node: # In[19]: million = 1000000 A.webLink(million) # We can show some standard features in the display: # In[20]: for (name, n) in normalShow.items(): A.dm(f"**{name}** = node `{n}`\n") A.pretty(n, standardFeatures=True) A.dm("\n---\n") # In[21]: for (name, n) in normalShow.items(): A.dm(f"**{name}** = node `{n}`\n") A.pretty(n, standardFeatures=True) A.dm("\n---\n") # For more display options, see [display](display.ipynb). # # Feature statistics # # `F` # gives access to all features. # Every feature has a method # `freqList()` # to generate a frequency list of its values, higher frequencies first. # Here are the parts of speech: # In[22]: F.sp.freqList() # # Lexeme matters # # ## Top 10 frequent verbs # # If we count the frequency of words, we usually mean the frequency of their # corresponding lexemes. # # There are several methods for working with lexemes. # # ### Method 1: counting words # In[23]: verbs = collections.Counter() A.indent(reset=True) A.info("Collecting data") for w in F.otype.s("word"): if F.sp.v(w) != "verb": continue verbs[F.lex.v(w)] += 1 A.info("Done") print( "".join( "{}: {}\n".format(verb, cnt) for (verb, cnt) in sorted(verbs.items(), key=lambda x: (-x[1], x[0]))[0:10] ) ) # ### Method 2: counting lexemes # # An alternative way to do this is to use the feature `freq_lex`, defined for `lex` nodes. # Now we walk the lexemes instead of the occurrences. # # Note that the feature `sp` (part-of-speech) is defined for nodes of type `word` as well as `lex`. # Both also have the `lex` feature. # In[24]: verbs = collections.Counter() A.indent(reset=True) A.info("Collecting data") for w in F.otype.s("lex"): if F.sp.v(w) != "verb": continue verbs[F.lex.v(w)] += F.freq_lex.v(w) A.info("Done") print( "".join( "{}: {}\n".format(verb, cnt) for (verb, cnt) in sorted(verbs.items(), key=lambda x: (-x[1], x[0]))[0:10] ) ) # This is an order of magnitude faster. In this case, that means the difference between a third of a second and a # hundredth of a second, not a big gain in absolute terms. # But suppose you need to run this a 1000 times in a loop. # Then it is the difference between 5 minutes and 10 seconds. # A five minute wait is not pleasant in interactive computing! # ### A frequency mapping of lexemes # # We make a mapping between lexeme forms and the number of occurrences of those lexemes. # In[25]: lexeme_dict = {F.lex_utf8.v(n): F.freq_lex.v(n) for n in F.otype.s("word")} # In[26]: list(lexeme_dict.items())[0:10] # ### Real work # # As a primer of real world work on lexeme distribution, have a look at James Cuénod's notebook on # [Collocation Mutual Information Analysis of the Hebrew Bible](https://nbviewer.jupyter.org/github/jcuenod/hebrewCollocations/blob/master/Collocation%20MI%20Analysis%20of%20the%20Hebrew%20Bible.ipynb) # # It is a nice example how you collect data with TF API calls, then do research with your own methods and tools, and then use TF for presenting results. # # In case the name has changed, the enclosing repo is # [here](https://nbviewer.jupyter.org/github/jcuenod/hebrewCollocations/tree/master/). # ## Lexeme distribution # # Let's do a bit more fancy lexeme stuff. # # ### Hapaxes # # A hapax can be found by inspecting lexemes and see to how many word nodes they are linked. # If that is number is one, we have a hapax. # # We print 10 hapaxes with their glosses. # In[27]: A.indent(reset=True) hapax = [] zero = set() for lx in F.otype.s("lex"): occs = L.d(lx, otype="word") n = len(occs) if n == 0: # that's weird: should not happen zero.add(lx) elif n == 1: # hapax found! hapax.append(lx) A.info("{} hapaxes found".format(len(hapax))) if zero: A.error("{} zeroes found".format(len(zero)), tm=False) else: A.info("No zeroes found", tm=False) for h in hapax[0:10]: print("\t{:<8} {}".format(F.lex.v(h), F.gloss.v(h))) # ### Small occurrence base # # The occurrence base of a lexeme are the verses, chapters and books in which occurs. # Let's look for lexemes that occur in a single chapter. # # If a lexeme occurs in a single chapter, its slots are a subset of the slots of that chapter. # So, if you go *up* from the lexeme, you encounter the chapter. # # Normally, lexemes occur in many chapters, and then none of them totally includes all occurrences of it, # so if you go up from such lexemes, you don not find chapters. # # Let's check it out. # # Oh yes, we have already found the hapaxes, we will skip them here. # In[28]: A.indent(reset=True) A.info("Finding single chapter lexemes") singleCh = [] multipleCh = [] for lx in F.otype.s("lex"): chapters = L.u(lx, "chapter") if len(chapters) == 1: if lx not in hapax: singleCh.append(lx) elif len(chapters) > 0: # should not happen multipleCh.append(lx) A.info("{} single chapter lexemes found".format(len(singleCh))) if multipleCh: A.error( "{} chapter embedders of multiple lexemes found".format(len(multipleCh)), tm=False, ) else: A.info("No chapter embedders of multiple lexemes found", tm=False) for s in singleCh[0:10]: print( "{:<20} {:<6}".format( "{} {}:{}".format(*T.sectionFromNode(s)), F.lex.v(s), ) ) # ### Confined to books # # As a final exercise with lexemes, lets make a list of all books, and show their total number of lexemes and # the number of lexemes that occur exclusively in that book. # In[29]: A.indent(reset=True) A.info("Making book-lexeme index") allBook = collections.defaultdict(set) allLex = set() for b in F.otype.s("book"): for w in L.d(b, "word"): lx = L.u(w, "lex")[0] allBook[b].add(lx) allLex.add(lx) A.info("Found {} lexemes".format(len(allLex))) # In[30]: A.indent(reset=True) A.info("Finding single book lexemes") singleBook = collections.defaultdict(lambda: 0) for lx in F.otype.s("lex"): book = L.u(lx, "book") if len(book) == 1: singleBook[book[0]] += 1 A.info("found {} single book lexemes".format(sum(singleBook.values()))) # In[31]: print( "{:<20}{:>5}{:>5}{:>5}\n{}".format( "book", "#all", "#own", "%own", "-" * 35, ) ) booklist = [] for b in F.otype.s("book"): book = T.bookName(b) a = len(allBook[b]) o = singleBook.get(b, 0) p = 100 * o / a booklist.append((book, a, o, p)) for x in sorted(booklist, key=lambda e: (-e[3], -e[1], e[0])): print("{:<20} {:>4} {:>4} {:>4.1f}%".format(*x)) # The book names may sound a bit unfamiliar, they are in Latin here. # Later we'll see that you can also get them in English, or in Swahili. # # Locality API # We travel upwards and downwards, forwards and backwards through the nodes. # The Locality-API (`L`) provides functions: `u()` for going up, and `d()` for going down, # `n()` for going to next nodes and `p()` for going to previous nodes. # # These directions are indirect notions: nodes are just numbers, but by means of the # `oslots` feature they are linked to slots. One node *contains* an other node, if the one is linked to a set of slots that contains the set of slots that the other is linked to. # And one if next or previous to an other, if its slots follow or precede the slots of the other one. # # `L.u(node)` **Up** is going to nodes that embed `node`. # # `L.d(node)` **Down** is the opposite direction, to those that are contained in `node`. # # `L.n(node)` **Next** are the next *adjacent* nodes, i.e. nodes whose first slot comes immediately after the last slot of `node`. # # `L.p(node)` **Previous** are the previous *adjacent* nodes, i.e. nodes whose last slot comes immediately before the first slot of `node`. # # All these functions yield nodes of all possible node types. # By passing an optional parameter, you can restrict the results to nodes of that type. # # The result are ordered according to the order of things in the text. # # The functions return always a tuple, even if there is just one node in the result. # # ## Going up # We go from the first word to the book it contains. # Note the `[0]` at the end. You expect one book, yet `L` returns a tuple. # To get the only element of that tuple, you need to do that `[0]`. # # If you are like me, you keep forgetting it, and that will lead to weird error messages later on. # In[32]: firstBook = L.u(1, otype="book")[0] print(firstBook) # And let's see all the containing objects of word 3: # In[33]: w = 3 for otype in F.otype.all: if otype == F.otype.slotType: continue up = L.u(w, otype=otype) upNode = "x" if len(up) == 0 else up[0] print("word {} is contained in {} {}".format(w, otype, upNode)) # ## Going next # Let's go to the next nodes of the first book. # In[34]: afterFirstBook = L.n(firstBook) for n in afterFirstBook: print( "{:>7}: {:<13} first slot={:<6}, last slot={:<6}".format( n, F.otype.v(n), E.oslots.s(n)[0], E.oslots.s(n)[-1], ) ) secondBook = L.n(firstBook, otype="book")[0] # ## Going previous # # And let's see what is right before the second book. # In[35]: for n in L.p(secondBook): print( "{:>7}: {:<13} first slot={:<6}, last slot={:<6}".format( n, F.otype.v(n), E.oslots.s(n)[0], E.oslots.s(n)[-1], ) ) # ## Going down # We go to the chapters of the second book, and just count them. # In[36]: chapters = L.d(secondBook, otype="chapter") print(len(chapters)) # ## The first verse # We pick the first verse and the first word, and explore what is above and below them. # In[37]: for n in [1, L.u(1, otype="verse")[0]]: A.indent(level=0) A.info("Node {}".format(n), tm=False) A.indent(level=1) A.info("UP", tm=False) A.indent(level=2) A.info("\n".join(["{:<15} {}".format(u, F.otype.v(u)) for u in L.u(n)]), tm=False) A.indent(level=1) A.info("DOWN", tm=False) A.indent(level=2) A.info("\n".join(["{:<15} {}".format(u, F.otype.v(u)) for u in L.d(n)]), tm=False) A.indent(level=0) A.info("Done", tm=False) # # Text API # # So far, we have mainly seen nodes and their numbers, and the names of node types. # You would almost forget that we are dealing with text. # So let's try to see some text. # # In the same way as `F` gives access to feature data, # `T` gives access to the text. # That is also feature data, but you can tell Text-Fabric which features are specifically # carrying the text, and in return Text-Fabric offers you # a Text API: `T`. # # ## Formats # Hebrew text can be represented in a number of ways: # # * fully pointed (vocalized and accented), or consonantal, # * in transliteration, phonetic transcription or in Hebrew characters, # * showing the actual text or only the lexemes, # * following the ketiv or the qere, at places where they deviate from each other. # # If you wonder where the information about text formats is stored: # not in the program text-fabric, but in the data set. # It has a feature `otext`, which specifies the formats and which features # must be used to produce them. `otext` is the third special feature in a TF data set, # next to `otype` and `oslots`. # It is an optional feature. # If it is absent, there will be no `T` API. # # Here is a list of all available formats in this data set. # In[38]: sorted(T.formats) # Note the `text-phono-full` format here. # It does not come from the main data source `bhsa`, but from the module `phono`. # Look in your data directory, find `~/github/etcbc/phono/tf/2017/otext@phono.tf`, # and you'll see this format defined there. # ## Using the formats # # We can pretty display in other formats: # In[39]: for word in wordShows: A.pretty(word, fmt="text-phono-full") # ## T.text() # # This function is central to get text representations of nodes. Its most basic usage is # # ```python # T.text(nodes, fmt=fmt) # ``` # where `nodes` is a list or iterable of nodes, usually word nodes, and `fmt` is the name of a format. # If you leave out `fmt`, the default `text-orig-full` is chosen. # # The result is the text in that format for all nodes specified: # In[40]: T.text([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], fmt="text-orig-plain") # There is also another usage of this function: # # ```python # T.text(node, fmt=fmt) # ``` # # where `node` is a single node. # In this case, the default format is `ntype-orig-full` where `ntype` is the type of `node`. # So for a `lex` node, the default format is `lex-orig-full`. # # If the format is defined in the corpus, it will be used. Otherwise, the word nodes contained in `node` will be looked up # and represented with the default format `text-orig-full`. # # In this way we can sensibly represent a lot of different nodes, such as chapters, verses, sentences, words and lexemes. # # We compose a set of example nodes and run `T.text` on them: # In[41]: exampleNodes = [ 1, F.otype.s("sentence")[0], F.otype.s("verse")[0], F.otype.s("chapter")[0], F.otype.s("lex")[1], ] exampleNodes # In[42]: for n in exampleNodes: print(f"This is {F.otype.v(n)} {n}:") print(T.text(n)) print("") # ## Using the formats # Now let's use those formats to print out the first verse of the Hebrew Bible. # In[43]: for fmt in sorted(T.formats): print("{}:\n\t{}".format(fmt, T.text(range(1, 12), fmt=fmt))) # Note that `lex-default` is a format that only works for nodes of type `lex`. # If we do not specify a format, the **default** format is used (`text-orig-full`). # In[44]: T.text(range(1, 12)) # In[45]: firstVerse = F.otype.s("verse")[0] T.text(firstVerse) # In[46]: T.text(firstVerse, fmt="text-phono-full") # The important things to remember are: # # * you can supply a list of word nodes and get them represented in all formats (except `lex-default`) # * you can use `T.text(lx)` for lexeme nodes `lx` and it will give the vocalized lexeme (using format `lex-default`) # * you can get non-word nodes `n` in default format by `T.text(n)` # * you can get non-word nodes `n` in other formats by `T.text(n, fmt=fmt, descend=True)` # ## Whole text in all formats # Part of the pleasure of working with computers is that they can crunch massive amounts of data. # The text of the Hebrew Bible is a piece of cake. # # It takes less than ten seconds to have that cake and eat it. # In nearly a dozen formats. # In[47]: A.indent(reset=True) A.info("writing plain text of whole Bible in all formats ...") text = collections.defaultdict(list) for v in F.otype.s("verse"): for fmt in sorted(T.formats): text[fmt].append(T.text(v, fmt=fmt, descend=True)) A.info("done {} formats".format(len(text))) # In[48]: for fmt in sorted(text): print("{}\n{}\n".format(fmt, "\n".join(text[fmt][0:5]))) # ### The full plain text # We write a few formats to file, in your Downloads folder. # In[49]: for fmt in """ text-orig-full text-phono-full """.strip().split(): with open(os.path.expanduser(f"~/Downloads/{fmt}.txt"), "w") as f: f.write("\n".join(text[fmt])) # ## Book names # # For Bible book names, we can use several languages. # # ### Languages # Here are the languages that we can use for book names. # These languages come from the features `book@ll`, where `ll` is a two letter # ISO language code. Have a look in your data directory, you can't miss them. # In[50]: T.languages # ### Book names in Swahili # Get the book names in Swahili. # In[51]: nodeToSwahili = "" for b in F.otype.s("book"): nodeToSwahili += "{} = {}\n".format(b, T.bookName(b, lang="sw")) print(nodeToSwahili) # ## Book nodes from Swahili # OK, there they are. We copy them into a string, and do the opposite: get the nodes back. # We check whether we get exactly the same nodes as the ones we started with. # In[52]: swahiliNames = """ Mwanzo Kutoka Mambo_ya_Walawi Hesabu Kumbukumbu_la_Torati Yoshua Waamuzi 1_Samweli 2_Samweli 1_Wafalme 2_Wafalme Isaya Yeremia Ezekieli Hosea Yoeli Amosi Obadia Yona Mika Nahumu Habakuki Sefania Hagai Zekaria Malaki Zaburi Ayubu Mithali Ruthi Wimbo_Ulio_Bora Mhubiri Maombolezo Esta Danieli Ezra Nehemia 1_Mambo_ya_Nyakati 2_Mambo_ya_Nyakati """.strip().split() swahiliToNode = "" for nm in swahiliNames: swahiliToNode += "{} = {}\n".format(T.bookNode(nm, lang="sw"), nm) if swahiliToNode != nodeToSwahili: print("Something is not right with the book names") else: print("Going from nodes to booknames and back yields the original nodes") # ## Sections # # A section in the Hebrew bible is a book, a chapter or a verse. # Knowledge of sections is not baked into Text-Fabric. # The config feature `otext.tf` may specify three section levels, and tell # what the corresponding node types and features are. # # From that knowledge it can construct mappings from nodes to sections, e.g. from verse # nodes to tuples of the form: # # `(bookName, chapterNumber, verseNumber)` # # You can get the section of a node as a tuple of relevant book, chapter, and verse nodes. # Or you can get it as a passage label, a string. # # You can ask for the passage corresponding to the first slot of a node, or the one corresponding to the last slot. # # If you are dealing with book and chapter nodes, you can ask to fill out the verse and chapter parts as well. # # Here are examples of getting the section that corresponds to a node and vice versa. # # **NB:** `sectionFromNode` always delivers a verse specification, either from the # first slot belonging to that node, or, if `lastSlot`, from the last slot # belonging to that node. # In[53]: for (desc, n) in chain(normalShow.items(), sectionShow.items()): for lang in "en la sw".split(): d = f"{n:>7} {desc}" if lang == "en" else "" first = A.sectionStrFromNode(n, lang=lang) last = A.sectionStrFromNode(n, lang=lang, lastSlot=True, fillup=True) tup = ( T.sectionTuple(n) if lang == "en" else T.sectionTuple(n, lastSlot=True, fillup=True) if lang == "la" else "" ) print(f"{d:<20} {lang} - {first:<30} {last:<30} {tup}") # And here are examples to get back: # In[54]: for (lang, section) in ( ("en", "Ezekiel"), ("la", "Ezechiel"), ("sw", "Ezekieli"), ("en", "Isaiah 43"), ("la", "Jesaia 43"), ("sw", "Isaya 43"), ("en", "Deuteronomy 28:34"), ("la", "Deuteronomium 28:34"), ("sw", "Kumbukumbu_la_Torati 28:34"), ("en", "Job 37:3"), ("la", "Iob 37:3"), ("sw", "Ayubu 37:3"), ("en", "Numbers 22:33"), ("la", "Numeri 22:33"), ("sw", "Hesabu 22:33"), ("en", "Genesis 30:18"), ("la", "Genesis 30:18"), ("sw", "Mwanzo 30:18"), ("en", "Genesis 1:30"), ("la", "Genesis 1:30"), ("sw", "Mwanzo 1:30"), ("en", "Psalms 37:2"), ("la", "Psalmi 37:2"), ("sw", "Zaburi 37:2"), ): n = A.nodeFromSectionStr(section, lang=lang) nType = F.otype.v(n) print(f"{section:<30} {lang} {nType:<20} {n}") # ## Sentences spanning multiple verses # If you go up from a sentence node, you expect to find a verse node. # But some sentences span multiple verses, and in that case, you will not find the enclosing # verse node, because it is not there. # # Here is a piece of code to detect and list all cases where sentences span multiple verses. # # The idea is to pick the first and the last word of a sentence, use `T.sectionFromNode` to # discover the verse in which that word occurs, and if they are different: bingo! # # We show the first 5 of ca. 900 cases. # By the way: doing this in the `2016` version of the data yields 915 results. # The splitting up of the text into sentences is not carved in stone! # In[55]: A.indent(reset=True) A.info("Get sentences that span multiple verses") spanSentences = [] for s in F.otype.s("sentence"): fs = T.sectionFromNode(s, lastSlot=False) ls = T.sectionFromNode(s, lastSlot=True) if fs != ls: spanSentences.append("{} {}:{}-{}".format(fs[0], fs[1], fs[2], ls[2])) A.info("Found {} cases".format(len(spanSentences))) A.info("\n{}".format("\n".join(spanSentences[0:10]))) # A different way, with better display, is: # In[56]: A.indent(reset=True) A.info("Get sentences that span multiple verses") spanSentences = [] for s in F.otype.s("sentence"): words = L.d(s, otype="word") fw = words[0] lw = words[-1] fVerse = L.u(fw, otype="verse")[0] lVerse = L.u(lw, otype="verse")[0] if fVerse != lVerse: spanSentences.append((s, fVerse, lVerse)) A.info("Found {} cases".format(len(spanSentences))) A.table(spanSentences, end=1) # Wait a second, the columns with the verses are empty. # In tables, the content of a verse is not shown. # And by default, the passage that is relevant to a row is computed from one of the columns. # # But here, we definitely want the passage of columns 2 and 3, so: # In[57]: A.table(spanSentences, end=10, withPassage={2, 3}) # We can zoom in: # In[58]: A.show(spanSentences, condensed=False, start=6, end=6, baseTypes={"sentence_atom"}) # # Ketiv Qere # Let us explore where Ketiv/Qere pairs are and how they render. # In[59]: qeres = [w for w in F.otype.s("word") if F.qere.v(w) is not None] print("{} qeres".format(len(qeres))) for w in qeres[0:10]: print( '{}: ketiv = "{}"+"{}" qere = "{}"+"{}"'.format( w, F.g_word.v(w), F.trailer.v(w), F.qere.v(w), F.qere_trailer.v(w), ) ) # ## Show a ketiv-qere pair # Let us print all text representations of the verse in which the second qere occurs. # In[60]: refWord = qeres[1] print(f"Reference word is {refWord}") vn = L.u(refWord, otype="verse")[0] print("{} {}:{}".format(*T.sectionFromNode(refWord))) for fmt in sorted(T.formats): if fmt.startswith("text-"): print("{:<25} {}".format(fmt, T.text(vn, fmt=fmt, descend=True))) # # Edge features: mother # # We have not talked about edges much. If the nodes correspond to the rows in the big spreadsheet, # the edges point from one row to another. # # One edge we have encountered: the special feature `oslots`. # Each non-slot node is linked by `oslots` to all of its slot nodes. # # An edge is really a feature as well. # Whereas a node feature is a column of information, # one cell per node, # an edge feature is also a column of information, one cell per pair of nodes. # # Linguists use more relationships between textual objects, for example: # linguistic dependency. # In the BHSA all cases of linguistic dependency are coded in the edge feature `mother`. # # Let us do a few basic enquiry on an edge feature: # [mother](https://etcbc.github.io/bhsa/features/hebrew/2017/mother). # # We count how many mothers nodes can have (it turns to be 0 or 1). # We walk through all nodes and per node we retrieve the mother nodes, and # we store the lengths (if non-zero) in a dictionary (`mother_len`). # # We see that nodes have at most one mother. # # We also count the inverse relationship: daughters. # In[61]: A.indent(reset=True) A.info("Counting mothers") motherLen = {} daughterLen = {} for c in N.walk(): lms = E.mother.f(c) or [] lds = E.mother.t(c) or [] nms = len(lms) nds = len(lds) if nms: motherLen[c] = nms if nds: daughterLen[c] = nds A.info("{} nodes have mothers".format(len(motherLen))) A.info("{} nodes have daughters".format(len(daughterLen))) motherCount = collections.Counter() daughterCount = collections.Counter() for (n, lm) in motherLen.items(): motherCount[lm] += 1 for (n, ld) in daughterLen.items(): daughterCount[ld] += 1 print("mothers", motherCount) print("daughters", daughterCount) # # Clean caches # # Text-Fabric pre-computes data for you, so that it can be loaded faster. # If the original data is updated, Text-Fabric detects it, and will recompute that data. # # But there are cases, when the algorithms of Text-Fabric have changed, without any changes in the data, that you might # want to clear the cache of precomputed results. # # There are two ways to do that: # # * Locate the `.tf` directory of your dataset, and remove all `.tfx` files in it. # This might be a bit awkward to do, because the `.tf` directory is hidden on Unix-like systems. # * Call `TF.clearCache()`, which does exactly the same. # # It is not handy to execute the following cell all the time, that's why I have commented it out. # So if you really want to clear the cache, remove the comment sign below. # In[65]: # TF.clearCache() # # All steps # # By now you have an impression how to compute around in the Hebrew Bible. # While this is still the beginning, I hope you already sense the power of unlimited programmatic access # to all the bits and bytes in the data set. # # Here are a few directions for unleashing that power. # # * **start** your first step in mastering the bible computationally # * **[display](display.ipynb)** become an expert in creating pretty displays of your text structures # * **[search](search.ipynb)** turbo charge your hand-coding with search templates # * **[export Excel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results # * **[share](share.ipynb)** draw in other people's data and let them use yours # * **[export](export.ipynb)** export your dataset as an Emdros database # * **[annotate](annotate.ipynb)** annotate plain text by means of other tools and import the annotations as TF features # * **[map](map.ipynb)** map somebody else's annotations to a new version of the corpus # * **[volumes](volumes.ipynb)** work with selected books only # * **[trees](trees.ipynb)** work with the BHSA data as syntax trees # # CC-BY Dirk Roorda