This is about combining multiple TF datasets into one, and then tweaking it further.
In the previous chapters of this tutorial you have learned how to add new features to an existing dataset.
Here you learn how you can combine dozens of slightly heterogeneous TF data sets, and apply structural tweaks to the node types and features later on.
The incentive to write these composition functions into Text-Fabric came from Ernst Boogert while he was converting between 100 and 200 works by the Church Fathers (Patristics). The conversion did a very good job in getting all the information from TEI files with different structures into TF, one dataset per work.
Then the challenge became to combine them into one big dataset, and to merge several node types into one type, and several features into one.
See patristics.
%load_ext autoreload
%autoreload 2
The new functions are collect()
and modify()
.
from tf.fabric import Fabric
from tf.dataset import modify
from tf.volumes import collect
We use two copies of our example corpus Banks, present in this repository.
The collect function takes any number of directory locations, and considers each location to be the host of a TF data set.
You can pass this list straight to the collect()
function as the locations
parameter,
or you can add names to the individual corpora.
In that case, you pass an iterable of (name
, location
) pairs into the locations
parameter.
Here we give the first copy the name banks
, and the second copy the name river
.
We also specify the output location.
PREFIX = "combine/input"
SUFFIX = "tf/0.2"
locations = (
("banks", f"{PREFIX}/banks1/{SUFFIX}"),
("rivers", f"{PREFIX}/banks2/{SUFFIX}"),
)
COMBINED = "combine/_temp/riverbanks"
We are going to call the collect()
function.
But first we clear the output location.
Note how you can mix a bash-shell command with your Python code.
output = COMBINED
!rm -rf {output}
collect(
locations,
output,
volumeType="volume",
volumeFeature="title",
featureMeta=dict(
otext=dict(
sectionTypes="volume,chapter,line",
sectionFeatures="title,number,number",
**{"fmt:text-orig-full": "{letters} "},
),
),
)
0.00s Loading volume banks from combine/input/banks1/tf/0.2 ... This is Text-Fabric 9.1.3 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 10 features found and 0 ignored 0.00s loading features ... 0.01s All features loaded/computed - for details use TF.isLoaded() | 0.00s Feature overview: 8 for nodes; 1 for edges; 1 configs; 8 computed 0.00s loading features ... 0.00s All additional features loaded - for details use TF.isLoaded() 0.02s Loading volume rivers from combine/input/banks2/tf/0.2 ... This is Text-Fabric 9.1.3 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 10 features found and 0 ignored 0.00s loading features ... 0.01s All features loaded/computed - for details use TF.isLoaded() | 0.00s Feature overview: 8 for nodes; 1 for edges; 1 configs; 8 computed 0.00s loading features ... 0.00s All additional features loaded - for details use TF.isLoaded() 0.04s inspect metadata ... WARNING: otext.structureFeatures metadata varies across volumes WARNING: otext.structureTypes metadata varies across volumes WARNING: author.compiler metadata varies across volumes WARNING: author.purpose metadata varies across volumes WARNING: letters.description metadata varies across volumes 0.04s metadata sorted out 0.04s check nodetypes ... | volume banks | volume rivers 0.04s node types ok 0.04s Collect nodes from volumes ... | 0.00s Check against overlapping slots ... | | banks : 99 slots | | rivers : 99 slots | 0.00s no overlap | 0.00s Group non-slot nodes by type | | banks : 100- 117 | | rivers : 100- 117 | 0.00s Mapping nodes from volume to/from work ... | | book : 199 - 200 | | chapter : 201 - 204 | | line : 205 - 228 | | sentence : 229 - 234 | 0.01s The new work has 236 nodes of which 198 slots 0.05s collection done 0.05s remap features ... 0.05s remapping done 0.05s write work as TF data set 0.07s writing done 0.07s done
True
This function is a bit verbose in its output, but a lot happens under the hood, and if your dataset is large, it may take several minutes. It is pleasant to see the progress under those circumstances.
But for now, we pass silent=True
, to make everything a bit more quiet.
output = COMBINED
!rm -rf {output}
collect(
locations,
output,
volumeType="volume",
volumeFeature="title",
featureMeta=dict(
otext=dict(
sectionTypes="volume,chapter,line",
sectionFeatures="title,number,number",
**{"fmt:text-orig-full": "{letters} "},
),
),
silent=True,
)
WARNING: otext.structureFeatures metadata varies across volumes WARNING: otext.structureTypes metadata varies across volumes WARNING: author.compiler metadata varies across volumes WARNING: author.purpose metadata varies across volumes WARNING: letters.description metadata varies across volumes
True
There you are, on your file system you see the combined dataset:
!ls -l {output}
total 88 -rw-r--r-- 1 dirk staff 559 Nov 4 16:04 author.tf -rw-r--r-- 1 dirk staff 524 Nov 4 16:04 gap.tf -rw-r--r-- 1 dirk staff 1619 Nov 4 16:04 letters.tf -rw-r--r-- 1 dirk staff 548 Nov 4 16:04 number.tf -rw-r--r-- 1 dirk staff 681 Nov 4 16:04 oslots.tf -rw-r--r-- 1 dirk staff 1062 Nov 4 16:04 otext.tf -rw-r--r-- 1 dirk staff 485 Nov 4 16:04 otype.tf -rw-r--r-- 1 dirk staff 2747 Nov 4 16:04 ovolume.tf -rw-r--r-- 1 dirk staff 640 Nov 4 16:04 punc.tf -rw-r--r-- 1 dirk staff 494 Nov 4 16:04 terminator.tf -rw-r--r-- 1 dirk staff 563 Nov 4 16:04 title.tf
If we compare that with one of the input:
!ls -l {PREFIX}/banks1/{SUFFIX}
total 80 -rw-r--r-- 1 dirk staff 359 May 20 2019 author.tf -rw-r--r-- 1 dirk staff 409 May 20 2019 gap.tf -rw-r--r-- 1 dirk staff 911 May 20 2019 letters.tf -rw-r--r-- 1 dirk staff 421 May 20 2019 number.tf -rw-r--r-- 1 dirk staff 419 May 20 2019 oslots.tf -rw-r--r-- 1 dirk staff 572 May 20 2019 otext.tf -rw-r--r-- 1 dirk staff 372 May 30 2019 otype.tf -rw-r--r-- 1 dirk staff 457 May 20 2019 punc.tf -rw-r--r-- 1 dirk staff 377 May 20 2019 terminator.tf -rw-r--r-- 1 dirk staff 361 May 20 2019 title.tf
then we see the same files (with the addition of ovolume.tf but smaller file sizes.
Let's have a look inside, and note that we use the TF function loadAll()
which loads all loadable features.
TF = Fabric(locations=COMBINED)
api = TF.loadAll(silent=False)
docs = api.makeAvailableIn(globals())
This is Text-Fabric 9.1.3 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 11 features found and 0 ignored 0.00s loading features ... | 0.00s T otype from combine/_temp/riverbanks | 0.00s T oslots from combine/_temp/riverbanks | 0.00s Dataset without structure sections in otext:no structure functions in the T-API | 0.00s T number from combine/_temp/riverbanks | 0.00s T punc from combine/_temp/riverbanks | 0.00s T gap from combine/_temp/riverbanks | 0.00s T terminator from combine/_temp/riverbanks | 0.00s T title from combine/_temp/riverbanks | 0.00s T letters from combine/_temp/riverbanks | | 0.00s C __levels__ from otype, oslots, otext | | 0.00s C __order__ from otype, oslots, __levels__ | | 0.00s C __rank__ from otype, __order__ | | 0.00s C __levUp__ from otype, oslots, __rank__ | | 0.00s C __levDown__ from otype, __levUp__, __rank__ | | 0.00s C __boundary__ from otype, oslots, __rank__ | | 0.00s C __sections__ from otype, oslots, otext, __levUp__, __levels__, title, number, number 0.03s All features loaded/computed - for details use TF.isLoaded() | 0.00s Feature overview: 9 for nodes; 1 for edges; 1 configs; 8 computed 0.00s loading features ... | 0.00s T author from combine/_temp/riverbanks | 0.00s T ovolume from combine/_temp/riverbanks 0.01s All additional features loaded - for details use TF.isLoaded()
We look up the section of the first word:
T.sectionFromNode(1)
('banks', 1, 1)
The component sets had 99 words each. So what is the section of word 100?
T.sectionFromNode(100)
('rivers', 1, 1)
Right, that's the first word of the second component.
Here is an overview of all the node types in the combined set.
The second field is the average length in words for nodes of that type, the remaining fields give the first and last node of that type.
C.levels.data
(('book', 99.0, 199, 200), ('volume', 99.0, 235, 236), ('chapter', 49.5, 201, 204), ('sentence', 33.0, 229, 234), ('line', 7.666666666666667, 205, 228), ('word', 1, 1, 198))
The combined data set consists of the concatenation of all slot nodes of the component data sets.
Note that the individual components have got a top node, of type volume
.
This is the effect of specifying componentType='volume'
.
There is also a feature for volumes, named title
, that contains their name, or if we haven't passed their names
in the locations
parameter, their location.
This is the effect of componentFeature='title'
.
Let's check.
We use the new .items()
method on features.
F.title.items()
dict_items([(199, 'Consider Phlebas'), (200, 'Consider Phlebas'), (235, 'banks'), (236, 'rivers')])
We see several things:
title
book
nodes, still have the same value for title
as before.This is a general principle that we see over and over again: when we combine data, we merge as much as possible.
That means that when you create new features, you may use the names of old features, and the new information for that feature will be merged with the old information of that feature.
Although combining has its complications, the most complex operation is modify()
because it can do many things.
It operates on a single TF dataset, and it produces a modified dataset as a fresh "copy".
Despite the name, no actual modification takes place on the input dataset.
location = f"{PREFIX}/banks1/{SUFFIX}"
MODIFIED = "_temp/mudbanks"
Now we take the first local copy of the Banks dataset as our input, for a lot of different operations.
Here is the list what modify()
can do.
The order is important, because all operations are executed in this order:
otext
feature, such as text formats and section structure definitions.Modify will perform as many sanity checks as possible before it starts working, so that the chances are good that the modified dataset will load properly. It will adapt the value type of features to the values encountered, and it will deduce whether edges have values or not.
If a modified dataset does not load, while the original dataset did load, it is a bug, and I welcome a GitHub issue for it.
We start with the last one, the most simple one.
otext = dict(
sectionTypes="book,chapter",
sectionFeatures="title,number",
**{"fmt:text-orig-full": "{letters} "},
)
We use silent=True
from now on, but if you work with larger datasets, it is recommended to set silent=False
or
to leave it out altogether.
test = "meta"
output = f"{MODIFIED}.{test}"
!rm -rf {output}
modify(
location,
output,
featureMeta=dict(otext=otext),
silent=True,
)
True
TF = Fabric(locations=f"{MODIFIED}.{test}", silent=True)
api = TF.loadAll(silent=True)
docs = api.makeAvailableIn(globals())
We have now only 2 section levels. If we ask for some sections, we see that we only get 2 components in the tuple.
T.sectionFromNode(1)
('Consider Phlebas', 1)
T.sectionFromNode(99)
('Consider Phlebas', 2)
We are going to do some tricky mergers on features that are involved in the section structure and the
text formats, so we take care to modify those by means of the featureMeta
parameter.
otext = dict(
sectionTypes="book,chapter",
sectionFeatures="heading,heading",
structureTypes="book,chapter",
structureFeatures="heading,heading",
**{
"fmt:text-orig-full": "{content} ",
"fmt:text-orig-fake": "{fake} ",
"fmt:line-default": "{content:XXX}{terminator} ",
},
)
We want sectional headings in one feature, heading
, instead of in title
for books and number
for chapters.
We also make a content
feature that gives the letters
of a word unless there is punctuation: then it gives punc
.
And we make the opposite: fake
: it prefers punc
over letters
.
Note that punc
and letters
will be deleted after the merge as a whole is completed, so that it is indeed
possible for features to be the input of multiple mergers.
test = "merge.f"
output = f"{MODIFIED}.{test}"
!rm -rf {output}
modify(
location,
output,
mergeFeatures=dict(
heading=("title number"), content=("punc letters"), fake=("letters punc")
),
featureMeta=dict(
otext=otext,
),
silent=True,
)
True
TF = Fabric(locations=f"{MODIFIED}.{test}", silent=True)
api = TF.loadAll(silent=True)
docs = api.makeAvailableIn(globals())
We inspect the new heading
feature for a book and a chapter.
b = F.otype.s("book")[0]
F.heading.v(b)
'Consider Phlebas'
c = F.otype.s("chapter")[0]
F.heading.v(c)
'1'
And here is an overview of all node features: title
and number
are gone, together with punc
and letters
.
Fall()
['author', 'content', 'fake', 'gap', 'heading', 'otype', 'terminator']
We have modified the standard text format, text-orig-full
. It now uses the content
feature,
and indeed, we do not see punctuation anymore.
T.text(range(1, 10))
'Everything about us everything around us everything we know '
On the other hand, text-orig-fake
uses the fake
feature, and we see that the words in front
of punctuation have disappeared.
T.text(range(1, 10), fmt="text-orig-fake")
'Everything about , everything around , everything we know '
We just remove two features from the dataset: author
and terminator
.
test = "delete.f"
output = f"{MODIFIED}.{test}"
!rm -rf {output}
modify(
location,
output,
deleteFeatures="author terminator",
silent=True,
)
| Missing for text API: features: terminator
False
Oops. terminator
is used in a text-format, so if we delete it, the dataset will not load properly.
Let's not delete terminator
but gap
.
test = "delete.f"
output = f"{MODIFIED}.{test}"
modify(
location,
output,
deleteFeatures="author gap",
silent=True,
)
True
TF = Fabric(locations=f"{MODIFIED}.{test}", silent=True)
api = TF.loadAll(silent=True)
docs = api.makeAvailableIn(globals())
Fall()
['letters', 'number', 'otype', 'punc', 'terminator', 'title']
Indeed, gap
is gone.
F.gap.freqList()
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-36-3d34f1d72b02> in <module> ----> 1 F.gap.freqList() AttributeError: 'NodeFeatures' object has no attribute 'gap'
I told you! Sigh ...
We add a bunch of node features and edge features.
When you add features, you also have to pass their data. Here we compute that data in place, which results in a lengthy call, but usually you'll get that data from somewhere in a dictionary, and you only pass the dictionary.
We do not have to explicitly tell the value types of the new features, modify()
will deduced them.
We can override that by passing a value type explicitly.
Let's declare lemma
to be str
, and big
int
:
test = "add.f"
output = f"{MODIFIED}.{test}"
!rm -rf {output}
modify(
location,
output,
addFeatures=dict(
nodeFeatures=dict(
author={101: "Banks Jr.", 102: "Banks Sr."},
lemma={n: 1000 + n for n in range(1, 10)},
small={n: chr(ord("a") + n % 26) for n in range(1, 10)},
big={n: chr(ord("A") + n % 26) for n in range(1, 10)},
),
edgeFeatures=dict(
link={n: {n + i for i in range(1, 3)} for n in range(1, 10)},
similarity={
n: {n + i: chr(ord("a") + (i + n) % 26) for i in range(1, 3)}
for n in range(1, 10)
},
),
),
featureMeta=dict(
lemma=dict(
valueType="str",
),
big=dict(
valueType="int",
),
),
silent=True,
)
| Add features: big: feature values are declared to be int but some values are not int
False
We get away with lemma
as string, because everything that is written is also a string.
But not all values of big
are numbers, so: complaint.
Let's stick to the default:
test = "add.f"
output = f"{MODIFIED}.{test}"
!rm -rf {output}
modify(
location,
output,
addFeatures=dict(
nodeFeatures=dict(
author={101: "Banks Jr.", 102: "Banks Sr."},
lemma={n: 1000 + n for n in range(1, 10)},
small={n: chr(ord("a") + n % 26) for n in range(1, 10)},
big={n: chr(ord("A") + n % 26) for n in range(1, 10)},
),
edgeFeatures=dict(
link={n: {n + i for i in range(1, 3)} for n in range(1, 10)},
similarity={
n: {n + i: chr(ord("a") + (i + n) % 26) for i in range(1, 3)}
for n in range(1, 10)
},
),
),
silent=True,
)
True
TF = Fabric(locations=f"{MODIFIED}.{test}", silent=True)
api = TF.loadAll(silent=True)
docs = api.makeAvailableIn(globals())
Fall()
['author', 'big', 'gap', 'lemma', 'letters', 'number', 'otype', 'punc', 'small', 'terminator', 'title']
Eall()
['link', 'oslots', 'similarity']
We see the extra features, and let's just enumerate their mappings.
link
is an edge feature where edges do not have values.
So for each n
, the result is a set of nodes.
E.link.items()
dict_items([(1, frozenset({2, 3})), (2, frozenset({3, 4})), (3, frozenset({4, 5})), (4, frozenset({5, 6})), (5, frozenset({6, 7})), (6, frozenset({8, 7})), (7, frozenset({8, 9})), (8, frozenset({9, 10})), (9, frozenset({10, 11}))])
similarity
assigns values to the edges. So for each n
, the result is a mapping from nodes to values.
E.similarity.items()
dict_items([(1, {2: 'c', 3: 'd'}), (2, {3: 'd', 4: 'e'}), (3, {4: 'e', 5: 'f'}), (4, {5: 'f', 6: 'g'}), (5, {6: 'g', 7: 'h'}), (6, {7: 'h', 8: 'i'}), (7, {8: 'i', 9: 'j'}), (8, {9: 'j', 10: 'k'}), (9, {10: 'k', 11: 'l'})])
E.similarity.f(1)
((2, 'c'), (3, 'd'))
Now the node features.
F.author.items()
dict_items([(100, 'Iain M. Banks'), (101, 'Banks Jr.'), (102, 'Banks Sr.')])
F.small.items()
dict_items([(1, 'b'), (2, 'c'), (3, 'd'), (4, 'e'), (5, 'f'), (6, 'g'), (7, 'h'), (8, 'i'), (9, 'j')])
F.big.items()
dict_items([(1, 'B'), (2, 'C'), (3, 'D'), (4, 'E'), (5, 'F'), (6, 'G'), (7, 'H'), (8, 'I'), (9, 'J')])
F.lemma.items()
dict_items([(1, 1001), (2, 1002), (3, 1003), (4, 1004), (5, 1005), (6, 1006), (7, 1007), (8, 1008), (9, 1009)])
Manipulating features is relatively easy. But when we fiddle with the node types, we need our wits about us.
In this example, we first do a feature merge of title
and number
into nm
.
Then we merge the line
and sentence
types into a new type rule
.
And book
and chapter
will merge into section
.
We adapt our section structure so that it makes use of the new features and types.
test = "merge.t"
output = f"{MODIFIED}.{test}"
!rm -rf {output}
modify(
location,
output,
mergeFeatures=dict(nm="title number"),
mergeTypes=dict(
rule=dict(
line=dict(
type="line",
),
sentence=dict(
type="sentence",
),
),
section=dict(
book=dict(
type="book",
),
chapter=dict(
type="chapter",
),
),
),
featureMeta=dict(
otext=dict(
sectionTypes="section,rule",
sectionFeatures="nm,nm",
structureTypes="section",
structureFeatures="nm",
),
),
silent=True,
)
True
TF = Fabric(locations=f"{MODIFIED}.{test}", silent=True)
api = TF.loadAll(silent=True)
docs = api.makeAvailableIn(globals())
We expect a severely reduced inventory of node types:
C.levels.data
(('section', 66.0, 100, 102), ('rule', 12.733333333333333, 103, 117), ('word', 1, 1, 99))
Fall()
['author', 'gap', 'letters', 'nm', 'otype', 'punc', 'terminator', 'type']
We delete the line
and sentence
types.
test = "delete.t"
output = f"{MODIFIED}.{test}"
!rm -rf {output}
modify(
location,
output,
deleteTypes="sentence line",
silent=True,
)
| Missing for text API: types: line, sentence
False
But, again, we can't do that because they are important for the text API.
This time, we change the text API, so that it does not need them anymore.
test = "delete.t"
output = f"{MODIFIED}.{test}"
modify(
location,
output,
deleteTypes="sentence line",
featureMeta=dict(
otext=dict(
sectionTypes="book,chapter",
sectionFeatures="title,number",
structureTypes="book,chapter",
structureFeatures="title,number",
),
),
silent=True,
)
True
TF = Fabric(locations=f"{MODIFIED}.{test}", silent=True)
api = TF.loadAll(silent=True)
docs = api.makeAvailableIn(globals())
C.levels.data
(('book', 99.0, 100, 100), ('chapter', 49.5, 101, 102), ('word', 1, 1, 99))
As expected.
Adding types involves a lot of data, because we do not only add nodes, but also features about those nodes.
The idea is this:
Suppose that somewhere in another dataset, you have found lexeme nodes for the words in your data set.
You just take those lexeme features, which may range from 100,000 to 110,000 say, and you find a way to map them to your
words, by means of a map nodeSlots
.
Then you can just grab those lexeme functions as they are, and pack them into the addTypes
argument,
together with the nodeSlots
and the node boundaries (100,000 - 110,000).
The new feature data is not able to say something about nodes in the input data set, because the new nodes will be shifted
so that they are past the maxNode
of your input data set.
And if your feature data accidentally addresses nodes outside the declared range, those assignments will be ignored.
So all in all, it is a rather clean addition of material.
Maybe a bit too clean, because it is also impossible to add edge features that link the new nodes to the old nodes. But then, it would be devilishly hard to make sure that after the necessary remapping of the edge features, they address the intended nodes.
If you do want edge features between old and new nodes, it is better to compute them in the new dataset and add them
as an individual feature or by another call to modify()
.
Let's have a look at an example where we add a type bis
consisting of a few bigrams, and a type tris
,
consisting of a bunch trigrams
.
We just furnish a slot mapping for those nodes, and give them a name
feature.
test = "add.t"
output = f"{MODIFIED}.{test}"
!rm -rf {output}
modify(
location,
output,
addTypes=dict(
bis=dict(
nodeFrom=1,
nodeTo=5,
nodeSlots={
1: {10, 11},
2: {20, 21},
3: {30, 31},
4: {40, 41},
5: {50, 51},
},
nodeFeatures=dict(
name={
1: "b1",
2: "b2",
3: "b3",
4: "b4",
5: "b5",
},
),
edgeFeatures=dict(
link={
1: {2: 100, 3: 50, 4: 25},
2: {3: 50, 4: 25, 5: 12},
3: {4: 25, 5: 12},
4: {5: 12, 1: 6},
5: {1: 6, 2: 3, 4: 1},
},
),
),
tris=dict(
nodeFrom=1,
nodeTo=4,
nodeSlots={
1: {60, 61, 62},
2: {70, 71, 72},
3: {80, 81, 82},
4: {90, 91, 94},
},
nodeFeatures=dict(
name={
1: "tr1",
2: "tr2",
3: "tr3",
4: "tr4",
},
),
edgeFeatures=dict(
sim={
1: {2, 3, 4},
2: {3, 4},
3: {4},
4: {5, 1},
},
),
),
),
silent=True,
)
True
TF = Fabric(locations=f"{MODIFIED}.{test}", silent=True)
api = TF.loadAll(silent=True)
docs = api.makeAvailableIn(globals())
C.levels.data
(('book', 99.0, 100, 100), ('chapter', 49.5, 101, 102), ('sentence', 33.0, 115, 117), ('line', 7.666666666666667, 103, 114), ('tris', 3.0, 123, 126), ('bis', 2.0, 118, 122), ('word', 1, 1, 99))
There are the bis
and tris
!
Fall()
['author', 'gap', 'letters', 'name', 'number', 'otype', 'punc', 'terminator', 'title']
And there is the new feature name
:
sorted(F.name.items())
[(118, 'b1'), (119, 'b2'), (120, 'b3'), (121, 'b4'), (122, 'b5'), (123, 'tr1'), (124, 'tr2'), (125, 'tr3'), (126, 'tr4')]
Eall()
['link', 'oslots', 'sim']
And the new edge features link
and sim
:
sorted(E.link.items())
[(118, {121: '25', 120: '50', 119: '100'}), (119, {122: '12', 121: '25', 120: '50'}), (120, {122: '12', 121: '25'}), (121, {118: '6', 122: '12'}), (122, {121: '1', 119: '3', 118: '6'})]
sorted(E.sim.items())
[(123, frozenset({124, 125, 126})), (124, frozenset({125, 126})), (125, frozenset({126})), (126, frozenset({123}))]
And that is all for now.
Incredible that you made it till here!