Start with convert

Compose¶

This is about combining multiple TF datasets into one, and then tweaking it further.

In the previous chapters of this tutorial you have learned how to add new features to an existing dataset.

Here you learn how you can combine dozens of slightly heterogeneous TF data sets, and apply structural tweaks to the node types and features later on.

The incentive to write these composition functions into Text-Fabric came from Ernst Boogert while he was converting between 100 and 200 works by the Church Fathers (Patristics). The conversion did a very good job in getting all the information from TEI files with different structures into TF, one dataset per work.

Then the challenge became to combine them into one big dataset, and to merge several node types into one type, and several features into one.

See patristics.

In [1]:

%load_ext autoreload
%autoreload 2

The new functions are collect() and modify().

In [2]:

from tf.fabric import Fabric
from tf.dataset import modify
from tf.volumes import collect

Corpus¶

We use two copies of our example corpus Banks, present in this repository.

Collect¶

The collect function takes any number of directory locations, and considers each location to be the host of a TF data set.

You can pass this list straight to the collect() function as the locations parameter, or you can add names to the individual corpora. In that case, you pass an iterable of (name, location) pairs into the locations parameter.

Here we give the first copy the name banks, and the second copy the name river.

We also specify the output location.

In [3]:

PREFIX = "combine/input"
SUFFIX = "tf/0.2"

locations = (
    ("banks", f"{PREFIX}/banks1/{SUFFIX}"),
    ("rivers", f"{PREFIX}/banks2/{SUFFIX}"),
)

COMBINED = "combine/_temp/riverbanks"

We are going to call the collect() function.

But first we clear the output location.

Note how you can mix a bash-shell command with your Python code.

In [6]:

output = COMBINED

!rm -rf {output}

collect(
    locations,
    output,
    volumeType="volume",
    volumeFeature="title",
    featureMeta=dict(
        otext=dict(
            sectionTypes="volume,chapter,line",
            sectionFeatures="title,number,number",
            **{"fmt:text-orig-full": "{letters} "},
        ),
    ),
)

  0.00s Loading volume banks                                                        from combine/input/banks1/tf/0.2 ...
This is Text-Fabric 9.1.3
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

10 features found and 0 ignored
  0.00s loading features ...
  0.01s All features loaded/computed - for details use TF.isLoaded()
   |     0.00s Feature overview: 8 for nodes; 1 for edges; 1 configs; 8 computed
  0.00s loading features ...
  0.00s All additional features loaded - for details use TF.isLoaded()
  0.02s Loading volume rivers                                                       from combine/input/banks2/tf/0.2 ...
This is Text-Fabric 9.1.3
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

10 features found and 0 ignored
  0.00s loading features ...
  0.01s All features loaded/computed - for details use TF.isLoaded()
   |     0.00s Feature overview: 8 for nodes; 1 for edges; 1 configs; 8 computed
  0.00s loading features ...
  0.00s All additional features loaded - for details use TF.isLoaded()
  0.04s inspect metadata ...
WARNING: otext.structureFeatures metadata varies across volumes
WARNING: otext.structureTypes metadata varies across volumes
WARNING: author.compiler metadata varies across volumes
WARNING: author.purpose metadata varies across volumes
WARNING: letters.description metadata varies across volumes
  0.04s metadata sorted out
  0.04s check nodetypes ...
   |   volume banks
   |   volume rivers
  0.04s node types ok
  0.04s Collect nodes from volumes ...
   |     0.00s Check against overlapping slots ...
   |      |   banks                                                       :       99 slots
   |      |   rivers                                                      :       99 slots
   |     0.00s no overlap
   |     0.00s Group non-slot nodes by type
   |      |   banks                                                       :      100-     117
   |      |   rivers                                                      :      100-     117
   |     0.00s Mapping nodes from volume to/from work ...
   |      |   book                :      199 -      200
   |      |   chapter             :      201 -      204
   |      |   line                :      205 -      228
   |      |   sentence            :      229 -      234
   |     0.01s The new work has 236 nodes of which 198 slots
  0.05s collection done
  0.05s remap features ...
  0.05s remapping done
  0.05s write work as TF data set
  0.07s writing done
  0.07s done

Out[6]:

True

This function is a bit verbose in its output, but a lot happens under the hood, and if your dataset is large, it may take several minutes. It is pleasant to see the progress under those circumstances.

But for now, we pass silent=True, to make everything a bit more quiet.

In [7]:

output = COMBINED

!rm -rf {output}

collect(
    locations,
    output,
    volumeType="volume",
    volumeFeature="title",
    featureMeta=dict(
        otext=dict(
            sectionTypes="volume,chapter,line",
            sectionFeatures="title,number,number",
            **{"fmt:text-orig-full": "{letters} "},
        ),
    ),
    silent=True,
)

WARNING: otext.structureFeatures metadata varies across volumes
WARNING: otext.structureTypes metadata varies across volumes
WARNING: author.compiler metadata varies across volumes
WARNING: author.purpose metadata varies across volumes
WARNING: letters.description metadata varies across volumes

Out[7]:

True

There you are, on your file system you see the combined dataset:

In [8]:

!ls  -l {output}

total 88
-rw-r--r--  1 dirk  staff   559 Nov  4 16:04 author.tf
-rw-r--r--  1 dirk  staff   524 Nov  4 16:04 gap.tf
-rw-r--r--  1 dirk  staff  1619 Nov  4 16:04 letters.tf
-rw-r--r--  1 dirk  staff   548 Nov  4 16:04 number.tf
-rw-r--r--  1 dirk  staff   681 Nov  4 16:04 oslots.tf
-rw-r--r--  1 dirk  staff  1062 Nov  4 16:04 otext.tf
-rw-r--r--  1 dirk  staff   485 Nov  4 16:04 otype.tf
-rw-r--r--  1 dirk  staff  2747 Nov  4 16:04 ovolume.tf
-rw-r--r--  1 dirk  staff   640 Nov  4 16:04 punc.tf
-rw-r--r--  1 dirk  staff   494 Nov  4 16:04 terminator.tf
-rw-r--r--  1 dirk  staff   563 Nov  4 16:04 title.tf

If we compare that with one of the input:

In [9]:

!ls -l {PREFIX}/banks1/{SUFFIX}

total 80
-rw-r--r--  1 dirk  staff  359 May 20  2019 author.tf
-rw-r--r--  1 dirk  staff  409 May 20  2019 gap.tf
-rw-r--r--  1 dirk  staff  911 May 20  2019 letters.tf
-rw-r--r--  1 dirk  staff  421 May 20  2019 number.tf
-rw-r--r--  1 dirk  staff  419 May 20  2019 oslots.tf
-rw-r--r--  1 dirk  staff  572 May 20  2019 otext.tf
-rw-r--r--  1 dirk  staff  372 May 30  2019 otype.tf
-rw-r--r--  1 dirk  staff  457 May 20  2019 punc.tf
-rw-r--r--  1 dirk  staff  377 May 20  2019 terminator.tf
-rw-r--r--  1 dirk  staff  361 May 20  2019 title.tf

then we see the same files (with the addition of ovolume.tf but smaller file sizes.

Result¶

Let's have a look inside, and note that we use the TF function loadAll() which loads all loadable features.

In [10]:

TF = Fabric(locations=COMBINED)
api = TF.loadAll(silent=False)
docs = api.makeAvailableIn(globals())

This is Text-Fabric 9.1.3
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

11 features found and 0 ignored
  0.00s loading features ...
   |     0.00s T otype                from combine/_temp/riverbanks
   |     0.00s T oslots               from combine/_temp/riverbanks
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
   |     0.00s T number               from combine/_temp/riverbanks
   |     0.00s T punc                 from combine/_temp/riverbanks
   |     0.00s T gap                  from combine/_temp/riverbanks
   |     0.00s T terminator           from combine/_temp/riverbanks
   |     0.00s T title                from combine/_temp/riverbanks
   |     0.00s T letters              from combine/_temp/riverbanks
   |      |     0.00s C __levels__           from otype, oslots, otext
   |      |     0.00s C __order__            from otype, oslots, __levels__
   |      |     0.00s C __rank__             from otype, __order__
   |      |     0.00s C __levUp__            from otype, oslots, __rank__
   |      |     0.00s C __levDown__          from otype, __levUp__, __rank__
   |      |     0.00s C __boundary__         from otype, oslots, __rank__
   |      |     0.00s C __sections__         from otype, oslots, otext, __levUp__, __levels__, title, number, number
  0.03s All features loaded/computed - for details use TF.isLoaded()
   |     0.00s Feature overview: 9 for nodes; 1 for edges; 1 configs; 8 computed
  0.00s loading features ...
   |     0.00s T author               from combine/_temp/riverbanks
   |     0.00s T ovolume              from combine/_temp/riverbanks
  0.01s All additional features loaded - for details use TF.isLoaded()

We look up the section of the first word:

In [12]:

T.sectionFromNode(1)

Out[12]:

('banks', 1, 1)

The component sets had 99 words each. So what is the section of word 100?

In [13]:

T.sectionFromNode(100)

Out[13]:

('rivers', 1, 1)

Right, that's the first word of the second component.

Here is an overview of all the node types in the combined set.

The second field is the average length in words for nodes of that type, the remaining fields give the first and last node of that type.

In [14]:

C.levels.data

Out[14]:

(('book', 99.0, 199, 200),
 ('volume', 99.0, 235, 236),
 ('chapter', 49.5, 201, 204),
 ('sentence', 33.0, 229, 234),
 ('line', 7.666666666666667, 205, 228),
 ('word', 1, 1, 198))

The combined data set consists of the concatenation of all slot nodes of the component data sets.

Note that the individual components have got a top node, of type volume. This is the effect of specifying componentType='volume'.

There is also a feature for volumes, named title, that contains their name, or if we haven't passed their names in the locations parameter, their location. This is the effect of componentFeature='title'.

Let's check.

We use the new .items() method on features.

In [ ]:

F.title.items()

Out[ ]:

dict_items([(199, 'Consider Phlebas'), (200, 'Consider Phlebas'), (235, 'banks'), (236, 'rivers')])

We see several things:

the volume nodes indeed got the component name in the feature title
the other nodes that already had a title, the book nodes, still have the same value for title as before.

The merging principle¶

This is a general principle that we see over and over again: when we combine data, we merge as much as possible.

That means that when you create new features, you may use the names of old features, and the new information for that feature will be merged with the old information of that feature.

Modify¶

Although combining has its complications, the most complex operation is modify() because it can do many things.

It operates on a single TF dataset, and it produces a modified dataset as a fresh "copy".

Despite the name, no actual modification takes place on the input dataset.

In [18]:

location = f"{PREFIX}/banks1/{SUFFIX}"

MODIFIED = "_temp/mudbanks"

Now we take the first local copy of the Banks dataset as our input, for a lot of different operations.

Here is the list what modify() can do. The order is important, because all operations are executed in this order:

merge features: several input features are combined into a single output feature and then deleted;
delete features: several features are be deleted
add features: several node/edge features with their data are added to the dataset
merge types: several input node types are combined into a single output node type; the input node types are deleted, but not their nodes: they are now part of the output node type;
delete types: several node types are deleted, with their nodes, and all features will be remapped to accommodate for this;
add types: several new node types with additional feature data for them are added after the last node; features do not have to be remapped for this; the new node types may be arbitrary intervals of integers and have no relationship with the existing nodes.
modify metadata: the metadata of all features can be tweaked, including everything that is in the otext feature, such as text formats and section structure definitions.

Modify will perform as many sanity checks as possible before it starts working, so that the chances are good that the modified dataset will load properly. It will adapt the value type of features to the values encountered, and it will deduce whether edges have values or not.

If a modified dataset does not load, while the original dataset did load, it is a bug, and I welcome a GitHub issue for it.

Only meta data¶

We start with the last one, the most simple one.

In [19]:

otext = dict(
    sectionTypes="book,chapter",
    sectionFeatures="title,number",
    **{"fmt:text-orig-full": "{letters} "},
)

We use silent=True from now on, but if you work with larger datasets, it is recommended to set silent=False or to leave it out altogether.

In [20]:

test = "meta"
output = f"{MODIFIED}.{test}"

!rm -rf {output}

modify(
    location,
    output,
    featureMeta=dict(otext=otext),
    silent=True,
)

Out[20]:

True

Result¶

In [21]:

TF = Fabric(locations=f"{MODIFIED}.{test}", silent=True)
api = TF.loadAll(silent=True)
docs = api.makeAvailableIn(globals())

We have now only 2 section levels. If we ask for some sections, we see that we only get 2 components in the tuple.

In [22]:

T.sectionFromNode(1)

Out[22]:

('Consider Phlebas', 1)

In [23]:

T.sectionFromNode(99)

Out[23]:

('Consider Phlebas', 2)

Merge features¶

We are going to do some tricky mergers on features that are involved in the section structure and the text formats, so we take care to modify those by means of the featureMeta parameter.

In [24]:

otext = dict(
    sectionTypes="book,chapter",
    sectionFeatures="heading,heading",
    structureTypes="book,chapter",
    structureFeatures="heading,heading",
    **{
        "fmt:text-orig-full": "{content} ",
        "fmt:text-orig-fake": "{fake} ",
        "fmt:line-default": "{content:XXX}{terminator} ",
    },
)

We want sectional headings in one feature, heading, instead of in title for books and number for chapters.

We also make a content feature that gives the letters of a word unless there is punctuation: then it gives punc.

And we make the opposite: fake: it prefers punc over letters.

Note that punc and letters will be deleted after the merge as a whole is completed, so that it is indeed possible for features to be the input of multiple mergers.

In [25]:

test = "merge.f"
output = f"{MODIFIED}.{test}"

!rm -rf {output}

modify(
    location,
    output,
    mergeFeatures=dict(
        heading=("title number"), content=("punc letters"), fake=("letters punc")
    ),
    featureMeta=dict(
        otext=otext,
    ),
    silent=True,
)

Out[25]:

True

Result¶

In [26]:

TF = Fabric(locations=f"{MODIFIED}.{test}", silent=True)
api = TF.loadAll(silent=True)
docs = api.makeAvailableIn(globals())

We inspect the new heading feature for a book and a chapter.

In [27]:

b = F.otype.s("book")[0]
F.heading.v(b)

Out[27]:

'Consider Phlebas'

In [28]:

c = F.otype.s("chapter")[0]
F.heading.v(c)

Out[28]:

'1'

And here is an overview of all node features: title and number are gone, together with punc and letters.

In [29]:

Fall()

Out[29]:

['author', 'content', 'fake', 'gap', 'heading', 'otype', 'terminator']

We have modified the standard text format, text-orig-full. It now uses the content feature, and indeed, we do not see punctuation anymore.

In [30]:

T.text(range(1, 10))

Out[30]:

'Everything about us everything around us everything we know '

On the other hand, text-orig-fake uses the fake feature, and we see that the words in front of punctuation have disappeared.

In [31]:

T.text(range(1, 10), fmt="text-orig-fake")

Out[31]:

'Everything about , everything around , everything we know '

Delete features¶

We just remove two features from the dataset: author and terminator.

In [32]:

test = "delete.f"
output = f"{MODIFIED}.{test}"

!rm -rf {output}

modify(
    location,
    output,
    deleteFeatures="author terminator",
    silent=True,
)

   |   Missing for text API: features: terminator

Out[32]:

False

Oops. terminator is used in a text-format, so if we delete it, the dataset will not load properly.

Let's not delete terminator but gap.

In [33]:

test = "delete.f"
output = f"{MODIFIED}.{test}"

modify(
    location,
    output,
    deleteFeatures="author gap",
    silent=True,
)

Out[33]:

True

Result¶

In [34]:

TF = Fabric(locations=f"{MODIFIED}.{test}", silent=True)
api = TF.loadAll(silent=True)
docs = api.makeAvailableIn(globals())

In [35]:

Fall()

Out[35]:

['letters', 'number', 'otype', 'punc', 'terminator', 'title']

Indeed, gap is gone.

In [36]:

F.gap.freqList()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-36-3d34f1d72b02> in <module>
----> 1 F.gap.freqList()

AttributeError: 'NodeFeatures' object has no attribute 'gap'

I told you! Sigh ...

Add features¶

We add a bunch of node features and edge features.

When you add features, you also have to pass their data. Here we compute that data in place, which results in a lengthy call, but usually you'll get that data from somewhere in a dictionary, and you only pass the dictionary.

We do not have to explicitly tell the value types of the new features, modify() will deduced them. We can override that by passing a value type explicitly.

Let's declare lemma to be str, and big int:

In [37]:

test = "add.f"
output = f"{MODIFIED}.{test}"

!rm -rf {output}

modify(
    location,
    output,
    addFeatures=dict(
        nodeFeatures=dict(
            author={101: "Banks Jr.", 102: "Banks Sr."},
            lemma={n: 1000 + n for n in range(1, 10)},
            small={n: chr(ord("a") + n % 26) for n in range(1, 10)},
            big={n: chr(ord("A") + n % 26) for n in range(1, 10)},
        ),
        edgeFeatures=dict(
            link={n: {n + i for i in range(1, 3)} for n in range(1, 10)},
            similarity={
                n: {n + i: chr(ord("a") + (i + n) % 26) for i in range(1, 3)}
                for n in range(1, 10)
            },
        ),
    ),
    featureMeta=dict(
        lemma=dict(
            valueType="str",
        ),
        big=dict(
            valueType="int",
        ),
    ),
    silent=True,
)

   |   Add features: big: feature values are declared to be int but some values are not int

Out[37]:

False

We get away with lemma as string, because everything that is written is also a string. But not all values of big are numbers, so: complaint.

Let's stick to the default:

In [38]:

test = "add.f"
output = f"{MODIFIED}.{test}"

!rm -rf {output}

modify(
    location,
    output,
    addFeatures=dict(
        nodeFeatures=dict(
            author={101: "Banks Jr.", 102: "Banks Sr."},
            lemma={n: 1000 + n for n in range(1, 10)},
            small={n: chr(ord("a") + n % 26) for n in range(1, 10)},
            big={n: chr(ord("A") + n % 26) for n in range(1, 10)},
        ),
        edgeFeatures=dict(
            link={n: {n + i for i in range(1, 3)} for n in range(1, 10)},
            similarity={
                n: {n + i: chr(ord("a") + (i + n) % 26) for i in range(1, 3)}
                for n in range(1, 10)
            },
        ),
    ),
    silent=True,
)

Out[38]:

True

Result¶

In [40]:

TF = Fabric(locations=f"{MODIFIED}.{test}", silent=True)
api = TF.loadAll(silent=True)
docs = api.makeAvailableIn(globals())

In [41]:

Fall()

Out[41]:

['author',
 'big',
 'gap',
 'lemma',
 'letters',
 'number',
 'otype',
 'punc',
 'small',
 'terminator',
 'title']

In [42]:

Eall()

Out[42]:

['link', 'oslots', 'similarity']

We see the extra features, and let's just enumerate their mappings.

link is an edge feature where edges do not have values. So for each n, the result is a set of nodes.

In [43]:

E.link.items()

Out[43]:

dict_items([(1, frozenset({2, 3})), (2, frozenset({3, 4})), (3, frozenset({4, 5})), (4, frozenset({5, 6})), (5, frozenset({6, 7})), (6, frozenset({8, 7})), (7, frozenset({8, 9})), (8, frozenset({9, 10})), (9, frozenset({10, 11}))])

similarity assigns values to the edges. So for each n, the result is a mapping from nodes to values.

In [44]:

E.similarity.items()

Out[44]:

dict_items([(1, {2: 'c', 3: 'd'}), (2, {3: 'd', 4: 'e'}), (3, {4: 'e', 5: 'f'}), (4, {5: 'f', 6: 'g'}), (5, {6: 'g', 7: 'h'}), (6, {7: 'h', 8: 'i'}), (7, {8: 'i', 9: 'j'}), (8, {9: 'j', 10: 'k'}), (9, {10: 'k', 11: 'l'})])

In [45]:

E.similarity.f(1)

Out[45]:

((2, 'c'), (3, 'd'))

Now the node features.

In [46]:

F.author.items()

Out[46]:

dict_items([(100, 'Iain M. Banks'), (101, 'Banks Jr.'), (102, 'Banks Sr.')])

In [47]:

F.small.items()

Out[47]:

dict_items([(1, 'b'), (2, 'c'), (3, 'd'), (4, 'e'), (5, 'f'), (6, 'g'), (7, 'h'), (8, 'i'), (9, 'j')])

In [48]:

F.big.items()

Out[48]:

dict_items([(1, 'B'), (2, 'C'), (3, 'D'), (4, 'E'), (5, 'F'), (6, 'G'), (7, 'H'), (8, 'I'), (9, 'J')])

In [49]:

F.lemma.items()

Out[49]:

dict_items([(1, 1001), (2, 1002), (3, 1003), (4, 1004), (5, 1005), (6, 1006), (7, 1007), (8, 1008), (9, 1009)])

Merge types¶

Manipulating features is relatively easy. But when we fiddle with the node types, we need our wits about us.

In this example, we first do a feature merge of title and number into nm.

Then we merge the line and sentence types into a new type rule.

And book and chapter will merge into section.

We adapt our section structure so that it makes use of the new features and types.

In [50]:

test = "merge.t"
output = f"{MODIFIED}.{test}"

!rm -rf {output}

modify(
    location,
    output,
    mergeFeatures=dict(nm="title number"),
    mergeTypes=dict(
        rule=dict(
            line=dict(
                type="line",
            ),
            sentence=dict(
                type="sentence",
            ),
        ),
        section=dict(
            book=dict(
                type="book",
            ),
            chapter=dict(
                type="chapter",
            ),
        ),
    ),
    featureMeta=dict(
        otext=dict(
            sectionTypes="section,rule",
            sectionFeatures="nm,nm",
            structureTypes="section",
            structureFeatures="nm",
        ),
    ),
    silent=True,
)

Out[50]:

True

Result¶

In [51]:

TF = Fabric(locations=f"{MODIFIED}.{test}", silent=True)
api = TF.loadAll(silent=True)
docs = api.makeAvailableIn(globals())

We expect a severely reduced inventory of node types:

In [52]:

C.levels.data

Out[52]:

(('section', 66.0, 100, 102),
 ('rule', 12.733333333333333, 103, 117),
 ('word', 1, 1, 99))

In [53]:

Fall()

Out[53]:

['author', 'gap', 'letters', 'nm', 'otype', 'punc', 'terminator', 'type']

Delete types¶

We delete the line and sentence types.

In [54]:

test = "delete.t"
output = f"{MODIFIED}.{test}"

!rm -rf {output}

modify(
    location,
    output,
    deleteTypes="sentence line",
    silent=True,
)

   |   Missing for text API: types: line, sentence

Out[54]:

False

But, again, we can't do that because they are important for the text API.

This time, we change the text API, so that it does not need them anymore.

In [55]:

test = "delete.t"
output = f"{MODIFIED}.{test}"

modify(
    location,
    output,
    deleteTypes="sentence line",
    featureMeta=dict(
        otext=dict(
            sectionTypes="book,chapter",
            sectionFeatures="title,number",
            structureTypes="book,chapter",
            structureFeatures="title,number",
        ),
    ),
    silent=True,
)

Out[55]:

True

Result¶

In [56]:

TF = Fabric(locations=f"{MODIFIED}.{test}", silent=True)
api = TF.loadAll(silent=True)
docs = api.makeAvailableIn(globals())

In [57]:

C.levels.data

Out[57]:

(('book', 99.0, 100, 100), ('chapter', 49.5, 101, 102), ('word', 1, 1, 99))

As expected.

Add types¶

Adding types involves a lot of data, because we do not only add nodes, but also features about those nodes.

The idea is this:

Suppose that somewhere in another dataset, you have found lexeme nodes for the words in your data set.

You just take those lexeme features, which may range from 100,000 to 110,000 say, and you find a way to map them to your words, by means of a map nodeSlots.

Then you can just grab those lexeme functions as they are, and pack them into the addTypes argument, together with the nodeSlots and the node boundaries (100,000 - 110,000).

The new feature data is not able to say something about nodes in the input data set, because the new nodes will be shifted so that they are past the maxNode of your input data set. And if your feature data accidentally addresses nodes outside the declared range, those assignments will be ignored.

So all in all, it is a rather clean addition of material.

Maybe a bit too clean, because it is also impossible to add edge features that link the new nodes to the old nodes. But then, it would be devilishly hard to make sure that after the necessary remapping of the edge features, they address the intended nodes.

If you do want edge features between old and new nodes, it is better to compute them in the new dataset and add them as an individual feature or by another call to modify().

Let's have a look at an example where we add a type bis consisting of a few bigrams, and a type tris, consisting of a bunch trigrams.

We just furnish a slot mapping for those nodes, and give them a name feature.

In [58]:

test = "add.t"
output = f"{MODIFIED}.{test}"

!rm -rf {output}

modify(
    location,
    output,
    addTypes=dict(
        bis=dict(
            nodeFrom=1,
            nodeTo=5,
            nodeSlots={
                1: {10, 11},
                2: {20, 21},
                3: {30, 31},
                4: {40, 41},
                5: {50, 51},
            },
            nodeFeatures=dict(
                name={
                    1: "b1",
                    2: "b2",
                    3: "b3",
                    4: "b4",
                    5: "b5",
                },
            ),
            edgeFeatures=dict(
                link={
                    1: {2: 100, 3: 50, 4: 25},
                    2: {3: 50, 4: 25, 5: 12},
                    3: {4: 25, 5: 12},
                    4: {5: 12, 1: 6},
                    5: {1: 6, 2: 3, 4: 1},
                },
            ),
        ),
        tris=dict(
            nodeFrom=1,
            nodeTo=4,
            nodeSlots={
                1: {60, 61, 62},
                2: {70, 71, 72},
                3: {80, 81, 82},
                4: {90, 91, 94},
            },
            nodeFeatures=dict(
                name={
                    1: "tr1",
                    2: "tr2",
                    3: "tr3",
                    4: "tr4",
                },
            ),
            edgeFeatures=dict(
                sim={
                    1: {2, 3, 4},
                    2: {3, 4},
                    3: {4},
                    4: {5, 1},
                },
            ),
        ),
    ),
    silent=True,
)

Out[58]:

True

Result¶

In [59]:

TF = Fabric(locations=f"{MODIFIED}.{test}", silent=True)
api = TF.loadAll(silent=True)
docs = api.makeAvailableIn(globals())

In [60]:

C.levels.data

Out[60]:

(('book', 99.0, 100, 100),
 ('chapter', 49.5, 101, 102),
 ('sentence', 33.0, 115, 117),
 ('line', 7.666666666666667, 103, 114),
 ('tris', 3.0, 123, 126),
 ('bis', 2.0, 118, 122),
 ('word', 1, 1, 99))

There are the bis and tris!

In [61]:

Fall()

Out[61]:

['author',
 'gap',
 'letters',
 'name',
 'number',
 'otype',
 'punc',
 'terminator',
 'title']

And there is the new feature name:

In [62]:

sorted(F.name.items())

Out[62]:

[(118, 'b1'),
 (119, 'b2'),
 (120, 'b3'),
 (121, 'b4'),
 (122, 'b5'),
 (123, 'tr1'),
 (124, 'tr2'),
 (125, 'tr3'),
 (126, 'tr4')]

In [63]:

Eall()

Out[63]:

['link', 'oslots', 'sim']

And the new edge features link and sim:

In [64]:

sorted(E.link.items())

Out[64]:

[(118, {121: '25', 120: '50', 119: '100'}),
 (119, {122: '12', 121: '25', 120: '50'}),
 (120, {122: '12', 121: '25'}),
 (121, {118: '6', 122: '12'}),
 (122, {121: '1', 119: '3', 118: '6'})]

In [65]:

sorted(E.sim.items())

Out[65]:

[(123, frozenset({124, 125, 126})),
 (124, frozenset({125, 126})),
 (125, frozenset({126})),
 (126, frozenset({123}))]

And that is all for now.

Incredible that you made it till here!

All chapters:

use
share
app
repo
compose