Start with convert


Compose

This is about combining multiple TF datasets into one, and then tweaking it further.

In the previous chapters of this tutorial you have learned how to add new features to an existing dataset.

Here you learn how you can combine dozens of slightly heterogeneous TF data sets, and apply structural tweaks to the node types and features later on.

The incentive to write these composition functions into Text-Fabric came from Ernst Boogert while he was converting between 100 and 200 works by the Church Fathers (Patristics). The conversion did a very good job in getting all the information from TEI files with different structures into TF, one dataset per work.

Then the challenge became to combine them into one big dataset, and to merge several node types into one type, and several features into one.

See patristics.

In [1]:
%load_ext autoreload
%autoreload 2

The new functions are combine() and modify().

In [2]:
from tf.fabric import Fabric
from tf.compose import combine, modify

Corpus

We use two copies of our example corpus Banks, present in this repository.

Combine

The combine function takes any number of directory locations, and considers each location to be the host of a TF data set.

You can pass this list straight to the combine() function as the locations parameter, or you can add names to the individual corpora. In that case, you pass an iterable of (name, location) pairs into the locations parameter.

Here we give the first copy the name banks, and the second copy the name river.

We also specify the output location.

In [3]:
PREFIX = 'combine/input'
SUFFIX = 'tf/0.2'

locations = (
    ('banks', f'{PREFIX}/banks1/{SUFFIX}'),
    ('rivers', f'{PREFIX}/banks2/{SUFFIX}'),
)

COMBINED = 'combine/_temp/riverbanks'

We are going to call the combine() function.

But first we clear the output location.

Note how you can mix a bash-shell command with your Python code.

In [4]:
output = COMBINED

!rm -rf {output}

combine(
  locations,
  output,
  componentType='volume',
  componentFeature='title',
  featureMeta=dict(
    otext=dict(
      sectionTypes='volume,chapter,line',
      sectionFeatures='title,number,number',
      **{'fmt:text-orig-full': '{letters} '},
    ),
  ),
)
  0.00s inspect metadata ...
This is Text-Fabric 7.8.0
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

10 features found and 0 ignored
This is Text-Fabric 7.8.0
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

10 features found and 0 ignored
WARNING: otext.structureFeatures metadata varies across sources
WARNING: otext.structureTypes metadata varies across sources
WARNING: author.compiler metadata varies across sources
WARNING: author.purpose metadata varies across sources
WARNING: letters.description metadata varies across sources
  0.05s determine nodetypes ...
  1 0.2)This is Text-Fabric 7.8.0
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

10 features found and 0 ignored
  0.00s loading features ...
  0.02s All features loaded/computed - for details use loadLog()
  2 0.2)This is Text-Fabric 7.8.0
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

10 features found and 0 ignored
  0.00s loading features ...
  0.02s All features loaded/computed - for details use loadLog()
   |     0.05s done
   |     0.05s compute offsets ...
   |     0.05s remap features ...
  1 0.2)This is Text-Fabric 7.8.0
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

10 features found and 0 ignored
  0.00s loading features ...
  0.01s All features loaded/computed - for details use loadLog()
  0.00s loading features ...
  0.01s All additional features loaded - for details use loadLog()
  2 0.2)This is Text-Fabric 7.8.0
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

10 features found and 0 ignored
  0.00s loading features ...
  0.01s All features loaded/computed - for details use loadLog()
  0.00s loading features ...
  0.00s All additional features loaded - for details use loadLog()
   |     0.04s write TF data ...
   |     0.05s done
Out[4]:
True

This function is a bit verbose in its output, but a lot happens under the hood, and if your dataset is large, it may take several minutes. It is pleasant to see the progress under those circumstances.

But for now, we pass silent=True, to make everything a bit more quiet.

In [5]:
output = COMBINED

!rm -rf {output}

combine(
  locations,
  output,
  componentType='volume',
  componentFeature='title',
  featureMeta=dict(
    otext=dict(
      sectionTypes='volume,chapter,line',
      sectionFeatures='title,number,number',
      **{'fmt:text-orig-full': '{letters} '},
    ),
  ),
  silent=True,
)
WARNING: otext.structureFeatures metadata varies across sources
WARNING: otext.structureTypes metadata varies across sources
WARNING: author.compiler metadata varies across sources
WARNING: author.purpose metadata varies across sources
WARNING: letters.description metadata varies across sources
Out[5]:
True

There you are, on your file system you see the combined dataset:

In [6]:
!ls  -l {output}
total 80
-rw-r--r--  1 dirk  staff   638 Jun  5 09:05 author.tf
-rw-r--r--  1 dirk  staff   597 Jun  5 09:05 gap.tf
-rw-r--r--  1 dirk  staff  1692 Jun  5 09:05 letters.tf
-rw-r--r--  1 dirk  staff   621 Jun  5 09:05 number.tf
-rw-r--r--  1 dirk  staff   754 Jun  5 09:05 oslots.tf
-rw-r--r--  1 dirk  staff  1135 Jun  5 09:05 otext.tf
-rw-r--r--  1 dirk  staff   558 Jun  5 09:05 otype.tf
-rw-r--r--  1 dirk  staff   713 Jun  5 09:05 punc.tf
-rw-r--r--  1 dirk  staff   567 Jun  5 09:05 terminator.tf
-rw-r--r--  1 dirk  staff   636 Jun  5 09:05 title.tf

If we compare that with one of the input:

In [7]:
!ls -l {PREFIX}/banks1/{SUFFIX}
total 80
-rw-r--r--  1 dirk  staff  359 May 20 21:12 author.tf
-rw-r--r--  1 dirk  staff  409 May 20 21:12 gap.tf
-rw-r--r--  1 dirk  staff  911 May 20 21:12 letters.tf
-rw-r--r--  1 dirk  staff  421 May 20 21:12 number.tf
-rw-r--r--  1 dirk  staff  419 May 20 21:12 oslots.tf
-rw-r--r--  1 dirk  staff  572 May 20 21:12 otext.tf
-rw-r--r--  1 dirk  staff  372 May 30 22:43 otype.tf
-rw-r--r--  1 dirk  staff  457 May 20 21:12 punc.tf
-rw-r--r--  1 dirk  staff  377 May 20 21:12 terminator.tf
-rw-r--r--  1 dirk  staff  361 May 20 21:12 title.tf

then we see the same sizes but smaller file sizes.

Result

Let's have a look inside, and note that we use the new TF function loadAll() which loads all loadable features.

In [12]:
TF = Fabric(locations=COMBINED)
api = TF.loadAll(silent=False)
docs = api.makeAvailableIn(globals())
This is Text-Fabric 7.8.0
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

10 features found and 0 ignored
  0.00s loading features ...
   |     0.00s No structure info in otext, the structure part of the T-API cannot be used
  0.02s All features loaded/computed - for details use loadLog()
  0.00s loading features ...
  0.00s All additional features loaded - for details use loadLog()

We look up the section of the first word:

In [13]:
loadLog()
   |     0.00s M otext                from combine/_temp/riverbanks
   |     0.00s B otype                from combine/_temp/riverbanks
   |     0.00s B oslots               from combine/_temp/riverbanks
   |     0.00s M otext                from combine/_temp/riverbanks
   |     0.00s B title                from combine/_temp/riverbanks
   |     0.00s B gap                  from combine/_temp/riverbanks
   |     0.00s B terminator           from combine/_temp/riverbanks
   |     0.00s B punc                 from combine/_temp/riverbanks
   |     0.00s B letters              from combine/_temp/riverbanks
   |     0.00s B number               from combine/_temp/riverbanks
   |     0.00s B __levels__           from otype, oslots, otext
   |     0.00s B __order__            from otype, oslots, __levels__
   |     0.00s B __rank__             from otype, __order__
   |     0.00s B __levUp__            from otype, oslots, __rank__
   |     0.00s B __levDown__          from otype, __levUp__, __rank__
   |     0.00s B __boundary__         from otype, oslots, __rank__
   |     0.00s B __sections__         from otype, oslots, otext, __levUp__, __levels__, title, number, number
   |     0.00s = otype                from combine/_temp/riverbanks
   |     0.00s = oslots               from combine/_temp/riverbanks
   |     0.00s M otext                from combine/_temp/riverbanks
   |     0.00s = title                from combine/_temp/riverbanks
   |     0.00s = gap                  from combine/_temp/riverbanks
   |     0.00s = terminator           from combine/_temp/riverbanks
   |     0.00s = punc                 from combine/_temp/riverbanks
   |     0.00s = letters              from combine/_temp/riverbanks
   |     0.00s = number               from combine/_temp/riverbanks
   |     0.00s = __levels__           from otype, oslots, otext
   |     0.00s = __order__            from otype, oslots, __levels__
   |     0.00s = __rank__             from otype, __order__
   |     0.00s = __levUp__            from otype, oslots, __rank__
   |     0.00s = __levDown__          from otype, __levUp__, __rank__
   |     0.00s = __boundary__         from otype, oslots, __rank__
   |     0.00s = __sections__         from otype, oslots, otext, __levUp__, __levels__, title, number, number
   |     0.00s B author               from combine/_temp/riverbanks
   |     0.00s = gap                  from combine/_temp/riverbanks
   |     0.00s = letters              from combine/_temp/riverbanks
   |     0.00s = number               from combine/_temp/riverbanks
   |     0.00s = oslots               from combine/_temp/riverbanks
   |     0.00s = otype                from combine/_temp/riverbanks
   |     0.00s = punc                 from combine/_temp/riverbanks
   |     0.00s = terminator           from combine/_temp/riverbanks
   |     0.00s = title                from combine/_temp/riverbanks
In [14]:
T.sectionFromNode(1)
Out[14]:
('banks', 1, 1)

The component sets had 99 words each. So what is the section of word 100?

In [15]:
T.sectionFromNode(100)
Out[15]:
('rivers', 1, 1)

Right, that's the first word of the second component.

Here is an overview of all the node types in the combined set.

The second field is the average length in words for nodes of that type, the remaining fields give the first and last node of that type.

In [16]:
C.levels.data
Out[16]:
(('book', 99.0, 199, 200),
 ('volume', 99.0, 235, 236),
 ('chapter', 49.5, 201, 204),
 ('sentence', 33.0, 229, 234),
 ('line', 7.666666666666667, 205, 228),
 ('word', 1, 1, 198))

The combined data set consists of the concatenation of all slot nodes of the component data sets.

Note that the individual components have got a top node, of type volume. This is the effect of specifying componentType='volume'.

There is also a feature for volumes, named title, that contains their name, or if we haven't passed their names in the locations parameter, their location. This is the effect of componentFeature='title'.

Let's check.

We use the new .items() method on features.

In [17]:
F.title.items()
Out[17]:
dict_items([(199, 'Consider Phlebas'), (200, 'Consider Phlebas'), (235, 'banks'), (236, 'rivers')])

We see several things:

  • the volume nodes indeed got the component name in the feature title
  • the other nodes that already had a title, the book nodes, still have the same value for title as before.

The merging principle

This is a general principle that we see over and over again: when we combine data, we merge as much as possible.

That means that when you create new features, you may use the names of old features, and the new information for that feature will be merged with the old information of that feature.

Modify

Although combining has its complications, the most complex operation is modify() because it can do many things.

It operates on a single TF dataset, and it produces a modified dataset as a fresh "copy".

Despite the name, no actual modification takes place on the input dataset.

In [18]:
location = f'{PREFIX}/banks1/{SUFFIX}'

MODIFIED = '_temp/mudbanks'

Now we take the first local copy of the Banks dataset as our input, for a lot of different operations.

Here is the list what modify() can do. The order is important, because all operations are executed in this order:

  1. merge features: several input features are combined into a single output feature and then deleted;
  2. delete features: several features are be deleted
  3. add features: several node/edge features with their data are added to the dataset
  4. merge types: several input node types are combined into a single output node type; the input nodetypes are deleted, but not their nodes: they are now part of the output node type;
  5. delete types: several node types are deleted, with their nodes, and all features will be remapped to accomodate for this;
  6. add types: several new node types with additional feature data for them are added after the last node; features do not have to be remapped for this; the new node types may be arbitrary intervals of integers and have no relationship with the existing nodes.
  7. modify metadata: the metadata of all features can be tweaked, including everything that is in the otext feature, such as text formats and section structure definitions.

Modify will perform as many sanity checks as possible before it starts working, so that the chances are good that the modified dataset will load properly. It will adapt the value type of features to the values encountered, and it will deduce whether edges have values or not.

If a modified dataset does not load, while the original dataset did load, it is a bug, and I welcome a GitHub issue for it.

Only meta data

We start with the last one, the most simple one.

In [19]:
otext = dict(
    sectionTypes='book,chapter',
    sectionFeatures='title,number',
    **{'fmt:text-orig-full': '{letters} '},
)

We use silent=True from now on, but if you work with larger datasets, it is recommended to set silent=False or to leave it out altogether.

In [20]:
test = 'meta'
output = f'{MODIFIED}.{test}'

!rm -rf {output}

modify(
  location,
  output,
  featureMeta=dict(otext=otext),
  silent=True,
)
Out[20]:
True

Result

In [21]:
TF = Fabric(locations=f'{MODIFIED}.{test}', silent=True)
api = TF.loadAll(silent=True)
docs = api.makeAvailableIn(globals())

We have now only 2 section levels. If we ask for some sections, we see that we only get 2 components in the tuple.

In [22]:
T.sectionFromNode(1)
Out[22]:
('Consider Phlebas', 1)
In [23]:
T.sectionFromNode(99)
Out[23]:
('Consider Phlebas', 2)

Merge features

We are going to do some tricky mergers on features that are involved in the section structure and the text formats, so we take care to modify those by means of the featureMeta parameter.

In [24]:
otext = dict(
    sectionTypes='book,chapter',
    sectionFeatures='heading,heading',
    structureTypes='book,chapter',
    structureFeatures='heading,heading',
    **{
      'fmt:text-orig-full': '{content} ',
      'fmt:text-orig-fake': '{fake} ',
      'fmt:line-default': '{content:XXX}{terminator} ',
    },
)

We want sectional headings in one feature, heading, instead of in title for books and number for chapters.

We also make a content feature that gives the letters of a word unless there is punctuation: then it gives punc.

And we make the opposite: fake: it prefers punc over letters.

Note that punc and letters will be deleted after the merge as a whole is completed, so that it is indeed possible for features to be the input of multiple mergers.

In [25]:
test = 'merge.f'
output = f'{MODIFIED}.{test}'

!rm -rf {output}

modify(
  location,
  output,
  mergeFeatures=dict(
    heading=('title number'),
    content=('punc letters'),
    fake=('letters punc')
  ),
  featureMeta=dict(
    otext=otext,
  ),
  silent=True,
)
Out[25]:
True

Result

In [26]:
TF = Fabric(locations=f'{MODIFIED}.{test}', silent=True)
api = TF.loadAll(silent=True)
docs = api.makeAvailableIn(globals())

We inspect the new heading feature for a book and a chapter.

In [27]:
b = F.otype.s('book')[0]
F.heading.v(b)
Out[27]:
'Consider Phlebas'
In [28]:
c = F.otype.s('chapter')[0]
F.heading.v(c)
Out[28]:
'1'

And here is an overview of all node features: title and number are gone, together with punc and letters.

In [29]:
Fall()
Out[29]:
['author', 'content', 'fake', 'gap', 'heading', 'otype', 'terminator']

We have modified the standard text format, text-orig-full. It now uses the content feature, and indeed, we do not see punctuation anymore.

In [30]:
T.text(range(1,10))
Out[30]:
'Everything about us everything around us everything we know '

On the other hand, text-orig-fake uses the fake feature, and we see that the words in front of punctuation have disappeared.

In [31]:
T.text(range(1,10), fmt='text-orig-fake')
Out[31]:
'Everything about , everything around , everything we know '

Delete features

We just remove two features from the dataset: author and terminator.

In [32]:
test = 'delete.f'
output = f'{MODIFIED}.{test}'

!rm -rf {output}

modify(
  location,
  output,
  deleteFeatures='author terminator',
  silent=True,
)
   |   Missing for text API: features: terminator
Out[32]:
False

Oops. terminator is used in a text-format, so if we delete it, the dataset will not load properly.

Let's not delete terminator but gap.

In [33]:
test = 'delete.f'
output = f'{MODIFIED}.{test}'

modify(
  location,
  output,
  deleteFeatures='author gap',
  silent=True,
)
Out[33]:
True

Result

In [34]:
TF = Fabric(locations=f'{MODIFIED}.{test}', silent=True)
api = TF.loadAll(silent=True)
docs = api.makeAvailableIn(globals())
In [35]:
Fall()
Out[35]:
['letters', 'number', 'otype', 'punc', 'terminator', 'title']

Indeed, gap is gone.

In [36]:
F.gap.freqList()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-36-3d34f1d72b02> in <module>
----> 1 F.gap.freqList()

AttributeError: 'NodeFeatures' object has no attribute 'gap'

I told you! Sigh ...

Add features

We add a bunch of node features and edge features.

When you add features, you also have to pass their data. Here we compute that data in place, which results in a lengthy call, but usually you'll get that data from somewhere in a dictionary, and you only pass the dictionary.

We do not have to explicitly tell the value types of the new features, modify() will deduced them. We can override that by passing a value type explicitly.

Let's declare lemma to be str, and big int:

In [37]:
test = 'add.f'
output = f'{MODIFIED}.{test}'

!rm -rf {output}

modify(
  location,
  output,
  addFeatures=dict(
    nodeFeatures=dict(
      author={101: 'Banks Jr.', 102: 'Banks Sr.'},
      lemma={n: 1000 + n for n in range(1, 10)},
      small={n: chr(ord('a') + n % 26) for n in range(1, 10)},
      big={n: chr(ord('A') + n % 26) for n in range(1, 10)},
    ),
    edgeFeatures=dict(
      link={n: {n + i for i in range(1,3)} for n in range(1,10)},
      similarity={n: {n + i: chr(ord('a') + (i + n) % 26) for i in range(1,3)} for n in range(1,10)},
    ),
  ),
  featureMeta=dict(
    lemma=dict(
      valueType='str',
    ),
    big=dict(
      valueType='int',
    ),
  ),
  silent=True,
)
   |   Add features: big: feature values are declared to be int but some values are not int
Out[37]:
False

We get away with lemma as string, because everything that is written is also a string. But not all values of big are numbers, so: complaint.

Let's stick to the default:

In [38]:
test = 'add.f'
output = f'{MODIFIED}.{test}'

!rm -rf {output}

modify(
  location,
  output,
  addFeatures=dict(
    nodeFeatures=dict(
      author={101: 'Banks Jr.', 102: 'Banks Sr.'},
      lemma={n: 1000 + n for n in range(1, 10)},
      small={n: chr(ord('a') + n % 26) for n in range(1, 10)},
      big={n: chr(ord('A') + n % 26) for n in range(1, 10)},
    ),
    edgeFeatures=dict(
      link={n: {n + i for i in range(1,3)} for n in range(1,10)},
      similarity={n: {n + i: chr(ord('a') + (i + n) % 26) for i in range(1,3)} for n in range(1,10)},
    ),
  ),
  silent=True,
)
Out[38]:
True

Result

In [40]:
TF = Fabric(locations=f'{MODIFIED}.{test}', silent=True)
api = TF.loadAll(silent=True)
docs = api.makeAvailableIn(globals())
In [41]:
Fall()
Out[41]:
['author',
 'big',
 'gap',
 'lemma',
 'letters',
 'number',
 'otype',
 'punc',
 'small',
 'terminator',
 'title']
In [42]:
Eall()
Out[42]:
['link', 'oslots', 'similarity']

We see the extra features, and let's just enumerate their mappings.

link is an edge feature where edges do not have values. So for each n, the result is a set of nodes.

In [43]:
E.link.items()
Out[43]:
dict_items([(1, frozenset({2, 3})), (2, frozenset({3, 4})), (3, frozenset({4, 5})), (4, frozenset({5, 6})), (5, frozenset({6, 7})), (6, frozenset({8, 7})), (7, frozenset({8, 9})), (8, frozenset({9, 10})), (9, frozenset({10, 11}))])

similarity assigns values to the edges. So for each n, the result is a mapping from nodes to values.

In [44]:
E.similarity.items()
Out[44]:
dict_items([(1, {2: 'c', 3: 'd'}), (2, {3: 'd', 4: 'e'}), (3, {4: 'e', 5: 'f'}), (4, {5: 'f', 6: 'g'}), (5, {6: 'g', 7: 'h'}), (6, {7: 'h', 8: 'i'}), (7, {8: 'i', 9: 'j'}), (8, {9: 'j', 10: 'k'}), (9, {10: 'k', 11: 'l'})])
In [45]:
E.similarity.f(1)
Out[45]:
((2, 'c'), (3, 'd'))

Now the node features.

In [46]:
F.author.items()
Out[46]:
dict_items([(100, 'Iain M. Banks'), (101, 'Banks Jr.'), (102, 'Banks Sr.')])
In [47]:
F.small.items()
Out[47]:
dict_items([(1, 'b'), (2, 'c'), (3, 'd'), (4, 'e'), (5, 'f'), (6, 'g'), (7, 'h'), (8, 'i'), (9, 'j')])
In [48]:
F.big.items()
Out[48]:
dict_items([(1, 'B'), (2, 'C'), (3, 'D'), (4, 'E'), (5, 'F'), (6, 'G'), (7, 'H'), (8, 'I'), (9, 'J')])
In [49]:
F.lemma.items()
Out[49]:
dict_items([(1, 1001), (2, 1002), (3, 1003), (4, 1004), (5, 1005), (6, 1006), (7, 1007), (8, 1008), (9, 1009)])

Merge types

Manipulating features is relatively easy. But when we fiddle with the node types, we need our wits about us.

In this example, we first do a feature merge of title and number into nm.

Then we merge the line and sentence types into a new type rule.

And book and chapter will merge into section.

We adapt our section structure so that it makes use of the new features and types.

In [50]:
test = 'merge.t'
output = f'{MODIFIED}.{test}'

!rm -rf {output}

modify(
  location,
  output,
  mergeFeatures=dict(
    nm='title number'
  ),
  mergeTypes=dict(
    rule=dict(
      line=dict(
        type='line',
      ),
      sentence=dict(
        type='sentence',
      ),
    ),
    section=dict(
      book=dict(
        type='book',
      ),
      chapter=dict(
        type='chapter',
      ),
    ),
  ),
  featureMeta=dict(
    otext=dict(
      sectionTypes='section,rule',
      sectionFeatures='nm,nm',
      structureTypes='section',
      structureFeatures='nm',
    ),
  ),
  silent=True,
)
Out[50]:
True

Result

In [51]:
TF = Fabric(locations=f'{MODIFIED}.{test}', silent=True)
api = TF.loadAll(silent=True)
docs = api.makeAvailableIn(globals())

We expect a severy reduced inventory of node types:

In [52]:
C.levels.data
Out[52]:
(('section', 66.0, 100, 102),
 ('rule', 12.733333333333333, 103, 117),
 ('word', 1, 1, 99))
In [53]:
Fall()
Out[53]:
['author', 'gap', 'letters', 'nm', 'otype', 'punc', 'terminator', 'type']

Delete types

We delete the line and sentence types.

In [54]:
test = 'delete.t'
output = f'{MODIFIED}.{test}'

!rm -rf {output}

modify(
  location,
  output,
  deleteTypes='sentence line',
  silent=True,
)
   |   Missing for text API: types: line, sentence
Out[54]:
False

But, again, we can't do that because they are important for the text API.

This time, we change the text API, so that it does not need them anymore.

In [55]:
test = 'delete.t'
output = f'{MODIFIED}.{test}'

modify(
  location,
  output,
  deleteTypes='sentence line',
  featureMeta=dict(
    otext=dict(
      sectionTypes='book,chapter',
      sectionFeatures='title,number',
      structureTypes='book,chapter',
      structureFeatures='title,number',
    ),
  ),
  silent=True,
)
Out[55]:
True

Result

In [56]:
TF = Fabric(locations=f'{MODIFIED}.{test}', silent=True)
api = TF.loadAll(silent=True)
docs = api.makeAvailableIn(globals())
In [57]:
C.levels.data
Out[57]:
(('book', 99.0, 100, 100), ('chapter', 49.5, 101, 102), ('word', 1, 1, 99))

As expected.

Add types

Adding types involves a lot of data, because we do not only add nodes, but also features about those nodes.

The idea is this:

Suppose that somewhere in another dataset, you have found lexeme nodes for the words in your data set.

You just take those lexeme features, which may range from 100,000 to 110,000 say, and you find a way to map them to your words, by means of a map nodeSlots.

Then you can just grab those lexeme functions as they are, and pack them into the addTypes argument, together with the nodeSlots and the node boundaries (100,000 - 110,000).

The new feature data is not able to say something about nodes in the input data set, because the new nodes will be shifted so that they are past the maxNode of your input data set. And if your feature data accidentally addresses nodes outside the declared range, those assignments will be ignored.

So all in all, it is a rather clean addition of material.

Maybe a bit too clean, because it is also impossible to add edge features that link the new nodes to the old nodes. But then, it would be devilishly hard to make sure that after the necessary remapping of the edge features, they address the intended nodes.

If you do want edge features between old and new nodes, it is better to compute them in the new dataset and add them as an individual feature or by another call to modify().

Let's have a look at an example where we add a type bis consisting of a few bigrams, and a type tris, consisting of a bunch trigrams.

We just furnish a slot mapping for those nodes, and give them a name feature.

In [58]:
test = 'add.t'
output = f'{MODIFIED}.{test}'

!rm -rf {output}

modify(
  location,
  output,
  addTypes=dict(
    bis=dict(
      nodeFrom=1,
      nodeTo=5,
      nodeSlots={
        1: {10, 11},
        2: {20, 21},
        3: {30, 31},
        4: {40, 41},
        5: {50, 51},
      },
      nodeFeatures=dict(
        name={
          1: 'b1',
          2: 'b2',
          3: 'b3',
          4: 'b4',
          5: 'b5',
        },
      ),
      edgeFeatures=dict(
        link={
          1: {2: 100, 3: 50, 4: 25},
          2: {3: 50, 4: 25, 5: 12},
          3: {4: 25, 5: 12},
          4: {5: 12, 1: 6},
          5: {1: 6, 2: 3, 4: 1},
        },
      ),
    ),
    tris=dict(
      nodeFrom=1,
      nodeTo=4,
      nodeSlots={
        1: {60, 61, 62},
        2: {70, 71, 72},
        3: {80, 81, 82},
        4: {90, 91, 94},
      },
      nodeFeatures=dict(
        name={
          1: 'tr1',
          2: 'tr2',
          3: 'tr3',
          4: 'tr4',
        },
      ),
      edgeFeatures=dict(
        sim={
          1: {2, 3, 4},
          2: {3, 4},
          3: {4},
          4: {5, 1},
        },
      ),
    ),
  ),
  silent=True,
)
Out[58]:
True

Result

In [59]:
TF = Fabric(locations=f'{MODIFIED}.{test}', silent=True)
api = TF.loadAll(silent=True)
docs = api.makeAvailableIn(globals())
In [60]:
C.levels.data
Out[60]:
(('book', 99.0, 100, 100),
 ('chapter', 49.5, 101, 102),
 ('sentence', 33.0, 115, 117),
 ('line', 7.666666666666667, 103, 114),
 ('tris', 3.0, 123, 126),
 ('bis', 2.0, 118, 122),
 ('word', 1, 1, 99))

There are the bis and tris!

In [61]:
Fall()
Out[61]:
['author',
 'gap',
 'letters',
 'name',
 'number',
 'otype',
 'punc',
 'terminator',
 'title']

And there is the new feature name:

In [62]:
sorted(F.name.items())
Out[62]:
[(118, 'b1'),
 (119, 'b2'),
 (120, 'b3'),
 (121, 'b4'),
 (122, 'b5'),
 (123, 'tr1'),
 (124, 'tr2'),
 (125, 'tr3'),
 (126, 'tr4')]
In [63]:
Eall()
Out[63]:
['link', 'oslots', 'sim']

And the new edge features link and sim:

In [64]:
sorted(E.link.items())
Out[64]:
[(118, {121: '25', 120: '50', 119: '100'}),
 (119, {122: '12', 121: '25', 120: '50'}),
 (120, {122: '12', 121: '25'}),
 (121, {118: '6', 122: '12'}),
 (122, {121: '1', 119: '3', 118: '6'})]
In [65]:
sorted(E.sim.items())
Out[65]:
[(123, frozenset({124, 125, 126})),
 (124, frozenset({125, 126})),
 (125, frozenset({126})),
 (126, frozenset({123}))]

And that is all for now.

Incredible that you made it till here!


All chapters: