%load_ext autoreload
%autoreload 2
In this notebook we map the slot nodes from version 0.4 (source version) to 0.7 (target version).
Basically this means that we map all slots from the source version to corresponding slots in the target version. In the target version there are more slots, because also footnotes occupy slots there, in contrast with the source version, where footnotes only appear inside feature values of the slot that precedes the footnote mark.
Some slots have an empty text (most of them contain some punctuation).
We do not want to be fussy about those slots. We map them unto corresponding empty slots if possible, otherwise we map them onto the nearest non-empty slot.
After establishing the slot mapping, we extend the mapping to all nodes in a generic way. The code for this is already in the TF library.
from tf.fabric import Fabric
from tf.dataset import Versions
from lib import TF_DIR
va = "0.4"
# vb = "0.9.1"
vb = "1.0"
We load the two versions of the TF data by means of the lower level Fabric
method,
and we only load the features we need.
TF = {}
api = {}
E = {}
Es = {}
F = {}
Fs = {}
L = {}
T = {}
maxSlot = {}
features = {
va: "trans",
vb: "trans isnote",
}
for v in (va, vb):
TF[v] = Fabric(locations=TF_DIR, modules=v)
api[v] = TF[v].load(features[v])
This is Text-Fabric 9.4.3 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 35 features found and 0 ignored 3.83s All features loaded/computed - for details use TF.isLoaded() This is Text-Fabric 9.4.3 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 43 features found and 0 ignored 4.78s All features loaded/computed - for details use TF.isLoaded()
We put the parts of the TF API that we need in the various dictionaries.
for v in (va, vb):
E[v] = api[v].E
Es[v] = api[v].Es
F[v] = api[v].F
Fs[v] = api[v].Fs
L[v] = api[v].L
T[v] = api[v].T
maxSlot[v] = F[v].otype.maxSlot
We walk through the slots of the target version (0.7) and skip its footnote slots.
For each target slot we increase the slot in the source version in (0.5), and check whether
source and target slots have the same value for the trans
feature.
If not, and one of them is empty, we skip the empty word and try the next one.
But if both are not empty and unequal, we have a real problem: a mismatch.
However, in version 0.5 we have an imperfect separation of numbers and words. So, sometimes we have to split words.
In that case we stop, and you have to inspect what is happening.
def makeSlotMapOld():
Fa = F[va]
Fb = F[vb]
transA = Fa.trans.v
transB = Fb.trans.v
isNote = Fb.isnote.v
maxSlotA = maxSlot[va]
maxSlotB = maxSlot[vb]
print(
f"""\
Computing slotMap between:
{va}: {maxSlotA:>8} slots,
{vb}: {maxSlotB:>8} slots.\
"""
)
slotMap = {}
good = True
wA = 1
emptyA = 0
emptyB = 0
for wB in range(1, maxSlotB + 1):
if isNote(wB):
continue
textA = transA(wA) or ""
textB = transB(wB) or ""
if textB == "":
if textA != "":
emptyB += 1
continue
else:
while textA == "" and wA < maxSlotA:
wA += 1
emptyA += 1
textA = transA(wA) or ""
if textA != textB:
print("Mismatch:")
print(f"A: {wA:>8} = `{textA}`")
print(f"B: {wB:>8} = `{textB}`")
good = False
break
if wA <= maxSlotA:
slotMap.setdefault(wA, {})[wB] = None
wA += 1
else:
if textB:
print(f"No more slots in {va} to match slot {wB} in {vb}")
break
maxSlotMap = max(slotMap)
if maxSlotMap > maxSlotA:
print(f"maxSlot in A version {va} exceeded")
print(f"Found {maxSlotMap}, but it should be <= {maxSlot[va]}")
good = False
if good:
print(
f"""\
slotMap succesfully created: {len(slotMap)} slots mapped.
{va}: {emptyA:>6} empty slots,
{vb}: {emptyB:>6} empty slots.\
"""
)
return slotMap
def makeSlotMap():
Fa = F[va]
Fb = F[vb]
transA = Fa.trans.v
transB = Fb.trans.v
isNote = Fb.isnote.v
maxSlotA = maxSlot[va]
maxSlotB = maxSlot[vb]
print(
f"""\
Computing slotMap between:
{va}: {maxSlotA:>8} slots,
{vb}: {maxSlotB:>8} slots.\
"""
)
slotMap = {}
good = True
wA = 1
wB = 1
while wB <= maxSlotB and wA <= maxSlotA:
if isNote(wB):
wB += 1
continue
textA = transA(wA) or ""
textB = transB(wB) or ""
if textA == textB:
slotMap.setdefault(wA, {})[wB] = None
wA += 1
wB += 1
elif textA.startswith(textB):
slotMap.setdefault(wA, {})[wB] = None
wB += 1
elif textA.endswith(textB):
wA += 1
wB += 1
elif textB.startswith(textA):
slotMap.setdefault(wA, {})[wB] = None
wA += 1
elif textB.endswith(textA):
slotMap.setdefault(wA, {})[wB] = None
wA += 1
wB += 1
else:
print("Mismatch:")
print(f"A: {wA:>8} = `{textA}`")
print(f"B: {wB:>8} = `{textB}`")
good = False
break
maxSlotMap = max(slotMap)
if maxSlotMap > maxSlotA:
print(f"maxSlot in A version {va} exceeded")
print(f"Found {maxSlotMap}, but it should be <= {maxSlot[va]}")
good = False
if good:
print(
f"""\
slotMap succesfully created: {len(slotMap)} slots mapped.
"""
)
return slotMap
slotMap = makeSlotMap()
Computing slotMap between: 0.4: 5030444 slots, 1.0: 5977367 slots. slotMap succesfully created: 5030444 slots mapped.
When we encounter problems, we can do a bit of checking to see what is going on.
The next function shows the line around a slot node, and can do so in both versions.
def show(v, n):
lines = L[v].u(n, otype="line")
if not lines:
lines = L[v].u(n + 1, otype="line")
if not lines:
lines = L[v].u(n - 1, otype="line")
if not lines:
print("no such line")
return
line = lines[0]
print(T[v].sectionFromNode(line))
words = L[v].d(line, otype="word")
print(" ".join(f"[{w}={F[v].trans.v(w)}]" for w in words))
print(T[v].text(line))
show(va, 49)
show(vb, 96)
(1, None, 2) [45=961] [46=copie] [47=5] [48=folio] [49=s] 961, copie, 5 folio's. (1, 3, 4) [95=„Journaelsgewijse] [96=reisbeschrijving] „Journaelsgewijse" reisbeschrijving »
We now extend the slotMap
to a full node map.
See dataset.Versions in the Text-Fabric documentation.
V = Versions(api, va, vb, slotMap)
V.makeVersionMapping()
********************************************************************************************** * * * 0.00s Mapping volume nodes 0.4 ==> 1.0 * * * ********************************************************************************************** | 0.00s Extending slot mapping 0.4 ==> 1.0 for volume nodes | 10s Done .............................................................................................. . 10s Statistics for 0.4 ==> 1.0 (volume) . .............................................................................................. | 10s TOTAL : 100.00% 13x | 10s unique, imperfect : 100.00% 13x ********************************************************************************************** * * * 10s Mapping letter nodes 0.4 ==> 1.0 * * * ********************************************************************************************** | 10s Extending slot mapping 0.4 ==> 1.0 for letter nodes | 20s Done .............................................................................................. . 20s Statistics for 0.4 ==> 1.0 (letter) . .............................................................................................. | 20s TOTAL : 100.00% 589x | 20s unique, perfect : 21.05% 124x | 20s unique, imperfect : 77.93% 459x | 20s multiple, non-perfect : 1.02% 6x ********************************************************************************************** * * * 20s Mapping page nodes 0.4 ==> 1.0 * * * ********************************************************************************************** | 20s Extending slot mapping 0.4 ==> 1.0 for page nodes | 30s Done .............................................................................................. . 30s Statistics for 0.4 ==> 1.0 (page) . .............................................................................................. | 30s TOTAL : 100.00% 10149x | 30s unique, perfect : 47.53% 4824x | 30s unique, imperfect : 51.63% 5240x | 30s multiple, non-perfect : 0.84% 85x ********************************************************************************************** * * * 30s Mapping table nodes 0.4 ==> 1.0 * * * ********************************************************************************************** | 30s Extending slot mapping 0.4 ==> 1.0 for table nodes | 30s Done .............................................................................................. . 30s Statistics for 0.4 ==> 1.0 (table) . .............................................................................................. | 30s TOTAL : 100.00% 322x | 30s unique, perfect : 92.55% 298x | 30s unique, imperfect : 7.14% 23x | 30s multiple, non-perfect : 0.31% 1x ********************************************************************************************** * * * 30s Mapping para nodes 0.4 ==> 1.0 * * * ********************************************************************************************** | 30s Extending slot mapping 0.4 ==> 1.0 for para nodes | 37s Done .............................................................................................. . 37s Statistics for 0.4 ==> 1.0 (para) . .............................................................................................. | 37s TOTAL : 100.00% 33885x | 37s unique, perfect : 77.44% 26242x | 37s unique, imperfect : 22.50% 7625x | 37s multiple, non-perfect : 0.01% 5x | 37s not mapped : 0.04% 13x ********************************************************************************************** * * * 37s Mapping remark nodes 0.4 ==> 1.0 * * * ********************************************************************************************** | 37s Extending slot mapping 0.4 ==> 1.0 for remark nodes | 40s Done .............................................................................................. . 40s Statistics for 0.4 ==> 1.0 (remark) . .............................................................................................. | 40s TOTAL : 100.00% 22922x | 40s unique, perfect : 97.36% 22318x | 40s unique, imperfect : 2.64% 604x ********************************************************************************************** * * * 40s Mapping head nodes 0.4 ==> 1.0 * * * ********************************************************************************************** | 40s Extending slot mapping 0.4 ==> 1.0 for head nodes | 40s Done .............................................................................................. . 40s Statistics for 0.4 ==> 1.0 (head) . .............................................................................................. | 40s TOTAL : 100.00% 589x | 40s unique, perfect : 91.17% 537x | 40s unique, imperfect : 8.83% 52x ********************************************************************************************** * * * 40s Mapping line nodes 0.4 ==> 1.0 * * * ********************************************************************************************** | 40s Extending slot mapping 0.4 ==> 1.0 for line nodes | 51s Done .............................................................................................. . 51s Statistics for 0.4 ==> 1.0 (line) . .............................................................................................. | 51s TOTAL : 100.00% 444978x | 51s unique, perfect : 97.15% 432317x | 51s unique, imperfect : 0.40% 1770x | 51s multiple, cleanly composed : 0.31% 1368x | 51s multiple, non-perfect : 2.14% 9523x ********************************************************************************************** * * * 51s Mapping row nodes 0.4 ==> 1.0 * * * ********************************************************************************************** | 51s Extending slot mapping 0.4 ==> 1.0 for row nodes | 51s Done .............................................................................................. . 51s Statistics for 0.4 ==> 1.0 (row) . .............................................................................................. | 51s TOTAL : 100.00% 4566x | 51s unique, perfect : 98.90% 4516x | 51s unique, imperfect : 1.10% 50x ********************************************************************************************** * * * 51s Mapping folio nodes 0.4 ==> 1.0 * * * ********************************************************************************************** | 51s Extending slot mapping 0.4 ==> 1.0 for folio nodes | 51s Done .............................................................................................. . 51s Statistics for 0.4 ==> 1.0 (folio) . .............................................................................................. | 51s TOTAL : 100.00% 2555x | 51s unique, perfect : 95.58% 2442x | 51s unique, imperfect : 4.11% 105x | 51s not mapped : 0.31% 8x ********************************************************************************************** * * * 51s Mapping cell nodes 0.4 ==> 1.0 * * * ********************************************************************************************** | 51s Extending slot mapping 0.4 ==> 1.0 for cell nodes | 52s Done .............................................................................................. . 52s Statistics for 0.4 ==> 1.0 (cell) . .............................................................................................. | 52s TOTAL : 100.00% 20593x | 52s unique, perfect : 99.71% 20533x | 52s unique, imperfect : 0.28% 58x | 52s multiple, cleanly composed : 0.00% 1x | 52s not mapped : 0.00% 1x ********************************************************************************************** * * * 52s Mapping subhead nodes 0.4 ==> 1.0 * * * ********************************************************************************************** | 52s Extending slot mapping 0.4 ==> 1.0 for subhead nodes | 52s Done .............................................................................................. . 52s Statistics for 0.4 ==> 1.0 (subhead) . .............................................................................................. | 52s TOTAL : 100.00% 1360x | 52s unique, perfect : 99.71% 1356x | 52s unique, imperfect : 0.29% 4x .............................................................................................. . 52s Write edge as TF feature omap@0.4-1.0 . .............................................................................................. 0.00s Exporting 0 node and 1 edge and 0 config features to ~/github/clariah/wp6-missieven/tf/1.0: | 8.50s T omap@0.4-1.0 to ~/github/clariah/wp6-missieven/tf/1.0 8.50s Exported 0 node features and 1 edge features and 0 config features to ~/github/clariah/wp6-missieven/tf/1.0
The result is a new feature in the latest version of the dataset:
!ls -l ~/github/clariah/wp6-missieven/tf/{vb}/omap@*.tf
-rw-r--r-- 1 werk staff 43874572 May 6 15:48 /Users/werk/github/clariah/wp6-missieven/tf/1.0/omap@0.4-1.0.tf
line
.Maybe we can shed some light on those cases.
First we load the mapping as a TF edge feature:
We are interested in mappings between line nodes which are diagnosed as multiple, non-perfect. First we ask for the list of diagnostic labels:
V.legend()
b = unique, perfect d = multiple, one perfect c = unique, imperfect f = multiple, cleanly composed e = multiple, non-perfect a = not mapped
We need to inspect line nodes in version 0.6 that have label e
:
diags = V.getDiagnosis(node="line", label="e")
print(type(diags))
print(len(diags))
<class 'tuple'> 9523
T[va].text(diags[0])
'VOOR ILE DE MAYO 25 februari 1610.'
V.edge[diags[0]]
{6018784: 18, 6018788: 4}
for (lnb, dis) in V.edge[diags[0]].items():
print(f"dis={dis:>2} text={T[vb].text(lnb)}")
dis=18 text=VOOR ILE DE MAYO De eerste drie brieven, door Both op reis naar Indië geschrovon, wijken niet af van dis= 4 text=25 februari 1610.
In the target version footnotes occupy lines themselves. Line breaks in footnotes now become line breaks in the text as a whole. So lines in the source version may become split into several parts when they have a reference to a multiline footnote.
The mapping then detects the two target lines, each of which is an imperfect target of the source line. We cannot do much about it.
We could have made another coding decision: line breaks in footnotes are different from line breaks in the body text. Then we would have a good correspondence between the lines in both versions.