Every word in the corpus has bounding box information stored in the features
boxl
, boxt
, boxr
, boxb
, which store the coordinates of the left, top, right, bottom boundaries.
For top en bottom, they are the $y$-coordinates, and for left and right they are the $x$ coordinates.
The origin is the top left of the page.
The $x$ coordinates increase when going to the right, the $y$ coordinates increase when going down.
We show what you can do with this information.
from tf.app import use
We load version 0.4
.
A = use("among/fusus/tf/Lakhnawi:clone", version="0.4", writing="ara", hoist=globals())
In version 0.4 the following was the case:
When words are not separated by space, but by punctuation marks, they end up in one box.
So, some words have exactly the same bounding box.
Let's find them.
It turns out that Text-Fabric search has a primitive that comes in handy: we can compare features of different nodes.
We search in each line, look for two adjacent words with the same left and right edges.
templateMultiple = """
line
w1:word
< w2:word
w1 .boxr. w2
w1 .boxl. w2
"""
results = A.search(templateMultiple)
0.70s 578 results
A.table(results, start=1, end=10)
n | p | line | word | word |
---|---|---|---|---|
1 | 1 1:9 | بيروت٤٣٤١هـ– | ٣١٠٢م | |
2 | 1 4:2 | ١‐ | نماذج | |
3 | 1 4:2 | ٣٣٩١…………………… | أ | |
4 | 1 4:3 | ٢‐ | عنوانكتاب | |
5 | 1 4:3 | الكلم……………………… | ٦ | |
6 | 1 4:4 | ٣‐ | خطبة | |
7 | 1 4:4 | الكلم……………………… | ٨ | |
8 | 1 4:5 | ٤‐[ | ١] | |
9 | 1 4:5 | ٤‐[ | فصّ | |
10 | 1 4:5 | ١] | فصّ |
What if we also stipulate that the two words are adjacent, in the sense that they occupy subsequent slots?
If more than two words occupy the same bounding box, we should get less results.
templateAdjacent = """
line
w1:word
<: w2:word
w1 .boxr. w2
w1 .boxl. w2
"""
results = A.search(templateAdjacent)
0.25s 557 results
A.table(results, start=1, end=10)
n | p | line | word | word |
---|---|---|---|---|
1 | 1 1:9 | بيروت٤٣٤١هـ– | ٣١٠٢م | |
2 | 1 4:2 | ١‐ | نماذج | |
3 | 1 4:2 | ٣٣٩١…………………… | أ | |
4 | 1 4:3 | ٢‐ | عنوانكتاب | |
5 | 1 4:3 | الكلم……………………… | ٦ | |
6 | 1 4:4 | ٣‐ | خطبة | |
7 | 1 4:4 | الكلم……………………… | ٨ | |
8 | 1 4:5 | ٤‐[ | ١] | |
9 | 1 4:5 | ١] | فصّ | |
10 | 1 4:5 | آدميّة.……………………………… | ٤١ |
However, from version 0.5 we have split words in an earlier stage, keeping a good connection between the words and their bounding boxes.
Let's load that version of the TF data and repeat the queries.
A = use("among/fusus/tf/Lakhnawi:clone", version="0.5", writing="ara", hoist=globals())
results = A.search(templateMultiple)
0.80s 0 results
That's better!