We show how to work with the TSV data from the Lakhnawi PDF.
Fusus has a function to import TSV data that is coming out of the OCR pipeline and out of the text extraction pipeline.
These have slightly different columns. When unpacking the TSV data, the function will cast the appropriate columns to integer.
Reference: convert.
%load_ext autoreload
%autoreload 2
from fusus.convert import loadTsv
For a known work, such as the Lakhnawi edition of the Fusus, we can use a keyword, see works.
(headers, words) = loadTsv(source="fususl")
Loading TSV data from ~/github/among/fusus/ur/Lakhnawi/allpages.tsv
We get the header fields and the words:
print(headers)
('page', 'line', 'column', 'span', 'direction', 'left', 'top', 'right', 'bottom', 'word')
len(words)
51814
print(words[40000])
(355, 12, 1, 1, 'r', 390, 373, 390, 394, 'َّىٰ')
Alternatively, we could have gotten it as follows:
(headers, words) = loadTsv(source="~/github/among/fusus/ur/Lakhnawi/allpages.tsv", ocred=False)
Loading TSV data from ~/github/among/fusus/ur/Lakhnawi/allpages.tsv
print(headers)
print(len(words))
print(words[40000])
('page', 'line', 'column', 'span', 'direction', 'left', 'top', 'right', 'bottom', 'word') 51814 (355, 12, 1, 1, 'r', 390, 373, 390, 394, 'َّىٰ')
(headers, words) = loadTsv(source="fususa")
Loading TSV data from ~/github/among/fusus/ur/Affifi/allpages.tsv
We get the header fields and the words:
print(headers)
('stripe', 'column', 'line', 'left', 'top', 'right', 'bottom', 'confidence', 'text')
len(words)
46264
print(words[40000])
(203, 0, '', 18, 904, 3266, 1058, 3429, 100, 'وجه')
Alternatively, we could have gotten it as follows:
(headers, words) = loadTsv(source="~/github/among/fusus/ur/Afifi/allpages.tsv", ocred=True)
Loading TSV data from ~/github/among/fusus/ur/Affifi/allpages.tsv
print(headers)
print(len(words))
print(words[40000])
('stripe', 'column', 'line', 'left', 'top', 'right', 'bottom', 'confidence', 'text') 46264 (203, 0, '', 18, 904, 3266, 1058, 3429, 100, 'وجه')