Toggle navigation
JUPYTER
FAQ
View as Code
Python [default] Kernel
View on GitHub
Execute on Binder
Download Notebook
collatex-tutorial
unit7
Notebook
Advanced normalization
¶
CollateX default matching
Why you may want to override it
How to override it
CollateX default matching
¶
Exact string matching – Near matching
Tokenize by splitting on white space
Punctuation marks are individual tokens
No case normalization
No Unicode normalization
Sample normalization overrides
¶
Case folding
Unicode normalization (precomposed characters)
Strip punctuation
Strip markup
Soundex
¶
English-language surnames, 1918
Algorithm (simplified)
Retain first letter
Delete other vowels
Degeminate
Conflate other letters according to phonetic similarity (e.g., t/d = 3; m/n = 5)
Truncate or zero-pad to four characters
Examples
Birnbaum
B-651 (also ✓
Barenboim
; also ✗
Brumble
)
Soundex assumptions
¶
More nuanced than generic edit distance
Edit distance (Levenshtein distance):
deletion
,
insertion
,
substitution
(Damerau-Levenshtein:
transposition
)
Character differences are not all equivalent with respect to information load
Consonants carry more information than vowels
Information load may be sensitive to position
Beginning of word carries more information than end
Especially true for lexical (not morphological) searching in inflected languages
Adapting Soundex to Church Slavonic
¶
Neutralize variant spellings of initial vowel
оу,у,ꙋ=у
ѡ,ꙍ,ѻ,о=о
Casefold, neutralize consonantal variants
Not always one-to-one, e.g., щ = шт
Degeminate, delete other vowels, delete diacritics
Keep two letters of two-letter words
Higher information load
Other conflations?
Knowledge based vs machine learning
Expand abbreviations? – б҃га, бг҃а, б҃а = бога (бг)
Truncate
Zero-pad
To what length?
Two types of normalization
¶
Collation
¶
Find alignment points
Coarse adjustments
No harm in conflating, e.g., imperfect and aorist or infinitive and supine
Evaluation
¶
Alignment points are already known
Finer comparisons
Many need to distinguish on the basis of small details
Collation after Soundex
¶
Greatly improved results
Utilize forced matches
A B C
A D C
Misses
Gap in alignment (no forced match)
Imperfect match
фраки ~ фраци
CollateX recognizes only perfect matches
Unable to recognize closest match (but see
near matching
)