#!/usr/bin/env python # coding: utf-8 # # Advanced normalization # # * CollateX default matching # * Why you may want to override it # * How to override it # # ## CollateX default matching # # * Exact string matching – Near matching # * Tokenize by splitting on white space # * Punctuation marks are individual tokens # * No case normalization # * No Unicode normalization # # ## Sample normalization overrides # * Case folding # * Unicode normalization (precomposed characters) # * Strip punctuation # * Strip markup # # ## Soundex # # * English-language surnames, 1918 # * Algorithm (simplified) # 1. Retain first letter # 1. Delete other vowels # 1. Degeminate # 1. Conflate other letters according to phonetic similarity (e.g., t/d = 3; m/n = 5) # 1. Truncate or zero-pad to four characters # * Examples # * *Birnbaum* B-651 (also ✓*Barenboim*; also ✗*Brumble*) # # ## Soundex assumptions # # * More nuanced than generic edit distance # * Edit distance (Levenshtein distance): *deletion*, *insertion*, *substitution* (Damerau-Levenshtein: *transposition*) # * Character differences are not all equivalent with respect to information load # * Consonants carry more information than vowels # * Information load may be sensitive to position # * Beginning of word carries more information than end # * Especially true for lexical (not morphological) searching in inflected languages # # ## Adapting Soundex to Church Slavonic # # * Neutralize variant spellings of initial vowel # * оу,у,ꙋ=у # * ѡ,ꙍ,ѻ,о=о # * Casefold, neutralize consonantal variants # * Not always one-to-one, e.g., щ = шт # * Degeminate, delete other vowels, delete diacritics # * Keep two letters of two-letter words # * Higher information load # * Other conflations? # * Knowledge based vs machine learning # * Expand abbreviations? – б҃га, бг҃а, б҃а = бога (бг) # * Truncate # * Zero-pad # * To what length? # # ## Two types of normalization # # ### Collation # # * Find alignment points # * Coarse adjustments # * No harm in conflating, e.g., imperfect and aorist or infinitive and supine # # ### Evaluation # # * Alignment points are already known # * Finer comparisons # * Many need to distinguish on the basis of small details # # ## Collation after Soundex # # * Greatly improved results # * Utilize forced matches # * A B C # * A D C # * Misses # * Gap in alignment (no forced match) # * Imperfect match # * фраки ~ фраци # * CollateX recognizes only perfect matches # * Unable to recognize closest match (but see *near matching*)