Advanced normalization

  • CollateX default matching
  • Why you may want to override it
  • How to override it

CollateX default matching

  • Exact string matching – Near matching
  • Tokenize by splitting on white space
  • Punctuation marks are individual tokens
  • No case normalization
  • No Unicode normalization

Sample normalization overrides

  • Case folding
  • Unicode normalization (precomposed characters)
  • Strip punctuation
  • Strip markup

Soundex

  • English-language surnames, 1918
  • Algorithm (simplified)
    1. Retain first letter
    2. Delete other vowels
    3. Degeminate
    4. Conflate other letters according to phonetic similarity (e.g., t/d = 3; m/n = 5)
    5. Truncate or zero-pad to four characters
  • Examples
    • Birnbaum B-651 (also ✓Barenboim; also ✗Brumble)

Soundex assumptions

  • More nuanced than generic edit distance
    • Edit distance (Levenshtein distance): deletion, insertion, substitution (Damerau-Levenshtein: transposition)
  • Character differences are not all equivalent with respect to information load
    • Consonants carry more information than vowels
  • Information load may be sensitive to position
    • Beginning of word carries more information than end
    • Especially true for lexical (not morphological) searching in inflected languages

Adapting Soundex to Church Slavonic

  • Neutralize variant spellings of initial vowel
    • оу,у,ꙋ=у
    • ѡ,ꙍ,ѻ,о=о
  • Casefold, neutralize consonantal variants
    • Not always one-to-one, e.g., щ = шт
  • Degeminate, delete other vowels, delete diacritics
    • Keep two letters of two-letter words
    • Higher information load
  • Other conflations?
    • Knowledge based vs machine learning
  • Expand abbreviations? – б҃га, бг҃а, б҃а = бога (бг)
    • Truncate
    • Zero-pad
    • To what length?

Two types of normalization

Collation

  • Find alignment points
  • Coarse adjustments
  • No harm in conflating, e.g., imperfect and aorist or infinitive and supine

Evaluation

  • Alignment points are already known
  • Finer comparisons
  • Many need to distinguish on the basis of small details

Collation after Soundex

  • Greatly improved results
  • Utilize forced matches
    • A B C
    • A D C
  • Misses
    • Gap in alignment (no forced match)
    • Imperfect match
      • фраки ~ фраци
    • CollateX recognizes only perfect matches
    • Unable to recognize closest match (but see near matching)