#!/usr/bin/env python
# coding: utf-8

# # Advanced normalization
# 
# * CollateX default matching
# * Why you may want to override it
# * How to override it
# 
# ## CollateX default matching
# 
# * Exact string matching – Near matching
# * Tokenize by splitting on white space
# * Punctuation marks are individual tokens
# * No case normalization
# * No Unicode normalization
# 
# ## Sample normalization overrides
# * Case folding
# * Unicode normalization (precomposed characters)
# * Strip punctuation
# * Strip markup
# 
# ## Soundex
# 
# * English-language surnames, 1918
# * Algorithm (simplified)
#     1. Retain first letter
#     1. Delete other vowels
#     1. Degeminate
#     1. Conflate other letters according to phonetic similarity (e.g., t/d = 3; m/n = 5)
#     1. Truncate or zero-pad to four characters
# * Examples
#     * *Birnbaum* B-651 (also ✓*Barenboim*; also ✗*Brumble*)
# 
# ## Soundex assumptions
# 
# * More nuanced than generic edit distance
#     * Edit distance (Levenshtein distance): *deletion*, *insertion*, *substitution* (Damerau-Levenshtein: *transposition*)
# * Character differences are not all equivalent with respect to information load
#     * Consonants carry more information than vowels
# * Information load may be sensitive to position
#     * Beginning of word carries more information than end
#     * Especially true for lexical (not morphological) searching in inflected languages
#     
# ## Adapting Soundex to Church Slavonic
# 
# * Neutralize variant spellings of initial vowel
#     * оу,у,ꙋ=у
#     * ѡ,ꙍ,ѻ,о=о
# * Casefold, neutralize consonantal variants
#     * Not always one-to-one, e.g., щ = шт
# * Degeminate, delete other vowels, delete diacritics
#     * Keep two letters of two-letter words
#     * Higher information load
# * Other conflations?
#     * Knowledge based vs machine learning
# * Expand abbreviations? –  б҃га, бг҃а, б҃а = бога (бг)
#     * Truncate
#     * Zero-pad
#     * To what length?
# 
# ## Two types of normalization
# 
# ### Collation
# 
# * Find alignment points
# * Coarse adjustments
# * No harm in conflating, e.g., imperfect and aorist or infinitive and supine
# 
# ### Evaluation
# 
# * Alignment points are already known
# * Finer comparisons
# * Many need to distinguish on the basis of small details
# 
# ## Collation after Soundex
# 
# * Greatly improved results
# * Utilize forced matches
#     * A B C
#     * A D C
# * Misses
#     * Gap in alignment (no forced match)
#     * Imperfect match
#         * фраки ~ фраци
#     * CollateX recognizes only perfect matches
#     * Unable to recognize closest match (but see *near matching*)