Near matching

The example below works with 2.1 (the current release), but not 2.0

What is near matching and why do we use it?

Near matching is another term for fuzzy matching, that is, is based on the idea that two items (such as two word tokens in a collation) should sometimes be considered matching even when they are not string equal (that is, not identical in every character). More precisely, near matching is a strategy for finding the closest match in situations where they not be an exact match.

Consider the following example from the Rus′ primary chronicle:

The last column contains slightly differing forms of fraci, but the second witness, Tro, reads fraki. Normalization, including Soundex takes care of the slight variation during Gothenburg stage 2, but we don’t want to merge c and k globally because the difference between them is usually significant.

When CollateX cannot find an exact match and there is more than one possible alignment for a token, it defaults to placing the token to the left. This means that without near matching, fraki, which does match perfectly either i or fraci, would be misaligned. With near matching, though, CollateX can recognize that fraci is more like fraki than it is like i, and thus place it in the correct (right) column.

How does near matching work in CollateX?

The way near matching currently works in CollateX is that if the user has turned it on (it is off by default), after the alignment stage has been completed, the system looks for situations that cannot be resolved solely by exact matching, that is, entirely at the alignment stage. Those situations must show both of the following properties:

  1. Different numbers of tokens: Between two blocks (vertical sets of tokens) that are firmly aligned there is an unequal number of tokens in the witnesses. In the example above, the alignment of “The” and “koala” is clear because both witnesses have the identical reading (that is, they are complete vertical blocks), but in one witness there is one token between them and in the other witness there are two tokens. If, on the other hand, the two witnesses read “The gray koala” and “The grey koala”, although the middle tokens don’t match, there’s no ambiguity because the alignment is forced: since “gray” and “grey” are sandwiched between perfect matches before and after, they can only be aligned with each other, so there is no need for near matching.
  2. No exact match: The tokens in the shorter witnesses do not have an exact string match in the longer witnesses. If they did have an exact match, that would have been found at the alignment stage and there would be no need for near matching. For example, if the two witnesses read “The grey koala” and “The big grey koala”, although there are three tokens in the first witness and four in the second, each token in the first has an exact string match in the second, which means that it can be aligned at the alignment stage.

If and only if both of those conditions are met, CollateX compares the floating token in the shorter witness (“gray” in the example above) to all possible matches (“big” and “grey” in this example) and calculates the nearest match using a measure called edit distance or Levenshtein distance (see https://en.wikipedia.org/wiki/Edit_distance for more information). CollateX calculates the edit distance between the floating token and the tokens in the other witnesses at all of the locations where it could be placed, and determines the best placement. In a tradition with a large number of witnesses and large gaps, the number of comparisons grows quickly, which means that you don’t want to calculate edit distance except where you need to. Performing computationally inexpensive exact string matching first (in the alignment stage) and then calculating the more expensive edit distance (in the analysis stage) only where alignment has failed to give a satisfactory answer reduces the amount of computation required.

In [21]:
from collatex import *
collation = Collation()
collation.add_plain_witness("A", "The gray koala")
collation.add_plain_witness("B", "The big grey koala")
alignment_table = collate(collation, segmentation=False)
print(alignment_table)
+---+-----+------+------+-------+
| A | The | gray | -    | koala |
| B | The | big  | grey | koala |
+---+-----+------+------+-------+
In [22]:
from collatex import *
collation = Collation()
collation.add_plain_witness("A", "The gray koala")
collation.add_plain_witness("B", "The big grey koala")
alignment_table = collate(collation, segmentation=False, near_match=True)
print(alignment_table)
+---+-----+-----+------+-------+
| A | The | -   | gray | koala |
| B | The | big | grey | koala |
+---+-----+-----+------+-------+