• Topic: Distance
  • Authors: Douglas Blank, Bryn Mawr College;
  • Dependencies: Python, Statistics
  • License(s): Creative Commons, Share, with attribution;
  • Audience: Undergraduate;
  • Keywords: distance, minimum edit distance, hamming distance
  • Code Examples: Python
  • Summary: Distance is usually consider to be a property of only geometric objects. However, we can expand the notion to include more abstract objects of information in general.

Distance

Distance is usually consider to be a property of only geometric objects. However, we can expand the notion to include more abstract objects of information in general.

Mathematical Distance

A mathematical distance is usually considered to just be the measurement (in some standard units) between two points, directly. This could be points in one dimension, or more. We call this ''euclidean distance'' and is, in general, a sum of the absolute differences of each of the dimensions. For example, in one dimension, the distance is just:

In [10]:
A = 5
B = 3
abs(A - B)
Out[10]:
2

In more than one dimension, the distance is the square root of the sum of the squares of the differences of each dimension. For example, if A and B are points with x and y properties, then the distance between them is:

In [9]:
import math
from Graphics import Point

A = Point(1, 3)
B = Point(2, 5)
math.sqrt( (A.x - B.x) ** 2 + (A.y - B.y) ** 2 )
Out[9]:
2.23606797749979

However, sometimes that type of measurement isn't practical. For example, consider driving in New York City. Knowing that your destination is a distance of exactly 1.2 miles away "as the crow flies" (eg, direct distance) doesn't really give you a hint as to how long it will take you to get there. What you really need to know is the "city block" distance. This is the distance given that you have to drive on the grid, and can never take a diagonal, lest you drive through a building! It gets more complicated if one takes into account one-way streets.

Thus, we can take these additional constraints and data into account, and come up with a different measurement (and units) of distance. For example, we could compute the amount of time it would take to take a taxi to our destination. This might include finding one, the city-block driving distance multiplied by average speed, etc. We could then compare taxi-time to subway time. This measurement of distance is much more useful for our purposes.

Distance between Sequences

A common goal in information science is to compute the distance between two abstract sequences of data. Just like in the NYC taxi example where we need to get from point A to point B, we need a way of measuring how to "get" from one datum to the other.

Let's consider text, also known as strings. What is the distance between "apple" and "aple"? Why would you want to know? The distance between strings could be useful for many purposes. For example, if you wrote "aple" a computer program (say a word processor or a cell phone app) might want to compute possible alternate real words for using in a spelling checking function, or an "auto correct" feature (see [1] for funny auto-correct nightmares).

Sequences, such as strings, appear throughout many aspects of life. And knowing the differences between any two can be a very useful measurement. Take, for instance, a string composed of of the letters A, T, G, and C. This is, of course, the parts that make up strings of DNA. Being able to compute, and thus measure, the differences between two species' DNA provides us with interesting information about these two species.

How could we compute a difference between two sequence? One method would be to count up the changes that it takes in order to turn one string into the other. For example, one can turn "cat" into "caps" by:

  1. change the "t" into a "p" (+1 for an edit)
  2. add an "s" onto the end (+1 for an insert)

This gives a distance of 2. Of course, this is not the only "path" to change "cat" into "caps". Here is another:

  1. delete the "t" (+1 for an delete)
  2. add an "p" onto the end (+1 for an insert)
  3. add an "s" onto the end (+1 for an insert)

This gives a distance of 3. Of course, we are interested in the minimum distance, so we consider 2 to be the correct answer in this case.

By finding the minimum of the sum of the total edits, deletions, and insertions, we can get a measure of the difference between two strings.

How can we write a program for computing the minimum distance between any two strings? Let's consider an algorithm for computing differences between strings:

 define function distance(string1, string2):
     if the next letter of string 1 is the same as next letter of string2:
         then there is no difference (0), and we just compute distance(string1[1:], string2[1:]) of the rest
     else:
         # there are three possibilities, and we want the smallest of:
         1) the first letter was deleted (+1), plus compute of rest distance(string1, string2[1:])
         2) the first letter was changed (+1), plus compute of rest distance(string1[1:], string2[1:])
         3) a new letter was inserted (+1), plus compute of rest distance(string1[1:], string2)

In other words, we compare the next letter of string1 with the next letter of string2. If they are the same, we continue with the next of each. If there is a difference, however, we need to consider one of the three changes that occurred:

  • a deletion
  • an insert
  • or an edit (changed letter)

Each of these is a difference of 1, and we add that to the rest of the differences. But, we want the smallest of the three possibilities. That's it... there is no more. This algorithm is recursive, as it calls itself. And it calls itself multiple times for each difference! The algorithm is small, but there is a lot of computation going on there.

As a recursive function, we need to make sure that there is a "base case"--- a way to know when we are finished. Easy: if we get to the end of either string, then we can return the result.

In [11]:
def distance(s1, s2):                                # s1 is string1, s2 is string2
    if len(s1) == 0:                                 # base case, nothing left to do
        return len(s2)                               # except return length of other
    elif len(s2) == 0:                               # base case, nothing left to do
        return len(s1)                               # except return length of other
    if s1[0] == s2[0]:                               # same
        return distance(s1[1:], s2[1:]) + 0          #   return 0 + rest
    else:                                            # difference! Take minimum of:
        return min( distance(s1[1:], s2[1:]) + 1,    #   edit, return 1 + rest
                    distance(s1, s2[1:]) + 1,        #   insert, return 1 + rest
                    distance(s1[1:], s2) + 1)        #   delete, return 1 + rest

Let's try it out to see how it works:

In [12]:
distance("apple", "apple")
Out[12]:
0

Yes, there is no difference between these... they are the same! Ok, what about our original issue, "apple" and "aple":

In [13]:
distance("apple", "aple")
Out[13]:
1

Yes, these two strings have a difference of 1... a "p" was deleted. Of course, there are other possibilities. And this algorithm considers them all, but only returns the minimum of them. Let's try something harder:

Try it!

Try out the distance function between any two strings. Suggest only trying on short strings for now!.

Continuing

Now, let's try some harder comparisons.

In [14]:
distance("abcdefghijklmnopqrstuvwxyz", "abcdefghijklmnopqrstuvwxyzA")
Out[14]:
1

True, and that was very fast. What about:

In [15]:
distance("abcdefghijklmnopqrstuvwxyz", "Aabcdefghijklmnopqrstuvwxyz")
Running script aborted!

... that's weird... it is still computing... why?

As the first letters are different, three different paths are considered:

  1. the "A" was inserted
  2. "a" changed into "A"
  3. "a" was deleted

In fact, we need to find the minimum of the total each of these paths, and we won't know that until we get to the end of the string. Unfortunately, the path continues to split three ways for many of these subpaths, all the way to the end.

This little bit of code calls itself a lot!

Consider Part 1

Things to consider at this point:

  1. How much work is being done here?
  2. Are some things computed over and over and over?
  3. What if edits/changes where weighted differently than deletes or inserts?

Continuing

Let's make the code a little simpler with a refactoring to handle the case where the first letter is same/different:

In [16]:
def distance1(s1, s2):
    if len(s1) == 0 or len(s2) == 0: return len(s2) + len(s1)
    cost = 0 if (s1[0] == s2[0]) else 1
    return min( distance1(s1[1:], s2[1:]) + cost,
                distance1(s1, s2[1:]) + 1,
                distance1(s1[1:], s2) + 1)

If the first letters are the same, the cost is 0 and continue. If different (a change), cost is 1 and continue.

But this still runs slow in some cases (which cases?).

We can fix this issue by not re-computing the same thing over and over.

Consider a function, called memoize, which saves the results of a computation given some set of arguments. If a function is called later with those exact arguments, it will simple return the saved result from before.

We can write memoize as a function that takes a function (f) and returns a function (m):

In [17]:
def memoize(f):
    cache = {}
    def m(*args):
        if args not in cache:
            cache[args] = f(*args)
        return cache[args]
    return m

Then we can wrap the memoize function around distance, like so:

In [18]:
@memoize
def distance(s1, s2):
    if len(s1) == 0 or len(s2) == 0: return len(s2) + len(s1)
    cost = 0 if (s1[0] == s2[0]) else 1
    return min( distance(s1[1:], s2[1:]) + cost,
                distance(s1, s2[1:]) + 1,
                distance(s1[1:], s2) + 1)
In [19]:
distance("abcdefghijklmnopqrstuvwxyz", "Aabcdefghijklmnopqrstuvwxyz")
Out[19]:
1

And now our computation is fast too!

Try it on any string.

Consider Part 2

  1. What other kinds of sequences would this work on?
  2. What other kinds of edits are possible?
  3. What are the space/time tradeoffs for memoization?
  4. Change the code to create a function that can describe what changes were made, in English.
  5. Prove that the algorithm is correct.
  6. Can you use this to make a spelling checker?

Continuing

This methodology is very useful for many different types of sequences. But it isn't the only measure of distance.

Consider that we simplify the strings to just be two symbols, 0 and 1, and require that the two sequences be the same length. We can then simplify our definition of "distance" by not considering inserts or deletes, but just counting the differences.

Here is a new function of distance that does just that:

In [20]:
def distance(s1, s2):
    if len(s1) == 0: return 0
    cost = 0 if s1[0] == s2[0] else 1
    return cost + distance(s1[1:], s2[1:])

Or, to compute it in a non-recursive fashion:

In [21]:
def distance(s1, s2):
    cost = 0
    for i in range(len(s1)):
        cost += 0 if s1[i] == s2[i] else 1
    return cost

These are very fast, as it only needs to consider each item once.

Try it!

Consider Part 3

  1. How do these two measurements compare?
  2. Could you use the second one instead of the first for spelling checking?

Summary

The minimum edit/delete/insert algorithm is called the "Levenshtein distance algorithm" or just the "minimum edit distance" and can be further explored in [2]. The difference distance algorithm is called "Hamming distance" and can be further explored in [3].

References

  1. http://damnyouautocorrect.com/
  2. http://en.wikipedia.org/wiki/Levenshtein_distance
  3. http://en.wikipedia.org/wiki/Hamming_distance