Longest Common Substring

Introduction

This notebook is a laborious run through the algorithm design process for a well-known and commonly asked dynamic programming problem. I'm doing this to better define my own process for approaching these kinds of difficult problems.

Understand the problem statement (1)

Q: Longest Common Substring — Given two strings $a$ and $b$, write a function which returns the longest substring common to both.

This is a substring search problem. It occurs in various real-world contexts, like shell completion and finding matching segments in long DNA sequences. If we limit ourselves to just characters with matching indices, this problem is trivial, and can be solved in $O(n)$ time using a one line map-reduce operation. What makes this problem challenging is that we are not limiting ourselves to just matching indices, but matches anywhere in the string.

Write output unit tests (2)

Once we understand the desired behavior, we need to formalize the expectation by writing unit tests for it. These tests will be black box because we are not testing any function intrinsics or helps, only the expected output. Writing tests ahead of time forces you to think about potential program edge cases right away, which promotes program correctness once you sit down to the actual programming.

In [4]:
import unittest

class TestLongestCommonSubstring(unittest.TestCase):
    """Example of how to use unittest in Jupyter."""
    
    def testEmptyString(self):
        self.assertEqual(lcs("", ""), "")
        
    def testUnitStringMatch(self):
        self.assertEqual(lcs("A", "A"), "A")
        
    def testUnitStringNonMatch(self):
        self.assertEqual(lcs("A", "B"), "")        
        
    def testLongMatch(self):
        self.assertEqual(lcs("AAAA", "AAAA"), "AAAA")        

    def testStartMatch(self):
        self.assertEqual(lcs("AABB", "AACC"), "AA")
        
    def testEndMatch(self):
        self.assertEqual(lcs("ABCC", "DECC"), "CC")
        
    def testUnequalLengths(self):
        self.assertEqual(lcs("ABCDEFGHIJKLMNOPQRSTUVWXYZ", "ABCD"), "ABCD")

        
if __name__ == '__main__':
    unittest.main(argv=[''], exit=False)
.......
----------------------------------------------------------------------
Ran 7 tests in 0.004s

OK

Design a brute-force solution in psuedocode (3)

Understanding expected output is often enough to define a brute-force solution. The brute-force solution doesn't need to be pretty or fast, it just needs to work and pass the tests.

Why write the algorithm out in psuedocode first? Generally speaking you perform two errors when building a program: logic errors, and syntax errors. Logic errors are errors in the way the program operates (for example, forgetting a termination condition). Syntax errors are errors in the way that you using the language you are writing the program in (for example, incorrectly concatenating variables of different types). If you write the program in a true language immediately, you will need to keep both kinds of errors in mind simultaneously. By writing the program in psuedocode first you can focus solely on logical correctness first, and on syntactic correctness later.

Do this in psuedocode first because that separates the algorithm design from the language-specific implementation details thereof.

If the problem is very easy, you can skip this step and go straight to (4). If the problem is hard from the get-go, or if a significant difficulty in the problem is in just coming up with the brute force solution, you may need to perform (5), property analysis, first, before you can even implement a brute-force solution.

lcs_brute(a, b):
    longest = ""
    for a_i of range(a):
        for b_i of range(b):
            curr_seq = ""
            offset = 0

            while a[a_i + offset] = b[b_i + offset]:
                next_char = a[a_i + offset]  # or equivalently b[b_i + offset]
                curr_seq += next_char
                offset++

                if a_i + offset >= len(a) or b_i + offset >= len(b):
                    break

            if len(curr_seq) > len(longest):
                longest = curr_seq

            curr_seq = ""

Implement the brute-force algorithm (4)

With the logic written out, you can focus on mapping it to the implementation.

In [2]:
def lcs(a, b):
    longest = ""
    for a_idx, a_char in enumerate(a):
        for b_idx, b_char in enumerate(b):
            curr_seq = ""
            offset = 0
            
            while a[a_idx + offset] == b[b_idx + offset]:
                next_char = a[a_idx + offset]
                curr_seq += next_char
                offset += 1
                
                if a_idx + offset >= len(a) or b_idx + offset >= len(b):
                    break
                
            if len(curr_seq) > len(longest):
                longest = curr_seq
                
            curr_seq = ""
            
    return longest

Study the program properties (5)

This algorithm is $O(n^3)$. Very inefficient! With a brute force implementation done, it is now time to perform optimization.

Just like when designing the algorithm, there are two types of optimizations. Logical optimizations reduce program runtime and memory overhead by changing the underlying algorithm (for example, memoization). Syntactic optimizations reduce these things by changing the language features used (for example, vectorization). The former is usually considered important, and the latter optional.

Splitting your approach to the program into a logic step and an implementation step allows you to do first one, then the other.

  1. Let $\text{LCS}(\cdot, \cdot)$ be the program. The core unit of work of the LCS program is the index-based comparison operation $\text{cmplen(i, j)}$, which itself iteratively uses the atomic character comparison instruction $\text{CMP}[\cdot,\cdot]$. Logically optimizing the algorithm is equivalent to reducing the number of comparisons needed.
  2. There is a recurrence relationship over cmplen as we move leftward:

    $\text{cmplen}(i, j) = \text{CMP}[i, j] \circ \text{cmplen}(i + 1, j + 1)$

    (here $\circ$ means functional composition, in the sense that we may write a function which returns a consistent result, given the ops on either side of the dot)

  3. If the indices $i$ or $j$ are out of bounds, $\text{cmplen(i, j) = 0}$.

  4. If we know a certain substring match, we know every sub-substring match as well, and we know those matches will be shorter.
  5. A longest known match of length $n$ greatly reduces the remaining search space, because it eliminates all character sequences near the ends of the string shorter than $n$, and all character sequences in between known matches of length less than $n$.
  6. Every matching substring within the two strings will necessarily be the prefix of some other substring, and the suffix of some other string.

The major optimization opportunity are P6 and P2.

First P6.

Our algorithm performs a tight loop over the data, visiting every possible combination of prefixes and/or suffixes. For example:

$$\text{LCS}(ABA, BAB) = \max{\{f(ABA, BAB), f(ABA, AB), f(ABA, B), f(BA, BAB), f(BA, AB), \ldots\}}$$

we know from P6 that every matching substring is a suffix of some other substring. Visiting every possible combination of suffixes means visiting every possible matching substring, either by itself or as the suffix of some other string.

So we can compare the last characters of each combination of strings in the inner loop, and discard the rest of the substring after finding or not finding a match, instead of iterating through the entire substring, as in the brute force solution.

P6 applies equally to prefixes and suffixes.

Next P2.

Per P2, if we know the substring match length of the immediately smaller suffix string, we may learn the substring match length of the current suffix string by performing a single additional comparison operation.

What about the remaining properties? The remaining properties (particularly P5) may be used to reduce the best case runtime of the algorithm. But they do not affect its $O$ speed overall. For this particular problem, they are less important.

P2 is the reason that we want to use suffixes, not prefixes, as this optimization can't be applied to suffixes, but P6 applies to both equally.

What data structure can be use to take advantage of P2? An array whose columns and indices are the indices of the $a$ and $b$ substring, respectively, is a convenient representation.

Implement the optimized program in psuedocode (6)

Again, do pseudocode first to separate logical thought from syntactic thought.

lcs(a, b):
    progress = Array(len(a) by len(b))
    for a_i of range(a):
        for b_i of range(b):
            a_suff = a[:a_i + 1]
            b_suff = b[:b_i + 1]
            if a_i = 0 or b_i = 0:
                if a_suff == b_suff:
                    progress[a_i][b_i] = 1
                else:
                    progress[a_i][b_i] = 0
            else:
                if a_suff[-1] == b_suff[-1]:
                    progress[a_i][b_i] = progress[a_i - 1][b_i - 1] + 1
                else:
                    progress[a_i][b_i] = 0

    a_i, b_i = argmax(progress)
    maxlen = progress[a_i][b_i]
    return a[suffix end:]

This algorithm has complexity $O(n^2)$!

Implement the optimized algorithm (7)

Now we implement the optimized algorithm, and make sure it passes our tests.

In [3]:
def lcs(a, b):
    if len(a) == 0 or len(b) == 0:
        return ''
    
    progress = [[float('NaN') for _ in range(len(b))] for _ in range(len(a))]
    for a_i in range(len(a)):
        for b_i in range(len(b)):
            a_suff = a[:a_i + 1]
            b_suff = b[:b_i + 1]
            if a_i == 0 or b_i == 0:
                progress[a_i][b_i] = int(a_suff[-1] == b_suff[-1])
            else:
                if a_suff[-1] == b_suff[-1]:
                    progress[a_i][b_i] = progress[a_i - 1][b_i - 1] + 1
                else:
                    progress[a_i][b_i] = 0
    
    def argmax(arr):
        a_idx = b_idx = 0
        _max = 0
        for a in range(len(arr)):
            for b in range(len(arr[0])):
                if arr[a][b] > _max:
                    a_idx = a
                    b_idx = b
                    _max = arr[a][b]
        return a_idx, b_idx
    
    a_i, b_i = argmax(progress)
    maxlen = progress[a_i][b_i]
    
    return '' if maxlen == 0 else a[a_i - maxlen + 1:a_i + 1]

Loop through the optimizations (if necessary)

If we are applying multiple non-trivial optimizations, we may significantly reduce our defect rate by applying them one at a time, instead of all at once. In other words, if there are multiple things we can "do" to make our program better, it can be greatly helpful to perform steps 6 and 7 in a loop: implement one optimization in psuedocode and then in code, then the next one, and so on.

Refactor for quality and implement syntactic optimizations (8)

Your first attempt at an algorithm will invariably be a bit messy, and will tend to use features of the language that you are most comfortable with. Once you have successfully implemented something that works, you can improve code quality by refactoring it.

This can achieve two ends. One, tidy operations like removing unused variables, renaming unclear fields, and extracting functional components to helper functions improve code readability. Second, refactoring will allow you to deploy language features that you know reasonably well, but which you would hesitate to use in an interview or prototyping context (in case you make a mistake). These more advanced language features can further improve code clarity.

Finally, the last thing you can do is introduce syntactic optimizations. Recall that syntactic optimizations are ones which do not improve the time complexity of the algorithm, but do improve its finer runtime characteristics. Examples of syntatic optimizations include replacing typed arrays with lists and vectorizing math operations. These types of optimizations are hard to get right, will not be easily applicable to every problem, and require a detailed understanding of the programming language you are working with.