Given two strings:

In [1]:

A = 'ABCBDAB'

In [2]:

B = 'BDCABA'

find the longest common subsequence between A and B.

There are two:

B, C, B, A
B, D, A, B

Returning either (or both) is fine.

I believe this example comes from a uOttawa course on algorithms (CSI 3105). It illustrates the problem and highlights the fact that subsequences need not be contiguous, but A and B are a meaningless scramble of letters. Here's a friendlier example:

In [3]:

occidental = "occidental"

In [4]:

superdelicately = "superdelicately"

The longest common subsequence of these two words is an English word, can you find it?

It's "deal".

The length of the longest common subsequence¶

The longest common subsequence problem can be solved with dynamic programming.

In [5]:

from functools import lru_cache

In [6]:

@lru_cache(maxsize=None)
def lcs(A, B):
    # Base case A and/or B is empty
    if len(A) == 0 or len(B) == 0:
        return 0
    
    last_letter_A = A[-1]
    last_letter_B = B[-1]
    if last_letter_A == last_letter_B:
        # Recursive case 1: the last letter of A matches the last letter of B
        return 1 + lcs(A[:-1], B[:-1])
    else:
        # Recursive case 2: the last letter of A does not match the last letter of B
        return max( [lcs(A[:-1], B), lcs(A, B[:-1])] )

In [7]:

lcs(A, B)

Out[7]:

Which is true, both "B, C, B, A" and "B, D, A, B" are four characters long.

In [8]:

lcs(occidental, superdelicately)

Out[8]:

Yep, "deal" is four characters as well.

The longest common subsequence¶

Alternate version that returns a longest common subsequence as a string:

In [9]:

@lru_cache(maxsize=None)
def lcs2(A, B):
    # Base case A and/or B is empty
    if len(A) == 0 or len(B) == 0:
        return ''
    
    last_letter_A = A[-1]
    last_letter_B = B[-1]
    if last_letter_A == last_letter_B:
        # Recursive case 1: the last letter of A matches the last letter of B
        return lcs2(A[:-1], B[:-1]) + last_letter_A
    else:
        # Recursive case 2: the last letter of A does not match the last letter of B
        option1 = lcs2(A[:-1], B)
        option2 = lcs2(A, B[:-1])
        if len(option1) > len(option2):
            return option1
        else:
            return option2

In [10]:

lcs2(A, B)

Out[10]:

'BDAB'

This particular implementation happened to find the second possible longest common subsequence.

In [11]:

lcs2(occidental, superdelicately)

Out[11]:

'deal'

This was somewhat tricky. You can't just drop a print statement somewhere because a single call has no idea whether it will actually make it into the final answer. The algorithm explores a DAG of possibilities (a binary tree?), if you're in the middle of a branch there is no telling whether that branch will become the path to the answer or ultimately culled in favour of another branch that yields a longer subsequence.

Iterative, "bottom-up" version¶

Some people say that algorithms using memoization are not "truly dynamic programming" and demand a "bottom-up" approach.

In the bottom-up approach the lookup table is constructed manually. Starting from nothing, simple cases are solved and then those results are used to solve progressively more difficult cases. I find this approach less elegant then the top-down memoized approach. However, the bottom-up solution does have one major advantage over the top-down solution for Python: since it has no recursive calls, it will never reach the maximum recursion depth.

In [12]:

def lcs3(A, B):
    # Create a lookup table. This is a 2D array with the letters of A as columns
    # and the letters of B as rows. It includes a buffer row and column filled with zeroes.
    columns = len(A) + 1
    rows = len(B) + 1
    lookup = [[0 for n in range(columns)] for n in range(rows)]
    
    # Since the lookup table is initialized with 0's, 
    # the base case (len(A) == 0 or len(B) == 0) is "baked-in".
    
    # What's left is to go row by row, column by column and 
    # apply the iterative version of the recursive cases in lcs().
    for row_id in range(1, rows):
        for column_id in range(1, columns):
            letter_in_A = A[column_id-1]
            letter_in_B = B[row_id-1]
            if letter_in_A == letter_in_B:
                # Equivalent to "Recursive case 1"
                lookup[row_id][column_id] = 1 + lookup[row_id - 1][column_id - 1]
            else:
                # Equivalent to "Recursive case 2"
                option1 = lookup[row_id][column_id-1]
                option2 = lookup[row_id-1][column_id]
                if option1 > option2:
                    lookup[row_id][column_id] = option1
                else:
                    lookup[row_id][column_id] = option2
    
    
    [print(r) for r in lookup]
    
    return lookup[rows-1][columns-1]

In [13]:

lcs3(A, B)

[0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 1, 1, 1, 1, 1, 1]
[0, 0, 1, 1, 1, 2, 2, 2]
[0, 0, 1, 2, 2, 2, 2, 2]
[0, 1, 1, 2, 2, 2, 3, 3]
[0, 1, 2, 2, 3, 3, 3, 4]
[0, 1, 2, 2, 3, 3, 4, 4]

Out[13]:

In [14]:

lcs3(occidental, superdelicately)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
[0, 0, 0, 0, 0, 1, 2, 2, 2, 2, 2]
[0, 0, 0, 0, 0, 1, 2, 2, 2, 2, 3]
[0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 3]
[0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3]
[0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3]
[0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 3]
[0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 3]
[0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 4]
[0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 4]

Out[14]:

Iterative version that returns the subsequence:

In [15]:

def pretty_print(arr):
    '''A hacky function that makes every string in an array equal length, then prints
    it row by row.'''
    longest_length = len(arr[:-1][:-1])
    nicer = []
    for i, row in enumerate(arr):
        new_row = []
        for j, cell in enumerate(row):
            cell_length = len(cell)
            padding_needed = longest_length - cell_length
            new_cell = cell + (' ' * padding_needed)
            new_row.append(new_cell)
        nicer.append(new_row)
    
    [print(r) for r in nicer]

In [16]:

def lcs4(A, B, print_table=False):
    # Create a lookup table. This is a 2D array with the letters of A as columns
    # and the letters of B as rows. It includes a buffer row and column filled with empty strings.
    columns = len(A) + 1
    rows = len(B) + 1
    lookup = [['' for n in range(columns)] for n in range(rows)]
    
    # Since the lookup table is initialized with '' 
    # the base case (len(A) == 0 or len(B) == 0) is "baked-in".
    
    # What's left is to go row by row, column by column and 
    # apply the iterative version of the recursive cases in lcs().
    for row_id in range(1, rows):
        for column_id in range(1, columns):
            letter_in_A = A[column_id-1]
            letter_in_B = B[row_id-1]
            if letter_in_A == letter_in_B:
                # "Recursive case 1"
                lookup[row_id][column_id] = lookup[row_id - 1][column_id - 1] + letter_in_A
            else:
                # "Recursive case 2"
                option1 = lookup[row_id][column_id-1]
                option2 = lookup[row_id-1][column_id]
                if len(option1) > len(option2):
                    lookup[row_id][column_id] = option1
                else:
                    lookup[row_id][column_id] = option2
    
    
    if print_table:
        pretty_print(lookup)
    
    return lookup[rows-1][columns-1]

In [17]:

lcs4(A, B, print_table=True)

['     ', '     ', '     ', '     ', '     ', '     ', '     ', '     ']
['     ', '     ', 'B    ', 'B    ', 'B    ', 'B    ', 'B    ', 'B    ']
['     ', '     ', 'B    ', 'B    ', 'B    ', 'BD   ', 'BD   ', 'BD   ']
['     ', '     ', 'B    ', 'BC   ', 'BC   ', 'BD   ', 'BD   ', 'BD   ']
['     ', 'A    ', 'B    ', 'BC   ', 'BC   ', 'BD   ', 'BDA  ', 'BDA  ']
['     ', 'A    ', 'AB   ', 'BC   ', 'BCB  ', 'BCB  ', 'BDA  ', 'BDAB ']
['     ', 'A    ', 'AB   ', 'BC   ', 'BCB  ', 'BCB  ', 'BCBA ', 'BDAB ']

Out[17]:

'BDAB'

In [18]:

lcs4(occidental, superdelicately)

Out[18]:

'deal'

The occidental, superdelicately example yields a table that is too large to display.

For part 2¶

This notebook is long enough, so I've split it up into two parts.

The second part extends the LCS algorithm implemented above to numbers larger than 9 and explores an interesting connection between the longest common subsequence problem and the longest increasing subsequence problem.