Given two strings:

In [1]:
A = 'ABCBDAB'

In [2]:
B = 'BDCABA'


find the longest common subsequence between A and B.

There are two:

1. B, C, B, A

2. B, D, A, B

Returning either (or both) is fine.

I believe this example comes from a uOttawa course on algorithms (CSI 3105). It illustrates the problem and highlights the fact that subsequences need not be contiguous, but A and B are a meaningless scramble of letters. Here's a friendlier example:

In [3]:
occidental = "occidental"

In [4]:
superdelicately = "superdelicately"


The longest common subsequence of these two words is an English word, can you find it?

It's "deal".

# The length of the longest common subsequence¶

The longest common subsequence problem can be solved with dynamic programming.

In [5]:
from functools import lru_cache

In [6]:
@lru_cache(maxsize=None)
def lcs(A, B):
# Base case A and/or B is empty
if len(A) == 0 or len(B) == 0:
return 0

last_letter_A = A[-1]
last_letter_B = B[-1]
if last_letter_A == last_letter_B:
# Recursive case 1: the last letter of A matches the last letter of B
return 1 + lcs(A[:-1], B[:-1])
else:
# Recursive case 2: the last letter of A does not match the last letter of B
return max( [lcs(A[:-1], B), lcs(A, B[:-1])] )

In [7]:
lcs(A, B)

Out[7]:
4

Which is true, both "B, C, B, A" and "B, D, A, B" are four characters long.

In [8]:
lcs(occidental, superdelicately)

Out[8]:
4

Yep, "deal" is four characters as well.

# The longest common subsequence¶

Alternate version that returns a longest common subsequence as a string:

In [9]:
@lru_cache(maxsize=None)
def lcs2(A, B):
# Base case A and/or B is empty
if len(A) == 0 or len(B) == 0:
return ''

last_letter_A = A[-1]
last_letter_B = B[-1]
if last_letter_A == last_letter_B:
# Recursive case 1: the last letter of A matches the last letter of B
return lcs2(A[:-1], B[:-1]) + last_letter_A
else:
# Recursive case 2: the last letter of A does not match the last letter of B
option1 = lcs2(A[:-1], B)
option2 = lcs2(A, B[:-1])
if len(option1) > len(option2):
return option1
else:
return option2

In [10]:
lcs2(A, B)

Out[10]:
'BDAB'

This particular implementation happened to find the second possible longest common subsequence.

In [11]:
lcs2(occidental, superdelicately)

Out[11]:
'deal'

This was somewhat tricky. You can't just drop a print statement somewhere because a single call has no idea whether it will actually make it into the final answer. The algorithm explores a DAG of possibilities (a binary tree?), if you're in the middle of a branch there is no telling whether that branch will become the path to the answer or ultimately culled in favour of another branch that yields a longer subsequence.

# Iterative, "bottom-up" version¶

Some people say that algorithms using memoization are not "truly dynamic programming" and demand a "bottom-up" approach.

In the bottom-up approach the lookup table is constructed manually. Starting from nothing, simple cases are solved and then those results are used to solve progressively more difficult cases. I find this approach less elegant then the top-down memoized approach. However, the bottom-up solution does have one major advantage over the top-down solution for Python: since it has no recursive calls, it will never reach the maximum recursion depth.

In [12]:
def lcs3(A, B):
# Create a lookup table. This is a 2D array with the letters of A as columns
# and the letters of B as rows. It includes a buffer row and column filled with zeroes.
columns = len(A) + 1
rows = len(B) + 1
lookup = [[0 for n in range(columns)] for n in range(rows)]

# Since the lookup table is initialized with 0's,
# the base case (len(A) == 0 or len(B) == 0) is "baked-in".

# What's left is to go row by row, column by column and
# apply the iterative version of the recursive cases in lcs().
for row_id in range(1, rows):
for column_id in range(1, columns):
letter_in_A = A[column_id-1]
letter_in_B = B[row_id-1]
if letter_in_A == letter_in_B:
# Equivalent to "Recursive case 1"
lookup[row_id][column_id] = 1 + lookup[row_id - 1][column_id - 1]
else:
# Equivalent to "Recursive case 2"
option1 = lookup[row_id][column_id-1]
option2 = lookup[row_id-1][column_id]
if option1 > option2:
lookup[row_id][column_id] = option1
else:
lookup[row_id][column_id] = option2

[print(r) for r in lookup]

return lookup[rows-1][columns-1]

In [13]:
lcs3(A, B)

[0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 1, 1, 1, 1, 1, 1]
[0, 0, 1, 1, 1, 2, 2, 2]
[0, 0, 1, 2, 2, 2, 2, 2]
[0, 1, 1, 2, 2, 2, 3, 3]
[0, 1, 2, 2, 3, 3, 3, 4]
[0, 1, 2, 2, 3, 3, 4, 4]

Out[13]:
4
In [14]:
lcs3(occidental, superdelicately)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
[0, 0, 0, 0, 0, 1, 2, 2, 2, 2, 2]
[0, 0, 0, 0, 0, 1, 2, 2, 2, 2, 3]
[0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 3]
[0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3]
[0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3]
[0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 3]
[0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 3]
[0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 4]
[0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 4]

Out[14]:
4

Iterative version that returns the subsequence:

In [15]:
def pretty_print(arr):
'''A hacky function that makes every string in an array equal length, then prints
it row by row.'''
longest_length = len(arr[:-1][:-1])
nicer = []
for i, row in enumerate(arr):
new_row = []
for j, cell in enumerate(row):
cell_length = len(cell)
new_cell = cell + (' ' * padding_needed)
new_row.append(new_cell)
nicer.append(new_row)

[print(r) for r in nicer]

In [16]:
def lcs4(A, B, print_table=False):
# Create a lookup table. This is a 2D array with the letters of A as columns
# and the letters of B as rows. It includes a buffer row and column filled with empty strings.
columns = len(A) + 1
rows = len(B) + 1
lookup = [['' for n in range(columns)] for n in range(rows)]

# Since the lookup table is initialized with ''
# the base case (len(A) == 0 or len(B) == 0) is "baked-in".

# What's left is to go row by row, column by column and
# apply the iterative version of the recursive cases in lcs().
for row_id in range(1, rows):
for column_id in range(1, columns):
letter_in_A = A[column_id-1]
letter_in_B = B[row_id-1]
if letter_in_A == letter_in_B:
# "Recursive case 1"
lookup[row_id][column_id] = lookup[row_id - 1][column_id - 1] + letter_in_A
else:
# "Recursive case 2"
option1 = lookup[row_id][column_id-1]
option2 = lookup[row_id-1][column_id]
if len(option1) > len(option2):
lookup[row_id][column_id] = option1
else:
lookup[row_id][column_id] = option2

if print_table:
pretty_print(lookup)

return lookup[rows-1][columns-1]

In [17]:
lcs4(A, B, print_table=True)

['     ', '     ', '     ', '     ', '     ', '     ', '     ', '     ']
['     ', '     ', 'B    ', 'B    ', 'B    ', 'B    ', 'B    ', 'B    ']
['     ', '     ', 'B    ', 'B    ', 'B    ', 'BD   ', 'BD   ', 'BD   ']
['     ', '     ', 'B    ', 'BC   ', 'BC   ', 'BD   ', 'BD   ', 'BD   ']
['     ', 'A    ', 'B    ', 'BC   ', 'BC   ', 'BD   ', 'BDA  ', 'BDA  ']
['     ', 'A    ', 'AB   ', 'BC   ', 'BCB  ', 'BCB  ', 'BDA  ', 'BDAB ']
['     ', 'A    ', 'AB   ', 'BC   ', 'BCB  ', 'BCB  ', 'BCBA ', 'BDAB ']

Out[17]:
'BDAB'
In [18]:
lcs4(occidental, superdelicately)

Out[18]:
'deal'

The occidental, superdelicately example yields a table that is too large to display.

# For part 2¶

This notebook is long enough, so I've split it up into two parts.

The second part extends the LCS algorithm implemented above to numbers larger than 9 and explores an interesting connection between the longest common subsequence problem and the longest increasing subsequence problem.