We discussed the Karp-Rabin algorithm for string matching and Huffman coding
The Karp-Rabin algorithm is a probabilistic string matching algorithm (find all occurrences of a string of length $m$ in a string of length $n$) running in linear time $O(m+n)$ (as opposed to the naive solution which has running time $O(nm)$).
Make sure you read our KR summary.
Make sure you understand the way the algorithm works, and in particular the "rolling hash mechanism", that is, how to compute the fingerprint of the next substring in $O(1)$ time, given the fingerprint of the current substring.
Make sure you understand the "arithmetization" step used by the algorithm.
Huffman Coding is a lossless compression technique. Given a corpus, a character frequency analysis is performed and a prefix free tree is built according to the analysis
Given a text, the characters are encoded using the tree
Given an encoding, decoding is performed in a "greedy" manner (for this reason a prefix free encoding is crucial)
While Huffman coding is optimal, if the character frequency of the corpus differs from that of the encoded text, the compression can be inefficient
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Given a string $text$ of length $n$, and a short(er) string $pattern$ of length $m$ ($m\leq n$), report all occurrences of $pattern$ in $text$.
Example:
$text = $"abracadabra", $pattern = $"abr"
The requested output should be $[0,7]$, since $pattern$ appears in $text$ in indices $0,7$.
def fingerprint(text, basis=2**16, r=2**32-3):
""" used to compute karp-rabin fingerprint of the pattern
employs Horner method (modulo r) """
partial_sum = 0
for ch in text:
partial_sum =(partial_sum*basis + ord(ch)) % r
return partial_sum
def text_fingerprint(text, m, basis=2**16, r=2**32-3):
""" computes karp-rabin fingerprint of the text """
f=[]
b_power = pow(basis, m-1, r)
list.append(f, fingerprint(text[0:m], basis, r))
# f[0] equals first text fingerprint
for s in range(1, len(text)-m+1):
new_fingerprint = ( (f[s-1] - ord(text[s-1])*b_power)*basis
+ord(text[s+m-1]) ) % r
# compute f[s], based on f[s-1]
list.append(f,new_fingerprint)# append f[s] to existing f
return f
def find_matches_KR(pattern, text, basis=2**16, r=2**32-3):
""" find all occurrences of pattern in text
using coin flipping Karp-Rabin algorithm """
if len(pattern) > len(text):
return []
p = fingerprint(pattern, basis, r)
f = text_fingerprint(text, len(pattern), basis, r)
matches = [s for (s,f_s) in enumerate(f) if f_s == p]
# list comprehension
return matches
text = "abracadabra"
pattern = "abr"
fingerprint("abr")
6422933
base = 2**16
arit = ord("a")*(base**2) + ord("b")*(base**1) + ord("r")*(base**0)
arit
r = 2**32 - 3
fp = arit%r
fp
416618250354
6422933
text_fingerprint(text, 3)
[6422933, 7471495, 6357433, 6488452, 6357389, 6553988, 6357390, 6422933, 7471495]
find_matches_KR(pattern, text)
[0, 7]
Makes sure no false positives occur. In the worst case, when all $n-m$ possible substrings are indeed matches, behaves as the naive solution in terms of time complexity.
def find_matches_KR_safe(pattern, text, basis=2**16, r=2**32-3):
""" a safe version of KR
checks every suspect for a match """
if len(pattern) > len(text):
return []
p = fingerprint(pattern, basis, r)
f = text_fingerprint(text, len(pattern), basis, r)
matches = [s for (s,f_s) in enumerate(f) if f_s == p \
and text[s:s+len(pattern)]==pattern]
#note that python performs "short circuit evaluation" of the 'and' statement
return matches
This is the worst-case scenario for the safe version.
How will changing the pattern length $m$ affect both the safe version and the standard KR version?
import time
text = "a"*1000000
print("text = 'a'*",len(text))
for pattern in ["a"*100, "a"*1000, "a"*10000, "a"*100000]:
print("pattern = 'a'*",len(pattern))
for f in [find_matches_KR, find_matches_KR_safe]:
t0=time.perf_counter()
res=f(pattern, text)
t1=time.perf_counter()
print (f.__name__, t1-t0)
print("") #newline
text = 'a'* 1000000 pattern = 'a'* 100 find_matches_KR 2.527005435999996 find_matches_KR_safe 3.1151218930000084 pattern = 'a'* 1000 find_matches_KR 2.022127041999994 find_matches_KR_safe 3.558386521000017 pattern = 'a'* 10000 find_matches_KR 2.499810314000001 find_matches_KR_safe 6.067419519999987 pattern = 'a'* 100000 find_matches_KR 2.4022465289999957 find_matches_KR_safe 29.330972100999986
How do the regular/safe versions differ when we look for a random pattern in a random string?
Furthermore, how does the running time vary when $m$, the pattern length, increases?
import random
def gen_str(size):
''' Generate a random lowercase English string of length size'''
s=""
for i in range(size):
s+=random.choice("abcdefghijklmnopqrstuvwxyz")
return s
n=1000000
m=1000
text = gen_str(n)
pattern = gen_str(m)
print("random str of len n=", n, " , random pattern of length m=",m)
for f in [find_matches_KR, find_matches_KR_safe]:
t0=time.perf_counter()
f(pattern, text)
t1=time.perf_counter()
print (f.__name__, t1-t0)
m=10000
pattern = gen_str(m)
print("random str of len n=", n, " , random pattern of length m=",m)
for f in [find_matches_KR, find_matches_KR_safe]:
t0=time.perf_counter()
f(pattern, text)
t1=time.perf_counter()
print (f.__name__, t1-t0)
m=100000
pattern = gen_str(m)
print("random str of len n=", n, " , random pattern of length m=",m)
for f in [find_matches_KR, find_matches_KR_safe]:
t0=time.perf_counter()
f(pattern, text)
t1=time.perf_counter()
print (f.__name__, t1-t0)
random str of len n= 1000000 , random pattern of length m= 1000
[]
find_matches_KR 2.527568842999983
[]
find_matches_KR_safe 2.52221031900001 random str of len n= 1000000 , random pattern of length m= 10000
[]
find_matches_KR 2.490760865000027
[]
find_matches_KR_safe 2.4914142309999647 random str of len n= 1000000 , random pattern of length m= 100000
[]
find_matches_KR 2.41860688700001
[]
find_matches_KR_safe 2.451781514000004
By setting $r$ to be a power of the base we will obtain more false-positives. Specifically, for $r=base$, we will get many more false positives. Why is that?
Note that the computation of the fingerprint is of the form $fp = (x \cdot base + y) \mod r$ for some $x,y$ (where $y$ is the arithmetization of the last character in the current substring).
In the case where $base = r$ this is equivalent to $(x \cdot r + y) \mod r = y \mod r$, i.e., the fingerprint takes into account only the last character of the substring!
Clearly, choosing $r$ which is a multiple of $base$ is not a good idea. This may also serve as an intuition for why we choose $r$ to be prime - say for example we know that all the fingerprints we compute are even, then if our $r$ is a multiple of $2$, we will always have $fp \mod r$ an even number, and thus we will only utilize half of our range!
# Many false positives
find_matches_KR("da", "abracadabra", basis=2**16, r=2**16)
# Same result here since the "x" will be cancelled out
find_matches_KR("xa", "abracadabra", basis=2**16, r=2**16)
# The safe check still works, of course
find_matches_KR_safe("da", "abracadabra", basis=2**16, r=2**16)
[2, 4, 6, 9]
[2, 4, 6, 9]
[6]
fingerprint("da", 2**16, r=2**16)
ord("d")*(2**16)**1 + ord("a")
ord("a")
fingerprint("ca", 2**16, r=2**16)
ord("c")*(2**16)**1 + ord("a")
(ord("c")*(2**16)**1 + ord("a") )%(2**16)
97
6553697
97
97
6488161
97
base = 2**16
r = 2**16
fingerprint("bda", base, r)
ord("b")*(base**2) + ord("d")*(base**1) + ord("a")
(ord("b")*base + ord("d"))*base + ord("a")
((ord("b")*base + ord("d"))*base + ord("a"))%r == ord("a")%r
fingerprint("cda", base, r)
(ord("c")*base + ord("d"))*base + ord("a")
((ord("c")*base + ord("d"))*base + ord("a"))%r == ord("a")%r
97
420913348705
420913348705
True
97
425208316001
True
When $r = base^k$ for some $k$ we have a similar behaviour, though this time it means we only take the last $k$ characters of the rolling hash into account.
find_matches_KR("Humpty", "Humpty Dumpty", r=2**(16*5))
[0, 7]
base = 2**16
r = 2**(16*5)
fingerprint("Humpty", r=2**(16*5)) == fingerprint("Dumpty", r=2**(16*5))
fingerprint("Humpty", r=2**(16*5)) == fingerprint("Xumpty", r=2**(16*5))
fingerprint("Humpty", r=2**(16*5)) == fingerprint("Hxmpty", r=2**(16*5))
True
True
False
text_fingerprint("Humpty Dumpty", 6, r=2**(16*5))
[2158299737877522940025, 2010726629729956855840, 2066067987872461357124, 2139856371159933386869, 2232065040410175930477, 590314951159640293488, 1254411530052683432052, 2158299737877522940025]
# Why don't we return the last index?
find_matches_KR("Humpty", "Humpty Dumpty", r=2**(16*6))
[0]
def fingerprint(string, basis=2**16, r=2**32-3):
""" used to computes karp-rabin fingerprint of the pattern
employs Horner method (modulo r) """
partial_sum = 0
for x in string:
partial_sum = (partial_sum*basis + ord(x)) % r
return partial_sum
def slide(prev_fp, prev_char, next_char, b_power, basis=2**16, r=2**32-3):
new_fp=((prev_fp - ord(prev_char)*b_power)*basis + ord(next_char)) % r
return new_fp
Build a generator which, given a string text and a length parameter generates all KR fingerprints of desired length in text. Instructions:
def kr_gen(text, length, basis=2**16, r=2**32-3):
fp = fingerprint(text[:length])
yield fp
b_power = pow(basis, length - 1, r)
for s in range(1, len(text) - length + 1):
fp = slide(fp, text[s - 1], text[s - 1 + length], b_power)
yield fp
gen = kr_gen("abracadabra", 3)
next(gen)
next(gen)
6422933
7471495
next(gen)
next(gen)
next(gen)
next(gen)
next(gen)
next(gen)
next(gen)
next(gen)
next(gen)
next(gen)
6357433
6488452
6357389
6553988
6357390
6422933
7471495
--------------------------------------------------------------------------- StopIteration Traceback (most recent call last) <ipython-input-47-6fd1aa460113> in <module> 6 next(gen) 7 next(gen) ----> 8 next(gen) 9 next(gen) 10 next(gen) StopIteration:
Build a generator which, given two strings $text1, text2$ and a length parameter $\ell$ generates all pairs of indices $(i,j)$ such that $text1[i:i+\ell] == text2[j:j+\ell]$.
Instructions:
def generate_shared_substrings(text1, text2, length):
g1 = kr_gen(text1, length)
i1 = 0
for fp1 in g1:
g2 = kr_gen(text2, length)
i2 = 0
for fp2 in g2:
if fp1 == fp2:
yield(i1, i2)
i2 += 1
i1 += 1
g = generate_shared_substrings("abcdef","xcdefx",3)
next(g)
next(g)
(2, 1)
(3, 2)
next(g)
--------------------------------------------------------------------------- StopIteration Traceback (most recent call last) <ipython-input-51-e734f8aca5ac> in <module> ----> 1 next(g) StopIteration:
Generates a variable-length, prefix-free code, based on the character frequencies in corpus string. Notice that there are several degrees of freedom that could lead to different tree structures.
def ascii2bit_stream(text):
""" Translates ASCII text to binary reprersentation using
7 bits per character. Assume only ASCII chars """
return "".join([bin(ord(c))[2:].zfill(7) for c in text])
########################################################
#### HUFFMAN CODE
########################################################
def char_count(text):
""" Counts the number of each character in text.
Returns a dictionary, with keys being the observed characters,
values being the counts """
d = {}
for ch in text:
if ch in d:
d[ch] += 1
else:
d[ch] = 1
return d
def build_huffman_tree(char_count_dict):
""" Recieves dictionary with char:count entries
Generates a LIST structure representing
the binary Huffman encoding tree """
queue = [(c,cnt) for (c,cnt) in char_count_dict.items()]
while len(queue) > 1:
#print(queue)
# combine two smallest elements
A, cntA = extract_min(queue) # smallest in queue
B, cntB = extract_min(queue) # next smallest
chars = [A,B]
weight = cntA + cntB # combined weight
queue.append((chars, weight)) # insert combined node
# only root node left
#print("final queue:", queue)
root, weight_trash = extract_min(queue) # weight_trash unused
return root #a LIST representing the tree structure
def extract_min(queue):
""" queue is a list of 2-tuples (x,y).
remove and return the tuple with minimal y """
min_pair = min(queue, key = lambda pair: pair[1])
queue.remove(min_pair)
return min_pair
def generate_code(huff_tree, prefix=""):
""" Receives a Huffman tree with embedded encoding,
and a prefix of encodings.
returns a dictionary where characters are
keys and associated binary strings are values."""
if isinstance(huff_tree, str): # a leaf
return {huff_tree: prefix}
else:
lchild, rchild = huff_tree[0], huff_tree[1]
codebook = {}
codebook.update(generate_code(lchild, prefix+'0'))
codebook.update(generate_code(rchild, prefix+'1'))
# oh, the beauty of recursion...
return codebook
def compress(text, encoding_dict):
""" compress text using encoding dictionary """
assert isinstance(text, str)
return "".join(encoding_dict[ch] for ch in text)
def reverse_dict(d):
""" build the "reverse" of encoding dictionary """
return {y:x for (x,y) in d.items()}
def decompress(bits, decoding_dict):
prefix = ""
result = []
for bit in bits:
prefix += bit
if prefix in decoding_dict:
result.append(decoding_dict[prefix])
prefix = "" #restart
assert prefix == "" # must finish last codeword
return "".join(result) # converts list of chars to a string
s = "live and let live"
Frequency dictionary
count_dict = char_count(s)
count_dict
{'l': 3, 'i': 2, 'v': 2, 'e': 3, ' ': 3, 'a': 1, 'n': 1, 'd': 1, 't': 1}
Build a Huffman tree:
The function returns a list which represents a binary tree as follows:
The set of indices that lead to a leaf (char) correspond to the path leading to the leaf, and therefore to the encoding of the character in the final Huffman code
huff_tree = build_huffman_tree(count_dict)
# A list with two elements
huff_tree
# The left subtree of the root
huff_tree[0]
# The right subtree of the root
huff_tree[1]
# Leaves correspond to characters in the final code
huff_tree[0][0]
huff_tree[0][1][1]
[[' ', ['i', 'v']], [[['a', 'n'], ['d', 't']], ['l', 'e']]]
[' ', ['i', 'v']]
[[['a', 'n'], ['d', 't']], ['l', 'e']]
' '
'v'
Generate binary representation for each char in the corpus based on the tree
d = generate_code(huff_tree)
d
{' ': '00', 'i': '010', 'v': '011', 'a': '1000', 'n': '1001', 'd': '1010', 't': '1011', 'l': '110', 'e': '111'}
Compressing some text using our code
text1 = "tell"
compress(text1, d)
len(compress(text1, d))#4+3+3+3
'1011111110110'
13
unicode_conversion = ascii2bit_stream("tell") #when we convert every character to 7 bits
len(unicode_conversion)#4*7
28
Decoding
e = reverse_dict(d)
e
{'00': ' ', '010': 'i', '011': 'v', '1000': 'a', '1001': 'n', '1010': 'd', '1011': 't', '110': 'l', '111': 'e'}
decompress('1011111110110', e)
'tell'
asci = str.join("",[chr(c) for c in range(128)])
asci
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f'
def compression_ratio(text, corpus):
d = generate_code(build_huffman_tree(char_count(corpus)))
len_compress = len(compress(text, d))
len_ascii = len(ascii2bit_stream(text)) #len(text)*7
return len_compress/len_ascii
compression_ratio("ab", "ab")
1/7
0.14285714285714285
0.14285714285714285
compression_ratio("ab", "abcd")
2/7
0.2857142857142857
0.2857142857142857
compression_ratio("hello", "a"*100 + asci)
8/7
1.1428571428571428
1.1428571428571428
compression_ratio("hello", "a"*10 + asci)
compression_ratio("hello", "a"*40 + asci)
compression_ratio("hello", "a"*80 + asci)
1.0
1.0
1.1428571428571428
What happened here? Where is the precise cutoff between a 1 and 8/7 ratio?
If we have 65 occurrences of "a", then we are left with 127 occurences of characters that are not "a".
Eventually, all 127 non-"a" chars will be grouped in a tree of depth 7. Since the tree is built by pairing items of minimal weight it will not be a full binary tree but slightly skewed - one character (the last character in the dictionary) will have an encoding of length 6 and all others an encoding of length 7 (this can easily be shown by induction).
At this point the "a" node will be attached to the tree - "a" will get an encoding of length 1 and all other characters will get an encoding of length 8 (or 7 in case of the last letter in the dictionary). Since "a" is not in "hello" the compression ratio will be 8/7.
When "a" appears less than 65 times the compression ratio depends (among other things) on the order in which the dictionary is accessed. Specifically, in our implementation this will result in a 1.0 compression ratio for "hello" (i.e. - all chars in "hello" have an encoding of length 7) but not, for example, for "12345".
c_cnt = char_count("a"*64 + asci)
d = build_huffman_tree(c_cnt)
#d[0]
#d[1]
code = generate_code(d)
#code
# Since all chars in "hello" have an encoding of length 7 the ratio will be 1.0
compression_ratio("hello", "a"*63 + asci)
# There are however characters with an encoding of length 8 resulting in a 8/7 ratio
compression_ratio("12345", "a"*63 + asci)
# If "a" appears more than 64 times all characters (apart from "a" and chr(127)) will have an encdoing of length 8
compression_ratio("hello", "a"*64 + asci)
# chr(127) is the only character with an encoding length of 7
compression_ratio(chr(127)*5, "a"*64 + asci)
1.0
1.1428571428571428
1.1428571428571428
1.0
compression_ratio("hello", "a"*4 + asci)
d = generate_code(build_huffman_tree(char_count("a"*4 + asci)))
1.0
d["a"]
[(key,d[key]) for key in d.keys() if len(d[key]) > 7]
'11110'
[('\x00', '11111010'), ('\x01', '11111011'), ('\x02', '11111100'), ('\x03', '11111101'), ('\x04', '11111110'), ('\x05', '11111111')]
d = generate_code(build_huffman_tree(char_count("a"*100 + asci)))
d['a']
d['h']
len([(key,d[key]) for key in d.keys() if len(d[key]) > 7])
len([(key,d[key]) for key in d.keys() if 1 < len(d[key]) <= 7])
'0'
'11101001'
126
1