cs1001.py , Tel Aviv University, Fall 2018/19

Recitation 10¶

We discussed the Karp-Rabin algorithm for string matching and Huffman coding

Takeaways:¶

The Karp-Rabin algorithm is a probabilistic string matching algorithm (find all occurrences of a string of length $m$ in a string of length $n$) running in linear time $O(m+n)$ (as opposed to the naive solution which has running time $O(nm)$).
Make sure you read our KR summary.
Make sure you understand the way the algorithm works, and in particular the "rolling hash mechanism", that is, how to compute the fingerprint of the next substring in $O(1)$ time, given the fingerprint of the current substring.
Make sure you understand the "arithmetization" step used by the algorithm.
Huffman Coding is a lossless compression technique. Given a corpus, a character frequency analysis is performed and a prefix free tree is built according to the analysis
Given a text, the characters are encoded using the tree
Given an encoding, decoding is performed in a "greedy" manner (for this reason a prefix free encoding is crucial)
While Huffman coding is optimal, if the character frequency of the corpus differs from that of the encoded text, the compression can be inefficient

Code for printing several outputs in one cell (not part of the recitation):¶

In [4]:

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

The string-matching problem¶

Given a string $text$ of length $n$, and a short(er) string $pattern$ of length $m$ ($m\leq n$), report all occurrences of $pattern$ in $text$.

Example:

$text = $"abracadabra", $pattern = $"abr"

The requested output should be $[0,7]$, since $pattern$ appears in $text$ in indices $0,7$.

Karp-Rabin Algorithm¶

The algorithm works as follows:
- An initial "fingerprint" is computed for the pattern string and the first substring of length $m$ in the text
- If both fingerprints are equal, we assume a match
- Then, a "rolling hash" mechanism computes the next fingerprint in $O(1)$ time given the current fingerprint in the text
- At each stage again a comparison is made and equal fingerprints are considered a match

In [2]:

def fingerprint(text, basis=2**16, r=2**32-3):
    """ used to compute karp-rabin fingerprint of the pattern
        employs Horner method (modulo r) """
    partial_sum = 0
    for ch in text:
        partial_sum =(partial_sum*basis + ord(ch)) % r
    return partial_sum

def text_fingerprint(text, m, basis=2**16, r=2**32-3):
    """ computes karp-rabin fingerprint of the text """
    f=[]
    b_power = pow(basis, m-1, r)
    list.append(f, fingerprint(text[0:m], basis, r))
    # f[0] equals first text fingerprint 
    for s in range(1, len(text)-m+1):
        new_fingerprint = ( (f[s-1] - ord(text[s-1])*b_power)*basis
                         +ord(text[s+m-1]) ) % r
            # compute f[s], based on f[s-1]
        list.append(f,new_fingerprint)# append f[s] to existing f       
    return f

def find_matches_KR(pattern, text, basis=2**16, r=2**32-3):
    """ find all occurrences of pattern in text
        using coin flipping Karp-Rabin algorithm """
    if len(pattern) > len(text):
        return []
    p = fingerprint(pattern, basis, r)
    f = text_fingerprint(text, len(pattern), basis, r)
    matches = [s for (s,f_s) in enumerate(f) if f_s == p]
    # list comprehension 
    return matches

In [25]:

text = "abracadabra"
pattern = "abr"

In [26]:

fingerprint("abr")

Out[26]:

In [27]:

base = 2**16
arit = ord("a")*(base**2) + ord("b")*(base**1) + ord("r")*(base**0)
arit
r = 2**32 - 3
fp = arit%r
fp

Out[27]:

416618250354

Out[27]:

In [28]:

text_fingerprint(text, 3)

Out[28]:

In [29]:

find_matches_KR(pattern, text)

Out[29]:

[0, 7]

Safe version¶

Makes sure no false positives occur. In the worst case, when all $n-m$ possible substrings are indeed matches, behaves as the naive solution in terms of time complexity.

In [30]:

def find_matches_KR_safe(pattern, text, basis=2**16, r=2**32-3):
    """ a safe version of KR
        checks every suspect for a match """

    if len(pattern) > len(text):
        return []
    p = fingerprint(pattern, basis, r)
    f = text_fingerprint(text, len(pattern), basis, r)
    matches = [s for (s,f_s) in enumerate(f) if f_s == p \
               and text[s:s+len(pattern)]==pattern]
    #note that python performs "short circuit evaluation" of the 'and' statement
    return matches

Competition between versions on single char string.¶

This is the worst-case scenario for the safe version.

How will changing the pattern length $m$ affect both the safe version and the standard KR version?

In [31]:

import time

text = "a"*1000000
print("text = 'a'*",len(text))
for pattern in ["a"*100, "a"*1000, "a"*10000, "a"*100000]:
    print("pattern = 'a'*",len(pattern))
    for f in [find_matches_KR, find_matches_KR_safe]:
        t0=time.perf_counter()
        res=f(pattern, text)
        t1=time.perf_counter()
        print (f.__name__, t1-t0)
    print("") #newline

text = 'a'* 1000000
pattern = 'a'* 100
find_matches_KR 2.527005435999996
find_matches_KR_safe 3.1151218930000084

pattern = 'a'* 1000
find_matches_KR 2.022127041999994
find_matches_KR_safe 3.558386521000017

pattern = 'a'* 10000
find_matches_KR 2.499810314000001
find_matches_KR_safe 6.067419519999987

pattern = 'a'* 100000
find_matches_KR 2.4022465289999957
find_matches_KR_safe 29.330972100999986

Competition between versions on random strings.¶

How do the regular/safe versions differ when we look for a random pattern in a random string?

Furthermore, how does the running time vary when $m$, the pattern length, increases?

In [32]:

import random
def gen_str(size):
    ''' Generate a random lowercase English string of length size'''
    s=""
    for i in range(size):
        s+=random.choice("abcdefghijklmnopqrstuvwxyz")
    return s


n=1000000
m=1000
text = gen_str(n)
pattern = gen_str(m)
print("random str of len n=", n, " , random pattern of length m=",m)
for f in [find_matches_KR, find_matches_KR_safe]:
    t0=time.perf_counter()
    f(pattern, text)
    t1=time.perf_counter()
    print (f.__name__, t1-t0)
    

m=10000
pattern = gen_str(m)
print("random str of len n=", n, " , random pattern of length m=",m)
for f in [find_matches_KR, find_matches_KR_safe]:
    t0=time.perf_counter()
    f(pattern, text)
    t1=time.perf_counter()
    print (f.__name__, t1-t0)
    
m=100000
pattern = gen_str(m)
print("random str of len n=", n, " , random pattern of length m=",m)
for f in [find_matches_KR, find_matches_KR_safe]:
    t0=time.perf_counter()
    f(pattern, text)
    t1=time.perf_counter()
    print (f.__name__, t1-t0)

random str of len n= 1000000  , random pattern of length m= 1000

Out[32]:

[]

find_matches_KR 2.527568842999983

Out[32]:

[]

find_matches_KR_safe 2.52221031900001
random str of len n= 1000000  , random pattern of length m= 10000

Out[32]:

[]

find_matches_KR 2.490760865000027

Out[32]:

[]

find_matches_KR_safe 2.4914142309999647
random str of len n= 1000000  , random pattern of length m= 100000

Out[32]:

[]

find_matches_KR 2.41860688700001

Out[32]:

[]

find_matches_KR_safe 2.451781514000004

Choice of $r$¶

By setting $r$ to be a power of the base we will obtain more false-positives. Specifically, for $r=base$, we will get many more false positives. Why is that?

Note that the computation of the fingerprint is of the form $fp = (x \cdot base + y) \mod r$ for some $x,y$ (where $y$ is the arithmetization of the last character in the current substring).

In the case where $base = r$ this is equivalent to $(x \cdot r + y) \mod r = y \mod r$, i.e., the fingerprint takes into account only the last character of the substring!

Clearly, choosing $r$ which is a multiple of $base$ is not a good idea. This may also serve as an intuition for why we choose $r$ to be prime - say for example we know that all the fingerprints we compute are even, then if our $r$ is a multiple of $2$, we will always have $fp \mod r$ an even number, and thus we will only utilize half of our range!

In [33]:

# Many false positives
find_matches_KR("da", "abracadabra", basis=2**16, r=2**16)

# Same result here since the "x" will be cancelled out 
find_matches_KR("xa", "abracadabra", basis=2**16, r=2**16)

# The safe check still works, of course
find_matches_KR_safe("da", "abracadabra", basis=2**16, r=2**16)

Out[33]:

[2, 4, 6, 9]

Out[33]:

[2, 4, 6, 9]

Out[33]:

[6]

In [5]:

fingerprint("da", 2**16, r=2**16)
ord("d")*(2**16)**1 + ord("a")
ord("a")

fingerprint("ca", 2**16, r=2**16)
ord("c")*(2**16)**1 + ord("a")
(ord("c")*(2**16)**1 + ord("a") )%(2**16)

Out[5]:

In [35]:

base = 2**16
r = 2**16
fingerprint("bda", base, r)
ord("b")*(base**2) + ord("d")*(base**1) + ord("a")
(ord("b")*base + ord("d"))*base + ord("a")
((ord("b")*base + ord("d"))*base + ord("a"))%r == ord("a")%r


fingerprint("cda", base, r)
(ord("c")*base + ord("d"))*base + ord("a")
((ord("c")*base + ord("d"))*base + ord("a"))%r == ord("a")%r

Out[35]:

420913348705

Out[35]:

420913348705

Out[35]:

True

Out[35]:

425208316001

Out[35]:

True

When $r = base^k$ for some $k$ we have a similar behaviour, though this time it means we only take the last $k$ characters of the rolling hash into account.

In [36]:

find_matches_KR("Humpty", "Humpty Dumpty", r=2**(16*5))

Out[36]:

[0, 7]

In [6]:

base = 2**16
r = 2**(16*5)

fingerprint("Humpty", r=2**(16*5)) == fingerprint("Dumpty", r=2**(16*5))
fingerprint("Humpty", r=2**(16*5)) == fingerprint("Xumpty", r=2**(16*5))
fingerprint("Humpty", r=2**(16*5)) == fingerprint("Hxmpty", r=2**(16*5))

Out[6]:

True

Out[6]:

True

Out[6]:

False

In [40]:

text_fingerprint("Humpty Dumpty", 6, r=2**(16*5))

Out[40]:

[2158299737877522940025,
 2010726629729956855840,
 2066067987872461357124,
 2139856371159933386869,
 2232065040410175930477,
 590314951159640293488,
 1254411530052683432052,
 2158299737877522940025]

In [42]:

# Why don't we return the last index?
find_matches_KR("Humpty", "Humpty Dumpty", r=2**(16*6))

Out[42]:

[0]

Question 2 in this exercise ¶

In [44]:

def fingerprint(string, basis=2**16, r=2**32-3):
    """ used to computes karp-rabin fingerprint of the pattern
    employs Horner method (modulo r) """
    partial_sum = 0
    for x in string:
        partial_sum = (partial_sum*basis + ord(x)) % r
    return partial_sum

def slide(prev_fp, prev_char, next_char, b_power, basis=2**16, r=2**32-3):
    new_fp=((prev_fp - ord(prev_char)*b_power)*basis + ord(next_char)) % r
    return new_fp

Section (a)¶

Build a generator which, given a string text and a length parameter generates all KR fingerprints of desired length in text. Instructions:

Make use of both $fingerprint$ and $slide$
The generator can call $fingerprint$ only once

In [45]:

def kr_gen(text, length, basis=2**16, r=2**32-3):
    fp = fingerprint(text[:length])
    yield fp
    b_power = pow(basis, length - 1, r)
    for s in range(1, len(text) - length + 1):
        fp = slide(fp, text[s - 1], text[s - 1 + length], b_power)
        yield fp

In [46]:

gen = kr_gen("abracadabra", 3)
next(gen)
next(gen)

Out[46]:

In [47]:

next(gen)
next(gen)
next(gen)
next(gen)
next(gen)
next(gen)
next(gen)
next(gen)
next(gen)
next(gen)

Out[47]:

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-47-6fd1aa460113> in <module>
      6 next(gen)
      7 next(gen)
----> 8 next(gen)
      9 next(gen)
     10 next(gen)

StopIteration:

Section (b)¶

Build a generator which, given two strings $text1, text2$ and a length parameter $\ell$ generates all pairs of indices $(i,j)$ such that $text1[i:i+\ell] == text2[j:j+\ell]$.

Instructions:

Use $kr\_gen$ from the previous section
The generator must work in space smaller than the size of the strings. In particular, at no point can you save the array of fingerprints of any entire string

In [49]:

def generate_shared_substrings(text1, text2, length):
    g1 = kr_gen(text1, length)
    i1 = 0
    for fp1 in g1:
        g2 = kr_gen(text2, length)
        i2 = 0
        for fp2 in g2:
            if fp1 == fp2:
                yield(i1, i2)
            i2 += 1
        i1 += 1

In [50]:

g = generate_shared_substrings("abcdef","xcdefx",3)
next(g)
next(g)

Out[50]:

(2, 1)

Out[50]:

(3, 2)

In [51]:

next(g)

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-51-e734f8aca5ac> in <module>
----> 1 next(g)

StopIteration:

Huffman Coding¶

Generates a variable-length, prefix-free code, based on the character frequencies in corpus string. Notice that there are several degrees of freedom that could lead to different tree structures.

In [8]:

def ascii2bit_stream(text):
    """ Translates ASCII text to binary reprersentation using
        7 bits per character. Assume only ASCII chars """
    return "".join([bin(ord(c))[2:].zfill(7) for c in text])



########################################################
#### HUFFMAN CODE
########################################################

def char_count(text):
    """ Counts the number of each character in text.
        Returns a dictionary, with keys being the observed characters,
        values being the counts """
    d = {}
    for ch in text:
        if ch in d:
            d[ch] += 1
        else:
            d[ch] = 1
    return d


def build_huffman_tree(char_count_dict):
    """ Recieves dictionary with char:count entries
        Generates a LIST structure representing
        the binary Huffman encoding tree """
    queue = [(c,cnt) for (c,cnt) in char_count_dict.items()]

    while len(queue) > 1:
        #print(queue)
        # combine two smallest elements
        A, cntA = extract_min(queue)    # smallest in queue
        B, cntB = extract_min(queue)    # next smallest
        chars = [A,B]
        weight = cntA + cntB            # combined weight
        queue.append((chars, weight))   # insert combined node

    # only root node left
    #print("final queue:", queue)
    root, weight_trash = extract_min(queue) # weight_trash unused
    return root                             #a LIST representing the tree structure


def extract_min(queue): 
    """ queue is a list of 2-tuples (x,y).
        remove and return the tuple with minimal y """ 
    min_pair = min(queue, key = lambda pair: pair[1])
    queue.remove(min_pair)
    return min_pair



def generate_code(huff_tree, prefix=""):
    """ Receives a Huffman tree with embedded encoding,
        and a prefix of encodings.
        returns a dictionary where characters are
        keys and associated binary strings are values."""
    if isinstance(huff_tree, str): # a leaf
        return {huff_tree: prefix}
    else:
        lchild, rchild = huff_tree[0], huff_tree[1]
        codebook = {}

        codebook.update(generate_code(lchild, prefix+'0'))
        codebook.update(generate_code(rchild, prefix+'1'))
        #   oh, the beauty of recursion...
        return codebook

    
def compress(text, encoding_dict):
    """ compress text using encoding dictionary """
    assert isinstance(text, str)
    return "".join(encoding_dict[ch] for ch in text)


def reverse_dict(d):
    """ build the "reverse" of encoding dictionary """
    return {y:x for (x,y) in d.items()}


def decompress(bits, decoding_dict):
    prefix = ""
    result = []
    for bit in bits:
        prefix += bit
        if prefix in decoding_dict:
            result.append(decoding_dict[prefix])
            prefix = ""  #restart
    assert prefix == "" # must finish last codeword
    return "".join(result)  # converts list of chars to a string

The Huffman Encoding Process¶

In [9]:

s = "live and let live"

Frequency dictionary

In [10]:

count_dict = char_count(s)
count_dict

Out[10]:

{'l': 3, 'i': 2, 'v': 2, 'e': 3, ' ': 3, 'a': 1, 'n': 1, 'd': 1, 't': 1}

Build a Huffman tree:

The function returns a list which represents a binary tree as follows:

A string in the list is a leaf
Otherwise, a list of size two represents two sub-trees: the left subtree and the right subtree

The set of indices that lead to a leaf (char) correspond to the path leading to the leaf, and therefore to the encoding of the character in the final Huffman code

In [16]:

huff_tree = build_huffman_tree(count_dict)

# A list with two elements
huff_tree

# The left subtree of the root
huff_tree[0]

# The right subtree of the root
huff_tree[1]

# Leaves correspond to characters in the final code
huff_tree[0][0]
huff_tree[0][1][1]

Out[16]:

[[' ', ['i', 'v']], [[['a', 'n'], ['d', 't']], ['l', 'e']]]

Out[16]:

[' ', ['i', 'v']]

Out[16]:

[[['a', 'n'], ['d', 't']], ['l', 'e']]

Out[16]:

' '

Out[16]:

'v'

Generate binary representation for each char in the corpus based on the tree

In [57]:

d = generate_code(huff_tree)
d

Out[57]:

{' ': '00',
 'i': '010',
 'v': '011',
 'a': '1000',
 'n': '1001',
 'd': '1010',
 't': '1011',
 'l': '110',
 'e': '111'}

Compressing some text using our code

In [58]:

text1 = "tell"

In [59]:

compress(text1, d)
len(compress(text1, d))#4+3+3+3

Out[59]:

'1011111110110'

Out[59]:

In [60]:

unicode_conversion = ascii2bit_stream("tell") #when we convert every character to 7 bits
len(unicode_conversion)#4*7

Out[60]:

Decoding

In [61]:

e = reverse_dict(d)
e

Out[61]:

{'00': ' ',
 '010': 'i',
 '011': 'v',
 '1000': 'a',
 '1001': 'n',
 '1010': 'd',
 '1011': 't',
 '110': 'l',
 '111': 'e'}

In [62]:

decompress('1011111110110', e)

Out[62]:

'tell'

Handling missing characters¶

In [18]:

asci = str.join("",[chr(c) for c in range(128)])
asci

Out[18]:

'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f'

Computing the compression ratio¶

In [28]:

def compression_ratio(text, corpus):
    d = generate_code(build_huffman_tree(char_count(corpus)))
    len_compress = len(compress(text, d))
    len_ascii = len(ascii2bit_stream(text)) #len(text)*7
    return len_compress/len_ascii

In [65]:

compression_ratio("ab", "ab")
1/7

Out[65]:

0.14285714285714285

Out[65]:

0.14285714285714285

In [66]:

compression_ratio("ab", "abcd")
2/7

Out[66]:

0.2857142857142857

Out[66]:

0.2857142857142857

In [67]:

compression_ratio("hello", "a"*100 + asci)
8/7

Out[67]:

1.1428571428571428

Out[67]:

1.1428571428571428

In [68]:

compression_ratio("hello", "a"*10 + asci)
compression_ratio("hello", "a"*40 + asci)
compression_ratio("hello", "a"*80 + asci)

Out[68]:

1.0

Out[68]:

1.0

Out[68]:

1.1428571428571428

What happened here? Where is the precise cutoff between a 1 and 8/7 ratio?

If we have 65 occurrences of "a", then we are left with 127 occurences of characters that are not "a".

Eventually, all 127 non-"a" chars will be grouped in a tree of depth 7. Since the tree is built by pairing items of minimal weight it will not be a full binary tree but slightly skewed - one character (the last character in the dictionary) will have an encoding of length 6 and all others an encoding of length 7 (this can easily be shown by induction).

At this point the "a" node will be attached to the tree - "a" will get an encoding of length 1 and all other characters will get an encoding of length 8 (or 7 in case of the last letter in the dictionary). Since "a" is not in "hello" the compression ratio will be 8/7.

When "a" appears less than 65 times the compression ratio depends (among other things) on the order in which the dictionary is accessed. Specifically, in our implementation this will result in a 1.0 compression ratio for "hello" (i.e. - all chars in "hello" have an encoding of length 7) but not, for example, for "12345".

In [61]:

c_cnt = char_count("a"*64 + asci)
d = build_huffman_tree(c_cnt)
#d[0]
#d[1]
code = generate_code(d)
#code

# Since all chars in "hello" have an encoding of length 7 the ratio will be 1.0
compression_ratio("hello", "a"*63 + asci)

# There are however characters with an encoding of length 8 resulting in a 8/7 ratio
compression_ratio("12345", "a"*63 + asci)

# If "a" appears more than 64 times all characters (apart from "a" and chr(127)) will have an encdoing of length 8
compression_ratio("hello", "a"*64 + asci)

# chr(127) is the only character with an encoding length of 7
compression_ratio(chr(127)*5, "a"*64 + asci)

Out[61]:

1.0

Out[61]:

1.1428571428571428

Out[61]:

1.1428571428571428

Out[61]:

1.0

In [43]:

compression_ratio("hello", "a"*4 + asci)

d = generate_code(build_huffman_tree(char_count("a"*4 + asci)))

Out[43]:

1.0

In [47]:

d["a"]

[(key,d[key]) for key in d.keys() if len(d[key]) > 7]

Out[47]:

'11110'

Out[47]:

[('\x00', '11111010'),
 ('\x01', '11111011'),
 ('\x02', '11111100'),
 ('\x03', '11111101'),
 ('\x04', '11111110'),
 ('\x05', '11111111')]

In [50]:

d = generate_code(build_huffman_tree(char_count("a"*100 + asci)))
d['a']
d['h']
len([(key,d[key]) for key in d.keys() if len(d[key]) > 7])
len([(key,d[key]) for key in d.keys() if 1 < len(d[key]) <= 7])

Out[50]:

'0'

Out[50]:

'11101001'

Out[50]:

cs1001.py , Tel Aviv University, Fall 2018/19

Recitation 10¶

Takeaways:¶

Code for printing several outputs in one cell (not part of the recitation):¶

The string-matching problem¶

Karp-Rabin Algorithm¶

Safe version¶

Competition between versions on single char string.¶

Competition between versions on random strings.¶

Choice of $r$¶

Question 2 in this exercise¶

Section (a)¶

Section (b)¶

Huffman Coding¶

The Huffman Encoding Process¶

Handling missing characters¶

Computing the compression ratio¶

Question 2 in this exercise ¶