cs1001.py , Tel Aviv University, Spring 2018/19

Recitation 10

We discussed iterators, generators, and the Karp-Rabin algorithm for string matching

Takeaways:
  • A generator function is a function that contains the yield command and returns a genertor object.
  • The Karp-Rabin algorithm is a probabilistic string matching algorithm (find all occurences of a string of length $m$ in a string of length $n$) running in linear time $O(m+n)$ (as opposed to the naive solution which has running time $O(nm)$).
  • Make sure you read our KR summary.
  • Make sure you understand the way the algorithm works, and in particular the "rolling hash mechanism", that is, how to compute the fingerprint of the next substring in $O(1)$ time, given the fingerprint of the current substring.
  • Make sure you understand the "arithmetization" step used by the algorithm.

Code for printing several outputs in one cell (not part of the recitation):

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Iterators and Generators

Iterators

In [2]:
l = "123"  #[1,2,3]
li = iter(l)
type(li)
li2 = iter(l)
Out[2]:
str_iterator
In [25]:
next(li)
Out[25]:
'1'
In [26]:
next(li)
Out[26]:
'2'
In [27]:
z = next(li)
print("z is", z)
z is 3
In [28]:
next(li)
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-28-c9b8845252db> in <module>
----> 1 next(li)

StopIteration: 
In [29]:
next(li2)
Out[29]:
'1'
In [30]:
for elem in li2:
    print(elem)
2
3

Generators

In [32]:
def countdown_gen():
    ''' calling it creates an iterator for the values 5,4,3,2,1,'launch' '''
    yield 5
    yield 4
    yield 3
    yield 2
    yield 1
    yield 'launch'
In [33]:
cd = countdown_gen()
cd #generator object (private case of iterator)
Out[33]:
<generator object countdown_gen at 0x000002623F0CFE58>
In [34]:
next(cd) 
Out[34]:
5
In [35]:
next(cd) 
Out[35]:
4
In [36]:
for e in cd:
    print(e)
3
2
1
launch

With generators we can have infinite iterations

In [ ]:
def countdown_infinite():
    i = 0
    while True:
        yield i
        i += 1
        
[i for i in range(10)]

Section (b)

In [37]:
def SomePairs():
    i=1
    while True:
        for j in range(i):
            yield(i,j)
        i=i+1

gen = SomePairs()
[next(gen) for _ in range(10)]
Out[37]:
[(1, 0),
 (2, 0),
 (2, 1),
 (3, 0),
 (3, 1),
 (3, 2),
 (4, 0),
 (4, 1),
 (4, 2),
 (4, 3)]

Section (c)

In [38]:
def RevGen(PairsGen):
    pairs = PairsGen()
    while True:
        pair = next(pairs)
        yield(pair[1],pair[0])

gen = RevGen(SomePairs)
[next(gen) for _ in range(10)]
Out[38]:
[(0, 1),
 (0, 2),
 (1, 2),
 (0, 3),
 (1, 3),
 (2, 3),
 (0, 4),
 (1, 4),
 (2, 4),
 (3, 4)]

Section (d1)

In [41]:
def UnionGenerators(gen1, gen2):
    while True:
        yield next(gen1)
        yield next(gen2)
        
ug = UnionGenerators(SomePairs(), RevGen(SomePairs))

for _ in range(10):
    next(ug)
Out[41]:
(1, 0)
Out[41]:
(0, 1)
Out[41]:
(2, 0)
Out[41]:
(0, 2)
Out[41]:
(2, 1)
Out[41]:
(1, 2)
Out[41]:
(3, 0)
Out[41]:
(0, 3)
Out[41]:
(3, 1)
Out[41]:
(1, 3)

Section (d2)

In [ ]:
def EqPairs():
    i=0
    while True:
        yield (i,i)
        i=i+1
        
def AllPairs():
    return UnionGenerators(SomePairs(), UnionGenerators(EqPairs(), RevGen(SomePairs)))

gen = AllPairs()
[next(gen) for _ in range(10)]

The string-matching problem

Given a string $text$ of length $n$, and a short(er) string $pattern$ of length $m$ ($m\leq n$), report all occurrances of $pattern$ in $text$.

Example:

$text = $"abracadabra", $pattern = $"abr"

The requested output should be $[0,7]$, since $pattern$ appears in $text$ in indices $0,7$.

Karp-Rabin Algorithm

  • The algorithm works as follows:
    • An initial "fingerprint" is computed for the pattern string and the first substring of length $m$ in the text
    • If both fingerprints are equal, we assume a match
    • Then, a "rolling hash" mechanism computes the next fingerprint in $O(1)$ time given the current fingerprint in the text
    • At each stage again a comparison is made and equal fingerprints are considered a match
In [30]:
def is_prime(N, show_witness=False):
    """ probabilistic test for N's compositeness """
    for i in range(0,100):
        a = random.randint(1,N-1) # a is a random integer in [1..N-1]
        if pow(a,N-1,N) != 1:
            if show_witness:  # caller wishes to see a witness
                print(N,"is composite","\n",a,"is a witness, i=",i+1)
            return False
    return True


def find_prime(n):
    """ find random n-bit long prime """
    while(True):
        candidate = random.randrange(2**(n-1),2**n)
        if is_prime(candidate):
            return candidate


def fingerprint(string, basis, r):
    """ used to computes karp-rabin fingerprint of the pattern
        and of the first substring in the text
        employs Horner method (modulo r) """
    s = 0
    for ch in string:
        s = (s*basis + ord(ch)) % r # Horner
    return s




def substring_fingerprints(string, m, basis, r):
    """ return a list of all fingerprints of size m windows in string """
    fp_list = []
    b_power = pow(basis, m-1, r)

    # compute first fingerprint
    fp_list.append(fingerprint(string[0:m], basis, r))

    # compute f_list[s], based on f_list[s-1]
    for s in range(1,len(string)-m+1): # O(n-m-1), each iteration O(1)
        next_fingerprint = \
                ((fp_list[s-1] - ord(string[s-1])*b_power) * basis \
                              + ord(string[s+m-1])) % r
        fp_list.append(next_fingerprint)

    return fp_list

def find_matches_KR(text, pattern, r=None):
    """ find all shifts of occurances of pattern in text """
    if len(pattern) > len(text):
        return [] #no matches

    basis = 2**16 # assume 16 bits are enough for all chars
    if r == None: 
        r = find_prime(32) # randomly pick a 32 bit long prime using 
                           # good old find_prime from an earlier lecture

    p = fingerprint(pattern, basis, r)
    fp_list = substring_fingerprints(text, len(pattern), basis, r)

    matches = [shift for shift in range(len(fp_list)) if fp_list[shift] == p]

    return matches
In [29]:
text = "abracadabra"
pattern = "abr"
In [43]:
r = find_prime(32)
base = 2**16

r
Out[43]:
2149219973
In [44]:
fingerprint("abr", base, r)
Out[44]:
1818795565
In [45]:
base = 2**16
arit = ord("a")*(base**2) + ord("b")*(base**1) + ord("r")*(base**0)
arit
fp = arit%r
fp
Out[45]:
416618250354
Out[45]:
1818795565
In [46]:
substring_fingerprints(text, 3, base, r)
Out[46]:
[1818795565,
 1816371474,
 1759694964,
 1818861084,
 1811784715,
 1818926620,
 1808312063,
 1818795565,
 1816371474]
In [48]:
find_matches_KR(text, pattern, r)
Out[48]:
[0, 7]

Safe version

Makes sure no false positives occur. In the worst case, when all $n-m$ possible substrings are indeed matches, behaves as the naive solution in terms of time complexity.

In [69]:
def find_matches_KR_safe(text, pattern, r=None):
    """ a safe version of KR
        checks every suspect for a match """
    if len(pattern) > len(text):
        return [] #no matches

    basis = 2**16 # assume 16 bits are enough for all chars
    if r == None: 
        r = find_prime(32) # randomly pick a 32 bit long prime using 
                           # good old find_prime from an earlier lecture

    p = fingerprint(pattern, basis, r)
    fp_list = substring_fingerprints(text, len(pattern), basis, r)

    matches = [shift for shift in range(len(fp_list)) \
               if fp_list[shift] == p and \
                  text[shift:shift+len(pattern)]==pattern]

    return matches

Competition between versions on single char string.

This is the worst-case scenario for the safe version. Changing $m$ has a greater effect on the safe version than on the standard KR.

In [56]:
import time

text = "a"*1000000
print("text = 'a'*",len(text))
print()
for pattern in ["a"*100, "a"*1000, "a"*10000, "a"*100000]:
    print("pattern = 'a'*",len(pattern))
    r = find_prime(32)
    for f in [find_matches_KR, find_matches_KR_safe]:
        t0=time.perf_counter()
        res=f(text, pattern, r)
        t1=time.perf_counter()
        print (f.__name__, t1-t0 )
    print("") #newline
text = 'a'* 1000000

pattern = 'a'* 100
find_matches_KR 1.4179207529959967
find_matches_KR_safe 1.6589124150050338

pattern = 'a'* 1000
find_matches_KR 1.3420768009964377
find_matches_KR_safe 1.775463817990385

pattern = 'a'* 10000
find_matches_KR 1.2947049719950883
find_matches_KR_safe 2.638186254000175

pattern = 'a'* 100000
find_matches_KR 1.4077826730062952
find_matches_KR_safe 13.829727393997018

Competition between versions on random strings.

Note that the standard and safe versions of KR has similar running times. Moreover, as $m$ increases, the running time slightly decreases since there are less candidates to consider.

In [57]:
import random
def gen_str(size):
    ''' Generate a random lowercase English string of length size'''
    s=""
    for i in range(size):
        s+=random.choice("abcdefghijklmnopqrstuvwxyz")
    return s


n=1000000
m=1000
text = gen_str(n)
pattern = gen_str(m)
print("random str of len n=", n, " , random pattern of length m=",m)
r = find_prime(32)
for f in [find_matches_KR, find_matches_KR_safe]:
    t0=time.perf_counter()
    f(text, pattern, r)
    t1=time.perf_counter()
    print (f.__name__, t1-t0)
    

m=10000
pattern = gen_str(m)
print("random str of len n=", n, " , random pattern of length m=",m)
r = find_prime(32)
for f in [find_matches_KR, find_matches_KR_safe]:
    t0=time.perf_counter()
    f(text, pattern, r)
    t1=time.perf_counter()
    print (f.__name__, t1-t0)
    
m=100000
pattern = gen_str(m)
print("random str of len n=", n, " , random pattern of length m=",m)
r = find_prime(32)
for f in [find_matches_KR, find_matches_KR_safe]:
    t0=time.perf_counter()
    f(text, pattern, r)
    t1=time.perf_counter()
    print (f.__name__, t1-t0)
random str of len n= 1000000  , random pattern of length m= 1000
Out[57]:
[]
find_matches_KR 1.4785060709982645
Out[57]:
[]
find_matches_KR_safe 1.3623063810082385
random str of len n= 1000000  , random pattern of length m= 10000
Out[57]:
[]
find_matches_KR 1.383506228987244
Out[57]:
[]
find_matches_KR_safe 1.3038542309950572
random str of len n= 1000000  , random pattern of length m= 100000
Out[57]:
[]
find_matches_KR 1.25419370700547
Out[57]:
[]
find_matches_KR_safe 1.6594200499966973

Choice of $r$

By setting $r$ to be a power of the base we will obtain more false-positives. Specifically, for $r=base$, we will get many more false positives. Why is that?

Note that the computation of the fingerprint is of the form $fp = (x \cdot base + y) \mod r$ for some $x,y$ (where $y$ is the arithmetization of the last character in the current substring).

In the case where $base = r$ this is equivalent to $(x \cdot r + y) \mod r = y \mod r$, i.e., the fingerprint takes into account only the last character of the substring!

Clearly, choosing $r$ which is a multiple of $base$ is not a good idea. This may also serve as an intuition for why we choose $r$ to be prime - say for example we know that all the fingerprints we compute are even, then if our $r$ is a multiple of $2$, we will always have $fp \mod r$ an even number, and thus we will only utilize half of our range!

In [60]:
# Many false positives
find_matches_KR("abracadabra", "da", r=2**16)

# Same result here since the "x" will be cancelled out 
find_matches_KR("abracadabra", "xa", r=2**16)

# The safe check still works, of course
find_matches_KR_safe("abracadabra", "da", r=2**16)
Out[60]:
[2, 4, 6, 9]
Out[60]:
[2, 4, 6, 9]
Out[60]:
[6]
In [61]:
fingerprint("da", 2**16, r=2**16)
ord("d")*(2**16)**1 + ord("a")
ord("a")

fingerprint("ca", 2**16, r=2**16)
ord("c")*(2**16)**1 + ord("a")
(ord("c")*(2**16)**1 + ord("a") )%2**16
Out[61]:
97
Out[61]:
6553697
Out[61]:
97
Out[61]:
97
Out[61]:
6488161
Out[61]:
97
In [62]:
base = 2**16
r = 2**16
fingerprint("bda", base, r)
ord("b")*(base**2) + ord("d")*(base**1) + ord("a")
(ord("b")*base + ord("d"))*base + ord("a")
((ord("b")*base + ord("d"))*base + ord("a"))%r == ord("a")%r


fingerprint("cda", base, r)
(ord("c")*base + ord("d"))*base + ord("a")
((ord("c")*base + ord("d"))*base + ord("a"))%r == ord("a")%r
Out[62]:
97
Out[62]:
420913348705
Out[62]:
420913348705
Out[62]:
True
Out[62]:
97
Out[62]:
425208316001
Out[62]:
True

When $r = base^k$ for some $k$ we have a similar behaviour, though this time it means we only take the last $k$ characters of the rolling hash into account.

In [63]:
find_matches_KR("Humpty Dumpty", "Humpty", r=2**(16*5))
Out[63]:
[0, 7]
In [65]:
fingerprint("Humpty", 2**16, r=2**(16*5))
fingerprint("Dumpty", 2**16, r=2**(16*5))
fingerprint("Xumpty", 2**16, r=2**(16*5))
Out[65]:
2158299737877522940025
Out[65]:
2158299737877522940025
Out[65]:
2158299737877522940025
In [67]:
substring_fingerprints("Humpty Dumpty", 6, 2**16, r=2**(16*5))
Out[67]:
[2158299737877522940025,
 2010726629729956855840,
 2066067987872461357124,
 2139856371159933386869,
 2232065040410175930477,
 590314951159640293488,
 1254411530052683432052,
 2158299737877522940025]
In [68]:
# Why don't we return the last index?
find_matches_KR("Humpty Dumpty", "Humpty", r=2**(16*6))
Out[68]:
[0]

Question 2 in this exercise

(We did not see this in class, but you should make sure you understand this exercise)

In [20]:
def fingerprint(string, basis=2**16, r=2**32-3):
    """ used to computes karp-rabin fingerprint of the pattern
    employs Horner method (modulo r) """
    partial_sum = 0
    for x in string:
        partial_sum = (partial_sum*basis + ord(x)) % r
    return partial_sum

def slide(prev_fp, prev_char, next_char, b_power, basis=2**16, r=2**32-3):
    new_fp=((prev_fp - ord(prev_char)*b_power)*basis + ord(next_char)) % r
    return new_fp

Section (a)

Build a generator which, given a string text and a length parameter generates all KR fingerprints of desired length in text. Instructions:

  • Make use of both $fingerprint$ and $slide$
  • The generator can call $fingerprint$ only once
In [21]:
def kr_gen(text, length, basis=2**16, r=2**32-3):
    fp = fingerprint(text[:length])
    yield fp
    b_power = pow(basis, length - 1, r)
    for s in range(1, len(text) - length + 1):
        fp = slide(fp, text[s - 1], text[s - 1 + length], b_power)
        yield fp
In [22]:
gen = kr_gen("abracadabra", 3)
next(gen)
next(gen)
Out[22]:
6422933
Out[22]:
7471495
In [23]:
next(gen)
next(gen)
next(gen)
next(gen)
next(gen)
next(gen)
next(gen)
next(gen)
next(gen)
next(gen)
Out[23]:
6357433
Out[23]:
6488452
Out[23]:
6357389
Out[23]:
6553988
Out[23]:
6357390
Out[23]:
6422933
Out[23]:
7471495
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-23-6fd1aa460113> in <module>
      6 next(gen)
      7 next(gen)
----> 8 next(gen)
      9 next(gen)
     10 next(gen)

StopIteration: 

Section (b)

Build a generator which, given two strings $text1, text2$ and a length parameter $\ell$ generates all pairs of indices $(i,j)$ such that $text1[i:i+\ell] == text2[j:j+\ell]$.

Instructions:

  • Use $kr\_gen$ from the previous section
  • The generator must work in space smaller than the size of the strings. In particular, at no point can you save the array of fingerprints of any entire string
In [24]:
def generate_shared_substrings(text1, text2, length):
    g1 = kr_gen(text1, length)
    i1 = 0
    for fp1 in g1:
        g2 = kr_gen(text2, length)
        i2 = 0
        for fp2 in g2:
            if fp1 == fp2:
                yield(i1, i2)
            i2 += 1
        i1 += 1
In [25]:
g = generate_shared_substrings("abcdef","xcdefx",3)
next(g)
next(g)
Out[25]:
(2, 1)
Out[25]:
(3, 2)
In [51]:
next(g)
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-51-e734f8aca5ac> in <module>
----> 1 next(g)

StopIteration: