cs1001.py , Tel Aviv University, Spring 2018/19

Recitation 10¶

We discussed iterators, generators, and the Karp-Rabin algorithm for string matching

Takeaways:¶

A generator function is a function that contains the yield command and returns a genertor object.
The Karp-Rabin algorithm is a probabilistic string matching algorithm (find all occurences of a string of length $m$ in a string of length $n$) running in linear time $O(m+n)$ (as opposed to the naive solution which has running time $O(nm)$).
Make sure you read our KR summary.
Make sure you understand the way the algorithm works, and in particular the "rolling hash mechanism", that is, how to compute the fingerprint of the next substring in $O(1)$ time, given the fingerprint of the current substring.
Make sure you understand the "arithmetization" step used by the algorithm.

Code for printing several outputs in one cell (not part of the recitation):¶

In [1]:

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Iterators and Generators¶

Iterators¶

In [2]:

l = "123"  #[1,2,3]
li = iter(l)
type(li)
li2 = iter(l)

Out[2]:

str_iterator

In [25]:

next(li)

Out[25]:

'1'

In [26]:

next(li)

Out[26]:

'2'

In [27]:

z = next(li)
print("z is", z)

z is 3

In [28]:

next(li)

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-28-c9b8845252db> in <module>
----> 1 next(li)

StopIteration:

In [29]:

next(li2)

Out[29]:

'1'

In [30]:

for elem in li2:
    print(elem)

2
3

Generators¶

In [32]:

def countdown_gen():
    ''' calling it creates an iterator for the values 5,4,3,2,1,'launch' '''
    yield 5
    yield 4
    yield 3
    yield 2
    yield 1
    yield 'launch'

In [33]:

cd = countdown_gen()
cd #generator object (private case of iterator)

Out[33]:

<generator object countdown_gen at 0x000002623F0CFE58>

In [34]:

next(cd) 

Out[34]:

In [35]:

next(cd) 

Out[35]:

In [36]:

for e in cd:
    print(e)

3
2
1
launch

With generators we can have infinite iterations

In [ ]:

def countdown_infinite():
    i = 0
    while True:
        yield i
        i += 1
        
[i for i in range(10)]

Solving this exam question about generators ¶

Section (b)

In [37]:

def SomePairs():
    i=1
    while True:
        for j in range(i):
            yield(i,j)
        i=i+1

gen = SomePairs()
[next(gen) for _ in range(10)]

Out[37]:

[(1, 0),
 (2, 0),
 (2, 1),
 (3, 0),
 (3, 1),
 (3, 2),
 (4, 0),
 (4, 1),
 (4, 2),
 (4, 3)]

Section (c)

In [38]:

def RevGen(PairsGen):
    pairs = PairsGen()
    while True:
        pair = next(pairs)
        yield(pair[1],pair[0])

gen = RevGen(SomePairs)
[next(gen) for _ in range(10)]

Out[38]:

[(0, 1),
 (0, 2),
 (1, 2),
 (0, 3),
 (1, 3),
 (2, 3),
 (0, 4),
 (1, 4),
 (2, 4),
 (3, 4)]

Section (d1)

In [41]:

def UnionGenerators(gen1, gen2):
    while True:
        yield next(gen1)
        yield next(gen2)
        
ug = UnionGenerators(SomePairs(), RevGen(SomePairs))

for _ in range(10):
    next(ug)

Out[41]:

(1, 0)

Out[41]:

(0, 1)

Out[41]:

(2, 0)

Out[41]:

(0, 2)

Out[41]:

(2, 1)

Out[41]:

(1, 2)

Out[41]:

(3, 0)

Out[41]:

(0, 3)

Out[41]:

(3, 1)

Out[41]:

(1, 3)

Section (d2)

In [ ]:

def EqPairs():
    i=0
    while True:
        yield (i,i)
        i=i+1
        
def AllPairs():
    return UnionGenerators(SomePairs(), UnionGenerators(EqPairs(), RevGen(SomePairs)))

gen = AllPairs()
[next(gen) for _ in range(10)]

The string-matching problem¶

Given a string $text$ of length $n$, and a short(er) string $pattern$ of length $m$ ($m\leq n$), report all occurrances of $pattern$ in $text$.

Example:

$text = $"abracadabra", $pattern = $"abr"

The requested output should be $[0,7]$, since $pattern$ appears in $text$ in indices $0,7$.

Karp-Rabin Algorithm¶

The algorithm works as follows:
- An initial "fingerprint" is computed for the pattern string and the first substring of length $m$ in the text
- If both fingerprints are equal, we assume a match
- Then, a "rolling hash" mechanism computes the next fingerprint in $O(1)$ time given the current fingerprint in the text
- At each stage again a comparison is made and equal fingerprints are considered a match

In [30]:

def is_prime(N, show_witness=False):
    """ probabilistic test for N's compositeness """
    for i in range(0,100):
        a = random.randint(1,N-1) # a is a random integer in [1..N-1]
        if pow(a,N-1,N) != 1:
            if show_witness:  # caller wishes to see a witness
                print(N,"is composite","\n",a,"is a witness, i=",i+1)
            return False
    return True


def find_prime(n):
    """ find random n-bit long prime """
    while(True):
        candidate = random.randrange(2**(n-1),2**n)
        if is_prime(candidate):
            return candidate


def fingerprint(string, basis, r):
    """ used to computes karp-rabin fingerprint of the pattern
        and of the first substring in the text
        employs Horner method (modulo r) """
    s = 0
    for ch in string:
        s = (s*basis + ord(ch)) % r # Horner
    return s




def substring_fingerprints(string, m, basis, r):
    """ return a list of all fingerprints of size m windows in string """
    fp_list = []
    b_power = pow(basis, m-1, r)

    # compute first fingerprint
    fp_list.append(fingerprint(string[0:m], basis, r))

    # compute f_list[s], based on f_list[s-1]
    for s in range(1,len(string)-m+1): # O(n-m-1), each iteration O(1)
        next_fingerprint = \
                ((fp_list[s-1] - ord(string[s-1])*b_power) * basis \
                              + ord(string[s+m-1])) % r
        fp_list.append(next_fingerprint)

    return fp_list

def find_matches_KR(text, pattern, r=None):
    """ find all shifts of occurances of pattern in text """
    if len(pattern) > len(text):
        return [] #no matches

    basis = 2**16 # assume 16 bits are enough for all chars
    if r == None: 
        r = find_prime(32) # randomly pick a 32 bit long prime using 
                           # good old find_prime from an earlier lecture

    p = fingerprint(pattern, basis, r)
    fp_list = substring_fingerprints(text, len(pattern), basis, r)

    matches = [shift for shift in range(len(fp_list)) if fp_list[shift] == p]

    return matches

In [29]:

text = "abracadabra"
pattern = "abr"

In [43]:

r = find_prime(32)
base = 2**16

r

Out[43]:

2149219973

In [44]:

fingerprint("abr", base, r)

Out[44]:

1818795565

In [45]:

base = 2**16
arit = ord("a")*(base**2) + ord("b")*(base**1) + ord("r")*(base**0)
arit
fp = arit%r
fp

Out[45]:

416618250354

Out[45]:

1818795565

In [46]:

substring_fingerprints(text, 3, base, r)

Out[46]:

[1818795565,
 1816371474,
 1759694964,
 1818861084,
 1811784715,
 1818926620,
 1808312063,
 1818795565,
 1816371474]

In [48]:

find_matches_KR(text, pattern, r)

Out[48]:

[0, 7]

Safe version¶

Makes sure no false positives occur. In the worst case, when all $n-m$ possible substrings are indeed matches, behaves as the naive solution in terms of time complexity.

In [69]:

def find_matches_KR_safe(text, pattern, r=None):
    """ a safe version of KR
        checks every suspect for a match """
    if len(pattern) > len(text):
        return [] #no matches

    basis = 2**16 # assume 16 bits are enough for all chars
    if r == None: 
        r = find_prime(32) # randomly pick a 32 bit long prime using 
                           # good old find_prime from an earlier lecture

    p = fingerprint(pattern, basis, r)
    fp_list = substring_fingerprints(text, len(pattern), basis, r)

    matches = [shift for shift in range(len(fp_list)) \
               if fp_list[shift] == p and \
                  text[shift:shift+len(pattern)]==pattern]

    return matches

Competition between versions on single char string.¶

This is the worst-case scenario for the safe version. Changing $m$ has a greater effect on the safe version than on the standard KR.

In [56]:

import time

text = "a"*1000000
print("text = 'a'*",len(text))
print()
for pattern in ["a"*100, "a"*1000, "a"*10000, "a"*100000]:
    print("pattern = 'a'*",len(pattern))
    r = find_prime(32)
    for f in [find_matches_KR, find_matches_KR_safe]:
        t0=time.perf_counter()
        res=f(text, pattern, r)
        t1=time.perf_counter()
        print (f.__name__, t1-t0 )
    print("") #newline

text = 'a'* 1000000

pattern = 'a'* 100
find_matches_KR 1.4179207529959967
find_matches_KR_safe 1.6589124150050338

pattern = 'a'* 1000
find_matches_KR 1.3420768009964377
find_matches_KR_safe 1.775463817990385

pattern = 'a'* 10000
find_matches_KR 1.2947049719950883
find_matches_KR_safe 2.638186254000175

pattern = 'a'* 100000
find_matches_KR 1.4077826730062952
find_matches_KR_safe 13.829727393997018

Competition between versions on random strings.¶

Note that the standard and safe versions of KR has similar running times. Moreover, as $m$ increases, the running time slightly decreases since there are less candidates to consider.

In [57]:

import random
def gen_str(size):
    ''' Generate a random lowercase English string of length size'''
    s=""
    for i in range(size):
        s+=random.choice("abcdefghijklmnopqrstuvwxyz")
    return s


n=1000000
m=1000
text = gen_str(n)
pattern = gen_str(m)
print("random str of len n=", n, " , random pattern of length m=",m)
r = find_prime(32)
for f in [find_matches_KR, find_matches_KR_safe]:
    t0=time.perf_counter()
    f(text, pattern, r)
    t1=time.perf_counter()
    print (f.__name__, t1-t0)
    

m=10000
pattern = gen_str(m)
print("random str of len n=", n, " , random pattern of length m=",m)
r = find_prime(32)
for f in [find_matches_KR, find_matches_KR_safe]:
    t0=time.perf_counter()
    f(text, pattern, r)
    t1=time.perf_counter()
    print (f.__name__, t1-t0)
    
m=100000
pattern = gen_str(m)
print("random str of len n=", n, " , random pattern of length m=",m)
r = find_prime(32)
for f in [find_matches_KR, find_matches_KR_safe]:
    t0=time.perf_counter()
    f(text, pattern, r)
    t1=time.perf_counter()
    print (f.__name__, t1-t0)

random str of len n= 1000000  , random pattern of length m= 1000

Out[57]:

[]

find_matches_KR 1.4785060709982645

Out[57]:

[]

find_matches_KR_safe 1.3623063810082385
random str of len n= 1000000  , random pattern of length m= 10000

Out[57]:

[]

find_matches_KR 1.383506228987244

Out[57]:

[]

find_matches_KR_safe 1.3038542309950572
random str of len n= 1000000  , random pattern of length m= 100000

Out[57]:

[]

find_matches_KR 1.25419370700547

Out[57]:

[]

find_matches_KR_safe 1.6594200499966973

Choice of $r$¶

By setting $r$ to be a power of the base we will obtain more false-positives. Specifically, for $r=base$, we will get many more false positives. Why is that?

Note that the computation of the fingerprint is of the form $fp = (x \cdot base + y) \mod r$ for some $x,y$ (where $y$ is the arithmetization of the last character in the current substring).

In the case where $base = r$ this is equivalent to $(x \cdot r + y) \mod r = y \mod r$, i.e., the fingerprint takes into account only the last character of the substring!

Clearly, choosing $r$ which is a multiple of $base$ is not a good idea. This may also serve as an intuition for why we choose $r$ to be prime - say for example we know that all the fingerprints we compute are even, then if our $r$ is a multiple of $2$, we will always have $fp \mod r$ an even number, and thus we will only utilize half of our range!

In [60]:

# Many false positives
find_matches_KR("abracadabra", "da", r=2**16)

# Same result here since the "x" will be cancelled out 
find_matches_KR("abracadabra", "xa", r=2**16)

# The safe check still works, of course
find_matches_KR_safe("abracadabra", "da", r=2**16)

Out[60]:

[2, 4, 6, 9]

Out[60]:

[2, 4, 6, 9]

Out[60]:

[6]

In [61]:

fingerprint("da", 2**16, r=2**16)
ord("d")*(2**16)**1 + ord("a")
ord("a")

fingerprint("ca", 2**16, r=2**16)
ord("c")*(2**16)**1 + ord("a")
(ord("c")*(2**16)**1 + ord("a") )%2**16

Out[61]:

In [62]:

base = 2**16
r = 2**16
fingerprint("bda", base, r)
ord("b")*(base**2) + ord("d")*(base**1) + ord("a")
(ord("b")*base + ord("d"))*base + ord("a")
((ord("b")*base + ord("d"))*base + ord("a"))%r == ord("a")%r


fingerprint("cda", base, r)
(ord("c")*base + ord("d"))*base + ord("a")
((ord("c")*base + ord("d"))*base + ord("a"))%r == ord("a")%r

Out[62]:

420913348705

Out[62]:

420913348705

Out[62]:

True

Out[62]:

425208316001

Out[62]:

True

When $r = base^k$ for some $k$ we have a similar behaviour, though this time it means we only take the last $k$ characters of the rolling hash into account.

In [63]:

find_matches_KR("Humpty Dumpty", "Humpty", r=2**(16*5))

Out[63]:

[0, 7]

In [65]:

fingerprint("Humpty", 2**16, r=2**(16*5))
fingerprint("Dumpty", 2**16, r=2**(16*5))
fingerprint("Xumpty", 2**16, r=2**(16*5))

Out[65]:

2158299737877522940025

Out[65]:

2158299737877522940025

Out[65]:

2158299737877522940025

In [67]:

substring_fingerprints("Humpty Dumpty", 6, 2**16, r=2**(16*5))

Out[67]:

[2158299737877522940025,
 2010726629729956855840,
 2066067987872461357124,
 2139856371159933386869,
 2232065040410175930477,
 590314951159640293488,
 1254411530052683432052,
 2158299737877522940025]

In [68]:

# Why don't we return the last index?
find_matches_KR("Humpty Dumpty", "Humpty", r=2**(16*6))

Out[68]:

[0]

Question 2 in this exercise ¶

(We did not see this in class, but you should make sure you understand this exercise)¶

In [20]:

def fingerprint(string, basis=2**16, r=2**32-3):
    """ used to computes karp-rabin fingerprint of the pattern
    employs Horner method (modulo r) """
    partial_sum = 0
    for x in string:
        partial_sum = (partial_sum*basis + ord(x)) % r
    return partial_sum

def slide(prev_fp, prev_char, next_char, b_power, basis=2**16, r=2**32-3):
    new_fp=((prev_fp - ord(prev_char)*b_power)*basis + ord(next_char)) % r
    return new_fp

Section (a)¶

Build a generator which, given a string text and a length parameter generates all KR fingerprints of desired length in text. Instructions:

Make use of both $fingerprint$ and $slide$
The generator can call $fingerprint$ only once

In [21]:

def kr_gen(text, length, basis=2**16, r=2**32-3):
    fp = fingerprint(text[:length])
    yield fp
    b_power = pow(basis, length - 1, r)
    for s in range(1, len(text) - length + 1):
        fp = slide(fp, text[s - 1], text[s - 1 + length], b_power)
        yield fp

In [22]:

gen = kr_gen("abracadabra", 3)
next(gen)
next(gen)

Out[22]:

In [23]:

next(gen)
next(gen)
next(gen)
next(gen)
next(gen)
next(gen)
next(gen)
next(gen)
next(gen)
next(gen)

Out[23]:

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-23-6fd1aa460113> in <module>
      6 next(gen)
      7 next(gen)
----> 8 next(gen)
      9 next(gen)
     10 next(gen)

StopIteration:

Section (b)¶

Build a generator which, given two strings $text1, text2$ and a length parameter $\ell$ generates all pairs of indices $(i,j)$ such that $text1[i:i+\ell] == text2[j:j+\ell]$.

Instructions:

Use $kr\_gen$ from the previous section
The generator must work in space smaller than the size of the strings. In particular, at no point can you save the array of fingerprints of any entire string

In [24]:

def generate_shared_substrings(text1, text2, length):
    g1 = kr_gen(text1, length)
    i1 = 0
    for fp1 in g1:
        g2 = kr_gen(text2, length)
        i2 = 0
        for fp2 in g2:
            if fp1 == fp2:
                yield(i1, i2)
            i2 += 1
        i1 += 1

In [25]:

g = generate_shared_substrings("abcdef","xcdefx",3)
next(g)
next(g)

Out[25]:

(2, 1)

Out[25]:

(3, 2)

In [51]:

next(g)

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-51-e734f8aca5ac> in <module>
----> 1 next(g)

StopIteration:

cs1001.py , Tel Aviv University, Spring 2018/19

Recitation 10¶

Takeaways:¶

Code for printing several outputs in one cell (not part of the recitation):¶

Iterators and Generators¶

Iterators¶

Generators¶

Solving this exam question about generators¶

The string-matching problem¶

Karp-Rabin Algorithm¶

Safe version¶

Competition between versions on single char string.¶

Competition between versions on random strings.¶

Choice of $r$¶

Question 2 in this exercise¶

(We did not see this in class, but you should make sure you understand this exercise)¶

Section (a)¶

Section (b)¶

Solving this exam question about generators ¶

Question 2 in this exercise ¶