%run "boaz_utils.ipynb"
You have list of strings and want to find one inside it
Google database size is $\approx 10^{12}$ web pages. At very least need to perform following task:
Given list $L = [ s_1, s_2 , \ldots, s_N ]$ of strings and string $s$, find $i$ such that $s_i=s$.
Question: Write function find(L,s)
that takes as input a list L
of strings and a string s
and returns the first index i
such that L[i]==s
if there is such an index, or -1
otherwise.
def find(L,s):
for i in range(len(L)):
if L[i]==s: return i
return -1
mu_players = ['Ander Herrera', 'Anthony Martial','Antonio Valencia','Ashley Young','Axel Tuanzebe','Chris Smalling','Daley Blind','David de Gea', 'Eric Bailly', 'Henrikh Mkhitaryan', 'Jesse Lingard', 'Joel Castro Pereira', 'Juan Mata', 'Luke Shaw', 'Marcos Rojo', 'Marcus Rashford', 'Marouane Fellaini', 'Matteo Darmian', 'Michael Carrick', 'Paul Pogba', 'Phil Jones', 'Romelu Lukaku', 'Sergio Romero', 'Timothy Fosu-Mensah', 'Victor Lindelof']
find(mu_players,"Boaz Barak")
-1
A computer is fast but it is not magical
randomstring()
'wpxtiydtig'
def listofstrings(n):
return [randomstring() for i in range(n)]
find(listofstrings(100000),randomstring())
-1
inputs = [[listofstrings(n),randomstring()] for n in range(1,10000,100)]
c,*_ = timer(find,inputs)
# Estimate for time for n= 100000 (in seconds)
c(100000)
0.8908645734477529
# estimate for n=10**12
c(10**12)
8908645.73447753
c(10**12)/(24*3600) # in days
103.10932563052695
print(mu_players)
['Ander Herrera', 'Anthony Martial', 'Antonio Valencia', 'Ashley Young', 'Axel Tuanzebe', 'Chris Smalling', 'Daley Blind', 'David de Gea', 'Eric Bailly', 'Henrikh Mkhitaryan', 'Jesse Lingard', 'Joel Castro Pereira', 'Juan Mata', 'Luke Shaw', 'Marcos Rojo', 'Marcus Rashford', 'Marouane Fellaini', 'Matteo Darmian', 'Michael Carrick', 'Paul Pogba', 'Phil Jones', 'Romelu Lukaku', 'Sergio Romero', 'Timothy Fosu-Mensah', 'Victor Lindelof']
Crucial observation: List is sorted.
Question: Suppose you were in a school with 1000 students, and you see on the bulletin board the list of students in alphabetical order and their grades. Do you need to look at all names to find your grade?
Example: (on board) We can search through 8 sorted strings with only $3$ comparisons.
Exercise: Talk to your neighbors for 5 minutes, try to find a more efficient way to search for a string in a sorted list. Search through $100$ students using at most $7$ comparisons, through $1000$ students using at most $10$ comparisons, through $1000000$ students using at most $20$ comparisons.
# You can use the fact that Python can compare strings:
"Boaz Barak" < "Jelani Nelson"
Idea: If L
is sorted list of n
elements and s
is a string in the list, we can check if s
is in the first half or second half of L
by comparing s
to L[int(n/2)]
Exercise: Give a function search8(L,s)
to find an element s
in a length $8$ list L
by comparing s
to only $3$ members of L
def search8(L,s): # L is of length 8
if s>=L[4]:
if s>= L[6]:
if s== L[6]: return 6
if s== L[7]: return 7
else:
if s==L[4]: return 4
if s==L[5]: return 5
else:
if s>= L[2]:
if s==L[2]: return 2
if s==L[3]: return 3
else:
if s==L[0]: return 0
if s==L[1]: return 1
return -1
search8(mu_players[:8],"Romelu Lukaku")
-1
search8(mu_players[:8],"Boaz Barak")
-1
Exercise: Give an algorithm to find an element $s$ in a length $16$ list $L$ by comparing $s$ to only $4$ members of $L$.
Exercise: Give an algorithm to find an element $s$ in a length $32$ list $L$ by comparing $s$ to only $5$ members of $L$.
Binary Search Algorithm: binsearch
$(L,s)$
Input: Sorted list $L=[L_0,\ldots,L_{n-1}]$, element $s$
Goal: Output the index of $L$ containing $s$ or $-1$ otherwise.
If $L$ is empty return $-1$. If $n=1$ return $0$ if $s=L_0$ and $-1$ otherwise
Let $m = \lceil n/2 \rceil$.
If $s < L_m$, return binsearch
$([L_0,\ldots,L_{m-1}],s)$ (using recursion!)
Otherwise compute $i=binsearch([L_m,\ldots,L_{n-1}],s)$ and return $m+i$ if $i \neq -1$, and $-1$ otherwise.
def binsearch(L,s):
if not L: return -1
if len(L)==1: ==
return 0 if sL[0] else -1
mid = int((len(L)+1)/2) # could also use int(math.ceil(len(L)/2))
if s<L[mid]:
return binsearch(L[:mid],s)
if s>=L[mid]:
i = binsearch(L[mid:],s)
if i >=0: return mid+i
return -1
File "<ipython-input-25-a2d02aa34e8d>", line 3 if len(L)==1: == ^ SyntaxError: invalid syntax
binsearch(mu_players,"Romelu Lukaku")
21
Is it faster?
random.seed(4943985)
s_inputs = [[sorted(listofstrings(n)),randomstring()] for n in range(1,10000,100)]
c,*_ = timer(binsearch,s_inputs)
c(10**12)
0.003972864602998028
Binary search does in $1/1000$ of a second what linear search does in $>100$ days!
compare_times(find,binsearch,s_inputs)
Intuition: One step of binary search reduces the size of the problem from $n$ to $n/2$
In $2$ steps we reduce the size to $n/4$.
In $3$ steps we reduce the size to $n/8$
In $4$ steps we reduce the size to $n/16$.
Running time is proportional to number $k$ such that $n < 2^k$. This is $k = \lceil \log_2 n \rceil$
# Some sense of propotion:
n = 100000000000000
math.log2(n)
46.50699332842307
A more clever algorithm can make a huge difference!
Exercise: Find non recursive implementation of binary search
def binsearchnr(L,s):
start = 0
end = len(L)
while start<end:
mid = int((start+end)/2)
if s<L[mid]: end = mid
if s>L[mid]: start = mid+1
if s==L[mid]: return mid
return -1
binsearchnr(mu_players,"Boaz Barak")
-1
c,*_ = timer(binsearchnr,s_inputs)
c(10**12)
numbers = sorted([ random.randint(1,1000) for i in range(100)])
numbers[:14]
[37, 58, 83, 94, 95, 112, 136, 146, 151, 159, 159, 161, 161, 163]
t= numbers[10]
print(t)
159
binsearch(numbers,t)
10
Advanced comment: Every objects you can compare, you can also sort and use binary search.