This notebook explores the speed of searching for values in sets and lists.
After reading this notebook, watch Brandon Rhodes' videos All Your Ducks In A Row: Data Structures in the Standard Library and Beyond and The Mighty Dictionary.
def make_list(n):
if True:
return list(range(n))
else:
return list(str(i) for i in range(n))
n = int(25e6)
m = int(n*.9)
m = n // 2
a_list = make_list(n)
a_set = set(a_list)
# Finding something that is in a set is fast.
%timeit m in a_set
# Finding something that is _not_ in a set is also fast.
%timeit n+1 in a_set
The slowest run took 38.41 times longer than the fastest. This could mean that an intermediate result is being cached. 10000000 loops, best of 3: 122 ns per loop The slowest run took 31.83 times longer than the fastest. This could mean that an intermediate result is being cached. 10000000 loops, best of 3: 145 ns per loop
# Finding something in a list can be slow.
# It starts at the beginning and compares each value.
# The search time depends on where the value is in the list.
%timeit m in a_list
# Finding something that is not is a list is the worst case.
# It has to be compared to all values of the list.
%timeit n+1 in a_list
1 loop, best of 3: 326 ms per loop 1 loop, best of 3: 645 ms per loop
max_exponent = 6
for n in (10 ** i for i in range(1, max_exponent+1)):
m = n // 2
print('%d:' % n)
a_list = make_list(n)
a_set = set(a_list)
%timeit m in a_set
10: The slowest run took 51.62 times longer than the fastest. This could mean that an intermediate result is being cached. 10000000 loops, best of 3: 92 ns per loop 100: The slowest run took 49.91 times longer than the fastest. This could mean that an intermediate result is being cached. 10000000 loops, best of 3: 91 ns per loop 1000: The slowest run took 39.22 times longer than the fastest. This could mean that an intermediate result is being cached. 10000000 loops, best of 3: 118 ns per loop 10000: The slowest run took 40.17 times longer than the fastest. This could mean that an intermediate result is being cached. 10000000 loops, best of 3: 118 ns per loop 100000: The slowest run took 36.67 times longer than the fastest. This could mean that an intermediate result is being cached. 10000000 loops, best of 3: 124 ns per loop 1000000: The slowest run took 40.67 times longer than the fastest. This could mean that an intermediate result is being cached. 10000000 loops, best of 3: 122 ns per loop
Notice that the difference between searching small sets and large sets in not large. This is the magic of Python sets and dictionaries. Read the hash table Wikipedia article for an explanation of how this works.
for n in (10 ** i for i in range(1, max_exponent+1)):
m = n // 2
print('%d:' % n)
a_list = make_list(n)
a_set = set(a_list)
%timeit m in a_list
10: The slowest run took 23.84 times longer than the fastest. This could mean that an intermediate result is being cached. 1000000 loops, best of 3: 202 ns per loop 100: 1000000 loops, best of 3: 1.32 µs per loop 1000: 100000 loops, best of 3: 12.7 µs per loop 10000: 10000 loops, best of 3: 126 µs per loop 100000: 1000 loops, best of 3: 1.21 ms per loop 1000000: 100 loops, best of 3: 12.8 ms per loop
Notice that the search time for finding something in the middle of a list is proportional to the size of the list.
def set_foo(n):
for i in range(n):
i in a_set
def list_foo(n):
for i in range(n):
i in a_list
for n in (10 ** i for i in range(1, max_exponent+1)):
print('%d:' % n)
a_list = list(range(n))
a_set = set(a_list)
%timeit set_foo(n)
10: 1000000 loops, best of 3: 1.91 µs per loop 100: 100000 loops, best of 3: 9.23 µs per loop 1000: 10000 loops, best of 3: 119 µs per loop 10000: 1000 loops, best of 3: 1.27 ms per loop 100000: 100 loops, best of 3: 13.2 ms per loop 1000000: 10 loops, best of 3: 133 ms per loop
for n in (10 ** i for i in range(1, max_exponent+1)):
print('%d:' % n)
a_list = list(range(n))
a_set = set(a_list)
%timeit list_foo(n)
10: 100000 loops, best of 3: 2.62 µs per loop 100: 10000 loops, best of 3: 132 µs per loop 1000: 100 loops, best of 3: 12.6 ms per loop 10000: 1 loop, best of 3: 1.26 s per loop 100000: 1 loop, best of 3: 1min 59s per loop 1000000:
The above cell does not finish overnight, so what you see above is what was saved before it finished.