This paragraph in the Python docs:
If a class does not define an __eq__() method it should not define a __hash__() operation either; if it defines __eq__() but not __hash__(), its instances will not be usable as items in hashable collections. If a class defines mutable objects and implements an __eq__() method, it should not implement __hash__(), since the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).
...why?
This talk is a foray into the Python data model so that we understand the reasoning behind this advice. And, above everything else, just an excuse to talk about hash tables.
On your marks, get set, go!
==
tests for equality¶The ==
operator is the object value comparison operator.
It tests whether two objects have the same value.
x = [2, 3]
y = [2, 3]
x == y
True
Both variables have the same value, so the result of the comparison is True
.
__eq__()
¶In the Python data model, equality is implemented via __eq__() and its counterpart, __ne__().
is
tests for identity¶The is
operator, on the other hand, is the object identity comparison operator.
It tests whether two objects are the same.
x = [2, 3]
y = [2, 3]
x is y
False
Although x
and y
have the same value, they are different objects, so the comparison with is returns False
.
x = []
id(x)
140116482324168
y = []
id(y)
140116482324488
x is y
False
In CPython this number is in fact the address of the object in memory.
x = 23
id(x)
94149475332416
y = 23
id(x)
94149475332416
x is y
True
What? But... they're different objects... right?
And to make things even worse:
a = 581
id(a)
140116482290160
b = 581
id(b)
140116482290224
a is b
False
The difference is that the current implementation (of, again, CPython) keeps an array of integer objects for all integers between −5 and 256. When we create an integer in that range what we are actually getting is just a reference back to that existing object.
In other words: in CPython, integers between −5 and 256 are singletons.
This is an implementation detail. For example, until February 2006 this happened only for values [−5, 99]. It's undocumented for a reason. Never rely on this!
The same behavior may be exhibited by small strings literals.
x = "dog"
id(x)
140116481858392
y = "dog"
id(y)
140116481858392
x is y
True
These strings are interned.
From Wikipedia [emphasis added]:
In computer science, string interning is a method of storing only one copy of each distinct string value, which must be immutable. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned.
x = "".join(["d", "o", "g"])
y = "dog"
x is y
False
Why the difference? Note that we said earlier that this behavior is exhibited for literals. Here, x
is an expression. Because Python does not know its value until runtime, x
is not interned and therefore it's a different object than y
.
We can force a string to be interned, however, using sys.intern()
.
import sys
x = sys.intern("".join(["a", "b", "c"]))
y = "abc"
x is y
True
From the Python docs:
Interning strings is useful to gain a little performance on dictionary lookup – if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare.
Let's move beyond built-in types and define our own class, Circle
.
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __repr__(self):
return "Circle(x={}, y={}, radius={})".format(
self.x, self.y, self.radius)
c = Circle(2, 3, 5)
print(c)
Circle(x=2, y=3, radius=5)
To avoid hard-coding the name of the class, we can use instead:
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __repr__(self):
return "{}(x={}, y={}, radius={})".format(
type(self).__name__, # note this
self.x,
self.y,
self.radius)
The comparison between the instances of our class is broken!
c1 = Circle(2, 3, 5)
c2 = Circle(2, 3, 5)
c1 == c2
False
We didn't expect this — the values are the same, so they should compare equal.
print(c1.x == c2.x)
print(c1.y == c2.y)
print(c1.radius == c2.radius)
True True True
__eq__()
¶From the Python docs:
User-defined classes have [a]
__eq__()
[method] by default; with [it], all objects compare unequal (except with themselves) [...]
In other words, if we don't define __eq__()
by default it uses is
.
And c1
and c2
are different objects.
c1 is c2
False
For our classes to be fully comparable, they must implement the "rich" comparison methods.
Each comparison operation calls one of these methods. The correspondences are:
For example, so that we can use ==
with our class:
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __eq__(self, other):
return self.x == other.x and \
self.y == other.y and \
self.radius == other.radius
c1 = Circle(2, 3, 5)
c2 = Circle(2, 3, 5)
c1 == c2
True
bool()
¶We could return anything, as Python will call bool()
to determine the truth value.
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __eq__(self, other):
if self.x == other.x and \
self.y == other.y and \
self.radius == other.radius:
return "Yes, they are equal!"
else:
return "Nope, they're different"
c1 = Circle(2, 3, 5)
c2 = Circle(2, 3, 5)
c1 == c2
'Yes, they are equal!'
As the condition of an if
statement:
if c1 == c2:
print("I already told you, they're equal!")
I already told you, they're equal!
But the truth value of any non-empty string is True
:
c1 = Circle(1, 3, 9)
c2 = Circle(2, 5, 7)
if c1 == c2: # bool("Nope, they're different")
print("These circles are equal too!")
Dont' do this. Return True
or False
.
NotImplemented
¶NotImplemented
.c1 = Circle(2, 3, 5)
c2 = Circle(2, 3, 5)
c1 < c2
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-27-6b88d8748591> in <module>() 1 c1 = Circle(2, 3, 5) 2 c2 = Circle(2, 3, 5) ----> 3 c1 < c2 TypeError: unorderable types: Circle() < Circle()
The <
operator calls __lt__()
, but we haven't defined it. Python returns NotImplemented
, which in turn raises TypeError.
Reminder: we exclusively have a __eq__()
by default, comparing equal only to ourselves.
NotImplemented
ourselves¶If we want to make sure that the object is only compared to another object of the same type:
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __eq__(self, other):
if type(other) != type(self):
return NotImplemented
return self.x == other.x and \
self.y == other.y and \
self.radius == other.radius
c1 = Circle(2, 3, 5)
c1 == 5
False
To account and allow for subclasses:
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __eq__(self, other):
if not isinstance(other, type(self)): # instead of type() == type()
return NotImplemented
return self.x == other.x and \
self.y == other.y and \
self.radius == other.radius
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __lt__(self, other):
return self.radius < other.radius
It works as we would expect:
c1 = Circle(2, 5, 13)
c2 = Circle(3, 7, 11)
c1 < c2
False
However...
c1 > c2
True
How?! We didn't define __gt__()
, but Python is not complaining.
From the Python docs:
There are no swapped-argument versions of these methods (to be used when the left argument does not support the operation but the right argument does); rather,
__lt__()
and__gt__()
are each other’s reflection [...]
This is what happened here: Circle
doesn't define __gt__()
, so...
c1.__gt__(c2)
NotImplemented
... was translated into:
c2.__lt__(c1)
True
Technically speaking, we fallback to the right operand’s reflected method.
Let's verify it via the good ol' printf debugging technique.
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __repr__(self):
return "{}(x={}, y={}, radius={})".format(
type(self).__name__, self.x, self.y, self.radius)
def __lt__(self, other):
print("called {} < {}".format(self, other)) # we added this
return self.radius < other.radius
c1 = Circle(2, 5, 13)
c2 = Circle(3, 7, 11)
c1 > c2
called Circle(x=3, y=7, radius=11) < Circle(x=2, y=5, radius=13)
True
So, indeed, Python resorted to the reflection of the method we have.
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __repr__(self):
return "{}(x={}, y={}, radius={})".format(
type(self).__name__, self.x, self.y, self.radius)
def __ge__(self, other):
print("called {} >= {}".format(self, other)) # more printf debugging
return self.radius >= other.radius
c1 = Circle(2, 5, 11)
c2 = Circle(3, 7, 13)
c1 <= c2
called Circle(x=3, y=7, radius=13) >= Circle(x=2, y=5, radius=11)
True
Let's go back to an earlier example.
class Circle:
"""A regular circle."""
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __repr__(self):
return "{}(x={}, y={}, radius={})".format(
type(self).__name__, self.x, self.y, self.radius)
def __eq__(self, other):
print("calling {} == {}".format(self, other))
if not isinstance(other, type(self)): # we allow for subclasses
return NotImplemented
return self.x == other.x and \
self.y == other.y and \
self.radius == other.radius
class PremiumCircle(Circle):
"A luxury Circle, with high quality values."
pass # it's the exact same thing, of course
Focus your attention on __eq__()
:
def __eq__(self, other):
if not isinstance(other, type(self)): # we allow for subclasses
return NotImplemented
return self.x == other.x and \
self.y == other.y and \
self.radius == other.radius
if not isinstance(other, type(self)): # we allow for subclasses
return NotImplemented
This means that __eq__()
would only work if the subclass is the right side of ==
(and, therefore, binds to other
), because...
issubclass(PremiumCircle, Circle)
True
... but...
issubclass(Circle, PremiumCircle)
False
... so __eq__()
will return NotImplemented
.
However...
c1 = PremiumCircle(4, 5, 8)
c2 = Circle(2, 3, 5)
c1 == c2 # subclass on the *left* side!
calling PremiumCircle(x=4, y=5, radius=8) == Circle(x=2, y=3, radius=5) calling Circle(x=2, y=3, radius=5) == PremiumCircle(x=4, y=5, radius=8)
False
Works! This is, again, because of reflection. Note how we called __eq__()
twice.
PremiumCircle.__eq__(Circle)
returns NotImplemented
, because PremiumCircle
is not a subclass of Circle
. Python uses reflection to try the opposite: Circle.__eq__(PremiumCircle)
: this comparison can take place because PremiumCircle
is a subclass of Circle.
The reflection of __eq__()
and __ne__()
with themseves allows us not having to worry about whether self
is a subclass of other
or the other way around.
From the docs of NotImplemented [emphasis added]:
When a binary (or in-place) method returns
NotImplemented
the interpreter will try the reflected operation on the other type (or some other fallback, depending on the operator). If all attempts returnNotImplemented
, the interpreter will raise an appropriate exception [...]
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __eq__(self, other):
return self.x == other.x and \
self.y == other.y and \
self.radius == other.radius
c1 = Circle(2, 5, 11)
c2 = Circle(3, 7, 13)
c1 != c2
True
There are no other implied relationships among the comparison operators.
Python cannot do smart things like figuring out that if x >= y
and x != y
, then if follows that x > y
.
Humans understand that if (a) x
is greater than or equal to y
and (b) they're not equal, that means that x
must be greater than y
. Python cannot.
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __eq__(self, other):
return self.x == other.x and \
self.y == other.y and \
self.radius == other.radius
def __ge__(self, other):
return self.radius >= other.radius
c1 = Circle(2, 5, 11)
c2 = Circle(3, 7, 13)
c1 > c2
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-44-ce709a35ae78> in <module>() 15 c1 = Circle(2, 5, 11) 16 c2 = Circle(3, 7, 13) ---> 17 c1 > c2 TypeError: unorderable types: Circle() > Circle()
So does this mean the in practice we need to define all the comparison methods?
class Circle:
def __init__(self, x, y, radius):
self.x, self.y, self.radius = x, y, radius
def __eq__(self, other):
return self.x == other.x and \
self.y == other.y and \
self.radius == other.radius
def __ne__(self, other):
return not self == other
def __ge__(self, other):
return self.radius >= other.radius
def __le__(self, other):
return self.radius <= other.radius
def __gt__(self, other):
return self.radius > other.radius
def __lt__(self, other):
return self.radius < other.radius
So we can leave it in:
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __eq__(self, other):
return self.x == other.x and \
self.y == other.y and \
self.radius == other.radius
def __ge__(self, other):
return self.radius >= other.radius
def __gt__(self, other):
return self.radius > other.radius
functools.total_ordering()
¶We can forget all these rules and let functools.total_ordering()
do all the work.
It just needs __eq__()
and one more (any) comparison operator to derive the rest for us automatically.
import functools
@functools.total_ordering
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __eq__(self, other):
return self.x == other.x and \
self.y == other.y and \
self.radius == other.radius
def __ge__(self, other):
return self.radius >= other.radius
The name total ordering comes from mathematics: "binary relation on some set X, which is antisymmetric, transitive, and a connex relation". In human-being-speak we can say that a total order is that of a set of elements in which all the possible pairs are comparable.
The downside, from the Python docs [emphasis added]:
[...] it does come at the cost of slower execution and more complex stack traces for the derived comparison methods. If performance benchmarking indicates this is a bottleneck for a given application, implementing all six rich comparison methods instead is likely to provide an easy speed boost.
__cmp__()
¶Before Python 3, if the rich comparison methods were not implemented, Python falled back to a different magic method: __cmp__()
.
object.__cmp__(self, other)
would return:
self < other
.self == other
.self > other
.Thus, our early ancestors could choose between implementing __cmp__()
or the rich comparison methods — named like this to differentiate them from the more primitive alternative.
In Python 3 __cmp__()
no longer exists.
We now know how to create well behaved totally ordered classes.
==
is for equality...is
for identity.__eq__()
.functools.total_ordering()
simplifies our life.__cmp__()
exists only in history books.We do care if we want to use our classes in a Python dictionary or a set.
Everything to be known about dictionaries was said by Brandon Rhodes back in 2010.
If we work with a a Python list, we access elements by their index number.
numbers = [2, 3, 5, 7, 11]
numbers[0]
2
numbers[2]
5
numbers[-1]
11
A bijective function maps each index element to the corresponding value.
numbers = [2, 3, 5, 7, 11]
For example, index two (i.e. third element in the array) maps to value 5.
However, we cannot easily go the other way around. Given, say, the value 7:
The answer to these questions is $\mathcal{O}(n)$.
7 in numbers # O(n)
True
Same for:
numbers.index(7) # O(n)
3
Python is doing the equivalent of:
def index(iterable, wanted):
"""Return index of 'wanted' in 'iterable'; -1 if not found."""
for index, element in enumerate(iterable):
if element == wanted:
return index
return -1
index(numbers, 7)
3
To make it explicit: we have to visit all the elements in the iterable, so $\mathcal{O}(n)$.
numbers = {} # now a dict
numbers[1] = "one"
numbers[3] = "three"
numbers[7] = "seven"
numbers[3]
'three'
5 in numbers
False
del numbers[1]
These operations, unlike for the list, are $\mathcal{O}(1)$. How? Because dictionaries are Python's name for...
How do hash tables achieve this?
This is the first plot twist.
A hash table is nothing else than a list (or array, in non-Python-specific terminology).
Let's create our hash table:
table = [None, None, None, None, None]
Let's visualize it vertically:
table = [
None, # index = 0
None, # index = 1
None, # index = 2
None, # index = 3
None, # index = 4
]
How do we compute the indexes?
From Wikipedia:
A hash function is any function that can be used to map data of arbitrary size to data of a fixed size.
def simple_hash(str_):
return len(str_)
simple_hash("dog")
3
simple_hash("tiger")
5
simple_hash("Tyrannosaurus Rex")
17
We take data (strings, in this case) and return a numeric value.
table = [
None, # index = 0
None, # index = 1
None, # index = 2
None, # index = 3
None, # index = 4
]
Let's get the hash of "dog"
:
word = 'dog'
word_hash = simple_hash(word)
print("Hash of", word, "is", word_hash)
Hash of dog is 3
Use the hash as the index to store the value in our table:
index = word_hash
table[index] = word # O(1)
table
[None, None, None, 'dog', None]
Let's now look up a value. Is "bear"
in the hash table?
word = 'bear'
index = simple_hash(word)
table[index] != None
False
Our function can return any value, exceeding the size of the hash table.
word = 'monkey'
index = simple_hash(word)
table[index] = word
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-70-4f70fec575d8> in <module>() 1 word = 'monkey' 2 index = simple_hash(word) ----> 3 table[index] = word IndexError: list assignment index out of range
So the operation becomes:
word = 'monkey'
index = simple_hash(word) % len(table)
print('{} % {} == {}'.format(simple_hash(word), len(table), index))
6 % 5 == 1
table[index] = word
table
[None, 'monkey', None, 'dog', None]
Is "monkey"
in the hash table?
word = 'monkey'
index = simple_hash(word) % len(table)
table[index] != None
True
Because we transform the input values into a smaller domain, sooner or later we'll come across two different keys that map to the same hash value.
This is known as a collision.
Right now, our hash table looks like this:
table = [
None, # index = 0
'monkey', # index = 1
None, # index = 2
'dog', # index = 3
None, # index = 4
]
A wild new word appears:
word = 'cat'
index = simple_hash(word) % len(table)
table[index] = word
table
[None, 'monkey', None, 'cat', None]
table
[None, 'monkey', None, 'cat', None]
"cat"
and "dog"
have the same length."dog"
is gone.This is a collision. They will happen, so our hash table needs to take them into account.
PHP functions originally bucketed by strlen, were renamed to balance length:
Well, there were other factors in play there. htmlspecialchars was a
very early function. Back when PHP had less than 100 functions and the
function hashing mechanism was strlen(). **In order to get a nice hash
distribution of function names across the various function name lengths
names were picked specifically to make them fit into a specific length
bucket**. This was circa late 1994 when PHP was a tool just for my own
personal use and I wasn't too worried about not being able to remember
the few function names. \[emphasis added\]
Most programming languages use a hash table to store symbols: e.g. function name to function object or pointer.
PHP was hashing by length, which lead to lots of collisions.
Solution? Rename functions to smooth the distribution of lengths.
From PHP has inconsistent function naming:
underscore no underscore:
stream_get_line readline
disk_free_space diskfreespace
is_object isset
mcal_day_of_week jddayofweek
We live in the darkest timeline.
Formally speaking, a good hash function satisfies two main properties:
The same input value should produce the same output value. In other words: the hash is a mathematical function of the data being hashed. We cannot depend on external variable parameters such as a pseudo-random number generator.
Our function satisfied this, as the length of a string is deterministic.
The hashes should be distributed as evently as possible over the output range. In other words: every hash should have the same probability of being generated.
This is where our hash function (as well as PHP's) failed. Words in English...
[...] are reasonably well fitted by a shifted Poisson distribution with mean and variance equal to 6.94 and 5.80. [Source]
This would lead to a lot of collisions.
$ echo "Let there be light" | md5sum
b3d440d0691efe7d7d485602f134f469 -
MD5 is still useful to verify data integrity against unintentional corruption.
For example, Debian gives us these MD5 checksums:
f8446a84356a6bcbf79201cc9f46063f debian-9.5.0-amd64-netinst.iso
47a5dca818220d8558d37dfa11b85550 debian-9.5.0-amd64-xfce-CD-1.iso
2daa085925a556d35a0eeb27768cc892 debian-mac-9.5.0-amd64-netinst.iso
Then when I download the ISO:
wget https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-9.5.0-amd64-netinst.iso
md5sum debian-9.5.0-amd64-netinst.iso
f8446a84356a6bcbf79201cc9f46063f
Same hash, so moooost likely no corruption.
hash()
¶Meet the built-in hash()
function:
Return the hash value of the object (if it has one). Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup. Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for 1 and 1.0).
hash(0)
0
hash(5)
5
hash(113)
113
hash(7.3)
691752902764107783
hash(-5)
-5
hash(-2)
-2
hash(-1) # also -2!
-2
Why?
The hash value -1 is reserved (it’s used to flag errors in the C implementation). If the hash
algorithm generates this value, we simply use -2 instead. [Source]
Proof in the CPython source code:
if (x == (Py_uhash_t)-1)
x = (Py_uhash_t)-2;
return (Py_hash_t)x;
For strings the algorithm is much more complex.
hash("dog")
4448411893666489140
hash("cat")
-1695142227333385162
hash("giraffe")
-4077740646546332916
Looking at the CPython source code:
if (len < Py_HASH_CUTOFF) {
/* Optimize hashing of very small strings with inline DJBX33A. */
Py_uhash_t hash;
const unsigned char *p = src;
hash = 5381; /* DJBX33A starts with 5381 */
switch(len) {
/* ((hash << 5) + hash) + *p == hash * 33 + *p */
case 7: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
case 6: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
case 5: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
case 4: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
case 3: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
case 2: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
case 1: hash = ((hash << 5) + hash) + *p++; break;
default:
Py_UNREACHABLE();
}
hash ^= len;
hash ^= (Py_uhash_t) _Py_HashSecret.djbx33a.suffix;
x = (Py_hash_t)hash;
}
hash((1, 2))
3713081631934410656
hash(("dog", "cat"))
4660141651573361605
hash(("giraffe", 113, 5.1))
-1646678630949266879
Taking a look at the CPython source code:
x = 0x345678UL;
p = v->ob_item;
while (--len >= 0) {
y = PyObject_Hash(*p++);
if (y == -1)
return -1;
x = (x ^ y) * mult;
/* the cast might truncate len; that doesn't change hash stability */
mult += (Py_hash_t)(82520UL + len + len);
}
x += 97531UL;
if (x == (Py_uhash_t)-1)
x = -2;
hash()
on our own classes¶hash()
delegates the method to the __hash__()
magic method.
x = 3
hash(x)
3
This is equivalent to...
x.__hash__()
3
Let's not implement it in our class...
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
c = Circle(5, 7, 9)
hash(c)
-9223363279575207557
It returns something, nevertheless!
From the docs:
User-defined classes have
__eq__()
and__hash__()
methods by default; with them, all objects compare unequal (except with themselves) andx.__hash__()
returns an appropriate value such thatx == y
implies both thatx is y
andhash(x) == hash(y)
.
So the hash will only be the same if... the object is the same.
The (default) hash function is not based on the value of the object, as we would expect, but on its identity.
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
c1 = Circle(5, 7, 9)
c2 = Circle(5, 7, 9)
print("Hash of c1:", hash(c1))
print("Hash of c2:", hash(c2))
Hash of c1: -9223363279575207564 Hash of c2: 8757280120398
Different objects, different hashes!
__hash__()
¶To do things right, we need to implement our own __hash__()
method.
The docs recommend [emphasis added]:
[...] it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects by packing them into a tuple and hashing the tuple.
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __hash__(self):
return hash((self.x, self.y, self.radius))
c = Circle(2, 3, 11)
hash(c)
3789705017597642099
In this manner, we delegate __hash__()
to the hash function to the tuple
object, which in turn will use the hashes of the attributes (x
, y
and radius
) of our class.
This hashes will be, as we need, both deterministic and uniform over the output range.
hash()
.__hash__()
.Let's use all we learned to implement our own table hash — this time handling collisions.
We need to split this class over multiple Jupyter cells, so we'll use this decorator:
def add_to(cls):
"""A class decorator to dynamically add methods."""
def wrapper(func):
setattr(cls, func.__name__, func)
return cls
return wrapper
class HashTable:
"""Initial prototype of a hash table."""
def __init__(self, size=8):
self._slots = [None] * size
def __len__(self):
count = 0
for key in self._slots:
if key is not None:
count += 1
return count
@add_to(HashTable)
def __setitem__(self, key, value):
index = hash(key) % len(self._slots)
self._slots[index] = value
@add_to(HashTable)
def __getitem__(self, key):
index = hash(key) % len(self._slots)
if self._slots[index] is None:
raise KeyError(key)
return self._slots[index]
@add_to(HashTable)
def __delitem__(self, key):
index = hash(key) % len(self._slots)
self._slots[index] = None
Let's try it...
table = HashTable()
table["cat"] = "Katze"
table["cat"]
'Katze'
table._slots
[None, None, None, None, None, None, 'Katze', None]
We can double check this is the right index:
hash("cat") % 8
6
Deletion also works:
del table["cat"]
table["cat"]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-102-f13e5ff72fe3> in <module>() 1 del table["cat"] ----> 2 table["cat"] <ipython-input-98-adffee489ca8> in __getitem__(self, key) 9 index = hash(key) % len(self._slots) 10 if self._slots[index] is None: ---> 11 raise KeyError(key) 12 return self._slots[index] 13 KeyError: 'cat'
class HashTable:
"""Initial prototype of a hash table, second attempt."""
def __init__(self, size=8):
self._slots = [None] * size
self._count = 0
def __len__(self):
return self._count
def _get_index(self, key):
return hash(key) % len(self._slots)
@add_to(HashTable)
def __setitem__(self, key, value):
index = self._get_index(key)
self._slots[index] = value
self._count += 1
@add_to(HashTable)
def __getitem__(self, key):
index = self._get_index(key)
if self._slots[index] is None:
raise KeyError(key)
return self._slots[index]
@add_to(HashTable)
def __delitem__(self, key):
index = self._get_index(key)
if self._slots[index] is None:
raise KeyError(key)
self._slots[index] = None
self._count -= 1
table = HashTable()
table["dog"] = "Hund"
table["bear"] = "Bär"
table._slots
[None, None, 'Bär', None, 'Hund', None, None, None]
len(table)
2
This is all good, but again – we have to keep collisions in mind.
Sooner than later two elements are going to end up in the same slot.
Thus, our empty hash table looks like this:
table = [
[], # index = 0
[], # index = 1
[], # index = 2
[], # index = 3
[], # index = 4
]
Say we want to map "dog"
to its German translation, "Hund"
.
It lands in, for example, index == 3
.
table = [
[], # index = 0
[], # index = 1
[], # index = 2
['Hund'], # index = 3
[], # index = 4
]
Then we map "bear"
to its German translation, "Bär"
.
It lands in, say, index == 0
.
table = [
["Bär"], # index = 0
[], # index = 1
[], # index = 2
['Hund'], # index = 3
[], # index = 4
]
Now we store "cat"
, and assume that it also lands in index == 3
.
table = [
["Bär"], # index = 0
[], # index = 1
[], # index = 2
['Hund', 'Katze'], # index = 3
[], # index = 4
]
Do you see the problem we face now?
Somebody asks us "How do you say cat in German".
We use the hash of "cat"
to compute its index, and get that index == 3
.
table = [
["Bär"], # index = 0
[], # index = 1
[], # index = 2
['Hund', 'Katze'], # index = 3
[], # index = 4
]
Is it "Hund"
or "Katze"
? How can we tell them apart? We can't!
In order to deal with collisions we have to store the key itself, in addition to the mapped value.
import collections
Entry = collections.namedtuple('Entry', 'key, value')
class HashTable:
"""A separate-chaining hash table."""
def __init__(self, size=8):
self._slots = [list() for _ in range(size)]
self._count = 0
def __len__(self):
return self._count
def _get_index(self, key):
return hash(key) % len(self._slots)
@add_to(HashTable)
def __setitem__(self, key, value):
slot_index = self._get_index(key)
slot = self._slots[slot_index]
for entry_index, entry in enumerate(slot):
if entry.key == key: # note the equality comparison here...
slot[entry_index] = entry._replace(value=value)
break
else:
entry = Entry(key, value)
self._slots[slot_index].append(entry)
self._count += 1
@add_to(HashTable)
def __getitem__(self, key):
index = self._get_index(key)
for entry in self._slots[index]:
if entry.key == key: # and here...
return entry.value
raise KeyError(key)
@add_to(HashTable)
def __delitem__(self, key):
slot_index = self._get_index(key)
slot = self._slots[slot_index]
for entry_index, entry in enumerate(slot):
if entry.key == key: # ... and here
del slot[entry_index]
break
else:
raise KeyError(key)
self._count -= 1
table = HashTable()
table["dog"] = "perro"
table["bear"] = "Bär"
"perro"
is Spanish, not German! Let's replace the value.
table["dog"] = "Hund"
table["dog"]
'Hund'
Let's take a look at the internals:
table._slots
[[], [], [Entry(key='bear', value='Bär')], [], [Entry(key='dog', value='Hund')], [], [], []]
Python may use a different strategy, but the foundations are the same. We need:
==
) and tell it apart from other elements what may have collided with it.Our hash table is really nice! Let's add functionality to loop over the elements:
items()
!¶@add_to(HashTable)
def items(self):
for slot in self._slots:
for entry in slot:
yield entry
table = HashTable()
table["dog"] = "Hund"
table["bear"] = "Bär"
for key, value in table.items():
print(key, "->", value)
bear -> Bär dog -> Hund
Alternatively, in a single line:
import itertools
@add_to(HashTable)
def items(self):
yield from itertools.chain.from_iterable(self._slots)
Our hash table is almost perfect! The only downside:
We only have eight slots! Sooner than later it's going to get full and performance start degrading.
The solution is to resize, also known as rehashing.
slots
...slots
, we no longer need it.import collections
import itertools
Entry = collections.namedtuple('Entry', 'key, value')
class HashTable:
"""A separate-chaining, dynamically resized hash table."""
def __init__(self, initial_size=8, load_factor=0.8):
self._slots = [list() for _ in range(initial_size)]
self._size = 0
self._load_factor = load_factor
def __len__(self):
return self._size
@staticmethod
def _get_index(key, slots):
return hash(key) % len(slots)
@classmethod
def _insert(cls, slots, key, value):
index = cls._get_index(key, slots)
slots[index] = value
@add_to(HashTable)
def __setitem__(self, key, value):
slot_index = self._get_index(key, self._slots)
slot = self._slots[slot_index]
for entry_index, entry in enumerate(slot):
if entry.key == key: # note the equality comparison here...
slot[entry_index] = entry._replace(value=value)
break
else:
entry = Entry(key, value)
self._slots[slot_index].append(entry)
self._size += 1
if len(self) / len(self._slots) >= self._load_factor:
self._rehash()
@add_to(HashTable)
def _rehash(self):
new_slots = [list() for _ in range(len(self._slots) * 2)]
for key, value in self.items():
HashTable._insert(new_slots, key, value)
self._slots = new_slots
# The remaining methods; not worth showing on the slides, as they're basically the same ones
# we implemented earlier, but they're included here for the sake of completeness --- this is
# the last version of our hash table implementation.
@add_to(HashTable)
def __getitem__(self, key):
index = self._get_index(key, self._slots)
for entry in self._slots[index]:
if entry.key == key: # note the equality comparison here...
return entry.value
raise KeyError(key)
@add_to(HashTable)
def __delitem__(self, key):
slot_index = self._get_index(key, self._slots)
slot = self._slots[slot_index]
for entry_index, entry in enumerate(slot):
if entry.key == key: # ... and here
del slot[entry_index]
break
else:
raise KeyError(key)
self._size -= 1
@add_to(HashTable)
def items(self):
yield from itertools.chain.from_iterable(self._slots)
In the words of Andrew Cooke on Stack Overflow [emphasis added]:
Therefore, the hash code only hints at the location of an object, an equality test follows to test candidate keys. To implement a membership test in a hash-table set, the hash code gives you "bucket" number in which to search for the value. However, all set items with the same hash code are in the bucket. For this, you also need an equality test to distinguish between all candidates in the bucket.
That's it.
What we learned is also useful to understand how Python sets work:
We're —finally— ready to revisit the paragraph from the beginning:
If a class does not define an __eq__() method it should not define a __hash__() operation either; if it defines __eq__() but not __hash__(), its instances will not be usable as items in hashable collections. If a class defines mutable objects and implements an __eq__() method, it should not implement __hash__(), since the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).
We now have the tools to understand —and remember— why.
If a class does not define an __eq__() method it should not define a __hash__() operation either; [...]
Rephrased: defining __hash__()
but not __eq__()
is wrong. Why?
Let's think of a hash table look up:
__hash__()
to compute the slot where the value goes.__eq__()
, so by default our objects compare unequal except with themselves.See this Stack Overflow question for an example of somebody who ran into this.
In our implementation of a hash table, if we look up "cat"
:
table = [
[], # 0
[Entry(key='bear', value='Bär')], # 1
[], # 2
[Entry(key='dog', value='Hund'), Entry(key='cat', value='Katze')], # 3
[], # 4
]
Python uses str.__hash__()
and determines that index == 3
.
str.__eq__()
to compare "cat"
to "dog"
$\rightarrow$ False
."cat" == "cat"
$\rightarrow$ True
. We found our entry.If we have __hash__()
but not __eq__()
, we'll land in the right slot but won't be able to find our value.
class Cyborg:
def __init__(self, name, weapon):
self.name = name
self.weapon = weapon
def __hash__(self):
return hash((self.name, self.weapon))
robot = Cyborg('T-1000', 'Franchi SPAS-12')
cyborgs = set()
cyborgs.add(robot)
robot in cyborgs
True
However..
cyborgs = {Cyborg('T-1000', 'Franchi SPAS-12')}
Cyborg('T-1000', 'Franchi SPAS-12') in cyborgs # doesn't work!
False
Without __eq__()
, these objects only compare equal to themselves.
As we saw earlier for Point
...
hash(robot1)
-7726681088492938838
hash(robot2)
-7726681088492938838
robot1 == robot2
False
[...] if [a class] defines __eq__() but not __hash__(), its instances will not be usable as items in hashable collections.
This is —now— almost trivial: if we don't have __hash__(), we cannot know in which slot the value is.
Indeed, Python won't even let us compute their hash.
class Cyborg:
def __init__(self, name, weapon):
self.name = name
self.weapon = weapon
def __eq__(self, other):
return self.name == other.name and self.weapon == other.weapon
robot = Cyborg('T-850', 'Ithaca 37')
hash(robot)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-3-08a2a9fdd4d7> in <module>() 10 11 robot = Cyborg('T-850', 'Ithaca 37') ---> 12 hash(robot) TypeError: unhashable type: 'Cyborg'
However, we saw earlier that by default both __eq__()
and __hash__()
return something.
Indeed, if we don't define none of them Python doesn't complain.
class Cyborg:
def __init__(self, name, weapon):
self.name = name
self.weapon = weapon
robot = Cyborg('T-850', 'Ithaca 37')
hash(robot)
8744344590021
How do we not define __hash__()
? By setting it to None
.
From the docs:
When the
__hash__()
method of a class isNone
, instances of the class will raise an appropriateTypeError
when a program attempts to retrieve their hash value, and will also be correctly identified as unhashable when checkingisinstance(obj, collections.abc.Hashable)
.
class Cyborg:
def __init__(self, name, weapon):
self.name = name
self.weapon = weapon
__hash__ = None
robot = Cyborg('T-850', 'Ithaca 37')
hash(robot)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-133-3d6965db93aa> in <module>() 9 10 robot = Cyborg('T-850', 'Ithaca 37') ---> 11 hash(robot) TypeError: unhashable type: 'Cyborg'
Also:
best_friend = dict()
best_friend[robot] = "Sarah Connor"
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-134-d42f98f6f6fa> in <module>() 1 best_friend = dict() ----> 2 best_friend[robot] = "Sarah Connor" TypeError: unhashable type: 'Cyborg'
And of course:
robot_army = set()
robot_army.add(robot)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-135-a2975627508c> in <module>() 1 robot_army = set() ----> 2 robot_army.add(robot) TypeError: unhashable type: 'Cyborg'
All of them fail because our class is not hashable.
From the docs:
If a class that does not override
__hash__()
wishes to suppress hash support, it should include__hash__ = None
in the class definition.
Note that returning None
does not work.
class Cyborg:
def __init__(self, name, weapon):
self.name = name
self.weapon = weapon
def __hash__(self):
return None
robot = Cyborg('T-1000', 'Franchi SPAS-12')
hash(robot)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-136-52f2e857f76a> in <module>() 10 11 robot = Cyborg('T-1000', 'Franchi SPAS-12') ---> 12 hash(robot) TypeError: __hash__ method should return an integer
The docs also warn us:
A class which defines its own
__hash__()
that explicitly raises aTypeError
would be incorrectly identified as hashable by anisinstance(obj, collections.abc.Hashable)
call.
collections.abc.Hashable
is an abstract base class. We use them
[... ] to test whether a class provides a particular interface; for example, whether it is hashable or whether it is a mapping.
Is this variable iterable?
word = "abc"
We could try to loop over it and catch the error...
def is_iterable(obj):
try:
for item in obj:
# can exit immediately, just want to check it doesn't fail
return True
except TypeError:
return False
is_iterable(word)
True
... or instead use:
import collections.abc
isinstance(word, collections.abc.Iterable)
True
Does this variable have a size?
t = (x ** 2 for x in [4, 5, 6, 8, 9])
We could try to call len()
...
def is_sized(obj):
try:
len(obj)
return True
except TypeError:
return False
is_sized(t)
False
Alternatively, we could check whether the object hasattr(obj, '__len__')
.
... or instead:
isinstance(t, collections.abc.Sized)
False
In the same way, to check whether a variable is hashable we could just try...
word = "Katze"
def is_hashable(obj):
try:
hash(obj)
return True
except TypeError:
return False
is_hashable(word)
True
... or instead:
isinstance(word, collections.abc.Hashable)
True
class Cyborg:
def __init__(self, name, weapon):
self.name = name
self.weapon = weapon
def __hash__(self):
raise TypeError("class Cyborg is not hashable")
robot = Cyborg('T-800', 'M79 grenade launcher')
hash(robot)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-145-c9a6424ab543> in <module>() 10 11 robot = Cyborg('T-800', 'M79 grenade launcher') ---> 12 hash(robot) <ipython-input-145-c9a6424ab543> in __hash__(self) 6 7 def __hash__(self): ----> 8 raise TypeError("class Cyborg is not hashable") 9 10 TypeError: class Cyborg is not hashable
This was expected. However...
isinstance(robot, collections.abc.Hashable)
True
Wrong! We broke it!
Use __hash__ = None
.
And finally:
[...] If a class defines mutable objects and implements an __eq__() method, it should not implement __hash__(), since the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).
This takes us to our final point: the hashability of mutable objects.
hash((1, 2, 3))
2528502973977326415
... but lists aren't.
hash([5, 7, 9])
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-148-831d6710add8> in <module>() ----> 1 hash([5, 7, 9]) TypeError: unhashable type: 'list'
squares = dict()
squares[[2, 7]] = 2 ** 7
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-149-4cf3b56bef1b> in <module>() 1 squares = dict() ----> 2 squares[[2, 7]] = 2 ** 7 TypeError: unhashable type: 'list'
Same error! It's unhashable.
For this we need to use tuples...
squares = dict()
squares[(2, 7)] = 2 ** 7
squares
{(2, 7): 128}
... or perhaps a two-level defaultdict
:
import collections
squares = collections.defaultdict(dict)
squares[2][7] = 2 ** 7
squares
defaultdict(dict, {2: {7: 128}})
squares[2][7]
128
Let's imagine a list were hashable and we used it as a key in a dictionary.
x = [2, 3, 5]
__hash__()
to compute the slot where the list goes.Later on, we mutate our list, say:
x[1] = 7
x
[2, 7, 5]
x
in the dictionary.__hash__()
and get a different value, hence a different slot.x
are different, so they will compare unequal.Cyborg
class.__eq__()
...__hash__()
.What could go wrong?
import functools
@functools.total_ordering
class Cyborg:
def __init__(self, name, weapon):
self.name = name
self.weapon = weapon
def __repr__(self):
return "Cyborg({}, {})".format(self.name, self.weapon)
def __eq__(self, other):
return self.name == other.name and self.weapon == other.weapon
def __lt__(self, other):
if self.name < other.name:
return True
return self.weapon < other.weapon
Our hash function combines both attributes in a tuple. Both are strings, so immutable, so hashable.
@add_to(Cyborg)
def __hash__(self):
return hash((self.name, self.weapon))
We really shouldn't be doing this...
robot1 = Cyborg('T-800', 'Franchi SPAS-12')
robot2 = Cyborg('T-800', 'MP5 Submachine gun')
robot1 == robot2
False
robot1 >= robot2
False
hash(robot1)
-2292265897856400339
hash(robot2)
6957369841330577170
See the problem here?
This will break!
robot = Cyborg('T-800', 'Franchi SPAS-12')
hash(robot)
-2292265897856400339
target = dict()
target[robot] = "John Connor"
target[robot]
'John Connor'
So far this works, but...
robot.weapon = 'M79 grenade launcher'
hash(robot)
4054157808228906460
target[robot]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-165-38243c188373> in <module>() ----> 1 target[robot] KeyError: Cyborg(T-800, M79 grenade launcher)
We broke it!
That's what the docs were explaining when they told us [emphasis added]:
[...] If a class defines mutable objects and implements an __eq__() method, it should not implement __hash__(), since the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).
A final comment. The dictionary (and thus also sets) store a reference to our object.
robot = Cyborg('T-1000', 'Remington 870')
id(robot)
140116472769168
terminators = set()
terminators.add(robot)
Let's retrieve the element from the set without removing it:
id(next(iter(terminators)))
140116472769168
It's not a copy — it's the same object.
robot.weapon = 'RSB-80 Plasma Gun'
robot
Cyborg(T-1000, RSB-80 Plasma Gun)
terminators
{Cyborg(T-1000, RSB-80 Plasma Gun)}
This implies that if our hashable mutated object landed (by chance) in the same slot as it originally did, we would still be able to find it.
robot = Cyborg('T-1000', 'Remington 870')
terminators = set()
terminators.add(robot)
original_hash = hash(robot)
print("Original hash:", original_hash)
Original hash: 2225710126195077769
Let's mutate the object but make sure it goes to the same slot:
robot.weapon = 'RSB-80 Plasma Gun'
Cyborg.__hash__ = lambda x: original_hash
print("Mutated hash: ", hash(robot))
Mutated hash: 2225710126195077769
robot in terminators
True
We can find it! It went to the same bucket.
Hashable mutable objects will break most of the time, but not always.
Experiments aside, let's do the right thing and make our mutable class non-hashable.
class Cyborg:
def __init__(self, name, weapon):
self.name = name
self.weapon = weapon
__hash__ = None
robot = Cyborg('T-1000', 'RSB-80 Plasma Gun')
terminators = set()
terminators.add(robot)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-174-3b61c68d3401> in <module>() 10 robot = Cyborg('T-1000', 'RSB-80 Plasma Gun') 11 terminators = set() ---> 12 terminators.add(robot) TypeError: unhashable type: 'Cyborg'
We revisited and understood the intimidating paragraph from the docs.
__eq__()
, don't define __hash__()
either.__hash__ = None
.__eq__()
, implement __hash__()
for it to be hashable.The whole point of using hash tables is to get $\mathcal{O}(1)$ operations.
Say we use a dictionary to map every number to its square:
{x: x ** 2 for x in range(10_000)}
{0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81, 10: 100, 11: 121, 12: 144, 13: 169, 14: 196, 15: 225, 16: 256, 17: 289, 18: 324, 19: 361, 20: 400, 21: 441, 22: 484, 23: 529, 24: 576, 25: 625, 26: 676, 27: 729, 28: 784, 29: 841, 30: 900, 31: 961, 32: 1024, 33: 1089, 34: 1156, 35: 1225, 36: 1296, 37: 1369, 38: 1444, 39: 1521, 40: 1600, 41: 1681, 42: 1764, 43: 1849, 44: 1936, 45: 2025, 46: 2116, 47: 2209, 48: 2304, 49: 2401, 50: 2500, 51: 2601, 52: 2704, 53: 2809, 54: 2916, 55: 3025, 56: 3136, 57: 3249, 58: 3364, 59: 3481, 60: 3600, 61: 3721, 62: 3844, 63: 3969, 64: 4096, 65: 4225, 66: 4356, 67: 4489, 68: 4624, 69: 4761, 70: 4900, 71: 5041, 72: 5184, 73: 5329, 74: 5476, 75: 5625, 76: 5776, 77: 5929, 78: 6084, 79: 6241, 80: 6400, 81: 6561, 82: 6724, 83: 6889, 84: 7056, 85: 7225, 86: 7396, 87: 7569, 88: 7744, 89: 7921, 90: 8100, 91: 8281, 92: 8464, 93: 8649, 94: 8836, 95: 9025, 96: 9216, 97: 9409, 98: 9604, 99: 9801, 100: 10000, 101: 10201, 102: 10404, 103: 10609, 104: 10816, 105: 11025, 106: 11236, 107: 11449, 108: 11664, 109: 11881, 110: 12100, 111: 12321, 112: 12544, 113: 12769, 114: 12996, 115: 13225, 116: 13456, 117: 13689, 118: 13924, 119: 14161, 120: 14400, 121: 14641, 122: 14884, 123: 15129, 124: 15376, 125: 15625, 126: 15876, 127: 16129, 128: 16384, 129: 16641, 130: 16900, 131: 17161, 132: 17424, 133: 17689, 134: 17956, 135: 18225, 136: 18496, 137: 18769, 138: 19044, 139: 19321, 140: 19600, 141: 19881, 142: 20164, 143: 20449, 144: 20736, 145: 21025, 146: 21316, 147: 21609, 148: 21904, 149: 22201, 150: 22500, 151: 22801, 152: 23104, 153: 23409, 154: 23716, 155: 24025, 156: 24336, 157: 24649, 158: 24964, 159: 25281, 160: 25600, 161: 25921, 162: 26244, 163: 26569, 164: 26896, 165: 27225, 166: 27556, 167: 27889, 168: 28224, 169: 28561, 170: 28900, 171: 29241, 172: 29584, 173: 29929, 174: 30276, 175: 30625, 176: 30976, 177: 31329, 178: 31684, 179: 32041, 180: 32400, 181: 32761, 182: 33124, 183: 33489, 184: 33856, 185: 34225, 186: 34596, 187: 34969, 188: 35344, 189: 35721, 190: 36100, 191: 36481, 192: 36864, 193: 37249, 194: 37636, 195: 38025, 196: 38416, 197: 38809, 198: 39204, 199: 39601, 200: 40000, 201: 40401, 202: 40804, 203: 41209, 204: 41616, 205: 42025, 206: 42436, 207: 42849, 208: 43264, 209: 43681, 210: 44100, 211: 44521, 212: 44944, 213: 45369, 214: 45796, 215: 46225, 216: 46656, 217: 47089, 218: 47524, 219: 47961, 220: 48400, 221: 48841, 222: 49284, 223: 49729, 224: 50176, 225: 50625, 226: 51076, 227: 51529, 228: 51984, 229: 52441, 230: 52900, 231: 53361, 232: 53824, 233: 54289, 234: 54756, 235: 55225, 236: 55696, 237: 56169, 238: 56644, 239: 57121, 240: 57600, 241: 58081, 242: 58564, 243: 59049, 244: 59536, 245: 60025, 246: 60516, 247: 61009, 248: 61504, 249: 62001, 250: 62500, 251: 63001, 252: 63504, 253: 64009, 254: 64516, 255: 65025, 256: 65536, 257: 66049, 258: 66564, 259: 67081, 260: 67600, 261: 68121, 262: 68644, 263: 69169, 264: 69696, 265: 70225, 266: 70756, 267: 71289, 268: 71824, 269: 72361, 270: 72900, 271: 73441, 272: 73984, 273: 74529, 274: 75076, 275: 75625, 276: 76176, 277: 76729, 278: 77284, 279: 77841, 280: 78400, 281: 78961, 282: 79524, 283: 80089, 284: 80656, 285: 81225, 286: 81796, 287: 82369, 288: 82944, 289: 83521, 290: 84100, 291: 84681, 292: 85264, 293: 85849, 294: 86436, 295: 87025, 296: 87616, 297: 88209, 298: 88804, 299: 89401, 300: 90000, 301: 90601, 302: 91204, 303: 91809, 304: 92416, 305: 93025, 306: 93636, 307: 94249, 308: 94864, 309: 95481, 310: 96100, 311: 96721, 312: 97344, 313: 97969, 314: 98596, 315: 99225, 316: 99856, 317: 100489, 318: 101124, 319: 101761, 320: 102400, 321: 103041, 322: 103684, 323: 104329, 324: 104976, 325: 105625, 326: 106276, 327: 106929, 328: 107584, 329: 108241, 330: 108900, 331: 109561, 332: 110224, 333: 110889, 334: 111556, 335: 112225, 336: 112896, 337: 113569, 338: 114244, 339: 114921, 340: 115600, 341: 116281, 342: 116964, 343: 117649, 344: 118336, 345: 119025, 346: 119716, 347: 120409, 348: 121104, 349: 121801, 350: 122500, 351: 123201, 352: 123904, 353: 124609, 354: 125316, 355: 126025, 356: 126736, 357: 127449, 358: 128164, 359: 128881, 360: 129600, 361: 130321, 362: 131044, 363: 131769, 364: 132496, 365: 133225, 366: 133956, 367: 134689, 368: 135424, 369: 136161, 370: 136900, 371: 137641, 372: 138384, 373: 139129, 374: 139876, 375: 140625, 376: 141376, 377: 142129, 378: 142884, 379: 143641, 380: 144400, 381: 145161, 382: 145924, 383: 146689, 384: 147456, 385: 148225, 386: 148996, 387: 149769, 388: 150544, 389: 151321, 390: 152100, 391: 152881, 392: 153664, 393: 154449, 394: 155236, 395: 156025, 396: 156816, 397: 157609, 398: 158404, 399: 159201, 400: 160000, 401: 160801, 402: 161604, 403: 162409, 404: 163216, 405: 164025, 406: 164836, 407: 165649, 408: 166464, 409: 167281, 410: 168100, 411: 168921, 412: 169744, 413: 170569, 414: 171396, 415: 172225, 416: 173056, 417: 173889, 418: 174724, 419: 175561, 420: 176400, 421: 177241, 422: 178084, 423: 178929, 424: 179776, 425: 180625, 426: 181476, 427: 182329, 428: 183184, 429: 184041, 430: 184900, 431: 185761, 432: 186624, 433: 187489, 434: 188356, 435: 189225, 436: 190096, 437: 190969, 438: 191844, 439: 192721, 440: 193600, 441: 194481, 442: 195364, 443: 196249, 444: 197136, 445: 198025, 446: 198916, 447: 199809, 448: 200704, 449: 201601, 450: 202500, 451: 203401, 452: 204304, 453: 205209, 454: 206116, 455: 207025, 456: 207936, 457: 208849, 458: 209764, 459: 210681, 460: 211600, 461: 212521, 462: 213444, 463: 214369, 464: 215296, 465: 216225, 466: 217156, 467: 218089, 468: 219024, 469: 219961, 470: 220900, 471: 221841, 472: 222784, 473: 223729, 474: 224676, 475: 225625, 476: 226576, 477: 227529, 478: 228484, 479: 229441, 480: 230400, 481: 231361, 482: 232324, 483: 233289, 484: 234256, 485: 235225, 486: 236196, 487: 237169, 488: 238144, 489: 239121, 490: 240100, 491: 241081, 492: 242064, 493: 243049, 494: 244036, 495: 245025, 496: 246016, 497: 247009, 498: 248004, 499: 249001, 500: 250000, 501: 251001, 502: 252004, 503: 253009, 504: 254016, 505: 255025, 506: 256036, 507: 257049, 508: 258064, 509: 259081, 510: 260100, 511: 261121, 512: 262144, 513: 263169, 514: 264196, 515: 265225, 516: 266256, 517: 267289, 518: 268324, 519: 269361, 520: 270400, 521: 271441, 522: 272484, 523: 273529, 524: 274576, 525: 275625, 526: 276676, 527: 277729, 528: 278784, 529: 279841, 530: 280900, 531: 281961, 532: 283024, 533: 284089, 534: 285156, 535: 286225, 536: 287296, 537: 288369, 538: 289444, 539: 290521, 540: 291600, 541: 292681, 542: 293764, 543: 294849, 544: 295936, 545: 297025, 546: 298116, 547: 299209, 548: 300304, 549: 301401, 550: 302500, 551: 303601, 552: 304704, 553: 305809, 554: 306916, 555: 308025, 556: 309136, 557: 310249, 558: 311364, 559: 312481, 560: 313600, 561: 314721, 562: 315844, 563: 316969, 564: 318096, 565: 319225, 566: 320356, 567: 321489, 568: 322624, 569: 323761, 570: 324900, 571: 326041, 572: 327184, 573: 328329, 574: 329476, 575: 330625, 576: 331776, 577: 332929, 578: 334084, 579: 335241, 580: 336400, 581: 337561, 582: 338724, 583: 339889, 584: 341056, 585: 342225, 586: 343396, 587: 344569, 588: 345744, 589: 346921, 590: 348100, 591: 349281, 592: 350464, 593: 351649, 594: 352836, 595: 354025, 596: 355216, 597: 356409, 598: 357604, 599: 358801, 600: 360000, 601: 361201, 602: 362404, 603: 363609, 604: 364816, 605: 366025, 606: 367236, 607: 368449, 608: 369664, 609: 370881, 610: 372100, 611: 373321, 612: 374544, 613: 375769, 614: 376996, 615: 378225, 616: 379456, 617: 380689, 618: 381924, 619: 383161, 620: 384400, 621: 385641, 622: 386884, 623: 388129, 624: 389376, 625: 390625, 626: 391876, 627: 393129, 628: 394384, 629: 395641, 630: 396900, 631: 398161, 632: 399424, 633: 400689, 634: 401956, 635: 403225, 636: 404496, 637: 405769, 638: 407044, 639: 408321, 640: 409600, 641: 410881, 642: 412164, 643: 413449, 644: 414736, 645: 416025, 646: 417316, 647: 418609, 648: 419904, 649: 421201, 650: 422500, 651: 423801, 652: 425104, 653: 426409, 654: 427716, 655: 429025, 656: 430336, 657: 431649, 658: 432964, 659: 434281, 660: 435600, 661: 436921, 662: 438244, 663: 439569, 664: 440896, 665: 442225, 666: 443556, 667: 444889, 668: 446224, 669: 447561, 670: 448900, 671: 450241, 672: 451584, 673: 452929, 674: 454276, 675: 455625, 676: 456976, 677: 458329, 678: 459684, 679: 461041, 680: 462400, 681: 463761, 682: 465124, 683: 466489, 684: 467856, 685: 469225, 686: 470596, 687: 471969, 688: 473344, 689: 474721, 690: 476100, 691: 477481, 692: 478864, 693: 480249, 694: 481636, 695: 483025, 696: 484416, 697: 485809, 698: 487204, 699: 488601, 700: 490000, 701: 491401, 702: 492804, 703: 494209, 704: 495616, 705: 497025, 706: 498436, 707: 499849, 708: 501264, 709: 502681, 710: 504100, 711: 505521, 712: 506944, 713: 508369, 714: 509796, 715: 511225, 716: 512656, 717: 514089, 718: 515524, 719: 516961, 720: 518400, 721: 519841, 722: 521284, 723: 522729, 724: 524176, 725: 525625, 726: 527076, 727: 528529, 728: 529984, 729: 531441, 730: 532900, 731: 534361, 732: 535824, 733: 537289, 734: 538756, 735: 540225, 736: 541696, 737: 543169, 738: 544644, 739: 546121, 740: 547600, 741: 549081, 742: 550564, 743: 552049, 744: 553536, 745: 555025, 746: 556516, 747: 558009, 748: 559504, 749: 561001, 750: 562500, 751: 564001, 752: 565504, 753: 567009, 754: 568516, 755: 570025, 756: 571536, 757: 573049, 758: 574564, 759: 576081, 760: 577600, 761: 579121, 762: 580644, 763: 582169, 764: 583696, 765: 585225, 766: 586756, 767: 588289, 768: 589824, 769: 591361, 770: 592900, 771: 594441, 772: 595984, 773: 597529, 774: 599076, 775: 600625, 776: 602176, 777: 603729, 778: 605284, 779: 606841, 780: 608400, 781: 609961, 782: 611524, 783: 613089, 784: 614656, 785: 616225, 786: 617796, 787: 619369, 788: 620944, 789: 622521, 790: 624100, 791: 625681, 792: 627264, 793: 628849, 794: 630436, 795: 632025, 796: 633616, 797: 635209, 798: 636804, 799: 638401, 800: 640000, 801: 641601, 802: 643204, 803: 644809, 804: 646416, 805: 648025, 806: 649636, 807: 651249, 808: 652864, 809: 654481, 810: 656100, 811: 657721, 812: 659344, 813: 660969, 814: 662596, 815: 664225, 816: 665856, 817: 667489, 818: 669124, 819: 670761, 820: 672400, 821: 674041, 822: 675684, 823: 677329, 824: 678976, 825: 680625, 826: 682276, 827: 683929, 828: 685584, 829: 687241, 830: 688900, 831: 690561, 832: 692224, 833: 693889, 834: 695556, 835: 697225, 836: 698896, 837: 700569, 838: 702244, 839: 703921, 840: 705600, 841: 707281, 842: 708964, 843: 710649, 844: 712336, 845: 714025, 846: 715716, 847: 717409, 848: 719104, 849: 720801, 850: 722500, 851: 724201, 852: 725904, 853: 727609, 854: 729316, 855: 731025, 856: 732736, 857: 734449, 858: 736164, 859: 737881, 860: 739600, 861: 741321, 862: 743044, 863: 744769, 864: 746496, 865: 748225, 866: 749956, 867: 751689, 868: 753424, 869: 755161, 870: 756900, 871: 758641, 872: 760384, 873: 762129, 874: 763876, 875: 765625, 876: 767376, 877: 769129, 878: 770884, 879: 772641, 880: 774400, 881: 776161, 882: 777924, 883: 779689, 884: 781456, 885: 783225, 886: 784996, 887: 786769, 888: 788544, 889: 790321, 890: 792100, 891: 793881, 892: 795664, 893: 797449, 894: 799236, 895: 801025, 896: 802816, 897: 804609, 898: 806404, 899: 808201, 900: 810000, 901: 811801, 902: 813604, 903: 815409, 904: 817216, 905: 819025, 906: 820836, 907: 822649, 908: 824464, 909: 826281, 910: 828100, 911: 829921, 912: 831744, 913: 833569, 914: 835396, 915: 837225, 916: 839056, 917: 840889, 918: 842724, 919: 844561, 920: 846400, 921: 848241, 922: 850084, 923: 851929, 924: 853776, 925: 855625, 926: 857476, 927: 859329, 928: 861184, 929: 863041, 930: 864900, 931: 866761, 932: 868624, 933: 870489, 934: 872356, 935: 874225, 936: 876096, 937: 877969, 938: 879844, 939: 881721, 940: 883600, 941: 885481, 942: 887364, 943: 889249, 944: 891136, 945: 893025, 946: 894916, 947: 896809, 948: 898704, 949: 900601, 950: 902500, 951: 904401, 952: 906304, 953: 908209, 954: 910116, 955: 912025, 956: 913936, 957: 915849, 958: 917764, 959: 919681, 960: 921600, 961: 923521, 962: 925444, 963: 927369, 964: 929296, 965: 931225, 966: 933156, 967: 935089, 968: 937024, 969: 938961, 970: 940900, 971: 942841, 972: 944784, 973: 946729, 974: 948676, 975: 950625, 976: 952576, 977: 954529, 978: 956484, 979: 958441, 980: 960400, 981: 962361, 982: 964324, 983: 966289, 984: 968256, 985: 970225, 986: 972196, 987: 974169, 988: 976144, 989: 978121, 990: 980100, 991: 982081, 992: 984064, 993: 986049, 994: 988036, 995: 990025, 996: 992016, 997: 994009, 998: 996004, 999: 998001, ...}
Lookups are $\mathcal{O}(1)$. Proof:
%%timeit squares = {x: x ** 2 for x in range(1_000)}
999 in squares
58 ns ± 6.09 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%%timeit squares = {x: x ** 2 for x in range(10_000)}
9_999 in squares
64 ns ± 9.71 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Size of dictionary is $10x$, but it takes the same amount of time. This is $\mathcal{O}(1)$.
Let's subclass int
and make it always return the same hash
class TrollInt:
def __init__(self, value):
self.value = value
def __eq__(self, other):
return self.value == other.value
def __hash__(self):
return 1
This means that they'll all go the same slot.
Python will need to resolve (endless) collisions
%%timeit squares = {TrollInt(x): TrollInt(x ** 2) for x in range(1_000)}
squares[TrollInt(999)]
379 µs ± 89.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit squares = {TrollInt(x): TrollInt(x ** 2) for x in range(10_000)}
squares[TrollInt(9_999)]
5.28 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
We've turned our $\mathcal{O}(1)$ operations into $\mathcal{O}(n)$.
Problem?