Motivation¶

If a class does not define an __eq__() method it should not define a __hash__() operation either; if it defines __eq__() but not __hash__(), its instances will not be usable as items in hashable collections. If a class defines mutable objects and implements an __eq__() method, it should not implement __hash__(), since the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).

...why?

This talk is a foray into the Python data model so that we understand the reasoning behind this advice. And, above everything else, just an excuse to talk about hash tables.

Outline¶

Object equality versus object identity.
Rich comparison operators for our classes.
Python dictionaries and sets are hash tables.
Fundamentals of hash tables, $\mathcal{O}(1)$ for the win.
The properties of a good hash function.
Implementing our own, full-fledged hash table.
Revising the initial paragraph — we're now ready.
Understanding —and remembering— why.

On your marks, get set, go!

Outline¶

Python dictionaries and sets are hash tables.
Fundamentals of hash tables, $\mathcal{O}(1)$ for the win.
The properties of a good hash function.
Implementing our own, full-fledged hash table.
Revising the initial paragraph — we're now ready.
Understanding —and remembering— why.

On your marks, get set, go!

Object equality versus object identity¶

`==` tests for equality¶

The == operator is the object value comparison operator.

It tests whether two objects have the same value.

In [2]:

x = [2, 3]
y = [2, 3]

x == y

Out[2]:

True

Both variables have the same value, so the result of the comparison is True.

`eq()`¶

In the Python data model, equality is implemented via __eq__() and its counterpart, __ne__().

`is` tests for identity¶

The is operator, on the other hand, is the object identity comparison operator.

It tests whether two objects are the same.

In [3]:

x = [2, 3]
y = [2, 3]

x is y

Out[3]:

False

Although x and y have the same value, they are different objects, so the comparison with is returns False.

`id()`¶

Identity is checked by comparing the built-in function id(). It returns the "identity" of an object: an integer guaranteed to be unique and constant for each object during its lifetime.

In [4]:

x = []
id(x)

Out[4]:

140116482324168

In [5]:

y = []
id(y)

Out[5]:

140116482324488

In [6]:

x is y

Out[6]:

False

In CPython this number is in fact the address of the object in memory.

However...¶

Also known as "Trivia Facts, Part I".

Integers¶

What we just said doesn't work as we might expect with integers.

In [7]:

x = 23
id(x)

Out[7]:

94149475332416

In [8]:

y = 23
id(x)

Out[8]:

94149475332416

In [9]:

x is y

Out[9]:

True

What? But... they're different objects... right?

And to make things even worse:

In [10]:

a = 581
id(a)

Out[10]:

140116482290160

In [11]:

b = 581
id(b)

Out[11]:

140116482290224

In [12]:

a is b

Out[12]:

False

The difference is that the current implementation (of, again, CPython) keeps an array of integer objects for all integers between −5 and 256. When we create an integer in that range what we are actually getting is just a reference back to that existing object.

In other words: in CPython, integers between −5 and 256 are singletons.

This is an implementation detail. For example, until February 2006 this happened only for values [−5, 99]. It's undocumented for a reason. Never rely on this!

Strings¶

The same behavior may be exhibited by small strings literals.

In [13]:

x = "dog"
id(x)

Out[13]:

140116481858392

In [14]:

y = "dog"
id(y)

Out[14]:

140116481858392

In [15]:

x is y

Out[15]:

True

These strings are interned.

String interning¶

From Wikipedia [emphasis added]:

In computer science, string interning is a method of storing only one copy of each distinct string value, which must be immutable. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned.

In [16]:

x = "".join(["d", "o", "g"])
y = "dog"
x is y

Out[16]:

False

Why the difference? Note that we said earlier that this behavior is exhibited for literals. Here, x is an expression. Because Python does not know its value until runtime, x is not interned and therefore it's a different object than y.

We can force a string to be interned, however, using sys.intern().

In [17]:

import sys

x = sys.intern("".join(["a", "b", "c"]))
y = "abc"

x is y

Out[17]:

True

How is this useful?¶

From the Python docs:

Interning strings is useful to gain a little performance on dictionary lookup – if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare.

A Python class of our own¶

Let's move beyond built-in types and define our own class, Circle.

In [18]:

class Circle:
    
    def __init__(self, x, y, radius):
        self.x = x
        self.y = y
        self.radius = radius
        
    def __repr__(self):
        return "Circle(x={}, y={}, radius={})".format(
            self.x, self.y, self.radius)


c = Circle(2, 3, 5)
print(c)

Circle(x=2, y=3, radius=5)

To avoid hard-coding the name of the class, we can use instead:

In [19]:

class Circle:
    
    def __init__(self, x, y, radius):
        self.x = x
        self.y = y
        self.radius = radius
        
    def __repr__(self):
        return "{}(x={}, y={}, radius={})".format(
            type(self).__name__,  # note this
            self.x,
            self.y,
            self.radius)

However...¶

The comparison between the instances of our class is broken!

In [20]:

c1 = Circle(2, 3, 5)
c2 = Circle(2, 3, 5)
c1 == c2

Out[20]:

False

We didn't expect this — the values are the same, so they should compare equal.

In [21]:

print(c1.x == c2.x)
print(c1.y == c2.y)
print(c1.radius == c2.radius)

True
True
True

We didn't define `eq()`¶

From the Python docs:

User-defined classes have [a] __eq__() [method] by default; with [it], all objects compare unequal (except with themselves) [...]

In other words, if we don't define __eq__() by default it uses is.

And c1 and c2 are different objects.

In [22]:

c1 is c2

Out[22]:

False

Rich comparison methods¶

For our classes to be fully comparable, they must implement the "rich" comparison methods.

Each comparison operation calls one of these methods. The correspondences are:

__lt__() $\rightarrow$ x < y
__le__() $\rightarrow$ x <= y
__eq__() $\rightarrow$ x == y
__ne__() $\rightarrow$ x != y
__gt__() $\rightarrow$ x > y
__ge__() $\rightarrow$ x >= y

For example, so that we can use == with our class:

In [23]:

class Circle:
    
    def __init__(self, x, y, radius):
        self.x = x
        self.y = y
        self.radius = radius
        
    def __eq__(self, other):
        return self.x == other.x and \
               self.y == other.y and \
               self.radius == other.radius
    
    
c1 = Circle(2, 3, 5)
c2 = Circle(2, 3, 5)
c1 == c2

Out[23]:

True

`bool()`¶

We could return anything, as Python will call bool() to determine the truth value.

In [188]:

class Circle:
    
    def __init__(self, x, y, radius):
        self.x = x
        self.y = y
        self.radius = radius
        
    def __eq__(self, other):
        if self.x == other.x and \
           self.y == other.y and \
           self.radius == other.radius:
            return "Yes, they are equal!"
        else:
            return "Nope, they're different"
    
    
c1 = Circle(2, 3, 5)
c2 = Circle(2, 3, 5)
c1 == c2

Out[188]:

'Yes, they are equal!'

As the condition of an if statement:

In [25]:

if c1 == c2:
    print("I already told you, they're equal!")

I already told you, they're equal!

But the truth value of any non-empty string is True:

In [26]:

c1 = Circle(1, 3, 9)
c2 = Circle(2, 5, 7)

if c1 == c2:  # bool("Nope, they're different")
    print("These circles are equal too!")

Dont' do this. Return True or False.

`NotImplemented`¶

If the comparison is not defined, we return the singleton NotImplemented.
It's a singleton, not an exception!
This makes Python understand and complain that the two objects cannot be compared.

In [27]:

c1 = Circle(2, 3, 5)
c2 = Circle(2, 3, 5)
c1 < c2

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-27-6b88d8748591> in <module>()
      1 c1 = Circle(2, 3, 5)
      2 c2 = Circle(2, 3, 5)
----> 3 c1 < c2

TypeError: unorderable types: Circle() < Circle()

The < operator calls __lt__(), but we haven't defined it. Python returns NotImplemented, which in turn raises TypeError.

Reminder: we exclusively have a __eq__() by default, comparing equal only to ourselves.

Returning `NotImplemented` ourselves¶

If we want to make sure that the object is only compared to another object of the same type:

In [28]:

class Circle:
    
    def __init__(self, x, y, radius):
        self.x = x
        self.y = y
        self.radius = radius
        
    def __eq__(self, other):
        if type(other) != type(self):
            return NotImplemented
        return self.x == other.x and \
               self.y == other.y and \
               self.radius == other.radius
    
    
c1 = Circle(2, 3, 5)
c1 == 5

Out[28]:

False

To account and allow for subclasses:

In [29]:

class Circle:
    
    def __init__(self, x, y, radius):
        self.x = x
        self.y = y
        self.radius = radius
        
    def __eq__(self, other):
        if not isinstance(other, type(self)):  # instead of type() == type()
            return NotImplemented
        return self.x == other.x and \
               self.y == other.y and \
               self.radius == other.radius

Reflections¶

`lt()` and `gt()`¶

Let's say we want to sort our Circle object by increasing area.

In [30]:

class Circle:
    
    def __init__(self, x, y, radius):
        self.x = x
        self.y = y
        self.radius = radius
        
    def __lt__(self, other):
        return self.radius < other.radius

It works as we would expect:

In [31]:

c1 = Circle(2, 5, 13)
c2 = Circle(3, 7, 11)
c1 < c2

Out[31]:

False

However...

In [32]:

c1 > c2

Out[32]:

True

How?! We didn't define __gt__(), but Python is not complaining.

From the Python docs:

There are no swapped-argument versions of these methods (to be used when the left argument does not support the operation but the right argument does); rather, __lt__() and __gt__() are each other’s reflection [...]

This is what happened here: Circle doesn't define __gt__(), so...

In [33]:

c1.__gt__(c2)

Out[33]:

NotImplemented

... was translated into:

In [34]:

c2.__lt__(c1)

Out[34]:

True

Technically speaking, we fallback to the right operand’s reflected method.

Let's verify it via the good ol' printf debugging technique.

In [35]:

class Circle:
    
    def __init__(self, x, y, radius):
        self.x = x
        self.y = y
        self.radius = radius
        
    def __repr__(self):
        return "{}(x={}, y={}, radius={})".format(
            type(self).__name__, self.x, self.y, self.radius)
        
    def __lt__(self, other):
        print("called {} < {}".format(self, other))  # we added this
        return self.radius < other.radius


c1 = Circle(2, 5, 13)
c2 = Circle(3, 7, 11)
c1 > c2

called Circle(x=3, y=7, radius=11) < Circle(x=2, y=5, radius=13)

Out[35]:

True

So, indeed, Python resorted to the reflection of the method we have.

`le()` and `ge()`¶

The same happens for __le__() and __ge__().

In [36]:

class Circle:
    
    def __init__(self, x, y, radius):
        self.x = x
        self.y = y
        self.radius = radius
        
    def __repr__(self):
        return "{}(x={}, y={}, radius={})".format(
            type(self).__name__, self.x, self.y, self.radius)
        
    def __ge__(self, other):
        print("called {} >= {}".format(self, other))  # more printf debugging
        return self.radius >= other.radius


c1 = Circle(2, 5, 11)
c2 = Circle(3, 7, 13)
c1 <= c2

called Circle(x=3, y=7, radius=13) >= Circle(x=2, y=5, radius=11)

Out[36]:

True

`eq()` and `ne()`¶

Finally, __eq__() and __ne__() are their own reflection.

This means that if x.__eq__(y) is not defined, we try to use y.__eq__(x).

Don't see why it's useful? You're not alone.

Let's go back to an earlier example.

In [37]:

class Circle:
    """A regular circle."""

    def __init__(self, x, y, radius):
        self.x = x
        self.y = y
        self.radius = radius
        
    def __repr__(self):
        return "{}(x={}, y={}, radius={})".format(
            type(self).__name__, self.x, self.y, self.radius)
        
    def __eq__(self, other):
        print("calling {} == {}".format(self, other))
        if not isinstance(other, type(self)):  # we allow for subclasses
            return NotImplemented
        return self.x == other.x and \
               self.y == other.y and \
               self.radius == other.radius


class PremiumCircle(Circle): 
    "A luxury Circle, with high quality values."
    pass  # it's the exact same thing, of course

Focus your attention on __eq__():

In [38]:

def __eq__(self, other):
    if not isinstance(other, type(self)):  # we allow for subclasses
        return NotImplemented
    return self.x == other.x and \
           self.y == other.y and \
           self.radius == other.radius

if not isinstance(other, type(self)):  # we allow for subclasses
    return NotImplemented

This means that __eq__() would only work if the subclass is the right side of == (and, therefore, binds to other), because...

In [40]:

issubclass(PremiumCircle, Circle)

Out[40]:

True

... but...

In [41]:

issubclass(Circle, PremiumCircle)

Out[41]:

False

... so __eq__() will return NotImplemented.

However...

In [42]:

c1 = PremiumCircle(4, 5, 8)
c2 = Circle(2, 3, 5)
c1 == c2  # subclass on the *left* side!

calling PremiumCircle(x=4, y=5, radius=8) == Circle(x=2, y=3, radius=5)
calling Circle(x=2, y=3, radius=5) == PremiumCircle(x=4, y=5, radius=8)

Out[42]:

False

Works! This is, again, because of reflection. Note how we called __eq__() twice.

PremiumCircle.__eq__(Circle) returns NotImplemented, because PremiumCircle is not a subclass of Circle. Python uses reflection to try the opposite: Circle.__eq__(PremiumCircle): this comparison can take place because PremiumCircle is a subclass of Circle.

The reflection of __eq__() and __ne__() with themseves allows us not having to worry about whether self is a subclass of other or the other way around.

From the docs of NotImplemented [emphasis added]:

When a binary (or in-place) method returns NotImplemented the interpreter will try the reflected operation on the other type (or some other fallback, depending on the operator). If all attempts return NotImplemented, the interpreter will raise an appropriate exception [...]

Implied relationships¶

By default, __ne__() delegates to __eq__() and inverts the result unless it is NotImplemented.

This means that defining __eq__() gives us access to != for free.

In [43]:

class Circle:
    
    def __init__(self, x, y, radius):
        self.x = x
        self.y = y
        self.radius = radius

    def __eq__(self, other):
        return self.x == other.x and \
               self.y == other.y and \
               self.radius == other.radius
    

c1 = Circle(2, 5, 11)
c2 = Circle(3, 7, 13)
c1 != c2

Out[43]:

True

And nothing else¶

There are no other implied relationships among the comparison operators.

Python cannot do smart things like figuring out that if x >= y and x != y, then if follows that x > y.

Humans understand that if (a) x is greater than or equal to y and (b) they're not equal, that means that x must be greater than y. Python cannot.

In [44]:

class Circle:
    
    def __init__(self, x, y, radius):
        self.x = x
        self.y = y
        self.radius = radius
        
    def __eq__(self, other):
        return self.x == other.x and \
               self.y == other.y and \
               self.radius == other.radius
    
    def __ge__(self, other):
        return self.radius >= other.radius
    
    
c1 = Circle(2, 5, 11)
c2 = Circle(3, 7, 13)
c1 > c2 

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-44-ce709a35ae78> in <module>()
     15 c1 = Circle(2, 5, 11)
     16 c2 = Circle(3, 7, 13)
---> 17 c1 > c2

TypeError: unorderable types: Circle() > Circle()

So, all of them?¶

So does this mean the in practice we need to define all the comparison methods?

In [45]:

class Circle:
    
    def __init__(self, x, y, radius):
        self.x, self.y, self.radius = x, y, radius

    def __eq__(self, other):
        return self.x == other.x and \
               self.y == other.y and \
               self.radius == other.radius
    
    def __ne__(self, other):
        return not self == other
    
    def __ge__(self, other):
        return self.radius >= other.radius
    
    def __le__(self, other):
        return self.radius <= other.radius
    
    def __gt__(self, other):
        return self.radius > other.radius
    
    def __lt__(self, other):
        return self.radius < other.radius

Not all, because:¶

!= returns the opposite of ==, so we don't have to implement __ne__().
For __le__() we can use the reflection of __gt__() (or vice versa).
For __gt__() we can use the reflection of __lt__() (or vice versa).

So we can leave it in:

In [46]:

class Circle:
    
    def __init__(self, x, y, radius):
        self.x = x
        self.y = y
        self.radius = radius
        
    def __eq__(self, other):
        return self.x == other.x and \
               self.y == other.y and \
               self.radius == other.radius

    def __ge__(self, other):
        return self.radius >= other.radius
    
    def __gt__(self, other):
        return self.radius > other.radius

`functools.total_ordering()`¶

We can forget all these rules and let functools.total_ordering() do all the work.

It just needs __eq__() and one more (any) comparison operator to derive the rest for us automatically.

In [47]:

import functools

@functools.total_ordering
class Circle:
    
    def __init__(self, x, y, radius):
        self.x = x
        self.y = y
        self.radius = radius
        
    def __eq__(self, other):
        return self.x == other.x and \
               self.y == other.y and \
               self.radius == other.radius

    def __ge__(self, other):
        return self.radius >= other.radius

The name total ordering comes from mathematics: "binary relation on some set X, which is antisymmetric, transitive, and a connex relation". In human-being-speak we can say that a total order is that of a set of elements in which all the possible pairs are comparable.

The downside, from the Python docs [emphasis added]:

[...] it does come at the cost of slower execution and more complex stack traces for the derived comparison methods. If performance benchmarking indicates this is a bottleneck for a given application, implementing all six rich comparison methods instead is likely to provide an easy speed boost.

Prehistory's `cmp()`¶

Before Python 3, if the rich comparison methods were not implemented, Python falled back to a different magic method: __cmp__().

object.__cmp__(self, other) would return:

A negative integer if self < other.
Zero if self == other.
a positive integer if self > other.

Thus, our early ancestors could choose between implementing __cmp__() or the rich comparison methods — named like this to differentiate them from the more primitive alternative.

In Python 3 __cmp__() no longer exists.

Wrapping up this section¶

We now know how to create well behaved totally ordered classes.

== is for equality...
... and is for identity.
The default value of __eq__().
Rich comparison operators:
- Return NotImplemented if comparison is undefined.
- Fallback to the right operand's reflected method.
- Have only one implied relationsip: __ne__() $\leftrightarrow$ __eq__()
functools.total_ordering() simplifies our life.
__cmp__() exists only in history books.

Photo by Pixabay | CC0

All right, I guess — but do we really care?¶

We do care if we want to use our classes in a Python dictionary or a set.

Mandatory disclaimer¶

Everything to be known about dictionaries was said by Brandon Rhodes back in 2010.

The Mighty Dictionary (Brandon Rhodes, PyCon 2010).

One-minute overview¶

If we work with a a Python list, we access elements by their index number.

In [48]:

numbers = [2, 3, 5, 7, 11]

In [49]:

numbers[0]

Out[49]:

In [50]:

numbers[2]

Out[50]:

In [51]:

numbers[-1]

Out[51]:

A bijective function maps each index element to the corresponding value.

In [52]:

numbers = [2, 3, 5, 7, 11]

For example, index two (i.e. third element in the array) maps to value 5.

However, we cannot easily go the other way around. Given, say, the value 7:

Is it in the array?
What's its index?

The answer to these questions is $\mathcal{O}(n)$.

In [53]:

7 in numbers  # O(n)

Out[53]:

True

Same for:

In [54]:

numbers.index(7)  # O(n)

Out[54]:

Python is doing the equivalent of:

In [55]:

def index(iterable, wanted):
    """Return index of 'wanted' in 'iterable'; -1 if not found."""
    
    for index, element in enumerate(iterable):
        if element == wanted:
            return index
    return -1


index(numbers, 7)

Out[55]:

To make it explicit: we have to visit all the elements in the iterable, so $\mathcal{O}(n)$.

Dictionaries¶

Python dictionaries are a mapping type.
Instead of indexing elements by a range of numbers, we use keys.
Keys can be any immutable (more of that later) type, e.g. numbers or strings.
They map each key to the corresponding value.

In [56]:

numbers = {}  # now a dict
numbers[1] = "one"
numbers[3] = "three"
numbers[7] = "seven"

In [57]:

numbers[3]

Out[57]:

'three'

In [58]:

5 in numbers

Out[58]:

False

In [59]:

del numbers[1]

These operations, unlike for the list, are $\mathcal{O}(1)$. How? Because dictionaries are Python's name for...

Hash tables¶

The most important data structure known to Humankind, as Steve Yegge put it.
Formally known as unordered associative array, invented in 1953.
Insertion, deletion and lookups have an amortized cost of $\mathcal{O}(1)$ (!)
Speed, speed, speed. Instant lookup of a value, regardless of how many we have!

How do hash tables achieve this?

A dictionary is really a list¶

This is the first plot twist.

A hash table is nothing else than a list (or array, in non-Python-specific terminology).

Let's create our hash table:

In [60]:

table = [None, None, None, None, None]

Let's visualize it vertically:

In [61]:

table = [
    None,  # index = 0
    None,  # index = 1
    None,  # index = 2
    None,  # index = 3
    None,  # index = 4
]

The difference is that the index of each element depends on its value.
Given a value, we can compute its index, and directly go there in $\mathcal{O}(1)$.
In this manner, we know exactly where to look up the element — no need to loop over all of them.

How do we compute the indexes?

Hash functions¶

From Wikipedia:

A hash function is any function that can be used to map data of arbitrary size to data of a fixed size.

It's a mathematical function that maps data to some smaller value.
We move from the domain of the hash function (the set of possible input values)...
... to a smaller domain: the number of outputs of the hash function.
The hash function compresses our data to a fixed size.

A simple hash function¶

In [62]:

def simple_hash(str_):
    return len(str_)

In [63]:

simple_hash("dog")

Out[63]:

In [64]:

simple_hash("tiger")

Out[64]:

In [65]:

simple_hash("Tyrannosaurus Rex")

Out[65]:

We take data (strings, in this case) and return a numeric value.

Using our hash function to compute indexes¶

In [66]:

table = [
    None,  # index = 0
    None,  # index = 1
    None,  # index = 2
    None,  # index = 3
    None,  # index = 4
]

Let's get the hash of "dog":

In [67]:

word = 'dog'
word_hash = simple_hash(word)
print("Hash of", word, "is", word_hash)

Hash of dog is 3

Use the hash as the index to store the value in our table:

In [68]:

index = word_hash
table[index] = word  # O(1)
table

Out[68]:

[None, None, None, 'dog', None]

Let's now look up a value. Is "bear" in the hash table?

In [69]:

word = 'bear'
index = simple_hash(word)
table[index] != None

Out[69]:

False

First problem: out of range¶

Our function can return any value, exceeding the size of the hash table.

In [70]:

word = 'monkey'
index = simple_hash(word)
table[index] = word

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-70-4f70fec575d8> in <module>()
      1 word = 'monkey'
      2 index = simple_hash(word)
----> 3 table[index] = word

IndexError: list assignment index out of range

Solution: modulo operator¶

That is, we take the remainder after dividing by the length of the list.
Modular arithmetic, the same way we all know that 70 minutes is one hour plus 10 minutes.

So the operation becomes:

In [71]:

word = 'monkey'
index = simple_hash(word) % len(table)
print('{} % {} == {}'.format(simple_hash(word), len(table), index))

6 % 5 == 1

In [72]:

table[index] = word
table

Out[72]:

[None, 'monkey', None, 'dog', None]

Is "monkey" in the hash table?

In [73]:

word = 'monkey'
index = simple_hash(word) % len(table)
table[index] != None

Out[73]:

True

Second problem: collisions!¶

Because we transform the input values into a smaller domain, sooner or later we'll come across two different keys that map to the same hash value.

This is known as a collision.

Right now, our hash table looks like this:

In [74]:

table = [
    None,      # index = 0
    'monkey',  # index = 1
    None,      # index = 2
    'dog',     # index = 3
    None,      # index = 4
]

A wild new word appears:

In [75]:

word = 'cat'
index = simple_hash(word) % len(table)
table[index] = word
table

Out[75]:

[None, 'monkey', None, 'cat', None]

In [76]:

table

Out[76]:

[None, 'monkey', None, 'cat', None]

"cat" and "dog" have the same length.
Since our hash function is the length, we ended up in the same position.
We override our value! "dog" is gone.

This is a collision. They will happen, so our hash table needs to take them into account.

More Trivia: PHP¶

PHP functions originally bucketed by strlen, were renamed to balance length:

Well, there were other factors in play there. htmlspecialchars was a

very early function. Back when PHP had less than 100 functions and the
function hashing mechanism was strlen(). **In order to get a nice hash
distribution of function names across the various function name lengths
names were picked specifically to make them fit into a specific length
bucket**. This was circa late 1994 when PHP was a tool just for my own
personal use and I wasn't too worried about not being able to remember
the few function names. \[emphasis added\]

Most programming languages use a hash table to store symbols: e.g. function name to function object or pointer.

PHP was hashing by length, which lead to lots of collisions.

Solution? Rename functions to smooth the distribution of lengths.

From PHP has inconsistent function naming:

underscore               no underscore:
stream_get_line          readline
disk_free_space          diskfreespace
is_object                isset
mcal_day_of_week         jddayofweek

We live in the darkest timeline.

Behave, hash function!¶

Formally speaking, a good hash function satisfies two main properties:

Determinism ¶

The same input value should produce the same output value. In other words: the hash is a mathematical function of the data being hashed. We cannot depend on external variable parameters such as a pseudo-random number generator.

Our function satisfied this, as the length of a string is deterministic.

Uniformity ¶

The hashes should be distributed as evently as possible over the output range. In other words: every hash should have the same probability of being generated.

This is where our hash function (as well as PHP's) failed. Words in English...

[...] are reasonably well fitted by a shifted Poisson distribution with mean and variance equal to 6.94 and 5.80. [Source]

This would lead to a lot of collisions.

A real-life example: MD5¶

A hash function producing a 128-bit hash value.
Designed by Ronald Rivest in 1992.

$ echo "Let there be light" | md5sum
b3d440d0691efe7d7d485602f134f469  -

128 bits $\rightarrow$ probability of two random hashes accidentally colliding is $1\ /\ {2}^{128}$
Initially designed to be used as a cryptographic hash function.
In 2004 it was shown that MD5 is not collision-resistant [Wikipedia]
For example, given a file we can find generate a different one with the same hash.

MD5 is still useful to verify data integrity against unintentional corruption.

For example, Debian gives us these MD5 checksums:

f8446a84356a6bcbf79201cc9f46063f  debian-9.5.0-amd64-netinst.iso
47a5dca818220d8558d37dfa11b85550  debian-9.5.0-amd64-xfce-CD-1.iso
2daa085925a556d35a0eeb27768cc892  debian-mac-9.5.0-amd64-netinst.iso

Then when I download the ISO:

wget https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-9.5.0-amd64-netinst.iso
md5sum debian-9.5.0-amd64-netinst.iso
f8446a84356a6bcbf79201cc9f46063f

Same hash, so moooost likely no corruption.

Python's `hash()`¶

Meet the built-in hash() function:

Return the hash value of the object (if it has one). Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup. Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for 1 and 1.0).

Integers¶

In [77]:

hash(0)

Out[77]:

In [78]:

hash(5)

Out[78]:

In [79]:

hash(113)

Out[79]:

In [80]:

hash(7.3)

Out[80]:

691752902764107783

Trivia redux¶

In [81]:

hash(-5)

Out[81]:

-5

In [82]:

hash(-2)

Out[82]:

-2

In [83]:

hash(-1)  # also -2!

Out[83]:

-2

Why?

The hash value -1 is reserved (it’s used to flag errors in the C implementation). If the hash

algorithm generates this value, we simply use -2 instead. [Source]

Proof in the CPython source code:

if (x == (Py_uhash_t)-1)
    x = (Py_uhash_t)-2;
return (Py_hash_t)x;

Strings¶

For strings the algorithm is much more complex.

In [84]:

hash("dog")

Out[84]:

4448411893666489140

In [85]:

hash("cat")

Out[85]:

-1695142227333385162

In [86]:

hash("giraffe")

Out[86]:

-4077740646546332916

Looking at the CPython source code:

if (len < Py_HASH_CUTOFF) {
        /* Optimize hashing of very small strings with inline DJBX33A. */
        Py_uhash_t hash;
        const unsigned char *p = src;
        hash = 5381; /* DJBX33A starts with 5381 */

        switch(len) {
            /* ((hash << 5) + hash) + *p == hash * 33 + *p */
            case 7: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
            case 6: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
            case 5: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
            case 4: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
            case 3: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
            case 2: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
            case 1: hash = ((hash << 5) + hash) + *p++; break;
            default:
                Py_UNREACHABLE();
        }
        hash ^= len;
        hash ^= (Py_uhash_t) _Py_HashSecret.djbx33a.suffix;
        x = (Py_hash_t)hash;
    }

Tuples¶

In [87]:

hash((1, 2))

Out[87]:

3713081631934410656

In [88]:

hash(("dog", "cat"))

Out[88]:

4660141651573361605

In [89]:

hash(("giraffe", 113, 5.1))

Out[89]:

-1646678630949266879

Taking a look at the CPython source code:

x = 0x345678UL;
p = v->ob_item;
while (--len >= 0) {
    y = PyObject_Hash(*p++);
    if (y == -1)
        return -1;
    x = (x ^ y) * mult;
    /* the cast might truncate len; that doesn't change hash stability */
    mult += (Py_hash_t)(82520UL + len + len);
}
x += 97531UL;
if (x == (Py_uhash_t)-1)
    x = -2;

`hash()` on our own classes¶

hash() delegates the method to the __hash__() magic method.

In [90]:

x = 3
hash(x)

Out[90]:

This is equivalent to...

In [91]:

x.__hash__()

Out[91]:

Let's not implement it in our class...

In [92]:

class Circle:
    
    def __init__(self, x, y, radius):
        self.x = x
        self.y = y
        self.radius = radius
         

c = Circle(5, 7, 9)
hash(c)

Out[92]:

-9223363279575207557

It returns something, nevertheless!

From the docs:

User-defined classes have __eq__() and __hash__() methods by default; with them, all objects compare unequal (except with themselves) and x.__hash__() returns an appropriate value such that x == y implies both that x is y and hash(x) == hash(y).

So the hash will only be the same if... the object is the same.

The (default) hash function is not based on the value of the object, as we would expect, but on its identity.

In [93]:

class Circle:
    
    def __init__(self, x, y, radius):
        self.x = x
        self.y = y
        self.radius = radius
         

c1 = Circle(5, 7, 9)
c2 = Circle(5, 7, 9)

In [94]:

print("Hash of c1:", hash(c1))
print("Hash of c2:", hash(c2))

Hash of c1: -9223363279575207564
Hash of c2: 8757280120398

Different objects, different hashes!

Implementing `hash()`¶

To do things right, we need to implement our own __hash__() method.

The docs recommend [emphasis added]:

[...] it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects by packing them into a tuple and hashing the tuple.

In [95]:

class Circle:
    
    def __init__(self, x, y, radius):
        self.x = x
        self.y = y
        self.radius = radius
        
    def __hash__(self):
        return hash((self.x, self.y, self.radius))


c = Circle(2, 3, 11)
hash(c)

Out[95]:

3789705017597642099

In this manner, we delegate __hash__() to the hash function to the tuple object, which in turn will use the hashes of the attributes (x, y and radius) of our class.

This hashes will be, as we need, both deterministic and uniform over the output range.

Wrapping up this section¶

Dictionaries are hash tables.
Hash tables are the data structure: $\mathcal{O}(1)$ operations.
Using hash functions to compute hash table indexes. Two problems:
- Values out of range $\rightarrow$ modulo operator.
- Collisions $\rightarrow$ will happen, sooner or later.
PHP's inconsistent function naming — now we know why.
The two properties of a good hash function:
- Determinism.
- Uniformity.
A real-life example: MD5.
Python's built-in hash().
Implementing __hash__().

Wrapping up this section¶

Dictionaries are hash tables.
Hash tables are the data structure: $\mathcal{O}(1)$ operations.
Using hash functions to compute hash table indexes. Two problems:
- Values out of range $\rightarrow$ modulo operator.
- Collisions $\rightarrow$ will happen, sooner or later.
PHP's inconsistent function naming — now we know why.
The two properties of a good hash function:
- Determinism.
- Uniformity.
Python's built-in hash().
Implementing __hash__().

Photo by Pixabay | CC0

Implementing our own hash table¶

Let's use all we learned to implement our own table hash — this time handling collisions.

We need to split this class over multiple Jupyter cells, so we'll use this decorator:

In [96]:

def add_to(cls):
    """A class decorator to dynamically add methods."""

    def wrapper(func):  
        setattr(cls, func.__name__, func)
        return cls
    return wrapper

In [97]:

class HashTable:
    """Initial prototype of a hash table."""
    
    def __init__(self, size=8):
        self._slots = [None] * size
        
    def __len__(self):
        count = 0
        for key in self._slots:
            if key is not None:
                count += 1
        return count

In [98]:

@add_to(HashTable)
def __setitem__(self, key, value):
    index = hash(key) % len(self._slots)
    self._slots[index] = value


@add_to(HashTable)
def __getitem__(self, key):
    index = hash(key) % len(self._slots)
    if self._slots[index] is None:
        raise KeyError(key)
    return self._slots[index]


@add_to(HashTable)
def __delitem__(self, key):
    index = hash(key) % len(self._slots)
    self._slots[index] = None

Let's try it...

In [99]:

table = HashTable()
table["cat"] = "Katze"
table["cat"]

Out[99]:

'Katze'

In [100]:

table._slots

Out[100]:

[None, None, None, None, None, None, 'Katze', None]

We can double check this is the right index:

In [101]:

hash("cat") % 8

Out[101]:

Deletion also works:

In [102]:

del table["cat"]
table["cat"]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-102-f13e5ff72fe3> in <module>()
      1 del table["cat"]
----> 2 table["cat"]

<ipython-input-98-adffee489ca8> in __getitem__(self, key)
      9     index = hash(key) % len(self._slots)
     10     if self._slots[index] is None:
---> 11         raise KeyError(key)
     12     return self._slots[index]
     13 

KeyError: 'cat'

A couple of improvements:¶

Keep a counter of values, instead of calculating them every single time.
Factor our index calculation to a private method of the class.

In [103]:

class HashTable:
    """Initial prototype of a hash table, second attempt."""
    
    def __init__(self, size=8):
        self._slots = [None] * size
        self._count = 0
        
    def __len__(self):
        return self._count
    
    def _get_index(self, key):
        return hash(key) % len(self._slots)

In [104]:

@add_to(HashTable)
def __setitem__(self, key, value):
    index = self._get_index(key)
    self._slots[index] = value
    self._count += 1

    
@add_to(HashTable)
def __getitem__(self, key):
    index = self._get_index(key)
    if self._slots[index] is None:
        raise KeyError(key)
    return self._slots[index]


@add_to(HashTable)
def __delitem__(self, key):
    index = self._get_index(key)
    if self._slots[index] is None:
        raise KeyError(key)
    self._slots[index] = None
    self._count -= 1

In [105]:

table = HashTable()
table["dog"] = "Hund"
table["bear"] = "Bär"
table._slots

Out[105]:

[None, None, 'Bär', None, 'Hund', None, None, None]

In [106]:

len(table)

Out[106]:

This is all good, but again – we have to keep collisions in mind.

Sooner than later two elements are going to end up in the same slot.

A simple approach: separate chaining¶

Each slot is independent, and keeps a list entries with the same index.
All the keys that share index are kept one after another in the list...
... that is, chained, hence the name.
The algorithm becomes:
- Get the index $i$ of our element.
- Add the element to the $i$-th list in our hash table.

Thus, our empty hash table looks like this:

In [107]:

table = [
    [],  # index = 0
    [],  # index = 1
    [],  # index = 2
    [],  # index = 3
    [],  # index = 4
]

Say we want to map "dog" to its German translation, "Hund".

It lands in, for example, index == 3.

In [108]:

table = [
    [],        # index = 0
    [],        # index = 1
    [],        # index = 2
    ['Hund'],  # index = 3
    [],        # index = 4
]

Then we map "bear" to its German translation, "Bär".

It lands in, say, index == 0.

In [109]:

table = [
    ["Bär"],   # index = 0
    [],        # index = 1
    [],        # index = 2
    ['Hund'],  # index = 3
    [],        # index = 4
]

Now we store "cat", and assume that it also lands in index == 3.

In [110]:

table = [
    ["Bär"],            # index = 0
    [],                 # index = 1
    [],                 # index = 2
    ['Hund', 'Katze'],  # index = 3
    [],                 # index = 4
]

Do you see the problem we face now?

Handling collisions¶

Somebody asks us "How do you say cat in German".

We use the hash of "cat" to compute its index, and get that index == 3.

In [111]:

table = [
    ["Bär"],            # index = 0
    [],                 # index = 1
    [],                 # index = 2
    ['Hund', 'Katze'],  # index = 3
    [],                 # index = 4
]

Is it "Hund" or "Katze"? How can we tell them apart? We can't!

We also need the key¶

In order to deal with collisions we have to store the key itself, in addition to the mapped value.

Image by Jorge Stolfi | CC BY-SA 3.0

Now with keys!¶

In [112]:

import collections

Entry = collections.namedtuple('Entry', 'key, value')


class HashTable:
    """A separate-chaining hash table."""
    
    def __init__(self, size=8):
        self._slots = [list() for _ in range(size)]
        self._count = 0
        
    def __len__(self):
        return self._count
    
    def _get_index(self, key):
        return hash(key) % len(self._slots)

In [113]:

@add_to(HashTable)    
def __setitem__(self, key, value):
    slot_index = self._get_index(key)
    slot = self._slots[slot_index]
    for entry_index, entry in enumerate(slot):
        if entry.key == key:  # note the equality comparison here...
            slot[entry_index] = entry._replace(value=value)
            break
    else:
        entry = Entry(key, value)
        self._slots[slot_index].append(entry)
        self._count += 1

In [114]:

@add_to(HashTable)
def __getitem__(self, key):
    index = self._get_index(key)
    for entry in self._slots[index]:
        if entry.key == key: # and here...
            return entry.value
    raise KeyError(key)

In [115]:

@add_to(HashTable)
def __delitem__(self, key):
    slot_index = self._get_index(key)
    slot = self._slots[slot_index]
    for entry_index, entry in enumerate(slot):
        if entry.key == key:        # ... and here
            del slot[entry_index]
            break
    else:
        raise KeyError(key)
    self._count -= 1

In [116]:

table = HashTable()

In [117]:

table["dog"] = "perro"
table["bear"] = "Bär"

"perro" is Spanish, not German! Let's replace the value.

In [118]:

table["dog"] = "Hund"
table["dog"]

Out[118]:

'Hund'

Let's take a look at the internals:

In [119]:

table._slots

Out[119]:

[[],
 [],
 [Entry(key='bear', value='Bär')],
 [],
 [Entry(key='dog', value='Hund')],
 [],
 [],
 []]

Real life is so much more complex¶

We used this collision resolution strategy here because it's simple.
Python uses a different one: open addressing with pseudo-random probing.
See the superb documentation at cpython/Objects/dictobject.c.

Image by Jorge Stolfi | CC BY-SA 3.0

Still useful¶

Python may use a different strategy, but the foundations are the same. We need:

The hash to find the corresponding slot (also known as bucket).
The element itself to compare (==) and tell it apart from other elements what may have collided with it.

Our hash table is really nice! Let's add functionality to loop over the elements:

Now with `items()`!¶

In [120]:

@add_to(HashTable)
def items(self):
    for slot in self._slots:
        for entry in slot:
            yield entry
        

table = HashTable()
table["dog"] = "Hund"
table["bear"] = "Bär"

for key, value in table.items():
    print(key, "->", value)

bear -> Bär
dog -> Hund

Alternatively, in a single line:

In [121]:

import itertools

@add_to(HashTable)
def items(self):
    yield from itertools.chain.from_iterable(self._slots)

Resizing¶

Our hash table is almost perfect! The only downside:

We only have eight slots! Sooner than later it's going to get full and performance start degrading.

The solution is to resize, also known as rehashing.

Allocate a new, larger underlying data structure.
Remove each entry from old slots...
... and insert it into the new one.
After all entries are removed from the old slots, we no longer need it.
From now on, use the new slots data structure.
Takes forever, but amortized cost is $\mathcal{O}(1)$.

Now with dynamic resizing!¶

In [122]:

import collections
import itertools

Entry = collections.namedtuple('Entry', 'key, value')


class HashTable:
    """A separate-chaining, dynamically resized hash table."""
    
    def __init__(self, initial_size=8, load_factor=0.8):
        self._slots = [list() for _ in range(initial_size)]
        self._size = 0
        self._load_factor = load_factor 
        
    def __len__(self):
        return self._size
    
    @staticmethod
    def _get_index(key, slots):
        return hash(key) % len(slots)
    
    @classmethod
    def _insert(cls, slots, key, value):
        index = cls._get_index(key, slots)
        slots[index] = value

In [123]:

@add_to(HashTable)
def __setitem__(self, key, value):
    slot_index = self._get_index(key, self._slots)
    slot = self._slots[slot_index]
    for entry_index, entry in enumerate(slot):
        if entry.key == key:  # note the equality comparison here...
            slot[entry_index] = entry._replace(value=value)
            break
    else:
        entry = Entry(key, value)
        self._slots[slot_index].append(entry)
        self._size += 1

    if len(self) / len(self._slots) >= self._load_factor:
        self._rehash()    

In [124]:

@add_to(HashTable)
def _rehash(self):
    new_slots = [list() for _ in range(len(self._slots) * 2)]
    for key, value in self.items():
        HashTable._insert(new_slots, key, value)
    self._slots = new_slots

In [125]:

# The remaining methods; not worth showing on the slides, as they're basically the same ones
# we implemented earlier, but they're included here for the sake of completeness --- this is
# the last version of our hash table implementation.

@add_to(HashTable)        
def __getitem__(self, key):
    index = self._get_index(key, self._slots)
    for entry in self._slots[index]:
        if entry.key == key:         # note the equality comparison here...
            return entry.value
    raise KeyError(key)


@add_to(HashTable)
def __delitem__(self, key):
    slot_index = self._get_index(key, self._slots)
    slot = self._slots[slot_index]
    for entry_index, entry in enumerate(slot):
        if entry.key == key:        # ... and here
            del slot[entry_index]
            break
    else:
        raise KeyError(key)
    self._size -= 1


@add_to(HashTable)
def items(self):
    yield from itertools.chain.from_iterable(self._slots)   

Future improvements¶

Shrink when we delete too many elements (i.e., it's too empty, so make it smaller).
Alternatives to all-at-once-rehashing:

In the words of Andrew Cooke on Stack Overflow [emphasis added]:

Therefore, the hash code only hints at the location of an object, an equality test follows to test candidate keys. To implement a membership test in a hash-table set, the hash code gives you "bucket" number in which to search for the value. However, all set items with the same hash code are in the bucket. For this, you also need an equality test to distinguish between all candidates in the bucket.

That's it.

Sets¶

What we learned is also useful to understand how Python sets work:

They are, in essence, a hash table with dummy values.
In other words: the members of the set are the keys of the hash table.
Plus, of course, optimizations to exploit this lack of values.

Wrapping up this section¶

We implemented our own hash table.
- Need to store both keys and values.
- Handled collisions resolution via separate chaining.
- Dynamic resizing when it's too full.
Python dictionaries have a much more complex implementation.
Sets are in essence a hash table with dummy values.

Wrapping up this section¶

We implemented our own hash table.
- Need to store both keys and values.
- Handled collisions resolution via separate chaining.
Python dictionaries have a much more complex implementation.
Sets are in essence a hash table with dummy values.

Photo by Francesco Ungaro | Pexels License

We're back¶

We're —finally— ready to revisit the paragraph from the beginning:

If a class does not define an __eq__() method it should not define a __hash__() operation either; if it defines __eq__() but not __hash__(), its instances will not be usable as items in hashable collections. If a class defines mutable objects and implements an __eq__() method, it should not implement __hash__(), since the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).

We now have the tools to understand —and remember— why.

Primo¶

If a class does not define an __eq__() method it should not define a __hash__() operation either; [...]

Rephrased: defining __hash__() but not __eq__() is wrong. Why?

Let's think of a hash table look up:

Python uses __hash__() to compute the slot where the value goes.
This gives us the "bucket" in which to search for our value.
Now we need the equality test to distingush between all candidates in the bucket.
We didn't implement __eq__(), so by default our objects compare unequal except with themselves.
Therefore, we cannot find our entry.

See this Stack Overflow question for an example of somebody who ran into this.

An example that works¶

In our implementation of a hash table, if we look up "cat":

In [126]:

table = [
    [],                                                                 # 0
    [Entry(key='bear', value='Bär')],                                   # 1
    [],                                                                 # 2
    [Entry(key='dog', value='Hund'), Entry(key='cat', value='Katze')],  # 3
    [],                                                                 # 4
]

Python uses str.__hash__() and determines that index == 3.

There are two different entries there, so I need to compare for equality.
Python uses str.__eq__() to compare "cat" to "dog" $\rightarrow$ False.
Second entry: "cat" == "cat" $\rightarrow$ True. We found our entry.

An example that doesn't work.¶

If we have __hash__() but not __eq__(), we'll land in the right slot but won't be able to find our value.

In [127]:

class Cyborg:

    def __init__(self, name, weapon):
        self.name = name
        self.weapon = weapon
        
    def __hash__(self):
        return hash((self.name, self.weapon))


robot = Cyborg('T-1000', 'Franchi SPAS-12')
cyborgs = set()
cyborgs.add(robot)
robot in cyborgs

Out[127]:

True

However..

In [128]:

cyborgs = {Cyborg('T-1000', 'Franchi SPAS-12')}
Cyborg('T-1000', 'Franchi SPAS-12') in cyborgs  # doesn't work!

Out[128]:

False

Without __eq__(), these objects only compare equal to themselves.

As we saw earlier for Point...

In [129]:

robot1 = Cyborg('T-800', 'M79 grenade launcher')
robot2 = Cyborg('T-800', 'M79 grenade launcher')

In [130]:

hash(robot1)

Out[130]:

-7726681088492938838

In [131]:

hash(robot2)

Out[131]:

-7726681088492938838

In [132]:

robot1 == robot2

Out[132]:

False

Secondo¶

[...] if [a class] defines __eq__() but not __hash__(), its instances will not be usable as items in hashable collections.

This is —now— almost trivial: if we don't have __hash__(), we cannot know in which slot the value is.

Indeed, Python won't even let us compute their hash.

In [3]:

class Cyborg:

    def __init__(self, name, weapon):
        self.name = name
        self.weapon = weapon
        
    def __eq__(self, other):
        return self.name == other.name and self.weapon == other.weapon


robot = Cyborg('T-850', 'Ithaca 37')
hash(robot)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-08a2a9fdd4d7> in <module>()
     10 
     11 robot = Cyborg('T-850', 'Ithaca 37')
---> 12 hash(robot)

TypeError: unhashable type: 'Cyborg'

But wait¶

However, we saw earlier that by default both __eq__() and __hash__() return something.

Indeed, if we don't define none of them Python doesn't complain.

In [4]:

class Cyborg:

    def __init__(self, name, weapon):
        self.name = name
        self.weapon = weapon


robot = Cyborg('T-850', 'Ithaca 37')
hash(robot)

Out[4]:

8744344590021

Making it unhashable¶

How do we not define __hash__()? By setting it to None.

From the docs:

When the __hash__() method of a class is None, instances of the class will raise an appropriate TypeError when a program attempts to retrieve their hash value, and will also be correctly identified as unhashable when checking isinstance(obj, collections.abc.Hashable).

An example¶

In [133]:

class Cyborg:

    def __init__(self, name, weapon):
        self.name = name
        self.weapon = weapon
        
    __hash__ = None


robot = Cyborg('T-850', 'Ithaca 37')
hash(robot)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-133-3d6965db93aa> in <module>()
      9 
     10 robot = Cyborg('T-850', 'Ithaca 37')
---> 11 hash(robot)

TypeError: unhashable type: 'Cyborg'

Also:

In [134]:

best_friend = dict()
best_friend[robot] = "Sarah Connor"

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-134-d42f98f6f6fa> in <module>()
      1 best_friend = dict()
----> 2 best_friend[robot] = "Sarah Connor"

TypeError: unhashable type: 'Cyborg'

And of course:

In [135]:

robot_army = set()
robot_army.add(robot)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-135-a2975627508c> in <module>()
      1 robot_army = set()
----> 2 robot_army.add(robot)

TypeError: unhashable type: 'Cyborg'

All of them fail because our class is not hashable.

Getting It Wrong, Part I¶

From the docs:

If a class that does not override __hash__() wishes to suppress hash support, it should include __hash__ = None in the class definition.

Note that returning None does not work.

In [136]:

class Cyborg:
    
    def __init__(self, name, weapon):
        self.name = name
        self.weapon = weapon

    def __hash__(self):
        return None


robot = Cyborg('T-1000', 'Franchi SPAS-12')
hash(robot)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-136-52f2e857f76a> in <module>()
     10 
     11 robot = Cyborg('T-1000', 'Franchi SPAS-12')
---> 12 hash(robot)

TypeError: __hash__ method should return an integer

Getting It Wrong, Part II¶

The docs also warn us:

A class which defines its own __hash__() that explicitly raises a TypeError would be incorrectly identified as hashable by an isinstance(obj, collections.abc.Hashable) call.

Abstract base classes¶

collections.abc.Hashable is an abstract base class. We use them

[... ] to test whether a class provides a particular interface; for example, whether it is hashable or whether it is a mapping.

Example 1¶

Is this variable iterable?

In [137]:

word = "abc"

We could try to loop over it and catch the error...

In [138]:

def is_iterable(obj):
    try:
        for item in obj:
            # can exit immediately, just want to check it doesn't fail
            return True
    except TypeError:
        return False


is_iterable(word)

Out[138]:

True

... or instead use:

In [139]:

import collections.abc

isinstance(word, collections.abc.Iterable)

Out[139]:

True

Example 2¶

Does this variable have a size?

In [140]:

t = (x ** 2 for x in [4, 5, 6, 8, 9])

We could try to call len()...

In [141]:

def is_sized(obj):
    try:
        len(obj)
        return True
    except TypeError:
        return False

is_sized(t)

Out[141]:

False

Alternatively, we could check whether the object hasattr(obj, '__len__').

... or instead:

In [142]:

isinstance(t, collections.abc.Sized)

Out[142]:

False

Example 3¶

In the same way, to check whether a variable is hashable we could just try...

In [143]:

word = "Katze"

def is_hashable(obj):
    try:
        hash(obj)
        return True
    except TypeError:
        return False


is_hashable(word)

Out[143]:

True

... or instead:

In [144]:

isinstance(word, collections.abc.Hashable)

Out[144]:

True

Don't raise `TypeError`¶

The docs warn us that raising TypeError ourselves will not work.

In [145]:

class Cyborg:
    
    def __init__(self, name, weapon):
        self.name = name
        self.weapon = weapon

    def __hash__(self):
        raise TypeError("class Cyborg is not hashable")


robot = Cyborg('T-800', 'M79 grenade launcher')
hash(robot)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-145-c9a6424ab543> in <module>()
     10 
     11 robot = Cyborg('T-800', 'M79 grenade launcher')
---> 12 hash(robot)

<ipython-input-145-c9a6424ab543> in __hash__(self)
      6 
      7     def __hash__(self):
----> 8         raise TypeError("class Cyborg is not hashable")
      9 
     10 

TypeError: class Cyborg is not hashable

This was expected. However...

In [146]:

isinstance(robot, collections.abc.Hashable)

Out[146]:

True

Wrong! We broke it!

Use __hash__ = None.

Terzo¶

And finally:

[...] If a class defines mutable objects and implements an __eq__() method, it should not implement __hash__(), since the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).

This takes us to our final point: the hashability of mutable objects.

Hashing mutable objects.¶

Well, we can't.

Tuples are hashable...

In [147]:

hash((1, 2, 3))

Out[147]:

2528502973977326415

... but lists aren't.

In [148]:

hash([5, 7, 9])

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-148-831d6710add8> in <module>()
----> 1 hash([5, 7, 9])

TypeError: unhashable type: 'list'

I'd never try to do that!¶

Perhaps we've never tried calling hash() explicitly for a list, but we may have run into this:

Say we want to map two points $(x, y)$, to the result of ${x}^{y}$.

In [149]:

squares = dict()
squares[[2, 7]] = 2 ** 7

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-149-4cf3b56bef1b> in <module>()
      1 squares = dict()
----> 2 squares[[2, 7]] = 2 ** 7

TypeError: unhashable type: 'list'

Same error! It's unhashable.

For this we need to use tuples...

In [150]:

squares = dict()
squares[(2, 7)] = 2 ** 7
squares

Out[150]:

{(2, 7): 128}

... or perhaps a two-level defaultdict:

In [151]:

import collections

squares = collections.defaultdict(dict)
squares[2][7] = 2 ** 7
squares

Out[151]:

defaultdict(dict, {2: {7: 128}})

In [152]:

squares[2][7]

Out[152]:

But I really want to hash a list¶

Let's imagine a list were hashable and we used it as a key in a dictionary.

In [153]:

x = [2, 3, 5]

Python would use __hash__() to compute the slot where the list goes.
We would go to that slot and store the list there, as usual.

Later on, we mutate our list, say:

In [154]:

x[1] = 7
x

Out[154]:

[2, 7, 5]

Now we want to loop up x in the dictionary.
Python would use __hash__() and get a different value, hence a different slot.
Even if (by chance) we ended up in the same slot, we would need the equality test to distinguish between all candidates in the bucket: old and new x are different, so they will compare unequal.
We can never find our list in the dictionary!

A hashable mutable class — sadness ensues¶

Let's implement a Cyborg class.
We'll implement __eq__()...
... but also implement __hash__().

What could go wrong?

In [155]:

import functools

@functools.total_ordering
class Cyborg:
    
    def __init__(self, name, weapon):
        self.name = name
        self.weapon = weapon
        
    def __repr__(self):
        return "Cyborg({}, {})".format(self.name, self.weapon)
        
    def __eq__(self, other):
        return self.name == other.name and self.weapon == other.weapon
    
    def __lt__(self, other):
        if self.name < other.name:
            return True
        return self.weapon < other.weapon

In [155]:

class Cyborg:
    
    def __init__(self, name, weapon):
        self.name = name
        self.weapon = weapon
        
    def __repr__(self):
        return "Cyborg({}, {})".format(self.name, self.weapon)
        
    def __eq__(self, other):
        return self.name == other.name and self.weapon == other.weapon

Our hash function combines both attributes in a tuple. Both are strings, so immutable, so hashable.

In [156]:

@add_to(Cyborg)    
def __hash__(self):
    return hash((self.name, self.weapon))

We really shouldn't be doing this...

In [157]:

robot1 = Cyborg('T-800', 'Franchi SPAS-12')
robot2 = Cyborg('T-800', 'MP5 Submachine gun')

In [158]:

robot1 == robot2

Out[158]:

False

In [159]:

robot1 >= robot2

Out[159]:

False

In [160]:

hash(robot1)

Out[160]:

-2292265897856400339

In [161]:

hash(robot2)

Out[161]:

6957369841330577170

See the problem here?

Our class is mutable!¶

This will break!

In [162]:

robot = Cyborg('T-800', 'Franchi SPAS-12')
hash(robot)

Out[162]:

-2292265897856400339

In [163]:

target = dict()
target[robot] = "John Connor"
target[robot]

Out[163]:

'John Connor'

So far this works, but...

In [164]:

robot.weapon = 'M79 grenade launcher'
hash(robot)

Out[164]:

4054157808228906460

In [165]:

target[robot]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-165-38243c188373> in <module>()
----> 1 target[robot]

KeyError: Cyborg(T-800, M79 grenade launcher)

We broke it!

That's what the docs were explaining when they told us [emphasis added]:

[...] If a class defines mutable objects and implements an __eq__() method, it should not implement __hash__(), since the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).

Dictionaries store references¶

A final comment. The dictionary (and thus also sets) store a reference to our object.

In [166]:

robot = Cyborg('T-1000', 'Remington 870')
id(robot)

Out[166]:

140116472769168

In [167]:

terminators = set()
terminators.add(robot)

Let's retrieve the element from the set without removing it:

In [168]:

id(next(iter(terminators))) 

Out[168]:

140116472769168

It's not a copy — it's the same object.

In other words: mutating our object (which should have never been hashable)...
... also modifies the copy stored in the hash hable (dictionary / set!).

In [169]:

robot.weapon = 'RSB-80 Plasma Gun'
robot

Out[169]:

Cyborg(T-1000, RSB-80 Plasma Gun)

In [170]:

terminators

Out[170]:

{Cyborg(T-1000, RSB-80 Plasma Gun)}

This implies that if our hashable mutated object landed (by chance) in the same slot as it originally did, we would still be able to find it.

In [171]:

robot = Cyborg('T-1000', 'Remington 870')

terminators = set()
terminators.add(robot)
original_hash = hash(robot)
print("Original hash:", original_hash)

Original hash: 2225710126195077769

Let's mutate the object but make sure it goes to the same slot:

In [172]:

robot.weapon = 'RSB-80 Plasma Gun'
Cyborg.__hash__ = lambda x: original_hash
print("Mutated hash: ", hash(robot))

Mutated hash:  2225710126195077769

In [173]:

robot in terminators

Out[173]:

True

We can find it! It went to the same bucket.

Hashable mutable objects will break most of the time, but not always.

Doing it right¶

Experiments aside, let's do the right thing and make our mutable class non-hashable.

In [174]:

class Cyborg:
    
    def __init__(self, name, weapon):
        self.name = name
        self.weapon = weapon
        
    __hash__ = None
    
    
robot = Cyborg('T-1000', 'RSB-80 Plasma Gun')
terminators = set()
terminators.add(robot)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-174-3b61c68d3401> in <module>()
     10 robot = Cyborg('T-1000', 'RSB-80 Plasma Gun')
     11 terminators = set()
---> 12 terminators.add(robot)

TypeError: unhashable type: 'Cyborg'

Wrapping up this section¶

We revisited and understood the intimidating paragraph from the docs.

If your class doesn't define __eq__(), don't define __hash__() either.
Mutable classes should not be hashable.
- To make it non-hashable, set __hash__ = None.
If your class defines __eq__(), implement __hash__() for it to be hashable.
Abstract base classes simplify testing whether an interface is satisfied.
Dictionaries store references, not copies of the data.

Wrapping up this section¶

We revisited and understood the intimidating paragraph from the docs.

If your class doesn't define __eq__(), don't define __hash__() either.
Mutable classes should not be hashable.
- To make it non-hashable, set __hash__ = None.
If your class defines __eq__(), implement __hash__() for it to be hashable.

Photo by Pixabay / CC0

Bonus: Trollface¶

The whole point of using hash tables is to get $\mathcal{O}(1)$ operations.

Say we use a dictionary to map every number to its square:

In [176]:

{x: x ** 2 for x in range(10_000)}

Out[176]:

{0: 0,
 1: 1,
 2: 4,
 3: 9,
 4: 16,
 5: 25,
 6: 36,
 7: 49,
 8: 64,
 9: 81,
 10: 100,
 11: 121,
 12: 144,
 13: 169,
 14: 196,
 15: 225,
 16: 256,
 17: 289,
 18: 324,
 19: 361,
 20: 400,
 21: 441,
 22: 484,
 23: 529,
 24: 576,
 25: 625,
 26: 676,
 27: 729,
 28: 784,
 29: 841,
 30: 900,
 31: 961,
 32: 1024,
 33: 1089,
 34: 1156,
 35: 1225,
 36: 1296,
 37: 1369,
 38: 1444,
 39: 1521,
 40: 1600,
 41: 1681,
 42: 1764,
 43: 1849,
 44: 1936,
 45: 2025,
 46: 2116,
 47: 2209,
 48: 2304,
 49: 2401,
 50: 2500,
 51: 2601,
 52: 2704,
 53: 2809,
 54: 2916,
 55: 3025,
 56: 3136,
 57: 3249,
 58: 3364,
 59: 3481,
 60: 3600,
 61: 3721,
 62: 3844,
 63: 3969,
 64: 4096,
 65: 4225,
 66: 4356,
 67: 4489,
 68: 4624,
 69: 4761,
 70: 4900,
 71: 5041,
 72: 5184,
 73: 5329,
 74: 5476,
 75: 5625,
 76: 5776,
 77: 5929,
 78: 6084,
 79: 6241,
 80: 6400,
 81: 6561,
 82: 6724,
 83: 6889,
 84: 7056,
 85: 7225,
 86: 7396,
 87: 7569,
 88: 7744,
 89: 7921,
 90: 8100,
 91: 8281,
 92: 8464,
 93: 8649,
 94: 8836,
 95: 9025,
 96: 9216,
 97: 9409,
 98: 9604,
 99: 9801,
 100: 10000,
 101: 10201,
 102: 10404,
 103: 10609,
 104: 10816,
 105: 11025,
 106: 11236,
 107: 11449,
 108: 11664,
 109: 11881,
 110: 12100,
 111: 12321,
 112: 12544,
 113: 12769,
 114: 12996,
 115: 13225,
 116: 13456,
 117: 13689,
 118: 13924,
 119: 14161,
 120: 14400,
 121: 14641,
 122: 14884,
 123: 15129,
 124: 15376,
 125: 15625,
 126: 15876,
 127: 16129,
 128: 16384,
 129: 16641,
 130: 16900,
 131: 17161,
 132: 17424,
 133: 17689,
 134: 17956,
 135: 18225,
 136: 18496,
 137: 18769,
 138: 19044,
 139: 19321,
 140: 19600,
 141: 19881,
 142: 20164,
 143: 20449,
 144: 20736,
 145: 21025,
 146: 21316,
 147: 21609,
 148: 21904,
 149: 22201,
 150: 22500,
 151: 22801,
 152: 23104,
 153: 23409,
 154: 23716,
 155: 24025,
 156: 24336,
 157: 24649,
 158: 24964,
 159: 25281,
 160: 25600,
 161: 25921,
 162: 26244,
 163: 26569,
 164: 26896,
 165: 27225,
 166: 27556,
 167: 27889,
 168: 28224,
 169: 28561,
 170: 28900,
 171: 29241,
 172: 29584,
 173: 29929,
 174: 30276,
 175: 30625,
 176: 30976,
 177: 31329,
 178: 31684,
 179: 32041,
 180: 32400,
 181: 32761,
 182: 33124,
 183: 33489,
 184: 33856,
 185: 34225,
 186: 34596,
 187: 34969,
 188: 35344,
 189: 35721,
 190: 36100,
 191: 36481,
 192: 36864,
 193: 37249,
 194: 37636,
 195: 38025,
 196: 38416,
 197: 38809,
 198: 39204,
 199: 39601,
 200: 40000,
 201: 40401,
 202: 40804,
 203: 41209,
 204: 41616,
 205: 42025,
 206: 42436,
 207: 42849,
 208: 43264,
 209: 43681,
 210: 44100,
 211: 44521,
 212: 44944,
 213: 45369,
 214: 45796,
 215: 46225,
 216: 46656,
 217: 47089,
 218: 47524,
 219: 47961,
 220: 48400,
 221: 48841,
 222: 49284,
 223: 49729,
 224: 50176,
 225: 50625,
 226: 51076,
 227: 51529,
 228: 51984,
 229: 52441,
 230: 52900,
 231: 53361,
 232: 53824,
 233: 54289,
 234: 54756,
 235: 55225,
 236: 55696,
 237: 56169,
 238: 56644,
 239: 57121,
 240: 57600,
 241: 58081,
 242: 58564,
 243: 59049,
 244: 59536,
 245: 60025,
 246: 60516,
 247: 61009,
 248: 61504,
 249: 62001,
 250: 62500,
 251: 63001,
 252: 63504,
 253: 64009,
 254: 64516,
 255: 65025,
 256: 65536,
 257: 66049,
 258: 66564,
 259: 67081,
 260: 67600,
 261: 68121,
 262: 68644,
 263: 69169,
 264: 69696,
 265: 70225,
 266: 70756,
 267: 71289,
 268: 71824,
 269: 72361,
 270: 72900,
 271: 73441,
 272: 73984,
 273: 74529,
 274: 75076,
 275: 75625,
 276: 76176,
 277: 76729,
 278: 77284,
 279: 77841,
 280: 78400,
 281: 78961,
 282: 79524,
 283: 80089,
 284: 80656,
 285: 81225,
 286: 81796,
 287: 82369,
 288: 82944,
 289: 83521,
 290: 84100,
 291: 84681,
 292: 85264,
 293: 85849,
 294: 86436,
 295: 87025,
 296: 87616,
 297: 88209,
 298: 88804,
 299: 89401,
 300: 90000,
 301: 90601,
 302: 91204,
 303: 91809,
 304: 92416,
 305: 93025,
 306: 93636,
 307: 94249,
 308: 94864,
 309: 95481,
 310: 96100,
 311: 96721,
 312: 97344,
 313: 97969,
 314: 98596,
 315: 99225,
 316: 99856,
 317: 100489,
 318: 101124,
 319: 101761,
 320: 102400,
 321: 103041,
 322: 103684,
 323: 104329,
 324: 104976,
 325: 105625,
 326: 106276,
 327: 106929,
 328: 107584,
 329: 108241,
 330: 108900,
 331: 109561,
 332: 110224,
 333: 110889,
 334: 111556,
 335: 112225,
 336: 112896,
 337: 113569,
 338: 114244,
 339: 114921,
 340: 115600,
 341: 116281,
 342: 116964,
 343: 117649,
 344: 118336,
 345: 119025,
 346: 119716,
 347: 120409,
 348: 121104,
 349: 121801,
 350: 122500,
 351: 123201,
 352: 123904,
 353: 124609,
 354: 125316,
 355: 126025,
 356: 126736,
 357: 127449,
 358: 128164,
 359: 128881,
 360: 129600,
 361: 130321,
 362: 131044,
 363: 131769,
 364: 132496,
 365: 133225,
 366: 133956,
 367: 134689,
 368: 135424,
 369: 136161,
 370: 136900,
 371: 137641,
 372: 138384,
 373: 139129,
 374: 139876,
 375: 140625,
 376: 141376,
 377: 142129,
 378: 142884,
 379: 143641,
 380: 144400,
 381: 145161,
 382: 145924,
 383: 146689,
 384: 147456,
 385: 148225,
 386: 148996,
 387: 149769,
 388: 150544,
 389: 151321,
 390: 152100,
 391: 152881,
 392: 153664,
 393: 154449,
 394: 155236,
 395: 156025,
 396: 156816,
 397: 157609,
 398: 158404,
 399: 159201,
 400: 160000,
 401: 160801,
 402: 161604,
 403: 162409,
 404: 163216,
 405: 164025,
 406: 164836,
 407: 165649,
 408: 166464,
 409: 167281,
 410: 168100,
 411: 168921,
 412: 169744,
 413: 170569,
 414: 171396,
 415: 172225,
 416: 173056,
 417: 173889,
 418: 174724,
 419: 175561,
 420: 176400,
 421: 177241,
 422: 178084,
 423: 178929,
 424: 179776,
 425: 180625,
 426: 181476,
 427: 182329,
 428: 183184,
 429: 184041,
 430: 184900,
 431: 185761,
 432: 186624,
 433: 187489,
 434: 188356,
 435: 189225,
 436: 190096,
 437: 190969,
 438: 191844,
 439: 192721,
 440: 193600,
 441: 194481,
 442: 195364,
 443: 196249,
 444: 197136,
 445: 198025,
 446: 198916,
 447: 199809,
 448: 200704,
 449: 201601,
 450: 202500,
 451: 203401,
 452: 204304,
 453: 205209,
 454: 206116,
 455: 207025,
 456: 207936,
 457: 208849,
 458: 209764,
 459: 210681,
 460: 211600,
 461: 212521,
 462: 213444,
 463: 214369,
 464: 215296,
 465: 216225,
 466: 217156,
 467: 218089,
 468: 219024,
 469: 219961,
 470: 220900,
 471: 221841,
 472: 222784,
 473: 223729,
 474: 224676,
 475: 225625,
 476: 226576,
 477: 227529,
 478: 228484,
 479: 229441,
 480: 230400,
 481: 231361,
 482: 232324,
 483: 233289,
 484: 234256,
 485: 235225,
 486: 236196,
 487: 237169,
 488: 238144,
 489: 239121,
 490: 240100,
 491: 241081,
 492: 242064,
 493: 243049,
 494: 244036,
 495: 245025,
 496: 246016,
 497: 247009,
 498: 248004,
 499: 249001,
 500: 250000,
 501: 251001,
 502: 252004,
 503: 253009,
 504: 254016,
 505: 255025,
 506: 256036,
 507: 257049,
 508: 258064,
 509: 259081,
 510: 260100,
 511: 261121,
 512: 262144,
 513: 263169,
 514: 264196,
 515: 265225,
 516: 266256,
 517: 267289,
 518: 268324,
 519: 269361,
 520: 270400,
 521: 271441,
 522: 272484,
 523: 273529,
 524: 274576,
 525: 275625,
 526: 276676,
 527: 277729,
 528: 278784,
 529: 279841,
 530: 280900,
 531: 281961,
 532: 283024,
 533: 284089,
 534: 285156,
 535: 286225,
 536: 287296,
 537: 288369,
 538: 289444,
 539: 290521,
 540: 291600,
 541: 292681,
 542: 293764,
 543: 294849,
 544: 295936,
 545: 297025,
 546: 298116,
 547: 299209,
 548: 300304,
 549: 301401,
 550: 302500,
 551: 303601,
 552: 304704,
 553: 305809,
 554: 306916,
 555: 308025,
 556: 309136,
 557: 310249,
 558: 311364,
 559: 312481,
 560: 313600,
 561: 314721,
 562: 315844,
 563: 316969,
 564: 318096,
 565: 319225,
 566: 320356,
 567: 321489,
 568: 322624,
 569: 323761,
 570: 324900,
 571: 326041,
 572: 327184,
 573: 328329,
 574: 329476,
 575: 330625,
 576: 331776,
 577: 332929,
 578: 334084,
 579: 335241,
 580: 336400,
 581: 337561,
 582: 338724,
 583: 339889,
 584: 341056,
 585: 342225,
 586: 343396,
 587: 344569,
 588: 345744,
 589: 346921,
 590: 348100,
 591: 349281,
 592: 350464,
 593: 351649,
 594: 352836,
 595: 354025,
 596: 355216,
 597: 356409,
 598: 357604,
 599: 358801,
 600: 360000,
 601: 361201,
 602: 362404,
 603: 363609,
 604: 364816,
 605: 366025,
 606: 367236,
 607: 368449,
 608: 369664,
 609: 370881,
 610: 372100,
 611: 373321,
 612: 374544,
 613: 375769,
 614: 376996,
 615: 378225,
 616: 379456,
 617: 380689,
 618: 381924,
 619: 383161,
 620: 384400,
 621: 385641,
 622: 386884,
 623: 388129,
 624: 389376,
 625: 390625,
 626: 391876,
 627: 393129,
 628: 394384,
 629: 395641,
 630: 396900,
 631: 398161,
 632: 399424,
 633: 400689,
 634: 401956,
 635: 403225,
 636: 404496,
 637: 405769,
 638: 407044,
 639: 408321,
 640: 409600,
 641: 410881,
 642: 412164,
 643: 413449,
 644: 414736,
 645: 416025,
 646: 417316,
 647: 418609,
 648: 419904,
 649: 421201,
 650: 422500,
 651: 423801,
 652: 425104,
 653: 426409,
 654: 427716,
 655: 429025,
 656: 430336,
 657: 431649,
 658: 432964,
 659: 434281,
 660: 435600,
 661: 436921,
 662: 438244,
 663: 439569,
 664: 440896,
 665: 442225,
 666: 443556,
 667: 444889,
 668: 446224,
 669: 447561,
 670: 448900,
 671: 450241,
 672: 451584,
 673: 452929,
 674: 454276,
 675: 455625,
 676: 456976,
 677: 458329,
 678: 459684,
 679: 461041,
 680: 462400,
 681: 463761,
 682: 465124,
 683: 466489,
 684: 467856,
 685: 469225,
 686: 470596,
 687: 471969,
 688: 473344,
 689: 474721,
 690: 476100,
 691: 477481,
 692: 478864,
 693: 480249,
 694: 481636,
 695: 483025,
 696: 484416,
 697: 485809,
 698: 487204,
 699: 488601,
 700: 490000,
 701: 491401,
 702: 492804,
 703: 494209,
 704: 495616,
 705: 497025,
 706: 498436,
 707: 499849,
 708: 501264,
 709: 502681,
 710: 504100,
 711: 505521,
 712: 506944,
 713: 508369,
 714: 509796,
 715: 511225,
 716: 512656,
 717: 514089,
 718: 515524,
 719: 516961,
 720: 518400,
 721: 519841,
 722: 521284,
 723: 522729,
 724: 524176,
 725: 525625,
 726: 527076,
 727: 528529,
 728: 529984,
 729: 531441,
 730: 532900,
 731: 534361,
 732: 535824,
 733: 537289,
 734: 538756,
 735: 540225,
 736: 541696,
 737: 543169,
 738: 544644,
 739: 546121,
 740: 547600,
 741: 549081,
 742: 550564,
 743: 552049,
 744: 553536,
 745: 555025,
 746: 556516,
 747: 558009,
 748: 559504,
 749: 561001,
 750: 562500,
 751: 564001,
 752: 565504,
 753: 567009,
 754: 568516,
 755: 570025,
 756: 571536,
 757: 573049,
 758: 574564,
 759: 576081,
 760: 577600,
 761: 579121,
 762: 580644,
 763: 582169,
 764: 583696,
 765: 585225,
 766: 586756,
 767: 588289,
 768: 589824,
 769: 591361,
 770: 592900,
 771: 594441,
 772: 595984,
 773: 597529,
 774: 599076,
 775: 600625,
 776: 602176,
 777: 603729,
 778: 605284,
 779: 606841,
 780: 608400,
 781: 609961,
 782: 611524,
 783: 613089,
 784: 614656,
 785: 616225,
 786: 617796,
 787: 619369,
 788: 620944,
 789: 622521,
 790: 624100,
 791: 625681,
 792: 627264,
 793: 628849,
 794: 630436,
 795: 632025,
 796: 633616,
 797: 635209,
 798: 636804,
 799: 638401,
 800: 640000,
 801: 641601,
 802: 643204,
 803: 644809,
 804: 646416,
 805: 648025,
 806: 649636,
 807: 651249,
 808: 652864,
 809: 654481,
 810: 656100,
 811: 657721,
 812: 659344,
 813: 660969,
 814: 662596,
 815: 664225,
 816: 665856,
 817: 667489,
 818: 669124,
 819: 670761,
 820: 672400,
 821: 674041,
 822: 675684,
 823: 677329,
 824: 678976,
 825: 680625,
 826: 682276,
 827: 683929,
 828: 685584,
 829: 687241,
 830: 688900,
 831: 690561,
 832: 692224,
 833: 693889,
 834: 695556,
 835: 697225,
 836: 698896,
 837: 700569,
 838: 702244,
 839: 703921,
 840: 705600,
 841: 707281,
 842: 708964,
 843: 710649,
 844: 712336,
 845: 714025,
 846: 715716,
 847: 717409,
 848: 719104,
 849: 720801,
 850: 722500,
 851: 724201,
 852: 725904,
 853: 727609,
 854: 729316,
 855: 731025,
 856: 732736,
 857: 734449,
 858: 736164,
 859: 737881,
 860: 739600,
 861: 741321,
 862: 743044,
 863: 744769,
 864: 746496,
 865: 748225,
 866: 749956,
 867: 751689,
 868: 753424,
 869: 755161,
 870: 756900,
 871: 758641,
 872: 760384,
 873: 762129,
 874: 763876,
 875: 765625,
 876: 767376,
 877: 769129,
 878: 770884,
 879: 772641,
 880: 774400,
 881: 776161,
 882: 777924,
 883: 779689,
 884: 781456,
 885: 783225,
 886: 784996,
 887: 786769,
 888: 788544,
 889: 790321,
 890: 792100,
 891: 793881,
 892: 795664,
 893: 797449,
 894: 799236,
 895: 801025,
 896: 802816,
 897: 804609,
 898: 806404,
 899: 808201,
 900: 810000,
 901: 811801,
 902: 813604,
 903: 815409,
 904: 817216,
 905: 819025,
 906: 820836,
 907: 822649,
 908: 824464,
 909: 826281,
 910: 828100,
 911: 829921,
 912: 831744,
 913: 833569,
 914: 835396,
 915: 837225,
 916: 839056,
 917: 840889,
 918: 842724,
 919: 844561,
 920: 846400,
 921: 848241,
 922: 850084,
 923: 851929,
 924: 853776,
 925: 855625,
 926: 857476,
 927: 859329,
 928: 861184,
 929: 863041,
 930: 864900,
 931: 866761,
 932: 868624,
 933: 870489,
 934: 872356,
 935: 874225,
 936: 876096,
 937: 877969,
 938: 879844,
 939: 881721,
 940: 883600,
 941: 885481,
 942: 887364,
 943: 889249,
 944: 891136,
 945: 893025,
 946: 894916,
 947: 896809,
 948: 898704,
 949: 900601,
 950: 902500,
 951: 904401,
 952: 906304,
 953: 908209,
 954: 910116,
 955: 912025,
 956: 913936,
 957: 915849,
 958: 917764,
 959: 919681,
 960: 921600,
 961: 923521,
 962: 925444,
 963: 927369,
 964: 929296,
 965: 931225,
 966: 933156,
 967: 935089,
 968: 937024,
 969: 938961,
 970: 940900,
 971: 942841,
 972: 944784,
 973: 946729,
 974: 948676,
 975: 950625,
 976: 952576,
 977: 954529,
 978: 956484,
 979: 958441,
 980: 960400,
 981: 962361,
 982: 964324,
 983: 966289,
 984: 968256,
 985: 970225,
 986: 972196,
 987: 974169,
 988: 976144,
 989: 978121,
 990: 980100,
 991: 982081,
 992: 984064,
 993: 986049,
 994: 988036,
 995: 990025,
 996: 992016,
 997: 994009,
 998: 996004,
 999: 998001,
 ...}

Lookups are $\mathcal{O}(1)$. Proof:

In [181]:

%%timeit squares = {x: x ** 2 for x in range(1_000)}
999 in squares

58 ns ± 6.09 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [183]:

%%timeit squares = {x: x ** 2 for x in range(10_000)}
9_999 in squares

64 ns ± 9.71 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Size of dictionary is $10x$, but it takes the same amount of time. This is $\mathcal{O}(1)$.

Let's subclass int and make it always return the same hash

In [184]:

class TrollInt:
    
    def __init__(self, value):
         self.value = value
            
    def __eq__(self, other):
        return self.value == other.value
        
    def __hash__(self):
        return 1

This means that they'll all go the same slot.

Python will need to resolve (endless) collisions

In [185]:

%%timeit squares = {TrollInt(x): TrollInt(x ** 2) for x in range(1_000)}
squares[TrollInt(999)]  

379 µs ± 89.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [186]:

%%timeit squares = {TrollInt(x): TrollInt(x ** 2) for x in range(10_000)}
squares[TrollInt(9_999)]

5.28 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

We've turned our $\mathcal{O}(1)$ operations into $\mathcal{O}(n)$.

Problem?

Motivation¶

Outline¶

Outline¶

Object equality versus object identity¶

== tests for equality¶

__eq__()¶

is tests for identity¶

id()¶

However...¶

Integers¶

Strings¶

String interning¶

How is this useful?¶

A Python class of our own¶

However...¶

We didn't define __eq__()¶

Rich comparison methods¶

bool()¶

NotImplemented¶

Returning NotImplemented ourselves¶

Reflections¶

__lt__() and __gt__()¶

__le__() and __ge__()¶

__eq__() and __ne__()¶

Implied relationships¶

And nothing else¶

So, all of them?¶

Not all, because:¶

functools.total_ordering()¶

Prehistory's __cmp__()¶

Wrapping up this section¶

All right, I guess — but do we really care?¶

Mandatory disclaimer¶

One-minute overview¶

Dictionaries¶

Hash tables¶

A dictionary is really a list¶

Hash functions¶

A simple hash function¶

Using our hash function to compute indexes¶

First problem: out of range¶

Solution: modulo operator¶

Second problem: collisions!¶

More Trivia: PHP¶

Behave, hash function!¶

Determinism¶

Uniformity¶

A real-life example: MD5¶

Python's hash()¶

Integers¶

Trivia redux¶

Strings¶

Tuples¶

hash() on our own classes¶

Implementing __hash__()¶

Wrapping up this section¶

Wrapping up this section¶

Implementing our own hash table¶

A couple of improvements:¶

A simple approach: separate chaining¶

Handling collisions¶

We also need the key¶

Now with keys!¶

Real life is so much more complex¶

Still useful¶

Now with items()!¶

Resizing¶

Now with dynamic resizing!¶

Future improvements¶

Sets¶

Wrapping up this section¶

Wrapping up this section¶

We're back¶

Primo¶

An example that works¶

An example that doesn't work.¶

Secondo¶

But wait¶

Making it unhashable¶

An example¶

`==` tests for equality¶

`eq()`¶

`is` tests for identity¶

`id()`¶

We didn't define `eq()`¶

`bool()`¶

`NotImplemented`¶

Returning `NotImplemented` ourselves¶

`lt()` and `gt()`¶

`le()` and `ge()`¶

`eq()` and `ne()`¶

`functools.total_ordering()`¶

Prehistory's `cmp()`¶

Determinism ¶

Uniformity ¶

Python's `hash()`¶

`hash()` on our own classes¶

Implementing `hash()`¶

Now with `items()`!¶

Don't raise `TypeError`¶