#!/usr/bin/env python # coding: utf-8 #
# #
# # [Photo](https://www.pexels.com/photo/landscape-photography-of-desert-33147/) by [Pixabay](https://pixabay.com/en/desert-africa-bedouin-footprints-1007157/) / [CC0](https://www.pexels.com/creative-commons-images/) # # Motivation # # This paragraph [in the Python docs](https://docs.python.org/3/reference/datamodel.html#object.__hash__): # # > If a class does not define an [\_\_eq\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__eq__) method it should not define a [\_\_hash\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__hash__) operation either; if it defines [\_\_eq\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__eq__) but not [\_\_hash\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__hash__), its instances will not be usable as items in hashable collections. If a class defines mutable objects and implements an [\_\_eq\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__eq__) method, it should not implement [\_\_hash\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__hash__), since the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket). # # ..._why_? # # This talk is a foray into the Python data model so that we understand the reasoning behind this advice. And, above everything else, just an excuse to talk about [hash tables](https://en.wikipedia.org/wiki/Hash_table). # # Outline # # 1. Object _equality_ versus object _identity_. # 2. Rich comparison operators for our classes. # 3. Python dictionaries and sets are _hash tables_. # 4. Fundamentals of hash tables, $\mathcal{O}(1)$ for the win. # 5. The properties of a good hash function. # 6. Implementing our own, full-fledged hash table. # 7. Revising the initial paragraph — we're now ready. # 8. Understanding —and remembering— why. # # _On your marks, get set, go!_ # # Outline # # 1. Python dictionaries and sets are _hash tables_. # 2. Fundamentals of hash tables, $\mathcal{O}(1)$ for the win. # 3. The properties of a good hash function. # 4. Implementing our own, full-fledged hash table. # 5. Revising the initial paragraph — we're now ready. # 6. Understanding —and remembering— why. # # _On your marks, get set, go!_ # # Object _equality_ versus object _identity_ # # `==` tests for _equality_ # # The `==` operator is the object **value comparison** operator. # # It tests whether two objects have the same _value_. # In[2]: x = [2, 3] y = [2, 3] x == y # Both variables have the same value, so the result of the comparison is `True`. # ## `__eq__()` # # In the [Python data model](https://docs.python.org/3/reference/datamodel.html), equality is implemented via [\_\_eq\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__eq__) and its counterpart, [\_\_ne\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__ne__). # # `is` tests for _identity_ # # The `is` operator, on the other hand, is the object **identity comparison** operator. # # It tests whether two objects are the _same_. # In[3]: x = [2, 3] y = [2, 3] x is y # Although `x` and `y` have the same value, they are different objects, so the comparison with is returns `False`. # ## `id()` # # Identity is checked by comparing the built-in function [`id()`](https://docs.python.org/3/library/functions.html#id). It returns the "identity" of an object: an integer guaranteed to be unique and constant for each object during its lifetime. # In[4]: x = [] id(x) # In[5]: y = [] id(y) # In[6]: x is y # In [CPython](https://github.com/python/cpython) this number is in fact the address of the object in memory. # # However... # # Also known as "Trivia Facts, Part I". # # # ## Integers # # # What we just said doesn't work as we might expect with integers. # In[7]: x = 23 id(x) # In[8]: y = 23 id(x) # In[9]: x is y # What? But... they're different objects... _right_? # And to make things even worse: # In[10]: a = 581 id(a) # In[11]: b = 581 id(b) # In[12]: a is b # The difference is that the current implementation (of, again, CPython) keeps an array of integer objects for **all integers between −5 and 256**. When we create an integer in that range what we are actually getting is just a reference back to that existing object. # # In other words: in CPython, integers between −5 and 256 are [singletons](https://en.wikipedia.org/wiki/Singleton_pattern). # # This is an _implementation detail_. For example, until February 2006 this happened only for values [−5, 99]. It's undocumented for a reason. Never rely on this! # ## Strings # # The same behavior may be exhibited by small strings literals. # In[13]: x = "dog" id(x) # In[14]: y = "dog" id(y) # In[15]: x is y # These strings are _interned_. # # String interning # # From [Wikipedia](https://en.wikipedia.org/wiki/String_interning) \[emphasis added\]: # # > In computer science, string interning is a method of **storing only one copy of each distinct string value**, which must be immutable. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. # In[16]: x = "".join(["d", "o", "g"]) y = "dog" x is y # Why the difference? Note that we said earlier that this behavior is exhibited for _literals_. Here, `x` is an expression. Because Python does not know its value until runtime, `x` is not interned and therefore it's a different object than `y`. # We can force a string to be interned, however, using [`sys.intern()`](https://docs.python.org/3.2/library/sys.html#sys.intern). # In[17]: import sys x = sys.intern("".join(["a", "b", "c"])) y = "abc" x is y # # How is this useful? # # From the [Python docs](https://docs.python.org/3/library/sys.html#sys.intern): # # > Interning strings is useful to gain a little performance on dictionary lookup – if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare. # # A Python class of our own # # Let's move beyond built-in types and define our own class, `Circle`. # In[18]: class Circle: def __init__(self, x, y, radius): self.x = x self.y = y self.radius = radius def __repr__(self): return "Circle(x={}, y={}, radius={})".format( self.x, self.y, self.radius) c = Circle(2, 3, 5) print(c) # To avoid hard-coding the name of the class, we can use instead: # In[19]: class Circle: def __init__(self, x, y, radius): self.x = x self.y = y self.radius = radius def __repr__(self): return "{}(x={}, y={}, radius={})".format( type(self).__name__, # note this self.x, self.y, self.radius) # # However... # # The comparison between the instances of our class is broken! # In[20]: c1 = Circle(2, 3, 5) c2 = Circle(2, 3, 5) c1 == c2 # We didn't expect this — the _values_ are the same, so they should compare equal. # In[21]: print(c1.x == c2.x) print(c1.y == c2.y) print(c1.radius == c2.radius) # # We didn't define `__eq__()` # # From the [Python docs](https://docs.python.org/3/reference/datamodel.html#object.__hash__): # # > User-defined classes have \[a\] [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) [method] by default; with \[it\], all objects compare unequal (except with themselves) \[...\] # # In other words, if we don't define [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) by default it uses `is`. # # And `c1` and `c2` are different objects. # In[22]: c1 is c2 # # Rich comparison methods # # # For our classes to be fully comparable, they must implement the "_rich_" comparison methods. # # Each comparison operation calls one of these methods. The correspondences are: # # - [`__lt__()`](https://docs.python.org/3/reference/datamodel.html#object.__lt__) $\rightarrow$ `x < y` # - [`__le__()`](https://docs.python.org/3/reference/datamodel.html#object.__le__) $\rightarrow$ `x <= y` # - [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) $\rightarrow$ `x == y` # - [`__ne__()`](https://docs.python.org/3/reference/datamodel.html#object.__ne__) $\rightarrow$ `x != y` # - [`__gt__()`](https://docs.python.org/3/reference/datamodel.html#object.__gt__) $\rightarrow$ `x > y` # - [`__ge__()`](https://docs.python.org/3/reference/datamodel.html#object.__ge__) $\rightarrow$ `x >= y` # For example, so that we can use `==` with our class: # In[23]: class Circle: def __init__(self, x, y, radius): self.x = x self.y = y self.radius = radius def __eq__(self, other): return self.x == other.x and \ self.y == other.y and \ self.radius == other.radius c1 = Circle(2, 3, 5) c2 = Circle(2, 3, 5) c1 == c2 # # `bool()` # # We _could_ return anything, as Python will call `bool()` to determine the truth value. # In[188]: class Circle: def __init__(self, x, y, radius): self.x = x self.y = y self.radius = radius def __eq__(self, other): if self.x == other.x and \ self.y == other.y and \ self.radius == other.radius: return "Yes, they are equal!" else: return "Nope, they're different" c1 = Circle(2, 3, 5) c2 = Circle(2, 3, 5) c1 == c2 # As the condition of an `if` statement: # In[25]: if c1 == c2: print("I already told you, they're equal!") # But the truth value of _any_ non-empty string is `True`: # In[26]: c1 = Circle(1, 3, 9) c2 = Circle(2, 5, 7) if c1 == c2: # bool("Nope, they're different") print("These circles are equal too!") # Dont' do this. Return `True` or `False`. # # `NotImplemented` # # - If the comparison is not defined, we return the singleton [`NotImplemented`](https://docs.python.org/3/library/constants.html#NotImplemented). # - It's a singleton, _not_ an exception! # - This makes Python understand and complain that the two objects cannot be compared. # In[27]: c1 = Circle(2, 3, 5) c2 = Circle(2, 3, 5) c1 < c2 # The `<` operator calls [`__lt__()`](https://docs.python.org/3/reference/datamodel.html#object.__lt__), but we haven't defined it. Python returns [`NotImplemented`](https://docs.python.org/3/library/constants.html#NotImplemented), which in turn raises [TypeError](https://docs.python.org/3/library/exceptions.html#TypeError). # # Reminder: we *exclusively* have a [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) by default, comparing equal only to ourselves. # # Returning `NotImplemented` ourselves # # If we want to make sure that the object is only compared to another object of the same type: # In[28]: class Circle: def __init__(self, x, y, radius): self.x = x self.y = y self.radius = radius def __eq__(self, other): if type(other) != type(self): return NotImplemented return self.x == other.x and \ self.y == other.y and \ self.radius == other.radius c1 = Circle(2, 3, 5) c1 == 5 # To account and allow for subclasses: # In[29]: class Circle: def __init__(self, x, y, radius): self.x = x self.y = y self.radius = radius def __eq__(self, other): if not isinstance(other, type(self)): # instead of type() == type() return NotImplemented return self.x == other.x and \ self.y == other.y and \ self.radius == other.radius # # Reflections # # ## `__lt__()` and `__gt__()` # # Let's say we want to sort our `Circle` object by increasing area. # In[30]: class Circle: def __init__(self, x, y, radius): self.x = x self.y = y self.radius = radius def __lt__(self, other): return self.radius < other.radius # It works as we would expect: # In[31]: c1 = Circle(2, 5, 13) c2 = Circle(3, 7, 11) c1 < c2 # However... # In[32]: c1 > c2 # How?! We didn't define [`__gt__()`](https://docs.python.org/3/reference/datamodel.html#object.__gt__), but Python is not complaining. # From the [Python docs](https://docs.python.org/3/reference/datamodel.html#object.__gt__): # # > There are no swapped-argument versions of these methods (to be used when the left argument does not support the operation but the right argument does); rather, [`__lt__()`](https://docs.python.org/3/reference/datamodel.html#object.__lt__) and [`__gt__()`](https://docs.python.org/3/reference/datamodel.html#object.__gt__) are each other’s reflection [...] # This is what happened here: `Circle` doesn't define [`__gt__()`](https://docs.python.org/3/reference/datamodel.html#object.__gt__), so... # In[33]: c1.__gt__(c2) # ... was translated into: # In[34]: c2.__lt__(c1) # Technically speaking, we fallback to the right operand’s reflected method. # Let's verify it via the good ol' [printf debugging](https://en.wikipedia.org/wiki/Debugging#Techniques) technique. # In[35]: class Circle: def __init__(self, x, y, radius): self.x = x self.y = y self.radius = radius def __repr__(self): return "{}(x={}, y={}, radius={})".format( type(self).__name__, self.x, self.y, self.radius) def __lt__(self, other): print("called {} < {}".format(self, other)) # we added this return self.radius < other.radius c1 = Circle(2, 5, 13) c2 = Circle(3, 7, 11) c1 > c2 # So, indeed, Python resorted to the reflection of the method we have. # ## `__le__()` and `__ge__()` # # The same happens for [`__le__()`](https://docs.python.org/3/reference/datamodel.html#object.__le__) and [`__ge__()`](https://docs.python.org/3/reference/datamodel.html#object.__ge__). # In[36]: class Circle: def __init__(self, x, y, radius): self.x = x self.y = y self.radius = radius def __repr__(self): return "{}(x={}, y={}, radius={})".format( type(self).__name__, self.x, self.y, self.radius) def __ge__(self, other): print("called {} >= {}".format(self, other)) # more printf debugging return self.radius >= other.radius c1 = Circle(2, 5, 11) c2 = Circle(3, 7, 13) c1 <= c2 # ## `__eq__()` and `__ne__()` # # Finally, [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) and [`__ne__()`](https://docs.python.org/3/reference/datamodel.html#object.__ne__) are *their own reflection*. # # This means that if `x.__eq__(y)` is not defined, we try to use `y.__eq__(x)`. # # Don't see why it's useful? You're not alone. # Let's go back to an earlier example. # In[37]: class Circle: """A regular circle.""" def __init__(self, x, y, radius): self.x = x self.y = y self.radius = radius def __repr__(self): return "{}(x={}, y={}, radius={})".format( type(self).__name__, self.x, self.y, self.radius) def __eq__(self, other): print("calling {} == {}".format(self, other)) if not isinstance(other, type(self)): # we allow for subclasses return NotImplemented return self.x == other.x and \ self.y == other.y and \ self.radius == other.radius class PremiumCircle(Circle): "A luxury Circle, with high quality values." pass # it's the exact same thing, of course # Focus your attention on `__eq__()`: # In[38]: def __eq__(self, other): if not isinstance(other, type(self)): # we allow for subclasses return NotImplemented return self.x == other.x and \ self.y == other.y and \ self.radius == other.radius # ```python # if not isinstance(other, type(self)): # we allow for subclasses # return NotImplemented # ``` # This means that `__eq__()` would only work if the subclass is the _right_ side of `==` (and, therefore, binds to `other`), because... # In[40]: issubclass(PremiumCircle, Circle) # ... but... # In[41]: issubclass(Circle, PremiumCircle) # ... so `__eq__()` will return `NotImplemented`. # However... # In[42]: c1 = PremiumCircle(4, 5, 8) c2 = Circle(2, 3, 5) c1 == c2 # subclass on the *left* side! # Works! This is, again, because of reflection. Note how we called `__eq__()` twice. # # `PremiumCircle.__eq__(Circle)` returns `NotImplemented`, because `PremiumCircle` is not a subclass of `Circle`. Python uses reflection to try the opposite: `Circle.__eq__(PremiumCircle)`: this comparison can take place because `PremiumCircle` *is* a subclass of Circle. # The reflection of [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) and [`__ne__()`](https://docs.python.org/3/reference/datamodel.html#object.__ne__) with themseves allows us not having to worry about whether `self` is a subclass of `other` or the other way around. # # From the [docs of NotImplemented](https://docs.python.org/3/library/constants.html#NotImplemented) \[emphasis added\]: # # > When a binary (or in-place) method returns `NotImplemented` **the interpreter will try the reflected operation on the other type** (or some other fallback, depending on the operator). If all attempts return `NotImplemented`, the interpreter will raise an appropriate exception [...] # # Implied relationships # # By default, [`__ne__()`](https://docs.python.org/3/reference/datamodel.html#object.__ne__) delegates to [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) and inverts the result unless it is `NotImplemented`. # # This means that defining [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) gives us access to `!=` for free. # In[43]: class Circle: def __init__(self, x, y, radius): self.x = x self.y = y self.radius = radius def __eq__(self, other): return self.x == other.x and \ self.y == other.y and \ self.radius == other.radius c1 = Circle(2, 5, 11) c2 = Circle(3, 7, 13) c1 != c2 # ## And nothing else # # There are **no other** implied relationships among the comparison operators. # # Python cannot do smart things like figuring out that if `x >= y` and `x != y`, then if follows that `x > y`. # # Humans understand that if (a) `x` is greater than or equal to `y` and (b) they're not equal, that means that `x` must be greater than `y`. Python cannot. # In[44]: class Circle: def __init__(self, x, y, radius): self.x = x self.y = y self.radius = radius def __eq__(self, other): return self.x == other.x and \ self.y == other.y and \ self.radius == other.radius def __ge__(self, other): return self.radius >= other.radius c1 = Circle(2, 5, 11) c2 = Circle(3, 7, 13) c1 > c2 # # So, all of them? # # So does this mean the in practice we need to define _all_ the comparison methods? # In[45]: class Circle: def __init__(self, x, y, radius): self.x, self.y, self.radius = x, y, radius def __eq__(self, other): return self.x == other.x and \ self.y == other.y and \ self.radius == other.radius def __ne__(self, other): return not self == other def __ge__(self, other): return self.radius >= other.radius def __le__(self, other): return self.radius <= other.radius def __gt__(self, other): return self.radius > other.radius def __lt__(self, other): return self.radius < other.radius # ## Not _all_, because: # # - `!=` returns the opposite of `==`, so we don't have to implement [`__ne__()`](https://docs.python.org/3/reference/datamodel.html#object.__ne__). # - For [`__le__()`](https://docs.python.org/3/reference/datamodel.html#object.__le__) we can use the reflection of [`__gt__()`](https://docs.python.org/3/reference/datamodel.html#object.__gt__) (or vice versa). # - For [`__gt__()`](https://docs.python.org/3/reference/datamodel.html#object.__gt__) we can use the reflection of [`__lt__()`](https://docs.python.org/3/reference/datamodel.html#object.__ne__) (or vice versa). # So we can leave it in: # In[46]: class Circle: def __init__(self, x, y, radius): self.x = x self.y = y self.radius = radius def __eq__(self, other): return self.x == other.x and \ self.y == other.y and \ self.radius == other.radius def __ge__(self, other): return self.radius >= other.radius def __gt__(self, other): return self.radius > other.radius # # `functools.total_ordering()` # # We can forget all these rules and let [`functools.total_ordering()`](https://docs.python.org/3/library/functools.html#functools.total_ordering) do all the work. # # It just needs [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) and one more (any) comparison operator to derive the rest for us _automatically_. # In[47]: import functools @functools.total_ordering class Circle: def __init__(self, x, y, radius): self.x = x self.y = y self.radius = radius def __eq__(self, other): return self.x == other.x and \ self.y == other.y and \ self.radius == other.radius def __ge__(self, other): return self.radius >= other.radius # The name [total ordering](https://en.wikipedia.org/wiki/Total_order) comes from mathematics: "binary relation on some set X, which is antisymmetric, transitive, and a connex relation". In human-being-speak we can say that a total order is that of a set of elements in which all the possible pairs are comparable. # # The downside, from [the Python docs](https://docs.python.org/3/library/functools.html#functools.total_ordering) \[emphasis added\]: # # > [...] it does come at the **cost of slower execution and more complex stack traces** for the derived comparison methods. If performance benchmarking indicates this is a bottleneck for a given application, implementing all six rich comparison methods instead is likely to provide an easy speed boost. # # Prehistory's `__cmp__()` # # Before Python 3, if the rich comparison methods were not implemented, Python falled back to a different magic method: [`__cmp__()`](https://docs.python.org/2/reference/datamodel.html#object.__cmp__). # # `object.__cmp__(self, other)` would return: # # - A negative integer if `self < other`. # - Zero if `self == other`. # - a positive integer if `self > other`. # # Thus, our early ancestors could choose between implementing [`__cmp__()`](https://docs.python.org/2/reference/datamodel.html#object.__cmp__) or the rich comparison methods — named like this to differentiate them from the more primitive alternative. # # In Python 3 [`__cmp__()`](https://docs.python.org/2/reference/datamodel.html#object.__cmp__) no longer exists. # # Wrapping up this section # # We now know how to create well behaved totally ordered classes. # # - `==` is for equality... # - ... and `is` for identity. # - The default value of [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__). # - Rich comparison operators: # * Return `NotImplemented` if comparison is undefined. # * Fallback to the right operand's reflected method. # * Have only one implied relationsip: [`__ne__()`](https://docs.python.org/3/reference/datamodel.html#object.__ne__) $\leftrightarrow$ [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) # - [`functools.total_ordering()`](https://docs.python.org/3/library/functools.html#functools.total_ordering) simplifies our life. # - [`__cmp__()`](https://docs.python.org/2/reference/datamodel.html#object.__cmp__) exists only in history books. #
# #
# # [Photo](https://www.pexels.com/photo/adventure-arid-barren-dawn-206724/) by [Pixabay](https://pixabay.com/en/desert-morocco-sand-dune-dry-1748462/) | [CC0](https://www.pexels.com/creative-commons-images/) # # All right, I guess — but do we _really_ care? # # We do care if we want to use our classes in a Python dictionary or a set. # # Mandatory disclaimer # # Everything to be known about dictionaries was said by [Brandon Rhodes](https://rhodesmill.org/brandon/) back in 2010. # #
# # The Mighty Dictionary (Brandon Rhodes, PyCon 2010). #
# # One-minute overview # # If we work with a a Python list, we access elements by their index number. # In[48]: numbers = [2, 3, 5, 7, 11] # In[49]: numbers[0] # In[50]: numbers[2] # In[51]: numbers[-1] # A [bijective function](https://en.wikipedia.org/wiki/Bijection) maps each index element to the corresponding value. # In[52]: numbers = [2, 3, 5, 7, 11] # For example, index two (i.e. third element in the array) maps to value 5. # # However, we cannot easily go the other way around. Given, say, the value 7: # - Is it in the array? # - What's its index? # # The answer to these questions is $\mathcal{O}(n)$. # In[53]: 7 in numbers # O(n) # Same for: # In[54]: numbers.index(7) # O(n) # Python is doing the equivalent of: # In[55]: def index(iterable, wanted): """Return index of 'wanted' in 'iterable'; -1 if not found.""" for index, element in enumerate(iterable): if element == wanted: return index return -1 index(numbers, 7) # To make it explicit: we have to visit _all_ the elements in the iterable, so $\mathcal{O}(n)$. # # Dictionaries # # - Python [dictionaries](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) are a [mapping type](https://docs.python.org/3/library/stdtypes.html#typesmapping). # - Instead of indexing elements by a range of numbers, we use *keys*. # - Keys can be any **immutable** (more of that later) type, e.g. numbers or strings. # - They map each key to the corresponding value. # In[56]: numbers = {} # now a dict numbers[1] = "one" numbers[3] = "three" numbers[7] = "seven" # In[57]: numbers[3] # In[58]: 5 in numbers # In[59]: del numbers[1] # These operations, unlike for the list, are $\mathcal{O}(1)$. How? Because dictionaries are Python's name for... # # Hash tables # # - The most important data structure known to Humankind, [as Steve Yegge put it](https://steve-yegge.blogspot.com/2008/03/get-that-job-at-google.html). # - Formally known as [unordered associative array](https://en.wikipedia.org/wiki/Hash_table), invented in 1953. # - Insertion, deletion and lookups have an amortized cost of $\mathcal{O}(1)$ (!) # - Speed, speed, speed. Instant lookup of a value, regardless of how many we have! # # How do hash tables achieve this? # # A dictionary is really a list # # This is the first plot twist. # # A hash table is nothing else than a list (or _array_, in non-Python-specific terminology). # # Let's create our hash table: # In[60]: table = [None, None, None, None, None] # Let's visualize it vertically: # In[61]: table = [ None, # index = 0 None, # index = 1 None, # index = 2 None, # index = 3 None, # index = 4 ] # - The difference is that the **index** of each element depends on its **value**. # - Given a value, we can compute its index, and directly go there in $\mathcal{O}(1)$. # - In this manner, we know exactly where to look up the element — no need to loop over all of them. # # How do we compute the indexes? # # Hash functions # # From [Wikipedia](https://en.wikipedia.org/wiki/Hash_function): # # > A hash function is any function that can be used to map data of arbitrary size to data of a fixed size. # # - It's a mathematical function that maps data to some smaller value. # - We move from the [domain](https://en.wikipedia.org/wiki/Domain_of_a_function) of the hash function (the set of possible input values)... # - ... to a _smaller_ domain: the number of outputs of the hash function. # - The hash function _compresses_ our data to a fixed size. # # A simple hash function # In[62]: def simple_hash(str_): return len(str_) # In[63]: simple_hash("dog") # In[64]: simple_hash("tiger") # In[65]: simple_hash("Tyrannosaurus Rex") # We take data (strings, in this case) and return a numeric value. # ## Using our hash function to compute indexes # In[66]: table = [ None, # index = 0 None, # index = 1 None, # index = 2 None, # index = 3 None, # index = 4 ] # Let's get the hash of `"dog"`: # In[67]: word = 'dog' word_hash = simple_hash(word) print("Hash of", word, "is", word_hash) # Use the hash as the index to store the value in our table: # In[68]: index = word_hash table[index] = word # O(1) table # Let's now look up a value. Is `"bear"` in the hash table? # In[69]: word = 'bear' index = simple_hash(word) table[index] != None # # First problem: out of range # # Our function can return _any_ value, exceeding the size of the hash table. # In[70]: word = 'monkey' index = simple_hash(word) table[index] = word # ## Solution: modulo operator # # - That is, we take the remainder after dividing by the length of the list. # - [Modular arithmetic](https://en.wikipedia.org/wiki/Modular_arithmetic), the same way we all know that 70 minutes is one hour _plus_ 10 minutes. # # So the operation becomes: # In[71]: word = 'monkey' index = simple_hash(word) % len(table) print('{} % {} == {}'.format(simple_hash(word), len(table), index)) # In[72]: table[index] = word table # Is `"monkey"` in the hash table? # In[73]: word = 'monkey' index = simple_hash(word) % len(table) table[index] != None # # Second problem: collisions! # # Because we transform the input values into a _smaller_ domain, sooner or later we'll come across two _different_ keys that map to the _same_ hash value. # # This is known as a **collision**. # Right now, our hash table looks like this: # In[74]: table = [ None, # index = 0 'monkey', # index = 1 None, # index = 2 'dog', # index = 3 None, # index = 4 ] # A [wild new](https://knowyourmeme.com/memes/a-wild-x-appears-wild-x-appeared) word appears: # In[75]: word = 'cat' index = simple_hash(word) % len(table) table[index] = word table # In[76]: table # - `"cat"` and `"dog"` have the same length. # - Since our hash function is the length, we ended up in the same position. # - We override our value! `"dog"` is gone. # # This is a collision. They _will_ happen, so our hash table needs to take them into account. # # More on this later. # # More Trivia: PHP # # [PHP functions originally bucketed by strlen, were renamed to balance length](https://news.ycombinator.com/item?id=6919216): # # > Well, there were other factors in play there. htmlspecialchars was a # very early function. Back when PHP had less than 100 functions and the # function hashing mechanism was strlen(). **In order to get a nice hash # distribution of function names across the various function name lengths # names were picked specifically to make them fit into a specific length # bucket**. This was circa late 1994 when PHP was a tool just for my own # personal use and I wasn't too worried about not being able to remember # the few function names. \[emphasis added\] # # Most programming languages use a hash table to store [symbols][1]: e.g. function name to function object or pointer. # # PHP was hashing by _length_, which lead to lots of collisions. # # [1]: https://en.wikipedia.org/wiki/Symbol_(programming) # Solution? Rename functions to smooth the distribution of lengths. # # From [PHP has inconsistent function naming](https://tnx.nl/php.html#names): # # ```text # underscore no underscore: # stream_get_line readline # disk_free_space diskfreespace # is_object isset # mcal_day_of_week jddayofweek # ``` # # We live in the darkest timeline. # # Behave, hash function! # # Formally speaking, a good hash function satisfies two main properties: # # ## [Determinism](https://en.wikipedia.org/wiki/Hash_function#Determinism) # # The same input value should produce the same output value. In other words: the hash is a [mathematical function][1] of the data being hashed. We cannot depend on external variable parameters such as a pseudo-random number generator. # # Our function satisfied this, as the length of a string is deterministic. # # [1]: https://en.wikipedia.org/wiki/Function_(mathematics) # ## [Uniformity](https://en.wikipedia.org/wiki/Hash_function#Uniformity) # # The hashes should be distributed as evently as possible over the output range. In other words: every hash should have the same probability of being generated. # # This is where our hash function (as well as PHP's) failed. Words in English... # # > [...] are reasonably well fitted by a shifted Poisson distribution with mean and variance equal to 6.94 and 5.80. [[Source]](https://www.sciencedirect.com/science/article/pii/0378375886901692) # # This would lead to _a lot_ of collisions. # # A real-life example: MD5 # # - A hash function producing a 128-bit hash value. # - Designed by [Ronald Rivest](https://en.wikipedia.org/wiki/Ron_Rivest) in 1992. # ```text # $ echo "Let there be light" | md5sum # b3d440d0691efe7d7d485602f134f469 - # ``` # - 128 bits $\rightarrow$ probability of two random hashes accidentally colliding is $1\ /\ {2}^{128}$ # - Initially designed to be used as a cryptographic hash function. # - In 2004 it was shown that MD5 is not collision-resistant [[Wikipedia](https://en.wikipedia.org/wiki/MD5#History_and_cryptanalysis)] # - For example, given a file we can find generate a different one with the same hash. # MD5 is still useful to verify data integrity against _unintentional_ corruption. # # For example, [Debian](https://www.debian.org/) gives us [these MD5 checksums](https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/MD5SUMS): # # ```text # f8446a84356a6bcbf79201cc9f46063f debian-9.5.0-amd64-netinst.iso # 47a5dca818220d8558d37dfa11b85550 debian-9.5.0-amd64-xfce-CD-1.iso # 2daa085925a556d35a0eeb27768cc892 debian-mac-9.5.0-amd64-netinst.iso # ``` # # Then when I download the ISO: # # ```text # wget https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-9.5.0-amd64-netinst.iso # md5sum debian-9.5.0-amd64-netinst.iso # f8446a84356a6bcbf79201cc9f46063f # ``` # # Same hash, so _moooost_ likely no corruption. # # Python's `hash()` # # Meet the built-in [`hash()`](https://docs.python.org/3/library/functions.html#hash) function: # # > Return the hash value of the object (if it has one). Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup. Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for 1 and 1.0). # ## Integers # In[77]: hash(0) # In[78]: hash(5) # In[79]: hash(113) # In[80]: hash(7.3) # # Trivia redux # In[81]: hash(-5) # In[82]: hash(-2) # In[83]: hash(-1) # also -2! # Why? # > The hash value -1 is reserved (it’s used to flag errors in the C implementation). If the hash # algorithm generates this value, we simply use -2 instead. [[Source](http://effbot.org/zone/python-hash.htm)] # # # Proof in the [CPython source code](https://github.com/python/cpython/blob/2847ccae4687cb43334d87d86fb6c11cb14218f5/Objects/longobject.c#L2986-L2988): # # ```c # if (x == (Py_uhash_t)-1) # x = (Py_uhash_t)-2; # return (Py_hash_t)x; # ``` # ## Strings # # For strings the algorithm is much more complex. # In[84]: hash("dog") # In[85]: hash("cat") # In[86]: hash("giraffe") # Looking at the [CPython source code](https://github.com/python/cpython/blob/ad65f15581173542f1d2a9968a63bee272510ce3/Python/pyhash.c#L161-L183): # # ```c # if (len < Py_HASH_CUTOFF) { # /* Optimize hashing of very small strings with inline DJBX33A. */ # Py_uhash_t hash; # const unsigned char *p = src; # hash = 5381; /* DJBX33A starts with 5381 */ # # switch(len) { # /* ((hash << 5) + hash) + *p == hash * 33 + *p */ # case 7: hash = ((hash << 5) + hash) + *p++; /* fallthrough */ # case 6: hash = ((hash << 5) + hash) + *p++; /* fallthrough */ # case 5: hash = ((hash << 5) + hash) + *p++; /* fallthrough */ # case 4: hash = ((hash << 5) + hash) + *p++; /* fallthrough */ # case 3: hash = ((hash << 5) + hash) + *p++; /* fallthrough */ # case 2: hash = ((hash << 5) + hash) + *p++; /* fallthrough */ # case 1: hash = ((hash << 5) + hash) + *p++; break; # default: # Py_UNREACHABLE(); # } # hash ^= len; # hash ^= (Py_uhash_t) _Py_HashSecret.djbx33a.suffix; # x = (Py_hash_t)hash; # } # ``` # ## Tuples # In[87]: hash((1, 2)) # In[88]: hash(("dog", "cat")) # In[89]: hash(("giraffe", 113, 5.1)) # Taking a look at the [CPython source code](https://github.com/python/cpython/blob/2847ccae4687cb43334d87d86fb6c11cb14218f5/Objects/tupleobject.c#L355-L367): # ```c # x = 0x345678UL; # p = v->ob_item; # while (--len >= 0) { # y = PyObject_Hash(*p++); # if (y == -1) # return -1; # x = (x ^ y) * mult; # /* the cast might truncate len; that doesn't change hash stability */ # mult += (Py_hash_t)(82520UL + len + len); # } # x += 97531UL; # if (x == (Py_uhash_t)-1) # x = -2; # ``` # # `hash()` on our own classes # # [`hash()`](https://docs.python.org/3/library/functions.html#hash) delegates the method to the [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__hash__) magic method. # In[90]: x = 3 hash(x) # This is equivalent to... # In[91]: x.__hash__() # Let's _not_ implement it in our class... # In[92]: class Circle: def __init__(self, x, y, radius): self.x = x self.y = y self.radius = radius c = Circle(5, 7, 9) hash(c) # It returns _something_, nevertheless! # From the [docs](https://docs.python.org/3/reference/datamodel.html#object.__hash__): # # > User-defined classes have [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) and [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__hash__) methods by default; with them, all objects compare unequal (except with themselves) and `x.__hash__()` returns an appropriate value such that `x == y` implies both that `x is y` and `hash(x) == hash(y)`. # # So the hash will only be the same if... the object is the same. # # The (default) hash function is not based on the _value_ of the object, as we would expect, but on its _identity_. # In[93]: class Circle: def __init__(self, x, y, radius): self.x = x self.y = y self.radius = radius c1 = Circle(5, 7, 9) c2 = Circle(5, 7, 9) # In[94]: print("Hash of c1:", hash(c1)) print("Hash of c2:", hash(c2)) # Different objects, different hashes! # # Implementing `__hash__()` # # To do things right, we need to implement our own [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__hash__) method. # # The [docs](https://docs.python.org/3/reference/datamodel.html#object.__hash__) recommend \[emphasis added\]: # # > [...] it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects by **packing them into a tuple and hashing the tuple**. # In[95]: class Circle: def __init__(self, x, y, radius): self.x = x self.y = y self.radius = radius def __hash__(self): return hash((self.x, self.y, self.radius)) c = Circle(2, 3, 11) hash(c) # In this manner, we delegate [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__hash__) to the hash function to the `tuple` object, which in turn will use the hashes of the attributes (`x`, `y` and `radius`) of our class. # # This hashes will be, as we need, both deterministic and uniform over the output range. # # Wrapping up this section # # - Dictionaries are hash tables. # - Hash tables are _the_ data structure: $\mathcal{O}(1)$ operations. # - Using hash functions to compute hash table indexes. Two problems: # * Values out of range $\rightarrow$ modulo operator. # * Collisions $\rightarrow$ will happen, sooner or later. # - PHP's inconsistent function naming — now we know why. # - The two properties of a good hash function: # * Determinism. # * Uniformity. # - A real-life example: MD5. # - Python's built-in `hash()`. # - Implementing `__hash__()`. # # Wrapping up this section # # - Dictionaries are hash tables. # - Hash tables are _the_ data structure: $\mathcal{O}(1)$ operations. # - Using hash functions to compute hash table indexes. Two problems: # * Values out of range $\rightarrow$ modulo operator. # * Collisions $\rightarrow$ will happen, sooner or later. # - PHP's inconsistent function naming — now we know why. # - The two properties of a good hash function: # * Determinism. # * Uniformity. # - Python's built-in `hash()`. # - Implementing `__hash__()`. #
# #
# # [Photo](https://www.pexels.com/photo/sand-desert-sand-dunes-dune-50628/) by [Pixabay](https://pixabay.com/en/morocco-africa-desert-marroc-sand-123976/) | [CC0](https://www.pexels.com/creative-commons-images/) # # Implementing our own hash table # # Let's use all we learned to implement our own table hash — this time handling collisions. # We need to split this class over multiple [Jupyter](http://jupyter.org/) cells, so we'll use this decorator: # In[96]: def add_to(cls): """A class decorator to dynamically add methods.""" def wrapper(func): setattr(cls, func.__name__, func) return cls return wrapper # In[97]: class HashTable: """Initial prototype of a hash table.""" def __init__(self, size=8): self._slots = [None] * size def __len__(self): count = 0 for key in self._slots: if key is not None: count += 1 return count # In[98]: @add_to(HashTable) def __setitem__(self, key, value): index = hash(key) % len(self._slots) self._slots[index] = value @add_to(HashTable) def __getitem__(self, key): index = hash(key) % len(self._slots) if self._slots[index] is None: raise KeyError(key) return self._slots[index] @add_to(HashTable) def __delitem__(self, key): index = hash(key) % len(self._slots) self._slots[index] = None # Let's try it... # In[99]: table = HashTable() table["cat"] = "Katze" table["cat"] # In[100]: table._slots # We can double check this is the right index: # In[101]: hash("cat") % 8 # Deletion also works: # In[102]: del table["cat"] table["cat"] # # A couple of improvements: # # 1. Keep a *counter* of values, instead of calculating them every single time. # 2. Factor our *index calculation* to a private method of the class. # In[103]: class HashTable: """Initial prototype of a hash table, second attempt.""" def __init__(self, size=8): self._slots = [None] * size self._count = 0 def __len__(self): return self._count def _get_index(self, key): return hash(key) % len(self._slots) # In[104]: @add_to(HashTable) def __setitem__(self, key, value): index = self._get_index(key) self._slots[index] = value self._count += 1 @add_to(HashTable) def __getitem__(self, key): index = self._get_index(key) if self._slots[index] is None: raise KeyError(key) return self._slots[index] @add_to(HashTable) def __delitem__(self, key): index = self._get_index(key) if self._slots[index] is None: raise KeyError(key) self._slots[index] = None self._count -= 1 # In[105]: table = HashTable() table["dog"] = "Hund" table["bear"] = "Bär" table._slots # In[106]: len(table) # This is all good, but again – **we have to keep collisions in mind**. # # Sooner than later two elements are going to end up in the _same_ slot. # # A simple approach: separate chaining # # - Each slot is independent, and keeps a list entries with the _same_ index. # - All the keys that share index are kept one after another in the list... # - ... that is, chained, hence the name. # - The algorithm becomes: # * Get the index $i$ of our element. # * Add the element to the $i$-th list in our hash table. # Thus, our empty hash table looks like this: # In[107]: table = [ [], # index = 0 [], # index = 1 [], # index = 2 [], # index = 3 [], # index = 4 ] # Say we want to map `"dog"` to its German translation, `"Hund"`. # # It lands in, for example, `index == 3`. # In[108]: table = [ [], # index = 0 [], # index = 1 [], # index = 2 ['Hund'], # index = 3 [], # index = 4 ] # Then we map `"bear"` to its German translation, `"Bär"`. # # It lands in, say, `index == 0`. # In[109]: table = [ ["Bär"], # index = 0 [], # index = 1 [], # index = 2 ['Hund'], # index = 3 [], # index = 4 ] # Now we store `"cat"`, and assume that it _also_ lands in `index == 3`. # In[110]: table = [ ["Bär"], # index = 0 [], # index = 1 [], # index = 2 ['Hund', 'Katze'], # index = 3 [], # index = 4 ] # Do you see the problem we face now? # # Handling collisions # # Somebody asks us "How do you say _cat_ in German". # # We use the hash of `"cat"` to compute its index, and get that `index == 3`. # In[111]: table = [ ["Bär"], # index = 0 [], # index = 1 [], # index = 2 ['Hund', 'Katze'], # index = 3 [], # index = 4 ] # Is it `"Hund"` or `"Katze"`? How can we tell them apart? We can't! # # We also need the _key_ # # In order to deal with collisions we have to store the *key* itself, in addition to the mapped value. # #
# # # Image by Jorge Stolfi | CC BY-SA 3.0 #
# # Now with keys! # In[112]: import collections Entry = collections.namedtuple('Entry', 'key, value') class HashTable: """A separate-chaining hash table.""" def __init__(self, size=8): self._slots = [list() for _ in range(size)] self._count = 0 def __len__(self): return self._count def _get_index(self, key): return hash(key) % len(self._slots) # In[113]: @add_to(HashTable) def __setitem__(self, key, value): slot_index = self._get_index(key) slot = self._slots[slot_index] for entry_index, entry in enumerate(slot): if entry.key == key: # note the equality comparison here... slot[entry_index] = entry._replace(value=value) break else: entry = Entry(key, value) self._slots[slot_index].append(entry) self._count += 1 # In[114]: @add_to(HashTable) def __getitem__(self, key): index = self._get_index(key) for entry in self._slots[index]: if entry.key == key: # and here... return entry.value raise KeyError(key) # In[115]: @add_to(HashTable) def __delitem__(self, key): slot_index = self._get_index(key) slot = self._slots[slot_index] for entry_index, entry in enumerate(slot): if entry.key == key: # ... and here del slot[entry_index] break else: raise KeyError(key) self._count -= 1 # In[116]: table = HashTable() # In[117]: table["dog"] = "perro" table["bear"] = "Bär" # `"perro"` is Spanish, not German! Let's replace the value. # In[118]: table["dog"] = "Hund" table["dog"] # Let's take a look at the internals: # In[119]: table._slots # # Real life is _so_ much more complex # # - We used this collision resolution strategy here because it's simple. # - Python uses a different one: [open addressing](https://en.wikipedia.org/wiki/Hash_table#Open_addressing) with pseudo-random probing. # - See the superb documentation at [cpython/Objects/dictobject.c](https://github.com/python/cpython/blob/master/Objects/dictobject.c). #
# # # Image by Jorge Stolfi | CC BY-SA 3.0 #
# # Still useful # # Python may use a different strategy, but the foundations are the same. We need: # # 1. The hash to find the corresponding slot (also known as _bucket_). # 2. The element itself to compare (`==`) and tell it apart from other elements what may have collided with it. # Our hash table is really nice! Let's add functionality to loop over the elements: # # Now with `items()`! # In[120]: @add_to(HashTable) def items(self): for slot in self._slots: for entry in slot: yield entry table = HashTable() table["dog"] = "Hund" table["bear"] = "Bär" for key, value in table.items(): print(key, "->", value) # Alternatively, in a single line: # In[121]: import itertools @add_to(HashTable) def items(self): yield from itertools.chain.from_iterable(self._slots) # # Resizing # # Our hash table is _almost_ perfect! The only downside: # # We only have *eight* slots! Sooner than later it's going to get full and performance start degrading. # # The solution is to resize, also known as _rehashing_. # # - Allocate a new, _larger_ underlying data structure. # - Remove each entry from old `slots`... # - ... and insert it into the new one. # - After _all_ entries are removed from the old `slots`, we no longer need it. # - From now on, use the new slots data structure. # - Takes forever, but [amortized cost](https://en.wikipedia.org/wiki/Amortized_analysis) is $\mathcal{O}(1)$. # # Now with dynamic resizing! # In[122]: import collections import itertools Entry = collections.namedtuple('Entry', 'key, value') class HashTable: """A separate-chaining, dynamically resized hash table.""" def __init__(self, initial_size=8, load_factor=0.8): self._slots = [list() for _ in range(initial_size)] self._size = 0 self._load_factor = load_factor def __len__(self): return self._size @staticmethod def _get_index(key, slots): return hash(key) % len(slots) @classmethod def _insert(cls, slots, key, value): index = cls._get_index(key, slots) slots[index] = value # In[123]: @add_to(HashTable) def __setitem__(self, key, value): slot_index = self._get_index(key, self._slots) slot = self._slots[slot_index] for entry_index, entry in enumerate(slot): if entry.key == key: # note the equality comparison here... slot[entry_index] = entry._replace(value=value) break else: entry = Entry(key, value) self._slots[slot_index].append(entry) self._size += 1 if len(self) / len(self._slots) >= self._load_factor: self._rehash() # In[124]: @add_to(HashTable) def _rehash(self): new_slots = [list() for _ in range(len(self._slots) * 2)] for key, value in self.items(): HashTable._insert(new_slots, key, value) self._slots = new_slots # In[125]: # The remaining methods; not worth showing on the slides, as they're basically the same ones # we implemented earlier, but they're included here for the sake of completeness --- this is # the last version of our hash table implementation. @add_to(HashTable) def __getitem__(self, key): index = self._get_index(key, self._slots) for entry in self._slots[index]: if entry.key == key: # note the equality comparison here... return entry.value raise KeyError(key) @add_to(HashTable) def __delitem__(self, key): slot_index = self._get_index(key, self._slots) slot = self._slots[slot_index] for entry_index, entry in enumerate(slot): if entry.key == key: # ... and here del slot[entry_index] break else: raise KeyError(key) self._size -= 1 @add_to(HashTable) def items(self): yield from itertools.chain.from_iterable(self._slots) # # Future improvements # # - Shrink when we delete too many elements (i.e., it's too empty, so make it smaller). # - Alternatives to all-at-once-rehashing: # * [Incremental resizing](https://en.wikipedia.org/wiki/Hash_table#Incremental_resizing). # * [Monotonic keys](https://en.wikipedia.org/wiki/Hash_table#Monotonic_keys). # * [Linear hashing](https://en.wikipedia.org/wiki/Hash_table#Linear_hashing). # In the words of [Andrew Cooke](https://stackoverflow.com/users/181772/andrew-cooke) on [Stack Overflow](https://stackoverflow.com/a/10130506) \[emphasis added\]: # # > Therefore, the hash code only hints at the location of an object, an equality test follows to test candidate keys. To implement a membership test in a hash-table set, **the hash code gives you "bucket" number in which to search for the value**. However, all set items with the same hash code are in the bucket. For this, **you also need an equality test to distinguish between all candidates in the bucket**. # # That's it. # # Sets # # What we learned is also useful to understand [how Python sets work](https://groups.google.com/forum/#!topic/comp.lang.python/4-nf21y-M8g): # # - They are, in essence, a hash table with dummy values. # - In other words: the members of the set are the keys of the hash table. # - Plus, of course, optimizations to exploit this lack of values. # # Wrapping up this section # # - We implemented our own hash table. # * Need to store both keys and values. # * Handled collisions resolution via separate chaining. # * Dynamic resizing when it's too full. # - Python dictionaries have a much more complex implementation. # - Sets are in essence a hash table with dummy values. # # Wrapping up this section # # - We implemented our own hash table. # * Need to store both keys and values. # * Handled collisions resolution via separate chaining. # - Python dictionaries have a much more complex implementation. # - Sets are in essence a hash table with dummy values. #
# #
# # [Photo](https://www.pexels.com/photo/silhouette-photography-of-desert-998649/) by [Francesco Ungaro](https://www.pexels.com/@ghostpresenter) | [Pexels License](https://www.pexels.com/photo-license/) # # We're back # # We're —finally— ready to revisit [the paragraph](https://docs.python.org/3/reference/datamodel.html#object.__hash__) from the beginning: # # > If a class does not define an [\_\_eq\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__eq__) method it should not define a [\_\_hash\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__hash__) operation either; if it defines [\_\_eq\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__eq__) but not [\_\_hash\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__hash__), its instances will not be usable as items in hashable collections. If a class defines mutable objects and implements an [\_\_eq\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__eq__) method, it should not implement [\_\_hash\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__hash__), since the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket). # # We now have the tools to understand —and remember— _why_. # # Primo # # > If a class does not define an [\_\_eq\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__eq__) method it should not define a [\_\_hash\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__hash__) operation either; [...] # # Rephrased: defining [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__hash__) but _not_ [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) is wrong. Why? # Let's think of a hash table look up: # # - Python uses [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__hash__) to compute the slot where the value goes. # - This gives us the "_bucket_" in which to search for our value. # - Now we need the equality test to distingush between all candidates in the bucket. # - We didn't implement [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__), so by default our objects compare unequal except with themselves. # - Therefore, **we cannot find our entry**. # # See [this Stack Overflow question](https://stackoverflow.com/q/42061019) for an example of somebody who ran into this. # ## An example that works # # In our implementation of a hash table, if we look up `"cat"`: # In[126]: table = [ [], # 0 [Entry(key='bear', value='Bär')], # 1 [], # 2 [Entry(key='dog', value='Hund'), Entry(key='cat', value='Katze')], # 3 [], # 4 ] # Python uses `str.__hash__()` and determines that `index == 3`. # * There are two different entries there, so I need to compare for equality. # * Python uses `str.__eq__()` to compare `"cat"` to `"dog"` $\rightarrow$ `False`. # * Second entry: `"cat" == "cat"` $\rightarrow$ `True`. We found our entry. # ## An example that doesn't work. # # If we have [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__hash__) but not [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__), we'll land in the right slot but won't be able to find our value. # In[127]: class Cyborg: def __init__(self, name, weapon): self.name = name self.weapon = weapon def __hash__(self): return hash((self.name, self.weapon)) robot = Cyborg('T-1000', 'Franchi SPAS-12') cyborgs = set() cyborgs.add(robot) robot in cyborgs # However.. # In[128]: cyborgs = {Cyborg('T-1000', 'Franchi SPAS-12')} Cyborg('T-1000', 'Franchi SPAS-12') in cyborgs # doesn't work! # Without [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__), these objects only compare equal to _themselves_. # As we saw earlier for `Point`... # In[129]: robot1 = Cyborg('T-800', 'M79 grenade launcher') robot2 = Cyborg('T-800', 'M79 grenade launcher') # In[130]: hash(robot1) # In[131]: hash(robot2) # In[132]: robot1 == robot2 # # Secondo # # > [...] if [a class] defines [\_\_eq\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__eq__) but not [\_\_hash\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__hash__), its instances will not be usable as items in hashable collections. # # This is —now— almost trivial: if we don't have [\_\_hash\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__hash__), we cannot know in which slot the value is. # Indeed, Python won't even let us compute their hash. # In[3]: class Cyborg: def __init__(self, name, weapon): self.name = name self.weapon = weapon def __eq__(self, other): return self.name == other.name and self.weapon == other.weapon robot = Cyborg('T-850', 'Ithaca 37') hash(robot) # ## But wait # # However, we saw earlier that by default both [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) and [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__hash__) return _something_. # # Indeed, if we don't define _none_ of them Python doesn't complain. # In[4]: class Cyborg: def __init__(self, name, weapon): self.name = name self.weapon = weapon robot = Cyborg('T-850', 'Ithaca 37') hash(robot) # # Making it unhashable # # How do we _not_ define [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__hash__)? By setting it to `None`. # # From [the docs](https://docs.python.org/3/reference/datamodel.html#object.__hash__): # # > When the [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__hash__) method of a class is `None`, instances of the class will raise an appropriate [`TypeError`](https://docs.python.org/3/library/exceptions.html#TypeError) when a program attempts to retrieve their hash value, and will also be correctly identified as unhashable when checking `isinstance(obj, collections.abc.Hashable)`. # ## An example # In[133]: class Cyborg: def __init__(self, name, weapon): self.name = name self.weapon = weapon __hash__ = None robot = Cyborg('T-850', 'Ithaca 37') hash(robot) # Also: # In[134]: best_friend = dict() best_friend[robot] = "Sarah Connor" # And of course: # In[135]: robot_army = set() robot_army.add(robot) # All of them fail because our class is not hashable. # # Getting It Wrong, Part I # # From [the docs](https://docs.python.org/3/reference/datamodel.html#object.__hash__): # # > If a class that does not override [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) wishes to suppress hash support, it should include `__hash__ = None` in the class definition. # # Note that _returning_ `None` does not work. # In[136]: class Cyborg: def __init__(self, name, weapon): self.name = name self.weapon = weapon def __hash__(self): return None robot = Cyborg('T-1000', 'Franchi SPAS-12') hash(robot) # # Getting It Wrong, Part II # # The docs also [warn us](https://docs.python.org/3/reference/datamodel.html#object.__hash__): # # > A class which defines its own [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__hash__) that explicitly raises a [`TypeError`](https://docs.python.org/3/library/exceptions.html#TypeError) would be incorrectly identified as hashable by an `isinstance(obj, collections.abc.Hashable)` call. # ## Abstract base classes # # [`collections.abc.Hashable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Hashable) is an [abstract base class](https://docs.python.org/3/library/abc.html). We use them # # >[... ] to test whether a class provides a particular interface; for example, whether it is hashable or whether it is a mapping. # ## Example 1 # # Is this variable iterable? # In[137]: word = "abc" # We could try to loop over it and catch the error... # In[138]: def is_iterable(obj): try: for item in obj: # can exit immediately, just want to check it doesn't fail return True except TypeError: return False is_iterable(word) # ... or instead use: # In[139]: import collections.abc isinstance(word, collections.abc.Iterable) # ## Example 2 # # Does this variable have a size? # In[140]: t = (x ** 2 for x in [4, 5, 6, 8, 9]) # We could try to call [`len()`](https://docs.python.org/3/library/functions.html#len)... # In[141]: def is_sized(obj): try: len(obj) return True except TypeError: return False is_sized(t) # Alternatively, we could check whether the object `hasattr(obj, '__len__')`. # ... or instead: # In[142]: isinstance(t, collections.abc.Sized) # ## Example 3 # # In the same way, to check whether a variable is hashable we could just try... # In[143]: word = "Katze" def is_hashable(obj): try: hash(obj) return True except TypeError: return False is_hashable(word) # ... or instead: # In[144]: isinstance(word, collections.abc.Hashable) # # Don't raise `TypeError` # # The docs warn us that raising [`TypeError`](https://docs.python.org/3/library/exceptions.html#TypeError) ourselves will not work. # In[145]: class Cyborg: def __init__(self, name, weapon): self.name = name self.weapon = weapon def __hash__(self): raise TypeError("class Cyborg is not hashable") robot = Cyborg('T-800', 'M79 grenade launcher') hash(robot) # This was expected. However... # In[146]: isinstance(robot, collections.abc.Hashable) # Wrong! We broke it! # # Use `__hash__ = None`. # # Terzo # # And finally: # # > [...] If a class defines mutable objects and implements an [\_\_eq\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__eq__) method, it should not implement [\_\_hash\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__hash__), since the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket). # # This takes us to our final point: the hashability of _mutable_ objects. # # Hashing mutable objects. # # Well, we can't. # # Tuples are hashable... # In[147]: hash((1, 2, 3)) # ... but lists aren't. # In[148]: hash([5, 7, 9]) # # I'd never try to do that! # # Perhaps we've never tried calling [`hash()`](https://docs.python.org/3/library/functions.html#hash) explicitly for a list, but we may have run into this: # # Say we want to map two points $(x, y)$, to the result of ${x}^{y}$. # In[149]: squares = dict() squares[[2, 7]] = 2 ** 7 # Same error! It's unhashable. # For this we need to use tuples... # In[150]: squares = dict() squares[(2, 7)] = 2 ** 7 squares # ... or perhaps a two-level [`defaultdict`](https://docs.python.org/3/library/collections.html#collections.defaultdict): # In[151]: import collections squares = collections.defaultdict(dict) squares[2][7] = 2 ** 7 squares # In[152]: squares[2][7] # # But I _really_ want to hash a list # # Let's imagine a list were hashable and we used it as a key in a dictionary. # In[153]: x = [2, 3, 5] # - Python would use [`__hash__()`](https://docs.python.org/3/library/functions.html#hash) to compute the slot where the list goes. # - We would go to that slot and store the list there, as usual. # Later on, we mutate our list, say: # In[154]: x[1] = 7 x # - Now we want to loop up `x` in the dictionary. # - Python would use [`__hash__()`](https://docs.python.org/3/library/functions.html#hash) and get a _different_ value, hence a _different_ slot. # - Even if (by chance) we ended up in the same slot, we would need the equality test to distinguish between all candidates in the bucket: old and new `x` are different, so they will compare unequal. # - We can never find our list in the dictionary! # # A hashable mutable class — sadness ensues # # - Let's implement a `Cyborg` class. # - We'll implement [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__)... # - ... but _also_ implement [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__hash__). # # What could go wrong? # In[155]: import functools @functools.total_ordering class Cyborg: def __init__(self, name, weapon): self.name = name self.weapon = weapon def __repr__(self): return "Cyborg({}, {})".format(self.name, self.weapon) def __eq__(self, other): return self.name == other.name and self.weapon == other.weapon def __lt__(self, other): if self.name < other.name: return True return self.weapon < other.weapon # In[155]: class Cyborg: def __init__(self, name, weapon): self.name = name self.weapon = weapon def __repr__(self): return "Cyborg({}, {})".format(self.name, self.weapon) def __eq__(self, other): return self.name == other.name and self.weapon == other.weapon # Our hash function combines both attributes in a tuple. Both are strings, so immutable, so hashable. # In[156]: @add_to(Cyborg) def __hash__(self): return hash((self.name, self.weapon)) # We really shouldn't be doing this... # In[157]: robot1 = Cyborg('T-800', 'Franchi SPAS-12') robot2 = Cyborg('T-800', 'MP5 Submachine gun') # In[158]: robot1 == robot2 # In[159]: robot1 >= robot2 # In[160]: hash(robot1) # In[161]: hash(robot2) # See the problem here? # # Our class is mutable! # # This will break! # In[162]: robot = Cyborg('T-800', 'Franchi SPAS-12') hash(robot) # In[163]: target = dict() target[robot] = "John Connor" target[robot] # So far this works, but... # In[164]: robot.weapon = 'M79 grenade launcher' hash(robot) # In[165]: target[robot] # We broke it! # That's what the docs were explaining when they told us \[emphasis added\]: # # > [...] If a class defines mutable objects and implements an [\_\_eq\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__eq__) method, it should not implement [\_\_hash\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__hash__), since the implementation of hashable collections requires that a key’s hash value is immutable (**if the object’s hash value changes, it will be in the wrong hash bucket**). # # Dictionaries store _references_ # # A final comment. The dictionary (and thus also sets) store _a reference_ to our object. # In[166]: robot = Cyborg('T-1000', 'Remington 870') id(robot) # In[167]: terminators = set() terminators.add(robot) # Let's retrieve the element from the set [without removing it](https://stackoverflow.com/q/59825): # In[168]: id(next(iter(terminators))) # It's not a copy — it's the same object. # - In other words: mutating our object (which should have never been hashable)... # - ... also modifies the copy stored in the hash hable (dictionary / set!). # In[169]: robot.weapon = 'RSB-80 Plasma Gun' robot # In[170]: terminators # This implies that if our hashable mutated object landed (by chance) in the same slot as it originally did, we would still be able to find it. # In[171]: robot = Cyborg('T-1000', 'Remington 870') terminators = set() terminators.add(robot) original_hash = hash(robot) print("Original hash:", original_hash) # Let's mutate the object but make sure it goes to the same slot: # In[172]: robot.weapon = 'RSB-80 Plasma Gun' Cyborg.__hash__ = lambda x: original_hash print("Mutated hash: ", hash(robot)) # In[173]: robot in terminators # We can find it! It went to the same bucket. # # Hashable mutable objects will break _most_ of the time, but not always. # # Doing it right # # Experiments aside, let's do the right thing and make our mutable class non-hashable. # In[174]: class Cyborg: def __init__(self, name, weapon): self.name = name self.weapon = weapon __hash__ = None robot = Cyborg('T-1000', 'RSB-80 Plasma Gun') terminators = set() terminators.add(robot) # # Wrapping up this section # # We revisited and understood the intimidating paragraph from the docs. # # - If your class doesn't define `__eq__()`, don't define `__hash__()` either. # - Mutable classes should not be hashable. # * To make it non-hashable, set `__hash__ = None`. # - If your class defines `__eq__()`, implement `__hash__()` for it to be hashable. # - Abstract base classes simplify testing whether an interface is satisfied. # - Dictionaries store references, not copies of the data. # # Wrapping up this section # # We revisited and understood the intimidating paragraph from the docs. # # - If your class doesn't define `__eq__()`, don't define `__hash__()` either. # - Mutable classes should not be hashable. # * To make it non-hashable, set `__hash__ = None`. # - If your class defines `__eq__()`, implement `__hash__()` for it to be hashable. # #
# #
# # [Photo](https://www.pexels.com/photo/landscape-photography-of-desert-33147/) by [Pixabay](https://pixabay.com/en/desert-africa-bedouin-footprints-1007157/) / [CC0](https://www.pexels.com/creative-commons-images/) # # Bonus: Trollface # # The whole point of using hash tables is to get $\mathcal{O}(1)$ operations. # # Say we use a dictionary to map every number to its square: # In[176]: {x: x ** 2 for x in range(10_000)} # Lookups are $\mathcal{O}(1)$. Proof: # In[181]: get_ipython().run_cell_magic('timeit', 'squares = {x: x ** 2 for x in range(1_000)}', '999 in squares\n') # In[183]: get_ipython().run_cell_magic('timeit', 'squares = {x: x ** 2 for x in range(10_000)}', '9_999 in squares\n') # Size of dictionary is $10x$, but it takes the same amount of time. This is $\mathcal{O}(1)$. # Let's subclass `int` and make it always return the same hash # In[184]: class TrollInt: def __init__(self, value): self.value = value def __eq__(self, other): return self.value == other.value def __hash__(self): return 1 # This means that they'll all go the same slot. # # Python will need to resolve (endless) collisions # In[185]: get_ipython().run_cell_magic('timeit', 'squares = {TrollInt(x): TrollInt(x ** 2) for x in range(1_000)}', 'squares[TrollInt(999)] \n') # In[186]: get_ipython().run_cell_magic('timeit', 'squares = {TrollInt(x): TrollInt(x ** 2) for x in range(10_000)}', 'squares[TrollInt(9_999)]\n') # We've turned our $\mathcal{O}(1)$ operations into $\mathcal{O}(n)$. # # Problem?