#!/usr/bin/env python
# coding: utf-8
#
#
#
#
# [Photo](https://www.pexels.com/photo/landscape-photography-of-desert-33147/) by [Pixabay](https://pixabay.com/en/desert-africa-bedouin-footprints-1007157/) / [CC0](https://www.pexels.com/creative-commons-images/)
# # Motivation
#
# This paragraph [in the Python docs](https://docs.python.org/3/reference/datamodel.html#object.__hash__):
#
# > If a class does not define an [\_\_eq\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__eq__) method it should not define a [\_\_hash\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__hash__) operation either; if it defines [\_\_eq\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__eq__) but not [\_\_hash\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__hash__), its instances will not be usable as items in hashable collections. If a class defines mutable objects and implements an [\_\_eq\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__eq__) method, it should not implement [\_\_hash\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__hash__), since the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).
#
# ..._why_?
#
# This talk is a foray into the Python data model so that we understand the reasoning behind this advice. And, above everything else, just an excuse to talk about [hash tables](https://en.wikipedia.org/wiki/Hash_table).
# # Outline
#
# 1. Object _equality_ versus object _identity_.
# 2. Rich comparison operators for our classes.
# 3. Python dictionaries and sets are _hash tables_.
# 4. Fundamentals of hash tables, $\mathcal{O}(1)$ for the win.
# 5. The properties of a good hash function.
# 6. Implementing our own, full-fledged hash table.
# 7. Revising the initial paragraph — we're now ready.
# 8. Understanding —and remembering— why.
#
# _On your marks, get set, go!_
# # Outline
#
# 1. Python dictionaries and sets are _hash tables_.
# 2. Fundamentals of hash tables, $\mathcal{O}(1)$ for the win.
# 3. The properties of a good hash function.
# 4. Implementing our own, full-fledged hash table.
# 5. Revising the initial paragraph — we're now ready.
# 6. Understanding —and remembering— why.
#
# _On your marks, get set, go!_
# # Object _equality_ versus object _identity_
# # `==` tests for _equality_
#
# The `==` operator is the object **value comparison** operator.
#
# It tests whether two objects have the same _value_.
# In[2]:
x = [2, 3]
y = [2, 3]
x == y
# Both variables have the same value, so the result of the comparison is `True`.
# ## `__eq__()`
#
# In the [Python data model](https://docs.python.org/3/reference/datamodel.html), equality is implemented via [\_\_eq\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__eq__) and its counterpart, [\_\_ne\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__ne__).
# # `is` tests for _identity_
#
# The `is` operator, on the other hand, is the object **identity comparison** operator.
#
# It tests whether two objects are the _same_.
# In[3]:
x = [2, 3]
y = [2, 3]
x is y
# Although `x` and `y` have the same value, they are different objects, so the comparison with is returns `False`.
# ## `id()`
#
# Identity is checked by comparing the built-in function [`id()`](https://docs.python.org/3/library/functions.html#id). It returns the "identity" of an object: an integer guaranteed to be unique and constant for each object during its lifetime.
# In[4]:
x = []
id(x)
# In[5]:
y = []
id(y)
# In[6]:
x is y
# In [CPython](https://github.com/python/cpython) this number is in fact the address of the object in memory.
# # However...
#
# Also known as "Trivia Facts, Part I".
#
#
# ## Integers
#
#
# What we just said doesn't work as we might expect with integers.
# In[7]:
x = 23
id(x)
# In[8]:
y = 23
id(x)
# In[9]:
x is y
# What? But... they're different objects... _right_?
# And to make things even worse:
# In[10]:
a = 581
id(a)
# In[11]:
b = 581
id(b)
# In[12]:
a is b
# The difference is that the current implementation (of, again, CPython) keeps an array of integer objects for **all integers between −5 and 256**. When we create an integer in that range what we are actually getting is just a reference back to that existing object.
#
# In other words: in CPython, integers between −5 and 256 are [singletons](https://en.wikipedia.org/wiki/Singleton_pattern).
#
# This is an _implementation detail_. For example, until February 2006 this happened only for values [−5, 99]. It's undocumented for a reason. Never rely on this!
# ## Strings
#
# The same behavior may be exhibited by small strings literals.
# In[13]:
x = "dog"
id(x)
# In[14]:
y = "dog"
id(y)
# In[15]:
x is y
# These strings are _interned_.
# # String interning
#
# From [Wikipedia](https://en.wikipedia.org/wiki/String_interning) \[emphasis added\]:
#
# > In computer science, string interning is a method of **storing only one copy of each distinct string value**, which must be immutable. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned.
# In[16]:
x = "".join(["d", "o", "g"])
y = "dog"
x is y
# Why the difference? Note that we said earlier that this behavior is exhibited for _literals_. Here, `x` is an expression. Because Python does not know its value until runtime, `x` is not interned and therefore it's a different object than `y`.
# We can force a string to be interned, however, using [`sys.intern()`](https://docs.python.org/3.2/library/sys.html#sys.intern).
# In[17]:
import sys
x = sys.intern("".join(["a", "b", "c"]))
y = "abc"
x is y
# # How is this useful?
#
# From the [Python docs](https://docs.python.org/3/library/sys.html#sys.intern):
#
# > Interning strings is useful to gain a little performance on dictionary lookup – if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare.
# # A Python class of our own
#
# Let's move beyond built-in types and define our own class, `Circle`.
# In[18]:
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __repr__(self):
return "Circle(x={}, y={}, radius={})".format(
self.x, self.y, self.radius)
c = Circle(2, 3, 5)
print(c)
# To avoid hard-coding the name of the class, we can use instead:
# In[19]:
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __repr__(self):
return "{}(x={}, y={}, radius={})".format(
type(self).__name__, # note this
self.x,
self.y,
self.radius)
# # However...
#
# The comparison between the instances of our class is broken!
# In[20]:
c1 = Circle(2, 3, 5)
c2 = Circle(2, 3, 5)
c1 == c2
# We didn't expect this — the _values_ are the same, so they should compare equal.
# In[21]:
print(c1.x == c2.x)
print(c1.y == c2.y)
print(c1.radius == c2.radius)
# # We didn't define `__eq__()`
#
# From the [Python docs](https://docs.python.org/3/reference/datamodel.html#object.__hash__):
#
# > User-defined classes have \[a\] [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) [method] by default; with \[it\], all objects compare unequal (except with themselves) \[...\]
#
# In other words, if we don't define [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) by default it uses `is`.
#
# And `c1` and `c2` are different objects.
# In[22]:
c1 is c2
# # Rich comparison methods
#
#
# For our classes to be fully comparable, they must implement the "_rich_" comparison methods.
#
# Each comparison operation calls one of these methods. The correspondences are:
#
# - [`__lt__()`](https://docs.python.org/3/reference/datamodel.html#object.__lt__) $\rightarrow$ `x < y`
# - [`__le__()`](https://docs.python.org/3/reference/datamodel.html#object.__le__) $\rightarrow$ `x <= y`
# - [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) $\rightarrow$ `x == y`
# - [`__ne__()`](https://docs.python.org/3/reference/datamodel.html#object.__ne__) $\rightarrow$ `x != y`
# - [`__gt__()`](https://docs.python.org/3/reference/datamodel.html#object.__gt__) $\rightarrow$ `x > y`
# - [`__ge__()`](https://docs.python.org/3/reference/datamodel.html#object.__ge__) $\rightarrow$ `x >= y`
# For example, so that we can use `==` with our class:
# In[23]:
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __eq__(self, other):
return self.x == other.x and \
self.y == other.y and \
self.radius == other.radius
c1 = Circle(2, 3, 5)
c2 = Circle(2, 3, 5)
c1 == c2
# # `bool()`
#
# We _could_ return anything, as Python will call `bool()` to determine the truth value.
# In[188]:
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __eq__(self, other):
if self.x == other.x and \
self.y == other.y and \
self.radius == other.radius:
return "Yes, they are equal!"
else:
return "Nope, they're different"
c1 = Circle(2, 3, 5)
c2 = Circle(2, 3, 5)
c1 == c2
# As the condition of an `if` statement:
# In[25]:
if c1 == c2:
print("I already told you, they're equal!")
# But the truth value of _any_ non-empty string is `True`:
# In[26]:
c1 = Circle(1, 3, 9)
c2 = Circle(2, 5, 7)
if c1 == c2: # bool("Nope, they're different")
print("These circles are equal too!")
# Dont' do this. Return `True` or `False`.
# # `NotImplemented`
#
# - If the comparison is not defined, we return the singleton [`NotImplemented`](https://docs.python.org/3/library/constants.html#NotImplemented).
# - It's a singleton, _not_ an exception!
# - This makes Python understand and complain that the two objects cannot be compared.
# In[27]:
c1 = Circle(2, 3, 5)
c2 = Circle(2, 3, 5)
c1 < c2
# The `<` operator calls [`__lt__()`](https://docs.python.org/3/reference/datamodel.html#object.__lt__), but we haven't defined it. Python returns [`NotImplemented`](https://docs.python.org/3/library/constants.html#NotImplemented), which in turn raises [TypeError](https://docs.python.org/3/library/exceptions.html#TypeError).
#
# Reminder: we *exclusively* have a [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) by default, comparing equal only to ourselves.
# # Returning `NotImplemented` ourselves
#
# If we want to make sure that the object is only compared to another object of the same type:
# In[28]:
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __eq__(self, other):
if type(other) != type(self):
return NotImplemented
return self.x == other.x and \
self.y == other.y and \
self.radius == other.radius
c1 = Circle(2, 3, 5)
c1 == 5
# To account and allow for subclasses:
# In[29]:
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __eq__(self, other):
if not isinstance(other, type(self)): # instead of type() == type()
return NotImplemented
return self.x == other.x and \
self.y == other.y and \
self.radius == other.radius
# # Reflections
#
# ## `__lt__()` and `__gt__()`
#
# Let's say we want to sort our `Circle` object by increasing area.
# In[30]:
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __lt__(self, other):
return self.radius < other.radius
# It works as we would expect:
# In[31]:
c1 = Circle(2, 5, 13)
c2 = Circle(3, 7, 11)
c1 < c2
# However...
# In[32]:
c1 > c2
# How?! We didn't define [`__gt__()`](https://docs.python.org/3/reference/datamodel.html#object.__gt__), but Python is not complaining.
# From the [Python docs](https://docs.python.org/3/reference/datamodel.html#object.__gt__):
#
# > There are no swapped-argument versions of these methods (to be used when the left argument does not support the operation but the right argument does); rather, [`__lt__()`](https://docs.python.org/3/reference/datamodel.html#object.__lt__) and [`__gt__()`](https://docs.python.org/3/reference/datamodel.html#object.__gt__) are each other’s reflection [...]
# This is what happened here: `Circle` doesn't define [`__gt__()`](https://docs.python.org/3/reference/datamodel.html#object.__gt__), so...
# In[33]:
c1.__gt__(c2)
# ... was translated into:
# In[34]:
c2.__lt__(c1)
# Technically speaking, we fallback to the right operand’s reflected method.
# Let's verify it via the good ol' [printf debugging](https://en.wikipedia.org/wiki/Debugging#Techniques) technique.
# In[35]:
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __repr__(self):
return "{}(x={}, y={}, radius={})".format(
type(self).__name__, self.x, self.y, self.radius)
def __lt__(self, other):
print("called {} < {}".format(self, other)) # we added this
return self.radius < other.radius
c1 = Circle(2, 5, 13)
c2 = Circle(3, 7, 11)
c1 > c2
# So, indeed, Python resorted to the reflection of the method we have.
# ## `__le__()` and `__ge__()`
#
# The same happens for [`__le__()`](https://docs.python.org/3/reference/datamodel.html#object.__le__) and [`__ge__()`](https://docs.python.org/3/reference/datamodel.html#object.__ge__).
# In[36]:
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __repr__(self):
return "{}(x={}, y={}, radius={})".format(
type(self).__name__, self.x, self.y, self.radius)
def __ge__(self, other):
print("called {} >= {}".format(self, other)) # more printf debugging
return self.radius >= other.radius
c1 = Circle(2, 5, 11)
c2 = Circle(3, 7, 13)
c1 <= c2
# ## `__eq__()` and `__ne__()`
#
# Finally, [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) and [`__ne__()`](https://docs.python.org/3/reference/datamodel.html#object.__ne__) are *their own reflection*.
#
# This means that if `x.__eq__(y)` is not defined, we try to use `y.__eq__(x)`.
#
# Don't see why it's useful? You're not alone.
# Let's go back to an earlier example.
# In[37]:
class Circle:
"""A regular circle."""
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __repr__(self):
return "{}(x={}, y={}, radius={})".format(
type(self).__name__, self.x, self.y, self.radius)
def __eq__(self, other):
print("calling {} == {}".format(self, other))
if not isinstance(other, type(self)): # we allow for subclasses
return NotImplemented
return self.x == other.x and \
self.y == other.y and \
self.radius == other.radius
class PremiumCircle(Circle):
"A luxury Circle, with high quality values."
pass # it's the exact same thing, of course
# Focus your attention on `__eq__()`:
# In[38]:
def __eq__(self, other):
if not isinstance(other, type(self)): # we allow for subclasses
return NotImplemented
return self.x == other.x and \
self.y == other.y and \
self.radius == other.radius
# ```python
# if not isinstance(other, type(self)): # we allow for subclasses
# return NotImplemented
# ```
# This means that `__eq__()` would only work if the subclass is the _right_ side of `==` (and, therefore, binds to `other`), because...
# In[40]:
issubclass(PremiumCircle, Circle)
# ... but...
# In[41]:
issubclass(Circle, PremiumCircle)
# ... so `__eq__()` will return `NotImplemented`.
# However...
# In[42]:
c1 = PremiumCircle(4, 5, 8)
c2 = Circle(2, 3, 5)
c1 == c2 # subclass on the *left* side!
# Works! This is, again, because of reflection. Note how we called `__eq__()` twice.
#
# `PremiumCircle.__eq__(Circle)` returns `NotImplemented`, because `PremiumCircle` is not a subclass of `Circle`. Python uses reflection to try the opposite: `Circle.__eq__(PremiumCircle)`: this comparison can take place because `PremiumCircle` *is* a subclass of Circle.
# The reflection of [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) and [`__ne__()`](https://docs.python.org/3/reference/datamodel.html#object.__ne__) with themseves allows us not having to worry about whether `self` is a subclass of `other` or the other way around.
#
# From the [docs of NotImplemented](https://docs.python.org/3/library/constants.html#NotImplemented) \[emphasis added\]:
#
# > When a binary (or in-place) method returns `NotImplemented` **the interpreter will try the reflected operation on the other type** (or some other fallback, depending on the operator). If all attempts return `NotImplemented`, the interpreter will raise an appropriate exception [...]
# # Implied relationships
#
# By default, [`__ne__()`](https://docs.python.org/3/reference/datamodel.html#object.__ne__) delegates to [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) and inverts the result unless it is `NotImplemented`.
#
# This means that defining [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) gives us access to `!=` for free.
# In[43]:
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __eq__(self, other):
return self.x == other.x and \
self.y == other.y and \
self.radius == other.radius
c1 = Circle(2, 5, 11)
c2 = Circle(3, 7, 13)
c1 != c2
# ## And nothing else
#
# There are **no other** implied relationships among the comparison operators.
#
# Python cannot do smart things like figuring out that if `x >= y` and `x != y`, then if follows that `x > y`.
#
# Humans understand that if (a) `x` is greater than or equal to `y` and (b) they're not equal, that means that `x` must be greater than `y`. Python cannot.
# In[44]:
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __eq__(self, other):
return self.x == other.x and \
self.y == other.y and \
self.radius == other.radius
def __ge__(self, other):
return self.radius >= other.radius
c1 = Circle(2, 5, 11)
c2 = Circle(3, 7, 13)
c1 > c2
# # So, all of them?
#
# So does this mean the in practice we need to define _all_ the comparison methods?
# In[45]:
class Circle:
def __init__(self, x, y, radius):
self.x, self.y, self.radius = x, y, radius
def __eq__(self, other):
return self.x == other.x and \
self.y == other.y and \
self.radius == other.radius
def __ne__(self, other):
return not self == other
def __ge__(self, other):
return self.radius >= other.radius
def __le__(self, other):
return self.radius <= other.radius
def __gt__(self, other):
return self.radius > other.radius
def __lt__(self, other):
return self.radius < other.radius
# ## Not _all_, because:
#
# - `!=` returns the opposite of `==`, so we don't have to implement [`__ne__()`](https://docs.python.org/3/reference/datamodel.html#object.__ne__).
# - For [`__le__()`](https://docs.python.org/3/reference/datamodel.html#object.__le__) we can use the reflection of [`__gt__()`](https://docs.python.org/3/reference/datamodel.html#object.__gt__) (or vice versa).
# - For [`__gt__()`](https://docs.python.org/3/reference/datamodel.html#object.__gt__) we can use the reflection of [`__lt__()`](https://docs.python.org/3/reference/datamodel.html#object.__ne__) (or vice versa).
# So we can leave it in:
# In[46]:
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __eq__(self, other):
return self.x == other.x and \
self.y == other.y and \
self.radius == other.radius
def __ge__(self, other):
return self.radius >= other.radius
def __gt__(self, other):
return self.radius > other.radius
# # `functools.total_ordering()`
#
# We can forget all these rules and let [`functools.total_ordering()`](https://docs.python.org/3/library/functools.html#functools.total_ordering) do all the work.
#
# It just needs [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) and one more (any) comparison operator to derive the rest for us _automatically_.
# In[47]:
import functools
@functools.total_ordering
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __eq__(self, other):
return self.x == other.x and \
self.y == other.y and \
self.radius == other.radius
def __ge__(self, other):
return self.radius >= other.radius
# The name [total ordering](https://en.wikipedia.org/wiki/Total_order) comes from mathematics: "binary relation on some set X, which is antisymmetric, transitive, and a connex relation". In human-being-speak we can say that a total order is that of a set of elements in which all the possible pairs are comparable.
#
# The downside, from [the Python docs](https://docs.python.org/3/library/functools.html#functools.total_ordering) \[emphasis added\]:
#
# > [...] it does come at the **cost of slower execution and more complex stack traces** for the derived comparison methods. If performance benchmarking indicates this is a bottleneck for a given application, implementing all six rich comparison methods instead is likely to provide an easy speed boost.
# # Prehistory's `__cmp__()`
#
# Before Python 3, if the rich comparison methods were not implemented, Python falled back to a different magic method: [`__cmp__()`](https://docs.python.org/2/reference/datamodel.html#object.__cmp__).
#
# `object.__cmp__(self, other)` would return:
#
# - A negative integer if `self < other`.
# - Zero if `self == other`.
# - a positive integer if `self > other`.
#
# Thus, our early ancestors could choose between implementing [`__cmp__()`](https://docs.python.org/2/reference/datamodel.html#object.__cmp__) or the rich comparison methods — named like this to differentiate them from the more primitive alternative.
#
# In Python 3 [`__cmp__()`](https://docs.python.org/2/reference/datamodel.html#object.__cmp__) no longer exists.
# # Wrapping up this section
#
# We now know how to create well behaved totally ordered classes.
#
# - `==` is for equality...
# - ... and `is` for identity.
# - The default value of [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__).
# - Rich comparison operators:
# * Return `NotImplemented` if comparison is undefined.
# * Fallback to the right operand's reflected method.
# * Have only one implied relationsip: [`__ne__()`](https://docs.python.org/3/reference/datamodel.html#object.__ne__) $\leftrightarrow$ [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__)
# - [`functools.total_ordering()`](https://docs.python.org/3/library/functools.html#functools.total_ordering) simplifies our life.
# - [`__cmp__()`](https://docs.python.org/2/reference/datamodel.html#object.__cmp__) exists only in history books.
#
#
#
#
# [Photo](https://www.pexels.com/photo/adventure-arid-barren-dawn-206724/) by [Pixabay](https://pixabay.com/en/desert-morocco-sand-dune-dry-1748462/) | [CC0](https://www.pexels.com/creative-commons-images/)
# # All right, I guess — but do we _really_ care?
#
# We do care if we want to use our classes in a Python dictionary or a set.
# # Mandatory disclaimer
#
# Everything to be known about dictionaries was said by [Brandon Rhodes](https://rhodesmill.org/brandon/) back in 2010.
#
#
#
# The Mighty Dictionary (Brandon Rhodes, PyCon 2010).
#
# # One-minute overview
#
# If we work with a a Python list, we access elements by their index number.
# In[48]:
numbers = [2, 3, 5, 7, 11]
# In[49]:
numbers[0]
# In[50]:
numbers[2]
# In[51]:
numbers[-1]
# A [bijective function](https://en.wikipedia.org/wiki/Bijection) maps each index element to the corresponding value.
# In[52]:
numbers = [2, 3, 5, 7, 11]
# For example, index two (i.e. third element in the array) maps to value 5.
#
# However, we cannot easily go the other way around. Given, say, the value 7:
# - Is it in the array?
# - What's its index?
#
# The answer to these questions is $\mathcal{O}(n)$.
# In[53]:
7 in numbers # O(n)
# Same for:
# In[54]:
numbers.index(7) # O(n)
# Python is doing the equivalent of:
# In[55]:
def index(iterable, wanted):
"""Return index of 'wanted' in 'iterable'; -1 if not found."""
for index, element in enumerate(iterable):
if element == wanted:
return index
return -1
index(numbers, 7)
# To make it explicit: we have to visit _all_ the elements in the iterable, so $\mathcal{O}(n)$.
# # Dictionaries
#
# - Python [dictionaries](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) are a [mapping type](https://docs.python.org/3/library/stdtypes.html#typesmapping).
# - Instead of indexing elements by a range of numbers, we use *keys*.
# - Keys can be any **immutable** (more of that later) type, e.g. numbers or strings.
# - They map each key to the corresponding value.
# In[56]:
numbers = {} # now a dict
numbers[1] = "one"
numbers[3] = "three"
numbers[7] = "seven"
# In[57]:
numbers[3]
# In[58]:
5 in numbers
# In[59]:
del numbers[1]
# These operations, unlike for the list, are $\mathcal{O}(1)$. How? Because dictionaries are Python's name for...
# # Hash tables
#
# - The most important data structure known to Humankind, [as Steve Yegge put it](https://steve-yegge.blogspot.com/2008/03/get-that-job-at-google.html).
# - Formally known as [unordered associative array](https://en.wikipedia.org/wiki/Hash_table), invented in 1953.
# - Insertion, deletion and lookups have an amortized cost of $\mathcal{O}(1)$ (!)
# - Speed, speed, speed. Instant lookup of a value, regardless of how many we have!
#
# How do hash tables achieve this?
# # A dictionary is really a list
#
# This is the first plot twist.
#
# A hash table is nothing else than a list (or _array_, in non-Python-specific terminology).
#
# Let's create our hash table:
# In[60]:
table = [None, None, None, None, None]
# Let's visualize it vertically:
# In[61]:
table = [
None, # index = 0
None, # index = 1
None, # index = 2
None, # index = 3
None, # index = 4
]
# - The difference is that the **index** of each element depends on its **value**.
# - Given a value, we can compute its index, and directly go there in $\mathcal{O}(1)$.
# - In this manner, we know exactly where to look up the element — no need to loop over all of them.
#
# How do we compute the indexes?
# # Hash functions
#
# From [Wikipedia](https://en.wikipedia.org/wiki/Hash_function):
#
# > A hash function is any function that can be used to map data of arbitrary size to data of a fixed size.
#
# - It's a mathematical function that maps data to some smaller value.
# - We move from the [domain](https://en.wikipedia.org/wiki/Domain_of_a_function) of the hash function (the set of possible input values)...
# - ... to a _smaller_ domain: the number of outputs of the hash function.
# - The hash function _compresses_ our data to a fixed size.
# # A simple hash function
# In[62]:
def simple_hash(str_):
return len(str_)
# In[63]:
simple_hash("dog")
# In[64]:
simple_hash("tiger")
# In[65]:
simple_hash("Tyrannosaurus Rex")
# We take data (strings, in this case) and return a numeric value.
# ## Using our hash function to compute indexes
# In[66]:
table = [
None, # index = 0
None, # index = 1
None, # index = 2
None, # index = 3
None, # index = 4
]
# Let's get the hash of `"dog"`:
# In[67]:
word = 'dog'
word_hash = simple_hash(word)
print("Hash of", word, "is", word_hash)
# Use the hash as the index to store the value in our table:
# In[68]:
index = word_hash
table[index] = word # O(1)
table
# Let's now look up a value. Is `"bear"` in the hash table?
# In[69]:
word = 'bear'
index = simple_hash(word)
table[index] != None
# # First problem: out of range
#
# Our function can return _any_ value, exceeding the size of the hash table.
# In[70]:
word = 'monkey'
index = simple_hash(word)
table[index] = word
# ## Solution: modulo operator
#
# - That is, we take the remainder after dividing by the length of the list.
# - [Modular arithmetic](https://en.wikipedia.org/wiki/Modular_arithmetic), the same way we all know that 70 minutes is one hour _plus_ 10 minutes.
#
# So the operation becomes:
# In[71]:
word = 'monkey'
index = simple_hash(word) % len(table)
print('{} % {} == {}'.format(simple_hash(word), len(table), index))
# In[72]:
table[index] = word
table
# Is `"monkey"` in the hash table?
# In[73]:
word = 'monkey'
index = simple_hash(word) % len(table)
table[index] != None
# # Second problem: collisions!
#
# Because we transform the input values into a _smaller_ domain, sooner or later we'll come across two _different_ keys that map to the _same_ hash value.
#
# This is known as a **collision**.
# Right now, our hash table looks like this:
# In[74]:
table = [
None, # index = 0
'monkey', # index = 1
None, # index = 2
'dog', # index = 3
None, # index = 4
]
# A [wild new](https://knowyourmeme.com/memes/a-wild-x-appears-wild-x-appeared) word appears:
# In[75]:
word = 'cat'
index = simple_hash(word) % len(table)
table[index] = word
table
# In[76]:
table
# - `"cat"` and `"dog"` have the same length.
# - Since our hash function is the length, we ended up in the same position.
# - We override our value! `"dog"` is gone.
#
# This is a collision. They _will_ happen, so our hash table needs to take them into account.
#
# More on this later.
# # More Trivia: PHP
#
# [PHP functions originally bucketed by strlen, were renamed to balance length](https://news.ycombinator.com/item?id=6919216):
#
# > Well, there were other factors in play there. htmlspecialchars was a
# very early function. Back when PHP had less than 100 functions and the
# function hashing mechanism was strlen(). **In order to get a nice hash
# distribution of function names across the various function name lengths
# names were picked specifically to make them fit into a specific length
# bucket**. This was circa late 1994 when PHP was a tool just for my own
# personal use and I wasn't too worried about not being able to remember
# the few function names. \[emphasis added\]
#
# Most programming languages use a hash table to store [symbols][1]: e.g. function name to function object or pointer.
#
# PHP was hashing by _length_, which lead to lots of collisions.
#
# [1]: https://en.wikipedia.org/wiki/Symbol_(programming)
# Solution? Rename functions to smooth the distribution of lengths.
#
# From [PHP has inconsistent function naming](https://tnx.nl/php.html#names):
#
# ```text
# underscore no underscore:
# stream_get_line readline
# disk_free_space diskfreespace
# is_object isset
# mcal_day_of_week jddayofweek
# ```
#
# We live in the darkest timeline.
# # Behave, hash function!
#
# Formally speaking, a good hash function satisfies two main properties:
#
# ## [Determinism](https://en.wikipedia.org/wiki/Hash_function#Determinism)
#
# The same input value should produce the same output value. In other words: the hash is a [mathematical function][1] of the data being hashed. We cannot depend on external variable parameters such as a pseudo-random number generator.
#
# Our function satisfied this, as the length of a string is deterministic.
#
# [1]: https://en.wikipedia.org/wiki/Function_(mathematics)
# ## [Uniformity](https://en.wikipedia.org/wiki/Hash_function#Uniformity)
#
# The hashes should be distributed as evently as possible over the output range. In other words: every hash should have the same probability of being generated.
#
# This is where our hash function (as well as PHP's) failed. Words in English...
#
# > [...] are reasonably well fitted by a shifted Poisson distribution with mean and variance equal to 6.94 and 5.80. [[Source]](https://www.sciencedirect.com/science/article/pii/0378375886901692)
#
# This would lead to _a lot_ of collisions.
# # A real-life example: MD5
#
# - A hash function producing a 128-bit hash value.
# - Designed by [Ronald Rivest](https://en.wikipedia.org/wiki/Ron_Rivest) in 1992.
# ```text
# $ echo "Let there be light" | md5sum
# b3d440d0691efe7d7d485602f134f469 -
# ```
# - 128 bits $\rightarrow$ probability of two random hashes accidentally colliding is $1\ /\ {2}^{128}$
# - Initially designed to be used as a cryptographic hash function.
# - In 2004 it was shown that MD5 is not collision-resistant [[Wikipedia](https://en.wikipedia.org/wiki/MD5#History_and_cryptanalysis)]
# - For example, given a file we can find generate a different one with the same hash.
# MD5 is still useful to verify data integrity against _unintentional_ corruption.
#
# For example, [Debian](https://www.debian.org/) gives us [these MD5 checksums](https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/MD5SUMS):
#
# ```text
# f8446a84356a6bcbf79201cc9f46063f debian-9.5.0-amd64-netinst.iso
# 47a5dca818220d8558d37dfa11b85550 debian-9.5.0-amd64-xfce-CD-1.iso
# 2daa085925a556d35a0eeb27768cc892 debian-mac-9.5.0-amd64-netinst.iso
# ```
#
# Then when I download the ISO:
#
# ```text
# wget https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-9.5.0-amd64-netinst.iso
# md5sum debian-9.5.0-amd64-netinst.iso
# f8446a84356a6bcbf79201cc9f46063f
# ```
#
# Same hash, so _moooost_ likely no corruption.
# # Python's `hash()`
#
# Meet the built-in [`hash()`](https://docs.python.org/3/library/functions.html#hash) function:
#
# > Return the hash value of the object (if it has one). Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup. Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for 1 and 1.0).
# ## Integers
# In[77]:
hash(0)
# In[78]:
hash(5)
# In[79]:
hash(113)
# In[80]:
hash(7.3)
# # Trivia redux
# In[81]:
hash(-5)
# In[82]:
hash(-2)
# In[83]:
hash(-1) # also -2!
# Why?
# > The hash value -1 is reserved (it’s used to flag errors in the C implementation). If the hash
# algorithm generates this value, we simply use -2 instead. [[Source](http://effbot.org/zone/python-hash.htm)]
#
#
# Proof in the [CPython source code](https://github.com/python/cpython/blob/2847ccae4687cb43334d87d86fb6c11cb14218f5/Objects/longobject.c#L2986-L2988):
#
# ```c
# if (x == (Py_uhash_t)-1)
# x = (Py_uhash_t)-2;
# return (Py_hash_t)x;
# ```
# ## Strings
#
# For strings the algorithm is much more complex.
# In[84]:
hash("dog")
# In[85]:
hash("cat")
# In[86]:
hash("giraffe")
# Looking at the [CPython source code](https://github.com/python/cpython/blob/ad65f15581173542f1d2a9968a63bee272510ce3/Python/pyhash.c#L161-L183):
#
# ```c
# if (len < Py_HASH_CUTOFF) {
# /* Optimize hashing of very small strings with inline DJBX33A. */
# Py_uhash_t hash;
# const unsigned char *p = src;
# hash = 5381; /* DJBX33A starts with 5381 */
#
# switch(len) {
# /* ((hash << 5) + hash) + *p == hash * 33 + *p */
# case 7: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
# case 6: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
# case 5: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
# case 4: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
# case 3: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
# case 2: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
# case 1: hash = ((hash << 5) + hash) + *p++; break;
# default:
# Py_UNREACHABLE();
# }
# hash ^= len;
# hash ^= (Py_uhash_t) _Py_HashSecret.djbx33a.suffix;
# x = (Py_hash_t)hash;
# }
# ```
# ## Tuples
# In[87]:
hash((1, 2))
# In[88]:
hash(("dog", "cat"))
# In[89]:
hash(("giraffe", 113, 5.1))
# Taking a look at the [CPython source code](https://github.com/python/cpython/blob/2847ccae4687cb43334d87d86fb6c11cb14218f5/Objects/tupleobject.c#L355-L367):
# ```c
# x = 0x345678UL;
# p = v->ob_item;
# while (--len >= 0) {
# y = PyObject_Hash(*p++);
# if (y == -1)
# return -1;
# x = (x ^ y) * mult;
# /* the cast might truncate len; that doesn't change hash stability */
# mult += (Py_hash_t)(82520UL + len + len);
# }
# x += 97531UL;
# if (x == (Py_uhash_t)-1)
# x = -2;
# ```
# # `hash()` on our own classes
#
# [`hash()`](https://docs.python.org/3/library/functions.html#hash) delegates the method to the [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__hash__) magic method.
# In[90]:
x = 3
hash(x)
# This is equivalent to...
# In[91]:
x.__hash__()
# Let's _not_ implement it in our class...
# In[92]:
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
c = Circle(5, 7, 9)
hash(c)
# It returns _something_, nevertheless!
# From the [docs](https://docs.python.org/3/reference/datamodel.html#object.__hash__):
#
# > User-defined classes have [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) and [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__hash__) methods by default; with them, all objects compare unequal (except with themselves) and `x.__hash__()` returns an appropriate value such that `x == y` implies both that `x is y` and `hash(x) == hash(y)`.
#
# So the hash will only be the same if... the object is the same.
#
# The (default) hash function is not based on the _value_ of the object, as we would expect, but on its _identity_.
# In[93]:
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
c1 = Circle(5, 7, 9)
c2 = Circle(5, 7, 9)
# In[94]:
print("Hash of c1:", hash(c1))
print("Hash of c2:", hash(c2))
# Different objects, different hashes!
# # Implementing `__hash__()`
#
# To do things right, we need to implement our own [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__hash__) method.
#
# The [docs](https://docs.python.org/3/reference/datamodel.html#object.__hash__) recommend \[emphasis added\]:
#
# > [...] it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects by **packing them into a tuple and hashing the tuple**.
# In[95]:
class Circle:
def __init__(self, x, y, radius):
self.x = x
self.y = y
self.radius = radius
def __hash__(self):
return hash((self.x, self.y, self.radius))
c = Circle(2, 3, 11)
hash(c)
# In this manner, we delegate [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__hash__) to the hash function to the `tuple` object, which in turn will use the hashes of the attributes (`x`, `y` and `radius`) of our class.
#
# This hashes will be, as we need, both deterministic and uniform over the output range.
# # Wrapping up this section
#
# - Dictionaries are hash tables.
# - Hash tables are _the_ data structure: $\mathcal{O}(1)$ operations.
# - Using hash functions to compute hash table indexes. Two problems:
# * Values out of range $\rightarrow$ modulo operator.
# * Collisions $\rightarrow$ will happen, sooner or later.
# - PHP's inconsistent function naming — now we know why.
# - The two properties of a good hash function:
# * Determinism.
# * Uniformity.
# - A real-life example: MD5.
# - Python's built-in `hash()`.
# - Implementing `__hash__()`.
# # Wrapping up this section
#
# - Dictionaries are hash tables.
# - Hash tables are _the_ data structure: $\mathcal{O}(1)$ operations.
# - Using hash functions to compute hash table indexes. Two problems:
# * Values out of range $\rightarrow$ modulo operator.
# * Collisions $\rightarrow$ will happen, sooner or later.
# - PHP's inconsistent function naming — now we know why.
# - The two properties of a good hash function:
# * Determinism.
# * Uniformity.
# - Python's built-in `hash()`.
# - Implementing `__hash__()`.
#
#
#
#
# [Photo](https://www.pexels.com/photo/sand-desert-sand-dunes-dune-50628/) by [Pixabay](https://pixabay.com/en/morocco-africa-desert-marroc-sand-123976/) | [CC0](https://www.pexels.com/creative-commons-images/)
# # Implementing our own hash table
#
# Let's use all we learned to implement our own table hash — this time handling collisions.
# We need to split this class over multiple [Jupyter](http://jupyter.org/) cells, so we'll use this decorator:
# In[96]:
def add_to(cls):
"""A class decorator to dynamically add methods."""
def wrapper(func):
setattr(cls, func.__name__, func)
return cls
return wrapper
# In[97]:
class HashTable:
"""Initial prototype of a hash table."""
def __init__(self, size=8):
self._slots = [None] * size
def __len__(self):
count = 0
for key in self._slots:
if key is not None:
count += 1
return count
# In[98]:
@add_to(HashTable)
def __setitem__(self, key, value):
index = hash(key) % len(self._slots)
self._slots[index] = value
@add_to(HashTable)
def __getitem__(self, key):
index = hash(key) % len(self._slots)
if self._slots[index] is None:
raise KeyError(key)
return self._slots[index]
@add_to(HashTable)
def __delitem__(self, key):
index = hash(key) % len(self._slots)
self._slots[index] = None
# Let's try it...
# In[99]:
table = HashTable()
table["cat"] = "Katze"
table["cat"]
# In[100]:
table._slots
# We can double check this is the right index:
# In[101]:
hash("cat") % 8
# Deletion also works:
# In[102]:
del table["cat"]
table["cat"]
# # A couple of improvements:
#
# 1. Keep a *counter* of values, instead of calculating them every single time.
# 2. Factor our *index calculation* to a private method of the class.
# In[103]:
class HashTable:
"""Initial prototype of a hash table, second attempt."""
def __init__(self, size=8):
self._slots = [None] * size
self._count = 0
def __len__(self):
return self._count
def _get_index(self, key):
return hash(key) % len(self._slots)
# In[104]:
@add_to(HashTable)
def __setitem__(self, key, value):
index = self._get_index(key)
self._slots[index] = value
self._count += 1
@add_to(HashTable)
def __getitem__(self, key):
index = self._get_index(key)
if self._slots[index] is None:
raise KeyError(key)
return self._slots[index]
@add_to(HashTable)
def __delitem__(self, key):
index = self._get_index(key)
if self._slots[index] is None:
raise KeyError(key)
self._slots[index] = None
self._count -= 1
# In[105]:
table = HashTable()
table["dog"] = "Hund"
table["bear"] = "Bär"
table._slots
# In[106]:
len(table)
# This is all good, but again – **we have to keep collisions in mind**.
#
# Sooner than later two elements are going to end up in the _same_ slot.
# # A simple approach: separate chaining
#
# - Each slot is independent, and keeps a list entries with the _same_ index.
# - All the keys that share index are kept one after another in the list...
# - ... that is, chained, hence the name.
# - The algorithm becomes:
# * Get the index $i$ of our element.
# * Add the element to the $i$-th list in our hash table.
# Thus, our empty hash table looks like this:
# In[107]:
table = [
[], # index = 0
[], # index = 1
[], # index = 2
[], # index = 3
[], # index = 4
]
# Say we want to map `"dog"` to its German translation, `"Hund"`.
#
# It lands in, for example, `index == 3`.
# In[108]:
table = [
[], # index = 0
[], # index = 1
[], # index = 2
['Hund'], # index = 3
[], # index = 4
]
# Then we map `"bear"` to its German translation, `"Bär"`.
#
# It lands in, say, `index == 0`.
# In[109]:
table = [
["Bär"], # index = 0
[], # index = 1
[], # index = 2
['Hund'], # index = 3
[], # index = 4
]
# Now we store `"cat"`, and assume that it _also_ lands in `index == 3`.
# In[110]:
table = [
["Bär"], # index = 0
[], # index = 1
[], # index = 2
['Hund', 'Katze'], # index = 3
[], # index = 4
]
# Do you see the problem we face now?
# # Handling collisions
#
# Somebody asks us "How do you say _cat_ in German".
#
# We use the hash of `"cat"` to compute its index, and get that `index == 3`.
# In[111]:
table = [
["Bär"], # index = 0
[], # index = 1
[], # index = 2
['Hund', 'Katze'], # index = 3
[], # index = 4
]
# Is it `"Hund"` or `"Katze"`? How can we tell them apart? We can't!
# # We also need the _key_
#
# In order to deal with collisions we have to store the *key* itself, in addition to the mapped value.
#
#
#
#
# Image by Jorge Stolfi | CC BY-SA 3.0
#
# # Now with keys!
# In[112]:
import collections
Entry = collections.namedtuple('Entry', 'key, value')
class HashTable:
"""A separate-chaining hash table."""
def __init__(self, size=8):
self._slots = [list() for _ in range(size)]
self._count = 0
def __len__(self):
return self._count
def _get_index(self, key):
return hash(key) % len(self._slots)
# In[113]:
@add_to(HashTable)
def __setitem__(self, key, value):
slot_index = self._get_index(key)
slot = self._slots[slot_index]
for entry_index, entry in enumerate(slot):
if entry.key == key: # note the equality comparison here...
slot[entry_index] = entry._replace(value=value)
break
else:
entry = Entry(key, value)
self._slots[slot_index].append(entry)
self._count += 1
# In[114]:
@add_to(HashTable)
def __getitem__(self, key):
index = self._get_index(key)
for entry in self._slots[index]:
if entry.key == key: # and here...
return entry.value
raise KeyError(key)
# In[115]:
@add_to(HashTable)
def __delitem__(self, key):
slot_index = self._get_index(key)
slot = self._slots[slot_index]
for entry_index, entry in enumerate(slot):
if entry.key == key: # ... and here
del slot[entry_index]
break
else:
raise KeyError(key)
self._count -= 1
# In[116]:
table = HashTable()
# In[117]:
table["dog"] = "perro"
table["bear"] = "Bär"
# `"perro"` is Spanish, not German! Let's replace the value.
# In[118]:
table["dog"] = "Hund"
table["dog"]
# Let's take a look at the internals:
# In[119]:
table._slots
# # Real life is _so_ much more complex
#
# - We used this collision resolution strategy here because it's simple.
# - Python uses a different one: [open addressing](https://en.wikipedia.org/wiki/Hash_table#Open_addressing) with pseudo-random probing.
# - See the superb documentation at [cpython/Objects/dictobject.c](https://github.com/python/cpython/blob/master/Objects/dictobject.c).
#
#
#
# Image by Jorge Stolfi | CC BY-SA 3.0
#
# # Still useful
#
# Python may use a different strategy, but the foundations are the same. We need:
#
# 1. The hash to find the corresponding slot (also known as _bucket_).
# 2. The element itself to compare (`==`) and tell it apart from other elements what may have collided with it.
# Our hash table is really nice! Let's add functionality to loop over the elements:
# # Now with `items()`!
# In[120]:
@add_to(HashTable)
def items(self):
for slot in self._slots:
for entry in slot:
yield entry
table = HashTable()
table["dog"] = "Hund"
table["bear"] = "Bär"
for key, value in table.items():
print(key, "->", value)
# Alternatively, in a single line:
# In[121]:
import itertools
@add_to(HashTable)
def items(self):
yield from itertools.chain.from_iterable(self._slots)
# # Resizing
#
# Our hash table is _almost_ perfect! The only downside:
#
# We only have *eight* slots! Sooner than later it's going to get full and performance start degrading.
#
# The solution is to resize, also known as _rehashing_.
#
# - Allocate a new, _larger_ underlying data structure.
# - Remove each entry from old `slots`...
# - ... and insert it into the new one.
# - After _all_ entries are removed from the old `slots`, we no longer need it.
# - From now on, use the new slots data structure.
# - Takes forever, but [amortized cost](https://en.wikipedia.org/wiki/Amortized_analysis) is $\mathcal{O}(1)$.
# # Now with dynamic resizing!
# In[122]:
import collections
import itertools
Entry = collections.namedtuple('Entry', 'key, value')
class HashTable:
"""A separate-chaining, dynamically resized hash table."""
def __init__(self, initial_size=8, load_factor=0.8):
self._slots = [list() for _ in range(initial_size)]
self._size = 0
self._load_factor = load_factor
def __len__(self):
return self._size
@staticmethod
def _get_index(key, slots):
return hash(key) % len(slots)
@classmethod
def _insert(cls, slots, key, value):
index = cls._get_index(key, slots)
slots[index] = value
# In[123]:
@add_to(HashTable)
def __setitem__(self, key, value):
slot_index = self._get_index(key, self._slots)
slot = self._slots[slot_index]
for entry_index, entry in enumerate(slot):
if entry.key == key: # note the equality comparison here...
slot[entry_index] = entry._replace(value=value)
break
else:
entry = Entry(key, value)
self._slots[slot_index].append(entry)
self._size += 1
if len(self) / len(self._slots) >= self._load_factor:
self._rehash()
# In[124]:
@add_to(HashTable)
def _rehash(self):
new_slots = [list() for _ in range(len(self._slots) * 2)]
for key, value in self.items():
HashTable._insert(new_slots, key, value)
self._slots = new_slots
# In[125]:
# The remaining methods; not worth showing on the slides, as they're basically the same ones
# we implemented earlier, but they're included here for the sake of completeness --- this is
# the last version of our hash table implementation.
@add_to(HashTable)
def __getitem__(self, key):
index = self._get_index(key, self._slots)
for entry in self._slots[index]:
if entry.key == key: # note the equality comparison here...
return entry.value
raise KeyError(key)
@add_to(HashTable)
def __delitem__(self, key):
slot_index = self._get_index(key, self._slots)
slot = self._slots[slot_index]
for entry_index, entry in enumerate(slot):
if entry.key == key: # ... and here
del slot[entry_index]
break
else:
raise KeyError(key)
self._size -= 1
@add_to(HashTable)
def items(self):
yield from itertools.chain.from_iterable(self._slots)
# # Future improvements
#
# - Shrink when we delete too many elements (i.e., it's too empty, so make it smaller).
# - Alternatives to all-at-once-rehashing:
# * [Incremental resizing](https://en.wikipedia.org/wiki/Hash_table#Incremental_resizing).
# * [Monotonic keys](https://en.wikipedia.org/wiki/Hash_table#Monotonic_keys).
# * [Linear hashing](https://en.wikipedia.org/wiki/Hash_table#Linear_hashing).
# In the words of [Andrew Cooke](https://stackoverflow.com/users/181772/andrew-cooke) on [Stack Overflow](https://stackoverflow.com/a/10130506) \[emphasis added\]:
#
# > Therefore, the hash code only hints at the location of an object, an equality test follows to test candidate keys. To implement a membership test in a hash-table set, **the hash code gives you "bucket" number in which to search for the value**. However, all set items with the same hash code are in the bucket. For this, **you also need an equality test to distinguish between all candidates in the bucket**.
#
# That's it.
# # Sets
#
# What we learned is also useful to understand [how Python sets work](https://groups.google.com/forum/#!topic/comp.lang.python/4-nf21y-M8g):
#
# - They are, in essence, a hash table with dummy values.
# - In other words: the members of the set are the keys of the hash table.
# - Plus, of course, optimizations to exploit this lack of values.
# # Wrapping up this section
#
# - We implemented our own hash table.
# * Need to store both keys and values.
# * Handled collisions resolution via separate chaining.
# * Dynamic resizing when it's too full.
# - Python dictionaries have a much more complex implementation.
# - Sets are in essence a hash table with dummy values.
# # Wrapping up this section
#
# - We implemented our own hash table.
# * Need to store both keys and values.
# * Handled collisions resolution via separate chaining.
# - Python dictionaries have a much more complex implementation.
# - Sets are in essence a hash table with dummy values.
#
#
#
#
# [Photo](https://www.pexels.com/photo/silhouette-photography-of-desert-998649/) by [Francesco Ungaro](https://www.pexels.com/@ghostpresenter) | [Pexels License](https://www.pexels.com/photo-license/)
# # We're back
#
# We're —finally— ready to revisit [the paragraph](https://docs.python.org/3/reference/datamodel.html#object.__hash__) from the beginning:
#
# > If a class does not define an [\_\_eq\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__eq__) method it should not define a [\_\_hash\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__hash__) operation either; if it defines [\_\_eq\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__eq__) but not [\_\_hash\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__hash__), its instances will not be usable as items in hashable collections. If a class defines mutable objects and implements an [\_\_eq\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__eq__) method, it should not implement [\_\_hash\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__hash__), since the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).
#
# We now have the tools to understand —and remember— _why_.
# # Primo
#
# > If a class does not define an [\_\_eq\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__eq__) method it should not define a [\_\_hash\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__hash__) operation either; [...]
#
# Rephrased: defining [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__hash__) but _not_ [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) is wrong. Why?
# Let's think of a hash table look up:
#
# - Python uses [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__hash__) to compute the slot where the value goes.
# - This gives us the "_bucket_" in which to search for our value.
# - Now we need the equality test to distingush between all candidates in the bucket.
# - We didn't implement [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__), so by default our objects compare unequal except with themselves.
# - Therefore, **we cannot find our entry**.
#
# See [this Stack Overflow question](https://stackoverflow.com/q/42061019) for an example of somebody who ran into this.
# ## An example that works
#
# In our implementation of a hash table, if we look up `"cat"`:
# In[126]:
table = [
[], # 0
[Entry(key='bear', value='Bär')], # 1
[], # 2
[Entry(key='dog', value='Hund'), Entry(key='cat', value='Katze')], # 3
[], # 4
]
# Python uses `str.__hash__()` and determines that `index == 3`.
# * There are two different entries there, so I need to compare for equality.
# * Python uses `str.__eq__()` to compare `"cat"` to `"dog"` $\rightarrow$ `False`.
# * Second entry: `"cat" == "cat"` $\rightarrow$ `True`. We found our entry.
# ## An example that doesn't work.
#
# If we have [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__hash__) but not [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__), we'll land in the right slot but won't be able to find our value.
# In[127]:
class Cyborg:
def __init__(self, name, weapon):
self.name = name
self.weapon = weapon
def __hash__(self):
return hash((self.name, self.weapon))
robot = Cyborg('T-1000', 'Franchi SPAS-12')
cyborgs = set()
cyborgs.add(robot)
robot in cyborgs
# However..
# In[128]:
cyborgs = {Cyborg('T-1000', 'Franchi SPAS-12')}
Cyborg('T-1000', 'Franchi SPAS-12') in cyborgs # doesn't work!
# Without [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__), these objects only compare equal to _themselves_.
# As we saw earlier for `Point`...
# In[129]:
robot1 = Cyborg('T-800', 'M79 grenade launcher')
robot2 = Cyborg('T-800', 'M79 grenade launcher')
# In[130]:
hash(robot1)
# In[131]:
hash(robot2)
# In[132]:
robot1 == robot2
# # Secondo
#
# > [...] if [a class] defines [\_\_eq\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__eq__) but not [\_\_hash\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__hash__), its instances will not be usable as items in hashable collections.
#
# This is —now— almost trivial: if we don't have [\_\_hash\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__hash__), we cannot know in which slot the value is.
# Indeed, Python won't even let us compute their hash.
# In[3]:
class Cyborg:
def __init__(self, name, weapon):
self.name = name
self.weapon = weapon
def __eq__(self, other):
return self.name == other.name and self.weapon == other.weapon
robot = Cyborg('T-850', 'Ithaca 37')
hash(robot)
# ## But wait
#
# However, we saw earlier that by default both [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) and [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__hash__) return _something_.
#
# Indeed, if we don't define _none_ of them Python doesn't complain.
# In[4]:
class Cyborg:
def __init__(self, name, weapon):
self.name = name
self.weapon = weapon
robot = Cyborg('T-850', 'Ithaca 37')
hash(robot)
# # Making it unhashable
#
# How do we _not_ define [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__hash__)? By setting it to `None`.
#
# From [the docs](https://docs.python.org/3/reference/datamodel.html#object.__hash__):
#
# > When the [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__hash__) method of a class is `None`, instances of the class will raise an appropriate [`TypeError`](https://docs.python.org/3/library/exceptions.html#TypeError) when a program attempts to retrieve their hash value, and will also be correctly identified as unhashable when checking `isinstance(obj, collections.abc.Hashable)`.
# ## An example
# In[133]:
class Cyborg:
def __init__(self, name, weapon):
self.name = name
self.weapon = weapon
__hash__ = None
robot = Cyborg('T-850', 'Ithaca 37')
hash(robot)
# Also:
# In[134]:
best_friend = dict()
best_friend[robot] = "Sarah Connor"
# And of course:
# In[135]:
robot_army = set()
robot_army.add(robot)
# All of them fail because our class is not hashable.
# # Getting It Wrong, Part I
#
# From [the docs](https://docs.python.org/3/reference/datamodel.html#object.__hash__):
#
# > If a class that does not override [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__) wishes to suppress hash support, it should include `__hash__ = None` in the class definition.
#
# Note that _returning_ `None` does not work.
# In[136]:
class Cyborg:
def __init__(self, name, weapon):
self.name = name
self.weapon = weapon
def __hash__(self):
return None
robot = Cyborg('T-1000', 'Franchi SPAS-12')
hash(robot)
# # Getting It Wrong, Part II
#
# The docs also [warn us](https://docs.python.org/3/reference/datamodel.html#object.__hash__):
#
# > A class which defines its own [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__hash__) that explicitly raises a [`TypeError`](https://docs.python.org/3/library/exceptions.html#TypeError) would be incorrectly identified as hashable by an `isinstance(obj, collections.abc.Hashable)` call.
# ## Abstract base classes
#
# [`collections.abc.Hashable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Hashable) is an [abstract base class](https://docs.python.org/3/library/abc.html). We use them
#
# >[... ] to test whether a class provides a particular interface; for example, whether it is hashable or whether it is a mapping.
# ## Example 1
#
# Is this variable iterable?
# In[137]:
word = "abc"
# We could try to loop over it and catch the error...
# In[138]:
def is_iterable(obj):
try:
for item in obj:
# can exit immediately, just want to check it doesn't fail
return True
except TypeError:
return False
is_iterable(word)
# ... or instead use:
# In[139]:
import collections.abc
isinstance(word, collections.abc.Iterable)
# ## Example 2
#
# Does this variable have a size?
# In[140]:
t = (x ** 2 for x in [4, 5, 6, 8, 9])
# We could try to call [`len()`](https://docs.python.org/3/library/functions.html#len)...
# In[141]:
def is_sized(obj):
try:
len(obj)
return True
except TypeError:
return False
is_sized(t)
# Alternatively, we could check whether the object `hasattr(obj, '__len__')`.
# ... or instead:
# In[142]:
isinstance(t, collections.abc.Sized)
# ## Example 3
#
# In the same way, to check whether a variable is hashable we could just try...
# In[143]:
word = "Katze"
def is_hashable(obj):
try:
hash(obj)
return True
except TypeError:
return False
is_hashable(word)
# ... or instead:
# In[144]:
isinstance(word, collections.abc.Hashable)
# # Don't raise `TypeError`
#
# The docs warn us that raising [`TypeError`](https://docs.python.org/3/library/exceptions.html#TypeError) ourselves will not work.
# In[145]:
class Cyborg:
def __init__(self, name, weapon):
self.name = name
self.weapon = weapon
def __hash__(self):
raise TypeError("class Cyborg is not hashable")
robot = Cyborg('T-800', 'M79 grenade launcher')
hash(robot)
# This was expected. However...
# In[146]:
isinstance(robot, collections.abc.Hashable)
# Wrong! We broke it!
#
# Use `__hash__ = None`.
# # Terzo
#
# And finally:
#
# > [...] If a class defines mutable objects and implements an [\_\_eq\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__eq__) method, it should not implement [\_\_hash\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__hash__), since the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).
#
# This takes us to our final point: the hashability of _mutable_ objects.
# # Hashing mutable objects.
#
# Well, we can't.
#
# Tuples are hashable...
# In[147]:
hash((1, 2, 3))
# ... but lists aren't.
# In[148]:
hash([5, 7, 9])
# # I'd never try to do that!
#
# Perhaps we've never tried calling [`hash()`](https://docs.python.org/3/library/functions.html#hash) explicitly for a list, but we may have run into this:
#
# Say we want to map two points $(x, y)$, to the result of ${x}^{y}$.
# In[149]:
squares = dict()
squares[[2, 7]] = 2 ** 7
# Same error! It's unhashable.
# For this we need to use tuples...
# In[150]:
squares = dict()
squares[(2, 7)] = 2 ** 7
squares
# ... or perhaps a two-level [`defaultdict`](https://docs.python.org/3/library/collections.html#collections.defaultdict):
# In[151]:
import collections
squares = collections.defaultdict(dict)
squares[2][7] = 2 ** 7
squares
# In[152]:
squares[2][7]
# # But I _really_ want to hash a list
#
# Let's imagine a list were hashable and we used it as a key in a dictionary.
# In[153]:
x = [2, 3, 5]
# - Python would use [`__hash__()`](https://docs.python.org/3/library/functions.html#hash) to compute the slot where the list goes.
# - We would go to that slot and store the list there, as usual.
# Later on, we mutate our list, say:
# In[154]:
x[1] = 7
x
# - Now we want to loop up `x` in the dictionary.
# - Python would use [`__hash__()`](https://docs.python.org/3/library/functions.html#hash) and get a _different_ value, hence a _different_ slot.
# - Even if (by chance) we ended up in the same slot, we would need the equality test to distinguish between all candidates in the bucket: old and new `x` are different, so they will compare unequal.
# - We can never find our list in the dictionary!
# # A hashable mutable class — sadness ensues
#
# - Let's implement a `Cyborg` class.
# - We'll implement [`__eq__()`](https://docs.python.org/3/reference/datamodel.html#object.__eq__)...
# - ... but _also_ implement [`__hash__()`](https://docs.python.org/3/reference/datamodel.html#object.__hash__).
#
# What could go wrong?
# In[155]:
import functools
@functools.total_ordering
class Cyborg:
def __init__(self, name, weapon):
self.name = name
self.weapon = weapon
def __repr__(self):
return "Cyborg({}, {})".format(self.name, self.weapon)
def __eq__(self, other):
return self.name == other.name and self.weapon == other.weapon
def __lt__(self, other):
if self.name < other.name:
return True
return self.weapon < other.weapon
# In[155]:
class Cyborg:
def __init__(self, name, weapon):
self.name = name
self.weapon = weapon
def __repr__(self):
return "Cyborg({}, {})".format(self.name, self.weapon)
def __eq__(self, other):
return self.name == other.name and self.weapon == other.weapon
# Our hash function combines both attributes in a tuple. Both are strings, so immutable, so hashable.
# In[156]:
@add_to(Cyborg)
def __hash__(self):
return hash((self.name, self.weapon))
# We really shouldn't be doing this...
# In[157]:
robot1 = Cyborg('T-800', 'Franchi SPAS-12')
robot2 = Cyborg('T-800', 'MP5 Submachine gun')
# In[158]:
robot1 == robot2
# In[159]:
robot1 >= robot2
# In[160]:
hash(robot1)
# In[161]:
hash(robot2)
# See the problem here?
# # Our class is mutable!
#
# This will break!
# In[162]:
robot = Cyborg('T-800', 'Franchi SPAS-12')
hash(robot)
# In[163]:
target = dict()
target[robot] = "John Connor"
target[robot]
# So far this works, but...
# In[164]:
robot.weapon = 'M79 grenade launcher'
hash(robot)
# In[165]:
target[robot]
# We broke it!
# That's what the docs were explaining when they told us \[emphasis added\]:
#
# > [...] If a class defines mutable objects and implements an [\_\_eq\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__eq__) method, it should not implement [\_\_hash\_\_()](https://docs.python.org/3/reference/datamodel.html#object.__hash__), since the implementation of hashable collections requires that a key’s hash value is immutable (**if the object’s hash value changes, it will be in the wrong hash bucket**).
# # Dictionaries store _references_
#
# A final comment. The dictionary (and thus also sets) store _a reference_ to our object.
# In[166]:
robot = Cyborg('T-1000', 'Remington 870')
id(robot)
# In[167]:
terminators = set()
terminators.add(robot)
# Let's retrieve the element from the set [without removing it](https://stackoverflow.com/q/59825):
# In[168]:
id(next(iter(terminators)))
# It's not a copy — it's the same object.
# - In other words: mutating our object (which should have never been hashable)...
# - ... also modifies the copy stored in the hash hable (dictionary / set!).
# In[169]:
robot.weapon = 'RSB-80 Plasma Gun'
robot
# In[170]:
terminators
# This implies that if our hashable mutated object landed (by chance) in the same slot as it originally did, we would still be able to find it.
# In[171]:
robot = Cyborg('T-1000', 'Remington 870')
terminators = set()
terminators.add(robot)
original_hash = hash(robot)
print("Original hash:", original_hash)
# Let's mutate the object but make sure it goes to the same slot:
# In[172]:
robot.weapon = 'RSB-80 Plasma Gun'
Cyborg.__hash__ = lambda x: original_hash
print("Mutated hash: ", hash(robot))
# In[173]:
robot in terminators
# We can find it! It went to the same bucket.
#
# Hashable mutable objects will break _most_ of the time, but not always.
# # Doing it right
#
# Experiments aside, let's do the right thing and make our mutable class non-hashable.
# In[174]:
class Cyborg:
def __init__(self, name, weapon):
self.name = name
self.weapon = weapon
__hash__ = None
robot = Cyborg('T-1000', 'RSB-80 Plasma Gun')
terminators = set()
terminators.add(robot)
# # Wrapping up this section
#
# We revisited and understood the intimidating paragraph from the docs.
#
# - If your class doesn't define `__eq__()`, don't define `__hash__()` either.
# - Mutable classes should not be hashable.
# * To make it non-hashable, set `__hash__ = None`.
# - If your class defines `__eq__()`, implement `__hash__()` for it to be hashable.
# - Abstract base classes simplify testing whether an interface is satisfied.
# - Dictionaries store references, not copies of the data.
# # Wrapping up this section
#
# We revisited and understood the intimidating paragraph from the docs.
#
# - If your class doesn't define `__eq__()`, don't define `__hash__()` either.
# - Mutable classes should not be hashable.
# * To make it non-hashable, set `__hash__ = None`.
# - If your class defines `__eq__()`, implement `__hash__()` for it to be hashable.
#
#
#
#
#
# [Photo](https://www.pexels.com/photo/landscape-photography-of-desert-33147/) by [Pixabay](https://pixabay.com/en/desert-africa-bedouin-footprints-1007157/) / [CC0](https://www.pexels.com/creative-commons-images/)
# # Bonus: Trollface
#
# The whole point of using hash tables is to get $\mathcal{O}(1)$ operations.
#
# Say we use a dictionary to map every number to its square:
# In[176]:
{x: x ** 2 for x in range(10_000)}
# Lookups are $\mathcal{O}(1)$. Proof:
# In[181]:
get_ipython().run_cell_magic('timeit', 'squares = {x: x ** 2 for x in range(1_000)}', '999 in squares\n')
# In[183]:
get_ipython().run_cell_magic('timeit', 'squares = {x: x ** 2 for x in range(10_000)}', '9_999 in squares\n')
# Size of dictionary is $10x$, but it takes the same amount of time. This is $\mathcal{O}(1)$.
# Let's subclass `int` and make it always return the same hash
# In[184]:
class TrollInt:
def __init__(self, value):
self.value = value
def __eq__(self, other):
return self.value == other.value
def __hash__(self):
return 1
# This means that they'll all go the same slot.
#
# Python will need to resolve (endless) collisions
# In[185]:
get_ipython().run_cell_magic('timeit', 'squares = {TrollInt(x): TrollInt(x ** 2) for x in range(1_000)}', 'squares[TrollInt(999)] \n')
# In[186]:
get_ipython().run_cell_magic('timeit', 'squares = {TrollInt(x): TrollInt(x ** 2) for x in range(10_000)}', 'squares[TrollInt(9_999)]\n')
# We've turned our $\mathcal{O}(1)$ operations into $\mathcal{O}(n)$.
#
# Problem?