%load_ext watermark
%watermark -a 'Sebastian Raschka' -u -d -v
Sebastian Raschka last updated: 2017-07-23 CPython 3.6.1 IPython 6.0.0
A bloom filter is a probablistic data structures for memory-efficient look-ups to test if a element or value is a member of a set. In a nutshell, you can think of a bloom filter as a large bit array (an array that contains 1s and 0s), and by only checking a few elements (bits) of this array, we can tell whether an element is likely a member of a set or is definitely not a member of a set. Checking the set membership via bloom filters can return the following outputs:
Or in other words, bloom filters can produce false positives (a match means that a element is a member of the set with a given uncertainty) but does not produce false negatives (non-matches always mean that the element is not a member of that set).
So, if bloom filters are probablistic and can produce false positives, why are they useful in practice? Bloom filters are extremely useful if we are working with large databases and want to run a quick check whether or not it's worth to run a computationally more expensive database query to retrieve an element. If a bloom-filter returns a non-match, we know that an element is not contained in the set and thus we can't save computational resources to check the actual database for that element.
Bloom filters allow us to implement computationally cheap and memory efficient set membership checks using bit arrays. Given an element, a bloom filter uses multiple hash functions (or the same hash function with different random seeds) to encode the element as a position in the bit array. To walk through the inner works of a bloom filter step by step.
1)
let's assume we have initialized the following, empty bitarray b underlying the bloom filter of size 10:
b = [0 0 0 0 0 0 0 0 0 0]
2)
Next, we use two different hash functions, h1 and h2 to encode an element e. These hash functions convert the output of the hash into an integer and normalize the integer so that it fits into the bounds of the array b:
h1(e) -> 5
h2(e) -> 3
In the example above, the first hash function hashes element e to the array index position 5, and the second hash function hashes the element to the array index position 3.
3)
If we consider the hash operations of step 2) as part of an add to the set operation, we would use those returned array index position to update the bitarray b as follows:
[0 0 0 0 0 0 0 0 0 0] -> [0 0 0 1 0 1 0 0 0 0]
If step 2) was part of a query or look-up operation, we would simply check the respective array index positions:
In this section, we are going to implement a bloom filter in Python. However, note that the following implementation of a bloom filter in Python mainly serves illustrative purposes and has not be designed for efficiency. For example, using Python list objects for representing bit arrays is very inefficient compared to using the bitarray
package.
To generate the hashes, we will use the hashlib
module from Python's standard library. Let's start with a simple example generating an integer hash using the MD5
hash function:
import hashlib
h1 = hashlib.md5()
h1.update('hello-world'.encode('utf-8'))
int(h1.hexdigest(), 16)
43309944592180122138158756568456140780
Unfortunately, the update method will render the hash function non-deterministic in the context of bloom filters. I.e., hashing the same value would return a different integer hash:
h1.update('hello-world'.encode('utf-8'))
int(h1.hexdigest(), 16)
94628577942110548051725978613383116803
int(hashlib.new('md5', ('%s' % 'hello-world').encode('utf-8')).hexdigest(), 16)
43309944592180122138158756568456140780
int(hashlib.new('md5', ('%s' % 'hello-world').encode('utf-8')).hexdigest(), 16)
43309944592180122138158756568456140780
Next, let's implement a BloomFilter
class based on the concepts we discussed earlier:
class BloomFilter():
def __init__(self, array_size, hash_names):
self.array_size = array_size
self.bitarray = [0] * array_size
self.hash_names = hash_names
def _get_hash_positions(self, value):
pos = []
for h in self.hash_names:
hash_hex = hashlib.new(h, ('%s' % value).encode(
'utf-8')).hexdigest()
# convert hashed value into an integer
asint = int(hash_hex, 16)
# modulo array_size to fit hash value into the bitarray
pos.append(asint % self.array_size)
return pos
def add(self, value):
hash_pos = self._get_hash_positions(value)
for pos in hash_pos:
self.bitarray[pos] = 1
def query(self, value):
hash_pos = self._get_hash_positions(value)
for pos in hash_pos:
if not self.bitarray[pos]:
return False
return True
To test our implementation, let's initialize a bloom filter with array size 10 and two different hash function, a simple SHA hash and MD5 hash:
bloom = BloomFilter(array_size=10, hash_names=('md5', 'sha1'))
bloom.bitarray
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Next, we will add a new value to the bloom filter, 'hello world!'
bloom.add('hello world!')
bloom.bitarray
[0, 1, 0, 0, 0, 0, 0, 1, 0, 0]
As we can see from running the previous code example, the array index position the value was hashed into are 1
and 7
. Let's check the element for membership now:
bloom.query('hello world!')
True
So far so good. Let's add another value, 'foo-bar':
bloom.add('foo-bar')
bloom.bitarray
[0, 1, 0, 1, 0, 1, 0, 1, 0, 0]
The 'foo-bar' value was hashed into positions 3
and 5
. Similarly, we can check the membership as follows:
bloom.query('foo-bar')
True
Just to confirm that our bloom filter is implemented correctly and does not return false negative, let's run a query on a new value:
bloom.query('test')
False
A nice property of bloom filters is that we can determine the size of the bitarray mathematically for a desired false positive rate. The following equation approximates the probability of returning a false positive (for more details, and how to determine the optimal size of the bitarray and number of hashes, see the excellent article at https://en.wikipedia.org/wiki/Bloom_filter):
$$P \approx \big(1 - e^{-kn / m}\big)^k$$Here, $k$ is the number of hashes, $m$ is the size of the bitarray, $n$ is the expected number of elements that are to be stored in the bloom filter, and $k$ is the number of hash functions.
For instance, say we only expect 3 values to be stored, only have 2 hash functions, and a tiny bitarray (just as in the previous code example), then our probability of returning a false positive would be approximately 20%:
$$\big(1 - e^{-2*3 / 10}\big)^2 \approx 0.2$$