Visualization of cardinality algorithm (naive version)

In [112]:
import uuid
import math
%matplotlib inline

Create a list of 10 random UUIDs

In [113]:
list10 = [str(uuid.uuid4()) for i in xrange(10)]
list10
Out[113]:
['0084d80c-283c-4392-947f-a3975ae4c88a',
 '9ee0603d-f5bb-40ee-8e26-88668300cdbe',
 '993c2ba5-1821-4fc8-acf6-f2349045aa5f',
 '9f15bd5e-f872-4fbd-a0cb-a6b6c1e354d4',
 '35467743-f5bb-49f9-8c3f-baad88aed594',
 '48285124-11c6-4060-8b93-3aec01ef6fa8',
 '11e632f1-665a-4074-a0fe-8a4e3ff1b83b',
 '4d1b272f-b269-4716-9418-4d7fcce650fd',
 'a46cbf0d-ca3e-4dbf-8ab2-1240f5d96118',
 'd3e892b8-0548-4e25-98ef-65cf80959dd7']

Now lists of 100 and 1000 UUIDs

In [127]:
list100 = [str(uuid.uuid4()) for i in xrange(100)]
list1000 = [str(uuid.uuid4()) for i in xrange(1000)]

Hash the list of 10 UUIDs with a hash function that maps to floating point numbers in the range [0,1)

In [116]:
h10 = [hash(x) / pow(2.0,32.0) + 0.5 for x in list10]

Find the smallest, and the exponentiate that to get the approximation of the size (cardinality) of the set

In [117]:
m10 = min(h10)
m10
Out[117]:
0.07389730727300048
In [118]:
pow(10,-math.log10(m10))
Out[118]:
13.532292811504993

Do the same for the sets of 100 and 1000

In [119]:
h100 = [hash(x) / pow(2.0,32.0) + 0.5 for x in list100]
In [120]:
m100 = min(h100)
m100
Out[120]:
0.0026511941105127335
In [121]:
pow(10,-math.log10(m100))
Out[121]:
377.1885264963127
In [122]:
h1000 = [hash(x) / pow(2.0,32.0) + 0.5 for x in list1000]
In [123]:
m1000 = min(h1000)
m1000
Out[123]:
0.00014685397036373615
In [124]:
pow(10,-math.log10(m1000))
Out[124]:
6809.485623869372

Plot the hashed values. Top set of dots is for the set of 10 hashed UUIDs, middle set is for the set of 100 hashed UUIDs, and the bottom set is for the set of 1000 hashed UUIDs

In [125]:
scatter(h10 + h100 + h1000, [3]*10 + [2]*100 + [1]*1000)
Out[125]:
<matplotlib.collections.PathCollection at 0xb385860>

Zoom in to the left portion of the above plot.

In [126]:
fig, ax = plt.subplots()
fig.set_size_inches(12,4)
ax.set_xlim([0, 0.1])
ax.set_ylim([0, 4])
ax.scatter(h10 + h100 + h1000, [3]*10 + [2]*100 + [1]*1000)
Out[126]:
<matplotlib.collections.PathCollection at 0xb71e908>

Improving the accuracy

We can improve the accuracy by splitting the original set into m subsets, applying the above naive algorithm over each of the m subsets, and averaging the results together.

Below we do so for the set of 1000 hashed UUIDs, where we split it up into 10 subsets of 100 each, average the resulting min hash values, exponentiate that average, and then multiply that back by 10 to get the cardinality of the whole set.

In [132]:
betterm1000=mean([min(h1000[x:x+100]) for x in xrange(10)])
In [136]:
10*pow(10,-math.log10(betterm1000))
Out[136]:
358.90695792638485

As you can see, this resulted in a result that is off by a factor of 3, instead of off by a factor of 7 as before.

There are even more clever and accurate ways to average the mins, such as using a geometric average, or the ultimate, the HyperLogLog algorithm for averaging as described in HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm.

In [ ]: