HyperLogLog in Practice¶

Mike Mull
@kwikstep
https://github.com/mikemull/Notebooks/blob/master/hyperloglog.ipynb

What¶

"The purpose of this note is to present and analyse an efficient algorithm for estimating the number of distinct elements, known as the cardinality, of large data ensembles, which are referred to here as multisets and are usually massive streams (read-once sequences)."

An Easy Problem For Moderately Sized, Relatively Static Data¶

select count(distinct message) from server_log;

cut -f2 server.log | sort | uniq | wc -l

Complexity is more about space than computation
Space complexity of exact methods is generally O(n) or O(nlogn)
Estimation approaches focused on reducing memory usage while retaining accuracy

Why¶

Query optimization
Monitoring of network traffic
Data sketching
Basic data analysis
- From the paper: On an average day, Powerdrill performs about 5 million such count distinct computations
- ...about 100 computations a day yeild a result great than 1 billion

The History¶

Probabilistic Counting -> LogLog -> SuperLogLog -> HyperLogLog -> HyperLogLog++ -> ?

Probabilistic Counting (1985)
LogLog/SuperLogLog (2003)
HyperLogLog (2007)
HyperLogLog++ (2013)

Probabilistic Counting¶

Determine an "observable" with some understood probability that will help us estimate the cardinality
- Here, a bit pattern of the hashed value of the thing we're counting
Calculate observable for every item in set
Infer cardinality from observed values

Hashing¶

$$ hash(x) -> [0...2^{L-1}] $$

We assume (with some justification) that the hash function generates values uniformly, so, if L were 8

$$ \begin{equation} \begin{aligned} P(00000001) &= 1/256 \\ P(0000001x) &= 2/256 \\ P(000001xx) &= 4/256 \\ \end{aligned} \end{equation} $$

So, the more unique items we see, the more likely we are to see a bit pattern with more leading zeros

The Observable Value¶

So our observable is:

For PC $$ \rho(hash(x)) = \text{position of least significant 1 bit in hash} $$

For LogLog and HyperLogLog, this changes to: $$ \rho(hash(x)) = \text{position of first 1 bit} $$

Stochastic Averaging¶

Problem: The variance of the estimator in PC is too high
Solution 0: Run the process a bunch of times.
Solution 1: Use more hash functions
- More computationally expensive
- Can't construct independent hash functions

Stochastic Averaging¶

Problem: The variance of the estimator in PC is too high
Solution 2: Divide stream into M substreams, average the estimates in each substream to get value for n/M
- Uses the first (or last) p bits of the hashed value to give 2^p streams

[00000001|001001001010001001001010|
[00000110|011010110010011010110010|
 <stream>|<-----useful bits------>|

PCSA¶

m = 256 gives only about 5% accuracy
PCSA is O(mlog2n) on space with an accuracy of α / sqrt(m)

A Word About Mysterious Constants¶

$$ E(R) \approx log_2\phi n \quad \phi = 0.77351 $$

Or,

$$ \text{standard error} = \frac{0.78}{\sqrt{m}} $$

LogLog¶

Key difference is that now they only store max value of rho for each stream
So, for 32-bit hashes we need at most 5 bits to track rho
In general, for 2^k length hashed we need k bits to hold max(rho)
Like PCSA, uses arithmetic mean of values in substream
Space complexity is O(log2log2), hence the name

$$ E := \alpha_m m 2^{\frac{1}{m} \sum M(j)} $$

HyperLogLog¶

The Harmonic Mean¶

$$ \frac{m}{\sum_{j=1}^{m} \frac{1}{2^{M(j)}}} $$

HyperLogLog¶

$$ E := \frac{\alpha_m m^2}{\sum_{j=1}^{m} \frac{1}{2^{M(j)}}} $$

$$ \alpha_m := \frac{1}{m \int_{0}^{\infty} (log_2(\frac{2 + u}{1 + u}))^m du} $$

HLL also makes adjustments for high and low cardinalities

Finally, HyperLogLog++¶

Accuracy
Memory Efficiency
Estimate Large Cardinalities
Practicality

Improvement 1: 64-bit Hash Function¶

Can handle much larger cardinalities
For size L hashes, requires log2(L + 1 -p) * 2^p bits
- So the extra storage isn't that much

Improvement 2: Estimating Small Cardinalities¶

Empirical Bias Correction¶

Calculate an average of raw estimates for each cardinality
When estimating cardinalities, use 6 closest interpolation points to correct bias

alt text

Improvement 2: Estimating Small Cardinalities¶

Deciding Which Algorithm To Use¶

Error in different algorithms

Improvement 3: Sparse Representation¶

If we use 6m bits for every case, we're wasting memory when n << m
- They're using 2^14 = 16384 streams
Stream index and count encoded in an integer
Combination of sorted list and auxiliary set that gets merged.
If they don't need to convert to the non-sparse representation, they can use more streams (higher precision)
Convert to the dense representation if necessary.

Improvement 3a: Compressing and Encoding Sparse Representation¶

Compression:
- use a variable number of bits to store (index, rho) based on current estimate
- use difference encoding since the sparse list is sorted
Encoding:
- Don't store rho() at all in the sparse representation

The Final Results¶

HLL++ Comparison

References¶

In [ ]: