- Mike Mull
- @kwikstep
- https://github.com/mikemull/Notebooks/blob/master/hyperloglog.ipynb

"The purpose of this note is to present and analyse an efficient algorithm for

estimatingthe number of distinct elements, known as thecardinality, of large data ensembles, which are referred to here as multisets and are usuallymassive streams(read-once sequences)."

The quote is from Flajolet's original HLL paper. I've highlighted the key words that describe what HLL does

`select count(distinct message) from server_log;`

`cut -f2 server.log | sort | uniq | wc -l`

- Complexity is more about space than computation
- Space complexity of exact methods is generally O(n) or O(nlogn)
- Estimation approaches focused on reducing memory usage while retaining accuracy

Determining the cardinality of sets of data is a very common operation, and fairly simple until the cardinalities get larger (in fact i've used this recently as an interview problem). Most methods are not *computationally* expensive (basically O(n) since the scan the list); but are at best linear in the space used. The latter is the motivation behind most estimation algorithms.

Note also that cardinality estimation is not easily done with some other more basic probabilistic methods. For example, *sampling* doesn't work well because many unique items might occur few times.

- Query optimization
- Monitoring of network traffic
- Data sketching
- Basic data analysis
- From the paper:
*On an average day, Powerdrill performs about 5 million such count distinct computations* *...about 100 computations a day yeild a result great than 1 billion*

- From the paper:

The earliest uses of cardinality estimation were probably for database query optimization. Given information about the size of various tables, the query planner can choose the best way to join them. These algorithms are often built into network equipment also to detect usage anomalies. Hyperloglog is also used for *data sketching* along with things like CountMin and Bloom Filters (i first heard the term in relation to Trifacta's big-data management tools). Finally, the number of unique items is something that people are interested in know for numerous data analysis reasons, like monitoring usage or doing A/B testing.

Probabilistic Counting -> LogLog -> SuperLogLog -> HyperLogLog -> HyperLogLog++ -> ?

- Probabilistic Counting (1985)
- LogLog/SuperLogLog (2003)
- HyperLogLog (2007)
- HyperLogLog++ (2013)

- Determine an "observable" with some understood probability that will help us estimate the cardinality
- Here, a
*bit pattern*of the hashed value of the thing we're counting

- Here, a
- Calculate observable for every item in set
- Infer cardinality from observed values

$$ hash(x) -> [0...2^{L-1}] $$

- We assume (with some justification) that the hash function generates values
*uniformly*, so, if L were 8

$$ \begin{equation} \begin{aligned} P(00000001) &= 1/256 \\ P(0000001x) &= 2/256 \\ P(000001xx) &= 4/256 \\ \end{aligned} \end{equation} $$

- So, the more unique items we see, the more likely we are to see a bit pattern with more leading zeros

So our *observable* is:

For PC $$ \rho(hash(x)) = \text{position of least significant 1 bit in hash} $$

For LogLog and HyperLogLog, this changes to: $$ \rho(hash(x)) = \text{position of first 1 bit} $$

The PC method uses a bitmap where a bit is set if the rho function indicates that bit. Once all the items have been evaluated the estimator used for log2n is the position of the rightmost zero.

In the case of loglog and hyperloglog the estimatator is just the max() of rho.

- Problem: The variance of the estimator in PC is too high
- Solution 0: Run the process a bunch of times.
- Solution 1: Use more hash functions
- More computationally expensive
- Can't construct independent hash functions

- Problem: The variance of the estimator in PC is too high
- Solution 2: Divide stream into M substreams, average the estimates in each substream to get value for n/M
- Uses the first (or last)
*p*bits of the*hashed*value to give 2^p streams

- Uses the first (or last)

```
[00000001|001001001010001001001010|
[00000110|011010110010011010110010|
<stream>|<-----useful bits------>|
```

- m = 256 gives only about 5% accuracy
- PCSA is O(mlog2n) on space with an accuracy of α / sqrt(m)

Probabilistic Counting with Stochastic Average (PCSA) didn't give great results, but the memory requirement was primarily dependent on the number of streams used.

$$ E(R) \approx log_2\phi n \quad \phi = 0.77351 $$

Or,

$$ \text{standard error} = \frac{0.78}{\sqrt{m}} $$

Flajolet's real racket was the analysis of algorithms, and if you read his papers they consist mostly of long and detailed proofs about the precise nature of the distribution of the estimators that are used in these algorithms. Sadly, those details are beyond the scope of this talk, but this constant, and later normalizing factors that show up in loglog and hyperloglog come from those analyses. Note that these factors come from equations that are dependent on *m* so they apply to a specific number of streams.

- Key difference is that now they only store max value of rho for each stream
- So, for 32-bit hashes we need at most 5 bits to track rho
- In general, for 2^k length hashed we need k bits to hold max(rho)
- Like PCSA, uses arithmetic mean of values in substream
- Space complexity is O(log2log2), hence the name

$$ E := \alpha_m m 2^{\frac{1}{m} \sum M(j)} $$

Many of the details of PCSA are retained in loglog. The major difference is that the complicated bitmap approach in PCSA has been replaced by an estimator that is simply the maximum value for rho for each stream. Since rho is the bit position, you only need k bits to track 2^k bit positions.

The major innovation of hyperloglog over loglog is the use of the *harmonic mean* to calculate the average across the substreams. Because the dispersion of hyperloglog is lower (1.05/sqrt(m)), it can achieve the same accuracy as loglog with less memory (because it can use fewer substreams to get the same accuracy).

$$ E := \frac{\alpha_m m^2}{\sum_{j=1}^{m} \frac{1}{2^{M(j)}}} $$

$$ \alpha_m := \frac{1}{m \int_{0}^{\infty} (log_2(\frac{2 + u}{1 + u}))^m du} $$

- HLL also makes adjustments for high and low cardinalities

The estimator for hyperloglog is relatively straightforward, except of course for the hairy-looking integral that's described as a normalizing factor.

- Accuracy
- Memory Efficiency
- Estimate Large Cardinalities
- Practicality

Google's HLL++ doesn't introduce any radical innovations to the algorithm, but rather adds a lot of engineering to make it more accurate and memory efficient. These are the four goals they state as the objectives of their improvements

- Can handle much larger cardinalities
- For size L hashes, requires log2(L + 1 -p) * 2^p bits
- So the extra storage isn't that much

This one seems fairly obvious. Clearly they're going to be able to handle much larger cardinalities without worrying about collisions, and the extra memory required amounts to 1 bit per stream.

The HLL estimator has significant bias for lower cardinalities. Flajolet, et. al. addressed this by using a slightly different estimator if the estimate < 5/2m and there are empty buckets. In HLL++ they address this with an empirical bias correction. They calculate numerous estimates with the basic HLL approach, average those, and subtract from the true cardinality to get the bias. So now they have a bias estimate for a range of known cardinalities. However, when calculating a cardinality estimate for a new set of data, you obviously don't know the true cardinality, so to correct for bias they maintain a set of 200 interpolation points and do a nearest neighbor interpolation.

Again, this is empirically determined. For cardinalities lower than about 11k (for precision 14), they use linear counting because it has lower standard error. For a middle range they use their bias-corrected version of HLL, and beyond that standard HLL.

- If we use 6m bits for every case, we're wasting memory when n << m
- They're using 2^14 = 16384 streams

- Stream index and count encoded in an integer
- Combination of sorted list and auxiliary set that gets merged.
- If they don't need to convert to the non-sparse representation, they can use more streams (higher precision)
- Convert to the dense representation if necessary.

So these Google guys take the memory efficiency very seriously. Their first improvement is to use a sparse representation for the stream registers so they don't have to wastefully allocate 6m bits. They encode the stream index and the max(rho()) value into an integer, and keep these in a sorted list. To make insertions faster they also have an auxiliary set, which gets merged into the list if it reaches a certain size.

One really clever aspect of this representation is that they can use more streams as long as they don't have to convert to the dense representation.

- Compression:
- use a variable number of bits to store (index, rho) based on current estimate
- use difference encoding since the sparse list is sorted

- Encoding:
- Don't store rho() at all in the sparse representation

They make the observation that when the sparse representation is being used, the rho value won't be needed because the algorithm will use the linear counting method in the end.