"The purpose of this note is to present and analyse an efficient algorithm for estimating the number of distinct elements, known as the cardinality, of large data ensembles, which are referred to here as multisets and are usually massive streams (read-once sequences)."
The quote is from Flajolet's original HLL paper. I've highlighted the key words that describe what HLL does
select count(distinct message) from server_log;
cut -f2 server.log | sort | uniq | wc -l
Determining the cardinality of sets of data is a very common operation, and fairly simple until the cardinalities get larger (in fact i've used this recently as an interview problem). Most methods are not computationally expensive (basically O(n) since the scan the list); but are at best linear in the space used. The latter is the motivation behind most estimation algorithms.
Note also that cardinality estimation is not easily done with some other more basic probabilistic methods. For example, sampling doesn't work well because many unique items might occur few times.
The earliest uses of cardinality estimation were probably for database query optimization. Given information about the size of various tables, the query planner can choose the best way to join them. These algorithms are often built into network equipment also to detect usage anomalies. Hyperloglog is also used for data sketching along with things like CountMin and Bloom Filters (i first heard the term in relation to Trifacta's big-data management tools). Finally, the number of unique items is something that people are interested in know for numerous data analysis reasons, like monitoring usage or doing A/B testing.
Probabilistic Counting -> LogLog -> SuperLogLog -> HyperLogLog -> HyperLogLog++ -> ?
So our observable is:
For PC $$ \rho(hash(x)) = \text{position of least significant 1 bit in hash} $$
For LogLog and HyperLogLog, this changes to: $$ \rho(hash(x)) = \text{position of first 1 bit} $$
The PC method uses a bitmap where a bit is set if the rho function indicates that bit. Once all the items have been evaluated the estimator used for log2n is the position of the rightmost zero.
In the case of loglog and hyperloglog the estimatator is just the max() of rho.
[00000001|001001001010001001001010|
[00000110|011010110010011010110010|
<stream>|<-----useful bits------>|
Probabilistic Counting with Stochastic Average (PCSA) didn't give great results, but the memory requirement was primarily dependent on the number of streams used.
Or,
$$ \text{standard error} = \frac{0.78}{\sqrt{m}} $$Flajolet's real racket was the analysis of algorithms, and if you read his papers they consist mostly of long and detailed proofs about the precise nature of the distribution of the estimators that are used in these algorithms. Sadly, those details are beyond the scope of this talk, but this constant, and later normalizing factors that show up in loglog and hyperloglog come from those analyses. Note that these factors come from equations that are dependent on m so they apply to a specific number of streams.
Many of the details of PCSA are retained in loglog. The major difference is that the complicated bitmap approach in PCSA has been replaced by an estimator that is simply the maximum value for rho for each stream. Since rho is the bit position, you only need k bits to track 2^k bit positions.
The major innovation of hyperloglog over loglog is the use of the harmonic mean to calculate the average across the substreams. Because the dispersion of hyperloglog is lower (1.05/sqrt(m)), it can achieve the same accuracy as loglog with less memory (because it can use fewer substreams to get the same accuracy).
The estimator for hyperloglog is relatively straightforward, except of course for the hairy-looking integral that's described as a normalizing factor.
Google's HLL++ doesn't introduce any radical innovations to the algorithm, but rather adds a lot of engineering to make it more accurate and memory efficient. These are the four goals they state as the objectives of their improvements
This one seems fairly obvious. Clearly they're going to be able to handle much larger cardinalities without worrying about collisions, and the extra memory required amounts to 1 bit per stream.
The HLL estimator has significant bias for lower cardinalities. Flajolet, et. al. addressed this by using a slightly different estimator if the estimate < 5/2m and there are empty buckets. In HLL++ they address this with an empirical bias correction. They calculate numerous estimates with the basic HLL approach, average those, and subtract from the true cardinality to get the bias. So now they have a bias estimate for a range of known cardinalities. However, when calculating a cardinality estimate for a new set of data, you obviously don't know the true cardinality, so to correct for bias they maintain a set of 200 interpolation points and do a nearest neighbor interpolation.
Again, this is empirically determined. For cardinalities lower than about 11k (for precision 14), they use linear counting because it has lower standard error. For a middle range they use their bias-corrected version of HLL, and beyond that standard HLL.
So these Google guys take the memory efficiency very seriously. Their first improvement is to use a sparse representation for the stream registers so they don't have to wastefully allocate 6m bits. They encode the stream index and the max(rho()) value into an integer, and keep these in a sorted list. To make insertions faster they also have an auxiliary set, which gets merged into the list if it reaches a certain size.
One really clever aspect of this representation is that they can use more streams as long as they don't have to convert to the dense representation.
They make the observation that when the sparse representation is being used, the rho value won't be needed because the algorithm will use the linear counting method in the end.