"The purpose of this note is to present and analyse an efficient algorithm for estimating the number of distinct elements, known as the cardinality, of large data ensembles, which are referred to here as multisets and are usually massive streams (read-once sequences)."
select count(distinct message) from server_log;
cut -f2 server.log | sort | uniq | wc -l
Probabilistic Counting -> LogLog -> SuperLogLog -> HyperLogLog -> HyperLogLog++ -> ?
So our observable is:
For PC $$ \rho(hash(x)) = \text{position of least significant 1 bit in hash} $$
For LogLog and HyperLogLog, this changes to: $$ \rho(hash(x)) = \text{position of first 1 bit} $$
[00000001|001001001010001001001010|
[00000110|011010110010011010110010|
<stream>|<-----useful bits------>|
Or,
$$ \text{standard error} = \frac{0.78}{\sqrt{m}} $$