Notebook

Online mean and variance estimates¶

The first technique we'll introduce isn't a probabilistic structure at all, but it will serve as a warm-up to introduce some of the more involved concepts we'll look at later. We'll look at Chan's formula for online mean and variance estimates, so that we can calculate estimated mean and variance in a single pass over a large data set. As we'll see, this technique will also let us combine estimates for several data sets (i.e., for processing a partitioned collection in parallel).

In [ ]:

class StreamMV(object):
    from sys import float_info

    def __init__(self, count=0, min=float_info.max,
                 max=-float_info.max, m1=0.0, m2=0.0):
        (self.count, self.min, self.max) = (count, min, max)
        (self.m1, self.m2) = (m1, m2)

    def __lshift__(self, sample):
        (self.max, self.min) = (max(self.max, sample), min(self.min, sample))
        dev = sample - self.m1
        self.m1 = self.m1 + (dev / (self.count + 1))
        self.m2 = self.m2 + (dev * dev) * self.count / (self.count + 1)
        self.count += 1
        return self

    def mean(self): 
        return self.m1

    def variance(self): 
        return self.m2 / self.count

    def stddev(self): 
        return math.sqrt(self.variance)
    
    def merge_from(self, other):
        if other.count == 0:
            return self
        if self.count == 0:
            (self.m1, self.m2) = (other.m1, other.m2)
            self.count = other.count
            (self.min, self.max) = (other.min, other.max)
            return self
        else:
            dev = other.m1 - self.m1
            new_count = other.count + self.count
            self.m1 = (self.count * self.m1 + other.count * other.m1) / new_count
            self.m2 = self.m2 + other.m2 + (dev * dev) * self.count * other.count / new_count
            self.count = new_count
            self.max = max(self.max, other.max)
            self.min = min(self.min, other.min)
            return self

We can test this code by sampling from a random distribution with known mean and variance. (We're using the Poisson distribution with a $\lambda$ parameter of 7, which should have a mean and variance of 7, but you could try with any other distribution if you wanted.)

In [ ]:

from scipy.stats import poisson
sink = StreamMV()

for p in poisson.rvs(7, size=10000):
    sink << p

print (sink.mean(), sink.variance())

We can see that we can also parallelize this work:

In [ ]:

from scipy.stats import poisson
s1, s2 = StreamMV(), StreamMV()

for p in poisson.rvs(7, size=10000):
    s1 << p


for p in poisson.rvs(7, size=10000):
    s2 << p

print("s1 mean %f, variance %f, count %d" % (s1.mean(), s1.variance(), s1.count))
print("s2 mean %f, variance %f, count %d" % (s2.mean(), s2.variance(), s2.count))

s1.merge_from(s2)

print("s1+s2 mean %f, variance %f, count %d" % (s1.mean(), s1.variance(), s1.count))

The mean and variance estimate technique we've just shown has a few things in common with the other techniques we'll look at:

It's incremental, meaning that it is possible to update an estimate with a single sample at a time (this also implies that it's single-pass, meaning that you only need to see each sample once).
It's parallel, meaning that it is possible to combine estimates for subsets of the population of interest and get an estimate for their union, and
It's scalable, meaning that it requires a constant amount of space no matter how many samples it processes.

In [ ]: