The first technique we'll introduce isn't a probabilistic structure at all, but it will serve as a warm-up to introduce some of the more involved concepts we'll look at later. We'll look at Chan's formula for online mean and variance estimates, so that we can calculate estimated mean and variance in a single pass over a large data set. As we'll see, this technique will also let us combine estimates for several data sets (i.e., for processing a partitioned collection in parallel).
class StreamMV(object):
from sys import float_info
def __init__(self, count=0, min=float_info.max,
max=-float_info.max, m1=0.0, m2=0.0):
(self.count, self.min, self.max) = (count, min, max)
(self.m1, self.m2) = (m1, m2)
def __lshift__(self, sample):
(self.max, self.min) = (max(self.max, sample), min(self.min, sample))
dev = sample - self.m1
self.m1 = self.m1 + (dev / (self.count + 1))
self.m2 = self.m2 + (dev * dev) * self.count / (self.count + 1)
self.count += 1
return self
def mean(self):
return self.m1
def variance(self):
return self.m2 / self.count
def stddev(self):
return math.sqrt(self.variance)
def merge_from(self, other):
if other.count == 0:
return self
if self.count == 0:
(self.m1, self.m2) = (other.m1, other.m2)
self.count = other.count
(self.min, self.max) = (other.min, other.max)
return self
else:
dev = other.m1 - self.m1
new_count = other.count + self.count
self.m1 = (self.count * self.m1 + other.count * other.m1) / new_count
self.m2 = self.m2 + other.m2 + (dev * dev) * self.count * other.count / new_count
self.count = new_count
self.max = max(self.max, other.max)
self.min = min(self.min, other.min)
return self
We can test this code by sampling from a random distribution with known mean and variance. (We're using the Poisson distribution with a $\lambda$ parameter of 7, which should have a mean and variance of 7, but you could try with any other distribution if you wanted.)
from scipy.stats import poisson
sink = StreamMV()
for p in poisson.rvs(7, size=10000):
sink << p
print (sink.mean(), sink.variance())
We can see that we can also parallelize this work:
from scipy.stats import poisson
s1, s2 = StreamMV(), StreamMV()
for p in poisson.rvs(7, size=10000):
s1 << p
for p in poisson.rvs(7, size=10000):
s2 << p
print("s1 mean %f, variance %f, count %d" % (s1.mean(), s1.variance(), s1.count))
print("s2 mean %f, variance %f, count %d" % (s2.mean(), s2.variance(), s2.count))
s1.merge_from(s2)
print("s1+s2 mean %f, variance %f, count %d" % (s1.mean(), s1.variance(), s1.count))
The mean and variance estimate technique we've just shown has a few things in common with the other techniques we'll look at: