In [41]:
import numpy as np
import scipy.stats as spstats

%pylab inline
Populating the interactive namespace from numpy and matplotlib

The following question came up at work: given that I can only sample $n\%$ of my logs, what is the correct size of the error bars for an estimate of some metric? As an example, let's say I can only capture $5\%$ of my logs. From those logs, I estimate that $30\%$ of users click on some widget. What is the $95\%$ confidence bound for the actual click rate in the entire sample?

This walks through a few assumptions and a means to estimate the overall metric.

Step 1: Be sure that your aggregation unit produces independent samples.

If your metric is "percent of users who do X" then you should be able to assume that your users are independent. This isn't always true. An example where it might fail is if your domain includes 1000 tenants each with 1000 users. If you oversample from one tenant, then users may not be independent. Another case that could cause trouble is if you sample sessions and a user can have more than one session. In that case, sessions from a single user are probably not independent.

Step 2: Treat your metric as the mean of some distribution

If your metric is "percent of users who do X" then you are in luck. You can model each user's decision as a bernoulli trial, which means that there are only two possible outcomes. The total number of users who do X is a Bernoulli random variable, which is completely described by the probability of success for a single trial and the number of trials.

Let's see some code. Below, I create a population of users then I sample to get my observed population.

In [172]:
total_population_size = 100000
sample_size = 0.05

known_metric = .3  # 30% of users click the widget.

def get_samples(sample_ratio=sample_size):
    total_population = np.random.uniform(0, 1, total_population_size) < known_metric
    sampled_population = np.random.choice(total_population, int(total_population_size * sample_ratio), replace=False)
    return total_population, sampled_population
    
total_population, sampled_population = get_samples()    

print "Actual population metric is {0}".format(total_population.sum() / float(total_population_size))
print "Sampled population metric is {0}".format(sum(sampled_population) / float(len(sampled_population)))
Actual population metric is 0.30213
Sampled population metric is 0.3022

The sample estimate looks like it is pretty close, but we really care about the confidence interval around our estimate. In python, we'll use scipy to get a confidence interval around the sample mean. See http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval for a set of algorithms to go about computing a confidence interval.

In [55]:
np.array(spstats.binom.interval(.95, total_population_size, sum(sampled_population) / float(len(sampled_population)))) / total_population_size
Out[55]:
array([ 0.30255,  0.30826])
In [173]:
in_ci = []
ci_size = []
estimate = []
for i in range(1000):
    total_population, sampled_population = get_samples()  # in reality, you only have sampled_population.
    
    # compute your metric.
    sample_mean = sum(sampled_population) / float(len(sampled_population))
    
    estimate.append(sample_mean)
    
    # compute a confidence interval.
    ci = np.array(spstats.binom.interval(.95, len(sampled_population), sample_mean )) 
    
    # Scale up. In our case, we are multiplying times inverse of sample ratio.
    ci_full_population = ci / sample_size
    
    # Record whether our CI includes the known population mean.
    in_ci.append( ci_full_population[0] <= sum(total_population) <= ci_full_population[1])
    
    ci_size.append(ci_full_population)

This next computation is the most important one. Here, we examine how often our confidence interval contains the actual population value. We should see that this is around 95%. If it's too low, then our confidence interval is not wide enough. We are being overly optimistic with our estimated precision. if it's too high, then we are being overly conservative.

In [174]:
print "Your 95% CI contains the population value {0}% of the time".format(sum(in_ci)/ float(len(in_ci)) * 100)
print "This should be near 95%."
Your 95% CI contains the population value 96.7% of the time
This should be near 95%.
In [175]:
ci_width = ci_size[0]

print "Here's a sample"
print "Sample estimated num users who did action is {0}".format(estimate[0] * total_population_size)
print "average Confidence Interval is {0}".format(ci_width)
print "So the width of your estimate is {0}% of your metric value.".format((ci_width[1] - ci_width[0]) / estimate[0] /total_population_size* 100)
Here's a sample
Sample estimated num users who did action is 29020.0
average Confidence Interval is [ 27760.  30280.]
So the width of your estimate is 8.68366643694% of your metric value.

Danger Zone: What if your sample rate is also estimated?

In this simulation, we also allow our sampling rate to be a random variable. We'll allow our sampling distribution to be a uniform random variable centered on our intended value.

In [176]:
in_ci = []
ci_size = []
estimate = []

hyperparameter_mean_sample_size = sample_size  # same as above
hyperparameter_sample_size_width = 0.005  # So we could be between 0.04 and 0.06 in our sample size.

for i in range(100):
    
    local_sample_size = np.random.uniform(hyperparameter_mean_sample_size - hyperparameter_sample_size_width,
                                          hyperparameter_mean_sample_size + hyperparameter_sample_size_width)
    
    total_population, sampled_population = get_samples(local_sample_size)  # in reality, you only have sampled_population.
    
    # compute your metric.
    sample_mean = sum(sampled_population) / float(len(sampled_population))
    
    estimate.append(sample_mean)
    
    # compute a confidence interval.
    ci = np.array(spstats.binom.interval(.95, len(sampled_population), sample_mean )) 
    
    # Scale up. In our case, we are multiplying times inverse of sample ratio.
    ci_full_population = ci / sample_size
    
    # Record whether our CI includes the known population mean.
    in_ci.append( ci_full_population[0] <= sum(total_population) <= ci_full_population[1])
    
    ci_size.append(ci_full_population)
    
In [177]:
print "Your 95% CI contains the population value {0}% of the time".format(sum(in_ci)/ float(len(in_ci)) * 100)
print "This should be near 95%."
Your 95% CI contains the population value 37.0% of the time
This should be near 95%.
In [178]:
ci_width = ci_size[0]

print "Here's a sample"
print "Sample estimated num users who did action is {0}".format(estimate[0] * total_population_size)
print "average Confidence Interval is {0}".format(ci_width)
print "So the width of your estimate is {0}% of your metric value.".format((ci_width[1] - ci_width[0]) / estimate[0] /total_population_size* 100)
Here's a sample
Sample estimated num users who did action is 29654.7472256
average Confidence Interval is [ 27620.  30120.]
So the width of your estimate is 8.43035343035% of your metric value.

Notice how poorly our confidence interval estimate actually performs. In other words, if we say "we're 95% sure the true population estimate is X", we'll be wrong nearly 2/3 of the time.

How wide is that 95% confidence interval when we don't have an exact sampling estimate?

A Quick Overview of Bootstrapping

In this simulation, we will repeated sample from the same population and compute our stats.

In [213]:
def generate_bootstrap_sample():

    total_population = np.random.uniform(0, 1, total_population_size) < known_metric

    total_population_estimates = []

    for i in xrange(400):
        local_sample_size = np.random.uniform(hyperparameter_mean_sample_size - hyperparameter_sample_size_width,
                                              hyperparameter_mean_sample_size + hyperparameter_sample_size_width)


        sampled_population = np.random.choice(total_population, int(total_population_size * local_sample_size), replace=False)

        sample_pop_estimate = sum(sampled_population) 
        total_population_estimate = sample_pop_estimate / local_sample_size
        total_population_estimates.append(total_population_estimate)
    return total_population_estimates, sum(total_population)
total_population_estimates, population_sum = generate_bootstrap_sample()
In [214]:
population_sums = []
estimate_sums = []
lower_cis = []
upper_cis = []
in_ci = []

for i in xrange(50):
    total_population_estimates, population_sum = generate_bootstrap_sample()
    mean_estimate = np.mean(total_population_estimates)
    total_population_estimates.sort()
    lowerci = total_population_estimates[int(len(total_population_estimates) * 0.025)]
    upperci = total_population_estimates[int(len(total_population_estimates) * 0.975)]
    
    population_sums.append(population_sum)
    estimate_sums.append(mean_estimate)
    lower_cis.append(lowerci)
    upper_cis.append(upperci)
    in_ci.append(lowerci <= population_sum <= upperci)
In [204]:
print population_sums
[29929, 29946, 30010, 29936, 29903, 30021, 29927, 29700, 30130, 30121]
In [217]:
lower_cis
Out[217]:
[28686.276820705003,
 28731.836761777038,
 28757.667290529022,
 28373.215784100892,
 28609.294243885099,
 28769.532977962193,
 29017.15864017425,
 28692.976585557299,
 28866.795823245415,
 28532.883336716492,
 28734.928581715976,
 28725.211174631288,
 29005.998876432695,
 28618.298662344412,
 29005.066449032896,
 28713.120756954741,
 28645.14870133136,
 28754.543910306384,
 28381.850600555412,
 28610.720680349663,
 28789.557143272559,
 28723.559623398221,
 28724.764249326774,
 28924.128581216159,
 28619.038335640016,
 28833.273792601602,
 28593.178324218501,
 28909.550572313718,
 28509.446473895518,
 28769.812641889603,
 28679.455176208783,
 28883.39879461587,
 28893.532292009186,
 28915.032530865414,
 28722.672916717722,
 28816.983340642677,
 28717.073577029463,
 28950.631924390396,
 28789.989770898668,
 28822.672091299439,
 28369.587927194116,
 29031.934255494998,
 28757.435282149225,
 28848.903912733495,
 28568.035171669038,
 28761.028599398433,
 28608.708604078067,
 28789.541472763067,
 28596.984291843975,
 28755.180552669408]
In [218]:
upper_cis
Out[218]:
[31207.940513838243,
 31216.671386798451,
 31320.136972723605,
 30914.458254602523,
 31200.514462220057,
 31339.263158354195,
 31412.124857371382,
 31158.797651445864,
 31396.232274776372,
 31053.088686731651,
 31004.691234690279,
 31239.575382894618,
 31549.796842829688,
 30806.962511966842,
 31335.183311567009,
 31307.845456910058,
 31246.435620383287,
 31283.598186599131,
 30913.82543711544,
 31165.388603358693,
 31151.85294492084,
 31178.31211677397,
 31176.218792287382,
 31443.177927942907,
 31204.504324379148,
 31190.035559563861,
 31123.884873065981,
 31182.024584457806,
 31085.321961996928,
 31186.117510385797,
 31290.869743349638,
 31350.244036148371,
 31506.559330305903,
 31183.083917728316,
 31187.345095295463,
 31440.313974134708,
 31268.880334117308,
 31296.970362678476,
 31199.540749774376,
 31398.451319301472,
 30944.081785041122,
 31401.35602678555,
 31236.774248588561,
 31266.891169540391,
 31071.692731851414,
 31288.804487543712,
 31051.282219116441,
 31252.011634431012,
 30960.848625101084,
 31127.25268515047]
In [208]:
estimate_sums
Out[208]:
[29938.512436863024,
 29919.828022314738,
 30042.685693425225,
 29934.414132145183,
 29896.877562620706,
 29936.896930369832,
 29860.320008612092,
 29668.379184591788,
 30196.534324636352,
 30084.300087786232]
In [215]:
in_ci
Out[215]:
[True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True]
In [216]:
int(len(total_population_estimates) * 0.975)
Out[216]:
390