Aggregating Percentiles

There's a lot of articles explaning how percentiles can't actually be aggregated without loss of accuracy, so I won't rehash that here.

But, there is a good question around what do when you have no choice but to aggregate them. In that case, what type of aggregation is best?

To begin, let's provide some basic functionality that will help us explore various scenarios.

In [3]:
# for templating and display
from IPython.display import display
from tabulate import tabulate

from math import ceil

def percentile(percentage, data_points):
    """ return the point that is at the desired percentage """
    data_points = sorted(data_points)
    point_at_percentage = ceil(len(data_points) * percentage) - 1
    return data_points[point_at_percentage]

And we can test our percentile with:

In [4]:
import random

# we seed with a constant, to reproduce results. but this notebook should work
# with any value or random generator
random.seed(0)
SAMPLE_SIZE = 1000

random_points = [random.random() for _ in range(SAMPLE_SIZE)]

# should end up being around 0.99, since the points are random across 0 - 100
percentile(0.99, random_points)
Out[4]:
0.9867958186960467

So we've verified this function works. Now comes the fun part. Let's provide a way to randomly distribute these values across multiple bins.

In [6]:
def exponential_random():
    """ 
    return a number on the exponential scale,
    to emulate web service traffic.
    
    this will produce a sharp exponent from
    1 - 100
    """
    return 10 ** (random.random() * 2)

def create_bins(count, points):
    """ return <count> bins, with <points> random points per bin """
    all_points = []
    bins = []
    for _ in range(count):
        b = []
        for _ in range(points):
            p = exponential_random()
            all_points.append(p)
            b.append(p)
        bins.append(b)
    return bins, all_points

create_bins(10, 1)
Out[6]:
([[20.78373073698856],
  [2.0644462457050095],
  [1.6599971633412063],
  [10.182311868152327],
  [39.20323829703609],
  [16.221512415891407],
  [32.322723393505704],
  [3.400298651748122],
  [3.7147164248552023],
  [7.201239971993294]],
 [20.78373073698856,
  2.0644462457050095,
  1.6599971633412063,
  10.182311868152327,
  39.20323829703609,
  16.221512415891407,
  32.322723393505704,
  3.400298651748122,
  3.7147164248552023,
  7.201239971993294])

Finally, we need a way to test these aggregations under various conditions. There's a few real-world scenarios that our functions should handle:

Low Data Per Bin

It may happen that your percentile bins are not that accurate themselves, due to a low volume of data. For example, if each bin only recieves 10 data points, a p99 calculation cannot be accurate, as the last data point represents the worst 10 percent of the points.

This scenario occurs when tracking latencies for web services. If your service is spread across multiple hosts and percentiles are calculated on a per-host basis, and you have a low-volume endpoint, often the number of requests in a short time frame (one second, even one minute) will not capture enough data.

Single Bin

A single bin effectively reduces the scenario to the accuracy of the percentile itself. We won't test this scenario since no aggregation is really performed in this case.

So let's write some test code that modulates:

  • the percentile we're trying to aggregate
  • the number of bins we're aggregating
  • the points per bin
In [7]:
def test_percentile(aggregation_func, percentage, bin_count, point_count):
    """ test our aggregation functionality, under different circumstances """
    bins, points = create_bins(bin_count, point_count)
    percentiles = [
            percentile(percentage, bin_points) for bin_points in bins
    ]
    actual_value = percentile(percentage, points)
    return aggregation_func(percentage, percentiles), actual_value



    
def benchmark_accuracy(strategies, percentage, bin_count, point_count, trials=100000):
    """ 
    Try each strategy <trials> times, and take the average of the accuracy 
    with the expected value. return the winner, with the accurracy per each.
    """
    result_by_strategy = {}
    for strategy in strategies:
        total_deviation = 0
        for _ in range(trials):
            result, expected = test_percentile(strategy, percentage, bin_count, point_count)
            total_deviation += abs(result - expected)
        result_by_strategy[strategy.__name__] = total_deviation / trials
    # note in python3.6 iteration via insertion order is now 
    # a language specification
    winner = None
    min_deviation = trials * percentage
    for name, result in result_by_strategy.items(): 
        if result < min_deviation:
            min_deviation = result
            winner = name
    return [winner] + list(result_by_strategy.values())
In [8]:
BIN_COUNTS = [10, 100]
POINT_COUNTS = [1, 10, 100]
PERCENTAGES = [0.5, 0.95, 0.99]
# PERCENTAGES = [0.5, 0.95, 0.99]
# BIN_COUNTS = [1]
# POINT_COUNTS = [1]


def benchmark_all():
    data = [["bin_count", "points_per_bin", "percentage", "winner"] + [s.__name__ for s in STRATEGIES]]
    for bin_count in BIN_COUNTS:
        for point_count in POINT_COUNTS:
            for percentage in PERCENTAGES:
                print((bin_count, point_count, percentage))
                result = benchmark_accuracy(
                    STRATEGIES, percentage, bin_count, point_count
                )
                data.append([
                    bin_count,
                    point_count,
                    percentage,
                ] + result)
    print(tabulate(data, headers="firstrow"))
  

def benchmark_strategy(aggregation_func):
    """ benchmark a single strategy against various scenarios """
    data = [["function", "bin", "points_per_bin", "percentage / expected", "result"]]
    for bin_count in BIN_COUNTS:
        for point_count in POINT_COUNTS:
            for percentage in PERCENTAGES:
                result = test_percentile(aggregation_func, percentage,
                                         bin_count, point_count)
                data.append([
                    aggregation_func.__name__, 
                    percentage,
                    bin_count,
                    point_count,
                    percentage,
                    result
                ])
    print(tabulate(data, headers="firstrow"))

Now we outline the strategies:

In [9]:
def median(percentage, percentiles):
    """ return the median point """
    sorted(percentiles)
    return percentiles[len(percentiles) // 2]

def average(_, percentiles):
    return sum(percentiles) / len(percentiles)

def max_(_, percentile):
    return max(percentile)

def min_(_, percentile):
    return min(percentile)

STRATEGIES = [median, average, percentile, max_, min_]

Strategy Prediction

So which one will be the best? To help evaluate that, I think there's two limits to consider.

At low data points, the problem reduces down to each bin containing only one, or no, data points. In this case, percentiles are the most accurate, as median and average would be taking the median and average of the bins, which would effectively take the median and average of the traffic.

At a high number of data points per bin, the bins themselves are a more accurate representation of the percentiles they are capturing. As the number of data points reach infinity..

In [10]:
# this will take a while. This is slow in Native Python,
# at some point it would be valuable to switch to Numpy,
# and use the functions inside of it.
benchmark_all()
(10, 1, 0.5)
(10, 1, 0.95)
(10, 1, 0.99)
(10, 10, 0.5)
(10, 10, 0.95)
(10, 10, 0.99)
(10, 100, 0.5)
(10, 100, 0.95)
(10, 100, 0.99)
(100, 1, 0.5)
(100, 1, 0.95)
(100, 1, 0.99)
(100, 10, 0.5)
(100, 10, 0.95)
(100, 10, 0.99)
(100, 100, 0.5)
(100, 100, 0.95)
(100, 100, 0.99)
  bin_count    points_per_bin    percentage  winner        median     average    percentile      max_      min_
-----------  ----------------  ------------  ----------  --------  ----------  ------------  --------  --------
         10                 1          0.5   percentile  16.4145   11.4482         0         59.8732    8.46032
         10                 1          0.95  percentile  48.2928   48.4826         0          0        68.2175
         10                 1          0.99  percentile  48.4232   48.4967         0          0        68.3683
         10                10          0.5   average      5.05107   1.20435        2.43379   14.1471    6.85546
         10                10          0.95  average     16.4724    7.05708       19.1464    19.1643   42.1451
         10                10          0.99  percentile  22.3421   21.5031         4.17321    4.19587  57.2037
         10               100          0.5   average      1.73056   0.226398       0.556829   3.99453   3.05509
         10               100          0.95  average      6.24118   2.65317        8.51557    8.49106  15.766
         10               100          0.99  percentile   4.57526   3.61064        2.80232    2.80631  13.732
        100                 1          0.5   percentile  17.478    11.4759         0         85.6289    8.98846
        100                 1          0.95  percentile  56.1186   54.9946         0         19.1492   75.4618
        100                 1          0.99  percentile  70.006    69.988          0          4.17005  90.4284
        100                10          0.5   average      5.28758   0.408567       2.03797   30.4184    8.09651
        100                10          0.95  average     17.8629    9.18991       18.0849    20.4303   62.4184
        100                10          0.99  percentile  25.7039   25.1357         4.00724    4.45979  78.4014
        100               100          0.5   average      1.81351   0.0785362      0.260815   7.30585   4.42301
        100               100          0.95  average      6.61081   2.90541        8.79043   13.2074   25.0559
        100               100          0.99  percentile   4.91507   3.99404        3.61271    3.93704  23.0564