There's a lot of articles explaning how percentiles can't actually be aggregated without loss of accuracy, so I won't rehash that here.
But, there is a good question around what do when you have no choice but to aggregate them. In that case, what type of aggregation is best?
To begin, let's provide some basic functionality that will help us explore various scenarios.
# for templating and display
from IPython.display import display
from tabulate import tabulate
from math import ceil
def percentile(percentage, data_points):
""" return the point that is at the desired percentage """
data_points = sorted(data_points)
point_at_percentage = ceil(len(data_points) * percentage) - 1
return data_points[point_at_percentage]
And we can test our percentile with:
import random
# we seed with a constant, to reproduce results. but this notebook should work
# with any value or random generator
random.seed(0)
SAMPLE_SIZE = 1000
random_points = [random.random() for _ in range(SAMPLE_SIZE)]
# should end up being around 0.99, since the points are random across 0 - 100
percentile(0.99, random_points)
So we've verified this function works. Now comes the fun part. Let's provide a way to randomly distribute these values across multiple bins.
def exponential_random():
"""
return a number on the exponential scale,
to emulate web service traffic.
this will produce a sharp exponent from
1 - 100
"""
return 10 ** (random.random() * 2)
def create_bins(count, points):
""" return <count> bins, with <points> random points per bin """
all_points = []
bins = []
for _ in range(count):
b = []
for _ in range(points):
p = exponential_random()
all_points.append(p)
b.append(p)
bins.append(b)
return bins, all_points
create_bins(10, 1)
Finally, we need a way to test these aggregations under various conditions. There's a few real-world scenarios that our functions should handle:
It may happen that your percentile bins are not that accurate themselves, due to a low volume of data. For example, if each bin only recieves 10 data points, a p99 calculation cannot be accurate, as the last data point represents the worst 10 percent of the points.
This scenario occurs when tracking latencies for web services. If your service is spread across multiple hosts and percentiles are calculated on a per-host basis, and you have a low-volume endpoint, often the number of requests in a short time frame (one second, even one minute) will not capture enough data.
A single bin effectively reduces the scenario to the accuracy of the percentile itself. We won't test this scenario since no aggregation is really performed in this case.
So let's write some test code that modulates:
def test_percentile(aggregation_func, percentage, bin_count, point_count):
""" test our aggregation functionality, under different circumstances """
bins, points = create_bins(bin_count, point_count)
percentiles = [
percentile(percentage, bin_points) for bin_points in bins
]
actual_value = percentile(percentage, points)
return aggregation_func(percentage, percentiles), actual_value
def benchmark_accuracy(strategies, percentage, bin_count, point_count, trials=100000):
"""
Try each strategy <trials> times, and take the average of the accuracy
with the expected value. return the winner, with the accurracy per each.
"""
result_by_strategy = {}
for strategy in strategies:
total_deviation = 0
for _ in range(trials):
result, expected = test_percentile(strategy, percentage, bin_count, point_count)
total_deviation += abs(result - expected)
result_by_strategy[strategy.__name__] = total_deviation / trials
# note in python3.6 iteration via insertion order is now
# a language specification
winner = None
min_deviation = trials * percentage
for name, result in result_by_strategy.items():
if result < min_deviation:
min_deviation = result
winner = name
return [winner] + list(result_by_strategy.values())
BIN_COUNTS = [10, 100]
POINT_COUNTS = [1, 10, 100]
PERCENTAGES = [0.5, 0.95, 0.99]
# PERCENTAGES = [0.5, 0.95, 0.99]
# BIN_COUNTS = [1]
# POINT_COUNTS = [1]
def benchmark_all():
data = [["bin_count", "points_per_bin", "percentage", "winner"] + [s.__name__ for s in STRATEGIES]]
for bin_count in BIN_COUNTS:
for point_count in POINT_COUNTS:
for percentage in PERCENTAGES:
print((bin_count, point_count, percentage))
result = benchmark_accuracy(
STRATEGIES, percentage, bin_count, point_count
)
data.append([
bin_count,
point_count,
percentage,
] + result)
print(tabulate(data, headers="firstrow"))
def benchmark_strategy(aggregation_func):
""" benchmark a single strategy against various scenarios """
data = [["function", "bin", "points_per_bin", "percentage / expected", "result"]]
for bin_count in BIN_COUNTS:
for point_count in POINT_COUNTS:
for percentage in PERCENTAGES:
result = test_percentile(aggregation_func, percentage,
bin_count, point_count)
data.append([
aggregation_func.__name__,
percentage,
bin_count,
point_count,
percentage,
result
])
print(tabulate(data, headers="firstrow"))
Now we outline the strategies:
def median(percentage, percentiles):
""" return the median point """
sorted(percentiles)
return percentiles[len(percentiles) // 2]
def average(_, percentiles):
return sum(percentiles) / len(percentiles)
def max_(_, percentile):
return max(percentile)
def min_(_, percentile):
return min(percentile)
STRATEGIES = [median, average, percentile, max_, min_]
So which one will be the best? To help evaluate that, I think there's two limits to consider.
At low data points, the problem reduces down to each bin containing only one, or no, data points. In this case, percentiles are the most accurate, as median and average would be taking the median and average of the bins, which would effectively take the median and average of the traffic.
At a high number of data points per bin, the bins themselves are a more accurate representation of the percentiles they are capturing. As the number of data points reach infinity..
# this will take a while. This is slow in Native Python,
# at some point it would be valuable to switch to Numpy,
# and use the functions inside of it.
benchmark_all()