#!/usr/bin/env python # coding: utf-8 # # "CDFs with Seaborn" # > Plotting the cumulatlive distribution of latency measurements # # - toc: true # - badges: true # - comments: false # - categories: [jupyter, cdf, seabron] # In[1]: #hide from collections import Counter import pandas as pd import seaborn as sns import random as r r.seed(42) sns.set() # In[2]: #hide # generate the dataset data = [] for path in ['a', 'b', 'c']: for timestamp in range(1, 10001): latency = -1 if (path == 'a'): latency = r.normalvariate(30, 3) elif (path == 'b'): latency = r.normalvariate(40, 10) else: # c has a 50/50 latency if (r.choice([True, False])): latency = r.normalvariate(40, 1) else: latency = r.normalvariate(60, 1) data.append({ 'timestamp': timestamp, 'path': path, 'latency': latency }) df = pd.DataFrame(data) df[df.path == 'a'].describe(), df[df.path == 'b'].describe(), df[df.path == 'c'].describe() # # Introduction # # During my PHD, I often had to create CDF (cumulative distribution function) plots. # For example, I use CDF plots in my paper *Managing Latency and Excess Data Dissemination in Fog-Based Publish/Subscribe Systems* ([DOI](https://doi.org/10.1109/ICFC49376.2020.00010)/[Website](https://moewex.github.io/academic/publication/2020-broadcastgroups/)) for reporting latency measurement that have been collected by multiple end-devices for different data distribution strategies. # # In this blog post, I will showcase why CDFs are a particulary good fit for such a use case and how easy it is to generate them with [seaborn](https://seaborn.pydata.org). # # Exploring the Sample Data # # For the purpose of this blog post, I created an artificial sample dataset with latency measurements for three coummunication paths. # In[3]: df.head() # In[4]: sns.boxplot(data=df, x='path', y='latency') # Plotting the data in a boxplot already tells us that the communication path a experiences the smallest median latency. # Communication path b has a slightly lower median latency then c, but larger min and max values. # In[5]: sns.relplot(data=df, kind='line', x='timestamp', y='latency', col='path') # Plotting the latencies as line plots tells us that the latency pattern does not change throughout the experiment. # In[6]: sns.relplot(data=df, kind='scatter', x='timestamp', y='latency', col='path') # The scatter plot reveals something very interesting: # There are two groups of measurements on communication path c: one with a latency of about 60, and one with a latency of about 40. # This information is not available from the box or line plot. # # Calculating the Cumulatative Distribution # # For a reasearch paper, you typically only want as few plots as possible since you only have limited space available. # Thus, a CDF plot is often a good option since it a high information densitiy. # The first step for creating such a plot is to calculate the cumulative distribution of your input data. # In our case, we want to plot the cumulative distribution of latency measurements for each path. # In[7]: # Calculate cumulative distribution for each path df_cdf = pd.DataFrame() paths = df['path'].unique() for path in paths: # create dataframe for each path path_df = df[(df['path'] == path)] # create dataframe with count per latency for the chosen path df_tmp = pd.DataFrame(dict(Counter(path_df['latency'])), index=[0]).T # transform index into column df_tmp.reset_index(inplace=True) # set correct column names df_tmp.columns = ['latency', 'count'] # add a path column df_tmp.insert(0, 'path', path) # calculate distribution df_tmp.sort_values(by='latency', inplace=True) df_tmp["cumsum"] = df_tmp["count"].cumsum() sum = df_tmp["count"].sum() df_tmp["cumulative_distribution"] = df_tmp["cumsum"] / sum # add to result df df_cdf = df_cdf.append(df_tmp) # sort and reset the index just for the asthetics df_cdf.reset_index(inplace=True, drop=True) df_cdf.sort_values(by="path", inplace=True) # let's check how it looks df_cdf # As you can see, the resulting dataframe contains information on how often each latency occurs for each path, as well as corresponding cumulative distribution. # Plotting this data is then straight forward. # In[8]: sns.lineplot(data=df_cdf, x="latency", y="cumulative_distribution", hue="path") # This plot tells us quite a lot. By looking at the cumulative distribution, we can easily retrieve the min (0.0) and max (1.0) latency, as well as latency ranges. # E.g., to for path a, 60% of all measurements are betwwen 27 and 33. # Furthermore, we can retrieve information on the distribution of values, e.g., b has a higher variance then a. # We can also identify distinct groups of measurements by looking for *steps* in the distribution function as for path c. # In this case, we can confirm that there are two groups, but also learn that each group has the same size, i.e., each group contains 50% of the measurements. # # If your data has more dimensions, e.g., there could also be a client column that indicates which client sent a request, you only need to make minor modifications. # To not collapse all dimensions into the *path-lines*, you have to add one nested for-loop for each additional dimension. # This could then look like so: # # ```python # paths = df['path'].unique() # # for path in paths: # for client in paths['client'].unique(): # # create dataframe for each path and client # path_client_df = df[(df['path'] == path) & (df['client'] == client)] # # # continue as normal, but do not forget to also add a client column # # Closing Remarks # In this blog post, I quickly showcased how one can create CDF plots with seaborn, and why they are a particularly good fit for latency measurements. # The presented approach can also be applied to datasets with more dimensions by adding nested for-loops.