TriScale¶

Case Study - Failure Detection¶

This notebook presents a case study of the TriScale framework. It revisits the analysis of Blink, an algorithm that detects failuresand reroutes traffic directly in the data plane. Parts of this case study are described in the TriScale paper.

List of Imports¶

In [1]:

import os
import copy
from pathlib import Path
import zipfile

import pandas as pd
import numpy as np
import plotly.graph_objects as go

import triscale
import triplots

Download Source Files and Data¶

[Back to top]

The dataset for this case study is available on Zenodo:

The wget commands below download the required files to reproduce this case study. Downloading and unzipping might take a while...

The .zip file is ~100 kB

In [2]:

# Set `download = True` to download (and extract) the data from this case study
# Eventually, adjust the record_id for the file version you are interested in.

# For reproducing the results of the TriScale paper, set `record_id = 3666724`

download = True
record_id = 3666724 # v3.0.1 (https://doi.org/10.5281/zenodo.3666724)

files= ['UseCase_FailureDetection.zip']
if download:
    for file in files:
        print(file)
        url = 'https://zenodo.org/record/'+str(record_id)+'/files/'+file 
        os.system('wget %s' %url)
        if file[-4:] == '.zip':    
            with zipfile.ZipFile(file,"r") as zip_file:
                zip_file.extractall()
        print('Done.')
else: 
    print('Nothing to download')

Nothing to download

We now import the custom module for the case study.

In [3]:

import UseCase_FailureDetection.failuredetection as fd

Evaluation objectives¶

[Back to top]

In this case study, 30 prefixes of 15 different internet traces have been selected. For each of these prefixes, 5 artificial traces have been generates, all of which include a failure. We are interested in evaluating

The ratio of failures which are correctly detected (true positives)
The time taken until the traffic is rerouted

The experiment has been designed and performed by the authors of the Blink paper. In this case study, we only perform the data analysis, using TriScale approach to generalize the results.

1. Compute the Metrics¶

For each prefix, we compute two metrics

The true positive rate; that is, the ratio of failures correctly detected by the algorithm. Since there are 5 synthetic trace per prefix, this metric has values in {0, 0.2, 0.4, 0.6, 0.8, 1}
The median time taken to reroute the traffic (considering only the failures that have been detected)

The computation of metric values is performed by the compute_metrics() function below.

In [4]:

# Construct the path to the different test results
result_dir  = Path('UseCase_FailureDetection')
config_file = Path('UseCase_FailureDetection/config.yml')

out_file = result_dir / 'metrics.csv'
df = fd.compute_metrics(config_file, result_dir, out_name=out_file)
display(df)

Output retrieved from file. Skipping computation.

	Protocol	Trace	Prefix	TPR	Speed_s
0	blink	1	0	0.6	1.998730
1	blink	1	1	1.0	1.579861
2	blink	1	2	0.0	NaN
3	blink	1	3	1.0	1.707236
4	blink	1	4	0.8	1.419164
...	...	...	...	...	...
1345	infinite_timeout	15	25	0.8	1.681014
1346	infinite_timeout	15	26	0.0	NaN
1347	infinite_timeout	15	27	0.4	2.107471
1348	infinite_timeout	15	28	1.0	0.717849
1349	infinite_timeout	15	29	1.0	0.743098

1350 rows × 5 columns

2. Compute the KPIs¶

For each set of prefixes, we compute one KPI: the 95% CI on the median of each metric (TPR and recovery time).

In [5]:

KPI  = {    'percentile' : 50,
            'confidence' : 95,
            'bounds': [0,1],
            'bound': 'lower',
            }
out_file = result_dir / 'kpis.csv'
kpis = fd.compute_kpis(df,KPI,config_file,out_file)

Output retrieved from file. Skipping computation.

We can then plot these KPIs for each of the Internet traces.

In [6]:

figure = fd.plot_TPR(kpis,config_file)
figure.show()
figure = fd.plot_speed(kpis,config_file)
figure.show()

Using TriScale, we can generalize the results. For each trace, the evaluation of Blink on one prefix can be seen as a TriScale run. Since the prefixes are randomly selected from a fixed set, runs are i.i.d. and we can use TriScale’s KPI to derive the expected performance of Blink for any set of prefixes.

We can claim with 95% confidence that, for

at least 50% of the prefixes,
Blink always detects link failures (TPR= 1) and reroutes traffic within 1 s or less