NFStream: a Flexible Network Data Analysis Framework

In [ ]:
import nfstream
print(nfstream.__version__)

NFStream is a Python framework providing fast, flexible, and expressive data structures designed to make working with online or offline network data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world network data analysis in Python. Additionally, it has the broader goal of becoming a common network data analytics framework for researchers providing data reproducibility across experiments.

  • Performance: NFStream is designed to be fast: parallel processing, native C (using CFFI) for critical computation and PyPy support.
  • Encrypted layer-7 visibility: NFStream deep packet inspection is based on nDPI. It allows NFStream to perform reliable encrypted applications identification and metadata fingerprinting (e.g. TLS, SSH, DHCP, HTTP).
  • Statistical features extraction: NFStream provides state of the art of flow-based statistical feature extraction. It includes both post-mortem statistical features (e.g. min, mean, stddev and max of packet size and inter arrival time) and early flow features (e.g. sequence of first n packets sizes, inter arrival times and directions).
  • Flexibility: NFStream is easily extensible using NFPlugins. It allows to create a new feature within a few lines of Python.
  • Machine Learning oriented: NFStream aims to make Machine Learning Approaches for network traffic management reproducible and deployable. By using NFStream as a common framework, researchers ensure that models are trained using the same feature computation logic and thus, a fair comparison is possible. Moreover, trained models can be deployed and evaluated on live network using NFPlugins.

In this notebook, we demonstrate a subset of features provided by nfstream.

In [ ]:
from nfstream import NFStreamer, NFPlugin
import pandas as pd
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

Flow aggregation made simple

In the following, we are going to use the main object provided by nfstream, NFStreamer which have the following parameters:

  • source [default=None]: Packet capture source. Pcap file path or network interface name.
  • decode_tunnels [default=True]: Enable/Disable GTP/TZSP tunnels decoding.
  • bpf_filter [default=None]: Specify a BPF filter filter for filtering selected traffic.
  • promiscuous_mode [default=True]: Enable/Disable promiscuous capture mode.
  • snapshot_length [default=1500]: Control packet slicing size (truncation) in bytes.
  • idle_timeout [default=15]: Flows that are idle (no packets received) for more than this value in seconds are expired.
  • active_timeout [default=1800]: Flows that are active for more than this value in seconds are expired.
  • accounting_mode [default=0] : Specify the accounting mode that will be used to report bytes related features (0: Link layer, 1: IP layer, 2: Transport layer, 3: Payload).
  • udps [default=None]: Specify user defined NFPlugins used to extend NFStreamer.
  • n_dissections | [default=20]: Number of per flow packets to dissect for L7 visibility feature. When set to 0, L7 visibility feature is disabled.
  • statistical_analysis [default=False]: Enable/Disable post-mortem flow statistical analysis.
  • splt_analysis [default=0]: Specify the sequence of first packets length for early statistical analysis. When set to 0, splt_analysis is disabled.
  • n_meters [default=0]: Specify the number of parallel metering processes. When set to 0, NFStreamer will automatically scale metering according to available physical cores on the running host.
  • performance_report [default=0]: Performance report interval in seconds. Disabled whhen set to 0. Ignored for offline capture.

NFStreamer returns a flow iterator. We can iterate over flows or convert it directly to pandas Dataframe using to_pandas() method.

In [ ]:
df = NFStreamer(source="tests/pcap/instagram.pcap").to_pandas()
In [ ]:
df.head()

We can enable post-mortem statistical flow features extraction as follow:

In [ ]:
df = NFStreamer(source="tests/pcap/instagram.pcap", statistical_analysis=True).to_pandas()
In [ ]:
df.head()

We can enable early statistical flow features extraction as follow:

In [ ]:
df = NFStreamer(source="tests/pcap/instagram.pcap", splt_analysis=10).to_pandas()
In [ ]:
df.head()

We can enable IP anonymization as follow:

In [ ]:
df = NFStreamer(source="tests/pcap/instagram.pcap", 
                statistical_analysis=True).to_pandas(columns_to_anonymize=["src_ip", "src_mac", "dst_ip", "dst_mac"])
In [ ]:
df.head()

Now that we have our Dataframe, we can start analyzing our data as any data. For example we can compute additional features:

  • Compute data ratio on both direction (src2dst and dst2src)
In [ ]:
df["src2dst_bytes_data_ratio"] = df['src2dst_bytes'] / df['bidirectional_bytes']
df["dst2src_bytes_data_ratio"] = df['dst2src_bytes'] / df['bidirectional_bytes']
In [ ]:
df.head()
  • Filter data according to some criterias:
In [ ]:
df[df["dst_port"] == 443].head()

Extend nfstream

In some use cases, we need to add features that are computed as packet level. Thus, nfstream handles such scenario using NFPlugin.

  • Let's suppose that we want bidirectional packets with exact IP size equal to 40 counter per flow.
In [ ]:
class Packet40Count(NFPlugin):
    def on_init(self, pkt, flow): # flow creation with the first packet
        if pkt.ip_size == 40:
            flow.udps.packet_with_40_ip_size=1
        else:
            flow.udps.packet_with_40_ip_size=0
        
    def on_update(self, pkt, flow): # flow update with each packet belonging to the flow
        if pkt.ip_size == 40:
            flow.udps.packet_with_40_ip_size += 1
In [ ]:
df = NFStreamer(source="tests/pcap/google_ssl.pcap", udps=[Packet40Count()]).to_pandas()
In [ ]:
df.head()

Our Dataframe have a new column named udps.packet_with_40_ip_size.