Analysing network traffic with Pandas

Dirk Loss, http://dirk-loss.de, @dloss

First we get a sample PCAP file:

In [1]:
from IPython.display import HTML
HTML('<iframe src=http://digitalcorpora.org/corpora/scenarios/nitroba-university-harassment-scenario width=600 height=300></iframe>')
Out[1]:
In [2]:
!mkdir -p pcap
In [3]:
cd pcap/
/home/dirk/projects/pcap

Download the PCAP file:

In [4]:
# Just use curl:
# !curl -o nitroba.pcap http://digitalcorpora.org/corp/nps/packets/2008-nitroba/nitroba.pcap
In [5]:
# Or use pure Python:
# import urllib
# urllib.urlretrieve("http://digitalcorpora.org/corp/nps/packets/2008-nitroba/nitroba.pcap", "nitroba.pcap")
In [6]:
ls -l nitroba.pcap
-rw-rw-r-- 1 dirk dirk 56795590 Okt 29  2012 nitroba.pcap

Some initialisation

In [7]:
%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.
In [8]:
import pandas as pd

Convert PCAP to CSV using tshark

We can use the tshark command from the Wireshark tool suite to read the PCAP file and convert it into a CSV file:

In [9]:
!tshark -n -r nitroba.pcap -T fields -Eheader=y -e frame.number -e frame.len > frame.len
In [10]:
df=pd.read_table("frame.len")
df
Out[10]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 95175 entries, 0 to 95174
Data columns:
frame.number    95175  non-null values
frame.len       95175  non-null values
dtypes: int64(2)
In [11]:
df["frame.len"].describe()
Out[11]:
count    95175.000000
mean       580.748789
std        625.757017
min         42.000000
25%         70.000000
50%         87.000000
75%       1466.000000
max       1466.000000
In [12]:
figsize(10,6)
In [13]:
df["frame.len"].plot(style=".", alpha=0.2)
title("Frame length")
ylabel("bytes")
xlabel("frame number")
Out[13]:
<matplotlib.text.Text at 0xb4f1a2c>

A Python function to read PCAP files into Pandas DataFrames

Here is a convenience function that reads the given fields into a Pandas DataFrame:

In [14]:
import subprocess
import datetime
import pandas as pd

def read_pcap(filename, fields=[], display_filter="", 
              timeseries=False, strict=False):
    """ Read PCAP file into Pandas DataFrame object. 
    Uses tshark command-line tool from Wireshark.

    filename:       Name or full path of the PCAP file to read
    fields:         List of fields to include as columns
    display_filter: Additional filter to restrict frames
    strict:         Only include frames that contain all given fields 
                    (Default: false)
    timeseries:     Create DatetimeIndex from frame.time_epoch 
                    (Default: false)

    Syntax for fields and display_filter is specified in
    Wireshark's Display Filter Reference:
 
      http://www.wireshark.org/docs/dfref/
    """
    if timeseries:
        fields = ["frame.time_epoch"] + fields
    fieldspec = " ".join("-e %s" % f for f in fields)

    display_filters = fields if strict else []
    if display_filter:
        display_filters.append(display_filter)
    filterspec = "-R '%s'" % " and ".join(f for f in display_filters)

    options = "-r %s -n -T fields -Eheader=y" % filename
    cmd = "tshark %s %s %s" % (options, filterspec, fieldspec)
    proc = subprocess.Popen(cmd, shell = True, 
                                 stdout=subprocess.PIPE)
    if timeseries:
        df = pd.read_table(proc.stdout, 
                        index_col = "frame.time_epoch", 
                        parse_dates=True, 
                        date_parser=datetime.datetime.fromtimestamp)
    else:
        df = pd.read_table(p.stdout)
    return df

Bandwidth

In [15]:
framelen=read_pcap("nitroba.pcap", ["frame.len"], timeseries=True)
framelen
Out[15]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 95175 entries, 2008-07-22 03:51:07.095278 to 2008-07-22 08:13:47.046029
Data columns:
frame.len    95175  non-null values
dtypes: int64(1)
In [16]:
bytes_per_second=framelen.resample("S", how="sum")
bytes_per_second.head()
Out[16]:
frame.len
frame.time_epoch
2008-07-22 03:51:07 20729
2008-07-22 03:51:08 8426
2008-07-22 03:51:09 13565
2008-07-22 03:51:10 NaN
2008-07-22 03:51:11 NaN
In [17]:
bytes_per_second.plot(title="bytes/s")
Out[17]:
<matplotlib.axes.AxesSubplot at 0xb54a2ac>

TCP Streams

In [18]:
tf=read_pcap("nitroba.pcap", ["tcp.stream", "frame.len"], "tcp", timeseries=True, strict=True)
tf
Out[18]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 83879 entries, 2008-07-22 03:51:07.095278 to 2008-07-22 08:13:47.046029
Data columns:
tcp.stream    83879  non-null values
frame.len     83879  non-null values
dtypes: int64(2)
In [19]:
tf.head()
Out[19]:
tcp.stream frame.len
frame.time_epoch
2008-07-22 03:51:07.095278 0 70
2008-07-22 03:51:07.103728 0 70
2008-07-22 03:51:07.114897 1 1421
2008-07-22 03:51:07.139448 1 70
2008-07-22 03:51:07.319680 1 1284
In [20]:
per_stream=tf.groupby("tcp.stream")
per_stream
Out[20]:
<pandas.core.groupby.DataFrameGroupBy at 0xb71f36c>
In [21]:
bytes_per_stream = per_stream.sum()
bytes_per_stream.head()
Out[21]:
frame.len
tcp.stream
0 280
1 3125
5 6858
6 10316
10 6927
In [22]:
bytes_per_stream.plot()
Out[22]:
<matplotlib.axes.AxesSubplot at 0xb5ca8cc>
In [23]:
bytes_per_stream.max()
Out[23]:
frame.len    5588127
In [24]:
bytes_per_stream.idxmax()
Out[24]:
frame.len    88
In [25]:
bytes_per_stream.ix[88]
Out[25]:
frame.len    5588127
Name: 88

The stream transferring the most data was stream 88 (5.5 MB).

Ethernet Padding

Let's have a look at the padding of the Ethernet frames. Some cards have been leaking data in the past. For more details, see http://www.securiteam.com/securitynews/5BP01208UO.html

In [27]:
trailer_df = read_pcap("nitroba.pcap", ["eth.src", "eth.trailer"], timeseries=True)
trailer_df
Out[27]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 95175 entries, 2008-07-22 03:51:07.095278 to 2008-07-22 08:13:47.046029
Data columns:
eth.src        95175  non-null values
eth.trailer    12851  non-null values
dtypes: object(2)
In [28]:
trailer=trailer_df["eth.trailer"]
trailer
Out[28]:
frame.time_epoch
2008-07-22 03:51:07.095278    NaN
2008-07-22 03:51:07.103728    NaN
2008-07-22 03:51:07.114897    NaN
2008-07-22 03:51:07.139448    NaN
2008-07-22 03:51:07.319680    NaN
2008-07-22 03:51:07.321990    NaN
2008-07-22 03:51:07.326517    NaN
2008-07-22 03:51:07.335554    NaN
2008-07-22 03:51:07.376171    NaN
2008-07-22 03:51:07.378392    NaN
2008-07-22 03:51:07.389299    NaN
2008-07-22 03:51:07.390478    NaN
2008-07-22 03:51:07.404056    NaN
2008-07-22 03:51:07.416518    NaN
2008-07-22 03:51:07.423663    NaN
...
2008-07-22 08:13:44.266370                  NaN
2008-07-22 08:13:44.266638                  NaN
2008-07-22 08:13:44.293692    00:00:00:00:00:00
2008-07-22 08:13:44.585477                  NaN
2008-07-22 08:13:44.863535                  NaN
2008-07-22 08:13:44.873602                  NaN
2008-07-22 08:13:44.883737                  NaN
2008-07-22 08:13:44.893510                  NaN
2008-07-22 08:13:44.903460                  NaN
2008-07-22 08:13:44.913495                  NaN
2008-07-22 08:13:44.923654                  NaN
2008-07-22 08:13:44.933648                  NaN
2008-07-22 08:13:44.943515                  NaN
2008-07-22 08:13:44.953453                  NaN
2008-07-22 08:13:47.046029                  NaN
Name: eth.trailer, Length: 95175

Ok. Most frames do not seem to have padding, but some have. Let's count per value to get an overview:

In [29]:
trailer.value_counts()
Out[29]:
00:00:00:00:00:00                                        7989
3b:02:a7:19:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02     913
00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00     606
3b:02:a7:19:00:1d:6b:99:98:6a:88:64:11:00:8f:da:00:42     303
00:00                                                     299
00:00:c0:a8:01:40:00:00:00:00:00:00:00:00:00:1d:d9:2e     259
32:01:67:06:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02     254
2d:66:6f:6f:65:05:79:61:68:6f:6f:03:63:6f:6d:00:00:01     253
04:67:6b:64:63:03:75:61:73:03:61:6f:6c:03:63:6f:6d:00     160
70:03:6d:73:67:05:79:61:68:6f:6f:03:63:6f:6d:00:00:01     151
73:6b:03:6d:61:63:03:63:6f:6d:00:00:01:00:01:00:01:00     146
2d:66:6f:6f:62:05:79:61:68:6f:6f:03:63:6f:6d:00:00:01     101
73:6b:03:6d:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02      66
72:65:76:73:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02      54
00:00:00:00:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02      52
...
00:00:08:2a:73:3f                                        1
00:00:00:00:00:00:00:00:00:00:00:00:00:5a:dc:4d:80       1
00:00:98:10:80:29                                        1
00:00:00:00:00:00:00:00:00:00:5e:0c:dc:0d                1
00:00:4d:61:f6:f5                                        1
00:00:01:44:fb:75                                        1
00:00:60:ee:57:10                                        1
00:00:61:fd:78:f7                                        1
00:00:64:ef:99:ce                                        1
6e:74:65:64:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02    1
00:00:a4:73:ef:15                                        1
00:00:4c:6b:5d:8b                                        1
00:00:39:ad:69:38                                        1
00:00:ac:aa:46:f1                                        1
00:00:53:8a:e9:05                                        1
Length: 635

Mostly zeros, but some data. Let's decode the hex strings:

In [30]:
import binascii

def unhex(s, sep=":"):
    return binascii.unhexlify("".join(s.split(sep)))
In [31]:
s=unhex("3b:02:a7:19:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02")
s
Out[31]:
';\x02\xa7\x19\xaa\xaa\x03\x00\x80\xc2\x00\x07\x00\x00\x00\x02;\x02'
In [32]:
padding = trailer_df.dropna()
In [33]:
padding["unhex"]=padding["eth.trailer"].map(unhex)
In [34]:
def printable(s):
    chars = []
    for c in s:
        if c.isalnum():
            chars.append(c)
        else:
            chars.append(".")
    return "".join(chars)
           
In [35]:
printable("\x95asd\x33")
Out[35]:
'.asd3'
In [36]:
padding["printable"]=padding["unhex"].map(printable)
In [37]:
padding["printable"].value_counts()
Out[37]:
......                8145
..................    1927
......k..j.d.....B     303
..                     299
2.g...............     254
.fooe.yahoo.com...     253
.gkdc.uas.aol.com.     160
p.msg.yahoo.com...     151
sk.mac.com........     148
.foob.yahoo.com...     101
sk.m..............      66
revs..............      54
ge.w..............      45
1.1...............      44
.goo..............      42
...
....X.                1
...v.L                1
..9.Q.                1
...3QU                1
....M.                1
...38.                1
....VI                1
...t..                1
...x.T                1
...z..                1
..t.b.                1
....mm                1
.foo..............    1
....8.                1
..ON.q                1
Length: 375
In [38]:
def ratio_printable(s):
    printable = sum(1.0 for c in s if c.isalnum())
    return printable / len(s)         
In [39]:
ratio_printable("a\x93sdfs")
Out[39]:
0.8333333333333334
In [40]:
padding["ratio_printable"] = padding["unhex"].map(ratio_printable)
In [41]:
padding[padding["ratio_printable"] > 0.5]
Out[41]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 727 entries, 2008-07-22 03:51:20.018817 to 2008-07-22 05:40:13.338449
Data columns:
eth.src            727  non-null values
eth.trailer        727  non-null values
unhex              727  non-null values
printable          727  non-null values
ratio_printable    727  non-null values
dtypes: float64(1), object(4)
In [42]:
_.printable.value_counts()
Out[42]:
.fooe.yahoo.com...    253
.gkdc.uas.aol.com.    160
p.msg.yahoo.com...    151
.foob.yahoo.com...    101
.weather.com......     31
ge.weather.com....     26
1.1..HOST.239.255.      1
..CDWW                  1
.foof.yahoo.com...      1
..3rbo                  1
..BIKM                  1

Now find out which Ethernet cards sent those packets with more than 50% ASCII data in their padding:

In [43]:
padding[padding["ratio_printable"] > 0.5]['eth.src'].drop_duplicates()
Out[43]:
frame.time_epoch
2008-07-22 03:51:20.018817    00:1d:d9:2e:4f:61
2008-07-22 04:10:14.155085    00:1d:6b:99:98:68
Name: eth.src
In [44]:
HTML('<iframe src=http://www.coffer.com/mac_find/?string=00%3A1d%3Ad9%3A2e%3A4f%3A61 width=600 height=300></iframe>')
Out[44]:

Thats 'Hon Hai Precision' (and "Netopia Inc" for the other MAC address).