Dirk Loss, http://dirk-loss.de, @dloss
First we get a sample PCAP file:
from IPython.display import HTML
HTML('<iframe src=http://digitalcorpora.org/corpora/scenarios/nitroba-university-harassment-scenario width=600 height=300></iframe>')
!mkdir -p pcap
cd pcap/
/home/dirk/projects/pcap
# Just use curl:
# !curl -o nitroba.pcap http://digitalcorpora.org/corp/nps/packets/2008-nitroba/nitroba.pcap
# Or use pure Python:
# import urllib
# urllib.urlretrieve("http://digitalcorpora.org/corp/nps/packets/2008-nitroba/nitroba.pcap", "nitroba.pcap")
ls -l nitroba.pcap
-rw-rw-r-- 1 dirk dirk 56795590 Okt 29 2012 nitroba.pcap
%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline]. For more information, type 'help(pylab)'.
import pandas as pd
We can use the tshark
command from the Wireshark tool suite to read the PCAP file and convert it into a CSV file:
!tshark -n -r nitroba.pcap -T fields -Eheader=y -e frame.number -e frame.len > frame.len
df=pd.read_table("frame.len")
df
<class 'pandas.core.frame.DataFrame'> Int64Index: 95175 entries, 0 to 95174 Data columns: frame.number 95175 non-null values frame.len 95175 non-null values dtypes: int64(2)
df["frame.len"].describe()
count 95175.000000 mean 580.748789 std 625.757017 min 42.000000 25% 70.000000 50% 87.000000 75% 1466.000000 max 1466.000000
figsize(10,6)
df["frame.len"].plot(style=".", alpha=0.2)
title("Frame length")
ylabel("bytes")
xlabel("frame number")
<matplotlib.text.Text at 0xb4f1a2c>
Here is a convenience function that reads the given fields into a Pandas DataFrame:
import subprocess
import datetime
import pandas as pd
def read_pcap(filename, fields=[], display_filter="",
timeseries=False, strict=False):
""" Read PCAP file into Pandas DataFrame object.
Uses tshark command-line tool from Wireshark.
filename: Name or full path of the PCAP file to read
fields: List of fields to include as columns
display_filter: Additional filter to restrict frames
strict: Only include frames that contain all given fields
(Default: false)
timeseries: Create DatetimeIndex from frame.time_epoch
(Default: false)
Syntax for fields and display_filter is specified in
Wireshark's Display Filter Reference:
http://www.wireshark.org/docs/dfref/
"""
if timeseries:
fields = ["frame.time_epoch"] + fields
fieldspec = " ".join("-e %s" % f for f in fields)
display_filters = fields if strict else []
if display_filter:
display_filters.append(display_filter)
filterspec = "-R '%s'" % " and ".join(f for f in display_filters)
options = "-r %s -n -T fields -Eheader=y" % filename
cmd = "tshark %s %s %s" % (options, filterspec, fieldspec)
proc = subprocess.Popen(cmd, shell = True,
stdout=subprocess.PIPE)
if timeseries:
df = pd.read_table(proc.stdout,
index_col = "frame.time_epoch",
parse_dates=True,
date_parser=datetime.datetime.fromtimestamp)
else:
df = pd.read_table(p.stdout)
return df
framelen=read_pcap("nitroba.pcap", ["frame.len"], timeseries=True)
framelen
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 95175 entries, 2008-07-22 03:51:07.095278 to 2008-07-22 08:13:47.046029 Data columns: frame.len 95175 non-null values dtypes: int64(1)
bytes_per_second=framelen.resample("S", how="sum")
bytes_per_second.head()
frame.len | |
---|---|
frame.time_epoch | |
2008-07-22 03:51:07 | 20729 |
2008-07-22 03:51:08 | 8426 |
2008-07-22 03:51:09 | 13565 |
2008-07-22 03:51:10 | NaN |
2008-07-22 03:51:11 | NaN |
bytes_per_second.plot(title="bytes/s")
<matplotlib.axes.AxesSubplot at 0xb54a2ac>
tf=read_pcap("nitroba.pcap", ["tcp.stream", "frame.len"], "tcp", timeseries=True, strict=True)
tf
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 83879 entries, 2008-07-22 03:51:07.095278 to 2008-07-22 08:13:47.046029 Data columns: tcp.stream 83879 non-null values frame.len 83879 non-null values dtypes: int64(2)
tf.head()
tcp.stream | frame.len | |
---|---|---|
frame.time_epoch | ||
2008-07-22 03:51:07.095278 | 0 | 70 |
2008-07-22 03:51:07.103728 | 0 | 70 |
2008-07-22 03:51:07.114897 | 1 | 1421 |
2008-07-22 03:51:07.139448 | 1 | 70 |
2008-07-22 03:51:07.319680 | 1 | 1284 |
per_stream=tf.groupby("tcp.stream")
per_stream
<pandas.core.groupby.DataFrameGroupBy at 0xb71f36c>
bytes_per_stream = per_stream.sum()
bytes_per_stream.head()
frame.len | |
---|---|
tcp.stream | |
0 | 280 |
1 | 3125 |
5 | 6858 |
6 | 10316 |
10 | 6927 |
bytes_per_stream.plot()
<matplotlib.axes.AxesSubplot at 0xb5ca8cc>
bytes_per_stream.max()
frame.len 5588127
bytes_per_stream.idxmax()
frame.len 88
bytes_per_stream.ix[88]
frame.len 5588127 Name: 88
The stream transferring the most data was stream 88 (5.5 MB).
Let's have a look at the padding of the Ethernet frames. Some cards have been leaking data in the past. For more details, see http://www.securiteam.com/securitynews/5BP01208UO.html
trailer_df = read_pcap("nitroba.pcap", ["eth.src", "eth.trailer"], timeseries=True)
trailer_df
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 95175 entries, 2008-07-22 03:51:07.095278 to 2008-07-22 08:13:47.046029 Data columns: eth.src 95175 non-null values eth.trailer 12851 non-null values dtypes: object(2)
trailer=trailer_df["eth.trailer"]
trailer
frame.time_epoch 2008-07-22 03:51:07.095278 NaN 2008-07-22 03:51:07.103728 NaN 2008-07-22 03:51:07.114897 NaN 2008-07-22 03:51:07.139448 NaN 2008-07-22 03:51:07.319680 NaN 2008-07-22 03:51:07.321990 NaN 2008-07-22 03:51:07.326517 NaN 2008-07-22 03:51:07.335554 NaN 2008-07-22 03:51:07.376171 NaN 2008-07-22 03:51:07.378392 NaN 2008-07-22 03:51:07.389299 NaN 2008-07-22 03:51:07.390478 NaN 2008-07-22 03:51:07.404056 NaN 2008-07-22 03:51:07.416518 NaN 2008-07-22 03:51:07.423663 NaN ... 2008-07-22 08:13:44.266370 NaN 2008-07-22 08:13:44.266638 NaN 2008-07-22 08:13:44.293692 00:00:00:00:00:00 2008-07-22 08:13:44.585477 NaN 2008-07-22 08:13:44.863535 NaN 2008-07-22 08:13:44.873602 NaN 2008-07-22 08:13:44.883737 NaN 2008-07-22 08:13:44.893510 NaN 2008-07-22 08:13:44.903460 NaN 2008-07-22 08:13:44.913495 NaN 2008-07-22 08:13:44.923654 NaN 2008-07-22 08:13:44.933648 NaN 2008-07-22 08:13:44.943515 NaN 2008-07-22 08:13:44.953453 NaN 2008-07-22 08:13:47.046029 NaN Name: eth.trailer, Length: 95175
Ok. Most frames do not seem to have padding, but some have. Let's count per value to get an overview:
trailer.value_counts()
00:00:00:00:00:00 7989 3b:02:a7:19:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02 913 00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00 606 3b:02:a7:19:00:1d:6b:99:98:6a:88:64:11:00:8f:da:00:42 303 00:00 299 00:00:c0:a8:01:40:00:00:00:00:00:00:00:00:00:1d:d9:2e 259 32:01:67:06:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02 254 2d:66:6f:6f:65:05:79:61:68:6f:6f:03:63:6f:6d:00:00:01 253 04:67:6b:64:63:03:75:61:73:03:61:6f:6c:03:63:6f:6d:00 160 70:03:6d:73:67:05:79:61:68:6f:6f:03:63:6f:6d:00:00:01 151 73:6b:03:6d:61:63:03:63:6f:6d:00:00:01:00:01:00:01:00 146 2d:66:6f:6f:62:05:79:61:68:6f:6f:03:63:6f:6d:00:00:01 101 73:6b:03:6d:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02 66 72:65:76:73:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02 54 00:00:00:00:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02 52 ... 00:00:08:2a:73:3f 1 00:00:00:00:00:00:00:00:00:00:00:00:00:5a:dc:4d:80 1 00:00:98:10:80:29 1 00:00:00:00:00:00:00:00:00:00:5e:0c:dc:0d 1 00:00:4d:61:f6:f5 1 00:00:01:44:fb:75 1 00:00:60:ee:57:10 1 00:00:61:fd:78:f7 1 00:00:64:ef:99:ce 1 6e:74:65:64:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02 1 00:00:a4:73:ef:15 1 00:00:4c:6b:5d:8b 1 00:00:39:ad:69:38 1 00:00:ac:aa:46:f1 1 00:00:53:8a:e9:05 1 Length: 635
Mostly zeros, but some data. Let's decode the hex strings:
import binascii
def unhex(s, sep=":"):
return binascii.unhexlify("".join(s.split(sep)))
s=unhex("3b:02:a7:19:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02")
s
';\x02\xa7\x19\xaa\xaa\x03\x00\x80\xc2\x00\x07\x00\x00\x00\x02;\x02'
padding = trailer_df.dropna()
padding["unhex"]=padding["eth.trailer"].map(unhex)
def printable(s):
chars = []
for c in s:
if c.isalnum():
chars.append(c)
else:
chars.append(".")
return "".join(chars)
printable("\x95asd\x33")
'.asd3'
padding["printable"]=padding["unhex"].map(printable)
padding["printable"].value_counts()
...... 8145 .................. 1927 ......k..j.d.....B 303 .. 299 2.g............... 254 .fooe.yahoo.com... 253 .gkdc.uas.aol.com. 160 p.msg.yahoo.com... 151 sk.mac.com........ 148 .foob.yahoo.com... 101 sk.m.............. 66 revs.............. 54 ge.w.............. 45 1.1............... 44 .goo.............. 42 ... ....X. 1 ...v.L 1 ..9.Q. 1 ...3QU 1 ....M. 1 ...38. 1 ....VI 1 ...t.. 1 ...x.T 1 ...z.. 1 ..t.b. 1 ....mm 1 .foo.............. 1 ....8. 1 ..ON.q 1 Length: 375
def ratio_printable(s):
printable = sum(1.0 for c in s if c.isalnum())
return printable / len(s)
ratio_printable("a\x93sdfs")
0.8333333333333334
padding["ratio_printable"] = padding["unhex"].map(ratio_printable)
padding[padding["ratio_printable"] > 0.5]
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 727 entries, 2008-07-22 03:51:20.018817 to 2008-07-22 05:40:13.338449 Data columns: eth.src 727 non-null values eth.trailer 727 non-null values unhex 727 non-null values printable 727 non-null values ratio_printable 727 non-null values dtypes: float64(1), object(4)
_.printable.value_counts()
.fooe.yahoo.com... 253 .gkdc.uas.aol.com. 160 p.msg.yahoo.com... 151 .foob.yahoo.com... 101 .weather.com...... 31 ge.weather.com.... 26 1.1..HOST.239.255. 1 ..CDWW 1 .foof.yahoo.com... 1 ..3rbo 1 ..BIKM 1
Now find out which Ethernet cards sent those packets with more than 50% ASCII data in their padding:
padding[padding["ratio_printable"] > 0.5]['eth.src'].drop_duplicates()
frame.time_epoch 2008-07-22 03:51:20.018817 00:1d:d9:2e:4f:61 2008-07-22 04:10:14.155085 00:1d:6b:99:98:68 Name: eth.src
HTML('<iframe src=http://www.coffer.com/mac_find/?string=00%3A1d%3Ad9%3A2e%3A4f%3A61 width=600 height=300></iframe>')
Thats 'Hon Hai Precision' (and "Netopia Inc" for the other MAC address).