In this notebook we're going to be using the zat Python module and explore the functionality that enables us to easily go from Zeek data to Pandas to Scikit-Learn. Once we get our data in a form that is usable by Scikit-Learn we have a wide array of data analysis and machine learning algorithms at our disposal.
All this code in this notebook is from the examples/bro_to_scikit.py file in the zat repository (https://github.com/SuperCowPowers/zat). If you have any questions/problems please don't hesitate to open up an Issue in GitHub or even better submit a PR. :)
# Third Party Imports
import pandas as pd
import sklearn
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
# Local imports
import zat
from zat.log_to_dataframe import LogToDataFrame
from zat.dataframe_to_matrix import DataFrameToMatrix
# Good to print out versions of stuff
print('zat: {:s}'.format(zat.__version__))
print('Pandas: {:s}'.format(pd.__version__))
print('Numpy: {:s}'.format(np.__version__))
print('Scikit Learn Version:', sklearn.__version__)
zat: 0.3.6 Pandas: 0.25.1 Numpy: 1.16.4 Scikit Learn Version: 0.21.2
# Create a Pandas dataframe from a Zeek log
log_to_df = LogToDataFrame()
bro_df = log_to_df.create_dataframe('../data/dns.log')
# Print out the head of the dataframe
bro_df.head()
uid | id.orig_h | id.orig_p | id.resp_h | id.resp_p | proto | trans_id | query | qclass | qclass_name | ... | rcode | rcode_name | AA | TC | RD | RA | Z | answers | TTLs | rejected | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ts | |||||||||||||||||||||
2013-09-15 23:44:27.631940126 | CZGShC2znK1sV7jdI7 | 192.168.33.10 | 1030 | 4.2.2.3 | 53 | udp | 44949 | guyspy.com | 1 | C_INTERNET | ... | 0 | NOERROR | F | F | T | T | 0 | 54.245.228.191 | 36.000000 | F |
2013-09-15 23:44:27.696868896 | CZGShC2znK1sV7jdI7 | 192.168.33.10 | 1030 | 4.2.2.3 | 53 | udp | 50071 | www.guyspy.com | 1 | C_INTERNET | ... | 0 | NOERROR | F | F | T | T | 0 | guyspy.com,54.245.228.191 | 1000.000000,36.000000 | F |
2013-09-15 23:44:28.060639143 | CZGShC2znK1sV7jdI7 | 192.168.33.10 | 1030 | 4.2.2.3 | 53 | udp | 39062 | devrubn8mli40.cloudfront.net | 1 | C_INTERNET | ... | 0 | NOERROR | F | F | T | T | 0 | 54.230.86.87,54.230.86.18,54.230.87.160,54.230... | 60.000000,60.000000,60.000000,60.000000,60.000... | F |
2013-09-15 23:44:28.141794920 | CZGShC2znK1sV7jdI7 | 192.168.33.10 | 1030 | 4.2.2.3 | 53 | udp | 7312 | d31qbv1cthcecs.cloudfront.net | 1 | C_INTERNET | ... | 0 | NOERROR | F | F | T | T | 0 | 54.230.86.87,54.230.86.18,54.230.84.20,54.230.... | 60.000000,60.000000,60.000000,60.000000,60.000... | F |
2013-09-15 23:44:28.422703981 | CZGShC2znK1sV7jdI7 | 192.168.33.10 | 1030 | 4.2.2.3 | 53 | udp | 41872 | crl.entrust.net | 1 | C_INTERNET | ... | 0 | NOERROR | F | F | T | T | 0 | cdn.entrust.net.c.footprint.net,192.221.123.25... | 4993.000000,129.000000,129.000000,129.000000 | F |
5 rows × 22 columns
Now that we have the data in a dataframe there are a million wonderful things we could do for data munging, processing and analysis but that will have to wait for another time/notebook.
# Using Pandas we can easily and efficiently compute additional data metrics
# Here we use the vectorized operations of Pandas/Numpy to compute query length
bro_df['query_length'] = bro_df['query'].str.len()
When we look at the dns records some of the data is numerical and some of it is categorical so we'll need a way of handling both data types in a generalized way. zat has a DataFrameToMatrix class that handles a lot of the details and mechanics of combining numerical and categorical data, we'll use below.
# These are the features we want (note some of these are categorical :)
features = ['AA', 'RA', 'RD', 'TC', 'Z', 'rejected', 'proto', 'qclass_name',
'qtype_name', 'rcode_name', 'query_length']
feature_df = bro_df[features]
feature_df.head()
AA | RA | RD | TC | Z | rejected | proto | qclass_name | qtype_name | rcode_name | query_length | |
---|---|---|---|---|---|---|---|---|---|---|---|
ts | |||||||||||
2013-09-15 23:44:27.631940126 | F | T | T | F | 0 | F | udp | C_INTERNET | A | NOERROR | 10.0 |
2013-09-15 23:44:27.696868896 | F | T | T | F | 0 | F | udp | C_INTERNET | A | NOERROR | 14.0 |
2013-09-15 23:44:28.060639143 | F | T | T | F | 0 | F | udp | C_INTERNET | A | NOERROR | 28.0 |
2013-09-15 23:44:28.141794920 | F | T | T | F | 0 | F | udp | C_INTERNET | A | NOERROR | 29.0 |
2013-09-15 23:44:28.422703981 | F | T | T | F | 0 | F | udp | C_INTERNET | A | NOERROR | 15.0 |
We'll now use a zat scikit-learn tranformer class to convert the Pandas DataFrame to a numpy ndarray (matrix). Yes it's awesome... I'm not sure it's Optimus Prime awesome.. but it's still pretty nice.
# Use the zat DataframeToMatrix class (handles categorical data)
# You can see below it uses a heuristic to detect category data. When doing
# this for real we should explicitly convert before sending to the transformer.
to_matrix = dataframe_to_matrix.DataFrameToMatrix()
bro_matrix = to_matrix.fit_transform(feature_df)
Normalizing column Z... Normalizing column query_length...
Now that we have a numpy ndarray(matrix) we're ready to rock with scikit-learn...
# Now we're ready for scikit-learn!
# Just some simple stuff for this example, KMeans and TSNE projection
kmeans = KMeans(n_clusters=5).fit_predict(bro_matrix)
projection = TSNE().fit_transform(bro_matrix)
# Now we can put our ML results back onto our dataframe!
bro_df['x'] = projection[:, 0] # Projection X Column
bro_df['y'] = projection[:, 1] # Projection Y Column
bro_df['cluster'] = kmeans
bro_df[['query', 'proto', 'x', 'y', 'cluster']].head() # Showing the scikit-learn results in our dataframe
query | proto | x | y | cluster | |
---|---|---|---|---|---|
ts | |||||
2013-09-15 23:44:27.631940126 | guyspy.com | udp | 39.936161 | -18.896549 | 0 |
2013-09-15 23:44:27.696868896 | www.guyspy.com | udp | 24.945557 | 3.408248 | 0 |
2013-09-15 23:44:28.060639143 | devrubn8mli40.cloudfront.net | udp | -28.303942 | 2.064817 | 0 |
2013-09-15 23:44:28.141794920 | d31qbv1cthcecs.cloudfront.net | udp | -22.633984 | 6.704806 | 0 |
2013-09-15 23:44:28.422703981 | crl.entrust.net | udp | 18.326878 | -2.018499 | 0 |
# Plotting defaults
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 18.0
plt.rcParams['figure.figsize'] = 15.0, 7.0
# Now use dataframe group by cluster
cluster_groups = bro_df.groupby('cluster')
# Plot the Machine Learning results
fig, ax = plt.subplots()
colors = {0:'red', 1:'blue', 2:'green', 3:'orange', 4:'purple', 5:'black'}
for key, group in cluster_groups:
group.plot(ax=ax, kind='scatter', x='x', y='y', alpha=0.5, s=250,
label='Cluster: {:d}'.format(key), color=colors[key])
We cramed a bunch of features into the clustering algorithm. The features were both numerical and categorical. So did the clustering 'do the right thing'? Well first some caveats and disclaimers:
Okay will all those caveats lets look at how the clustering did on both numeric and categorical data combined
With our example data we've successfully gone from Zeek logs to Pandas to scikit-learn. The clusters appear to make sense and certainly from an investigative and threat hunting perspective being able to cluster the data and use PCA for dimensionality reduction might come in handy depending on your use case
# Now print out the details for each cluster
pd.set_option('display.width', 1000)
show_fields = ['query', 'Z', 'proto', 'qtype_name', 'cluster']
for key, group in cluster_groups:
print('\nCluster {:d}: {:d} observations'.format(key, len(group)))
print(group[show_fields].head())
Cluster 0: 42 observations query Z proto qtype_name cluster ts 2013-09-15 23:44:27.631940126 guyspy.com 0 udp A 0 2013-09-15 23:44:27.696868896 www.guyspy.com 0 udp A 0 2013-09-15 23:44:28.060639143 devrubn8mli40.cloudfront.net 0 udp A 0 2013-09-15 23:44:28.141794920 d31qbv1cthcecs.cloudfront.net 0 udp A 0 2013-09-15 23:44:28.422703981 crl.entrust.net 0 udp A 0 Cluster 1: 3 observations query Z proto qtype_name cluster ts 2013-09-15 23:44:50.827882051 NaN 0 udp NaN 1 2013-09-15 23:44:50.829301834 NaN 0 udp NaN 1 2013-09-15 23:44:51.026231050 NaN 0 udp NaN 1 Cluster 2: 3 observations query Z proto qtype_name cluster ts 2013-09-15 23:48:05.897021055 j.maxmind.com 0 tcp A 2 2013-09-15 23:48:06.897021055 j.maxmind.com 0 tcp A 2 2013-09-15 23:48:07.897021055 j.maxmind.com 0 tcp A 2 Cluster 3: 3 observations query Z proto qtype_name cluster ts 2013-09-15 23:48:02.497021198 superlongcrazydnsqueryforoutlierdetectionj.max... 0 udp A 3 2013-09-15 23:48:02.597021103 xyzqsuperlongcrazydnsqueryforoutlierdetectionj... 0 udp A 3 2013-09-15 23:48:02.697021008 abcsuperlongcrazydnsqueryforoutlierdetectionj.... 0 udp A 3 Cluster 4: 3 observations query Z proto qtype_name cluster ts 2013-09-15 23:48:02.897021055 j.maxmind.com 1 udp A 4 2013-09-15 23:48:04.897021055 j.maxmind.com 1 udp A 4 2013-09-15 23:48:04.997021198 j.maxmind.com 1 udp A 4
Well that's it for this notebook, we'll have an upcoming notebook that addresses some of the issues that we overlooked in this simple example. We use Isolation Forest for anomaly detection which works well for high dimensional data. The notebook will cover both training and streaming evalution against the model.
If you liked this notebook please visit the zat project for more notebooks and examples.