In this notebook we're going to be using the bat Python module for processing, transformation and anomaly detection on Bro network data. We're going to look at 'normal' http traffic and demonstrate the use of Isolation Forests for anomaly detection. We'll then explore those anomalies with clustering and PCA.
Software
Techniques
Related Notebooks
Note: A previous version of this notebook used a large http log (1 million rows) but we wanted people to be able to run the notebook themselves, so we've changed it to run on the local example http.log.
import bat
from bat import log_to_dataframe
from bat import dataframe_to_matrix
print('bat: {:s}'.format(bat.__version__))
import pandas as pd
print('Pandas: {:s}'.format(pd.__version__))
import numpy as np
print('Numpy: {:s}'.format(np.__version__))
import sklearn
from sklearn.ensemble import IsolationForest
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
print('Scikit Learn Version:', sklearn.__version__)
bat: 0.2.1 Pandas: 0.19.2 Numpy: 1.12.1 Scikit Learn Version: 0.18.1
# Create a Pandas dataframe from the Bro HTTP log
bro_df = log_to_dataframe.LogToDataFrame('../data/http.log')
print('Read in {:d} Rows...'.format(len(bro_df)))
bro_df.head()
Successfully monitoring ../data/http.log... Read in 150 Rows...
filename | host | id.orig_h | id.orig_p | id.resp_h | id.resp_p | info_code | info_msg | method | orig_fuids | ... | response_body_len | status_code | status_msg | tags | trans_depth | ts | uid | uri | user_agent | username | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | - | guyspy.com | 192.168.33.10 | 1031 | 54.245.228.191 | 80 | - | - | GET | - | ... | 184 | 301 | Moved Permanently | (empty) | 1 | 2013-09-15 19:44:27.668082 | CyIaMO7IheOh38Zsi | / | Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ... | - |
1 | - | www.guyspy.com | 192.168.33.10 | 1032 | 54.245.228.191 | 80 | - | - | GET | - | ... | 100631 | 200 | OK | (empty) | 1 | 2013-09-15 19:44:27.731702 | CoyZrY2g74UvMMgp4a | / | Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ... | - |
2 | - | www.guyspy.com | 192.168.33.10 | 1032 | 54.245.228.191 | 80 | - | - | GET | - | ... | 55817 | 404 | Not Found | (empty) | 2 | 2013-09-15 19:44:28.092922 | CoyZrY2g74UvMMgp4a | /wp-content/plugins/slider-pro/css/advanced-sl... | Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ... | - |
3 | - | www.guyspy.com | 192.168.33.10 | 1040 | 54.245.228.191 | 80 | - | - | GET | - | ... | 887 | 200 | OK | (empty) | 1 | 2013-09-15 19:44:28.150301 | CiCKTz4e0fkYYazBS3 | /wp-content/plugins/contact-form-7/includes/cs... | Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ... | - |
4 | - | www.guyspy.com | 192.168.33.10 | 1041 | 54.245.228.191 | 80 | - | - | GET | - | ... | 10068 | 200 | OK | (empty) | 1 | 2013-09-15 19:44:28.150602 | C1YBkC1uuO9bzndRvh | /wp-content/plugins/slider-pro/css/slider/adva... | Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ... | - |
5 rows × 27 columns
Yep it was quick... the two little lines of code above turned a Bro log (any log) into a Pandas DataFrame. The bat package also supports streaming data from dynamic/active logs, handles log rotations and in general tries to make your life a bit easier when doing data analysis and machine learning on Bro data.
Now that we have the data in a dataframe there are a million wonderful things we could do for data munging, processing and analysis but that will have to wait for another time/notebook.
# We're going to pick some features that might be interesting
# some of the features are numerical and some are categorical
features = ['id.resp_p', 'method', 'resp_mime_types', 'request_body_len']
When we look at the http records some of the data is numerical and some of it is categorical so we'll need a way of handling both data types in a generalized way. bat has a DataFrameToMatrix class that handles a lot of the details and mechanics of combining numerical and categorical data, we'll use below.
# Show the dataframe with mixed feature types
bro_df[features].head()
id.resp_p | method | resp_mime_types | request_body_len | |
---|---|---|---|---|
0 | 80 | GET | text/html | 0 |
1 | 80 | GET | text/html | 0 |
2 | 80 | GET | text/html | 0 |
3 | 80 | GET | text/plain | 0 |
4 | 80 | GET | text/plain | 0 |
We'll now use a scikit-learn tranformer class to convert the Pandas DataFrame to a numpy ndarray (matrix). Yes it's awesome... I'm not sure it's Optimus Prime awesome.. but it's still pretty nice.
# Use the bat DataframeToMatrix class (handles categorical data)
# You can see below it uses a heuristic to detect category data. When doing
# this for real we should explicitly convert before sending to the transformer.
to_matrix = dataframe_to_matrix.DataFrameToMatrix()
bro_matrix = to_matrix.fit_transform(bro_df[features], normalize=True)
print(bro_matrix.shape)
bro_matrix[:1]
Changing column method to category... Changing column resp_mime_types to category... Normalizing column id.resp_p... Normalizing column request_body_len... (150, 12)
array([[ 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0.]])
# Train/fit and Predict anomalous instances using the Isolation Forest model
odd_clf = IsolationForest(contamination=0.20) # Marking 20% odd
odd_clf.fit(bro_matrix)
IsolationForest(bootstrap=False, contamination=0.2, max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=1, random_state=None, verbose=0)
# Now we create a new dataframe using the prediction from our classifier
odd_df = bro_df[features][odd_clf.predict(bro_matrix) == -1]
print(odd_df.shape)
odd_df.head()
(32, 4)
id.resp_p | method | resp_mime_types | request_body_len | |
---|---|---|---|---|
106 | 80 | GET | application/x-dosexec | 0 |
107 | 80 | GET | application/x-dosexec | 0 |
109 | 80 | GET | application/x-dosexec | 0 |
112 | 80 | GET | application/x-dosexec | 0 |
113 | 80 | GET | application/x-dosexec | 0 |
# Now we're going to explore our odd dataframe with help from KMeans and PCA algorithms
odd_matrix = to_matrix.fit_transform(odd_df)
Changing column method to category... Changing column resp_mime_types to category... Normalizing column id.resp_p... Normalizing column request_body_len...
# Just some simple stuff for this example, KMeans and PCA
kmeans = KMeans(n_clusters=4).fit_predict(odd_matrix) # Change this to 3/5 for fun
pca = PCA(n_components=3).fit_transform(odd_matrix)
# Now we can put our ML results back onto our dataframe!
odd_df['x'] = pca[:, 0] # PCA X Column
odd_df['y'] = pca[:, 1] # PCA Y Column
odd_df['cluster'] = kmeans
odd_df.head()
id.resp_p | method | resp_mime_types | request_body_len | x | y | cluster | |
---|---|---|---|---|---|---|---|
106 | 80 | GET | application/x-dosexec | 0 | 1.112838 | -0.615774 | 1 |
107 | 80 | GET | application/x-dosexec | 0 | 1.112838 | -0.615774 | 1 |
109 | 80 | GET | application/x-dosexec | 0 | 1.112838 | -0.615774 | 1 |
112 | 80 | GET | application/x-dosexec | 0 | 1.112838 | -0.615774 | 1 |
113 | 80 | GET | application/x-dosexec | 0 | 1.112838 | -0.615774 | 1 |
# Plotting defaults
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 14.0
plt.rcParams['figure.figsize'] = 15.0, 6.0
# Helper method for scatter/beeswarm plot
def jitter(arr):
stdev = .02*(max(arr)-min(arr))
return arr + np.random.randn(len(arr)) * stdev
# Jitter so we can see instances that are projected coincident in 2D
odd_df['jx'] = jitter(odd_df['x'])
odd_df['jy'] = jitter(odd_df['y'])
# Now use dataframe group by cluster
cluster_groups = odd_df.groupby('cluster')
# Plot the Machine Learning results
colors = {0:'green', 1:'blue', 2:'red', 3:'orange', 4:'purple', 5:'brown'}
fig, ax = plt.subplots()
for key, group in cluster_groups:
group.plot(ax=ax, kind='scatter', x='jx', y='jy', alpha=0.5, s=250,
label='Cluster: {:d}'.format(key), color=colors[key])
# Now print out the details for each cluster
pd.set_option('display.width', 1000)
for key, group in cluster_groups:
print('\nCluster {:d}: {:d} observations'.format(key, len(group)))
print(group[features].head())
Cluster 0: 7 observations id.resp_p method resp_mime_types request_body_len 133 80 OPTIONS text/plain 0 134 80 OPTIONS text/plain 0 135 80 OPTIONS text/plain 0 136 80 OPTIONS text/plain 0 137 80 OPTIONS text/plain 0 Cluster 1: 8 observations id.resp_p method resp_mime_types request_body_len 106 80 GET application/x-dosexec 0 107 80 GET application/x-dosexec 0 109 80 GET application/x-dosexec 0 112 80 GET application/x-dosexec 0 113 80 GET application/x-dosexec 0 Cluster 2: 10 observations id.resp_p method resp_mime_types request_body_len 140 80 POST text/plain 69823 141 80 POST text/plain 69993 142 80 POST text/plain 71993 143 80 POST text/plain 70993 144 80 POST text/plain 72993 Cluster 3: 7 observations id.resp_p method resp_mime_types request_body_len 126 8080 GET text/plain 0 127 8080 GET text/plain 0 128 8080 GET text/plain 0 129 8080 GET text/plain 0 130 8080 GET text/plain 0
The important thing here is that both categorical and numerical variables were properly handled and the machine learning algorithm 'did the right thing' when marking outliers (for categorical and numerical fields)
# Distribution of the request body length
bro_df[['request_body_len']].hist()
print('\nFor this small demo dataset almost all request_body_len are 0\nCluster 2 represents outliers')
For this small demo dataset almost all request_body_len are 0 Cluster 2 represents outliers
Looking at the anomalous clusters for this small demo http log reveals four clusters that may be perfectly fine. So here we're not equating anomalous with 'bad'. The use of an anomaly detection algorithm can bring latent issues to the attention of threat hunters and system administrations. The results might be expected or a misconfigured appliance or something more nefarious that needs attention from security.
If you liked this notebook please visit SCP Labs for more notebooks and examples, or visit our company page for consulting and development services SuperCowPowers
# This cell is simply for adding some CSS (Ignore it :)
from IPython.core.display import HTML
def css_styling():
styles = open("styles/custom.css", "r").read()
return HTML(styles)
css_styling()