The unsupervised machine learning technique of clustering data into similar groups can be useful and fairly efficient in most cases. The big trick is often how you pick the number of clusters to make (the K hyperparameter). The number of clusters may vary dramatically depending on the characteristics of the data, the different types of variables (numeric or categorical), how the data is normalized/encoded and the distance metric used.
For this notebook we're going to focus specifically on the following:
# Third Party Imports
import pandas as pd
import numpy as np
import sklearn
from sklearn.manifold import TSNE
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.cluster import KMeans, DBSCAN
# Local imports
import bat
from bat.log_to_dataframe import LogToDataFrame
from bat.dataframe_to_matrix import DataFrameToMatrix
# Good to print out versions of stuff
print('BAT: {:s}'.format(bat.__version__))
print('Pandas: {:s}'.format(pd.__version__))
print('Scikit Learn Version:', sklearn.__version__)
BAT: 0.3.4 Pandas: 0.23.4 Scikit Learn Version: 0.20.0
# Create a Pandas dataframe from the Bro log
http_df = LogToDataFrame('data/http.log')
# Print out the head of the dataframe
http_df.head()
Successfully monitoring data/http.log...
filename | host | id.orig_h | id.orig_p | id.resp_h | id.resp_p | info_code | info_msg | method | orig_fuids | ... | resp_mime_types | response_body_len | status_code | status_msg | tags | trans_depth | uid | uri | user_agent | username | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ts | |||||||||||||||||||||
2013-09-15 17:44:27.668082 | - | guyspy.com | 192.168.33.10 | 1031 | 54.245.228.191 | 80 | 0 | - | GET | - | ... | text/html | 184 | 301 | Moved Permanently | (empty) | 1 | CyIaMO7IheOh38Zsi | / | Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ... | - |
2013-09-15 17:44:27.731702 | - | www.guyspy.com | 192.168.33.10 | 1032 | 54.245.228.191 | 80 | 0 | - | GET | - | ... | text/html | 100631 | 200 | OK | (empty) | 1 | CoyZrY2g74UvMMgp4a | / | Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ... | - |
2013-09-15 17:44:28.092922 | - | www.guyspy.com | 192.168.33.10 | 1032 | 54.245.228.191 | 80 | 0 | - | GET | - | ... | text/html | 55817 | 404 | Not Found | (empty) | 2 | CoyZrY2g74UvMMgp4a | /wp-content/plugins/slider-pro/css/advanced-sl... | Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ... | - |
2013-09-15 17:44:28.150301 | - | www.guyspy.com | 192.168.33.10 | 1040 | 54.245.228.191 | 80 | 0 | - | GET | - | ... | text/plain | 887 | 200 | OK | (empty) | 1 | CiCKTz4e0fkYYazBS3 | /wp-content/plugins/contact-form-7/includes/cs... | Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ... | - |
2013-09-15 17:44:28.150602 | - | www.guyspy.com | 192.168.33.10 | 1041 | 54.245.228.191 | 80 | 0 | - | GET | - | ... | text/plain | 10068 | 200 | OK | (empty) | 1 | C1YBkC1uuO9bzndRvh | /wp-content/plugins/slider-pro/css/slider/adva... | Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ... | - |
5 rows × 26 columns
When we look at the http records some of the data is numerical and some of it is categorical so we'll need a way of handling both data types in a generalized way. We have a DataFrameToMatrix class that handles a lot of the details and mechanics of combining numerical and categorical data, we'll use below.
We'll now use the Scikit-Learn tranformer class to convert the Pandas DataFrame to a numpy ndarray (matrix). The transformer class takes care of many low-level details
# We're going to pick some features that might be interesting
# some of the features are numerical and some are categorical
features = ['id.resp_p', 'method', 'resp_mime_types', 'request_body_len']
# Use the DataframeToMatrix class (handles categorical data)
# You can see below it uses a heuristic to detect category data. When doing
# this for real we should explicitly convert before sending to the transformer.
to_matrix = DataFrameToMatrix()
http_feature_matrix = to_matrix.fit_transform(http_df[features], normalize=True)
print('\nNOTE: The resulting numpy matrix has 12 dimensions based on one-hot encoding')
print(http_feature_matrix.shape)
http_feature_matrix[:1]
Changing column method to category... Changing column resp_mime_types to category... Normalizing column id.resp_p... Normalizing column request_body_len... NOTE: The resulting numpy matrix has 12 dimensions based on one-hot encoding (150, 12)
array([[0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0.]])
# Plotting defaults
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['font.size'] = 12.0
plt.rcParams['figure.figsize'] = 14.0, 7.0
"The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters."
from sklearn.metrics import silhouette_score
scores = []
clusters = range(2,16)
for K in clusters:
clusterer = KMeans(n_clusters=K)
cluster_labels = clusterer.fit_predict(http_feature_matrix)
score = silhouette_score(http_feature_matrix, cluster_labels)
scores.append(score)
# Plot it out
pd.DataFrame({'Num Clusters':clusters, 'score':scores}).plot(x='Num Clusters', y='score')
<matplotlib.axes._subplots.AxesSubplot at 0x1224a56d8>
# So we know that the highest (closest to 1) silhouette score is at 10 clusters
kmeans = KMeans(n_clusters=10).fit_predict(http_feature_matrix)
# TSNE is a great projection algorithm. In this case we're going from 12 dimensions to 2
projection = TSNE().fit_transform(http_feature_matrix)
# Now we can put our ML results back onto our dataframe!
http_df['cluster'] = kmeans
http_df['x'] = projection[:, 0] # Projection X Column
http_df['y'] = projection[:, 1] # Projection Y Column
# Now use dataframe group by cluster
cluster_groups = http_df.groupby('cluster')
# Plot the Machine Learning results
colors = {-1:'black', 0:'green', 1:'blue', 2:'red', 3:'orange', 4:'purple', 5:'brown', 6:'pink', 7:'lightblue', 8:'grey', 9:'yellow'}
fig, ax = plt.subplots()
for key, group in cluster_groups:
group.plot(ax=ax, kind='scatter', x='x', y='y', alpha=0.5, s=250,
label='Cluster: {:d}'.format(key), color=colors[key])
# Now print out the details for each cluster
pd.set_option('display.width', 1000)
for key, group in cluster_groups:
print('\nCluster {:d}: {:d} observations'.format(key, len(group)))
print(group[features].head(3))
Cluster 0: 7 observations id.resp_p method resp_mime_types request_body_len ts 2013-09-15 17:48:03.495720 8080 GET text/plain 0 2013-09-15 17:48:04.495720 8080 GET text/plain 0 2013-09-15 17:48:04.495720 8080 GET text/plain 0 Cluster 1: 40 observations id.resp_p method resp_mime_types request_body_len ts 2013-09-15 17:44:28.150301 80 GET text/plain 0 2013-09-15 17:44:28.150602 80 GET text/plain 0 2013-09-15 17:44:28.192918 80 GET text/plain 0 Cluster 2: 22 observations id.resp_p method resp_mime_types request_body_len ts 2013-09-15 17:44:30.064238 80 GET image/jpeg 0 2013-09-15 17:44:30.104156 80 GET image/jpeg 0 2013-09-15 17:44:30.725123 80 GET image/jpeg 0 Cluster 3: 15 observations id.resp_p method resp_mime_types request_body_len ts 2013-09-15 17:44:30.061532 80 GET image/png 0 2013-09-15 17:44:30.061532 80 GET image/png 0 2013-09-15 17:44:30.063460 80 GET image/png 0 Cluster 4: 14 observations id.resp_p method resp_mime_types request_body_len ts 2013-09-15 17:44:31.386100 80 GET - 0 2013-09-15 17:44:31.417193 80 GET - 0 2013-09-15 17:44:31.471002 80 GET - 0 Cluster 5: 10 observations id.resp_p method resp_mime_types request_body_len ts 2013-09-15 17:48:10.495720 80 POST text/plain 69823 2013-09-15 17:48:11.495720 80 POST text/plain 69993 2013-09-15 17:48:12.495720 80 POST text/plain 71993 Cluster 6: 14 observations id.resp_p method resp_mime_types request_body_len ts 2013-09-15 17:44:27.668082 80 GET text/html 0 2013-09-15 17:44:27.731702 80 GET text/html 0 2013-09-15 17:44:28.092922 80 GET text/html 0 Cluster 7: 8 observations id.resp_p method resp_mime_types request_body_len ts 2013-09-15 17:44:47.464161 80 GET application/x-dosexec 0 2013-09-15 17:44:47.464161 80 GET application/x-dosexec 0 2013-09-15 17:44:49.221978 80 GET application/x-dosexec 0 Cluster 8: 13 observations id.resp_p method resp_mime_types request_body_len ts 2013-09-15 17:44:40.230550 80 GET application/pdf 0 2013-09-15 17:44:40.230550 80 GET application/pdf 0 2013-09-15 17:44:40.230550 80 GET application/pdf 0 Cluster 9: 7 observations id.resp_p method resp_mime_types request_body_len ts 2013-09-15 17:48:06.495720 80 OPTIONS text/plain 0 2013-09-15 17:48:07.495720 80 OPTIONS text/plain 0 2013-09-15 17:48:08.495720 80 OPTIONS text/plain 0
Density-based spatial clustering is a data clustering algorithm that given a set of points in space, groups points that are closely packed together and marking low-density regions as outliers.
# Now try DBScan
http_df['cluster_db'] = DBSCAN().fit_predict(http_feature_matrix)
print('Number of Clusters: {:d}'.format(http_df['cluster_db'].nunique()))
Number of Clusters: 10
# Now use dataframe group by cluster
cluster_groups = http_df.groupby('cluster_db')
# Plot the Machine Learning results
fig, ax = plt.subplots()
for key, group in cluster_groups:
group.plot(ax=ax, kind='scatter', x='x', y='y', alpha=0.5, s=250,
label='Cluster: {:d}'.format(key), color=colors[key])
So obviously we got a bit lucky here and for different datasets with different feature distributions DBSCAN may not give you the optimal number of clusters right off the bat. There are two hyperparameters that can be tweeked but like we said the defaults often work well. See the DBSCAN and Hierarchical DBSCAN links for more information.
Well that's it for this notebook, given the usefulness and relatively efficiency of clustering it a good technique to include in your toolset. Understanding the K hyperparameter and how to determine optimal K (or not if you're using DBSCAN) is a good trick to know.
If you liked this notebook please visit SCP Labs for more notebooks and examples, or visit our company page for consulting and development services SuperCowPowers
# This cell is simply for adding some CSS (Ignore it :)
from IPython.core.display import HTML
def css_styling():
styles = open("styles/custom.css", "r").read()
return HTML(styles)
css_styling()