The topic we'll look at over the next couple of tutorials is anomaly detection. This is probably the most common and important application of machine learning tools to security. We'll start this week with supervised anomaly detection, and continue next week with unsupervised anomaly detection. There are a huge number of anomaly detection techniques out there, and many of them are very specialized to particular types of data, so all we'll be able to do here is scratch the surface.
As the name suggests, you can think of an anomaly as some sort of highly unusual event occuring in your data that you wish to find (e.g. an attack on your network, a defective device, credit card fraud). More usefully, a good definition of an anomaly is the following: An anomaly is a data sample that deviates significantly from other data samples, so much so to suggest that it was generated by a different mechanism. In probability language, you can think of an anomaly as something that comes from a different distribution than the "real" data.
If we happen to know which points in the dataset we're training on are anomalies, we can use supervised learning techniques (specifically binary classification) to build an anomaly detection model. By convention, people usually use the label 0
, called negative samples , for non-anomalous samples, and 1
for anomalies, called positive samples .
Supervised anomaly detection happens to be a special case of the more general problem of unbalanced data . That is, the number of labels you have for each class is significantly different. Till now we've worked with balanced data , which assumes the number of labels for each class is roughly equal (e.g. with 100 samples, you'd have 50 with negative labels and 50 with positive labels, or close to that ratio). With unbalanced data they can be highly skewed (e.g. with 100 samples, you might have 98 with negative labels and 2 with positive labels).
In the examples below we talk about various ways to deal with unbalanced data for a binary classification problem like supervised anomaly detection. Many of the techniques work just as well for multiclass problems (e.g. image classification). We begin by loading in the packages we'll use. The new one we'll use here is the imbalanced-learn library, aka imblearn
. This library has many techniques for dealing with unbalanced data that we'll use.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
# from sklearn.manifold import TSNE
from MulticoreTSNE import MulticoreTSNE as TSNE
from umap import UMAP
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, f1_score, confusion_matrix, accuracy_score
from sklearn.metrics import cohen_kappa_score, precision_score, recall_score
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE, ADASYN, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler, ClusterCentroids
from imblearn.datasets import make_imbalance
np.random.seed(123)
The dataset we'll be working with is a spam classification dataset. The 0
labels are non-spam and the 1
labels are spam. The dataset contains 57 derived features (using NLP). To make the dataset unbalanced we specify that only 2% of the dataset should be positive samples.
def get_data(ratio=0.01,get_all=False):
# loads spam dataset, where only (ratio*100)% of data is spam (1), else non-spam (0)
df = pd.read_csv("http://www.apps.stat.vt.edu/leman/VTCourses/spam.data.txt",sep=' ')
X_all = df.iloc[:,:-1]
y_all = df.iloc[:,-1]
if get_all:
return X_all,y_all
X,y = make_imbalance(X_all, y_all,
sampling_strategy={1: min(round(len(y_all)*ratio),len(y_all[y_all==1]))},
random_state=42)
X = StandardScaler().fit_transform(X)
return X,y
X,y = get_data(ratio=0.02)
print('number of features: ',len(X.T))
print('number of total samples: ',len(y))
print('number of negative samples: ',len(y[y==0]))
print('number of positive samples: ',len(y[y==1]))
print('percent positive samples: %.2f%%' % float(len(y[y==1])/(len(y[y==0])+len(y[y==1]))*100))
print('percent negative samples: %.2f%%' % float(len(y[y==0])/(len(y[y==0])+len(y[y==1]))*100))
number of features: 57 number of total samples: 2880 number of negative samples: 2788 number of positive samples: 92 percent positive samples: 3.19% percent negative samples: 96.81%
Here's a 2D t-SNE plot of the dataset so you can see what the situation looks like.
def plot_model(model, X, y):
xlim_left = np.min(X[:,0])
xlim_right = np.max(X[:,0])
ylim_left = np.min(X[:,1])
ylim_right = np.max(X[:,1])
xx, yy = np.mgrid[xlim_left-1:xlim_right+1:.1, ylim_left-1:ylim_right+1:.1]
grid = np.c_[xx.ravel(), yy.ravel()]
f, ax = plt.subplots(figsize=(8, 8))
if model is not None:
yhat = model.predict(grid).reshape(xx.shape)
ax.contour(xx, yy, yhat, levels=[.5])
x1 = X[y==0]
x2 = X[y==1]
ax.scatter(x1[:,0],x1[:,1],marker='.',c='red',label='y=0')
ax.scatter(x2[:,0],x2[:,1],marker='.',c='blue',label='y=1')
ax.set(aspect="equal", xlabel="$X_1$", ylabel="$X_2$")
ax.legend(loc='upper right')
tsne = TSNE(n_components=2, n_jobs=-1)
X_tsne = tsne.fit_transform(X)
plot_model(None, X_tsne, y)
So why can't we just do classification as usual at this point? There are several problems with doing this. For one, accuracy becomes a bad way to measure model performance with unbalanced data. To see this, consider the following scenario. What happens if we build a "stupid" classifier that just classifies all points as 0
?
To address the accuracy problem we instead must use a different metric to measure model performance. But which one? To understand the various metrics we've got to talk about the confusion matrix. The confusion matrix contains 4 numbers that we ultimately care about:
1
the model correctly predicted were 1
.0
the model incorrectly predicted were 1
.0
the model correctly predicted were 0
.1
the model incorrectly predicted were 0
.In sklearn, these 4 numbers are stored in the following way (note the diagonals are swapped from how you'd see them in a typical statistics textbook!):
confusion matrix = [TNFPFNTP]Just about every metric we'd use to measure model performance of a binary classifier is contained in these 4 numbers. Specifically, we'll focus mainly on the following 4 metrics:
There is also another important metric people use called the area under the receiver operating curve, aka AUC or AUROC. I won't go into details on this one, but look it up if you're interested.
With all of these metrics, you can think of values closer to 0 as "bad" and values closer to 1 as good. Below shows what all these scores are for the "all zero" case mentioned above.
def get_scores(y,yhat):
print('accuracy: ', round(accuracy_score(y,yhat),4))
print('precision: ', round(precision_score(y,yhat),4))
print('recall: ', round(recall_score(y,yhat),4))
print('f1: ', round(f1_score(y,yhat),4))
print('auc: ', round(roc_auc_score(y,yhat),4))
print('confusion matrix:\n', confusion_matrix(y,yhat))
# using since precision and F1 are technically undefined in this case (why?)
import warnings; warnings.filterwarnings('ignore')
yhat = np.zeros(y.shape)
get_scores(y,yhat)
accuracy: 0.9681 precision: 0.0 recall: 0.0 f1: 0.0 auc: 0.5 confusion matrix: [[2788 0] [ 92 0]]
You can see that accuracy isn't a useful metric at all when dealing with unbalanced data, because it mostly just reflects how unbalanced the data is. To avoid this it's better to focus on one of the other metrics. The F1 score is probably the most commonly used for this. In the above scenario, F1 correctly tells us that our model is terrible because it can't pick out the positive labels from the negative ones.
Now let's get to training models on unbalanced data. Recall the first thing you need to do is split your dataset into a training set and a test set. When dealing with unbalanced data, it's important that your test set correctly reflect the ratio of positives and negatives in your dataset, so we split the test set off before doing anything special with the data, and leave the test set alone (this is important or your metrics won't truly reflect how well your models is doing!). We again choose to take about 20% of the data for the test set (which is reasonable when you have a few thousand data samples or less).
Important Note: When you have extreme data imbalance, if you're not careful, you could end up in a situation where you have no positive samples in your test set. You want to avoid this. Your test set needs to contain at least a handful of positive samples to tell you how well you're doing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
We can start by just trying to train a model on the data we have without doing any kind of sampling. Sometimes this will work if the dataset isn't too imbalanced. However for extreme imbalance (like when your minority class is 1% of your data, some kind of resampling is almost always a good strategy.
We see that here. Notice that in this naive training case that all our test set metrics are zero. Why? Because the model didn't see enough positive examples in the training set to learn to differentiate them, hence everything gets classified as 0
. Resampling can help with this problem and improve performance.
model = RandomForestClassifier(min_samples_leaf=20, random_state=5, n_jobs=-1)
model.fit(X_train, y_train)
yhat = model.predict(X_test)
get_scores(y_test, yhat)
accuracy: 0.9688 precision: 0.0 recall: 0.0 f1: 0.0 auc: 0.5 confusion matrix: [[558 0] [ 18 0]]
We could just start training now the same way we did before with balanced data, but that's not a good idea. ML works best when you train on balanced data but test on the true unbalanced data. There are different ways of balancing the training set, but they pretty much all fall under two categories:
We can do each of these easily in Python using the imblearn package mentioned above. We'll start by looking at a few undersampling techniques, and then finish with a few oversampling techniques. Since we'll want to focus on how sampling affects training in this notebook, we'll just pick a model and stick with it for each sampling case. Since we mentioned before that Random Forests are easy to use and often work very well, we'll use those. In practice though you'll want to experiment with different models and hyperparameter choices for each sampling case.
We start with naive undersampling, aka random undersampling. All random undersampling does is takes your training set and randomly throws away enough of negative samples to make the training set balanced. As a result, the new, balanced training set will be much smaller than the original one.
X_train_bal, y_train_bal = RandomUnderSampler(random_state=543).fit_sample(X_train, y_train)
print('original training set size: ', len(y_train))
print('new training set size: ', len(y_train_bal))
print('number of negative samples: ', len(y_train_bal[y_train_bal==0]))
print('number of positive samples: ', len(y_train_bal[y_train_bal==1]))
original training set size: 2304 new training set size: 148 number of negative samples: 74 number of positive samples: 74
model = RandomForestClassifier(min_samples_leaf=20, random_state=5, n_jobs=-1)
model.fit(X_train_bal, y_train_bal)
yhat = model.predict(X_test)
get_scores(y_test, yhat)
accuracy: 0.9635 precision: 0.44 recall: 0.6111 f1: 0.5116 auc: 0.793 confusion matrix: [[544 14] [ 7 11]]
Ignoring accuracy (which is obviously biased), you can see we're doing poorly wrt to precision, which is causing F1 to be low as well. This means that our model isn't "picky" enough when classifying points as a 1
. One huge reason for this is that we're not training on that much data, only 148 samples. That's because we're undersampling.
Another undersampling technique people use involves clustering the negative samples and only keeping those cluster centers as points in the new training set. Suppose you have n positive samples and N negative samples with n≪N. Then cluster-based undersampling works by doing k-means on the N negative samples only with k=n. It then throws out the N negative samples and replaces them with the n centers calculated from the clustering.
X_train_bal,y_train_bal = ClusterCentroids(random_state=543).fit_sample(X_train, y_train)
print('original training set size: ', len(y_train))
print('new training set size: ', len(y_train_bal))
print('number of negative samples: ', len(y_train_bal[y_train_bal==0]))
print('number of positive samples: ', len(y_train_bal[y_train_bal==1]))
original training set size: 2304 new training set size: 148 number of negative samples: 74 number of positive samples: 74
model = RandomForestClassifier(min_samples_leaf=7, random_state=5, n_jobs=-1)
model.fit(X_train_bal, y_train_bal)
yhat = model.predict(X_test)
get_scores(y_test, yhat)
accuracy: 0.7882 precision: 0.1119 recall: 0.8333 f1: 0.1974 auc: 0.81 confusion matrix: [[439 119] [ 3 15]]
It looks like cluster-based undersampling does worse (at least with random forests). Precision is still pretty bad. Again, this is mainly because we're not using much data due to undersampling.
To make sure we're using all of our data we can use oversampling techniques instead. Note the drawback to oversampling is that it's very easy to overfit the positive samples if you're not careful. We'll first look at naive oversampling, aka random oversampling. All this does is randomly copies positive samples in the training set until the training set balances.
X_train_bal,y_train_bal = RandomOverSampler(random_state=543).fit_sample(X_train, y_train)
print('original training set size: ', len(y_train))
print('new training set size: ', len(y_train_bal))
print('number of negative samples: ', len(y_train_bal[y_train_bal==0]))
print('number of positive samples: ', len(y_train_bal[y_train_bal==1]))
original training set size: 2304 new training set size: 4460 number of negative samples: 2230 number of positive samples: 2230
model = RandomForestClassifier(min_samples_leaf=5, random_state=5, n_jobs=-1)
model.fit(X_train_bal, y_train_bal)
yhat = model.predict(X_test)
get_scores(y_test, yhat)
accuracy: 0.9774 precision: 0.6316 recall: 0.6667 f1: 0.6486 auc: 0.8271 confusion matrix: [[551 7] [ 6 12]]
You can see that oversampling does much better than undersampling, which fits the general principle that "machine learning models perform better with more data". Another oversampling technique is SMOTE. Instead of just copying positive samples to balance the dataset, SMOTE uses nearest neighbor interpolation to generate more positive samples. This insures new positive samples added to the training set aren't exactly like the originals, but "close enough" to look like them. This should help with overfitting worries.
X_train_bal, y_train_bal = SMOTE(random_state=543).fit_sample(X_train, y_train)
print('original training set size: ', len(y_train))
print('new training set size: ', len(y_train_bal))
print('number of negative samples: ', len(y_train_bal[y_train_bal==0]))
print('number of positive samples: ', len(y_train_bal[y_train_bal==1]))
original training set size: 2304 new training set size: 4460 number of negative samples: 2230 number of positive samples: 2230
model = RandomForestClassifier(min_samples_leaf=5, random_state=5, n_jobs=-1)
model.fit(X_train_bal, y_train_bal)
yhat = model.predict(X_test)
get_scores(y_test, yhat)
accuracy: 0.9809 precision: 0.7059 recall: 0.6667 f1: 0.6857 auc: 0.8289 confusion matrix: [[553 5] [ 6 12]]
The last oversampling technique we mention is ADASYN. ADASYN works similarly to SMOTE, but differs slightly in how it generates the new positive samples. With minimal tuning, it looks like ADASYN is performing the best for us in our particular case.
X_train_bal, y_train_bal = ADASYN(random_state=543).fit_sample(X_train, y_train)
print('original training set size: ', len(y_train))
print('new training set size: ', len(y_train_bal))
print('number of negative samples: ', len(y_train_bal[y_train_bal==0]))
print('number of positive samples: ', len(y_train_bal[y_train_bal==1]))
original training set size: 2304 new training set size: 4458 number of negative samples: 2230 number of positive samples: 2228
model = RandomForestClassifier(min_samples_leaf=5, random_state=5, n_jobs=-1)
model.fit(X_train_bal, y_train_bal)
yhat = model.predict(X_test)
get_scores(y_test, yhat)
accuracy: 0.9844 precision: 0.7647 recall: 0.7222 f1: 0.7429 auc: 0.8575 confusion matrix: [[554 4] [ 5 13]]
Final Aside: Since I generated this spam dataset originally from a balanced dataset, it's natural to ask how well we could've done if we had more positive samples to begin with. Here you go.
Moral: It's almost impossible to beat having real balanced data to begin with. If you can collect more positive samples to balance the dataset, do. Nothing beats real data.
X_all,y_all = get_data(get_all=True)
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.2)
model = RandomForestClassifier(min_samples_leaf=5, random_state=5, n_jobs=-1)
model.fit(X_train, y_train)
yhat = model.predict(X_test)
get_scores(y_test, yhat)
accuracy: 0.9315 precision: 0.946 recall: 0.8663 f1: 0.9044 auc: 0.9184 confusion matrix: [[559 17] [ 46 298]]