Anomalies collector - deepdive tutorial!

Open In Colab

This notebook will walk through, step by step, a worked example of how the netdata anomalies collector works under the hood.

Note: you can click the "Open in Colab" button above to open this notebook in Google Colab where you can just get going with it without having to set up python environments or any messy stuff like that. If the button does not work then just click here.

Another Note: If you are just reading through this notebook then it might be better to view it using nbviewer here or colab here as it will render a bit prettier than in Github.

In [ ]:
# uncomment the below line to install required packages if needed (you will need to do this the first time if running in Google Colab).
#!pip install netdata-pandas==0.0.28 numba==0.50.1 scikit-learn==0.23.2 pyod==0.8.3

But first, lets start with a meme...

image

Overview

There are three main concepts central to what the anomalies collector does:

  • Featurization: This is how we take the raw data for each chart and preprocess it into a feature representation or "feature vector)" used by the model. A simple way to think of this is that we just take each row of data and add some extra columns to encode some additional information. For example, a smoothed average of the last lags_n values for each dimension on the chart so the model can have some knowledge of the recent past beyond just the latest raw values of the dimensions.

  • Training: A function to take our "featurized" training data and train our models, one for each chart (or custom models if you have defined any). This function will do slightly different things depending on what model you use. In a broad sense, its job is to train the model to form a useful, more compact, representation of the training data and then we can use this representation to measure our level of surprise at new data that we want to get anomaly scores for. So for the default PCA model this involves leveraging finding a lower dimensional representation that does a good job at reconstructing the main characteristics of the variance in our training data. Some other models might take a slightly different approach use different representations and algorithms to get to that "measure of surprise" for each feature vector. For the purpose of what we are doing this is largely abstracted away by the API of the PyOD library, such that as a user we can easily switch between various models and still have broadly the same inputs (Numpy arrays) and outputs (also Numpy array's of anomaly scores, probabilities, and flags).

  • Prediction: Each trained model has a predict() function that we can use by passing in a new feature vector and getting back an anomaly probability and anomaly flag from the trained model. This is the part where we actually use the trained model as new data arrives to ask - "how unusual does this new data look?"

Lets go!

In [2]:
import time
from datetime import datetime
import matplotlib.pyplot as plt

from IPython.display import display, Markdown
from IPython.lib.display import YouTubeVideo
import numpy as np
import pandas as pd
from netdata_pandas.data import get_data, get_allmetrics
from pyod.models.hbos import HBOS
from pyod.models.pca import PCA
from pyod.models.cblof import CBLOF
from pyod.models.iforest import IForest


def make_features(df, lags_n, diffs_n, smooth_n):
    """Given a pandas dataframe preprocess it to take differences, add smoothing, and lags as specified. 
    """
    if diffs_n >= 1:
        # take differences
        df = df.diff(diffs_n).dropna()
    if smooth_n >= 2:
        # apply a rolling average to smooth out the data a bit
        df = df.rolling(smooth_n).mean().dropna()
    if lags_n >= 1:
        # for each dimension add a new columns for each of lags_n lags of the differenced and smoothed values for that dimension
        df_columns_new = [f'{col}_lag{n}' for n in range(lags_n+1) for col in df.columns]
        df = pd.concat([df.shift(n) for n in range(lags_n + 1)], axis=1).dropna()
        df.columns = df_columns_new
    # sort columns to have lagged values next to each other for clarity when looking at the feature vectors
    df = df.reindex(sorted(df.columns), axis=1)
    
    return df

Inputs & configuration

In the next cell we will define all the inputs we will use in this tutorial. Feel free to play with them once you are familiar with how it all hangs together.

Below you will see that the paramater values map to a subset of the inputs (the most important ones to help explain whats going on) required as part of the anomalies.conf configuration for the anomalies collector itself.

In [3]:
# inputs

# what host will we use
host = 'london.my-netdata.io'
#host= 'cdn77.my-netdata.io'
#host = 'odyslam.me/netdata/webserver'

# for this tutorial we will just use a few charts
charts_in_scope = ['system.cpu', 'system.load', 'system.net', 'system.io', 'system.ip', 'system.idlejitter']

# what model from PyOD will we use under the hood
model = 'pca'

# how many seconds of data will we train our models on
train_n_secs = 14400

# what contamination rate will we use, see some discussion here to understand this one more: https://github.com/yzhao062/pyod/issues/144
contamination = 0.001

# if we want to ignore a recent window of data when training the model we can use this
offset_n_secs = 0

# how many lags to include in our feature vector
lags_n = 5

# how much smoothing to apply in our feature vector
smooth_n = 3

# if we want to do everything in terms of differences then we set diffs_n=1
diffs_n = 1

# for purpose of this turorial how many prediction steps will we take once we have a trained model
n_prediction_steps = 20

An aside on PCA

By default the anomalies collector uses the PCA model, primarily this is because the PCA model gives a good combination of being able to capture and model flexible patterns in the data while also being computationally cheap since under the hood it is using the well researched, optimized and understood SVD algorithm to decompose our featurized data and project it onto a lower dimensional space. At a high level, when we see new data that is in a strange or unexpected part of this lower dimensional space then this is symptomatic of some anomalous data and so will get a higher anomaly score.

Note: If you want to learn more about PCA and play with some notebooks exploring PCA in a similar manner to this one then check out this chapter from the great Python Data Science Handbook by Jake Vander Plas.

The image below (taken from the book) gives a good intuition about a way of thinking of PCA as (almost) dimensionality reduction. In the image below we are looking at how PCA could be used to "compress" the X and Y data into one single dimension of numbers by projecting each pair of points onto the corresponding solid blue line of dots.

image

Or if a Computerfile video is more your thing then check out the below one as it does a better job then we can here.

In [4]:
YouTubeVideo('TJdH6rPA-TI')
Out[4]:

In this sense the PCA is learning a compressed representation of the data (technically, as mentioned in the video above, its a transformed representation in a different space, but we dont need to get too bogged down in that for our purposes here). This is essentially how PyOD uses PCA for anomaly detection under the hood.

So we use PCA to learn a compressed representation of all the training data for each chart is some more abstract lower dimensional space that can be of use to us. The "of use to us" part here comes from the fact that PCA picks these lower dimensional representations in a clever way (using SVD) that maximises the amount of information about the main directions of variance of the original dataset. e.g. in the image above the line of blue dots is the best line we could project our data onto in order to maintain as much information as possible about the variance of the original 2 dimensional dataset, as you can see there is much more 'spread' in the data along the original x-axis than the y-axis and that is what is captured by the 'spread' in the solid blue line of dots.

So when we see a new observation if it does not map well into this lower level representation we have learned during training, then that tells us that this new observation does not fit well into the representation we learned during training and as such it's probably somewhat anomalous, at least in comparison to what we observed in general in our training data.

Note: We have looked at PCA a little bit above, but the point of awesome libraries like PyOD is that you don't really need to go that far down into the details - once you understand how the API works and a little bit about the various types of models/approaches used you can consider playing with and trying out other models on your data that take completely different approaches under the hood, for example:

  • hbos: uses histograms as the underlying representations of your data used to then measure the surprise of new data (more info).
  • cblof: somewhat similar in approach to pca but learned clusters as the representation of your training data and then it's the distance of new observations to the learned cluster centroids that is used to generate an anomaly score (a good blog post).
  • iforest: uses a isolation forest as the underlying model and then observations that end up on strnage partso f that learned forest are then considered more anomalous (sklearn user guide).
  • ...etc.

Note: Not all models from PyOD have been implemented in the anomalies collector as some turned out to be too expensive for the specific use case of unsupervised anomaly detection on your Netdata node itself (or even on a parent node). To that end, the models available in the collector are pca, hbos, iforest, cblof, loda, copod or feature_bagging.

Initialize our models

Now we will initialize a PyOD model for each chart in charts_in_scope. Each model type in PyOD will have various different input parameters that a user can play with, we will tend to use the defaults and override them sometimes with ones we have picked based on what we know about the task we are working on. Generally (where relevant) these model parameters, apart from contamination, are hardcoded into the anomalies collector based on our internal research as we developed the collector, you can see this in the collector code here.

In the cell below we have added a comment for the source and API reference of each model from PyOD so you can take a look and read more about each one.

In [5]:
# initialize a model for each chart
if model == 'pca':
    # api: https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.pca
    # source: https://pyod.readthedocs.io/en/latest/_modules/pyod/models/pca.html
    models = {c: PCA(contamination=contamination, n_components=2, n_selected_components=2) for c in charts_in_scope}
elif model == 'hbos':
    # api: https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.hbos
    # source: https://pyod.readthedocs.io/en/latest/_modules/pyod/models/hbos.html
    models = {c: HBOS(contamination=contamination) for c in charts_in_scope}
elif model == 'cblof':
    # api: https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.cblof
    # source: https://pyod.readthedocs.io/en/latest/_modules/pyod/models/cblof.html
    models = {c: CBLOF(contamination=contamination, n_clusters=4) for c in charts_in_scope}
elif model == 'iforest':
    # api: https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.iforest
    # source: https://pyod.readthedocs.io/en/latest/_modules/pyod/models/iforest.html
    models = {c: IForest(contamination=contamination, n_estimators=50, bootstrap=True, behaviour='new') for c in charts_in_scope}
else:
    # we used the HBOS as default as it is both fast and robust to many different types of data and has proven in internal development 
    # to have less failure modes then some other models given the wide variaty of data we are expecting to be thrown at it
    models = {c: HBOS(contamination=contamination) for c in charts_in_scope}

Get training data

The first thing we need to do is get our raw training data for each chart we want to build a model for.

To get the data we will make use of the netdata-pandas library we have built to make multiple asynchronous calls (using asks and trio) to the Netdata REST API and basically wrangle the results into a nice Pandas DataFrame.

In [6]:
# define the window for the training data to pull
before = int(datetime.now().timestamp()) - offset_n_secs
after =  before - train_n_secs

# get the training data
df_train = get_data(hosts=host, charts=charts_in_scope, after=after, before=before, sort_cols=True, numeric_only=True, float_size='float32')
print(df_train.info())
df_train.head()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 14399 entries, 1604474576 to 1604488974
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   system.cpu|guest           14399 non-null  float32
 1   system.cpu|guest_nice      14399 non-null  float32
 2   system.cpu|iowait          14399 non-null  float32
 3   system.cpu|irq             14399 non-null  float32
 4   system.cpu|nice            14399 non-null  float32
 5   system.cpu|softirq         14399 non-null  float32
 6   system.cpu|steal           14399 non-null  float32
 7   system.cpu|system          14399 non-null  float32
 8   system.cpu|user            14399 non-null  float32
 9   system.idlejitter|average  14399 non-null  float32
 10  system.idlejitter|max      14399 non-null  float32
 11  system.idlejitter|min      14399 non-null  float32
 12  system.io|in               14399 non-null  float32
 13  system.io|out              14399 non-null  float32
 14  system.ip|received         14399 non-null  float32
 15  system.ip|sent             14399 non-null  float32
 16  system.load|load1          14395 non-null  float32
 17  system.load|load15         14395 non-null  float32
 18  system.load|load5          14395 non-null  float32
 19  system.net|received        14399 non-null  float32
 20  system.net|sent            14399 non-null  float32
dtypes: float32(21)
memory usage: 1.3 MB
None
Out[6]:
system.cpu|guest system.cpu|guest_nice system.cpu|iowait system.cpu|irq system.cpu|nice system.cpu|softirq system.cpu|steal system.cpu|system system.cpu|user system.idlejitter|average ... system.idlejitter|min system.io|in system.io|out system.ip|received system.ip|sent system.load|load1 system.load|load15 system.load|load5 system.net|received system.net|sent
time_idx
1604474576 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.500000 0.500000 82.0 ... 46.0 0.0 -11.750492 110.332367 -174.585098 NaN NaN NaN 20.880480 -84.349640
1604474577 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.505050 0.505050 84.0 ... 30.0 0.0 0.000000 192.250900 -64.668961 NaN NaN NaN 164.099686 -32.041279
1604474578 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.251256 1.005025 78.0 ... 27.0 0.0 0.000000 170.466507 -86.694138 NaN NaN NaN 124.846687 -37.602589
1604474579 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.502513 0.502513 79.0 ... 29.0 0.0 0.000000 153.144745 -199.074005 NaN NaN NaN 46.703060 -91.418747
1604474580 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.503778 0.251889 76.0 ... 47.0 0.0 -26.941219 124.466759 -144.531433 0.03 0.02 0.07 39.373100 -58.933270

5 rows × 21 columns

Above we can see our raw training data is just a pandas DataFrame with a timestamp index and a column of floats for each dimension from our charts_in_scope list.

Note: The netdata-pandas default naming convention for columns is "chart.name|dimension.name"

Preprocess or "featurize" the training data

Before we train our model we will first do some preprocessing to the raw data to create a "feature vector" to try and encode a more flexible and powerful representation for the model to work with as opposed to just looking at the most recently observed values in isolation.

This is the "featurization" we mentioned at the begining of the notebook. The idea here is to give the model some extra information so that it may spot more complex and interesting anomalies as opposed to just spikes where one metric is a very high or very low value.

In [7]:
# lets preprocess or "featurize" our raw data
df_train_processed = make_features(df_train, lags_n, diffs_n, smooth_n)

# print out the shape of our featurized data
print(df_train_processed.shape)
df_train_processed.head()
(14387, 126)
Out[7]:
system.cpu|guest_lag0 system.cpu|guest_lag1 system.cpu|guest_lag2 system.cpu|guest_lag3 system.cpu|guest_lag4 system.cpu|guest_lag5 system.cpu|guest_nice_lag0 system.cpu|guest_nice_lag1 system.cpu|guest_nice_lag2 system.cpu|guest_nice_lag3 ... system.net|received_lag2 system.net|received_lag3 system.net|received_lag4 system.net|received_lag5 system.net|sent_lag0 system.net|sent_lag1 system.net|sent_lag2 system.net|sent_lag3 system.net|sent_lag4 system.net|sent_lag5
time_idx
1604474588 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... -30.704399 -40.856167 6.443555 32.546180 32.694991 -12.822536 -24.871939 -14.788798 38.332296 7.027175
1604474589 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 40.196271 -30.704399 -40.856167 6.443555 -7.484048 32.694991 -12.822536 -24.871939 -14.788798 38.332296
1604474590 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 37.268468 40.196271 -30.704399 -40.856167 -12.974803 -7.484048 32.694991 -12.822536 -24.871939 -14.788798
1604474591 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 2.695841 37.268468 40.196271 -30.704399 -20.702449 -12.974803 -7.484048 32.694991 -12.822536 -24.871939
1604474592 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... -51.627127 2.695841 37.268468 40.196271 15.614039 -20.702449 -12.974803 -7.484048 32.694991 -12.822536

5 rows × 126 columns

The below few cells will explore a little what we have just done to try and make the ideas of preprocessing aka "featurization" aka "feature vector" a little clearer.

Terms like "featurization" and "feature vector" are often used to sound fancy, but in reality its typically just as simple as adding additional columns to each row of your data where those new columns have numbers in them that represent something about your data that you want to make available to the model.

So in our case adding lagged values of each smoothed and differenced dimension, is basically a design choice we make whereby we are telling the model we want it to consider lags_n recent values as opposed to just the latest observed dimensions. We do this because there are many different types of anomalies we want to try and be able to spot, so making a small snippet of recent data for each dimension available to the model gives us the ability to capture more complex anomaly patterns that might occur.

If we were to just train the model on the most recent values for each dimension the best we could reasonably hope for it to capture would be anomalies where one or more dimension takes an unusually high or low value for one time step. This is essentially not that much better then a traditional approach using z-scores. (If you are interested in comparing the two we actually also have a zscores collector too, if, for example, you would like to just start simple or cannot install the ML Python libraries the anomalies collector depends on).

In [8]:
# Lets look at how the shape of our data has changed due to preprocessing
print(f'df_train shape is {df_train.shape}')
print(f'df_train_processed is {df_train_processed.shape}')
n_cols_added = len(df_train_processed.columns)-len(df_train.columns)
print(f'make_features has added {n_cols_added} new columns, one for each lags_n ({df_train.shape[1]}*{lags_n}={n_cols_added})')
df_train shape is (14399, 21)
df_train_processed is (14387, 126)
make_features has added 105 new columns, one for each lags_n (21*5=105)

So as you can see from the above output, our featurization has added a new column for each lags_n specified. And we have also lost a few rows due to smooth_n and diffs_n

To be super clear lets look at the first few rows of training data for a specific metric before and after preprocessing.

Note: Look at the last time_idx to see how the featurization works for a specific timestamp of data.

In [9]:
metric = 'system.cpu|user'
print('raw data')
display(df_train[df_train.columns[df_train.columns.str.startswith(metric)]].head(3 + lags_n + smooth_n + diffs_n))
raw data
system.cpu|user
time_idx
1604474576 0.500000
1604474577 0.505050
1604474578 1.005025
1604474579 0.502513
1604474580 0.251889
1604474581 0.251256
1604474582 0.755668
1604474583 1.005025
1604474584 0.503778
1604474585 0.501253
1604474586 0.500000
1604474587 0.753769
In [10]:
print('featurized data')
display(df_train_processed[df_train_processed.columns[df_train_processed.columns.str.startswith(metric)]].head(1))
featurized data
system.cpu|user_lag0 system.cpu|user_lag1 system.cpu|user_lag2 system.cpu|user_lag3 system.cpu|user_lag4 system.cpu|user_lag5
time_idx
1604474588 0.084805 0.08333 -0.168342 -0.084805 0.084174 0.251045
In [11]:
print('manualy calculated')
df_manual_example = df_train[df_train.columns[df_train.columns.str.startswith(metric)]].copy()
# take diff
df_manual_example['diff'] = df_manual_example[metric].diff(diffs_n)
# apply smoothing
df_manual_example['smoothed'] = df_manual_example['diff'].rolling(smooth_n).mean()
display(df_manual_example.head(3 + lags_n + smooth_n + diffs_n).tail(1 + smooth_n + diffs_n))
manualy calculated
system.cpu|user diff smoothed
time_idx
1604474583 1.005025 0.249358 0.251045
1604474584 0.503778 -0.501247 0.084174
1604474585 0.501253 -0.002525 -0.084805
1604474586 0.500000 -0.001253 -0.168342
1604474587 0.753769 0.253769 0.083330

Above you can see how one raw metric value is now being preprocessed to be a vector of lags_n differenced and smoothed values. It is this matrix of smoothed differences that the model will use for both training and when making predictions.

So, for example, if a chart has 2 dimensions and we have set lags_n to be 5 then our featurized 'matrix' of numbers will be a 2(1+5) matrix. In reality this matrix is just flattened into a feature vector of 2 (1+5) = 12 floating point values. The cell below shows this for the system.net chart as that is an example with 2 dimensions.

In [39]:
# lets look at our first feature vector for the 'system.net' model 
chart = 'system.net'
print(df_train_processed[df_train_processed.columns[df_train_processed.columns.str.startswith(chart)]].head(1).shape)
print(df_train_processed[df_train_processed.columns[df_train_processed.columns.str.startswith(chart)]].head(1).values)
(1, 12)
[[ 37.26846759  40.19627126 -30.70439911 -40.85616748   6.4435552
   32.54618009  32.69499079 -12.82253647 -24.87193934 -14.78879801
   38.33229574   7.02717463]]

Train models

Now that we have our preprocessed training data we will train a model for each chart using our featurized data that represents each time step for each chart as a differenced, smoothed, and lagged matrix for each chart.

In [13]:
# loop over each chart in scope and train a model for each
for chart in charts_in_scope:
    # pull out the columns relating to the chart based on what thier name startswith and put it into a numpy array of values
    X_train = df_train_processed[df_train_processed.columns[df_train_processed.columns.str.startswith(chart)]].values
    print(f'train model for {chart} using X_train of {X_train.shape}')
    # call the fit() method on each initialized model and pass it the full numpy array of our featurized training data
    models[chart] = models[chart].fit(X_train)
train model for system.cpu using X_train of (14387, 54)
train model for system.load using X_train of (14387, 18)
train model for system.net using X_train of (14387, 12)
train model for system.io using X_train of (14387, 12)
train model for system.ip using X_train of (14387, 12)
train model for system.idlejitter using X_train of (14387, 18)

So we have now trained our models, one for each chart based on our preprocessed training data. To be concrete we will look at some example obvervations our model has been trained on.

In [14]:
# lets look at the first matrix or "feature vector" for our first model
obs_n = 0
model_n = 0
print(f'timestamp={df_train_processed[df_train_processed.columns[df_train_processed.columns.str.startswith(charts_in_scope[model_n])]].index[obs_n]}')
print(f'feature vector for {obs_n}th training observation for {charts_in_scope[model_n]} model:')
print(df_train_processed[df_train_processed.columns[df_train_processed.columns.str.startswith(charts_in_scope[model_n])]].values[obs_n]) 
timestamp=1604474588
feature vector for 0th training observation for system.cpu model:
[ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
 -0.0835422   0.0837521   0.          0.0835422   0.          0.
  0.          0.          0.          0.          0.          0.
  0.08480479  0.16708229  0.16582914 -0.00084172 -0.08333017 -0.0004219
  0.08480479  0.08333017 -0.16834172 -0.08480479  0.084174    0.25104532]
In [15]:
# and the next one
obs_n = 1
model_n = 0
print(f'timestamp={df_train_processed[df_train_processed.columns[df_train_processed.columns.str.startswith(charts_in_scope[model_n])]].index[obs_n]}')
print(f'feature vector for {obs_n}th training observation for {charts_in_scope[model_n]} model:')
print(df_train_processed[df_train_processed.columns[df_train_processed.columns.str.startswith(charts_in_scope[model_n])]].values[obs_n]) 
timestamp=1604474589
feature vector for 1th training observation for system.cpu model:
[ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.         -0.0835422   0.0837521   0.          0.0835422   0.
  0.          0.          0.          0.          0.          0.
 -0.08144416  0.08480479  0.16708229  0.16582914 -0.00084172 -0.08333017
 -0.08270361  0.08480479  0.08333017 -0.16834172 -0.08480479  0.084174  ]

If you look close enough at the above two cells you will see the same values be shifted for each lag.

Each matrix of numbers above is the representation we give to our model of each time step. This is how the model views each chart - a matrix ("feature vector" to sound fancy) of floating point numbers encoding some differenced and smoothed information about the last lags_n observations for each dimension in the specific chart we are modeling.

Note: Within the anomalies collector, at some regular interval, as defined by train_every_n in the anomalies.conf file, we will repeat the above training step to retrain all models on the most recent window of available training data.

Vistualize Training Data & Training Anomaly Scores

Now that we have covered what our feature vectors look like and trained our model. Lets see if we can visualize things a little to help our intuition of whats going on.

To do this we will use our trained model to get back an anomaly score for each of our training observations.

Then we will visualize a random sample of training data alongside the most highly scored training data (i.e most 'anomalous') and we should expect to see stark differences.

In [40]:
# lets pick a chart/model to look at
chart = 'system.cpu'

# pick a number of random samples to look at
n_sample_random = 40
# look also at the top n most anomalously scored training data
n_high_score = 20

# get all the training data for that model
X_train = df_train_processed[df_train_processed.columns[df_train_processed.columns.str.startswith(chart)]].values
X_train_timestamps = df_train_processed[df_train_processed.columns[df_train_processed.columns.str.startswith(chart)]].index.values
n_features = X_train.shape[1]

# score all our training data using the decision_function() method of the trained model - this will give use the raw internal anomaly score of the trained model
X_train_scores = models[chart].decision_function(X_train)

# determine some useful indices we can use to pull out the data we want
idx_random = np.random.randint(n_sample_random, size=n_sample_random)
idx_high = np.argpartition(X_train_scores, -n_high_score)[-n_high_score:]
idx = idx_random.tolist() + idx_high.tolist()

print(X_train.shape)
print(X_train_scores.shape)

X_train_random = X_train[idx_random,:]
print(X_train_random.shape)

X_train_high = X_train[idx_high,:]
print(X_train_high.shape)

X_train_examples = X_train[idx,:]
print(X_train_examples.shape)
(14387, 54)
(14387,)
(40, 54)
(20, 54)
(60, 54)

First we will just plot the random sample of our feature vectors as a line for each observation.

Note: The x-axis here does not really mean much, it will just be the most recent and last lags_n values for each dimension on the chart. So this line plot if just a way to visualize the raw data for comparing observations.

In [41]:
x_axis = np.array([n for n in range(0, n_features)])
plt.figure(figsize=(15,10))
p1 = plt.plot(x_axis, X_train_random.T)
plt.xlim(0, n_features)
plt.title(f'Random sample of {n_sample_random} training data feature vectors.')
plt.show()

Now lets plot the same plot as above, but also include the feature vectors of the training observations that had this highest anomaly scores.

In [42]:
plt.figure(figsize=(15,10))
p1 = plt.plot(x_axis, X_train_examples.T)
plt.xlim(0, n_features)
plt.title(f'Random sample of {n_sample_random} training data feature vectors & top {n_high_score} highest scored feature vectors in training data')
plt.show()

Now in the above plot we can see that the highest scored feature vectors or 'line' that we added in seem to stand out and look much 'different' to how the plot did before.

Note: Don't forget to look at how the y-axis is also very different between the two plots.

Another way to show this is via a heatmap of our random training data stacked alongside our highly anomalous training data. SO the data above the line in the plot below is our random sample of observations while that below is our highly scored training data.

You should be able to see a clear difference, even if its not very interpretable to us what the differences in numbers actually represents visually in terms of how it might look in the normal Netdata dashboard - we left that world behind a couple of cells back and are now looking at things in terms of how the model see's our data.

In [43]:
fig, ax = plt.subplots(figsize=(10, 16))
ax.imshow(X_train_examples, interpolation='nearest', cmap='Reds')
plt.title(f'Feature heatmap of random sample of {n_sample_random} random training observations & {n_high_score} most anomalous training observations')
plt.plot([n_sample_random for n in range(1,n_features+1)])
plt.show()

Lets pick the top few anomalies from the training data and highlight them in the Netdata dashboard to see how things look in the normal dashboard view.

In [44]:
for n in range(5):
    if chart == 'system.net':
        submenu = '#menu_system_submenu_network'
    elif chart == 'system.ip':
        submenu = '#menu_system_submenu_network'
    elif chart == 'system.cpu':
        submenu = '#menu_system_submenu_cpu'
    elif chart == 'system.load':
        submenu = '#menu_system_submenu_load'
    elif chart == 'system.io':
        submenu = '#menu_system_submenu_disk'
    elif chart == 'system.idlejitter':
        submenu = '#menu_system_submenu_idlejitter'
    after = (X_train_timestamps[idx_high][n]*1000)-(60000*5)
    before = (X_train_timestamps[idx_high][n]*1000)+(10000*2)
    highlight_after = (X_train_timestamps[idx_high][n]*1000)-10000
    highlight_before = (X_train_timestamps[idx_high][n]*1000)+10000
    print(f'({n}) http://{host}/{submenu};after={after};before={before};highlight_after={highlight_after};highlight_before={highlight_before}')
(0) http://london.my-netdata.io/#menu_system_submenu_cpu;after=1604480183000;before=1604480503000;highlight_after=1604480473000;highlight_before=1604480493000
(1) http://london.my-netdata.io/#menu_system_submenu_cpu;after=1604480181000;before=1604480501000;highlight_after=1604480471000;highlight_before=1604480491000
(2) http://london.my-netdata.io/#menu_system_submenu_cpu;after=1604480180000;before=1604480500000;highlight_after=1604480470000;highlight_before=1604480490000
(3) http://london.my-netdata.io/#menu_system_submenu_cpu;after=1604480182000;before=1604480502000;highlight_after=1604480472000;highlight_before=1604480492000
(4) http://london.my-netdata.io/#menu_system_submenu_cpu;after=1604479154000;before=1604479474000;highlight_after=1604479444000;highlight_before=1604479464000

Get pediction data

Now that we have our trained models for each chart we can use them in looking at incoming observations and 'ask' the trained models how 'unusual' it thinks they are.

In [21]:
# define an empty dataframe we can store enough recent data into to generate our feature vector's for recent data on
df_recent = pd.DataFrame()

# simulate n_prediction_steps of getting latest data, making feature vecotr and getting predicitons
for prediction_step in range(n_prediction_steps):
    time.sleep(1) # sleep for a second to ensure getting a new timestamp from the host
    df_latest = get_allmetrics(host=host, charts=charts_in_scope, wide=True)[df_train.columns]
    df_latest['time_idx'] = int(time.time())
    df_latest = df_latest.set_index('time_idx')
    # just keep enough recent data to generate each feature vector
    df_recent = df_recent.append(df_latest).tail((lags_n + smooth_n + diffs_n) * 2)
    
    # now lets featurize our recent data to be able to get predictions from the model for each observation
    df_predict_processed = make_features(df_recent, lags_n, diffs_n, smooth_n)

print(f'we now have {df_predict_processed.shape[0]} recent preprocessed feature vectors to predict on.')
we now have 10 recent preprocessed feature vectors to predict on.
In [22]:
print(df_predict_processed.shape)
df_predict_processed.head()
(10, 126)
Out[22]:
system.cpu|guest_lag0 system.cpu|guest_lag1 system.cpu|guest_lag2 system.cpu|guest_lag3 system.cpu|guest_lag4 system.cpu|guest_lag5 system.cpu|guest_nice_lag0 system.cpu|guest_nice_lag1 system.cpu|guest_nice_lag2 system.cpu|guest_nice_lag3 ... system.net|received_lag2 system.net|received_lag3 system.net|received_lag4 system.net|received_lag5 system.net|sent_lag0 system.net|sent_lag1 system.net|sent_lag2 system.net|sent_lag3 system.net|sent_lag4 system.net|sent_lag5
time_idx
1604488989 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 20.950889 -48.192025 -15.095592 51.158781 33.586547 13.133732 42.412212 -15.099732 26.433434 -54.686425
1604488991 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 30.414337 20.950889 -48.192025 -15.095592 2.353946 33.586547 13.133732 42.412212 -15.099732 26.433434
1604488992 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.330481 30.414337 20.950889 -48.192025 -93.153763 2.353946 33.586547 13.133732 42.412212 -15.099732
1604488993 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... -72.029869 1.330481 30.414337 20.950889 -1.209190 -93.153763 2.353946 33.586547 13.133732 42.412212
1604488994 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... -12.498062 -72.029869 1.330481 30.414337 -23.862861 -1.209190 -93.153763 2.353946 33.586547 13.133732

5 rows × 126 columns

The above featurized prediction data should be identical in terms of structure and schema to the featurized training data we explored above. This is what is expected by the model.

Get predictions

In [23]:
# for each recent feature vector, get a prediction
for time_idx, row in df_predict_processed.iterrows():
    
    print('-'*100)
    print(f'predictions for time {time_idx}')
    
    # convert our row into the expected 'flattened' feature vector
    df_tmp = row.to_frame().transpose()
    
    for model in models:
        
        # pull out relevant array of features for the model in question
        X_predict = df_tmp[df_tmp.columns[df_tmp.columns.str.startswith(model)]].values
        
        # call the predict_proba() and predict() methods on the trained data in order to make a prediction
        anomaly_probability = round(models[model].predict_proba(X_predict)[-1][1],4)
        anomaly_flag = models[model].predict(X_predict)[-1]
        
        print(f'model={model}, anomaly_probability={anomaly_probability}, anomaly_flag={anomaly_flag}')
----------------------------------------------------------------------------------------------------
predictions for time 1604488989
model=system.cpu, anomaly_probability=0.0291, anomaly_flag=0
model=system.load, anomaly_probability=0.0822, anomaly_flag=0
model=system.net, anomaly_probability=0.007, anomaly_flag=0
model=system.io, anomaly_probability=0.0014, anomaly_flag=0
model=system.ip, anomaly_probability=0.0102, anomaly_flag=0
model=system.idlejitter, anomaly_probability=0.0527, anomaly_flag=0
----------------------------------------------------------------------------------------------------
predictions for time 1604488991
model=system.cpu, anomaly_probability=0.0355, anomaly_flag=0
model=system.load, anomaly_probability=0.0658, anomaly_flag=0
model=system.net, anomaly_probability=0.0082, anomaly_flag=0
model=system.io, anomaly_probability=0.0014, anomaly_flag=0
model=system.ip, anomaly_probability=0.0077, anomaly_flag=0
model=system.idlejitter, anomaly_probability=0.0563, anomaly_flag=0
----------------------------------------------------------------------------------------------------
predictions for time 1604488992
model=system.cpu, anomaly_probability=0.0416, anomaly_flag=0
model=system.load, anomaly_probability=0.0438, anomaly_flag=0
model=system.net, anomaly_probability=0.013, anomaly_flag=0
model=system.io, anomaly_probability=0.0014, anomaly_flag=0
model=system.ip, anomaly_probability=0.0143, anomaly_flag=0
model=system.idlejitter, anomaly_probability=0.0489, anomaly_flag=0
----------------------------------------------------------------------------------------------------
predictions for time 1604488993
model=system.cpu, anomaly_probability=0.0386, anomaly_flag=0
model=system.load, anomaly_probability=0.0068, anomaly_flag=0
model=system.net, anomaly_probability=0.0153, anomaly_flag=0
model=system.io, anomaly_probability=0.0014, anomaly_flag=0
model=system.ip, anomaly_probability=0.0138, anomaly_flag=0
model=system.idlejitter, anomaly_probability=0.034, anomaly_flag=0
----------------------------------------------------------------------------------------------------
predictions for time 1604488994
model=system.cpu, anomaly_probability=0.0329, anomaly_flag=0
model=system.load, anomaly_probability=0.0078, anomaly_flag=0
model=system.net, anomaly_probability=0.0152, anomaly_flag=0
model=system.io, anomaly_probability=0.0014, anomaly_flag=0
model=system.ip, anomaly_probability=0.0124, anomaly_flag=0
model=system.idlejitter, anomaly_probability=0.0307, anomaly_flag=0
----------------------------------------------------------------------------------------------------
predictions for time 1604488995
model=system.cpu, anomaly_probability=0.0347, anomaly_flag=0
model=system.load, anomaly_probability=0.006, anomaly_flag=0
model=system.net, anomaly_probability=0.0119, anomaly_flag=0
model=system.io, anomaly_probability=0.0014, anomaly_flag=0
model=system.ip, anomaly_probability=0.0062, anomaly_flag=0
model=system.idlejitter, anomaly_probability=0.0275, anomaly_flag=0
----------------------------------------------------------------------------------------------------
predictions for time 1604488996
model=system.cpu, anomaly_probability=0.0356, anomaly_flag=0
model=system.load, anomaly_probability=0.0036, anomaly_flag=0
model=system.net, anomaly_probability=0.0051, anomaly_flag=0
model=system.io, anomaly_probability=0.0014, anomaly_flag=0
model=system.ip, anomaly_probability=0.0051, anomaly_flag=0
model=system.idlejitter, anomaly_probability=0.0304, anomaly_flag=0
----------------------------------------------------------------------------------------------------
predictions for time 1604488998
model=system.cpu, anomaly_probability=0.031, anomaly_flag=0
model=system.load, anomaly_probability=0.0012, anomaly_flag=0
model=system.net, anomaly_probability=0.0068, anomaly_flag=0
model=system.io, anomaly_probability=0.0014, anomaly_flag=0
model=system.ip, anomaly_probability=0.0119, anomaly_flag=0
model=system.idlejitter, anomaly_probability=0.0241, anomaly_flag=0
----------------------------------------------------------------------------------------------------
predictions for time 1604488999
model=system.cpu, anomaly_probability=0.023, anomaly_flag=0
model=system.load, anomaly_probability=0.0012, anomaly_flag=0
model=system.net, anomaly_probability=0.0121, anomaly_flag=0
model=system.io, anomaly_probability=0.0014, anomaly_flag=0
model=system.ip, anomaly_probability=0.0163, anomaly_flag=0
model=system.idlejitter, anomaly_probability=0.0198, anomaly_flag=0
----------------------------------------------------------------------------------------------------
predictions for time 1604489000
model=system.cpu, anomaly_probability=0.0226, anomaly_flag=0
model=system.load, anomaly_probability=0.0018, anomaly_flag=0
model=system.net, anomaly_probability=0.0171, anomaly_flag=0
model=system.io, anomaly_probability=0.0014, anomaly_flag=0
model=system.ip, anomaly_probability=0.0193, anomaly_flag=0
model=system.idlejitter, anomaly_probability=0.0184, anomaly_flag=0

In the above we should generally see low anomaly_probability values (assuming nothing has blown up on the host you used between the time you ran the training cells above and the predictions above).

Lets just do one last little thing to try show what is going on here and why we put so much effort and focus into the featurization above.

We will take some of the feature vectors we predicted on for each model, randomly shuffle the values around so as to make an unusual looking observation, and see what sort of an anomaly probability that gives us. (hint: it should be higher then those above :) ).

In [24]:
# sample some prediction data and shuffle values
n_predict_sample = 5
X_predict_sample = df_predict_processed.sample(n_predict_sample).values.T
np.random.shuffle(X_predict_sample)
X_predict_sample = X_predict_sample.T
df_predict_shuffled = pd.DataFrame(X_predict_sample, columns=df_predict_processed.columns)
for row in df_predict_shuffled.index:
    print('-'*150)
    for model in models:
        X_predict = df_predict_shuffled.loc[row:row][df_predict_shuffled.columns[df_predict_shuffled.columns.str.startswith(model)]].values
        anomaly_probability = round(models[model].predict_proba(X_predict)[-1][1],4)
        anomaly_flag = models[model].predict(X_predict)[-1]
        print(f'model={model}, anomaly_probability={anomaly_probability}, anomaly_flag={anomaly_flag}')
------------------------------------------------------------------------------------------------------------------------------------------------------
model=system.cpu, anomaly_probability=1.0, anomaly_flag=1
model=system.load, anomaly_probability=1.0, anomaly_flag=1
model=system.net, anomaly_probability=0.0119, anomaly_flag=0
model=system.io, anomaly_probability=0.0058, anomaly_flag=0
model=system.ip, anomaly_probability=0.0085, anomaly_flag=0
model=system.idlejitter, anomaly_probability=0.0248, anomaly_flag=0
------------------------------------------------------------------------------------------------------------------------------------------------------
model=system.cpu, anomaly_probability=1.0, anomaly_flag=1
model=system.load, anomaly_probability=1.0, anomaly_flag=1
model=system.net, anomaly_probability=0.0063, anomaly_flag=0
model=system.io, anomaly_probability=0.0063, anomaly_flag=0
model=system.ip, anomaly_probability=0.0103, anomaly_flag=0
model=system.idlejitter, anomaly_probability=0.1029, anomaly_flag=0
------------------------------------------------------------------------------------------------------------------------------------------------------
model=system.cpu, anomaly_probability=1.0, anomaly_flag=1
model=system.load, anomaly_probability=1.0, anomaly_flag=1
model=system.net, anomaly_probability=0.0093, anomaly_flag=0
model=system.io, anomaly_probability=0.0028, anomaly_flag=0
model=system.ip, anomaly_probability=0.0099, anomaly_flag=0
model=system.idlejitter, anomaly_probability=0.1332, anomaly_flag=0
------------------------------------------------------------------------------------------------------------------------------------------------------
model=system.cpu, anomaly_probability=1.0, anomaly_flag=1
model=system.load, anomaly_probability=1.0, anomaly_flag=1
model=system.net, anomaly_probability=0.0128, anomaly_flag=0
model=system.io, anomaly_probability=0.0053, anomaly_flag=0
model=system.ip, anomaly_probability=0.0064, anomaly_flag=0
model=system.idlejitter, anomaly_probability=0.0155, anomaly_flag=0
------------------------------------------------------------------------------------------------------------------------------------------------------
model=system.cpu, anomaly_probability=1.0, anomaly_flag=1
model=system.load, anomaly_probability=1.0, anomaly_flag=1
model=system.net, anomaly_probability=0.0074, anomaly_flag=0
model=system.io, anomaly_probability=0.0138, anomaly_flag=0
model=system.ip, anomaly_probability=0.0078, anomaly_flag=0
model=system.idlejitter, anomaly_probability=0.2852, anomaly_flag=0

We should see some higher anomaly probabilities above than in the actual real predictions we had previously made.

But what is the model?

To try and lift the lid a little on what the model actually is and how it is calculating anomaly probabilities lets take a look at one trained model and what it actually is.

In [25]:
# lets pick our first trained model
model_to_explore = charts_in_scope[0]

print(f'model for chart {model_to_explore}:')
models[model_to_explore].__dict__
model for chart system.cpu:
Out[25]:
{'contamination': 0.001,
 'n_components': 2,
 'n_selected_components': 2,
 'copy': True,
 'whiten': False,
 'svd_solver': 'auto',
 'tol': 0.0,
 'iterated_power': 'auto',
 'random_state': None,
 'weighted': True,
 'standardization': True,
 '_classes': 2,
 'scaler_': StandardScaler(),
 'detector_': PCA(n_components=2),
 'n_components_': 2,
 'components_': array([[ 2.80082352e-18,  3.03794228e-18, -2.70731444e-19,
         -3.92363821e-19,  1.48369551e-19, -3.63283111e-19,
         -4.39041383e-22, -8.31401705e-21,  1.39900394e-20,
          1.30685531e-20, -9.11087543e-21,  0.00000000e+00,
         -2.43885139e-01, -3.29203940e-01, -1.73697634e-01,
          1.29799362e-01,  3.23636716e-01,  2.79958132e-01,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         -1.51045154e-01, -8.76462672e-02,  6.10211833e-02,
          1.98659891e-01,  2.27264690e-01,  1.49883841e-01,
         -1.12372782e-01, -9.74261497e-02, -3.51807526e-02,
          8.00901689e-02,  1.02115746e-01,  7.26665255e-02,
         -1.24195520e-03, -1.22739202e-02, -1.72542528e-02,
         -3.10527251e-03,  6.40218609e-03,  1.38531055e-02,
         -2.85767294e-01, -2.95111985e-01, -1.27987592e-01,
          1.47699610e-01,  3.13336131e-01,  2.93640283e-01,
         -4.51371443e-02, -6.59597548e-02, -7.06319548e-02,
         -1.24793697e-02,  2.99234105e-02,  6.52240921e-02],
        [-1.60824947e-17, -1.14827710e-17,  9.61490236e-19,
         -3.01121753e-19,  9.15825774e-20, -1.79272007e-19,
         -3.09077873e-20, -2.81027414e-20, -4.26944492e-21,
          1.35805909e-20, -1.02490202e-20, -0.00000000e+00,
          9.19937111e-02, -1.01829472e-01, -2.77808201e-01,
         -2.74783232e-01, -9.60873349e-02,  9.77127933e-02,
         -0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
         -0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
         -1.93270784e-01, -3.33926708e-01, -3.73609183e-01,
         -2.84145201e-01, -1.33619732e-01, -1.36922292e-02,
         -1.04312593e-03, -7.56532493e-02, -1.20059231e-01,
         -9.40272399e-02, -2.46077516e-02,  3.28380477e-02,
          1.53039332e-02,  9.98811847e-03, -1.30719454e-03,
         -1.23054923e-02, -1.04461356e-02, -2.39789225e-03,
         -4.56703551e-02, -2.42918118e-01, -3.80897606e-01,
         -3.54023992e-01, -1.80861638e-01,  1.67598654e-02,
          1.10389117e-02, -1.99933307e-02, -6.19464953e-02,
         -1.01962973e-01, -9.31777410e-02, -5.72142476e-02]]),
 'n_selected_components_': 2,
 'w_components_': array([0.09782853, 0.08702647]),
 'selected_components_': array([[ 2.80082352e-18,  3.03794228e-18, -2.70731444e-19,
         -3.92363821e-19,  1.48369551e-19, -3.63283111e-19,
         -4.39041383e-22, -8.31401705e-21,  1.39900394e-20,
          1.30685531e-20, -9.11087543e-21,  0.00000000e+00,
         -2.43885139e-01, -3.29203940e-01, -1.73697634e-01,
          1.29799362e-01,  3.23636716e-01,  2.79958132e-01,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         -1.51045154e-01, -8.76462672e-02,  6.10211833e-02,
          1.98659891e-01,  2.27264690e-01,  1.49883841e-01,
         -1.12372782e-01, -9.74261497e-02, -3.51807526e-02,
          8.00901689e-02,  1.02115746e-01,  7.26665255e-02,
         -1.24195520e-03, -1.22739202e-02, -1.72542528e-02,
         -3.10527251e-03,  6.40218609e-03,  1.38531055e-02,
         -2.85767294e-01, -2.95111985e-01, -1.27987592e-01,
          1.47699610e-01,  3.13336131e-01,  2.93640283e-01,
         -4.51371443e-02, -6.59597548e-02, -7.06319548e-02,
         -1.24793697e-02,  2.99234105e-02,  6.52240921e-02],
        [-1.60824947e-17, -1.14827710e-17,  9.61490236e-19,
         -3.01121753e-19,  9.15825774e-20, -1.79272007e-19,
         -3.09077873e-20, -2.81027414e-20, -4.26944492e-21,
          1.35805909e-20, -1.02490202e-20, -0.00000000e+00,
          9.19937111e-02, -1.01829472e-01, -2.77808201e-01,
         -2.74783232e-01, -9.60873349e-02,  9.77127933e-02,
         -0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
         -0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
         -1.93270784e-01, -3.33926708e-01, -3.73609183e-01,
         -2.84145201e-01, -1.33619732e-01, -1.36922292e-02,
         -1.04312593e-03, -7.56532493e-02, -1.20059231e-01,
         -9.40272399e-02, -2.46077516e-02,  3.28380477e-02,
          1.53039332e-02,  9.98811847e-03, -1.30719454e-03,
         -1.23054923e-02, -1.04461356e-02, -2.39789225e-03,
         -4.56703551e-02, -2.42918118e-01, -3.80897606e-01,
         -3.54023992e-01, -1.80861638e-01,  1.67598654e-02,
          1.10389117e-02, -1.99933307e-02, -6.19464953e-02,
         -1.01962973e-01, -9.31777410e-02, -5.72142476e-02]]),
 'selected_w_components_': array([0.09782853, 0.08702647]),
 'decision_scores_': array([68.8878554 , 63.17257343, 67.21956501, ..., 75.31682611,
        76.71194073, 74.30143433]),
 'threshold_': 1311.4654267524904,
 'labels_': array([0, 0, 0, ..., 0, 0, 0]),
 '_mu': 89.38196355699694,
 '_sigma': 97.22056202996157}

So above we see the various objects, mostly array's of numbers, that make up the internal state of the trained model.

Note: for different types of models you will see different things in here. The below cells all assume model='pca' to walk through the example of how the default PCA model calculates an anomaly probability. Also for simplicity when we initialized the PCA model we set n_selected_components=2 to make the calculations below easier to follow by telling PyOD to just use the first 2 principle components when calculating the anomaly scores. In the actual anomalies collector we use all the principle components.

For PCA the main things of relevance below will be:

  • selected_components_: The actual principle components we want to use when calculating the anomaly score (aka decision_score).
  • selected_w_components_: The weights applied to each selected component, the first few typically matter more as they capture most of the variance in the original training data.
  • decision_scores_: The raw anomaly scores on all of the training observations, used to convert the anomaly score into something that more looks like an anomaly probability.

Lets look at the 'training' source code

Lets look at what the PCA model does when it trains a model. W can see from the below that most of what is going on is fitting a PCA from Sklearn to the training data X

In [26]:
PCA.fit??
Signature: PCA.fit(self, X, y=None)
Source:   
    def fit(self, X, y=None):
        """Fit detector. y is ignored in unsupervised methods.

        Parameters
        ----------
        X : numpy array of shape (n_samples, n_features)
            The input samples.

        y : Ignored
            Not used, present for API consistency by convention.

        Returns
        -------
        self : object
            Fitted estimator.
        """
        # validate inputs X and y (optional)
        X = check_array(X)
        self._set_n_classes(y)

        # PCA is recommended to use on the standardized data (zero mean and
        # unit variance).
        if self.standardization:
            X, self.scaler_ = standardizer(X, keep_scalar=True)

        self.detector_ = sklearn_PCA(n_components=self.n_components,
                                     copy=self.copy,
                                     whiten=self.whiten,
                                     svd_solver=self.svd_solver,
                                     tol=self.tol,
                                     iterated_power=self.iterated_power,
                                     random_state=self.random_state)
        self.detector_.fit(X=X, y=y)

        # copy the attributes from the sklearn PCA object
        self.n_components_ = self.detector_.n_components_
        self.components_ = self.detector_.components_

        # validate the number of components to be used for outlier detection
        if self.n_selected_components is None:
            self.n_selected_components_ = self.n_components_
        else:
            self.n_selected_components_ = self.n_selected_components
        check_parameter(self.n_selected_components_, 1, self.n_components_,
                        include_left=True, include_right=True,
                        param_name='n_selected_components_')

        # use eigenvalues as the weights of eigenvectors
        self.w_components_ = np.ones([self.n_components_, ])
        if self.weighted:
            self.w_components_ = self.detector_.explained_variance_ratio_

        # outlier scores is the sum of the weighted distances between each
        # sample to the eigenvectors. The eigenvectors with smaller
        # eigenvalues have more influence
        # Not all eigenvectors are used, only n_selected_components_ smallest
        # are used since they better reflect the variance change

        self.selected_components_ = self.components_[
                                    -1 * self.n_selected_components_:, :]
        self.selected_w_components_ = self.w_components_[
                                      -1 * self.n_selected_components_:]

        self.decision_scores_ = np.sum(
            cdist(X, self.selected_components_) / self.selected_w_components_,
            axis=1).ravel()

        self._process_decision_scores()
        return self
File:      c:\users\andre\documents\repos\netdata-community\netdata-agent-api\netdata-pandas\venv\lib\site-packages\pyod\models\pca.py
Type:      function

Lets look at the 'prediction' source code

Lets look at the source code for generating the anomaly probabilities.

In [27]:
PCA.predict_proba??
Signature: PCA.predict_proba(self, X, method='linear')
Source:   
    def predict_proba(self, X, method='linear'):
        """Predict the probability of a sample being outlier. Two approaches
        are possible:

        1. simply use Min-max conversion to linearly transform the outlier
           scores into the range of [0,1]. The model must be
           fitted first.
        2. use unifying scores, see :cite:`kriegel2011interpreting`.

        Parameters
        ----------
        X : numpy array of shape (n_samples, n_features)
            The input samples.

        method : str, optional (default='linear')
            probability conversion method. It must be one of
            'linear' or 'unify'.

        Returns
        -------
        outlier_probability : numpy array of shape (n_samples,)
            For each observation, tells whether or not
            it should be considered as an outlier according to the
            fitted model. Return the outlier probability, ranging
            in [0,1].
        """

        check_is_fitted(self, ['decision_scores_', 'threshold_', 'labels_'])
        train_scores = self.decision_scores_

        test_scores = self.decision_function(X)

        probs = np.zeros([X.shape[0], int(self._classes)])
        if method == 'linear':
            scaler = MinMaxScaler().fit(train_scores.reshape(-1, 1))
            probs[:, 1] = scaler.transform(
                test_scores.reshape(-1, 1)).ravel().clip(0, 1)
            probs[:, 0] = 1 - probs[:, 1]
            return probs

        elif method == 'unify':
            # turn output into probability
            pre_erf_score = (test_scores - self._mu) / (
                    self._sigma * np.sqrt(2))
            erf_score = erf(pre_erf_score)
            probs[:, 1] = erf_score.clip(0, 1).ravel()
            probs[:, 0] = 1 - probs[:, 1]
            return probs
        else:
            raise ValueError(method,
                             'is not a valid probability conversion method')
File:      c:\users\andre\documents\repos\netdata-community\netdata-agent-api\netdata-pandas\venv\lib\site-packages\pyod\models\base.py
Type:      function

We can see its getting scores from some decision_function() method, so lets look at that.

In [28]:
PCA.decision_function??
Signature: PCA.decision_function(self, X)
Source:   
    def decision_function(self, X):
        """Predict raw anomaly score of X using the fitted detector.

        The anomaly score of an input sample is computed based on different
        detector algorithms. For consistency, outliers are assigned with
        larger anomaly scores.

        Parameters
        ----------
        X : numpy array of shape (n_samples, n_features)
            The training input samples. Sparse matrices are accepted only
            if they are supported by the base estimator.

        Returns
        -------
        anomaly_scores : numpy array of shape (n_samples,)
            The anomaly score of the input samples.
        """
        check_is_fitted(self, ['components_', 'w_components_'])

        X = check_array(X)
        if self.standardization:
            X = self.scaler_.transform(X)

        return np.sum(
            cdist(X, self.selected_components_) / self.selected_w_components_,
            axis=1).ravel()
File:      c:\users\andre\documents\repos\netdata-community\netdata-agent-api\netdata-pandas\venv\lib\site-packages\pyod\models\pca.py
Type:      function

We can see here what actually look like relativley straightforward calculations, so lets try step through them below.

Ok so lets step through that!

So the cells above show the PyOD code under the hood - lets step through and recreate the a predicted score, step by step.

Lets begin by getting our feature vector that we would like an anomaly probability for.

In [29]:
from scipy.spatial.distance import cdist
from sklearn.preprocessing import MinMaxScaler

# get our feature vector for a random observation from our prediction data
X = df_predict_processed[df_predict_processed.columns[df_predict_processed.columns.str.startswith(model_to_explore)]].sample(1).values

print('feature vector')
print(X.shape)
print(X)
feature vector
(1, 54)
[[ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
  -0.1675042   0.          0.          0.1675042  -0.08291873  0.
   0.          0.         -0.08333333  0.          0.          0.08333333
  -0.167924   -0.08270047 -0.0824958   0.         -0.16396713 -0.00252527
  -0.00104953  0.00168347  0.08500837  0.0837521  -0.32918117 -0.00252527]]

Lets look at the principle components the model will use to calculate our anomaly score.

In [30]:
print('selected components learned by model')
selected_components = models[model_to_explore].selected_components_
print(selected_components.shape)
print(selected_components)
selected components learned by model
(2, 54)
[[ 2.80082352e-18  3.03794228e-18 -2.70731444e-19 -3.92363821e-19
   1.48369551e-19 -3.63283111e-19 -4.39041383e-22 -8.31401705e-21
   1.39900394e-20  1.30685531e-20 -9.11087543e-21  0.00000000e+00
  -2.43885139e-01 -3.29203940e-01 -1.73697634e-01  1.29799362e-01
   3.23636716e-01  2.79958132e-01  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  -1.51045154e-01 -8.76462672e-02  6.10211833e-02  1.98659891e-01
   2.27264690e-01  1.49883841e-01 -1.12372782e-01 -9.74261497e-02
  -3.51807526e-02  8.00901689e-02  1.02115746e-01  7.26665255e-02
  -1.24195520e-03 -1.22739202e-02 -1.72542528e-02 -3.10527251e-03
   6.40218609e-03  1.38531055e-02 -2.85767294e-01 -2.95111985e-01
  -1.27987592e-01  1.47699610e-01  3.13336131e-01  2.93640283e-01
  -4.51371443e-02 -6.59597548e-02 -7.06319548e-02 -1.24793697e-02
   2.99234105e-02  6.52240921e-02]
 [-1.60824947e-17 -1.14827710e-17  9.61490236e-19 -3.01121753e-19
   9.15825774e-20 -1.79272007e-19 -3.09077873e-20 -2.81027414e-20
  -4.26944492e-21  1.35805909e-20 -1.02490202e-20 -0.00000000e+00
   9.19937111e-02 -1.01829472e-01 -2.77808201e-01 -2.74783232e-01
  -9.60873349e-02  9.77127933e-02 -0.00000000e+00 -0.00000000e+00
  -0.00000000e+00 -0.00000000e+00 -0.00000000e+00 -0.00000000e+00
  -1.93270784e-01 -3.33926708e-01 -3.73609183e-01 -2.84145201e-01
  -1.33619732e-01 -1.36922292e-02 -1.04312593e-03 -7.56532493e-02
  -1.20059231e-01 -9.40272399e-02 -2.46077516e-02  3.28380477e-02
   1.53039332e-02  9.98811847e-03 -1.30719454e-03 -1.23054923e-02
  -1.04461356e-02 -2.39789225e-03 -4.56703551e-02 -2.42918118e-01
  -3.80897606e-01 -3.54023992e-01 -1.80861638e-01  1.67598654e-02
   1.10389117e-02 -1.99933307e-02 -6.19464953e-02 -1.01962973e-01
  -9.31777410e-02 -5.72142476e-02]]

Lets look at the weighting that the model will apply to each component when working out a weighted distance that will form the score.

In [31]:
print('selected components weights set by model')
selected_w_components = models[model_to_explore].selected_w_components_
print(selected_w_components.shape)
print(selected_w_components)
selected components weights set by model
(2,)
[0.09782853 0.08702647]

Lets just call predict_proba(X) to get the score we will try to recreate.

In [32]:
print('anomaly probability [p(anomaly), p(not anomaly)]')
anomaly_probability = models[model_to_explore].predict_proba(X)
print(anomaly_probability.shape)
print(anomaly_probability)
anomaly probability [p(anomaly), p(not anomaly)]
(1, 2)
[[0.96450375 0.03549625]]

Step 1: Standardize the feature vector

The first step is to standardize the data as this is a default common practice when fitting a PCA to data.

This is the default here which is good as it means that if you define any custom models in your anomalies.conf file, you dont need to worry about them not being on the same or similar scales (e.g. cpu % vs disk usage etc) as that is all taken care of internally by the PyOD model.

In [33]:
X_scaled = models[model_to_explore].scaler_.transform(X)
print(X_scaled)
[[ 0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  -2.70067980e-10 -2.70067979e-10 -2.70067980e-10 -2.70067979e-10
  -2.70067980e-10 -2.70067979e-10  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   2.40674662e-10  2.40674661e-10  2.40674662e-10  2.40674661e-10
   2.40674662e-10  2.40674661e-10 -2.66664208e+00 -9.29120645e-05
  -4.65713239e-07  2.66664268e+00 -1.32018223e+00 -9.29236062e-05
  -6.02436866e-12 -6.02436881e-12 -1.55106650e+00 -6.02436888e-12
  -6.02436872e-12  1.55106650e+00 -3.86095274e-01 -1.90187398e-01
  -1.89755931e-01 -5.35431133e-05 -3.77037179e-01 -5.84708821e-03
  -5.22050433e-03  8.09092270e-03  4.15752327e-01  4.09701175e-01
  -1.61015895e+00 -1.24094202e-02]]

Step 2: Calculate distance from each selected component

Now we want to work out the distance of our feature vector to each of the selected components.

This is the core part of what is going on when calculating an anomaly score using this model.

If we have a strange feature vector then its going to be something we have not really seen before and so will not fit well into the lower dimensional representation learned by the PCA model. Hence it will have a somewhat larger distance from the selected principle components.

In [34]:
distance = cdist(X_scaled, selected_components)
print(distance.shape)
print(distance)
(1, 2)
[[4.92401167 4.99332981]]

Note: These distance numbers are just numbers, where bigger means more distant, but on their own are hard to interpret it terms of what they mean. So the next step will be to try and go from the weighted distance score to an anomaly probability by comparing the anomay score with all those we saw within the training data. It's the training data that will be our yardstick for trying to say just how anomalous a new observation is.

Step 3: Use a weighted average distance as the anomaly score

We will take a weighted average of the distances, where the weights for each component will, by default, be related to the amount of variance in the original training data that each component 'explained' or represented (we can see this in the PyOD code here).

In [35]:
print('anomaly score')
anomaly_score = np.sum(distance / selected_w_components, axis=1).ravel()
print(anomaly_score)
anomaly score
[107.7102246]

Step 4: Convert anomaly score into anomaly probabilities based on the anomaly scores of the training data

Now we will use the decision scores we calculated from basically running the prediction process back over the training data when we originally fit our model.

We will use the distribution of scores in the training data to try and re-scale our raw anomaly score to look more like something that can pass as a probability.

Note: Strictly speaking this "anomaly probability" is not really a "real" probability in the sense of being some sort of more formal or theoretical output from some probabilistic process we have statistically modeled directly. Rather, its just a sensible re-scaling of our raw score based on what we saw in the training data. So a high 'probability' here really just means an unusual value in reference to the training data.

In [36]:
# get the raw anomaly scores from the training data
train_anomaly_scores = models[model_to_explore].decision_scores_

# create empty array for probabilities to go into
anomaly_probability_manual_calc = np.zeros([X.shape[0], int(models[model_to_explore]._classes)])

# create a scaler to rescale raw anomaly score to look more like a probability and be on the 0, 1 range
scaler = MinMaxScaler().fit(train_anomaly_scores.reshape(-1, 1))

# transform anomaly score into a probability by rescaling it based on the training data and clipping at 1
anomaly_probability_manual_calc[:, 1] = scaler.transform(anomaly_score.reshape(-1, 1)).ravel().clip(0, 1)

# use 1 - p(anomaly) as p(not anomaly)
anomaly_probability_manual_calc[:, 0] = 1 - anomaly_probability_manual_calc[:, 1]

print(anomaly_probability_manual_calc)
[[0.96450375 0.03549625]]

Do they match?

In [37]:
# do they match?
print(anomaly_probability)
print(anomaly_probability_manual_calc)
print(anomaly_probability == anomaly_probability_manual_calc)
assert np.sum(anomaly_probability == anomaly_probability_manual_calc) == 2
[[0.96450375 0.03549625]]
[[0.96450375 0.03549625]]
[[ True  True]]
In [ ]:
 
In [ ]:
 

....phew, thats it! Go get yourself a coffee :)

In [ ]: