Outlier detection (also known as anomaly detection) is the process of finding data objects with behaviors that are very different from expectation. Such objects are called outliers or anomalies.
The most interesting objects are those, that deviates significantly from the normal object. Outliers are not being generated by the same mechanism as rest of the data.
Outlier detection is important in many applications, such as:
Outlier detection and clustering analysis are two highly related tasks. Clustering finds the majority patterns in a data set and organizes the data accordingly, whereas outlier detection tries to capture those exceptional cases that deviate substantially from the majority patterns.
Outliers differes from the noisy data.
Moreover it should be removed while applying outlier detection. Noise may distort the normal objects and blur the distinction between normal objects and outliers. It may help hide outliers and reduce the effectiveness of outlier detection. For example, if a user consider of buying more expensive lunch that he used to buy usually, this behavior should be treated as a “noise transactions” like “random errors” or “variance”.
In general, outliers can be classified into three categories, namely global outliers, contextual (or conditional) outliers, and collective outliers.
Usually, a data set may contains different types of outliers and at the same time may belong to more than one type of outlier.
There are many outlier detection methods in the literature and in practice. Firstly the outlier detection methods differs according to whether the sample of data for analysis is given with domain expert–provided labels that can be used to build an outlier detection model. Secondly, methods can be divided into groups according to their assumptions regarding normal objects versus outliers.
If expert-labeled examples of normal and/or outlier objects can be obtained, they can be used to build outlier detection models. The methods used can be divided into supervised methods, semi-supervised methods, and unsupervised methods.
Modeling outlier detection as a classification problem. Samples examined by domain experts used for training & testing.
Challenges:
In some application scenarios, objects labeled as “normal” or “outlier” are not available. Thus, an unsupervised learning method has to be used. Unsupervised outlier detection methods make an implicit assumption: The normal objects are somewhat “clustered.” In other words, an unsupervised outlier detection method expects that normal objects follow a pattern far more frequently than outliers.
Challenges: Unsupervised methods can’t detect collective outlier effectively. This assumption may not be true all the time.
For example in some intrusion or virus detection, normal activities are diverse
In many applications, although obtaining some labeled examples is feasible, the number of such labeled examples is often small. If some labeled normal objects are available:
If only some labeled outliers are available, a small number of labeled outliers may not cover the possible outliers well.
Statistical methods (also known as model-based methods) assume that the normal data follow some statistical model (a stochastic model). The idea is to learn a generative model fitting the given data set, and then identify the objects in low probability regions of the model as outliers.
A parametric method assumes that the normal data objects are generated by a parametric distribution with parameter $\theta$. The probability density function of the parametric distribution $f (x,2)$ gives the probability that object x is generated by the distribution. The smaller this value, the more likely x is an outlier
import numpy as np
avg_temp = np.array([24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4])
We can use the maximum likelihood method to estimate the parameters μ and ". That is, we maximize the log-likelihood function
$$lnL(\mu,\sigma^2) = \sum_{i=1}^{n}lnf[x_i|(\mu,\sigma^2)] = -\frac{n}{2}ln(2\pi) -\frac{n}{2}ln(2\sigma) - \frac{1}{2\sigma^2} \sum_{i=1}^{n}(x_i-\mu)^2$$Taking derivatives with respect to μ and "2 and solving the resulting system of first-order conditions leads to the following maximum likelihood estimates:
mean = avg_temp.mean()
std = avg_temp.std()
sigma = (avg_temp.min() - avg_temp.max()) / std
print mean, std, sigma
The most deviating value, 24.0"C, is 4.61"C away from the estimated mean. We know that the $\mu±3$ region contains 99.7% data under the assumption of normal distribution
$\sigma < -3 $. Then 24 is outlier
The main drawback for parametric model is that in many cases, data distribution may not be known
import matplotlib.pyplot as plt
plt.boxplot(avg_temp)
plt.show()
For multivariative data other methods are used. For example, we can compute Mahalaobis distance or use $\chi^2$ statistics
In nonparametric methods for outlier detection, the model of “normal data” is learned from the input data, rather than assuming one a priori. Nonparametric methods often make fewer assumptions about the data, and thus can be applicable in more scenarios. As an example we can use histograms
Given a set of objects in feature space, a distance measure can be used to quantify the similarity between objects. Intuitively, objects that are far from others can be regarded as outliers. Proximity-based approaches assume that the proximity of an outlier object to its nearest neighbors significantly deviates from the proximity of the object to most of the other objects in the data set.
Distance-based outlier detection method consults the neighborhood of an object, which is defined by a given radius. An object is then considered an outlier if its neighborhood does not have enough other points.
Formally, let $r (r > 0)$ be a distance threshold and $\pi (0< \pi < 1)$ be a fraction threshold. An object, o, is a $DB(r,\pi)$-outlier if $$\frac{||{o'|dist(o,o')\leq r||}}{||D||} \leq \pi $$ where $dist(o,o')$ is a distance measure
Algorithms for mining distance-based outliers:
Density-based outlier detection method investigates the density of an object and that of its neighbors. Here, an object is identified as an outlier if its density is relatively much lower than that of its neighbors.
Many real-world data sets demonstrate a more complex structure, where objects may be considered outliers with respect to their local neighborhoods, rather than with respect to the global data distribution.
Distanse-based methods are not able to detect $o_1$ and $o_2$ as an outlier. However, when they are considered locally with respect to cluster $C_1$ because $o_1$ and $o_2$ deviate significantly from the objects in $C_1$. Moreover, $o_1$ and $o_2$ are also far from the objects in $C_2$.
The critical idea here is that we need to compare the density around an object with the density around its local neighbors. The basic assumption of density-based outlier detection methods is that the density around a nonoutlier object is similar to the density around its neighbors, while the density around an outlier object is significantly different from the density around its neighbors.
$dist_k(o)$ -is a distance between object $o$ and k-nearest neighbors. The k-distance neighborhood of $o$ contains all objects of which the distance to $o$ is not greater that $dist_k(o)$ , the $k$-distance of $o$
$$N_k(o)= [o'|o' \in D, dist(o,o') \leq dist_k(o)]$$We can use the average distance from the objects in $N_k(o)$ to o as the measure of the local density of $o$. If $o$ has very close neighbors $o'$ such that $dist(o, o')$ is very small, the statistical fluctuations of the distance measure can be undesirably high. To overcome this problem, we can switch to the following reachability distance measure by adding a smoothing effect.
$$reachdist_k(o,o') = max[dist_k(o), dist(o,o')]$$$k$ is a user-specified parameter that controls the smoothing effect. Essentially, $k$ specifies the minimum neighborhood to be examined to determine the local density of an object.Reachability distance is not symmetric.
Local reachability density of an object $o$ is $$lrd_k(o) = \frac{||N_k(o)||}{\sum_{o' \in N_k(o)}reachdist_k(o,o')}$$
We calculate the local reachability density for an object and compare it with that of its neighbors to quantify the degree to which the object is considered an outlier.
$$LOF_k(o) = \frac{\sum_{o'\in N_k(o)} \frac{lrd_k(o')}{lrd_k(o)} }{||N_k(o)||} = \sum_{o' \in N_k(o)}lrd_k(o')* \sum_{o' \in N_k(o)}reachdist_k(o',o)$$Local outlier factor is the average of the ratio of the local reachability density of o and those of $o’$ s $k$-nearest neighbors. The lower the local reachability density of $o$ and the higher the local reachability densities of the $k$-nearest neighbors of o, the higher the LOF value is. This exactly captures a local outlier of which the local density is relatively low compared to the local densities of its k-nearest neighbors.
The meaning of LOF is easy to understand.
The minimum reachability distance from o to its k-nearest neighbors $$direct_{min}(o) = min[reachdist_k(o',o)|o' \in N_k(o)]$$
While the maximum can be defined as follows: $$direct_{max}(o) = max[reachdist_k(o',o)|o' \in N_k(o)]$$
We also consider the neighbors of o' k-nearest neighbors $$indirect_{min}(o) = min[reachdistk(o'',o')|o' \in N_k(o) \text{ and } o'' \in N_k(o')]$$
and $$indirect_{max}(o) = max[reachdistk(o'',o')|o' \in N_k(o) \text{ and } o'' \in N_k(o')]$$
Then LOF(o) is bounded as
$$\frac{direct_{min}(o)}{indirect_{max}(o)} \leq LOF(o) \leq \frac{direct_{max}(o)}{indirect_{min}(o)}$$Based on the clickstream event frequency pattern in data.csv, we will apply LOF algorithm to calculate LOF for each point with the following initial settings:
As the result, we will find 5 outliers and their $LOF_k(o)$
import pandas as pd
import seaborn as sns
from scipy.spatial.distance import pdist, squareform
data_input = pd.read_csv("../../data/clikstream_data.csv")
data_input.head()
# Reachdist function
def reachdist(distance_df, observation, index):
return distance_df[observation][index]
# LOF algorithm implementation from scratch
def LOF_algorithm(data_input, distance_metric="cityblock", p=5):
distances = pdist(data_input.values, metric=distance_metric)
dist_matrix = squareform(distances)
distance_df = pd.DataFrame(dist_matrix)
k = 2 if distance_metric == "cityblock" else 3
observations = distance_df.columns
lrd_dict = {}
n_dist_index = {}
reach_array_dict = {}
for observation in observations:
dist = distance_df[observation].nsmallest(k + 1).iloc[k]
indexes = distance_df[distance_df[observation] <= dist].drop(observation).index
n_dist_index[observation] = indexes
reach_dist_array = []
for index in indexes:
# make a function reachdist(observation, index)
dist_between_observation_and_index = reachdist(
distance_df, observation, index
)
dist_index = distance_df[index].nsmallest(k + 1).iloc[k]
reach_dist = max(dist_index, dist_between_observation_and_index)
reach_dist_array.append(reach_dist)
lrd_observation = len(indexes) / sum(reach_dist_array)
reach_array_dict[observation] = reach_dist_array
lrd_dict[observation] = lrd_observation
# Calculate LOF
LOF_dict = {}
for observation in observations:
lrd_array = []
for index in n_dist_index[observation]:
lrd_array.append(lrd_dict[index])
LOF = (
sum(lrd_array)
* sum(reach_array_dict[observation])
/ np.square(len(n_dist_index[observation]))
)
LOF_dict[observation] = LOF
return sorted(LOF_dict.items(), key=lambda x: x[1], reverse=True)[:p]
LOF_algorithm(data_input, p=5)
LOF_algorithm(data_input, p=5, distance_metric="euclidean")