The basic technique I was taught for finding outliers was to find the $z$-score of various points--find the mean $\mu$ and the standard deviation $\sigma$ and calculate
$$z_i = \frac{x_i - \mu}{\sigma}$$(and with various extensions for multidimensional data).
This is fine as it goes for when you have hundreds of points, a tiny fraction of outliers, and your data is normally distributed. However, for the data I'm having to deal with, this doesn't actually work well.
from __future__ import print_function, division
import numpy as np
import scipy as sp
np.set_printoptions(suppress=True, precision=3)
np.random.seed(200)
# x[3] is a bogus datapoint where a scraper gave a dumb datapoint
x = np.array([10, 11, 10, 100001, 9, 10, 11])
mean_x = x.mean()
print('Mean x = %.1f' % mean_x)
std_x = x.std(ddof=1)
print('StDev x = %.1f' % std_x)
z_score = (x - mean_x)/std_x
print(z_score)
Mean x = 14294.6 StDev x = 37793.0 [-0.378 -0.378 -0.378 2.268 -0.378 -0.378 -0.378]
So, if we had set some limit $z < 2.5$, we'd completely miss the bogus point. In fact, $z$-scores are not the best in general for small data sets, since the largest $z$ score you can get for $n$ data points is $(n-1)/\sqrt{n}$.
We could try using a more robust estimate for the middle point with the median.
median_x = np.median(x)
print('Median x = %.1f' % median_x)
zm_score = (x - median_x)/std_x
print(zm_score)
Median x = 10.0 [ 0. 0. 0. 2.646 -0. 0. 0. ]
But that would still fail if we used $z < 3$ as a threshold. The next thing would then be to use a more robust form to estimate the variance (say the Interquartile Range).
q75, q25 = np.percentile(x, [75, 25])
iqr = q75 - q25
fake_std = iqr/1.349
print('IQR = %.1f' % iqr)
zmi_score = (x - median_x)/fake_std
print(zmi_score)
IQR = 1.0 [ 0. 1.349 0. 134887.859 -1.349 0. 1.349]
But we can run into some issues, particularly if we have very few data-points (how do you even really define IQR on all of 3 data points?).
x2 = [9, 10, 11, 100001]
q75, q25 = np.percentile(x2, [75 ,25])
iqr = q75 - q25
fake_std = iqr/1.349
print('IQR = %.2f :(' % iqr)
zmi2 = (x2 - np.median(x2))/fake_std
print(zmi2)
x3 = [1000, 9, 9, 10, 11, 100001]
q75, q25 = np.percentile(x3, [75, 25])
iqr = q75 - q25
fake_std = iqr/1.349
print('IQR = %.2f :(' % iqr)
zmi3 = (x3 - np.median(x3))/fake_std
print(zmi3)
IQR = 24998.75 :( [-0. -0. 0. 5.396] IQR = 743.50 :( [ 1.795 -0.003 -0.003 -0.001 0.001 181.422]
x4 = np.array([1000, 9, 9, 9, 10, 11, 100001])
q75, q25 = np.percentile(x4, [75, 25])
iqr = q75 - q25
fake_std = iqr/1.349
print('IQR = %.2f :(' % iqr)
zmi4 = (x4 - np.median(x4))/fake_std
print(zmi4)
IQR = 496.50 :( [ 2.69 -0.003 -0.003 -0.003 0. 0.003 271.677]
Another potential robust dispersion metric to use is MAD, the Median Absolute Deviation.
$$MAD = \text{median} \left| \, \bar x - \text{median}(\bar x) \, \right| $$print('Abs median difference')
print(np.abs(x4 - np.median(x4)))
MAD = np.median(np.abs(x4 - np.median(x4)))
print('MAD = %.3f' % MAD)
Abs median difference [ 990. 1. 1. 1. 0. 1. 99991.] MAD = 1.000
For normal distributions, $MAD = \sigma/1.4826$ and $IQR = 1.349\sigma$. Note that both IQR and MAD assume the underlying distribution is not skewed, and may run into issues with multimodal data. It's also worth noting that we're trading efficency for robustness as we move away from the mean and standard deviation. But this is all in the domain of being approximately right than precisely wrong.
For skewed data and as a more efficient estimator, we can look into $S_n$ and $Q_n$.
# Don't implement these in your code, there's a link to a more computationally efficent method below
# plus this is not vectorized
def find_Sn(x):
n = len(x)
outer_med_array = np.empty(n)
for ind in xrange(n):
outer_med_array[ind] = np.median(np.abs(x-x[ind]))
return 1.1926*np.median(outer_med_array)
print('S_%d = %.3f' % (len(x4), find_Sn(x4)))
S_7 = 1.193
# Don't do this naive implementation either
from math import floor
from sklearn.cross_validation import LeavePOut
def find_Qn(x):
n = len(x)
h = floor(n/2) + 1
k = int(h*(h-1)/2)
nchoose2 = int(n*(n-1)/2)
diff_array = np.empty(nchoose2)
# There's probably a better/faster way to get the n-choose-2 combinations of indicies, but I'm lazy
# and I know this works
lpo = LeavePOut(n, p=2)
for ind, (train, test) in enumerate(lpo):
diff_array[ind] = np.abs(x[test[1]] - x[test[0]])
diff_array.sort()
print('k = %d' % k)
print('Sorted array:')
print(diff_array)
return 2.2219*diff_array[k-1] # Since Python zero-indexes, use k-1 for kth order stat
print('Q_%d = %.3f' % (len(x4), find_Qn(x4)))
k = 6 Sorted array: [ 0. 0. 0. 1. 1. 1. 1. 2. 2. 2. 989. 990. 991. 991. 991. 99001. 99990. 99991. 99992. 99992. 99992.] Q_7 = 2.222
Method | Efficency on Guassian | Breakdown point | Use on highly skewed data? |
---|---|---|---|
Stdev | 100% | 0% | ??? |
IQR | 37% | 25% | No |
MAD | 37% | 50% | No |
$S_n$ | 58% | $\left \lfloor{n/2}\right \rfloor /n \approx 50$% | OK |
$Q_n$ | 82% | $\left \lfloor{n/2}\right \rfloor /n \approx 50$% | OK |
$P_n$ | 86% | 13.4% | OK |
The advantages we get for $S$ and $Q$ are balanced by longer computing time and space, but computers these days have plenty of memory and speed. R and C came up with a more efficent method to calculate these two scale statistics, details of which can be seen at this link. I do not know of any Python packages that implement these calculations, but the R package robustbase does have them.
As for which one to use, the paper Study of Extreme Values and Influential Observations states that $S$ is a little less sensitive to errors (specifically gross-error sensitivity, see paper), and recommends using it for when $n < 50$. For larger datasets, use $Q$.
Basically, instead of taking n-choose-2 pairwise absolute differences as in $Q$, take n-choose-2 pairwise means and then find the IQR.
$$P_n = 1.048 \times IQR\left[\frac{x_i + x_j}{2} \; i < j\right]$$def find_Pn(x):
n = len(x)
h = floor(n/2) + 1
k = int(h*(h-1)/2)
nchoose2 = int(n*(n-1)/2)
sum_array = np.empty(nchoose2)
# There's probably a better/faster way to get the n-choose-2 combinations of indicies, but I'm lazy
# and I know this works
lpo = LeavePOut(n, p=2)
for ind, (train, test) in enumerate(lpo):
sum_array[ind] = float(x[test[1]] + x[test[0]])/2
q75, q25 = np.percentile(sum_array, [75, 25])
return 1.048*(q75 - q25)
print('P_%d = %.3f' % (len(x4), find_Pn(x4)))
P_7 = 52395.284
Looks and we've exceeded the break-point for $P_n$ (29% of x4 is bad data). However, for a less fraught data-set:
x5 = np.array([10, 9, 9, 9, 10, 11, 100001, 11, 10, 12, 8, 10, 9, 11])
print('P_%d = %.3f' % (len(x5), find_Pn(x5)))
P_14 = 1.310
For a more robust version of $P$, the authors provide $\tilde{P}_n$ where you iteratively toss out anything with a z-like score > 5 (using $P_n$ as your scale parameter), reaching a 50% break point.
More interestingly, $P_n$ performs more robustly with digitized data. For example, if you were drawing from $Poisson(1)$, it's highly likely that your $Q$ estimators will return 0 (and $S$ has troubles with $\mu < 0.5$ here).
np.random.seed(200)
poisson_test = sp.stats.poisson.rvs(1, size=50)
print('S_%d = %.3f' % (len(poisson_test), find_Sn(poisson_test)))
print('Q_%d = %.3f' % (len(poisson_test), find_Qn(poisson_test)))
print('P_%d = %.3f' % (len(poisson_test), find_Pn(poisson_test)))
S_50 = 1.193 k = 325 Sorted array: [ 0. 0. 0. ..., 3. 3. 3.] Q_50 = 0.000 P_50 = 1.048
There is a clever algorithm to estimate $P$ in $O(n \log n)$ time, putting it on computing-time par with $S$ and $Q$.
As it turns out, there's also a similar (and older) estimate for the position parameter (mean/median). The Hodges-Lehmann (or sometimes Hodges-Lehmann-Sen estimator) is given by
$$HL = \text{median} \left[\frac{x_i + x_j}{2} \; i < j\right]$$Following the same algorithm linked above, you can compute this in $O(n \log n)$ time.
# Naive approach, not the clever algorithm.
def find_HL(x):
n = len(x)
h = floor(n/2) + 1
k = int(h*(h-1)/2)
nchoose2 = int(n*(n-1)/2)
sum_array = np.empty(nchoose2)
# There's probably a better/faster way to get the n-choose-2 combinations of indicies, but I'm lazy
# and I know this works
lpo = LeavePOut(n, p=2)
for ind, (train, test) in enumerate(lpo):
sum_array[ind] = float(x[test[1]] + x[test[0]])/2
return np.median(sum_array)
print("HL of x_4 = %.3f. We're right at the breakpoint here." % (find_HL(x4)))
print('HL of x_5 = %.3f' % (find_HL(x5)))
HL of x_4 = 504.500. We're right at the breakpoint here. HL of x_5 = 10.000
Method | Efficency on Gaussian | Breakdown point |
---|---|---|
Mean | 100% | 0% |
Median | 64% | 50% |
Hodges-Lehmann | 86% | 29% |
Maybe at some point I'll look at the Huber estimator. https://projecteuclid.org/download/pdf_1/euclid.aoms/1177703732
The basic idea is this--sort the data, and look at the minimum and maximum data points. For these points, calculate
$$ Q = \frac{\text{gap to next point}}{\text{range}}$$If Q is above some critical value (viz, there's a massive gap) then there's a good chance that the point is an outlier. The critical values of Q are given below, for 3 points, 4, points, ..., 28 points. 90, 95, and 99 represent confidence levels.
q90 = [0.941, 0.765, 0.642, 0.56, 0.507, 0.468, 0.437,
0.412, 0.392, 0.376, 0.361, 0.349, 0.338, 0.329,
0.32, 0.313, 0.306, 0.3, 0.295, 0.29, 0.285, 0.281,
0.277, 0.273, 0.269, 0.266, 0.263, 0.26
]
q95 = [0.97, 0.829, 0.71, 0.625, 0.568, 0.526, 0.493, 0.466,
0.444, 0.426, 0.41, 0.396, 0.384, 0.374, 0.365, 0.356,
0.349, 0.342, 0.337, 0.331, 0.326, 0.321, 0.317, 0.312,
0.308, 0.305, 0.301, 0.29
]
q99 = [0.994, 0.926, 0.821, 0.74, 0.68, 0.634, 0.598, 0.568,
0.542, 0.522, 0.503, 0.488, 0.475, 0.463, 0.452, 0.442,
0.433, 0.425, 0.418, 0.411, 0.404, 0.399, 0.393, 0.388,
0.384, 0.38, 0.376, 0.372
]
Q90 = {n:q for n,q in zip(range(3,len(q90)+1), q90)}
Q95 = {n:q for n,q in zip(range(3,len(q95)+1), q95)}
Q99 = {n:q for n,q in zip(range(3,len(q99)+1), q99)}
For x2 above...
x2.sort()
print(x2)
[9, 10, 11, 100001]
Q_min = (x2[1]-x2[0])/(x2[-1]-x2[0])
print('Q_min = %.3f' % Q_min)
Q_max = (x2[-1]-x2[-2])/(x2[-1]-x2[0])
print('Q_max = %.3f' % Q_max)
print(Q_max > Q95[len(x2)])
Q_min = 0.000 Q_max = 1.000 True
Two things to note:
This is another outlier test. To do this, compute the $z$ scores of the sample, and find the largest absolute value, $G = \max |z|$. We expect there's no outliers at significance level $\alpha$ if
$$ G > \frac{N-1}{\sqrt{N}} \sqrt{\frac{t^2_{\alpha/2N, N-2}}{N - 2 + t^2_{\alpha/2N, N-2}}}$$with $t_{\alpha/2N, N-2}$ the upper critical value of the t-distribution with $N - 2$ degrees-of-freedom at significance level $\alpha/2N$. For a 1-sided test, use $t_{\alpha/N, N-2}$.
Run this, removing the biggest outlier each time (if it exists), until it stops. But don't run this on too-small sample sizes (say $N \leq 6$).
from scipy import stats
def Grubbs_outlier_test(y_i, alpha=0.95):
"""
Perform Grubbs' outlier test.
ARGUMENTS
y_i (list or numpy array) - dataset
alpha (float) - significance cutoff for test
RETURNS
G_i (list) - Grubbs G statistic for each member of the dataset
Gtest (float) - rejection cutoff; hypothesis that no outliers exist if G_i.max() > Gtest
no_outliers (bool) - boolean indicating whether there are no outliers at specified
significance level
index (int) - integer index of outlier with maximum G_i
Code from https://github.com/choderalab/cadd-grc-2013/blob/master/notebooks/Outliers.ipynb
"""
s = y_i.std()
G_i = np.abs(y_i - y_i.mean()) / s
N = y_i.size
t = stats.t.isf(1 - alpha/(2*N), N-2)
# Upper critical value of the t-distribution with N − 2 degrees of freedom and a
# significance level of α/(2N)
Gtest = (N-1)/np.sqrt(N) * np.sqrt(t**2 / (N-2+t**2))
G = G_i.max()
index = G_i.argmax()
no_outliers = (G > Gtest)
return [G_i, Gtest, no_outliers, index]
x_grubbs1 = np.array([9, 10, 11, 9, 10, 100000, 11])
G1, Gtest, noout, indexval = Grubbs_outlier_test(x_grubbs1)
print(G1)
print("%.3f" % Gtest)
print(noout)
print(indexval)
[ 0.408 0.408 0.408 0.408 0.408 2.449 0.408] 1.411 True 5
Note that both of these tests assume that your data is normally distributed.
There's an extension of the Grubb's test called the Tietjen-Moore test, though this requires you to specify the number of outliers from the start.