#!/usr/bin/env python # coding: utf-8 #
# # # ## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course # ###
Author: Alexander Nichiporenko, AlexNich # # ##
Tutorial # ###
"Anomaly Detection: Isolation Forest" #
# # # Introduction. # In our course we quite a bit touched unsupervised learning tasks (reduction of dimenstion and clustering), and one more important class remained unnoticed - the detection of anomalies. # # In data science anomaly detection (outlier or novelty detection) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions. Outliers often reduce the quality of ML algorithms because models tune to them. # # A bit of theory and the main idea of algorithm. # One of the proven anomaly detection algorithms is the Isolation Forest. As the name implies - this is an ensemble of trees, which are built independently of each other. But in this case, the principle of building a tree is different from what is used in regression or classification problems - minimizing the splitting criterion at each step. # The trees are also binary, but at each node the feature is chosen randomly, the feature values for splitting are also chosen randomly from the range (min, max) that the feature accepts. The tree is built to the maximum possible depth - when there is only one object in each leaf. # With this approach, it appears that the anomalies will get to the final leaf much earlier than normal objects. This is the principle of detecting anomalies which Isolation Forest uses, this algorithm "isolates" anomalies by normal objects at early steps. # # Perhaps it seems not quite obvious, but we can consider the following one-dimensional toy example - such a set of numbers [1,20,21,25]. Obviously, the outlier in this case is the number 1. If we choose the threshold for the first splitting (1,25), then in the overwhelming majority of cases, the number 1 will immediately “isolate” in the first leaf of the tree. Let's simulate this situation for 1000 random threshold choices. # In[1]: #Importing libaries import numpy as np import pandas as pd import matplotlib.pyplot as plt # In[2]: #Toy example hits=0 K=1000 for k in range(K): np.random.seed(k+101) split=np.random.uniform(1,25) if split<20: hits+=1 print ('The portion of cases when the "1" goes to the first leaf:', hits/K) # In this way, we can measure the anomaly score using the path length of the object, i.e. the number of edges an observation must pass in the tree going from the root to the terminal node. But we have a problem, for sample data $X$= {$x_1,...,x_n$} the maximum possible height of isolation tree grows in the order of $n$, the average path length grows in the order of $log({n})$. Therefore, we cannot compare the anomaly of objects in samples of different sizes, normalization by any of the above values will not help either. So we will use this formula for normalization: # # ## $$c{(n)} = 2H(n-1) - {2 (n-1)\over n}$$ # # where $H(n-1)$ is $n$-$Harmonic$ number: # # ## $$H(n-1) = \sum_{k=1}^{n-1} {1\over k} \approx \gamma\ {(Euler's\ constant)} + \ln{(n-1)} \approx 0.5772156649 + \ln{(n-1)}$$ # # $c({n})$ gives the average path length of unsuccessful search in Binary Search Tree (BST). We can use it because isolation tree has a equivalent structure to BST and $c({n})$ equals to estimation of average $h({x})$ for external nodes. # # So final anomaly score is calculated by this formula: # # ## $$S(x,n) = {2 ^ {E(x)\over c(n)}}$$ # where $E(x)$ - average path length in trees of our forest where example $x$ was isolated: # # ## $$E(x) = {1\over N}\sum_{i=1}^{N} {h(x)_i}$$ # and $N$ - number of trees in the forest. # # $S(x,n)$ changes from $0$ to $1$. When $S(x,n)$ of example is very close to $1$ it means that it is definitely anomaly, when it much smaller then $0.5$ then this example safe to be regarded as normal instance, and if all examples have $S(x,n) \approx 0.5$, then the entire data doesn't have any distinct anomaly. # # When we decide which example is anomaly we can choose portion of examples with high score or make a threshold in $S(x,n)$. # # Let's grow our IsolationForest! # In this part of the tutorial, we will implement our own Isolation Forest and see how it works with outliers and normal objects. # In[3]: #Importing libaries ---- import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from math import log as ln # In[4]: #External Node - leaf with 1 example class ExNode: def __init__(self,size): self.size=size # In[5]: #Internal Node class InNode: def __init__(self,left,right,split_feature,split_threshold): self.left=left self.right=right self.split_feature=split_feature self.split_threshold=split_threshold # In[6]: #Build one Isolation tree def IsolationTree(X): if len(X)<=1: return ExNode(len(X)) else: f=np.random.choice(X.columns) t=np.random.uniform(X[f].min(),X[f].max()) X_l=X[X[f]=t] return InNode(IsolationTree(X_l),IsolationTree(X_r),f,t) # In[7]: #Build forest def MyIsolationForest(X,n_trees): forest=[] for i in range(n_trees): forest.append(IsolationTree(X)) return forest # In[8]: #Depth of external node where object was isolated def path_length(x,tree,curr_depth): if isinstance(tree,ExNode): return curr_depth t=tree.split_feature if x[t]2 else 1 if n ==1 else 0 def S(x,n): return 2 ** (-E(x)/c(n)) # ## Let's find outliers using our forest! # Firstly we will generate 1-d data - normal distribution and find average path length and $S(x,n)$ of normal and anomaly objects depends on number of trees. # In[10]: # Generating normal distributed 1d-data random_generator = np.random.RandomState(42) true_mean=100 true_sigma=10 X_all = random_generator.normal(true_mean, true_sigma, size=500) print ('Normal interval:',true_mean - 2 * true_sigma,'-', true_mean + 2 * true_sigma) X_outliers = pd.DataFrame(np.hstack([X_all[X_alltrue_mean+2*true_sigma]]),columns=['x']) X_normal = pd.DataFrame(list(set(X_all).difference(set(X_outliers))),columns=['x']) X_all = pd.DataFrame(X_all, columns=['x']) print ('Partition of outliers:', len(X_outliers)/len(X_all)) # In[11]: plt.figure(figsize=(10,5)) plt.hist(X_all['x'],bins=10); plt.hist(X_outliers['x'],bins=10); plt.legend(['X_all','X_outliers']); plt.ylabel('Count'); plt.xlabel('X value'); plt.title('Distribution of X'); # In[12]: #Outlier for test X_outliers.iloc[2,:] # In[13]: #Normal example for test X_normal.iloc[0,:] # In[14]: get_ipython().run_cell_magic('time', '', '\nanomaly_x=[]\nnormal_x=[]\n\nanomaly_mean_depth=[]\nnormal_mean_depth=[]\n\nanomaly_S=[]\nnormal_S=[]\n\nfor n in range(1,51,1):\n MyIF=MyIsolationForest(X_all,n)\n for iTree in MyIF:\n anomaly_x.append(path_length(X_outliers.iloc[2,:],iTree,0))\n normal_x.append(path_length(X_normal.iloc[0,:],iTree,0))\n anomaly_mean_depth.append(E(anomaly_x))\n normal_mean_depth.append(E(normal_x))\n anomaly_S.append(S(anomaly_x,len(X_all)))\n normal_S.append(S(normal_x,len(X_all)))') # In[15]: plt.figure(figsize=(10,5)) plt.plot(range(1,51,1),anomaly_mean_depth,c='r'); plt.plot(range(1,51,1),normal_mean_depth,c='g'); plt.title('Average path length (n_trees)'); plt.legend(['anomaly','normal']); plt.xlabel('Number of trees'); plt.ylabel('Average path length'); # In[16]: plt.figure(figsize=(10,5)) plt.plot(range(1,51,1),anomaly_S,c='r'); plt.plot(range(1,51,1),normal_S,c='g'); plt.title('S (n_trees)'); plt.legend(['anomaly','normal']); plt.xlabel('Number of trees'); plt.ylabel('S(x,n)'); # As we can see anomaly object has obviously less path depth and bigger $S(x,n)$. Also we don't need many trees to detect anomalies, approximately at 15 trees we reach the asymptote. # # Ok, let's find outliers in 2-d data. We will use IsolationForest with 30 trees to find our otliers and check the quality of detection. # In[17]: # Generating 2d-data random_generator = np.random.RandomState(42) # Generating normal data X_normal = random_generator.randn(2000, 2) * 0.5 X_normal = pd.DataFrame(X_normal, columns = ['x1', 'x2']) X_normal['type']='normal' # Generating outliers X_outliers_1 = random_generator.uniform(low=-6, high=6, size=(78, 2)) X_outliers_2 = random_generator.uniform(low=-6, high=-3, size=(35, 2)) X_outliers=np.vstack([X_outliers_1,X_outliers_2]) X_outliers = pd.DataFrame(X_outliers, columns = ['x1', 'x2']) X_outliers['R']=X_outliers['R']=np.sqrt(X_outliers['x1'] ** 2 + X_outliers['x2'] ** 2) X_outliers=X_outliers[X_outliers['R']>3].drop(columns=['R']) X_outliers['type']='anomaly' # Full data X_full=pd.concat([X_normal,X_outliers]) # In[18]: plt.figure(figsize=(8,8)) plt.scatter(X_outliers['x1'],X_outliers['x2'],c='r'); plt.scatter(X_normal['x1'],X_normal['x2'],c='g'); plt.xlabel('x1'); plt.ylabel('x2'); plt.legend(['outliers','normal']); plt.title('2d distribution'); # In[19]: X_normal.shape,X_outliers.shape, X_full.shape # In[20]: get_ipython().run_cell_magic('time', '', "MyIF=MyIsolationForest(X_full[['x1','x2']],30)") # In[21]: X_outliers.iloc[0,:] # In[22]: X_normal.iloc[0,:] # In[23]: get_ipython().run_cell_magic('time', '', '\naScore=[]\n\n\nfor i in range(X_full.shape[0]):\n depth=[]\n for iTree in MyIF:\n depth.append(path_length(X_full.iloc[i,:],iTree,0))\n \n aScore.append(S(depth,X_full.shape[0]))') # In[24]: X_full['aScore']=aScore # In[25]: t = X_full['aScore'].quantile(0.95) X_full['Outlier']=X_full['aScore'].apply(lambda x: -1 if x >=t else 1) #-1 for outliers and 1 for normal object # In[26]: plt.hist(X_full['aScore']); plt.xlabel('Anomaly Score'); # In[27]: plt.figure(figsize=(10,7)) plt.scatter(X_full['x1'],X_full['x2'],c=X_full['Outlier']); plt.title('Detection outliers using MyIF'); plt.xlabel('x1'); plt.ylabel('x2'); # Reader may notice that our forest works long enough. What will happen to the larger dataset? But fortunately, during the experiments, the authors of this algorithm found that not all data should be used to build a single tree from a forest, which leads not only to an increase in the speed of work, but also improves the quality of anomalies detection. This is because subsamples have fewer normal points ‘interfering’ with anomalies, thus, making anomalies easier to isolate. It is shown on the image below. # In[28]: X_sample = X_full.sample(256) plt.figure(figsize=(8,8)) plt.scatter(X_sample['x1'],X_sample['x2'],c=X_sample['type'].map({'normal':1,'anomaly':-1})); plt.xlabel('x1'); plt.ylabel('x2'); plt.title('2d distribution of Sample (size=256)'); # We won't develop our IsolationForest and will use sklearn's version with these improvements further. # # Sklearn is our everything! # In[29]: #Import IsolationForest from sklearn.ensemble import IsolationForest # #### IsolationForest important params: # # n_estimators - The number of base estimators in the ensemble, default=100. # max_sample - The number of samples from data to train each tree in the forest, default "auto" = min(256, n_samples) # max_features - The number of features from data to train each tree in the forest, default = 1.0 (all features) # bootstrap - bootstrap, default=False # contamination - The proportion of outliers in the data set, threshold, default = 0.1 # # We will test this implementation of IF on our 2-d dataset. # In[30]: isof = IsolationForest(random_state=77,n_jobs=4,contamination=0.05) # In[31]: get_ipython().run_cell_magic('time', '', "isof.fit(X_full[['x1','x2']])") # Very fast! How about the quality? # In[32]: # predictions y_pred_full = isof.predict(X_full[['x1','x2']]) # In[33]: X_full['Outlier_sk']=y_pred_full # In[34]: plt.figure(figsize=(10,7)) plt.scatter(X_full['x1'],X_full['x2'],c=X_full['Outlier_sk']); plt.title('Detection outliers using sklearn.IF'); plt.xlabel('x1'); plt.ylabel('x2'); # Image looks grate! Purple points are outliers which IsolationForest found. # In[35]: #count of detected outliers "-1" X_full.iloc[2000:,4].value_counts() # As we can see our IsolationForest found almost of outliers. # # Time to challenge! # It this part of tutorial we will compare IsolationForest with another algorithms. We will use this data set from Kaggle: https://www.kaggle.com/mlg-ulb/creditcardfraud # In[36]: from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from xgboost import XGBClassifier from sklearn.svm import OneClassSVM from sklearn.neighbors import LocalOutlierFactor from sklearn.model_selection import train_test_split pd.options.display.max_columns=500 # In[37]: data = pd.read_csv('creditcard.csv') # Let's look at the data. All features are numeric, so we will not do any data processing and will train models on what we downloaded. To prevent overfitting of supervised models we will devide initial dataset by train and test parts. # In[38]: data.head() # In[39]: data.describe() # In[68]: X = data.drop(columns=['Class']) y = data['Class'] # In[69]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=17) # In[70]: #Look at portion of fraud transactions y.mean(),y_train.mean(), y_test.mean(), len(y[y==1]), len(y), len(y_test), len(y_test[y_test==1]) # Whoah! Fraud transactions is about 0.2% of all transactions! # We will use all algorithms "from the box" without tuning parameters. We need to find fraud transcations quickly and as much as possible. # In[71]: #supervised rf = RandomForestClassifier(random_state=42,n_jobs=4) lr = LogisticRegression(random_state=42) xg = XGBClassifier(random_state=42,n_jobs=4) #unsupervised IF = IsolationForest(random_state=42,n_jobs=4,contamination=0.01,n_estimators=300) LOF = LocalOutlierFactor(contamination=0.01,n_jobs=4) # In[72]: get_ipython().run_cell_magic('time', '', 'rf.fit(X_train,y_train)') # In[73]: get_ipython().run_cell_magic('time', '', 'lr.fit(X_train,y_train)') # In[74]: get_ipython().run_cell_magic('time', '', 'xg.fit(X_train,y_train)') # In[75]: get_ipython().run_cell_magic('time', '', 'IF.fit(X_train,y_train)') # In[76]: get_ipython().run_cell_magic('time', '', 'LOF.fit(X_train,y_train)') # In[77]: rf_pred = rf.predict_proba(X_test)[:,1] lr_pred = lr.predict_proba(X_test)[:,1] xg_pred = xg.predict_proba(X_test)[:,1] IF_pred = IF.predict(X_test) LOF_pred = LOF.fit_predict(X_test) # In[78]: X_test['true']=y_test X_test['IF_pred']=IF_pred X_test['LOF_pred']=LOF_pred X_test['xg_pred']=xg_pred X_test['rf_pred']=rf_pred X_test['lr_pred']=lr_pred # Let's check how many fraud transactions found each algorithm: # In[84]: X_test[X_test['IF_pred']==-1]['true'].value_counts() # In[85]: X_test[X_test['LOF_pred']==-1]['true'].value_counts() # In[86]: X_test.sort_values(by='rf_pred',ascending=False)['true'].head(855).value_counts() # In[87]: X_test.sort_values(by='lr_pred',ascending=False)['true'].head(855).value_counts() # In[88]: X_test.sort_values(by='xg_pred',ascending=False)['true'].head(855).value_counts() # As we can see Isolation Forest solved this problem worse than supervised algorithms, but on the whole it is very good for the random. And compared to the LOF, the result is much better. And in this case we don't need to know what transactions are fraud - make a train dataset with target variable. # # Conclusion # In the conclusion ot the tutorial I want to summarise everything about Isolation Forest. This algorithm has very simple idea of anomaly detection: it isolates anomalies rather than normal observations. Also Isolation Forest converges quickly with a small ensemble size, which enables it to detect anomalies with high efficiency. # # If you want to learn more about Isolation Forest, I recommend reading the article from the authors of the algorithm: # # http://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf # # https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/tkdd11.pdf #