Fraud detection is one of the earliest industrial applications of data mining and machine learning. Fraud detection is typically handled as a binary classification problem, but the class population is unbalanced because instances of fraud are usually very rare compared to the overall volume of transactions. Moreover, when fraudulent transactions are discovered, the business typically takes measures to block the accounts from transacting to prevent further losses.
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
import zipfile
with zipfile.ZipFile('../datasets/fraud_detection.csv.zip', 'r') as z:
f = z.open('15_fraud_detection.csv')
data = pd.io.parsers.read_table(f, index_col=0, sep=',')
data.head()
accountAge | digitalItemCount | sumPurchaseCount1Day | sumPurchaseAmount1Day | sumPurchaseAmount30Day | paymentBillingPostalCode - LogOddsForClass_0 | accountPostalCode - LogOddsForClass_0 | paymentBillingState - LogOddsForClass_0 | accountState - LogOddsForClass_0 | paymentInstrumentAgeInAccount | ipState - LogOddsForClass_0 | transactionAmount | transactionAmountUSD | ipPostalCode - LogOddsForClass_0 | localHour - LogOddsForClass_0 | Label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2000 | 0 | 0 | 0.00 | 720.25 | 5.064533 | 0.421214 | 1.312186 | 0.566395 | 3279.574306 | 1.218157 | 599.00 | 626.164650 | 1.259543 | 4.745402 | 0 |
1 | 62 | 1 | 1 | 1185.44 | 2530.37 | 0.538996 | 0.481838 | 4.401370 | 4.500157 | 61.970139 | 4.035601 | 1185.44 | 1185.440000 | 3.981118 | 4.921349 | 0 |
2 | 2000 | 0 | 0 | 0.00 | 0.00 | 5.064533 | 5.096396 | 3.056357 | 3.155226 | 0.000000 | 3.314186 | 32.09 | 32.090000 | 5.008490 | 4.742303 | 0 |
3 | 1 | 1 | 0 | 0.00 | 0.00 | 5.064533 | 5.096396 | 3.331154 | 3.331239 | 0.000000 | 3.529398 | 133.28 | 132.729554 | 1.324925 | 4.745402 | 0 |
4 | 1 | 1 | 0 | 0.00 | 132.73 | 5.412885 | 0.342945 | 5.563677 | 4.086965 | 0.001389 | 3.529398 | 543.66 | 543.660000 | 2.693451 | 4.876771 | 0 |
data.shape, data.Label.sum(), data.Label.mean()
((138721, 16), 797, 0.0057453449730033666)
X = data.drop(['Label'], axis=1)
y = data['Label']
Estimate a Logistic Regression
Evaluate using the following metrics:
Comment about the results
Under-sample the negative class using random-under-sampling
Which is parameter for target_percentage did you choose? How the results change?
Only apply under-sampling to the training set, evaluate using the whole test set
Now using random-over-sampling
Estimate a Decision Tree classifier using the training and under-sampled datasets