Exercise 08¶

Fraud Detection¶

Introduction¶

Fraud Detection Dataset from Microsoft Azure: data

Fraud detection is one of the earliest industrial applications of data mining and machine learning. Fraud detection is typically handled as a binary classification problem, but the class population is unbalanced because instances of fraud are usually very rare compared to the overall volume of transactions. Moreover, when fraudulent transactions are discovered, the business typically takes measures to block the accounts from transacting to prevent further losses.

In [2]:

%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

In [3]:

import zipfile
with zipfile.ZipFile('../datasets/fraud_detection.csv.zip', 'r') as z:
    f = z.open('15_fraud_detection.csv')
    data = pd.io.parsers.read_table(f, index_col=0, sep=',')

In [4]:

data.head()

Out[4]:

	accountAge	digitalItemCount	sumPurchaseCount1Day	sumPurchaseAmount1Day	sumPurchaseAmount30Day	paymentBillingPostalCode - LogOddsForClass_0	accountPostalCode - LogOddsForClass_0	paymentBillingState - LogOddsForClass_0	accountState - LogOddsForClass_0	paymentInstrumentAgeInAccount	ipState - LogOddsForClass_0	transactionAmount	transactionAmountUSD	ipPostalCode - LogOddsForClass_0	localHour - LogOddsForClass_0
0	2000	0	0	0.00	720.25	5.064533	0.421214	1.312186	0.566395	3279.574306	1.218157	599.00	626.164650	1.259543	4.745402
1	62	1	1	1185.44	2530.37	0.538996	0.481838	4.401370	4.500157	61.970139	4.035601	1185.44	1185.440000	3.981118	4.921349
2	2000	0	0	0.00	0.00	5.064533	5.096396	3.056357	3.155226	0.000000	3.314186	32.09	32.090000	5.008490	4.742303
3	1	1	0	0.00	0.00	5.064533	5.096396	3.331154	3.331239	0.000000	3.529398	133.28	132.729554	1.324925	4.745402
4	1	1	0	0.00	132.73	5.412885	0.342945	5.563677	4.086965	0.001389	3.529398	543.66	543.660000	2.693451	4.876771

In [59]:

data.shape, data.Label.sum(), data.Label.mean()

Out[59]:

((138721, 16), 797, 0.0057453449730033666)

In [5]:

X = data.drop(['Label'], axis=1)
y = data['Label']

Exercice 08.1¶

Estimate a Logistic Regression

Evaluate using the following metrics:

Accuracy
F1-Score
F_Beta-Score (Beta=10)

Comment about the results

In [ ]:

Exercice 08.2¶

Under-sample the negative class using random-under-sampling

Which is parameter for target_percentage did you choose? How the results change?

Only apply under-sampling to the training set, evaluate using the whole test set

In [ ]:

Exercice 08.3¶

Now using random-over-sampling

In [ ]:

Exercice 08.4¶

Evaluate the results using SMOTE

Which parameters did you choose?

In [ ]:

Exercice 08.5¶

Estimate a Decision Tree classifier using the training and under-sampled datasets

In [ ]: