Fraud detection is one of the earliest industrial applications of data mining and machine learning. Fraud detection is typically handled as a binary classification problem, but the class population is unbalanced because instances of fraud are usually very rare compared to the overall volume of transactions. Moreover, when fraudulent transactions are discovered, the business typically takes measures to block the accounts from transacting to prevent further losses.
import pandas as pd
import zipfile
with zipfile.ZipFile('../datasets/fraud_detection.csv.zip', 'r') as z:
f = z.open('15_fraud_detection.csv')
data = pd.io.parsers.read_table(f, index_col=0, sep=',')
data.head()
accountAge | digitalItemCount | sumPurchaseCount1Day | sumPurchaseAmount1Day | sumPurchaseAmount30Day | paymentBillingPostalCode - LogOddsForClass_0 | accountPostalCode - LogOddsForClass_0 | paymentBillingState - LogOddsForClass_0 | accountState - LogOddsForClass_0 | paymentInstrumentAgeInAccount | ipState - LogOddsForClass_0 | transactionAmount | transactionAmountUSD | ipPostalCode - LogOddsForClass_0 | localHour - LogOddsForClass_0 | Label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2000 | 0 | 0 | 0.00 | 720.25 | 5.064533 | 0.421214 | 1.312186 | 0.566395 | 3279.574306 | 1.218157 | 599.00 | 626.164650 | 1.259543 | 4.745402 | 0 |
1 | 62 | 1 | 1 | 1185.44 | 2530.37 | 0.538996 | 0.481838 | 4.401370 | 4.500157 | 61.970139 | 4.035601 | 1185.44 | 1185.440000 | 3.981118 | 4.921349 | 0 |
2 | 2000 | 0 | 0 | 0.00 | 0.00 | 5.064533 | 5.096396 | 3.056357 | 3.155226 | 0.000000 | 3.314186 | 32.09 | 32.090000 | 5.008490 | 4.742303 | 0 |
3 | 1 | 1 | 0 | 0.00 | 0.00 | 5.064533 | 5.096396 | 3.331154 | 3.331239 | 0.000000 | 3.529398 | 133.28 | 132.729554 | 1.324925 | 4.745402 | 0 |
4 | 1 | 1 | 0 | 0.00 | 132.73 | 5.412885 | 0.342945 | 5.563677 | 4.086965 | 0.001389 | 3.529398 | 543.66 | 543.660000 | 2.693451 | 4.876771 | 0 |
X = data.drop(['Label'], axis=1)
y = data['Label']
y.value_counts(normalize=True)
0 0.994255 1 0.005745 Name: Label, dtype: float64
Estimate a Logistic Regression, GaussianNB, K-nearest neighbors and a Decision Tree Classifiers
Evaluate using the following metrics:
Comment about the results
Combine the classifiers and comment
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbbors import KNeighborsClassifier
Apply random-undersampling with a target percentage of 0.5
how does the results change
For each model estimate a BaggingClassifier of 100 models using the under sampled datasets
Using the under-sampled dataset
Evaluate a RandomForestClassifier and compare the results
change n_estimators=100, what happened