Fraud detection is one of the earliest industrial applications of data mining and machine learning. Fraud detection is typically handled as a binary classification problem, but the class population is unbalanced because instances of fraud are usually very rare compared to the overall volume of transactions. Moreover, when fraudulent transactions are discovered, the business typically takes measures to block the accounts from transacting to prevent further losses.
import pandas as pd
import zipfile
with zipfile.ZipFile('../datasets/fraud_detection.csv.zip', 'r') as z:
f = z.open('15_fraud_detection.csv')
data = pd.io.parsers.read_table(f, index_col=0, sep=',')
data.head()
accountAge | digitalItemCount | sumPurchaseCount1Day | sumPurchaseAmount1Day | sumPurchaseAmount30Day | paymentBillingPostalCode - LogOddsForClass_0 | accountPostalCode - LogOddsForClass_0 | paymentBillingState - LogOddsForClass_0 | accountState - LogOddsForClass_0 | paymentInstrumentAgeInAccount | ipState - LogOddsForClass_0 | transactionAmount | transactionAmountUSD | ipPostalCode - LogOddsForClass_0 | localHour - LogOddsForClass_0 | Label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2000 | 0 | 0 | 0.00 | 720.25 | 5.064533 | 0.421214 | 1.312186 | 0.566395 | 3279.574306 | 1.218157 | 599.00 | 626.164650 | 1.259543 | 4.745402 | 0 |
1 | 62 | 1 | 1 | 1185.44 | 2530.37 | 0.538996 | 0.481838 | 4.401370 | 4.500157 | 61.970139 | 4.035601 | 1185.44 | 1185.440000 | 3.981118 | 4.921349 | 0 |
2 | 2000 | 0 | 0 | 0.00 | 0.00 | 5.064533 | 5.096396 | 3.056357 | 3.155226 | 0.000000 | 3.314186 | 32.09 | 32.090000 | 5.008490 | 4.742303 | 0 |
3 | 1 | 1 | 0 | 0.00 | 0.00 | 5.064533 | 5.096396 | 3.331154 | 3.331239 | 0.000000 | 3.529398 | 133.28 | 132.729554 | 1.324925 | 4.745402 | 0 |
4 | 1 | 1 | 0 | 0.00 | 132.73 | 5.412885 | 0.342945 | 5.563677 | 4.086965 | 0.001389 | 3.529398 | 543.66 | 543.660000 | 2.693451 | 4.876771 | 0 |
X = data.drop(['Label'], axis=1)
y = data['Label']
y.value_counts(normalize=True)
0 0.994255 1 0.005745 Name: Label, dtype: float64
Estimate a Logistic Regression, GaussianNB, K-nearest neighbors and a Decision Tree Classifiers
Evaluate using the following metrics:
Comment about the results
Combine the classifiers and comment
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
models = {'lr': LogisticRegression(),
'dt': DecisionTreeClassifier(),
'nb': GaussianNB(),
'nn': KNeighborsClassifier()}
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)
# Train all the models
for model in models.keys():
models[model].fit(X_train, y_train)
# predict test for each model
y_pred = pd.DataFrame(index=X_test.index, columns=models.keys())
for model in models.keys():
y_pred[model] = models[model].predict(X_test)
y_pred.sample(10)
lr | dt | nb | nn | |
---|---|---|---|---|
111018 | 0 | 0 | 0 | 0 |
120018 | 0 | 0 | 0 | 0 |
24895 | 0 | 0 | 0 | 0 |
23525 | 0 | 0 | 0 | 0 |
29535 | 0 | 0 | 0 | 0 |
52150 | 0 | 0 | 1 | 0 |
127077 | 0 | 0 | 0 | 0 |
83261 | 0 | 0 | 0 | 0 |
26716 | 0 | 0 | 0 | 0 |
45260 | 0 | 0 | 0 | 0 |
y_pred_ensemble1 = (y_pred.mean(axis=1) > 0.5).astype(int)
y_pred_ensemble1.mean()
0.00020183962400161472
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score
stats = {'acc': accuracy_score,
'f1': f1_score,
'rec': recall_score,
'pre': precision_score}
res = pd.DataFrame(index=models.keys(), columns=stats.keys())
for model in models.keys():
for stat in stats.keys():
res.loc[model, stat] = stats[stat](y_test, y_pred[model])
res
acc | f1 | rec | pre | |
---|---|---|---|---|
lr | 0.993829 | 0 | 0 | 0 |
dt | 0.987918 | 0.121593 | 0.136792 | 0.109434 |
nb | 0.923647 | 0.0314557 | 0.20283 | 0.01705 |
nn | 0.993714 | 0.0840336 | 0.0471698 | 0.384615 |
res.loc['ensemble1'] = 0
for stat in stats.keys():
res.loc['ensemble1', stat] = stats[stat](y_test, y_pred_ensemble1)
res
acc | f1 | rec | pre | |
---|---|---|---|---|
lr | 0.993829 | 0 | 0 | 0 |
dt | 0.987918 | 0.121593 | 0.136792 | 0.109434 |
nb | 0.923647 | 0.0314557 | 0.20283 | 0.01705 |
nn | 0.993714 | 0.0840336 | 0.0471698 | 0.384615 |
ensemble1 | 0.994031 | 0.0547945 | 0.0283019 | 0.857143 |
Apply random-undersampling with a target percentage of 0.5
how does the results change
For each model estimate a BaggingClassifier of 100 models using the under sampled datasets
Using the under-sampled dataset
Evaluate a RandomForestClassifier and compare the results
change n_estimators=100, what happened