14 - Kaggle Competition¶

Fraud Detection¶

https://inclass.kaggle.com/c/easy-ml-class ¶

by Alejandro Correa Bahnsen

version 0.1, May 2016

Part of the class Machine Learning for Security Informatics ¶

This notebook is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License]

pip install tqdm

Fraud Detection¶

In [1]:

import pandas as pd
import zipfile
with zipfile.ZipFile('../datasets/fraud_transactions_kaggle.csv.zip', 'r') as z:
    f = z.open('fraud_transactions_kaggle.csv')
    data = pd.read_csv(f, index_col=0)

In [2]:

data.head()

Out[2]:

	date	card_number	type	merchant	amount	fraud
ID
0	2011-01-01 08:00:06	1942	2	8328	65.16	0.0
1	2011-01-01 08:00:16	5629	2	42588	260.84	0.0
2	2011-01-01 08:01:28	408	2	15622	6010.05	0.0
3	2011-01-01 08:01:43	859	2	45192	348.46	0.0
4	2011-01-01 08:01:48	3786	2	35549	1160.35	0.0

In [3]:

data.tail()

Out[3]:

	date	card_number	type	merchant	amount	fraud
ID
199995	2012-12-31 17:04:18	4069	2	35828	91.22	NaN
199996	2012-12-31 17:04:51	9	2	46923	390.95	NaN
199997	2012-12-31 17:05:38	1481	1	-1	0.65	NaN
199998	2012-12-31 17:05:55	1481	1	4535	390.04	NaN
199999	2012-12-31 17:25:02	0	1	8322	308.44	NaN

In [4]:

data.fraud.value_counts(dropna=False)

Out[4]:

 0.0    171048
NaN      27909
 1.0      1043
Name: fraud, dtype: int64

Estimate aggregated features¶

In [5]:

from datetime import datetime, timedelta
from tqdm import tqdm

Split for each account and create the date as index

In [6]:

card_numbers = data['card_number'].unique()
data['trx_id'] = data.index
data.index = pd.DatetimeIndex(data['date'])

data_ = []
for card_number in tqdm(card_numbers):
    data_.append(data.query('card_number == ' + str(card_number)))

100%|██████████| 8087/8087 [00:20<00:00, 390.15it/s]

Create Aggregated Features for one account

In [7]:

res_agg = pd.DataFrame(index=data['trx_id'].values, 
                       columns=['Trx_sum_7D', 'Trx_count_1D'])

In [8]:

trx = data_[0]

for i in range(trx.shape[0]):
    date = trx.index[i]
    trx_id = int(trx.ix[i, 'trx_id'])
    # Sum 7 D
    agg_ = trx[date-pd.datetools.to_offset('7D').delta:date-timedelta(0,0,1)]
    res_agg.loc[trx_id, 'Trx_sum_7D'] = agg_['amount'].sum()
    # Count 1D
    agg_ = trx[date-pd.datetools.to_offset('1D').delta:date-timedelta(0,0,1)]
    res_agg.loc[trx_id, 'Trx_count_1D'] = agg_['amount'].shape[0]

In [9]:

res_agg.mean()

Out[9]:

Trx_sum_7D      1054.881429
Trx_count_1D       0.640693
dtype: float64

All accounts

In [10]:

for trx in tqdm(data_):
    for i in range(trx.shape[0]):
        date = trx.index[i]
        trx_id = int(trx.ix[i, 'trx_id'])
        # Sum 7 D
        agg_ = trx[date-pd.datetools.to_offset('7D').delta:date-timedelta(0,0,1)]
        res_agg.loc[trx_id, 'Trx_sum_7D'] = agg_['amount'].sum()
        # Count 1D
        agg_ = trx[date-pd.datetools.to_offset('1D').delta:date-timedelta(0,0,1)]
        res_agg.loc[trx_id, 'Trx_count_1D'] = agg_['amount'].shape[0]

100%|██████████| 8087/8087 [04:26<00:00, 30.33it/s]

In [13]:

res_agg.head()

Out[13]:

	Trx_sum_7D	Trx_count_1D
0	0	0
1	0	0
2	0	0
3	0	0
4	0	0

In [15]:

data.index = data.trx_id

In [16]:

data = data.join(res_agg)

In [19]:

data.sample(15, random_state=42).sort_index()

Out[19]:

	date	card_number	type	merchant	amount	fraud	trx_id	Trx_sum_7D	Trx_count_1D
trx_id
4082	2011-01-16 16:26:53	3558	2	13505	528.82	0.0	4082	307.85	0
23677	2011-04-04 08:13:41	1162	2	9417	117.29	0.0	23677	0	0
30074	2011-04-29 13:09:07	0	1	56997	21.29	0.0	30074	14171.9	2
65426	2011-09-09 10:11:24	4420	2	57849	29.70	0.0	65426	0	0
72272	2011-10-04 10:43:00	2114	2	5109	2170.65	0.0	72272	131020	7
74456	2011-10-11 17:17:22	2148	2	1341	2150.19	0.0	74456	0	0
84660	2011-11-19 17:06:58	1521	1	35294	651.59	0.0	84660	0	0
117167	2012-04-01 12:33:33	1471	1	-1	650.94	0.0	117167	4381.21	1
119737	2012-04-09 14:27:12	2723	1	38616	13.03	0.0	119737	13614.2	0
132467	2012-05-27 16:43:11	4857	2	45373	41.70	0.0	132467	634.13	10
134858	2012-06-03 17:05:21	2114	1	18692	26.06	0.0	134858	175202	7
142133	2012-06-29 16:21:37	7588	2	35991	92.53	0.0	142133	1151.21	4
158154	2012-08-20 10:55:23	4420	2	53353	182.65	0.0	158154	121.77	0
176418	2012-10-16 14:23:04	1595	2	25985	15397.58	NaN	176418	0	0
186433	2012-11-20 11:04:00	4923	2	36010	217.89	NaN	186433	573.4	0

Split train and test¶

In [20]:

X = data.loc[~data.fraud.isnull()]

In [21]:

y = X.fraud

In [22]:

X = X.drop(['fraud', 'date', 'card_number'], axis=1)

In [28]:

X_kaggle = data.loc[data.fraud.isnull()]
X_kaggle = X_kaggle.drop(['fraud', 'date', 'card_number'], axis=1)

In [29]:

X_kaggle.head()

Out[29]:

	type	merchant	amount	trx_id	Trx_sum_7D	Trx_count_1D
trx_id
172091	2	13273	208.51	172091	120165	14
172092	2	34472	525.05	172092	71042.4	0
172093	2	37909	802.24	172093	120374	15
172094	2	35167	130.32	172094	90638.1	9
172095	2	35073	9696.96	172095	0	0

Simple Random Forest¶

In [25]:

from sklearn.ensemble import RandomForestClassifier

In [26]:

clf = RandomForestClassifier(n_estimators=100, n_jobs=-1, class_weight='balanced')

In [27]:

from sklearn.metrics import fbeta_score

KFold cross-validation

In [31]:

from sklearn.cross_validation import KFold

In [35]:

kf = KFold(X.shape[0], n_folds=5)
res = []
for train, test in kf:
    X_train, X_test, y_train, y_test = X.iloc[train], X.iloc[test], y.iloc[train], y.iloc[test]
    clf.fit(X_train, y_train)
    y_pred_proba = clf.predict_proba(X_test)[:, 1]
    y_pred = (y_pred_proba>0.05).astype(int)
    res.append(fbeta_score(y_test, y_pred, beta=2))

In [37]:

pd.Series(res).describe()

Out[37]:

count    5.000000
mean     0.078145
std      0.032472
min      0.054945
25%      0.057692
50%      0.062500
75%      0.082713
max      0.132877
dtype: float64

Train with all

Predict and send to Kaggle¶

In [38]:

clf.fit(X, y)

Out[38]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [40]:

y_pred = clf.predict_proba(X_kaggle)[:, 1]

In [41]:

y_pred = (y_pred>0.05).astype(int)

In [42]:

y_pred = pd.Series(y_pred,name='fraud', index=X_kaggle.index)

In [43]:

y_pred.head(10)

Out[43]:

trx_id
172091    0
172092    1
172093    1
172094    0
172095    1
172096    1
172097    1
172098    0
172099    1
172100    0
Name: fraud, dtype: int64

In [108]:

y_pred.to_csv('fraud_transactions_kaggle_1.csv', header=True, index_label='ID')

Main Issues¶

Class imbalance
Feature creation
Model selection
Threshold selection

14 - Kaggle Competition¶

Fraud Detection¶

https://inclass.kaggle.com/c/easy-ml-class¶

Part of the class Machine Learning for Security Informatics¶

Fraud Detection¶

Estimate aggregated features¶

Split train and test¶

Simple Random Forest¶

Predict and send to Kaggle¶

Main Issues¶

https://inclass.kaggle.com/c/easy-ml-class ¶

Part of the class Machine Learning for Security Informatics ¶