import numpy as np
import pandas as pd
df = pd.read_parquet("fraud-cleaned-sample.parquet")
We're using time-series data, so we'll split based on time.
first = df['timestamp'].min()
last = df['timestamp'].max()
cutoff = first + ((last - first) * 0.7)
train = df[df['timestamp'] <= cutoff]
len(train)
test = df[df['timestamp'] > cutoff]
len(test)
len(train) / (len(train) + len(test))
Some of our features are obvious quantities (like interarrival times and transaction amounts), but others are categories of things (like merchant IDs and transaction types). In a conventional programming language or database schema, we'd use enumerated types (C programmers may want to use distinguished small integers) to model categories of things, but those aren't suitable for input to machine learning algorithms.
Why?
Well, let's say we encode transaction types as small integers, like this:
MANUAL=0
SWIPE=1
CHIP_AND_PIN=2
CONTACTLESS=3
ONLINE=4
We can use this representation to write code that treats these differently, but the integers don't actually capture anything about our problem that a machine learning algorithm can exploit -- a manual transaction isn't "less than" a swipe transaction, and an online transaction isn't "closer to" a contactless transaction than a manual one is. We want a representation that makes sure that manual transactions are similar to other manual transactions in some way but dissimilar to all other transactions in that way.
There are several approaches we can use to make sense of categorical features, and we'll use two of them in this notebook:
import sklearn
from sklearn.pipeline import Pipeline
from sklearn import feature_extraction, preprocessing
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
stringize = np.frompyfunc(lambda x: "%s" % x, 1, 1)
def mk_stringize(colname):
def stringize(tab):
return [{colname : s} for s in tab]
return stringize
def amap(s):
return s.map(lambda x: {'merchant_id' : str(x)})
# my_func = mk_stringize('merchant_id')
my_func = amap
def mk_hasher(features=16384, values=None):
return Pipeline([('dictify',
FunctionTransformer(my_func, accept_sparse=True)),
('hasher',
sklearn.feature_extraction.FeatureHasher(n_features=features, input_type='dict'))])
HASH_BUCKETS = 256
tt_xform = ('onehot', sklearn.preprocessing.OneHotEncoder(handle_unknown='ignore', categories=[['online','contactless','chip_and_pin','manual','swipe']]), ['trans_type'])
mu_xform = ('m_hashing', mk_hasher(HASH_BUCKETS), 'merchant_id')
xform_steps = [tt_xform, mu_xform]
cat_xform = ColumnTransformer(transformers=xform_steps, n_jobs=None)
The general approach we'll use is to reduce the dimensionality of our encoded categorical features so we can plot them as points on a plane. This means going from hundreds of dimensions (in the case of hashed merchant IDs) or five or six dimensions (in the case of one-hot encoded transaction types) to two dimensions.
We'll use two different techniques: a linear technique called principal component analysis and a nonlinear technique called t-distributed stochastic neighbor embedding. The details of these techniques are out of scope for this workshop, but they're both good places to start if you want to visualize some high-dimensional data. Dimensionality reduction can be expensive, so we'll start by sampling only a small amount of our data.
vis_sample = pd.concat([train[train["label"] == label].sample(2500) for label in ["legitimate", "fraud"]])
categorical_matrix = cat_xform.fit_transform(vis_sample)
crows, ccols = categorical_matrix.shape
(categorical_matrix != 0).sum(0)
We're going to start by using PCA to plot the two first principal components of the encoded merchants -- think of this as mapping from the high-dimensional space to a two-dimensional space in such a way that emphasizes the dimensions that contain the most information and minimizes the dimensions that contain the least information.
import sklearn.decomposition
merchants = categorical_matrix[:, -HASH_BUCKETS:]
DIMENSIONS = 2
mpca2 = sklearn.decomposition.PCA(DIMENSIONS)
mpca2_a = mpca2.fit_transform(merchants.toarray())
merchants_df = pd.DataFrame({"label": vis_sample["label"].astype(np.object),
"x": mpca2_a.T[0],
"y": mpca2_a.T[1]}).reset_index().dropna()
del merchants_df["index"]
import altair as alt
alt.Chart(merchants_df).mark_point(opacity=0.1).encode(
x="x:Q",
y="y:Q",
color="label"
).interactive()
As we can see, there's a lot of overlap between the classes here and merchant ID alone isn't an obvious way to differentiate between legitimate and fraudulent transactions.
Sometimes, a nonlinear visualization technique can work better than a linear one like PCA. The next approach we'll try is called t-distributed stochastic neighbor embedding, or t-SNE for short. t-SNE learns a mapping from high-dimensional points to low-dimensional points so that points that are similar in high-dimensional space are likely to be similar in low-dimensional space as well. t-SNE can sometimes identify structure that simpler techniques like PCA can't, but this power comes at a cost: it is much more expensive to compute than PCA and doesn't parallelize well. (t-SNE also works best for visualizing two-dimensional data when it is reducing from tens of dimensions rather than hundreds or thousands. So, in some cases, you'll want to use a fast technique like PCA to reduce your data to a few dozen dimensions before using t-SNE. That's what we're doing with the TruncatedSVD
class in the next cell.)
✅ You can go back and re-run this entire notebook after changing HASH_BUCKETS
to a different value.
import sklearn.manifold
tsne = sklearn.manifold.TSNE()
# use SVD to reduce the dimensionality before fitting t-SNE
svd = sklearn.decomposition.TruncatedSVD(16)
svd_a = svd.fit_transform(merchants)
tsne_a = tsne.fit_transform(svd_a)
merchants_df["x"] = tsne_a.T[0]
merchants_df["y"] = tsne_a.T[1]
alt.Chart(merchants_df).mark_point(opacity=0.2).encode(x="x:Q", y="y:Q", color="label")
There's still a lot of overlap between the classes here. Fortunately, we know from the exploratory analysis notebook that our numeric features contain a lot of information to help us distinguish between classes. We'll see how to exploit that with models in the next notebook, but first, we need to preprocess these features.
For the numeric features, our preprocessing is a little easier. We need to impute missing values for interarrival times (the interarrival time is undefined for the first transaction for each user, since there was no previous interarrival time) and we need to scale all numeric features to a constant range. We'll do this using the Pipeline
facility from scikit-learn.
from sklearn.preprocessing import RobustScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
impute_and_scale = Pipeline([('median_imputer', SimpleImputer(strategy="median")), ('interarrival_scaler', RobustScaler())])
ia_scaler = ('interarrival_scaler', impute_and_scale, ['interarrival'])
amount_scaler = ('amount_scaler', RobustScaler(), ['amount'])
scale_steps = [ia_scaler, amount_scaler]
all_xforms = ColumnTransformer(transformers=(scale_steps + xform_steps))
feat_pipeline = Pipeline([
('feature_extraction',all_xforms)
])
feat_pipeline.fit(train)
from mlworkflows import util
util.serialize_to(feat_pipeline, "feature_pipeline.sav")
With your feature extraction pipeline saved, you can go on to the next notebook. You have two choices -- either use a model based on logistic regression or a model based on tree ensembles.