We aren't focusing on defining metrics or data cleaning in this workshop, so our first step will be exploratory analysis on a prepared dataset. This will involve looking at the data and then examining each variable in isolation to see if we can identify any interesting patterns.
import pandas as pd
import numpy as np
We'll start by reading in our fraud dataset and looking at the column names:
df = pd.read_parquet("fraud-cleaned-sample.parquet")
df.columns
When dealing with an unfamiliar dataset, we first need to understand what data we've collected. If we're lucky, our organization will publish a data catalog or data glossary that explains what each dataset contains. If we don't have this extra metadata, we'll need to explore the dataset manually and document our observations. Sometimes the column names will be self-explanatory, but if they aren't, we'll have to dig deeper to make sense of them. A good second step is examining the types of each column.
df.dtypes
Sometimes, string
-valued attributes in structured data are free-form text, but some string-valued variables are categorical variables. We can usually identify these by looking at unique values to identify attributes with a small number of possible values.
list(df["label"].unique())
list(df["trans_type"].unique())
If we're dealing with messy data, categorical variables may have a large number of unique values, some of which are spurious and may occur only once. In this case, we don't want to examine every unique value but we can look at the top few values to get a sense for what the range of values should be. In this next cell, we're looking at the three most common transaction types.
df[["timestamp", "trans_type"]].groupby(["trans_type"]).count().nlargest(3, "timestamp")
We can look at the most common transaction types for each class of transaction, as well.
df[df["label"] == "fraud"][["timestamp", "trans_type"]].groupby(
["trans_type"]
).count(
).nlargest(3, "timestamp")
df[df["label"] == "legitimate"][["timestamp", "trans_type"]].groupby(
["trans_type"]
).count(
).nlargest(3, "timestamp")
Let's look at the transaction type first to see what the distribution looks like for legitimate and fraud transactions.
pt = pd.pivot_table(df[["label", "trans_type", "timestamp"]],
index=["label", "trans_type"], aggfunc=len)
pt.columns = ['count']
gdf = pd.DataFrame(pt.to_records())
gdf['total'] = gdf.groupby('label')['count'].transform(np.sum)
gdf['percentage'] = gdf['count'] / gdf['total']
gdf
import altair as alt
alt.Chart(gdf).mark_bar().encode(
alt.Y('percentage:Q', axis=alt.Axis(format='.0%')), column='trans_type', x="label", color='label'
)
We can repeat this process for percentage of foreign vs domestic transactions for each label.
pt = pd.pivot_table(df[["label", "foreign", "timestamp"]],
index=["label", "foreign"], aggfunc=len)
pt.columns = ['count']
gdf = pd.DataFrame(pt.to_records())
gdf['total'] = gdf.groupby('label')['count'].transform(np.sum)
gdf['pctage'] = gdf['count'] / gdf['total']
gdf
alt.Chart(gdf).mark_bar().encode(
alt.Y('pctage:Q', axis=alt.Axis(format='.0%')), column='foreign', x="label", color='label'
)
When we're dealing with continuous variables, rather than categorical ones, we'll need to do something other than simply counting values to make sense of their distributions (we aren't interested, for example, in knowing that there were exactly 24 transactions for $17.35). There are a few options for how to proceed here:
Let's take a look at the latter approach.
qs = df[['label','amount']].groupby('label').quantile(q=[0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99])
qs
What does the above data mean? For each label, we have a set of rows, corresponding to the first, fifth, tenth, ..., ninety-fifth, and ninety-ninth percentiles. The value at each row indicates the transaction amount at each percentile. As an example, this means that 25% of legitimate transaction amounts are less than or equal to $12.43.
We can plot the cumulative distribution of each transaction type to compare these distributions.
qdf = pd.DataFrame(qs.to_records())
alt.Chart(qdf).mark_line(interpolate="monotone").encode(
alt.Y("amount", axis=alt.Axis(title='transaction amounts (log scale)'), scale=alt.Scale(type='log')),
alt.X("level_0", axis=alt.Axis(title='cumulative distribution'), scale=alt.Scale(type='linear')),
color="label"
)
As we can see, relatively more of the fraudulent transactions are for low dollar amounts.
Another interesting feature to look at is the distribution of interarrival times, or the time gaps between transactions. Our intuition is that legitimate transactions will be spread apart further (and will have some large gaps), while fraudulent transactions will cluster together.
fraudsamp = df[df["label"] == "fraud"].copy()
legitsamp = df[df["label"] == "legitimate"].sample(len(fraudsamp)).copy()
fraudsamp['irank'] = fraudsamp['interarrival'].rank(pct=True, method="dense")
legitsamp['irank'] = legitsamp['interarrival'].rank(pct=True, method="dense")
qdf = pd.concat([fraudsamp.groupby(['label', 'interarrival', 'irank']).size(), legitsamp.groupby(['label', 'interarrival', 'irank']).size()])
qdf = pd.DataFrame(pd.DataFrame(qdf).to_records())
qdf = qdf[qdf['interarrival'] > 0]
alt.Chart(qdf.sample(5000)).mark_line().encode(
alt.Y("interarrival", axis=alt.Axis(title='interarrival time'), scale=alt.Scale(type='log')),
alt.X("irank", axis=alt.Axis(title='cumulative distribution'), scale=alt.Scale(type='linear')),
color="label"
)
We've generated users to act during typical sleep-wake cycles at different times of day. (To simplify things, the data generator assumes that a user with a user ID k is in time zone t if k mod 24 == t.) As we'll see, legitimate transactions are centered around waking hours, while fraudulent transactions are roughly equally likely to happen at any time of day.
df["hour"] = df["timestamp"] % 86400 // 3600
df["tz"] = df["user_id"] % 24
tzpivot = pd.pivot_table(df[["label", "hour", "tz", "timestamp"]],
index=["label", "hour", "tz"], aggfunc=len)
tzdf = pd.DataFrame(tzpivot.to_records()).set_axis(["label", "tz", "hour", "count"], axis="columns")
alt.Chart(tzdf).mark_area().encode(
x="hour",
y=alt.Y("count", stack="normalize", axis=alt.Axis(title="percentage of total")),
color="tz",
column="label",
tooltip=['label', 'tz', 'hour', 'count']
)