Notebook

Fraud Classification on DC/OS¶

Fraud classification is a common data science problem with many solutions. It is similar in approach to many others (e.g., click prediction or spam detection) in that it is a rare events binary classification problem. That is, there are two classes, fraud and not fraud and one case is rare.

The notebooks in these examples were created on JupyterLab running on DC/OS. In this set of notebooks, we will walk through four modeling approaches: logistic regression, decision tree, random forest, and a neural network. General pros and cons will be given for each.

These examples are simple and can be run outside DC/OS. They lay the foundation, however, for future posts. Future posts will demonstrate how DC/OS can make scaling with multiple nodes and GPU's easy. Instructions for installing and running JupyterLab on DC/OS can be found here.

Data¶

The data used was generated with a payment simulator based on this fraud simulation paper.

Data Description¶

There are a total of ~1.3 million records and 11 columns in the dataset. Because only two types are required to build the models (Transfer and Cash-out), less than 600k records are kept. There are only about 8.4k fraud cases. Columns that have no value for the analysis are also dropped, leaving 7 columns (6 independent variables and 1 dependent variable). A description of the columns follows:

Variable	Description	Keep
step	Maps a unit of time in the real world. In this case, 1 step is 1 hour of time.	Drop
type	CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER	Keep (TRANSFER and CASH-OUT)
amount	The amount of the transaction.	Keep
nameOrig	The customer ID for the initiator of the transaction.	Drop
oldbalanceOrg	The initial balance before the transaction.	Keep
newbalanceOrg	The customer's balance after the transaction.	Keep
nameDest	The customer ID for the recipient of the transaction.	Drop
oldbalanceDest	The initial recipient balance before the transaction.	Keep
newbalanceDest	The recipient's balance after the transaction.	Keep
isFraud	This identifies a fraudulent transaction (1) and non fraudulent transaction(0).	Keep
isFlaggedFraud	This is a rule-based system that flags illegal attempts to transfer more than 200.000 in a single transaction.	Drop

Exploration¶

These examples are focused mostly on the models and interpretation. Most of the initial data exploration is skipped.

Measuring model performance¶

We will use the most common metrics for assessing fraud models: Accuracy, Precision, Recall, and F1.

.	Predicted Not Fraud	Predicted Is Fraud
Actual Not Fraud	True Negative	False Positive
Actual Is Fraud	False Negative	True Positive

Accuracy - Proportion of predictions that are correct. $\frac{True Positive + True Negative}{True Positive + True Negative + False Positive + False Negative}$
Precision - True positive over total positive actual cases. $\frac{True Positive}{True Positive + False Positive}$
Recall - True positive over total positive predicted cases. $\frac{True Positive}{True Positive + False Negative}$
F1 - A balance between Precision and Recall (harmonic mean of precision and recall) $\frac{2 * Precision * Recall}{Precision + Recall}$

Models¶

Logistic Regression¶

In statistics, a logistic regression model is a statistical model that is for data with a binary dependent variable. A typical logistic model is a model that the log-odds of the probability of an event is a linear combination of independent or predictor variables. The two possible dependent variable values are often labelled as 1 or 0, which represent outcomes such as fraud/no fraud, click/no click, or spam/no spam. The logistic regression model can be generalized to more than two levels of the dependent variable, but that is not needed for this example.

Pros
- Compuationally inexpensive
- Easy to interpret
- Easy to implement
Cons
- Prone to overfitting when penalty cost is not large enough.
- Assumes parametric.
- Assumes linear relationship (in transformed space) between independent variables.
- Must know interactions ahead of time.

Logistic Regression

Decision Tree¶

A decision tree is among the easiest models to conceptually understand and interpret the results. There are many different algorithms, but perhaps the easiest to describe is a recursive partition approach. A dataset is split recursively. Each split is determined based on the independent variable that results in the largest possible reduction in heterogeneity of the dependent variable (there are different measures, e.g. gini or enthropy). The splits stop when they a reach predetermined stop criterion.

Pros
- Compuationally inexpensive
- Very easy to interpret
- Is non-parametric.
- Easy to implement resulting model (at least the most important rules at top of tree)
- Does not assume linear relationship between independent variables.
- Can discover interactions.
Cons
- Difficult to tune optimally
  - Prone to overfitting when the cost of additional splits are not large enough.
  - Likely to perform poorly when cost of additional splits are too large.

Decision Tree

Random Forest¶

Random Forests are essentially bootstrapped Decision Trees. A random set of features are selected. Given those features a bootstap selection (with replacement) of records are generated and a decision tree is created. This process is repeated N times (N = the number of trees created). Each iteration generates a tree and each tree gets a "vote" on what features are most important and the magnitude of importance.
This is a rough description with many relevant details omitted. In general. however, random forests have the following properties.

Pros
- Resistant to overfitting with little tuning.
- Non parametric.
- Easy to determine most influential variables.
- Easy to run training in parallel.
Cons
- Compuationally expensive.
- Sometimes difficult to implement resulting model (depends on infrastructure).

Random Forest

Neural Network¶

Artificial neural networks (ANNs) are computing systems vaguely inspired by the biological neural networks that constitute animal brains. They are a relatively new type of model, but have quickly become the dominant approach to solving certain types of problems (e.g. object detection and NLP). They do well with many type of problems where huge amounts of training data is available.

Pros
- Non parametric.
- Performs exceptionally well (often best) for many complex problems with large training sets. Examples include:
  - Image Recognition
  - Self Driving Cars
  - Natural Language Processing
Cons
- Compuationally expensive.
- Sometimes difficult to implement resulting model (depends on infrastructure).
- Prone to overfitting without significant tuning.
- Difficult to interpret model.
- Many parameters and hyper parameters to tune.
- Typically requires large training sets.

Neural Network