Fraud classification is a common data science problem with many solutions. It is similar in approach to many others (e.g., click prediction or spam detection) in that it is a rare events binary classification problem. That is, there are two classes, fraud and not fraud and one case is rare.
The notebooks in these examples were created on JupyterLab running on DC/OS. In this set of notebooks, we will walk through four modeling approaches: logistic regression, decision tree, random forest, and a neural network. General pros and cons will be given for each.
These examples are simple and can be run outside DC/OS. They lay the foundation, however, for future posts. Future posts will demonstrate how DC/OS can make scaling with multiple nodes and GPU's easy. Instructions for installing and running JupyterLab on DC/OS can be found here.
The data used was generated with a payment simulator based on this fraud simulation paper.
There are a total of ~1.3 million records and 11 columns in the dataset. Because only two types are required to build the models (Transfer and Cash-out), less than 600k records are kept. There are only about 8.4k fraud cases. Columns that have no value for the analysis are also dropped, leaving 7 columns (6 independent variables and 1 dependent variable). A description of the columns follows:
Variable | Description | Keep |
---|---|---|
step | Maps a unit of time in the real world. In this case, 1 step is 1 hour of time. | Drop |
type | CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER | Keep (TRANSFER and CASH-OUT) |
amount | The amount of the transaction. | Keep |
nameOrig | The customer ID for the initiator of the transaction. | Drop |
oldbalanceOrg | The initial balance before the transaction. | Keep |
newbalanceOrg | The customer's balance after the transaction. | Keep |
nameDest | The customer ID for the recipient of the transaction. | Drop |
oldbalanceDest | The initial recipient balance before the transaction. | Keep |
newbalanceDest | The recipient's balance after the transaction. | Keep |
isFraud | This identifies a fraudulent transaction (1) and non fraudulent transaction(0). | Keep |
isFlaggedFraud | This is a rule-based system that flags illegal attempts to transfer more than 200.000 in a single transaction. | Drop |
. | Predicted Not Fraud | Predicted Is Fraud |
---|---|---|
Actual Not Fraud | True Negative | False Positive |
Actual Is Fraud | False Negative | True Positive |
In statistics, a logistic regression model is a statistical model that is for data with a binary dependent variable. A typical logistic model is a model that the log-odds of the probability of an event is a linear combination of independent or predictor variables. The two possible dependent variable values are often labelled as 1 or 0, which represent outcomes such as fraud/no fraud, click/no click, or spam/no spam. The logistic regression model can be generalized to more than two levels of the dependent variable, but that is not needed for this example.
A decision tree is among the easiest models to conceptually understand and interpret the results. There are many different algorithms, but perhaps the easiest to describe is a recursive partition approach. A dataset is split recursively. Each split is determined based on the independent variable that results in the largest possible reduction in heterogeneity of the dependent variable (there are different measures, e.g. gini or enthropy). The splits stop when they a reach predetermined stop criterion.
Random Forests are essentially bootstrapped Decision Trees. A random set of features are selected. Given those features a bootstap selection (with replacement) of records are generated and a decision tree is created. This process is repeated N times (N = the number of trees created). Each iteration generates a tree and each tree gets a "vote" on what features are most important and the magnitude of importance.
This is a rough description with many relevant details omitted. In general. however, random forests have the following properties.
Artificial neural networks (ANNs) are computing systems vaguely inspired by the biological neural networks that constitute animal brains. They are a relatively new type of model, but have quickly become the dominant approach to solving certain types of problems (e.g. object detection and NLP). They do well with many type of problems where huge amounts of training data is available.