al.bahnsen@gmail.com | |
http://github.com/albahnsen | |
http://linkedin.com/in/albahnsen | |
@albahnsen |
Just fund a bank | Just quit college |
Biggest Ponzi scheme | Now a Billionaire |
and accurate decisions
financial obligation if a loan is granted, based on past experiences
literature: logistic regression, neural networks, discriminant analysis, genetic programing, decision trees, random forests among others
Formally, a credit score is a statistical model that allows the estimation of the probability of a customer $i$ defaulting a contracted debt ($y_i=1$)
$$\hat p_i=P(y_i=1|\mathbf{x}_i)$$
Improve on the state of the art in credit scoring by predicting the probability that somebody will experience financial distress in the next two years.
from costcla.datasets import load_creditscoring1
data = load_creditscoring1()
print data.keys()
print 'Number of examples ', data.target.shape[0]
['target_names', 'cost_mat', 'name', 'DESCR', 'feature_names', 'data', 'target'] Number of examples 112915
target = pd.DataFrame(pd.Series(data.target).value_counts(), columns=('Frequency',))
target['Percentage'] = target['Frequency'] / target['Frequency'].sum()
target.index = ['Negative (Good Customers)', 'Positive (Bad Customers)']
print target
Frequency Percentage Negative (Good Customers) 105299 0.932551 Positive (Bad Customers) 7616 0.067449
pd.DataFrame(data.feature_names, columns=('Features',))
Features | |
---|---|
0 | RevolvingUtilizationOfUnsecuredLines |
1 | age |
2 | NumberOfTime30-59DaysPastDueNotWorse |
3 | DebtRatio |
4 | MonthlyIncome |
5 | NumberOfOpenCreditLinesAndLoans |
6 | NumberOfTimes90DaysLate |
7 | NumberRealEstateLoansOrLines |
8 | NumberOfTime60-89DaysPastDueNotWorse |
9 | NumberOfDependents |
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test, cost_mat_train, cost_mat_test = \
train_test_split(data.data, data.target, data.cost_mat)
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
classifiers = {"RF": {"f": RandomForestClassifier()},
"DT": {"f": DecisionTreeClassifier()},
"LR": {"f": LogisticRegression()}}
# Fit the classifiers using the training dataset
for model in classifiers.keys():
classifiers[model]["f"].fit(X_train, y_train)
classifiers[model]["c"] = classifiers[model]["f"].predict(X_test)
classifiers[model]["p"] = classifiers[model]["f"].predict_proba(X_test)
classifiers[model]["p_train"] = classifiers[model]["f"].predict_proba(X_train)
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score
measures = {"F1Score": f1_score, "Precision": precision_score,
"Recall": recall_score, "Accuracy": accuracy_score}
results = pd.DataFrame(columns=measures.keys())
for model in classifiers.keys():
results.loc[model] = [measures[measure](y_test, classifiers[model]["c"]) for measure in measures.keys()]
fig1()
| | Actual Positive ($y_i=1$) | Actual Negative ($y_i=0$)| |--- |:-: |:-: | | Pred. Positive ($c_i=1$) | $C_{TP_i}=0$ | $C_{FP_i}=r_i+C^a_{FP}$ | | Pred. Negative ($c_i=0$) | $C_{FN_i}=Cl_i \cdot L_{gd}$ | $C_{TN_i}=0$ |
Where:
a loan to an alternative customer.
For more info see [Correa Bahnsen et al., 2014]
Assuming the database belong to an average European financial institution, we find the different parameters needed to calculate the cost measure
| Parameter | Value | |--- |:-: | |Interest rate ($int_r$) | 4.79% | | Cost of funds ($int_{cf}$) | 2.94% | | Term ($l$) in months | 24 | | Loss given default ($L_{gd}$) | 75% | | Times income ($q$) | 3 | | Maximum credit line ($Cl_{max}$) | 25,000|
# The cost matrix is already calculated for the dataset
# cost_mat[C_FP,C_FN,C_TP,C_TN]
print data.cost_mat[[10, 17, 50]]
[[ 1023.73054104 18750. 0. 0. ] [ 717.25781516 6749.25 0. 0. ] [ 866.65393177 12599.25 0. 0. ]]
The financial cost of using a classifier $f$ on $\mathcal{S}$ is calculated by
$$ Cost(f(\mathcal{S})) = \sum_{i=1}^N y_i(1-c_i)C_{FN_i} + (1-y_i)c_i C_{FP_i}.$$
Then the financial savings are defined as the cost of the algorithm versus the cost of using no algorithm at all.
$$ Savings(f(\mathcal{S})) = \frac{ Cost_l(\mathcal{S}) - Cost(f(\mathcal{S}))} {Cost_l(\mathcal{S})},$$
where $Cost_l(\mathcal{S})$ is the cost of the costless class
# Calculation of the cost and savings
from costcla.metrics import savings_score
# Evaluate the savings for each model
results["Savings"] = np.zeros(results.shape[0])
for model in classifiers.keys():
results["Savings"].loc[model] = savings_score(y_test, classifiers[model]["c"], cost_mat_test)
fig2()
Cost-sensitive classification ussualy refers to class-dependent costs, where the cost dependends on the class but is assumed constant accross examples.
In credit scoring, different customers have different credit lines, which implies that the costs are not constant
The BMR classifier is a decision model based on quantifying tradeoffs between various decisions using probabilities and the costs that accompany such decisions.
In particular:
$$ R(c_i=0|\mathbf{x}_i)=C_{TN_i}(1-\hat p_i)+C_{FN_i} \cdot \hat p_i, $$and $$ R(c_i=1|\mathbf{x}_i)=C_{TP_i} \cdot \hat p_i + C_{FP_i}(1- \hat p_i), $$
costcla.models.BayesMinimumRiskClassifier(calibration=True)
fit(y_true_cal=None, y_prob_cal=None)
predict(y_prob,cost_mat)
Parameters
Returns
from costcla.models import BayesMinimumRiskClassifier
ci_models = classifiers.keys()
for model in ci_models:
classifiers[model+"-BMR"] = {"f": BayesMinimumRiskClassifier()}
# Fit
classifiers[model+"-BMR"]["f"].fit(y_test, classifiers[model]["p"])
# Calibration must be made in a validation set
# Predict
classifiers[model+"-BMR"]["c"] = classifiers[model+"-BMR"]["f"].predict(classifiers[model]["p"], cost_mat_test)
fig2()
A a new cost-based impurity measure taking into account the costs when all the examples in a leaf
costcla.models.CostSensitiveDecisionTreeClassifier(criterion='direct_cost', criterion_weight=False, pruned=True)
Ensemble of CSDT
costcla.models.CostSensitiveRandomPatchesClassifier(n_estimators=10, max_samples=0.5, max_features=0.5,combination='majority_voting)
from costcla.models import CostSensitiveDecisionTreeClassifier
from costcla.models import CostSensitiveRandomPatchesClassifier
classifiers = {"CSDT": {"f": CostSensitiveDecisionTreeClassifier()},
"CSRP": {"f": CostSensitiveRandomPatchesClassifier()}}
# Fit the classifiers using the training dataset
for model in classifiers.keys():
classifiers[model]["f"].fit(X_train, y_train, cost_mat_train)
classifiers[model]["c"] = classifiers[model]["f"].predict(X_test)
fig2()
CostCla is a Python open source cost-sensitive classification library built on top of Scikit-learn, Pandas and Numpy.
Source code, binaries and documentation are distributed under 3-Clause BSD license in the website http://albahnsen.com/CostSensitiveClassification/
Cost-proportionate over-sampling [Elkan, 2001]
SMOTE [Chawla et al., 2002]
Cost-proportionate rejection-sampling [Zadrozny et al., 2003]
Thresholding optimization [Sheng and Ling, 2006]
Bayes minimum risk [Correa Bahnsen et al., 2014a]
Cost-sensitive logistic regression [Correa Bahnsen et al., 2014b]
Cost-sensitive decision trees [Correa Bahnsen et al., 2015a]
Cost-sensitive ensemble methods: cost-sensitive bagging, cost-sensitive pasting, cost-sensitive random forest and cost-sensitive random patches [Correa Bahnsen et al., 2015c]
Credit Scoring1 - Kaggle credit competition [Data], cost matrix: [Correa Bahnsen et al., 2014]
Credit Scoring 2 - PAKDD2009 Credit [Data], cost matrix: [Correa Bahnsen et al., 2014a]
Direct Marketing - PAKDD2009 Credit [Data], cost matrix: [Correa Bahnsen et al., 2014b]
Churn Modeling, June 2015
You find the presentation and the IPython Notebook here:
albahnsen/CostSensitiveClassification/blob/ master/doc/tutorials/slides_edcs_credit_scoring.ipynb#/
This slides are a short version of this tutorial: