Classification Example: Prediction of credit-worthiness

Based on the experience made with former customers a bank likes to predict credit-worthiness of new potential customers. For a set of former customers the bank recorded the input features

  • equity
  • annual income

These features are requested from the customer before a contract is signed. For the former customers the bank does not only know these input features but also the target value, which is

  • an indicator whether the customer actual was credit-worthy (value=1) or not (value=0).

The file containing the data of customers can be downloaded from here. The first column is the equity, the second column the annual income (both in Euro) and the third column indicates whether the customer was credit-worthy or not.

Import of required modules:

In [28]:
%matplotlib inline
import numpy as np
np.set_printoptions(2,suppress=True)
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
#from sklearn.metrics import zero_one_score
from matplotlib import pyplot as plt

Iteration 1: Learning a model from 30 previous customers

Read training data

The code snippet below reads the available customer data file. The file contains data of 200 customers. However, in this first iteration we assume that only the first 30 datarows (customers) are available and can be applied for training.

In [16]:
dataArray=np.fromfile("./Res/creditCustomers2",sep=' ').reshape(-1,3)
numTrain=30
print "  Equity   |  Income | Credit Worthyness "
print dataArray[:numTrain,:]
#print data
trainfeatures=dataArray[:numTrain,0:2]  # first two columns are the input features
traintargets=dataArray[:numTrain,2]     # third column is the target parameter
  Equity   |  Income | Credit Worthyness 
[[ 18478.67  16217.33      0.  ]
 [ 18039.16  14389.45      0.  ]
 [ 20149.59  42639.28      1.  ]
 [ 32957.58  45059.4       1.  ]
 [ 33343.71  49628.87      1.  ]
 [ 35892.56  46360.38      1.  ]
 [ 21549.2   32119.1       1.  ]
 [ 34757.58  31228.13      1.  ]
 [ 23287.12  15444.99      0.  ]
 [ 22843.81  -1821.99      0.  ]
 [ 31918.57  48790.59      1.  ]
 [ 19162.31  20112.        0.  ]
 [ 13938.62  41884.71      1.  ]
 [ 21282.28  16369.33      0.  ]
 [ 34403.85  31502.71      1.  ]
 [ 15190.21  22829.53      0.  ]
 [ 24516.23  36650.94      1.  ]
 [ 42593.38  42210.23      1.  ]
 [ 10979.04  14509.59      0.  ]
 [ 26907.    57213.55      1.  ]
 [ 23464.43  29465.19      0.  ]
 [  3195.09  53783.39      1.  ]
 [ 34060.45  56937.59      1.  ]
 [ 25361.17   9524.05      0.  ]
 [ 24134.49  40642.9       1.  ]
 [ 20096.12   6266.42      0.  ]
 [  6995.53  17957.72      0.  ]
 [ 47583.8   39063.97      1.  ]
 [ 36496.6   64720.57      1.  ]
 [ 30263.38  34094.73      1.  ]]

Visualize training data

In order to select a suitable machine learning algorithm and suitable parameters of the selected algorithm, one should first try to understand the available data. This can be done by visualizing the data. In this example there are only two features (equity and annual income). Then data can easily be plotted in a 2-dimensional plane.

In the code snippet below the labeled 30 customers applied for training are plotted. Each customer is represented by a point. The 2 features of each customer define the coordinates of the point and the color of the point indicates credit-worthiness: green points for credit-worthy and magenta points for not-credit-worthy customers.

In [3]:
plt.figure(num=None, figsize=(12, 10), dpi=80, facecolor='w', edgecolor='k')
plt.plot(trainfeatures[traintargets[:]==0.0,0],trainfeatures[traintargets[:]==0.0,1],'sm',label='not credit-worthy')
plt.hold(True)
plt.plot(trainfeatures[traintargets[:]==1.0,0],trainfeatures[traintargets[:]==1.0,1],'sg',label='credit-worthy')
plt.xlabel('Equity [Euro]')
plt.ylabel('Annual Imcome [Euro]')
plt.title('Credit-Worthiness of Customers')
plt.legend(loc=2,numpoints=1)
plt.hold(True)
plt.show()

Calculate some descriptive statistics of training data

In [30]:
class0features=trainfeatures[traintargets[:]==0.0]
class1features=trainfeatures[traintargets[:]==1.0]
print "Mean of training data in class 0: ",class0features.mean(axis=0)
print "Mean of training data in class 1: ",class1features.mean(axis=0)
print "Covariance of training data in class 0:\n",np.cov(class0features,rowvar=False)
print "Covariance of training data in class 1:\n",np.cov(class1features,rowvar=False)
Mean of training data in class 0:  [ 18764.99  15105.3 ]
Mean of training data in class 1:  [ 29370.09  44140.61]
Covariance of training data in class 0:
[[ 29406940.45  -9070704.53]
 [ -9070704.53  63518172.37]]
Covariance of training data in class 1:
[[  1.08e+08  -5.25e+06]
 [ -5.25e+06   9.33e+07]]

Select and apply a suitable learning algorithm

Inspecting the plot above yields that the 2 classes of customers can possibly be separated by a linear discriminant. In this case we can select a linear classification algorithm, e.g. the LogisticRegression modul from scikit-learn. In the following code snippet an object of the class LogisticRegression is instantiated with suitable parameters. Then the fit() -method of this object is invoked. All learning-algorithms of scikit-learn are trained by calling the fit()-method with the available labeled trainingdata as parameters.

In [4]:
logReg = LogisticRegression(C=10000,fit_intercept=True, intercept_scaling=100)
logReg.fit(trainfeatures, traintargets)
Out[4]:
LogisticRegression(C=10000, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=100, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Apply the learned model and calculate prediction error on training data

Once a scikit-learn learning algorithm has been trained by invoking its fit() method, it can be applied for prediction by calling its predict() method with the input features as parameters. In the code snippet below the trainingfeatures, which have already been applied for training, are passed to the predict()-method. Note that the fit()-method requires trainingfeatures and trainingtargets as parameters, whereas the predict()-method only requires the input features.

The predictions of the learned classifier are stored in the variable predTrain. By comparing these predictions of the model with the real targets one can calculate the training data error. For classification a typical error measure is the so called accuracy, which just counts the ratio of correctly classified items. In this first iteration the model actually predicts the class of all training data correctly - the error on the training data is 0 and the corresponding accuracy value is 1.0.

In [5]:
predTrain=logReg.predict(trainfeatures)
numErrorsTrain=np.sum(np.abs((predTrain - traintargets)))
print "Number of misclassified training datasamples: ",numErrorsTrain
trainAccuracy=1-numErrorsTrain/float(len(traintargets))
print "Accuracy on test datasamples: ",trainAccuracy
Number of misclassified training datasamples:  0.0
Accuracy on test datasamples:  1.0

Visualize the learned model

The learned model is a linear discriminant, which separates the two customer classes. In order to plot the decision boundary (the discriminant) of the classifier, a fine-granular regular grid of inputs is generated and for each of the inputs the models prediction is calculated. Inputs which are assigned to class credit-worthy (class-index=1) are plotted blue, whereas inputs, which are assigned to not credit-worthy, are plotted red.

In [8]:
h=500
######### apply learned model and plot decision boundary and training data
x0ticks = np.arange(np.min(trainfeatures[:,0]),np.max(trainfeatures[:,0]),h)
x1ticks = np.arange(np.min(trainfeatures[:,1]),np.max(trainfeatures[:,1]),h)
testdata=np.zeros(((len(x0ticks)*len(x1ticks)),2))
count=0
for i0 in x0ticks:
    for i1 in x1ticks:
        testdata[count,:]=[i0,i1]
        count+=1
predicted=logReg.predict(testdata)
plt.figure(num=None, figsize=(12, 10), dpi=80, facecolor='w', edgecolor='k')
plt.xlim((min(x0ticks),max(x0ticks)))
plt.ylim((min(x1ticks),max(x1ticks)))
plt.plot(testdata[predicted==0,0],testdata[predicted==0,1],'sr',ms=10,alpha=0.1)
plt.plot(testdata[predicted==1,0],testdata[predicted==1,1],'sb',ms=10,alpha=0.1)
plt.plot(trainfeatures[traintargets==0,0],trainfeatures[traintargets==0,1],'sm',ms=8)
plt.plot(trainfeatures[traintargets==1,0],trainfeatures[traintargets==1,1],'sg',ms=8)
plt.title("Logistic Regression - Number of errors on training data "+str(numErrorsTrain))
plt.xlabel('Equity [Euro]')
plt.ylabel('Annual Imcome [Euro]')
plt.show()

Questions on the first iteration

  1. What is passed as input to the training-method of a classifier?
  2. What is passed as input to the prediciton-method of a classifier? What is returned by the prediction-method?
  3. How to determine the training-error of a classifier?
  4. Is the training-error a suitable measure for the quality of the learned model? Is the model, learned in this first iteration good?

Iteration 2: Test and evaluate learned model

Calculate error on test data

The model, which was learned in the first iteration has an optimum accuracy of 1.0. However, this good performance on the training data is not a valid measure for the quality of the model. A learned model is good, if it classifies or predicts new data correctly. I.e. for model evaluation one must

  1. pass data, which has not been used for training, to the input of the models predict() method.
  2. compare the model's prediction with the real target values

This implies that evaluation requires labeled data (feature-vectors and the corresponding target values). Thus not the entire set of labeled data can be applied for training. Instead one must partition the available set of labeled data into

  • a set of training data
  • a set of test data.

In the first iteration of this notebook only 30 labeled items (persons) have been applied for training. All in all 200 labeled items are contained in the input file. Thus 170 items can be applied for testing as follows:

In [9]:
testfeatures=dataArray[numTrain:,0:2]  # first two columns are the input features
testtargets=dataArray[numTrain:,2]     # third column is the target parameter

predTest=logReg.predict(testfeatures)
numErrorsTest=np.sum(np.abs((predTest - testtargets)))
print "Number of misclassified test datasamples: ",numErrorsTest
testAccuracy=1-numErrorsTest/float(len(testtargets))
print "Accuracy on test datasamples: ",testAccuracy
Number of misclassified test datasamples:  7.0
Accuracy on test datasamples:  0.958823529412

The learned linear model misclassifies 7 items out of the 170 test items. The corresponding accuracy is 0.959.

Enhance the set of training data

In order to learn a good model, the set of training data should be a good representative of the set of all data in the given problem domain. If the statistics of the training data does not match the statistics of the data, that must be classified by the learned model, the accuracy will be weak. The probability that a training set reflects the statistics of the entire data set well, increases with the size of training data. This can be demonstrated by the current example: The limited set of only 30 training samples, used in the first iteration, suggests that the two classes can be separated by a linear discriminant. Now, if we enhance the set of training data from 30 to 100 items, this is no longer true. The plot below indicates, that there won't be a linear discriminant, that separates the two classes without any misclassification.

In [10]:
numTrain2=100
trainfeatures2=dataArray[:numTrain2,0:2]  # first two columns are the input features
traintargets2=dataArray[:numTrain2,2]     # third column is the target parameter

plt.figure(num=None, figsize=(12, 10), dpi=80, facecolor='w', edgecolor='k')
plt.plot(trainfeatures2[traintargets2[:]==0.0,0],trainfeatures2[traintargets2[:]==0.0,1],'sm',label='not credit-worthy')
plt.hold(True)
plt.plot(trainfeatures2[traintargets2[:]==1.0,0],trainfeatures2[traintargets2[:]==1.0,1],'sg',label='credit-worthy')
plt.xlabel('Equity [Euro]')
plt.ylabel('Annual Imcome [Euro]')
plt.title('Credit-Worthiness of Customers')
plt.legend(loc=2,numpoints=1)
plt.hold(True)
plt.show()

Now, with the obviously not linear separable training data the question on which type of model shall be learned becomes more tricky:

  • Shall we still learn a simple linear model, which even can not separate the training data perfectly?
  • Shall we choose an algorithm, which is capable to learn a more complex non-linear model, which fits better to the training-data?

The more complex model may fit better to the training data. However, more complex models may be overfitted to the training data. This means that they have a smaller error rate on the training data, but a higher error rate on the test data. Since the error rate on the test data is the crucial performance figure, the rule of thumb says that the simpler model shall be chosen first. The simpler model reduces the probability for overfitting.

The same LogisticRegression-learning algorithm as applied in the first iteration is now applied for the enhanced training dataset. Now 4 items of the training data set are misclassified, yielding an accuracy of 0.96 on the training data.

In [11]:
logReg2 = LogisticRegression(C=10000,fit_intercept=True, intercept_scaling=100)
logReg2.fit(trainfeatures2, traintargets2)
predTrain2=logReg2.predict(trainfeatures2)
numErrorsTrain2=np.sum(np.abs((predTrain2 - traintargets2)))
print "Number of misclassified training datasamples: ",numErrorsTrain2
trainAccuracy2=1-numErrorsTrain2/float(len(traintargets2))
print "Accuracy on test datasamples: ",trainAccuracy2
Number of misclassified training datasamples:  4.0
Accuracy on test datasamples:  0.96

The learned linear discriminant can again be plotted as follows:

In [12]:
h=500
######### apply learned model and plot decision boundary and training data
x0ticks = np.arange(np.min(trainfeatures2[:,0]),np.max(trainfeatures2[:,0]),h)
x1ticks = np.arange(np.min(trainfeatures2[:,1]),np.max(trainfeatures2[:,1]),h)
testdata=np.zeros(((len(x0ticks)*len(x1ticks)),2))
count=0
for i0 in x0ticks:
    for i1 in x1ticks:
        testdata[count,:]=[i0,i1]
        count+=1
predicted=logReg2.predict(testdata)
plt.figure(num=None, figsize=(12, 10), dpi=80, facecolor='w', edgecolor='k')
plt.xlim((min(x0ticks),max(x0ticks)))
plt.ylim((min(x1ticks),max(x1ticks)))
plt.plot(testdata[predicted==0,0],testdata[predicted==0,1],'sr',ms=10,alpha=0.1)
plt.plot(testdata[predicted==1,0],testdata[predicted==1,1],'sb',ms=10,alpha=0.1)
plt.plot(trainfeatures2[traintargets2==0,0],trainfeatures2[traintargets2==0,1],'sm',ms=8)
plt.plot(trainfeatures2[traintargets2==1,0],trainfeatures2[traintargets2==1,1],'sg',ms=8)
plt.title("Logistic Regression - Number of errors on training data "+str(numErrorsTrain2))
plt.xlabel('Equity [Euro]')
plt.ylabel('Annual Imcome [Euro]')
plt.show()

The accuracy on the test data is calculated in the code snippet below. As the output shows, the accuracy on the test data slightly increased - despite the fact, that the accuracy on the training data decreased. We can conclude that the model of the second iteration, which was learned with more training data than the model in the first iteration, performs better. In both iterations the same learning algorithm has been applied, but the larger training data set of the second iteration is a better representation of the entire data.

In [11]:
testfeatures2=dataArray[numTrain2:,0:2]  # first two columns are the input features
testtargets2=dataArray[numTrain2:,2]     # third column is the target parameter

predTest2=logReg2.predict(testfeatures2)
numErrorsTest2=np.sum(np.abs((predTest2 - testtargets2)))
print "Number of misclassified test datasamples: ",numErrorsTest2
testAccuracy2=1-numErrorsTest2/float(len(testtargets2))
print "Accuracy on test datasamples: ",testAccuracy2
Number of misclassified test datasamples:  4.0
Accuracy on test datasamples:  0.96

Iteration 3: Cross Validation

We have learned that the number of training items shall be large, since this training data should reflect the statistics of the entire dataset. On the other hand we need a sufficiently large amount of test data in order to obtain meaningful performance figures. Both, training- and testdata must be labeled. However, usually labeling is expensive, hence the set of available labeled data is often quite small.

In the case that the number of labeled data is too low for partitioning it into a sufficiently large training and a sufficiently large test set, one usually applies x-fold cross validation. The integer $x$ is often selected to be $10$. For 10-fold cross validation the entire set of labeled data is partitioned into $x=10$ partitions of approximately the same size. The overall training and test is performed in $x=10$ iterations. In the i.th iteration partition $i$ is used for testing and all other partitions are applied for training the model. In each iteration a model is trained with the training partitions and tested with the test partition. The overall performance criteria is then the mean over all iteration-specific performance values (e.g. the mean over the accuracy values of the 10 iterations).

In [13]:
from IPython.display import Image
i = Image(filename='./Res/CrossValidation.jpg')
i
Out[13]:

Scikit-learn provides the cross_val_score-method for x-fold crossvalidation. Arguments of this method are an object of the learner-model class, the set of all features, all targets and the specification of a scoring method. The function returns a scores-object, which contains the performance values of all iterations:

In [14]:
allfeatures=dataArray[:,0:2]
alltargets=dataArray[:,2]
scores = cross_val_score(logReg,allfeatures,alltargets,cv=10,scoring="accuracy")
print scores
print "Cross Validation Score: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() / 2)
[ 1.    0.95  1.    0.95  0.95  1.    1.    0.9   0.95  0.95]
Cross Validation Score: 0.965 (+/- 0.016)

Summary

This notebook should provide a high-level overview of a supervised machine learning task. It demonstrates how labeled data on previous customers of a bank can be applied to train a model, that is able to predict (classify) the credit-worthyness of future customers based on their equity and annual income. By this example a basic understanding on the following machine learning process steps should be provided:

  • Data inspection by visualisation
  • Model-Type selection based on data visualisation
  • Training of a model and requirements on the training data
  • Test and evaluation of a model:
    • Performance measure for 2-class classification
    • Test procedures e.g. cross-validation
    • Overfitting vs. Generalization

In this notebook a classification task has been considered. It is recommended to continue with a similar notebook on a regression task: bodyfatRegression.

In [ ]: