Geodemographic Segmentation Model

Dataset: https://www.superdatascience.com/training/

Author: Filipa C. S. Rodrigues ([email protected])

In [1]:
%matplotlib inline 
import warnings
warnings.filterwarnings("ignore")
from __future__ import division
import pandas
import csv
import numpy as np
import statsmodels.api as sm
import statsmodels.discrete.discrete_model as smdis
import statsmodels.stats.outliers_influence as outliers
import matplotlib.pyplot as plt
/usr/local/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
In [2]:
df = pandas.DataFrame.from_csv('Churn-Modelling.csv', index_col=None)
df[:5]
Out[2]:
RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited
0 1 15634602 Hargrave 619 France Female 42 2 0.00 1 1 1 101348.88 1
1 2 15647311 Hill 608 Spain Female 41 1 83807.86 1 0 1 112542.58 0
2 3 15619304 Onio 502 France Female 42 8 159660.80 3 1 0 113931.57 1
3 4 15701354 Boni 699 France Female 39 1 0.00 2 0 0 93826.63 0
4 5 15737888 Mitchell 850 Spain Female 43 2 125510.82 1 1 1 79084.10 0

Variables

  • Exited: dependent variable (y) --> is binary
  • Gender, Geography: categorical independent variable (x)
  • All the remaining variables: numeric independent variable (x)
In [3]:
df_y = df['Exited']
df_x = df.drop(['Exited'], axis = 1)

Create dummy variables for the categorical variables:

In [4]:
dummy = pandas.get_dummies(df_x['Gender'])
df_x = dummy.join(df_x)
dummy = pandas.get_dummies(df_x['Geography'])
df_x = dummy.join(df_x)
df_x[:5]
Out[4]:
France Germany Spain Female Male RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary
0 1 0 0 1 0 1 15634602 Hargrave 619 France Female 42 2 0.00 1 1 1 101348.88
1 0 0 1 1 0 2 15647311 Hill 608 Spain Female 41 1 83807.86 1 0 1 112542.58
2 1 0 0 1 0 3 15619304 Onio 502 France Female 42 8 159660.80 3 1 0 113931.57
3 1 0 0 1 0 4 15701354 Boni 699 France Female 39 1 0.00 2 0 0 93826.63
4 0 0 1 1 0 5 15737888 Mitchell 850 Spain Female 43 2 125510.82 1 1 1 79084.10

Only one dummy variable should be used to avoid the "dummy variable trap":

In [5]:
df_x = df_x.drop(['Gender', 'Male', 'France', 'Geography'], axis =1)
df_x[:5]
Out[5]:
Germany Spain Female RowNumber CustomerId Surname CreditScore Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary
0 0 0 1 1 15634602 Hargrave 619 42 2 0.00 1 1 1 101348.88
1 0 1 1 2 15647311 Hill 608 41 1 83807.86 1 0 1 112542.58
2 0 0 1 3 15619304 Onio 502 42 8 159660.80 3 1 0 113931.57
3 0 0 1 4 15701354 Boni 699 39 1 0.00 2 0 0 93826.63
4 0 1 1 5 15737888 Mitchell 850 43 2 125510.82 1 1 1 79084.10

Add a constant \begin{align} b_0 \end{align} to the model:

In [6]:
df_x = sm.add_constant(df_x)
df_x[:2]
Out[6]:
const Germany Spain Female RowNumber CustomerId Surname CreditScore Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary
0 1 0 0 1 1 15634602 Hargrave 619 42 2 0.00 1 1 1 101348.88
1 1 0 1 1 2 15647311 Hill 608 41 1 83807.86 1 0 1 112542.58

Exclude the variables that should not affect the model:

In [7]:
df_x = df_x.drop(['RowNumber', 'CustomerId', 'Surname'], axis = 1)
df_x[:2]
Out[7]:
const Germany Spain Female CreditScore Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary
0 1 0 0 1 619 42 2 0.00 1 1 1 101348.88
1 1 0 1 1 608 41 1 83807.86 1 0 1 112542.58

Create a model with all the remaining variables:

In [8]:
model1 = smdis.Logit(df_y, df_x).fit()
print model1.summary()
print "\n_____P-values____"
p = model1.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model1.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))
Optimization terminated successfully.
         Current function value: 0.428068
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9988
Method:                           MLE   Df Model:                           11
Date:                Wed, 28 Dec 2016   Pseudo R-squ.:                  0.1532
Time:                        13:04:11   Log-Likelihood:                -4280.7
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
===================================================================================
                      coef    std err          z      P>|z|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
const              -3.9208      0.245    -15.980      0.000        -4.402    -3.440
Germany             0.7747      0.068     11.448      0.000         0.642     0.907
Spain               0.0352      0.071      0.499      0.618        -0.103     0.174
Female              0.5285      0.054      9.699      0.000         0.422     0.635
CreditScore        -0.0007      0.000     -2.384      0.017        -0.001    -0.000
Age                 0.0727      0.003     28.230      0.000         0.068     0.078
Tenure             -0.0159      0.009     -1.705      0.088        -0.034     0.002
Balance          2.637e-06   5.14e-07      5.128      0.000      1.63e-06  3.64e-06
NumOfProducts      -0.1015      0.047     -2.154      0.031        -0.194    -0.009
HasCrCard          -0.0447      0.059     -0.753      0.452        -0.161     0.072
IsActiveMember     -1.0754      0.058    -18.643      0.000        -1.189    -0.962
EstimatedSalary  4.807e-07   4.74e-07      1.015      0.310     -4.48e-07  1.41e-06
===================================================================================

_____P-values____
const               1.759879e-57
Germany             2.413604e-30
Spain               6.180842e-01
Female              3.044696e-22
CreditScore         1.712765e-02
Age                2.521111e-175
Tenure              8.821318e-02
Balance             2.922724e-07
NumOfProducts       3.124760e-02
HasCrCard           4.515135e-01
IsActiveMember      1.434412e-77
EstimatedSalary     3.101752e-01
dtype: float64

_____Highest p-value____

 Spain    0.618084
dtype: float64 


____Confusion Matrix___
[[ 7666.   297.]
 [ 1600.   437.]]

Number of cases correctly predicted: 8103.0 (81.03 %)

Note:

"Pseudo R-squ." - pseudo R-squared. Logistic regression does not have an equivalent to the R-squared that is found in OLS regression; however, many people have tried to come up with one. There are a wide variety of pseudo-R-square statistics. Because this statistic does not mean what R-square means in OLS regression (the proportion of variance explained by the predictors), we suggest interpreting this statistic with great caution. (source: http://www.ats.ucla.edu/stat/stata/output/stata_logistic.htm)

Backward Elimination

New model without the variable with the highest p-value: "Spain"

In [9]:
df_x1 = df_x.drop(['Spain'], axis = 1)
df_x1[:2]
Out[9]:
const Germany Female CreditScore Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary
0 1 0 1 619 42 2 0.00 1 1 1 101348.88
1 1 0 1 608 41 1 83807.86 1 0 1 112542.58
In [10]:
model2 = smdis.Logit(df_y, df_x1).fit()
print model2.summary()
print "\n_____P-values____"
p = model2.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model2.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))
Optimization terminated successfully.
         Current function value: 0.428080
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9989
Method:                           MLE   Df Model:                           10
Date:                Wed, 28 Dec 2016   Pseudo R-squ.:                  0.1531
Time:                        13:04:12   Log-Likelihood:                -4280.8
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
===================================================================================
                      coef    std err          z      P>|z|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
const              -3.9110      0.245    -15.994      0.000        -4.390    -3.432
Germany             0.7629      0.063     12.041      0.000         0.639     0.887
Female              0.5283      0.054      9.697      0.000         0.422     0.635
CreditScore        -0.0007      0.000     -2.378      0.017        -0.001    -0.000
Age                 0.0727      0.003     28.238      0.000         0.068     0.078
Tenure             -0.0160      0.009     -1.708      0.088        -0.034     0.002
Balance          2.637e-06   5.14e-07      5.129      0.000      1.63e-06  3.65e-06
NumOfProducts      -0.1013      0.047     -2.149      0.032        -0.194    -0.009
HasCrCard          -0.0449      0.059     -0.757      0.449        -0.161     0.071
IsActiveMember     -1.0752      0.058    -18.640      0.000        -1.188    -0.962
EstimatedSalary  4.813e-07   4.74e-07      1.016      0.310     -4.47e-07  1.41e-06
===================================================================================

_____P-values____
const               1.405607e-57
Germany             2.162265e-33
Female              3.114487e-22
CreditScore         1.739401e-02
Age                1.996715e-175
Tenure              8.764548e-02
Balance             2.913057e-07
NumOfProducts       3.161546e-02
HasCrCard           4.489327e-01
IsActiveMember      1.531167e-77
EstimatedSalary     3.095145e-01
dtype: float64

_____Highest p-value____

 HasCrCard    0.448933
dtype: float64 


____Confusion Matrix___
[[ 7665.   298.]
 [ 1602.   435.]]

Number of cases correctly predicted: 8100.0 (81.0 %)

New model without the variable with the highest p-value: "HasCrCard"

In [11]:
df_x2 = df_x1.drop(['HasCrCard'], axis = 1)
df_x2[:2]
Out[11]:
const Germany Female CreditScore Age Tenure Balance NumOfProducts IsActiveMember EstimatedSalary
0 1 0 1 619 42 2 0.00 1 1 101348.88
1 1 0 1 608 41 1 83807.86 1 1 112542.58
In [12]:
model3 = smdis.Logit(df_y, df_x2).fit()
print model3.summary()
print "\n_____P-values____"
p = model3.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model3.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))
Optimization terminated successfully.
         Current function value: 0.428109
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9990
Method:                           MLE   Df Model:                            9
Date:                Wed, 28 Dec 2016   Pseudo R-squ.:                  0.1531
Time:                        13:04:12   Log-Likelihood:                -4281.1
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
===================================================================================
                      coef    std err          z      P>|z|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
const              -3.9444      0.241    -16.395      0.000        -4.416    -3.473
Germany             0.7619      0.063     12.028      0.000         0.638     0.886
Female              0.5285      0.054      9.700      0.000         0.422     0.635
CreditScore        -0.0007      0.000     -2.369      0.018        -0.001    -0.000
Age                 0.0727      0.003     28.243      0.000         0.068     0.078
Tenure             -0.0162      0.009     -1.727      0.084        -0.034     0.002
Balance          2.645e-06   5.14e-07      5.146      0.000      1.64e-06  3.65e-06
NumOfProducts      -0.1013      0.047     -2.150      0.032        -0.194    -0.009
IsActiveMember     -1.0744      0.058    -18.631      0.000        -1.187    -0.961
EstimatedSalary  4.818e-07   4.74e-07      1.017      0.309     -4.47e-07  1.41e-06
===================================================================================

_____P-values____
const               2.068294e-60
Germany             2.545658e-33
Female              3.024897e-22
CreditScore         1.782382e-02
Age                1.731605e-175
Tenure              8.415226e-02
Balance             2.660186e-07
NumOfProducts       3.152337e-02
IsActiveMember      1.809198e-77
EstimatedSalary     3.090842e-01
dtype: float64

_____Highest p-value____

 EstimatedSalary    0.309084
dtype: float64 


____Confusion Matrix___
[[ 7673.   290.]
 [ 1599.   438.]]

Number of cases correctly predicted: 8111.0 (81.11 %)

New model without the variable with the highest p-value: "EstimatedSalary"

In [13]:
df_x3 = df_x2.drop(['EstimatedSalary'], axis = 1)
df_x3[:2]
Out[13]:
const Germany Female CreditScore Age Tenure Balance NumOfProducts IsActiveMember
0 1 0 1 619 42 2 0.00 1 1
1 1 0 1 608 41 1 83807.86 1 1
In [14]:
model4 = smdis.Logit(df_y, df_x3).fit()
print model4.summary()
print "\n_____P-values____"
p = model4.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model4.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))
Optimization terminated successfully.
         Current function value: 0.428161
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9991
Method:                           MLE   Df Model:                            8
Date:                Wed, 28 Dec 2016   Pseudo R-squ.:                  0.1530
Time:                        13:04:12   Log-Likelihood:                -4281.6
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
==================================================================================
                     coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
const             -3.8959      0.236    -16.528      0.000        -4.358    -3.434
Germany            0.7621      0.063     12.031      0.000         0.638     0.886
Female             0.5290      0.054      9.710      0.000         0.422     0.636
CreditScore       -0.0007      0.000     -2.378      0.017        -0.001    -0.000
Age                0.0727      0.003     28.238      0.000         0.068     0.078
Tenure            -0.0160      0.009     -1.710      0.087        -0.034     0.002
Balance         2.653e-06   5.14e-07      5.162      0.000      1.65e-06  3.66e-06
NumOfProducts     -0.1005      0.047     -2.132      0.033        -0.193    -0.008
IsActiveMember    -1.0751      0.058    -18.644      0.000        -1.188    -0.962
==================================================================================

_____P-values____
const              2.308408e-61
Germany            2.434718e-33
Female             2.744698e-22
CreditScore        1.741371e-02
Age               2.006986e-175
Tenure             8.734004e-02
Balance            2.440782e-07
NumOfProducts      3.297139e-02
IsActiveMember     1.407482e-77
dtype: float64

_____Highest p-value____

 Tenure    0.08734
dtype: float64 


____Confusion Matrix___
[[ 7676.   287.]
 [ 1598.   439.]]

Number of cases correctly predicted: 8115.0 (81.15 %)

New model without the variable with the highest p-value: "Tenure"

In [15]:
df_x4 = df_x3.drop(['Tenure'], axis = 1)
df_x4[:2]
Out[15]:
const Germany Female CreditScore Age Balance NumOfProducts IsActiveMember
0 1 0 1 619 42 0.00 1 1
1 1 0 1 608 41 83807.86 1 1
In [16]:
model5 = smdis.Logit(df_y, df_x4).fit()
print model5.summary()
print "\n_____P-values____"
p = model5.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model5.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))
Optimization terminated successfully.
         Current function value: 0.428307
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9992
Method:                           MLE   Df Model:                            7
Date:                Wed, 28 Dec 2016   Pseudo R-squ.:                  0.1527
Time:                        13:04:13   Log-Likelihood:                -4283.1
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
==================================================================================
                     coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
const             -3.9760      0.231    -17.200      0.000        -4.429    -3.523
Germany            0.7608      0.063     12.014      0.000         0.637     0.885
Female             0.5306      0.054      9.741      0.000         0.424     0.637
CreditScore       -0.0007      0.000     -2.377      0.017        -0.001    -0.000
Age                0.0727      0.003     28.237      0.000         0.068     0.078
Balance         2.652e-06   5.14e-07      5.160      0.000      1.64e-06  3.66e-06
NumOfProducts     -0.1010      0.047     -2.144      0.032        -0.193    -0.009
IsActiveMember    -1.0718      0.058    -18.603      0.000        -1.185    -0.959
==================================================================================

_____P-values____
const              2.658461e-66
Germany            3.005769e-33
Female             2.006779e-22
CreditScore        1.745446e-02
Age               2.037936e-175
Balance            2.463940e-07
NumOfProducts      3.199990e-02
IsActiveMember     3.044455e-77
dtype: float64

_____Highest p-value____

 NumOfProducts    0.032
dtype: float64 


____Confusion Matrix___
[[ 7676.   287.]
 [ 1601.   436.]]

Number of cases correctly predicted: 8112.0 (81.12 %)

Since the Accuracy decreases, the variable "Tenure" should be used in the model. Now, the variable with the highest p-value is "NumOfProducts" with 0.032, which is above of the defined threshold - 0.05.

In [17]:
df_x3[:5]
Out[17]:
const Germany Female CreditScore Age Tenure Balance NumOfProducts IsActiveMember
0 1 0 1 619 42 2 0.00 1 1
1 1 0 1 608 41 1 83807.86 1 1
2 1 0 1 502 42 8 159660.80 3 0
3 1 0 1 699 39 1 0.00 2 0
4 1 0 1 850 43 2 125510.82 1 1

Transforming Independent Variables

Most common transformations:

  1. $$\sqrt{x}$$
  2. $$x^2$$
  3. $$ln(x)$$

Transform "Balance" variable with ln(x):

Original value: Balance (in 1000$):

    Bal2 = Bal1 + 1unit -->  Bal2 = Bal1 + 1000$

    Scenario1: Bal1 = 1000$
        --> Bal2 = 1000$ + 1000$ = 2000$
    Scenario2: Bal1 = 10000$
        --> Bal2 = 10000$ + 1000$ = 11000$

Log10(Balance + 1):

    log10(Bal2) = log10(Bal1) + 1unit --> Bal2 = Bal1*10

    Scenario1: Bal1 = 1000$
        --> Bal2 = 1000*10 = 10000
Scenario2: Bal1 = 10000$
        ---> Bal2 = 10000*10 = 100000

Using the original value, means that if someone that starts with a balance of 1000$ and has an increase of 1000$, and someone that starts with a balance of 10000$ and has the same amount of increasing, are two completly different things.

Using the ln transformation, it has the same affect on any person regardless of their starting point. It always a 10 times increase so a unit increase has a consistent increase in the balance variable which is 10 times.

So regardless of who we're segmenting we can say that the effect of a one unit increase in the new transformed variable is consistent throughout our population and that is much more powerful because that does not restrict the logistic regression.

In [18]:
df_x4 = df_x3
df_x4['Balance_log'] = df_x4['Balance'].map(lambda x: np.log10(x+1))
df_x4[:2]
Out[18]:
const Germany Female CreditScore Age Tenure Balance NumOfProducts IsActiveMember Balance_log
0 1 0 1 619 42 2 0.00 1 1 0.00000
1 1 0 1 608 41 1 83807.86 1 1 4.92329
In [19]:
model6 = smdis.Logit(df_y, df_x4).fit()
print model6.summary()
print "\n_____P-values____"
p = model6.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model6.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))
Optimization terminated successfully.
         Current function value: 0.428134
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9990
Method:                           MLE   Df Model:                            9
Date:                Wed, 28 Dec 2016   Pseudo R-squ.:                  0.1530
Time:                        13:04:14   Log-Likelihood:                -4281.3
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
==================================================================================
                     coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
const             -3.9147      0.237    -16.505      0.000        -4.380    -3.450
Germany            0.7509      0.065     11.532      0.000         0.623     0.879
Female             0.5286      0.054      9.701      0.000         0.422     0.635
CreditScore       -0.0007      0.000     -2.387      0.017        -0.001    -0.000
Age                0.0727      0.003     28.233      0.000         0.068     0.078
Tenure            -0.0160      0.009     -1.706      0.088        -0.034     0.002
Balance         1.871e-06   1.19e-06      1.568      0.117     -4.68e-07  4.21e-06
NumOfProducts     -0.0957      0.048     -2.013      0.044        -0.189    -0.003
IsActiveMember    -1.0754      0.058    -18.650      0.000        -1.188    -0.962
Balance_log        0.0235      0.032      0.728      0.467        -0.040     0.087
==================================================================================

_____P-values____
const              3.364882e-61
Germany            9.111212e-31
Female             2.971672e-22
CreditScore        1.697954e-02
Age               2.287651e-175
Tenure             8.794792e-02
Balance            1.169019e-01
NumOfProducts      4.409600e-02
IsActiveMember     1.262368e-77
Balance_log        4.668374e-01
dtype: float64

_____Highest p-value____

 Balance_log    0.466837
dtype: float64 


____Confusion Matrix___
[[ 7681.   282.]
 [ 1597.   440.]]

Number of cases correctly predicted: 8121.0 (81.21 %)

The accuracy is improved!

Derived Variables

Sometimes it is useful too account for some effects, for example in our case, the balance and age might be correlated, because the older a person is the more whealth he/she can accumulate.

So a new variable can be created:

$$WealthAccumlation = \frac{Balance}{Age}$$
In [20]:
df_x5 = df_x4
df_x5['WealthAccumulation'] = df_x5['Balance']/ df_x5['Age']
df_x5[:2]
Out[20]:
const Germany Female CreditScore Age Tenure Balance NumOfProducts IsActiveMember Balance_log WealthAccumulation
0 1 0 1 619 42 2 0.00 1 1 0.00000 0.000000
1 1 0 1 608 41 1 83807.86 1 1 4.92329 2044.094146
In [21]:
model7 = smdis.Logit(df_y, df_x5).fit()
print model7.summary()
print "\n_____P-values____"
p = model7.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model7.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))
Optimization terminated successfully.
         Current function value: 0.427323
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9989
Method:                           MLE   Df Model:                           10
Date:                Wed, 28 Dec 2016   Pseudo R-squ.:                  0.1546
Time:                        13:04:15   Log-Likelihood:                -4273.2
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
======================================================================================
                         coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------
const                 -3.4253      0.266    -12.896      0.000        -3.946    -2.905
Germany                0.7534      0.066     11.483      0.000         0.625     0.882
Female                 0.5289      0.055      9.693      0.000         0.422     0.636
CreditScore           -0.0007      0.000     -2.326      0.020        -0.001    -0.000
Age                    0.0611      0.004     15.903      0.000         0.054     0.069
Tenure                -0.0165      0.009     -1.762      0.078        -0.035     0.002
Balance             8.362e-06   2.02e-06      4.143      0.000      4.41e-06  1.23e-05
NumOfProducts         -0.1011      0.048     -2.119      0.034        -0.195    -0.008
IsActiveMember        -1.0601      0.058    -18.349      0.000        -1.173    -0.947
Balance_log            0.0213      0.032      0.657      0.511        -0.042     0.085
WealthAccumulation    -0.0003   6.51e-05     -3.954      0.000        -0.000    -0.000
======================================================================================

_____P-values____
const                 4.756747e-38
Germany               1.599880e-30
Female                3.244594e-22
CreditScore           2.000443e-02
Age                   6.030631e-57
Tenure                7.801725e-02
Balance               3.428919e-05
NumOfProducts         3.411316e-02
IsActiveMember        3.376037e-75
Balance_log           5.111832e-01
WealthAccumulation    7.673375e-05
dtype: float64

_____Highest p-value____

 Balance_log    0.511183
dtype: float64 


____Confusion Matrix___
[[ 7662.   301.]
 [ 1586.   451.]]

Number of cases correctly predicted: 8113.0 (81.13 %)

This new variable had a negative effect in the model accuracy.

However, one should have in consideration that might be some collinearity effects because wealth accumulation includes the variables balance and age that are already in the model. So basically this means that this new variable might be somehow correlated with the other two and including all of them in the model might cause some damage, so on of them should be excluded.

Multicollinearity

Multicollinearity is a phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy.

The "Variance Inflation Factors - VIF" can be used to measure the degree of multicollinearity.

Minimum possible value = 1.0 Values > 5 may indicate a collinearity problem.

$$VIF(j) = \frac{1}{( 1 - R(j)^2)}$$

where R(j) is the multiple correlarion coefficient between variable j and the other independent variables

In [22]:
i = 0
for column in df_x5.columns:
    
    print column + " %s" % outliers.variance_inflation_factor(df_x5.values, i)
    i += 1
const 91.4773722803
Germany 1.27101224421
Female 1.00321582613
CreditScore 1.00104450767
Age 2.10487620512
Tenure 1.00161073156
Balance 20.8895673667
NumOfProducts 1.15312466504
IsActiveMember 1.01384972933
Balance_log 8.72881202924
WealthAccumulation 14.5598595824

Rerun without the "Balance_log" variable:

In [23]:
df_x6 = df_x5.drop('Balance_log', axis = 1)
In [24]:
model8 = smdis.Logit(df_y, df_x6).fit()
print model8.summary()
print "\n_____P-values____"
p = model8.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model8.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))
Optimization terminated successfully.
         Current function value: 0.427344
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9990
Method:                           MLE   Df Model:                            9
Date:                Wed, 28 Dec 2016   Pseudo R-squ.:                  0.1546
Time:                        13:04:15   Log-Likelihood:                -4273.4
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
======================================================================================
                         coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------
const                 -3.4089      0.264    -12.904      0.000        -3.927    -2.891
Germany                0.7636      0.064     11.967      0.000         0.639     0.889
Female                 0.5292      0.055      9.699      0.000         0.422     0.636
CreditScore           -0.0007      0.000     -2.318      0.020        -0.001    -0.000
Age                    0.0612      0.004     15.938      0.000         0.054     0.069
Tenure                -0.0165      0.009     -1.766      0.077        -0.035     0.002
Balance             9.078e-06    1.7e-06      5.351      0.000      5.75e-06  1.24e-05
NumOfProducts         -0.1054      0.047     -2.229      0.026        -0.198    -0.013
IsActiveMember        -1.0598      0.058    -18.343      0.000        -1.173    -0.947
WealthAccumulation    -0.0003   6.49e-05     -3.965      0.000        -0.000    -0.000
======================================================================================

_____P-values____
const                 4.282278e-38
Germany               5.318191e-33
Female                3.042840e-22
CreditScore           2.047417e-02
Age                   3.455661e-57
Tenure                7.744433e-02
Balance               8.770661e-08
NumOfProducts         2.578202e-02
IsActiveMember        3.744959e-75
WealthAccumulation    7.325518e-05
dtype: float64

_____Highest p-value____

 Tenure    0.077444
dtype: float64 


____Confusion Matrix___
[[ 7672.   291.]
 [ 1586.   451.]]

Number of cases correctly predicted: 8123.0 (81.23 %)
In [25]:
i = 0

for column in df_x6.columns:
    
    print column + " %s" % outliers.variance_inflation_factor(df_x6.values, i)
    i += 1
const 90.6320775047
Germany 1.2161676569
Female 1.00302785174
CreditScore 1.00097995156
Age 2.10477505278
Tenure 1.00153406128
Balance 14.0245653407
NumOfProducts 1.12330226029
IsActiveMember 1.01353234553
WealthAccumulation 14.5598562679

Create new variable log for wealth accumulation:

In [26]:
df_x7 = df_x5
df_x7['WealthAccumulation_log'] = df_x7['WealthAccumulation'].map(lambda x: np.log10(x + 1))
df_x7[:2]
Out[26]:
const Germany Female CreditScore Age Tenure Balance NumOfProducts IsActiveMember Balance_log WealthAccumulation WealthAccumulation_log
0 1 0 1 619 42 2 0.00 1 1 0.00000 0.000000 0.000000
1 1 0 1 608 41 1 83807.86 1 1 4.92329 2044.094146 3.310713
In [27]:
df_x7 = df_x7.drop(['Balance', 'WealthAccumulation'], axis = 1)
In [28]:
model9 = smdis.Logit(df_y, df_x7).fit()
print model9.summary()
print "\n_____P-values____"
p = model9.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model9.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))
Optimization terminated successfully.
         Current function value: 0.427891
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9990
Method:                           MLE   Df Model:                            9
Date:                Wed, 28 Dec 2016   Pseudo R-squ.:                  0.1535
Time:                        13:04:16   Log-Likelihood:                -4278.9
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
==========================================================================================
                             coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------------
const                     -3.5735      0.267    -13.362      0.000        -4.098    -3.049
Germany                    0.7516      0.065     11.492      0.000         0.623     0.880
Female                     0.5263      0.055      9.655      0.000         0.420     0.633
CreditScore               -0.0007      0.000     -2.386      0.017        -0.001    -0.000
Age                        0.0647      0.004     16.631      0.000         0.057     0.072
Tenure                    -0.0162      0.009     -1.729      0.084        -0.035     0.002
NumOfProducts             -0.0983      0.048     -2.063      0.039        -0.192    -0.005
IsActiveMember            -1.0687      0.058    -18.516      0.000        -1.182    -0.956
Balance_log                0.8699      0.297      2.932      0.003         0.288     1.451
WealthAccumulation_log    -1.1762      0.435     -2.702      0.007        -2.029    -0.323
==========================================================================================

_____P-values____
const                     1.007546e-40
Germany                   1.441579e-30
Female                    4.661027e-22
CreditScore               1.704065e-02
Age                       4.171785e-62
Tenure                    8.382654e-02
NumOfProducts             3.907377e-02
IsActiveMember            1.520894e-76
Balance_log               3.373142e-03
WealthAccumulation_log    6.887931e-03
dtype: float64

_____Highest p-value____

 Tenure    0.083827
dtype: float64 


____Confusion Matrix___
[[ 7685.   278.]
 [ 1597.   440.]]

Number of cases correctly predicted: 8125.0 (81.25 %)
In [29]:
i = 0

for column in df_x7.columns:
    
    print column + " %s" % outliers.variance_inflation_factor(df_x7.values, i)
    i += 1
const 93.5608074531
Germany 1.26986359735
Female 1.00282038022
CreditScore 1.00102945704
Age 2.26479838945
Tenure 1.00165459162
NumOfProducts 1.15243000098
IsActiveMember 1.01168357579
Balance_log 705.940971113
WealthAccumulation_log 704.739971273

"Balance_log" and "WealthAccumulation_log" have too large values which means that these two variables are basically the same thing. So one of these variables must be excluded!

In [30]:
df_x8 = df_x7.drop(['Balance_log'], axis = 1)
In [31]:
model10 = smdis.Logit(df_y, df_x8).fit()
print model10.summary()
print "\n_____P-values____"
p = model10.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model10.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))
Optimization terminated successfully.
         Current function value: 0.428322
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9991
Method:                           MLE   Df Model:                            8
Date:                Wed, 28 Dec 2016   Pseudo R-squ.:                  0.1527
Time:                        13:04:18   Log-Likelihood:                -4283.2
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
==========================================================================================
                             coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------------
const                     -3.9325      0.239    -16.467      0.000        -4.401    -3.464
Germany                    0.7515      0.065     11.550      0.000         0.624     0.879
Female                     0.5266      0.054      9.671      0.000         0.420     0.633
CreditScore               -0.0007      0.000     -2.409      0.016        -0.001    -0.000
Age                        0.0733      0.003     28.398      0.000         0.068     0.078
Tenure                    -0.0159      0.009     -1.697      0.090        -0.034     0.002
NumOfProducts             -0.0967      0.048     -2.034      0.042        -0.190    -0.004
IsActiveMember            -1.0763      0.058    -18.672      0.000        -1.189    -0.963
WealthAccumulation_log     0.0985      0.020      4.812      0.000         0.058     0.139
==========================================================================================

_____P-values____
const                      6.349678e-61
Germany                    7.379902e-31
Female                     4.015783e-22
CreditScore                1.598070e-02
Age                       2.132441e-177
Tenure                     8.978088e-02
NumOfProducts              4.195258e-02
IsActiveMember             8.443410e-78
WealthAccumulation_log     1.490621e-06
dtype: float64

_____Highest p-value____

 Tenure    0.089781
dtype: float64 


____Confusion Matrix___
[[ 7688.   275.]
 [ 1597.   440.]]

Number of cases correctly predicted: 8128.0 (81.28 %)

Correlation between variables:

In [32]:
df_x5[['Age', 'Balance_log', 'WealthAccumulation_log', 'WealthAccumulation']].corr()
Out[32]:
Age Balance_log WealthAccumulation_log WealthAccumulation
Age 1.000000 0.034530 -0.007524 -0.246293
Balance_log 0.034530 1.000000 0.998404 0.865141
WealthAccumulation_log -0.007524 0.998404 1.000000 0.888872
WealthAccumulation -0.246293 0.865141 0.888872 1.000000

As one can observe, the "WealthAccumulation_log" and "Balance_log" variables are highly correlated - they are basically the same which is very bad for the model.

Thumb rule: anything 0.9 is very high correlation. Correlations above 0.5 should be addressed.

In [33]:
df_final = df_x5.drop(['Balance', 'WealthAccumulation', 'WealthAccumulation_log'], axis = 1)
In [34]:
model11 = smdis.Logit(df_y, df_final).fit()
print model11.summary()
print "\n_____P-values____"
p = model11.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model11.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))
Optimization terminated successfully.
         Current function value: 0.428257
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9991
Method:                           MLE   Df Model:                            8
Date:                Wed, 28 Dec 2016   Pseudo R-squ.:                  0.1528
Time:                        13:04:19   Log-Likelihood:                -4282.6
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
==================================================================================
                     coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
const             -3.9126      0.237    -16.497      0.000        -4.377    -3.448
Germany            0.7476      0.065     11.492      0.000         0.620     0.875
Female             0.5267      0.054      9.672      0.000         0.420     0.633
CreditScore       -0.0007      0.000     -2.408      0.016        -0.001    -0.000
Age                0.0727      0.003     28.221      0.000         0.068     0.078
Tenure            -0.0159      0.009     -1.699      0.089        -0.034     0.002
NumOfProducts     -0.0950      0.048     -1.999      0.046        -0.188    -0.002
IsActiveMember    -1.0758      0.058    -18.662      0.000        -1.189    -0.963
Balance_log        0.0690      0.014      4.945      0.000         0.042     0.096
==================================================================================

_____P-values____
const              3.835044e-61
Germany            1.441009e-30
Female             3.970390e-22
CreditScore        1.604464e-02
Age               3.239286e-175
Tenure             8.932416e-02
NumOfProducts      4.562540e-02
IsActiveMember     1.011855e-77
Balance_log        7.619575e-07
dtype: float64

_____Highest p-value____

 Tenure    0.089324
dtype: float64 


____Confusion Matrix___
[[ 7687.   276.]
 [ 1597.   440.]]

Number of cases correctly predicted: 8127.0 (81.27 %)
In [35]:
i = 0

for column in df_final.columns:
    
    print column + " %s" % outliers.variance_inflation_factor(df_final.values, i)
    i += 1
const 76.3936634172
Germany 1.26934389528
Female 1.00281465894
CreditScore 1.00102204693
Age 1.01171547993
Tenure 1.0014367308
NumOfProducts 1.15145597018
IsActiveMember 1.01028026511
Balance_log 1.42007014313

Now, none of the variables presents multicollinearity.

CAP - Cumulative Accuracy Profile Curve

In [36]:
df_cap = df[['RowNumber', 'Exited']]
df_cap['phat'] = model11.predict()
In [37]:
df_cap.sort('phat', ascending=False)[:5]
Out[37]:
RowNumber Exited phat
4815 4816 0 0.930669
3531 3532 1 0.928311
9587 9588 0 0.923303
7499 7500 1 0.906178
9555 9556 1 0.904457
In [153]:
Total_Exited = sum(df_cap['Exited'])
print "Total Exited: %s" % Total_Exited
Total_Records = len(df)
print "Total number of clients: %s" % Total_Records
Exit_Ratio = Total_Exited/Total_Records
print "Ratio of exits: %s " % Exit_Ratio
Total Exited: 2037
Total number of clients: 10000
Ratio of exits: 0.2037 

First, I select the clients that have a higher probability.

In [111]:
df_cap['total_selected'] = range(1,len(df_cap)+1)
df_cap['total_selected_per'] = df_cap['total_selected'].map(lambda x: round(x*100/Total_Records, 1))
df_cap['random_selection'] = df_cap['total_selected'].map(lambda x: x*Exit_Ratio)
df_cap['random_selection_per'] = df_cap['total_selected'].map(lambda x: round(x*100/Total_Records, 1))
In [112]:
df_cap = df_cap.sort('phat', ascending=False)
In [113]:
df_cap['model_select'] = df_cap.Exited.cumsum()
df_cap['model_select_per'] = df_cap['model_select'].map(lambda x: round(x*100/Total_Exited, 1))
In [114]:
df_cap[:5]
Out[114]:
RowNumber Exited phat total_selected total_selected_per random_selection random_selection_per model_select model_select_per
4815 4816 0 0.930669 1 0.0 0.2037 0.0 0 0.0
3531 3532 1 0.928311 2 0.0 0.4074 0.0 1 0.0
9587 9588 0 0.923303 3 0.0 0.6111 0.0 1 0.0
7499 7500 1 0.906178 4 0.0 0.8148 0.0 2 0.1
9555 9556 1 0.904457 5 0.1 1.0185 0.1 3 0.1
In [115]:
df_cap = df_cap.sort('RowNumber')
plt.plot(df_cap['total_selected_per'], df_cap['random_selection_per'], label = 'random model' )
plt.plot(df_cap['total_selected_per'], df_cap['model_select_per'], label = 'Good Model' )
plt.plot([50, 50], [80, 0], 'r--', lw=1, label = 'x = 50%, y = X')
plt.plot([0, 50], [80, 80], 'r--', lw=1)
plt.plot([10, 100], [99, 99], 'y-', lw=3, label = 'perfect model')
plt.plot([0, 10], [0, 99], 'y-', lw=3)
plt.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1)
plt.xlabel('Total Contacted')
plt.ylabel('Total Purchased')
plt.show()

The closer the green line is to the yellow line, the better the model. The closer to the blue line, the worse.

Accuracy Ratio:

- Take the area under the perfect model (yellow line) to the random model (blue line) - 
$$a_p$$
- Take the area under the good model (green line) to the random model (blue line) - 

$$a_r$$
- So the accuracy ratio:

$$AR = \frac{a_r}{a_p}$$

that is between 0 and 1, and the closer the ratio is to 1 the better.

Rule of thumb:

Another way of evaluating the model is to look to the red line, x = 50%, and y = X:

If:

90% < X < 100% -> Too Good (overfitting?)
80% < X < 90% -> Very Good
70% < X < 80% -> Good
60% < X < 70% -> Poor
X < 60% -> Rubbish

Test model with test data

In [67]:
df_test = pandas.DataFrame.from_csv('Churn-Modelling-Test-Data.csv', index_col=None)
df_test[:5]
Out[67]:
RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited
0 10001 15798485 Copley 565 France Male 31 1 0.00 1 0 1 20443.08 0
1 10002 15588959 T'ang 569 France Male 34 4 0.00 1 0 1 4045.90 0
2 10003 15624896 Ku 669 France Female 20 7 0.00 2 1 0 128838.67 0
3 10004 15639629 McConnan 694 France Male 39 4 173255.48 1 1 1 81293.10 0
4 10005 15638852 Ts'ui 504 Spain Male 28 10 109291.36 1 1 1 187593.15 0
In [68]:
df_test = sm.add_constant(df_test)
In [69]:
df_test_y = df_test['Exited']
df_test = df_test.drop(['Exited'], axis = 1)
In [70]:
dummy = pandas.get_dummies(df_test['Gender'])
df_test= dummy.join(df_test)
dummy = pandas.get_dummies(df_test['Geography'])
df_test = dummy.join(df_test)
df_test[:5]
Out[70]:
France Germany Spain Female Male const RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary
0 1 0 0 0 1 1 10001 15798485 Copley 565 France Male 31 1 0.00 1 0 1 20443.08
1 1 0 0 0 1 1 10002 15588959 T'ang 569 France Male 34 4 0.00 1 0 1 4045.90
2 1 0 0 1 0 1 10003 15624896 Ku 669 France Female 20 7 0.00 2 1 0 128838.67
3 1 0 0 0 1 1 10004 15639629 McConnan 694 France Male 39 4 173255.48 1 1 1 81293.10
4 0 0 1 0 1 1 10005 15638852 Ts'ui 504 Spain Male 28 10 109291.36 1 1 1 187593.15
In [71]:
df_test['Balance_log'] = df_test['Balance'].map(lambda x: np.log10(x+1))

Check the parameters included in the final model (model11) and rerun the model with the same parameters but now including the test data:

In [125]:
model11.params
Out[125]:
const            -3.912576
Germany           0.747595
Female            0.526721
CreditScore      -0.000675
Age               0.072655
Tenure           -0.015879
NumOfProducts    -0.095020
IsActiveMember   -1.075776
Balance_log       0.069026
dtype: float64
In [126]:
df_test = df_test[['Germany', 'Female', 'CreditScore', 'Age', 'Tenure', 'NumOfProducts', 'IsActiveMember', 'Balance_log', 'const']]
In [127]:
df_total = df_final.append(df_test)
In [128]:
model_final = smdis.Logit(df_y, df_total[:10000]).fit()
Optimization terminated successfully.
         Current function value: 0.428257
         Iterations 6
In [139]:
y_pred = model_final.predict(df_total[10000:])
y_pred[:5]
Out[139]:
array([ 0.03811657,  0.04476137,  0.06383845,  0.08165338,  0.03915974])

Let's construct the CAP curve:

In [142]:
df_pred = pandas.DataFrame()
df_pred['phat'] = y_pred
df_pred['Exited'] = df_test_y
df_pred = df_pred.sort( 'phat', ascending=False)
df_pred[:5]
Out[142]:
phat Exited
957 0.877614 1
249 0.857731 1
429 0.832754 1
209 0.826484 1
693 0.826381 0
In [152]:
Total_Exited = sum(df_pred['Exited'])
print "Total Exited: %s" % Total_Exited
Total_Records = len(df_pred)
print "Total number of clients: %s" % Total_Records
Exit_Ratio = Total_Exited/Total_Records
print "Ratio of exits: %s" % Exit_Ratio
Total Exited: 260
Total number of clients: 1000
Ratio of exits: 0.26
In [143]:
df_pred['total_selected'] = range(1,len(df_pred)+1)
df_pred['total_selected_per'] = df_pred['total_selected'].map(lambda x: round(x*100/Total_Records, 1))
df_pred['random_selection'] = df_pred['total_selected'].map(lambda x: x*Exit_Ratio)
df_pred['random_selection_per'] = df_pred['total_selected'].map(lambda x: round(x*100/Total_Records, 1))
In [144]:
df_pred = df_pred.sort('phat', ascending=False)
In [145]:
df_pred['model_select'] = df_pred.Exited.cumsum()
df_pred['model_select_per'] = df_pred['model_select'].map(lambda x: round(x*100/Total_Exited, 1))
In [146]:
df_pred[:5]
Out[146]:
phat Exited total_selected total_selected_per random_selection random_selection_per model_select model_select_per
957 0.877614 1 1 0.1 0.26 0.1 1 0.4
249 0.857731 1 2 0.2 0.52 0.2 2 0.8
429 0.832754 1 3 0.3 0.78 0.3 3 1.2
209 0.826484 1 4 0.4 1.04 0.4 4 1.5
693 0.826381 0 5 0.5 1.30 0.5 4 1.5
In [151]:
plt.plot(df_pred['total_selected_per'], df_pred['random_selection_per'], label = 'random model' )
plt.plot(df_pred['total_selected_per'], df_pred['model_select_per'], label = 'Good Model' )
plt.plot([50, 50], [76, 0], 'r--', lw=1, label = 'x = 50%, y = X')
plt.plot([0, 50], [76, 76], 'r--', lw=1)
plt.plot([10, 100], [99, 99], 'y-', lw=3, label = 'perfect model')
plt.plot([0, 10], [0, 99], 'y-', lw=3)
plt.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1)
plt.xlabel('Total Contacted')
plt.ylabel('Total Purchased')
plt.show()

Conclusions:

  • The model is performing above 80%, so a bit worse than the training data (a drop of about 3%).
  • The concentration of data is about 10 times less than the training data which explains the ruggedness of the curve.

Parameters Interpretation

Odds ratio:

$$odds = \frac{p}{1-p}$$

Multiple logistic regression:

$$ln(\frac{p}{1-p}) = b_0 + b_0x_1 + ... + b_nx_n$$

So we can write:

$$ln(odds) = b_0 + b_0x_1 + ... + b_nx_n$$$$<=> odds = e^{b_0 + b_0x_1 + ... + b_nx_n}$$$$<=> odds = e^{b_0} + e^{b_0x_1} + ... + e^{b_nx_n}$$

which means that, if we increase an unit to the $$x_1$$ variable, for example:

$$e^{b_1x_1} -> e^{b_1(x_1 + 1)} = e^{b_1x_1}e^{b_1}$$

which means that, increading an independente variable, e.g. $$x_1$$, by 1 unit, will increase the odds by a multiplicative factor of $$e^{b_i}$$

So let's transform the coefficients of our model into odds ratios:

In [166]:
odds_ratio = np.exp(model_final.params)
odds_ratio
Out[166]:
Age               1.075360
Balance_log       1.071464
CreditScore       0.999325
Female            1.693371
Germany           2.111916
IsActiveMember    0.341033
NumOfProducts     0.909355
Tenure            0.984246
const             0.019989
dtype: float64
  • "Germany" has the greater odds ratio, which means that being in Germany has the highest impact for people leaving the bank - most important independent variable that influences the model. If we change from not Germany to Germany, the odds ratio increases by 2.1.
  • The next one is gender. One we go from a male customer to a female customer with all else held constant, the odds ratio of a female custormer leaving is 1.7 times grater that a Male customer leaving. So basically the odds ratio increases by 70%.
  • Being an "ActiveMember" has the oppositive effect because the value is less than 1. This means that when we go from a member who is not active to a member that is active, the odds ratio gets multiplied by 0.34. So basically it drops by 66% which is also great in terms of retention.
  • Anything above 0.8 and bellow 1.2 has not a great impact and can be ignored.