Geodemographic Segmentation Model¶

Dataset: https://www.superdatascience.com/training/

Author: Filipa C. S. Rodrigues (filipacsrodrigues@gmail.com)

In [1]:

%matplotlib inline 
import warnings
warnings.filterwarnings("ignore")
from __future__ import division
import pandas
import csv
import numpy as np
import statsmodels.api as sm
import statsmodels.discrete.discrete_model as smdis
import statsmodels.stats.outliers_influence as outliers
import matplotlib.pyplot as plt

/usr/local/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

In [2]:

df = pandas.DataFrame.from_csv('Churn-Modelling.csv', index_col=None)
df[:5]

Out[2]:

	RowNumber	CustomerId	Surname	CreditScore	Geography	Gender	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited
0	1	15634602	Hargrave	619	France	Female	42	2	0.00	1	1	1	101348.88	1
1	2	15647311	Hill	608	Spain	Female	41	1	83807.86	1	0	1	112542.58	0
2	3	15619304	Onio	502	France	Female	42	8	159660.80	3	1	0	113931.57	1
3	4	15701354	Boni	699	France	Female	39	1	0.00	2	0	0	93826.63	0
4	5	15737888	Mitchell	850	Spain	Female	43	2	125510.82	1	1	1	79084.10	0

Variables¶

Exited: dependent variable (y) --> is binary
Gender, Geography: categorical independent variable (x)
All the remaining variables: numeric independent variable (x)

In [3]:

df_y = df['Exited']
df_x = df.drop(['Exited'], axis = 1)

Create dummy variables for the categorical variables:

In [4]:

dummy = pandas.get_dummies(df_x['Gender'])
df_x = dummy.join(df_x)
dummy = pandas.get_dummies(df_x['Geography'])
df_x = dummy.join(df_x)
df_x[:5]

Out[4]:

	France	Spain	Female	RowNumber	CustomerId	Surname	CreditScore	Geography	Gender	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary
0	1	0	1	1	15634602	Hargrave	619	France	Female	42	2	0.00	1	1	1	101348.88
1	0	1	1	2	15647311	Hill	608	Spain	Female	41	1	83807.86	1	0	1	112542.58
2	1	0	1	3	15619304	Onio	502	France	Female	42	8	159660.80	3	1	0	113931.57
3	1	0	1	4	15701354	Boni	699	France	Female	39	1	0.00	2	0	0	93826.63
4	0	1	1	5	15737888	Mitchell	850	Spain	Female	43	2	125510.82	1	1	1	79084.10

Only one dummy variable should be used to avoid the "dummy variable trap":

In [5]:

df_x = df_x.drop(['Gender', 'Male', 'France', 'Geography'], axis =1)
df_x[:5]

Out[5]:

	Spain	Female	RowNumber	CustomerId	Surname	CreditScore	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary
0	0	1	1	15634602	Hargrave	619	42	2	0.00	1	1	1	101348.88
1	1	1	2	15647311	Hill	608	41	1	83807.86	1	0	1	112542.58
2	0	1	3	15619304	Onio	502	42	8	159660.80	3	1	0	113931.57
3	0	1	4	15701354	Boni	699	39	1	0.00	2	0	0	93826.63
4	1	1	5	15737888	Mitchell	850	43	2	125510.82	1	1	1	79084.10

Add a constant \begin{align} b_0 \end{align} to the model:

In [6]:

df_x = sm.add_constant(df_x)
df_x[:2]

Out[6]:

	const	Germany	Spain	Female	RowNumber	CustomerId	Surname	CreditScore	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary
0	1	0	0	1	1	15634602	Hargrave	619	42	2	0.00	1	1	1	101348.88
1	1	0	1	1	2	15647311	Hill	608	41	1	83807.86	1	0	1	112542.58

Exclude the variables that should not affect the model:

In [7]:

df_x = df_x.drop(['RowNumber', 'CustomerId', 'Surname'], axis = 1)
df_x[:2]

Out[7]:

	const	Germany	Spain	Female	CreditScore	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary
0	1	0	0	1	619	42	2	0.00	1	1	1	101348.88
1	1	0	1	1	608	41	1	83807.86	1	0	1	112542.58

Create a model with all the remaining variables:

In [8]:

model1 = smdis.Logit(df_y, df_x).fit()
print model1.summary()
print "\n_____P-values____"
p = model1.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model1.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))

Optimization terminated successfully.
         Current function value: 0.428068
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9988
Method:                           MLE   Df Model:                           11
Date:                Wed, 28 Dec 2016   Pseudo R-squ.:                  0.1532
Time:                        13:04:11   Log-Likelihood:                -4280.7
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
===================================================================================
                      coef    std err          z      P>|z|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
const              -3.9208      0.245    -15.980      0.000        -4.402    -3.440
Germany             0.7747      0.068     11.448      0.000         0.642     0.907
Spain               0.0352      0.071      0.499      0.618        -0.103     0.174
Female              0.5285      0.054      9.699      0.000         0.422     0.635
CreditScore        -0.0007      0.000     -2.384      0.017        -0.001    -0.000
Age                 0.0727      0.003     28.230      0.000         0.068     0.078
Tenure             -0.0159      0.009     -1.705      0.088        -0.034     0.002
Balance          2.637e-06   5.14e-07      5.128      0.000      1.63e-06  3.64e-06
NumOfProducts      -0.1015      0.047     -2.154      0.031        -0.194    -0.009
HasCrCard          -0.0447      0.059     -0.753      0.452        -0.161     0.072
IsActiveMember     -1.0754      0.058    -18.643      0.000        -1.189    -0.962
EstimatedSalary  4.807e-07   4.74e-07      1.015      0.310     -4.48e-07  1.41e-06
===================================================================================

_____P-values____
const               1.759879e-57
Germany             2.413604e-30
Spain               6.180842e-01
Female              3.044696e-22
CreditScore         1.712765e-02
Age                2.521111e-175
Tenure              8.821318e-02
Balance             2.922724e-07
NumOfProducts       3.124760e-02
HasCrCard           4.515135e-01
IsActiveMember      1.434412e-77
EstimatedSalary     3.101752e-01
dtype: float64

_____Highest p-value____

 Spain    0.618084
dtype: float64 


____Confusion Matrix___
[[ 7666.   297.]
 [ 1600.   437.]]

Number of cases correctly predicted: 8103.0 (81.03 %)

Note:

"Pseudo R-squ." - pseudo R-squared. Logistic regression does not have an equivalent to the R-squared that is found in OLS regression; however, many people have tried to come up with one. There are a wide variety of pseudo-R-square statistics. Because this statistic does not mean what R-square means in OLS regression (the proportion of variance explained by the predictors), we suggest interpreting this statistic with great caution. (source: http://www.ats.ucla.edu/stat/stata/output/stata_logistic.htm)

Backward Elimination

New model without the variable with the highest p-value: "Spain"

In [9]:

df_x1 = df_x.drop(['Spain'], axis = 1)
df_x1[:2]

Out[9]:

	const	Germany	Female	CreditScore	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary
0	1	0	1	619	42	2	0.00	1	1	1	101348.88
1	1	0	1	608	41	1	83807.86	1	0	1	112542.58

In [10]:

model2 = smdis.Logit(df_y, df_x1).fit()
print model2.summary()
print "\n_____P-values____"
p = model2.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model2.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))

Optimization terminated successfully.
         Current function value: 0.428080
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9989
Method:                           MLE   Df Model:                           10
Date:                Wed, 28 Dec 2016   Pseudo R-squ.:                  0.1531
Time:                        13:04:12   Log-Likelihood:                -4280.8
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
===================================================================================
                      coef    std err          z      P>|z|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
const              -3.9110      0.245    -15.994      0.000        -4.390    -3.432
Germany             0.7629      0.063     12.041      0.000         0.639     0.887
Female              0.5283      0.054      9.697      0.000         0.422     0.635
CreditScore        -0.0007      0.000     -2.378      0.017        -0.001    -0.000
Age                 0.0727      0.003     28.238      0.000         0.068     0.078
Tenure             -0.0160      0.009     -1.708      0.088        -0.034     0.002
Balance          2.637e-06   5.14e-07      5.129      0.000      1.63e-06  3.65e-06
NumOfProducts      -0.1013      0.047     -2.149      0.032        -0.194    -0.009
HasCrCard          -0.0449      0.059     -0.757      0.449        -0.161     0.071
IsActiveMember     -1.0752      0.058    -18.640      0.000        -1.188    -0.962
EstimatedSalary  4.813e-07   4.74e-07      1.016      0.310     -4.47e-07  1.41e-06
===================================================================================

_____P-values____
const               1.405607e-57
Germany             2.162265e-33
Female              3.114487e-22
CreditScore         1.739401e-02
Age                1.996715e-175
Tenure              8.764548e-02
Balance             2.913057e-07
NumOfProducts       3.161546e-02
HasCrCard           4.489327e-01
IsActiveMember      1.531167e-77
EstimatedSalary     3.095145e-01
dtype: float64

_____Highest p-value____

 HasCrCard    0.448933
dtype: float64 


____Confusion Matrix___
[[ 7665.   298.]
 [ 1602.   435.]]

Number of cases correctly predicted: 8100.0 (81.0 %)

New model without the variable with the highest p-value: "HasCrCard"

In [11]:

df_x2 = df_x1.drop(['HasCrCard'], axis = 1)
df_x2[:2]

Out[11]:

	const	Germany	Female	CreditScore	Age	Tenure	Balance	NumOfProducts	IsActiveMember	EstimatedSalary
0	1	0	1	619	42	2	0.00	1	1	101348.88
1	1	0	1	608	41	1	83807.86	1	1	112542.58

In [12]:

model3 = smdis.Logit(df_y, df_x2).fit()
print model3.summary()
print "\n_____P-values____"
p = model3.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model3.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))

Optimization terminated successfully.
         Current function value: 0.428109
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9990
Method:                           MLE   Df Model:                            9
Date:                Wed, 28 Dec 2016   Pseudo R-squ.:                  0.1531
Time:                        13:04:12   Log-Likelihood:                -4281.1
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
===================================================================================
                      coef    std err          z      P>|z|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
const              -3.9444      0.241    -16.395      0.000        -4.416    -3.473
Germany             0.7619      0.063     12.028      0.000         0.638     0.886
Female              0.5285      0.054      9.700      0.000         0.422     0.635
CreditScore        -0.0007      0.000     -2.369      0.018        -0.001    -0.000
Age                 0.0727      0.003     28.243      0.000         0.068     0.078
Tenure             -0.0162      0.009     -1.727      0.084        -0.034     0.002
Balance          2.645e-06   5.14e-07      5.146      0.000      1.64e-06  3.65e-06
NumOfProducts      -0.1013      0.047     -2.150      0.032        -0.194    -0.009
IsActiveMember     -1.0744      0.058    -18.631      0.000        -1.187    -0.961
EstimatedSalary  4.818e-07   4.74e-07      1.017      0.309     -4.47e-07  1.41e-06
===================================================================================

_____P-values____
const               2.068294e-60
Germany             2.545658e-33
Female              3.024897e-22
CreditScore         1.782382e-02
Age                1.731605e-175
Tenure              8.415226e-02
Balance             2.660186e-07
NumOfProducts       3.152337e-02
IsActiveMember      1.809198e-77
EstimatedSalary     3.090842e-01
dtype: float64

_____Highest p-value____

 EstimatedSalary    0.309084
dtype: float64 


____Confusion Matrix___
[[ 7673.   290.]
 [ 1599.   438.]]

Number of cases correctly predicted: 8111.0 (81.11 %)

New model without the variable with the highest p-value: "EstimatedSalary"

In [13]:

df_x3 = df_x2.drop(['EstimatedSalary'], axis = 1)
df_x3[:2]

Out[13]:

	const	Germany	Female	CreditScore	Age	Tenure	Balance	NumOfProducts	IsActiveMember
0	1	0	1	619	42	2	0.00	1	1
1	1	0	1	608	41	1	83807.86	1	1

In [14]:

model4 = smdis.Logit(df_y, df_x3).fit()
print model4.summary()
print "\n_____P-values____"
p = model4.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model4.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))

Optimization terminated successfully.
         Current function value: 0.428161
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9991
Method:                           MLE   Df Model:                            8
Date:                Wed, 28 Dec 2016   Pseudo R-squ.:                  0.1530
Time:                        13:04:12   Log-Likelihood:                -4281.6
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
==================================================================================
                     coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
const             -3.8959      0.236    -16.528      0.000        -4.358    -3.434
Germany            0.7621      0.063     12.031      0.000         0.638     0.886
Female             0.5290      0.054      9.710      0.000         0.422     0.636
CreditScore       -0.0007      0.000     -2.378      0.017        -0.001    -0.000
Age                0.0727      0.003     28.238      0.000         0.068     0.078
Tenure            -0.0160      0.009     -1.710      0.087        -0.034     0.002
Balance         2.653e-06   5.14e-07      5.162      0.000      1.65e-06  3.66e-06
NumOfProducts     -0.1005      0.047     -2.132      0.033        -0.193    -0.008
IsActiveMember    -1.0751      0.058    -18.644      0.000        -1.188    -0.962
==================================================================================

_____P-values____
const              2.308408e-61
Germany            2.434718e-33
Female             2.744698e-22
CreditScore        1.741371e-02
Age               2.006986e-175
Tenure             8.734004e-02
Balance            2.440782e-07
NumOfProducts      3.297139e-02
IsActiveMember     1.407482e-77
dtype: float64

_____Highest p-value____

 Tenure    0.08734
dtype: float64 


____Confusion Matrix___
[[ 7676.   287.]
 [ 1598.   439.]]

Number of cases correctly predicted: 8115.0 (81.15 %)

New model without the variable with the highest p-value: "Tenure"

In [15]:

df_x4 = df_x3.drop(['Tenure'], axis = 1)
df_x4[:2]

Out[15]:

	const	Germany	Female	CreditScore	Age	Balance	NumOfProducts	IsActiveMember
0	1	0	1	619	42	0.00	1	1
1	1	0	1	608	41	83807.86	1	1

In [16]:

model5 = smdis.Logit(df_y, df_x4).fit()
print model5.summary()
print "\n_____P-values____"
p = model5.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model5.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))

Optimization terminated successfully.
         Current function value: 0.428307
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9992
Method:                           MLE   Df Model:                            7
Date:                Wed, 28 Dec 2016   Pseudo R-squ.:                  0.1527
Time:                        13:04:13   Log-Likelihood:                -4283.1
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
==================================================================================
                     coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
const             -3.9760      0.231    -17.200      0.000        -4.429    -3.523
Germany            0.7608      0.063     12.014      0.000         0.637     0.885
Female             0.5306      0.054      9.741      0.000         0.424     0.637
CreditScore       -0.0007      0.000     -2.377      0.017        -0.001    -0.000
Age                0.0727      0.003     28.237      0.000         0.068     0.078
Balance         2.652e-06   5.14e-07      5.160      0.000      1.64e-06  3.66e-06
NumOfProducts     -0.1010      0.047     -2.144      0.032        -0.193    -0.009
IsActiveMember    -1.0718      0.058    -18.603      0.000        -1.185    -0.959
==================================================================================

_____P-values____
const              2.658461e-66
Germany            3.005769e-33
Female             2.006779e-22
CreditScore        1.745446e-02
Age               2.037936e-175
Balance            2.463940e-07
NumOfProducts      3.199990e-02
IsActiveMember     3.044455e-77
dtype: float64

_____Highest p-value____

 NumOfProducts    0.032
dtype: float64 


____Confusion Matrix___
[[ 7676.   287.]
 [ 1601.   436.]]

Number of cases correctly predicted: 8112.0 (81.12 %)

Since the Accuracy decreases, the variable "Tenure" should be used in the model. Now, the variable with the highest p-value is "NumOfProducts" with 0.032, which is above of the defined threshold - 0.05.

In [17]:

df_x3[:5]

Out[17]:

	const	Female	CreditScore	Age	Tenure	Balance	NumOfProducts	IsActiveMember
0	1	1	619	42	2	0.00	1	1
1	1	1	608	41	1	83807.86	1	1
2	1	1	502	42	8	159660.80	3	0
3	1	1	699	39	1	0.00	2	0
4	1	1	850	43	2	125510.82	1	1

Transforming Independent Variables¶

Most common transformations:

$$\sqrt{x}$$
$$x^2$$
$$ln(x)$$

Transform "Balance" variable with ln(x):

Original value: Balance (in 1000$):

    Bal2 = Bal1 + 1unit -->  Bal2 = Bal1 + 1000$

    Scenario1: Bal1 = 1000$
        --> Bal2 = 1000$ + 1000$ = 2000$
    Scenario2: Bal1 = 10000$
        --> Bal2 = 10000$ + 1000$ = 11000$

Log10(Balance + 1):

    log10(Bal2) = log10(Bal1) + 1unit --> Bal2 = Bal1*10

    Scenario1: Bal1 = 1000$
        --> Bal2 = 1000*10 = 10000
Scenario2: Bal1 = 10000$
        ---> Bal2 = 10000*10 = 100000

Using the original value, means that if someone that starts with a balance of 1000$ and has an increase of 1000$, and someone that starts with a balance of 10000$ and has the same amount of increasing, are two completly different things.

Using the ln transformation, it has the same affect on any person regardless of their starting point. It always a 10 times increase so a unit increase has a consistent increase in the balance variable which is 10 times.

So regardless of who we're segmenting we can say that the effect of a one unit increase in the new transformed variable is consistent throughout our population and that is much more powerful because that does not restrict the logistic regression.

In [18]:

df_x4 = df_x3
df_x4['Balance_log'] = df_x4['Balance'].map(lambda x: np.log10(x+1))
df_x4[:2]

Out[18]:

	const	Germany	Female	CreditScore	Age	Tenure	Balance	NumOfProducts	IsActiveMember	Balance_log
0	1	0	1	619	42	2	0.00	1	1	0.00000
1	1	0	1	608	41	1	83807.86	1	1	4.92329

In [19]:

model6 = smdis.Logit(df_y, df_x4).fit()
print model6.summary()
print "\n_____P-values____"
p = model6.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model6.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))

Optimization terminated successfully.
         Current function value: 0.428134
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9990
Method:                           MLE   Df Model:                            9
Date:                Wed, 28 Dec 2016   Pseudo R-squ.:                  0.1530
Time:                        13:04:14   Log-Likelihood:                -4281.3
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
==================================================================================
                     coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
const             -3.9147      0.237    -16.505      0.000        -4.380    -3.450
Germany            0.7509      0.065     11.532      0.000         0.623     0.879
Female             0.5286      0.054      9.701      0.000         0.422     0.635
CreditScore       -0.0007      0.000     -2.387      0.017        -0.001    -0.000
Age                0.0727      0.003     28.233      0.000         0.068     0.078
Tenure            -0.0160      0.009     -1.706      0.088        -0.034     0.002
Balance         1.871e-06   1.19e-06      1.568      0.117     -4.68e-07  4.21e-06
NumOfProducts     -0.0957      0.048     -2.013      0.044        -0.189    -0.003
IsActiveMember    -1.0754      0.058    -18.650      0.000        -1.188    -0.962
Balance_log        0.0235      0.032      0.728      0.467        -0.040     0.087
==================================================================================

_____P-values____
const              3.364882e-61
Germany            9.111212e-31
Female             2.971672e-22
CreditScore        1.697954e-02
Age               2.287651e-175
Tenure             8.794792e-02
Balance            1.169019e-01
NumOfProducts      4.409600e-02
IsActiveMember     1.262368e-77
Balance_log        4.668374e-01
dtype: float64

_____Highest p-value____

 Balance_log    0.466837
dtype: float64 


____Confusion Matrix___
[[ 7681.   282.]
 [ 1597.   440.]]

Number of cases correctly predicted: 8121.0 (81.21 %)

The accuracy is improved!

Derived Variables¶

Sometimes it is useful too account for some effects, for example in our case, the balance and age might be correlated, because the older a person is the more whealth he/she can accumulate.

So a new variable can be created:

$$WealthAccumlation = \frac{Balance}{Age}$$

In [20]:

df_x5 = df_x4
df_x5['WealthAccumulation'] = df_x5['Balance']/ df_x5['Age']
df_x5[:2]

Out[20]:

	const	Germany	Female	CreditScore	Age	Tenure	Balance	NumOfProducts	IsActiveMember	Balance_log	WealthAccumulation
0	1	0	1	619	42	2	0.00	1	1	0.00000	0.000000
1	1	0	1	608	41	1	83807.86	1	1	4.92329	2044.094146

In [21]:

model7 = smdis.Logit(df_y, df_x5).fit()
print model7.summary()
print "\n_____P-values____"
p = model7.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model7.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))

Optimization terminated successfully.
         Current function value: 0.427323
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9989
Method:                           MLE   Df Model:                           10
Date:                Wed, 28 Dec 2016   Pseudo R-squ.:                  0.1546
Time:                        13:04:15   Log-Likelihood:                -4273.2
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
======================================================================================
                         coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------
const                 -3.4253      0.266    -12.896      0.000        -3.946    -2.905
Germany                0.7534      0.066     11.483      0.000         0.625     0.882
Female                 0.5289      0.055      9.693      0.000         0.422     0.636
CreditScore           -0.0007      0.000     -2.326      0.020        -0.001    -0.000
Age                    0.0611      0.004     15.903      0.000         0.054     0.069
Tenure                -0.0165      0.009     -1.762      0.078        -0.035     0.002
Balance             8.362e-06   2.02e-06      4.143      0.000      4.41e-06  1.23e-05
NumOfProducts         -0.1011      0.048     -2.119      0.034        -0.195    -0.008
IsActiveMember        -1.0601      0.058    -18.349      0.000        -1.173    -0.947
Balance_log            0.0213      0.032      0.657      0.511        -0.042     0.085
WealthAccumulation    -0.0003   6.51e-05     -3.954      0.000        -0.000    -0.000
======================================================================================

_____P-values____
const                 4.756747e-38
Germany               1.599880e-30
Female                3.244594e-22
CreditScore           2.000443e-02
Age                   6.030631e-57
Tenure                7.801725e-02
Balance               3.428919e-05
NumOfProducts         3.411316e-02
IsActiveMember        3.376037e-75
Balance_log           5.111832e-01
WealthAccumulation    7.673375e-05
dtype: float64

_____Highest p-value____

 Balance_log    0.511183
dtype: float64 


____Confusion Matrix___
[[ 7662.   301.]
 [ 1586.   451.]]

Number of cases correctly predicted: 8113.0 (81.13 %)

This new variable had a negative effect in the model accuracy.

However, one should have in consideration that might be some collinearity effects because wealth accumulation includes the variables balance and age that are already in the model. So basically this means that this new variable might be somehow correlated with the other two and including all of them in the model might cause some damage, so on of them should be excluded.

Multicollinearity¶

Multicollinearity is a phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy.

The "Variance Inflation Factors - VIF" can be used to measure the degree of multicollinearity.

Minimum possible value = 1.0 Values > 5 may indicate a collinearity problem.

$$VIF(j) = \frac{1}{( 1 - R(j)^2)}$$

where R(j) is the multiple correlarion coefficient between variable j and the other independent variables

In [22]:

i = 0
for column in df_x5.columns:
    
    print column + " %s" % outliers.variance_inflation_factor(df_x5.values, i)
    i += 1

const 91.4773722803
Germany 1.27101224421
Female 1.00321582613
CreditScore 1.00104450767
Age 2.10487620512
Tenure 1.00161073156
Balance 20.8895673667
NumOfProducts 1.15312466504
IsActiveMember 1.01384972933
Balance_log 8.72881202924
WealthAccumulation 14.5598595824

Rerun without the "Balance_log" variable:

In [23]:

df_x6 = df_x5.drop('Balance_log', axis = 1)

In [24]:

model8 = smdis.Logit(df_y, df_x6).fit()
print model8.summary()
print "\n_____P-values____"
p = model8.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model8.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))

Optimization terminated successfully.
         Current function value: 0.427344
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9990
Method:                           MLE   Df Model:                            9
Date:                Wed, 28 Dec 2016   Pseudo R-squ.:                  0.1546
Time:                        13:04:15   Log-Likelihood:                -4273.4
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
======================================================================================
                         coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------
const                 -3.4089      0.264    -12.904      0.000        -3.927    -2.891
Germany                0.7636      0.064     11.967      0.000         0.639     0.889
Female                 0.5292      0.055      9.699      0.000         0.422     0.636
CreditScore           -0.0007      0.000     -2.318      0.020        -0.001    -0.000
Age                    0.0612      0.004     15.938      0.000         0.054     0.069
Tenure                -0.0165      0.009     -1.766      0.077        -0.035     0.002
Balance             9.078e-06    1.7e-06      5.351      0.000      5.75e-06  1.24e-05
NumOfProducts         -0.1054      0.047     -2.229      0.026        -0.198    -0.013
IsActiveMember        -1.0598      0.058    -18.343      0.000        -1.173    -0.947
WealthAccumulation    -0.0003   6.49e-05     -3.965      0.000        -0.000    -0.000
======================================================================================

_____P-values____
const                 4.282278e-38
Germany               5.318191e-33
Female                3.042840e-22
CreditScore           2.047417e-02
Age                   3.455661e-57
Tenure                7.744433e-02
Balance               8.770661e-08
NumOfProducts         2.578202e-02
IsActiveMember        3.744959e-75
WealthAccumulation    7.325518e-05
dtype: float64

_____Highest p-value____

 Tenure    0.077444
dtype: float64 


____Confusion Matrix___
[[ 7672.   291.]
 [ 1586.   451.]]

Number of cases correctly predicted: 8123.0 (81.23 %)

In [25]:

i = 0

for column in df_x6.columns:
    
    print column + " %s" % outliers.variance_inflation_factor(df_x6.values, i)
    i += 1

const 90.6320775047
Germany 1.2161676569
Female 1.00302785174
CreditScore 1.00097995156
Age 2.10477505278
Tenure 1.00153406128
Balance 14.0245653407
NumOfProducts 1.12330226029
IsActiveMember 1.01353234553
WealthAccumulation 14.5598562679

Create new variable log for wealth accumulation:

In [26]:

df_x7 = df_x5
df_x7['WealthAccumulation_log'] = df_x7['WealthAccumulation'].map(lambda x: np.log10(x + 1))
df_x7[:2]

Out[26]:

	const	Germany	Female	CreditScore	Age	Tenure	Balance	NumOfProducts	IsActiveMember	Balance_log	WealthAccumulation	WealthAccumulation_log
0	1	0	1	619	42	2	0.00	1	1	0.00000	0.000000	0.000000
1	1	0	1	608	41	1	83807.86	1	1	4.92329	2044.094146	3.310713

In [27]:

df_x7 = df_x7.drop(['Balance', 'WealthAccumulation'], axis = 1)

In [28]:

model9 = smdis.Logit(df_y, df_x7).fit()
print model9.summary()
print "\n_____P-values____"
p = model9.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model9.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))

Optimization terminated successfully.
         Current function value: 0.427891
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9990
Method:                           MLE   Df Model:                            9
Date:                Wed, 28 Dec 2016   Pseudo R-squ.:                  0.1535
Time:                        13:04:16   Log-Likelihood:                -4278.9
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
==========================================================================================
                             coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------------
const                     -3.5735      0.267    -13.362      0.000        -4.098    -3.049
Germany                    0.7516      0.065     11.492      0.000         0.623     0.880
Female                     0.5263      0.055      9.655      0.000         0.420     0.633
CreditScore               -0.0007      0.000     -2.386      0.017        -0.001    -0.000
Age                        0.0647      0.004     16.631      0.000         0.057     0.072
Tenure                    -0.0162      0.009     -1.729      0.084        -0.035     0.002
NumOfProducts             -0.0983      0.048     -2.063      0.039        -0.192    -0.005
IsActiveMember            -1.0687      0.058    -18.516      0.000        -1.182    -0.956
Balance_log                0.8699      0.297      2.932      0.003         0.288     1.451
WealthAccumulation_log    -1.1762      0.435     -2.702      0.007        -2.029    -0.323
==========================================================================================

_____P-values____
const                     1.007546e-40
Germany                   1.441579e-30
Female                    4.661027e-22
CreditScore               1.704065e-02
Age                       4.171785e-62
Tenure                    8.382654e-02
NumOfProducts             3.907377e-02
IsActiveMember            1.520894e-76
Balance_log               3.373142e-03
WealthAccumulation_log    6.887931e-03
dtype: float64

_____Highest p-value____

 Tenure    0.083827
dtype: float64 


____Confusion Matrix___
[[ 7685.   278.]
 [ 1597.   440.]]

Number of cases correctly predicted: 8125.0 (81.25 %)

In [29]:

i = 0

for column in df_x7.columns:
    
    print column + " %s" % outliers.variance_inflation_factor(df_x7.values, i)
    i += 1

const 93.5608074531
Germany 1.26986359735
Female 1.00282038022
CreditScore 1.00102945704
Age 2.26479838945
Tenure 1.00165459162
NumOfProducts 1.15243000098
IsActiveMember 1.01168357579
Balance_log 705.940971113
WealthAccumulation_log 704.739971273

"Balance_log" and "WealthAccumulation_log" have too large values which means that these two variables are basically the same thing. So one of these variables must be excluded!

In [30]:

df_x8 = df_x7.drop(['Balance_log'], axis = 1)

In [31]:

model10 = smdis.Logit(df_y, df_x8).fit()
print model10.summary()
print "\n_____P-values____"
p = model10.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model10.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))

Optimization terminated successfully.
         Current function value: 0.428322
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9991
Method:                           MLE   Df Model:                            8
Date:                Wed, 28 Dec 2016   Pseudo R-squ.:                  0.1527
Time:                        13:04:18   Log-Likelihood:                -4283.2
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
==========================================================================================
                             coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------------
const                     -3.9325      0.239    -16.467      0.000        -4.401    -3.464
Germany                    0.7515      0.065     11.550      0.000         0.624     0.879
Female                     0.5266      0.054      9.671      0.000         0.420     0.633
CreditScore               -0.0007      0.000     -2.409      0.016        -0.001    -0.000
Age                        0.0733      0.003     28.398      0.000         0.068     0.078
Tenure                    -0.0159      0.009     -1.697      0.090        -0.034     0.002
NumOfProducts             -0.0967      0.048     -2.034      0.042        -0.190    -0.004
IsActiveMember            -1.0763      0.058    -18.672      0.000        -1.189    -0.963
WealthAccumulation_log     0.0985      0.020      4.812      0.000         0.058     0.139
==========================================================================================

_____P-values____
const                      6.349678e-61
Germany                    7.379902e-31
Female                     4.015783e-22
CreditScore                1.598070e-02
Age                       2.132441e-177
Tenure                     8.978088e-02
NumOfProducts              4.195258e-02
IsActiveMember             8.443410e-78
WealthAccumulation_log     1.490621e-06
dtype: float64

_____Highest p-value____

 Tenure    0.089781
dtype: float64 


____Confusion Matrix___
[[ 7688.   275.]
 [ 1597.   440.]]

Number of cases correctly predicted: 8128.0 (81.28 %)

Correlation between variables:

In [32]:

df_x5[['Age', 'Balance_log', 'WealthAccumulation_log', 'WealthAccumulation']].corr()

Out[32]:

	Age	Balance_log	WealthAccumulation_log	WealthAccumulation
Age	1.000000	0.034530	-0.007524	-0.246293
Balance_log	0.034530	1.000000	0.998404	0.865141
WealthAccumulation_log	-0.007524	0.998404	1.000000	0.888872
WealthAccumulation	-0.246293	0.865141	0.888872	1.000000

As one can observe, the "WealthAccumulation_log" and "Balance_log" variables are highly correlated - they are basically the same which is very bad for the model.

Thumb rule: anything 0.9 is very high correlation. Correlations above 0.5 should be addressed.

In [33]:

df_final = df_x5.drop(['Balance', 'WealthAccumulation', 'WealthAccumulation_log'], axis = 1)

In [34]:

model11 = smdis.Logit(df_y, df_final).fit()
print model11.summary()
print "\n_____P-values____"
p = model11.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model11.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))

Optimization terminated successfully.
         Current function value: 0.428257
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Exited   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9991
Method:                           MLE   Df Model:                            8
Date:                Wed, 28 Dec 2016   Pseudo R-squ.:                  0.1528
Time:                        13:04:19   Log-Likelihood:                -4282.6
converged:                       True   LL-Null:                       -5054.9
                                        LLR p-value:                     0.000
==================================================================================
                     coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
const             -3.9126      0.237    -16.497      0.000        -4.377    -3.448
Germany            0.7476      0.065     11.492      0.000         0.620     0.875
Female             0.5267      0.054      9.672      0.000         0.420     0.633
CreditScore       -0.0007      0.000     -2.408      0.016        -0.001    -0.000
Age                0.0727      0.003     28.221      0.000         0.068     0.078
Tenure            -0.0159      0.009     -1.699      0.089        -0.034     0.002
NumOfProducts     -0.0950      0.048     -1.999      0.046        -0.188    -0.002
IsActiveMember    -1.0758      0.058    -18.662      0.000        -1.189    -0.963
Balance_log        0.0690      0.014      4.945      0.000         0.042     0.096
==================================================================================

_____P-values____
const              3.835044e-61
Germany            1.441009e-30
Female             3.970390e-22
CreditScore        1.604464e-02
Age               3.239286e-175
Tenure             8.932416e-02
NumOfProducts      4.562540e-02
IsActiveMember     1.011855e-77
Balance_log        7.619575e-07
dtype: float64

_____Highest p-value____

 Tenure    0.089324
dtype: float64 


____Confusion Matrix___
[[ 7687.   276.]
 [ 1597.   440.]]

Number of cases correctly predicted: 8127.0 (81.27 %)

In [35]:

i = 0

for column in df_final.columns:
    
    print column + " %s" % outliers.variance_inflation_factor(df_final.values, i)
    i += 1

const 76.3936634172
Germany 1.26934389528
Female 1.00281465894
CreditScore 1.00102204693
Age 1.01171547993
Tenure 1.0014367308
NumOfProducts 1.15145597018
IsActiveMember 1.01028026511
Balance_log 1.42007014313

Now, none of the variables presents multicollinearity.

CAP - Cumulative Accuracy Profile Curve¶

In [36]:

df_cap = df[['RowNumber', 'Exited']]
df_cap['phat'] = model11.predict()

In [37]:

df_cap.sort('phat', ascending=False)[:5]

Out[37]:

	RowNumber	Exited	phat
4815	4816	0	0.930669
3531	3532	1	0.928311
9587	9588	0	0.923303
7499	7500	1	0.906178
9555	9556	1	0.904457

In [153]:

Total_Exited = sum(df_cap['Exited'])
print "Total Exited: %s" % Total_Exited
Total_Records = len(df)
print "Total number of clients: %s" % Total_Records
Exit_Ratio = Total_Exited/Total_Records
print "Ratio of exits: %s " % Exit_Ratio

Total Exited: 2037
Total number of clients: 10000
Ratio of exits: 0.2037

First, I select the clients that have a higher probability.

In [111]:

df_cap['total_selected'] = range(1,len(df_cap)+1)
df_cap['total_selected_per'] = df_cap['total_selected'].map(lambda x: round(x*100/Total_Records, 1))
df_cap['random_selection'] = df_cap['total_selected'].map(lambda x: x*Exit_Ratio)
df_cap['random_selection_per'] = df_cap['total_selected'].map(lambda x: round(x*100/Total_Records, 1))

In [112]:

df_cap = df_cap.sort('phat', ascending=False)

In [113]:

df_cap['model_select'] = df_cap.Exited.cumsum()
df_cap['model_select_per'] = df_cap['model_select'].map(lambda x: round(x*100/Total_Exited, 1))

In [114]:

df_cap[:5]

Out[114]:

	RowNumber	Exited	phat	total_selected	total_selected_per	random_selection	random_selection_per	model_select	model_select_per
4815	4816	0	0.930669	1	0.0	0.2037	0.0	0	0.0
3531	3532	1	0.928311	2	0.0	0.4074	0.0	1	0.0
9587	9588	0	0.923303	3	0.0	0.6111	0.0	1	0.0
7499	7500	1	0.906178	4	0.0	0.8148	0.0	2	0.1
9555	9556	1	0.904457	5	0.1	1.0185	0.1	3	0.1

In [115]:

df_cap = df_cap.sort('RowNumber')
plt.plot(df_cap['total_selected_per'], df_cap['random_selection_per'], label = 'random model' )
plt.plot(df_cap['total_selected_per'], df_cap['model_select_per'], label = 'Good Model' )
plt.plot([50, 50], [80, 0], 'r--', lw=1, label = 'x = 50%, y = X')
plt.plot([0, 50], [80, 80], 'r--', lw=1)
plt.plot([10, 100], [99, 99], 'y-', lw=3, label = 'perfect model')
plt.plot([0, 10], [0, 99], 'y-', lw=3)
plt.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1)
plt.xlabel('Total Contacted')
plt.ylabel('Total Purchased')
plt.show()

The closer the green line is to the yellow line, the better the model. The closer to the blue line, the worse.

Accuracy Ratio: - Take the area under the perfect model (yellow line) to the random model (blue line) - $$a_p$$ - Take the area under the good model (green line) to the random model (blue line) -

$$a_r$$

- So the accuracy ratio:

$$AR = \frac{a_r}{a_p}$$

that is between 0 and 1, and the closer the ratio is to 1 the better.

Rule of thumb:

Another way of evaluating the model is to look to the red line, x = 50%, and y = X:

If:

90% < X < 100% -> Too Good (overfitting?)
80% < X < 90% -> Very Good
70% < X < 80% -> Good
60% < X < 70% -> Poor
X < 60% -> Rubbish

Test model with test data¶

In [67]:

df_test = pandas.DataFrame.from_csv('Churn-Modelling-Test-Data.csv', index_col=None)
df_test[:5]

Out[67]:

	RowNumber	CustomerId	Surname	CreditScore	Geography	Gender	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary
0	10001	15798485	Copley	565	France	Male	31	1	0.00	1	0	1	20443.08
1	10002	15588959	T'ang	569	France	Male	34	4	0.00	1	0	1	4045.90
2	10003	15624896	Ku	669	France	Female	20	7	0.00	2	1	0	128838.67
3	10004	15639629	McConnan	694	France	Male	39	4	173255.48	1	1	1	81293.10
4	10005	15638852	Ts'ui	504	Spain	Male	28	10	109291.36	1	1	1	187593.15

In [68]:

df_test = sm.add_constant(df_test)

In [69]:

df_test_y = df_test['Exited']
df_test = df_test.drop(['Exited'], axis = 1)

In [70]:

dummy = pandas.get_dummies(df_test['Gender'])
df_test= dummy.join(df_test)
dummy = pandas.get_dummies(df_test['Geography'])
df_test = dummy.join(df_test)
df_test[:5]

Out[70]:

	France	Spain	Female	Male	const	RowNumber	CustomerId	Surname	CreditScore	Geography	Gender	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary
0	1	0	0	1	1	10001	15798485	Copley	565	France	Male	31	1	0.00	1	0	1	20443.08
1	1	0	0	1	1	10002	15588959	T'ang	569	France	Male	34	4	0.00	1	0	1	4045.90
2	1	0	1	0	1	10003	15624896	Ku	669	France	Female	20	7	0.00	2	1	0	128838.67
3	1	0	0	1	1	10004	15639629	McConnan	694	France	Male	39	4	173255.48	1	1	1	81293.10
4	0	1	0	1	1	10005	15638852	Ts'ui	504	Spain	Male	28	10	109291.36	1	1	1	187593.15

In [71]:

df_test['Balance_log'] = df_test['Balance'].map(lambda x: np.log10(x+1))

Check the parameters included in the final model (model11) and rerun the model with the same parameters but now including the test data:

In [125]:

model11.params

Out[125]:

const            -3.912576
Germany           0.747595
Female            0.526721
CreditScore      -0.000675
Age               0.072655
Tenure           -0.015879
NumOfProducts    -0.095020
IsActiveMember   -1.075776
Balance_log       0.069026
dtype: float64

In [126]:

df_test = df_test[['Germany', 'Female', 'CreditScore', 'Age', 'Tenure', 'NumOfProducts', 'IsActiveMember', 'Balance_log', 'const']]

In [127]:

df_total = df_final.append(df_test)

In [128]:

model_final = smdis.Logit(df_y, df_total[:10000]).fit()

Optimization terminated successfully.
         Current function value: 0.428257
         Iterations 6

In [139]:

y_pred = model_final.predict(df_total[10000:])
y_pred[:5]

Out[139]:

array([ 0.03811657,  0.04476137,  0.06383845,  0.08165338,  0.03915974])

Let's construct the CAP curve:

In [142]:

df_pred = pandas.DataFrame()
df_pred['phat'] = y_pred
df_pred['Exited'] = df_test_y
df_pred = df_pred.sort( 'phat', ascending=False)
df_pred[:5]

Out[142]:

	phat	Exited
957	0.877614	1
249	0.857731	1
429	0.832754	1
209	0.826484	1
693	0.826381	0

In [152]:

Total_Exited = sum(df_pred['Exited'])
print "Total Exited: %s" % Total_Exited
Total_Records = len(df_pred)
print "Total number of clients: %s" % Total_Records
Exit_Ratio = Total_Exited/Total_Records
print "Ratio of exits: %s" % Exit_Ratio

Total Exited: 260
Total number of clients: 1000
Ratio of exits: 0.26

In [143]:

df_pred['total_selected'] = range(1,len(df_pred)+1)
df_pred['total_selected_per'] = df_pred['total_selected'].map(lambda x: round(x*100/Total_Records, 1))
df_pred['random_selection'] = df_pred['total_selected'].map(lambda x: x*Exit_Ratio)
df_pred['random_selection_per'] = df_pred['total_selected'].map(lambda x: round(x*100/Total_Records, 1))

In [144]:

df_pred = df_pred.sort('phat', ascending=False)

In [145]:

df_pred['model_select'] = df_pred.Exited.cumsum()
df_pred['model_select_per'] = df_pred['model_select'].map(lambda x: round(x*100/Total_Exited, 1))

In [146]:

df_pred[:5]

Out[146]:

	phat	Exited	total_selected	total_selected_per	random_selection	random_selection_per	model_select	model_select_per
957	0.877614	1	1	0.1	0.26	0.1	1	0.4
249	0.857731	1	2	0.2	0.52	0.2	2	0.8
429	0.832754	1	3	0.3	0.78	0.3	3	1.2
209	0.826484	1	4	0.4	1.04	0.4	4	1.5
693	0.826381	0	5	0.5	1.30	0.5	4	1.5

In [151]:

plt.plot(df_pred['total_selected_per'], df_pred['random_selection_per'], label = 'random model' )
plt.plot(df_pred['total_selected_per'], df_pred['model_select_per'], label = 'Good Model' )
plt.plot([50, 50], [76, 0], 'r--', lw=1, label = 'x = 50%, y = X')
plt.plot([0, 50], [76, 76], 'r--', lw=1)
plt.plot([10, 100], [99, 99], 'y-', lw=3, label = 'perfect model')
plt.plot([0, 10], [0, 99], 'y-', lw=3)
plt.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1)
plt.xlabel('Total Contacted')
plt.ylabel('Total Purchased')
plt.show()

Conclusions:

The model is performing above 80%, so a bit worse than the training data (a drop of about 3%).
The concentration of data is about 10 times less than the training data which explains the ruggedness of the curve.

Parameters Interpretation¶

Odds ratio:

$$odds = \frac{p}{1-p}$$

Multiple logistic regression:

$$ln(\frac{p}{1-p}) = b_0 + b_0x_1 + ... + b_nx_n$$

So we can write:

$$ln(odds) = b_0 + b_0x_1 + ... + b_nx_n$$$$<=> odds = e^{b_0 + b_0x_1 + ... + b_nx_n}$$$$<=> odds = e^{b_0} + e^{b_0x_1} + ... + e^{b_nx_n}$$

which means that, if we increase an unit to the $$x_1$$ variable, for example:

$$e^{b_1x_1} -> e^{b_1(x_1 + 1)} = e^{b_1x_1}e^{b_1}$$

which means that, increading an independente variable, e.g. $$x_1$$, by 1 unit, will increase the odds by a multiplicative factor of $$e^{b_i}$$

So let's transform the coefficients of our model into odds ratios:

In [166]:

odds_ratio = np.exp(model_final.params)
odds_ratio

Out[166]:

Age               1.075360
Balance_log       1.071464
CreditScore       0.999325
Female            1.693371
Germany           2.111916
IsActiveMember    0.341033
NumOfProducts     0.909355
Tenure            0.984246
const             0.019989
dtype: float64

"Germany" has the greater odds ratio, which means that being in Germany has the highest impact for people leaving the bank - most important independent variable that influences the model. If we change from not Germany to Germany, the odds ratio increases by 2.1.
The next one is gender. One we go from a male customer to a female customer with all else held constant, the odds ratio of a female custormer leaving is 1.7 times grater that a Male customer leaving. So basically the odds ratio increases by 70%.
Being an "ActiveMember" has the oppositive effect because the value is less than 1. This means that when we go from a member who is not active to a member that is active, the odds ratio gets multiplied by 0.34. So basically it drops by 66% which is also great in terms of retention.
Anything above 0.8 and bellow 1.2 has not a great impact and can be ignored.