Dataset: https://www.superdatascience.com/training/
Author: Filipa C. S. Rodrigues (filipacsrodrigues@gmail.com)
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
from __future__ import division
import pandas
import csv
import numpy as np
import statsmodels.api as sm
import statsmodels.discrete.discrete_model as smdis
import statsmodels.stats.outliers_influence as outliers
import matplotlib.pyplot as plt
/usr/local/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment. warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
df = pandas.DataFrame.from_csv('Churn-Modelling.csv', index_col=None)
df[:5]
RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 15634602 | Hargrave | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
1 | 2 | 15647311 | Hill | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
2 | 3 | 15619304 | Onio | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
3 | 4 | 15701354 | Boni | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
4 | 5 | 15737888 | Mitchell | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
df_y = df['Exited']
df_x = df.drop(['Exited'], axis = 1)
Create dummy variables for the categorical variables:
dummy = pandas.get_dummies(df_x['Gender'])
df_x = dummy.join(df_x)
dummy = pandas.get_dummies(df_x['Geography'])
df_x = dummy.join(df_x)
df_x[:5]
France | Germany | Spain | Female | Male | RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 1 | 0 | 1 | 15634602 | Hargrave | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 |
1 | 0 | 0 | 1 | 1 | 0 | 2 | 15647311 | Hill | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 |
2 | 1 | 0 | 0 | 1 | 0 | 3 | 15619304 | Onio | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 |
3 | 1 | 0 | 0 | 1 | 0 | 4 | 15701354 | Boni | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 |
4 | 0 | 0 | 1 | 1 | 0 | 5 | 15737888 | Mitchell | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 |
Only one dummy variable should be used to avoid the "dummy variable trap":
df_x = df_x.drop(['Gender', 'Male', 'France', 'Geography'], axis =1)
df_x[:5]
Germany | Spain | Female | RowNumber | CustomerId | Surname | CreditScore | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 1 | 15634602 | Hargrave | 619 | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 |
1 | 0 | 1 | 1 | 2 | 15647311 | Hill | 608 | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 |
2 | 0 | 0 | 1 | 3 | 15619304 | Onio | 502 | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 |
3 | 0 | 0 | 1 | 4 | 15701354 | Boni | 699 | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 |
4 | 0 | 1 | 1 | 5 | 15737888 | Mitchell | 850 | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 |
Add a constant \begin{align} b_0 \end{align} to the model:
df_x = sm.add_constant(df_x)
df_x[:2]
const | Germany | Spain | Female | RowNumber | CustomerId | Surname | CreditScore | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 1 | 1 | 15634602 | Hargrave | 619 | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 |
1 | 1 | 0 | 1 | 1 | 2 | 15647311 | Hill | 608 | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 |
Exclude the variables that should not affect the model:
df_x = df_x.drop(['RowNumber', 'CustomerId', 'Surname'], axis = 1)
df_x[:2]
const | Germany | Spain | Female | CreditScore | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 1 | 619 | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 |
1 | 1 | 0 | 1 | 1 | 608 | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 |
Create a model with all the remaining variables:
model1 = smdis.Logit(df_y, df_x).fit()
print model1.summary()
print "\n_____P-values____"
p = model1.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model1.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))
Optimization terminated successfully. Current function value: 0.428068 Iterations 6 Logit Regression Results ============================================================================== Dep. Variable: Exited No. Observations: 10000 Model: Logit Df Residuals: 9988 Method: MLE Df Model: 11 Date: Wed, 28 Dec 2016 Pseudo R-squ.: 0.1532 Time: 13:04:11 Log-Likelihood: -4280.7 converged: True LL-Null: -5054.9 LLR p-value: 0.000 =================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ----------------------------------------------------------------------------------- const -3.9208 0.245 -15.980 0.000 -4.402 -3.440 Germany 0.7747 0.068 11.448 0.000 0.642 0.907 Spain 0.0352 0.071 0.499 0.618 -0.103 0.174 Female 0.5285 0.054 9.699 0.000 0.422 0.635 CreditScore -0.0007 0.000 -2.384 0.017 -0.001 -0.000 Age 0.0727 0.003 28.230 0.000 0.068 0.078 Tenure -0.0159 0.009 -1.705 0.088 -0.034 0.002 Balance 2.637e-06 5.14e-07 5.128 0.000 1.63e-06 3.64e-06 NumOfProducts -0.1015 0.047 -2.154 0.031 -0.194 -0.009 HasCrCard -0.0447 0.059 -0.753 0.452 -0.161 0.072 IsActiveMember -1.0754 0.058 -18.643 0.000 -1.189 -0.962 EstimatedSalary 4.807e-07 4.74e-07 1.015 0.310 -4.48e-07 1.41e-06 =================================================================================== _____P-values____ const 1.759879e-57 Germany 2.413604e-30 Spain 6.180842e-01 Female 3.044696e-22 CreditScore 1.712765e-02 Age 2.521111e-175 Tenure 8.821318e-02 Balance 2.922724e-07 NumOfProducts 3.124760e-02 HasCrCard 4.515135e-01 IsActiveMember 1.434412e-77 EstimatedSalary 3.101752e-01 dtype: float64 _____Highest p-value____ Spain 0.618084 dtype: float64 ____Confusion Matrix___ [[ 7666. 297.] [ 1600. 437.]] Number of cases correctly predicted: 8103.0 (81.03 %)
Note:
"Pseudo R-squ." - pseudo R-squared. Logistic regression does not have an equivalent to the R-squared that is found in OLS regression; however, many people have tried to come up with one. There are a wide variety of pseudo-R-square statistics. Because this statistic does not mean what R-square means in OLS regression (the proportion of variance explained by the predictors), we suggest interpreting this statistic with great caution. (source: http://www.ats.ucla.edu/stat/stata/output/stata_logistic.htm)
Backward Elimination
New model without the variable with the highest p-value: "Spain"
df_x1 = df_x.drop(['Spain'], axis = 1)
df_x1[:2]
const | Germany | Female | CreditScore | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 619 | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 |
1 | 1 | 0 | 1 | 608 | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 |
model2 = smdis.Logit(df_y, df_x1).fit()
print model2.summary()
print "\n_____P-values____"
p = model2.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model2.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))
Optimization terminated successfully. Current function value: 0.428080 Iterations 6 Logit Regression Results ============================================================================== Dep. Variable: Exited No. Observations: 10000 Model: Logit Df Residuals: 9989 Method: MLE Df Model: 10 Date: Wed, 28 Dec 2016 Pseudo R-squ.: 0.1531 Time: 13:04:12 Log-Likelihood: -4280.8 converged: True LL-Null: -5054.9 LLR p-value: 0.000 =================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ----------------------------------------------------------------------------------- const -3.9110 0.245 -15.994 0.000 -4.390 -3.432 Germany 0.7629 0.063 12.041 0.000 0.639 0.887 Female 0.5283 0.054 9.697 0.000 0.422 0.635 CreditScore -0.0007 0.000 -2.378 0.017 -0.001 -0.000 Age 0.0727 0.003 28.238 0.000 0.068 0.078 Tenure -0.0160 0.009 -1.708 0.088 -0.034 0.002 Balance 2.637e-06 5.14e-07 5.129 0.000 1.63e-06 3.65e-06 NumOfProducts -0.1013 0.047 -2.149 0.032 -0.194 -0.009 HasCrCard -0.0449 0.059 -0.757 0.449 -0.161 0.071 IsActiveMember -1.0752 0.058 -18.640 0.000 -1.188 -0.962 EstimatedSalary 4.813e-07 4.74e-07 1.016 0.310 -4.47e-07 1.41e-06 =================================================================================== _____P-values____ const 1.405607e-57 Germany 2.162265e-33 Female 3.114487e-22 CreditScore 1.739401e-02 Age 1.996715e-175 Tenure 8.764548e-02 Balance 2.913057e-07 NumOfProducts 3.161546e-02 HasCrCard 4.489327e-01 IsActiveMember 1.531167e-77 EstimatedSalary 3.095145e-01 dtype: float64 _____Highest p-value____ HasCrCard 0.448933 dtype: float64 ____Confusion Matrix___ [[ 7665. 298.] [ 1602. 435.]] Number of cases correctly predicted: 8100.0 (81.0 %)
New model without the variable with the highest p-value: "HasCrCard"
df_x2 = df_x1.drop(['HasCrCard'], axis = 1)
df_x2[:2]
const | Germany | Female | CreditScore | Age | Tenure | Balance | NumOfProducts | IsActiveMember | EstimatedSalary | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 619 | 42 | 2 | 0.00 | 1 | 1 | 101348.88 |
1 | 1 | 0 | 1 | 608 | 41 | 1 | 83807.86 | 1 | 1 | 112542.58 |
model3 = smdis.Logit(df_y, df_x2).fit()
print model3.summary()
print "\n_____P-values____"
p = model3.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model3.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))
Optimization terminated successfully. Current function value: 0.428109 Iterations 6 Logit Regression Results ============================================================================== Dep. Variable: Exited No. Observations: 10000 Model: Logit Df Residuals: 9990 Method: MLE Df Model: 9 Date: Wed, 28 Dec 2016 Pseudo R-squ.: 0.1531 Time: 13:04:12 Log-Likelihood: -4281.1 converged: True LL-Null: -5054.9 LLR p-value: 0.000 =================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ----------------------------------------------------------------------------------- const -3.9444 0.241 -16.395 0.000 -4.416 -3.473 Germany 0.7619 0.063 12.028 0.000 0.638 0.886 Female 0.5285 0.054 9.700 0.000 0.422 0.635 CreditScore -0.0007 0.000 -2.369 0.018 -0.001 -0.000 Age 0.0727 0.003 28.243 0.000 0.068 0.078 Tenure -0.0162 0.009 -1.727 0.084 -0.034 0.002 Balance 2.645e-06 5.14e-07 5.146 0.000 1.64e-06 3.65e-06 NumOfProducts -0.1013 0.047 -2.150 0.032 -0.194 -0.009 IsActiveMember -1.0744 0.058 -18.631 0.000 -1.187 -0.961 EstimatedSalary 4.818e-07 4.74e-07 1.017 0.309 -4.47e-07 1.41e-06 =================================================================================== _____P-values____ const 2.068294e-60 Germany 2.545658e-33 Female 3.024897e-22 CreditScore 1.782382e-02 Age 1.731605e-175 Tenure 8.415226e-02 Balance 2.660186e-07 NumOfProducts 3.152337e-02 IsActiveMember 1.809198e-77 EstimatedSalary 3.090842e-01 dtype: float64 _____Highest p-value____ EstimatedSalary 0.309084 dtype: float64 ____Confusion Matrix___ [[ 7673. 290.] [ 1599. 438.]] Number of cases correctly predicted: 8111.0 (81.11 %)
New model without the variable with the highest p-value: "EstimatedSalary"
df_x3 = df_x2.drop(['EstimatedSalary'], axis = 1)
df_x3[:2]
const | Germany | Female | CreditScore | Age | Tenure | Balance | NumOfProducts | IsActiveMember | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 619 | 42 | 2 | 0.00 | 1 | 1 |
1 | 1 | 0 | 1 | 608 | 41 | 1 | 83807.86 | 1 | 1 |
model4 = smdis.Logit(df_y, df_x3).fit()
print model4.summary()
print "\n_____P-values____"
p = model4.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model4.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))
Optimization terminated successfully. Current function value: 0.428161 Iterations 6 Logit Regression Results ============================================================================== Dep. Variable: Exited No. Observations: 10000 Model: Logit Df Residuals: 9991 Method: MLE Df Model: 8 Date: Wed, 28 Dec 2016 Pseudo R-squ.: 0.1530 Time: 13:04:12 Log-Likelihood: -4281.6 converged: True LL-Null: -5054.9 LLR p-value: 0.000 ================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ---------------------------------------------------------------------------------- const -3.8959 0.236 -16.528 0.000 -4.358 -3.434 Germany 0.7621 0.063 12.031 0.000 0.638 0.886 Female 0.5290 0.054 9.710 0.000 0.422 0.636 CreditScore -0.0007 0.000 -2.378 0.017 -0.001 -0.000 Age 0.0727 0.003 28.238 0.000 0.068 0.078 Tenure -0.0160 0.009 -1.710 0.087 -0.034 0.002 Balance 2.653e-06 5.14e-07 5.162 0.000 1.65e-06 3.66e-06 NumOfProducts -0.1005 0.047 -2.132 0.033 -0.193 -0.008 IsActiveMember -1.0751 0.058 -18.644 0.000 -1.188 -0.962 ================================================================================== _____P-values____ const 2.308408e-61 Germany 2.434718e-33 Female 2.744698e-22 CreditScore 1.741371e-02 Age 2.006986e-175 Tenure 8.734004e-02 Balance 2.440782e-07 NumOfProducts 3.297139e-02 IsActiveMember 1.407482e-77 dtype: float64 _____Highest p-value____ Tenure 0.08734 dtype: float64 ____Confusion Matrix___ [[ 7676. 287.] [ 1598. 439.]] Number of cases correctly predicted: 8115.0 (81.15 %)
New model without the variable with the highest p-value: "Tenure"
df_x4 = df_x3.drop(['Tenure'], axis = 1)
df_x4[:2]
const | Germany | Female | CreditScore | Age | Balance | NumOfProducts | IsActiveMember | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 619 | 42 | 0.00 | 1 | 1 |
1 | 1 | 0 | 1 | 608 | 41 | 83807.86 | 1 | 1 |
model5 = smdis.Logit(df_y, df_x4).fit()
print model5.summary()
print "\n_____P-values____"
p = model5.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model5.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))
Optimization terminated successfully. Current function value: 0.428307 Iterations 6 Logit Regression Results ============================================================================== Dep. Variable: Exited No. Observations: 10000 Model: Logit Df Residuals: 9992 Method: MLE Df Model: 7 Date: Wed, 28 Dec 2016 Pseudo R-squ.: 0.1527 Time: 13:04:13 Log-Likelihood: -4283.1 converged: True LL-Null: -5054.9 LLR p-value: 0.000 ================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ---------------------------------------------------------------------------------- const -3.9760 0.231 -17.200 0.000 -4.429 -3.523 Germany 0.7608 0.063 12.014 0.000 0.637 0.885 Female 0.5306 0.054 9.741 0.000 0.424 0.637 CreditScore -0.0007 0.000 -2.377 0.017 -0.001 -0.000 Age 0.0727 0.003 28.237 0.000 0.068 0.078 Balance 2.652e-06 5.14e-07 5.160 0.000 1.64e-06 3.66e-06 NumOfProducts -0.1010 0.047 -2.144 0.032 -0.193 -0.009 IsActiveMember -1.0718 0.058 -18.603 0.000 -1.185 -0.959 ================================================================================== _____P-values____ const 2.658461e-66 Germany 3.005769e-33 Female 2.006779e-22 CreditScore 1.745446e-02 Age 2.037936e-175 Balance 2.463940e-07 NumOfProducts 3.199990e-02 IsActiveMember 3.044455e-77 dtype: float64 _____Highest p-value____ NumOfProducts 0.032 dtype: float64 ____Confusion Matrix___ [[ 7676. 287.] [ 1601. 436.]] Number of cases correctly predicted: 8112.0 (81.12 %)
Since the Accuracy decreases, the variable "Tenure" should be used in the model. Now, the variable with the highest p-value is "NumOfProducts" with 0.032, which is above of the defined threshold - 0.05.
df_x3[:5]
const | Germany | Female | CreditScore | Age | Tenure | Balance | NumOfProducts | IsActiveMember | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 619 | 42 | 2 | 0.00 | 1 | 1 |
1 | 1 | 0 | 1 | 608 | 41 | 1 | 83807.86 | 1 | 1 |
2 | 1 | 0 | 1 | 502 | 42 | 8 | 159660.80 | 3 | 0 |
3 | 1 | 0 | 1 | 699 | 39 | 1 | 0.00 | 2 | 0 |
4 | 1 | 0 | 1 | 850 | 43 | 2 | 125510.82 | 1 | 1 |
Most common transformations:
Transform "Balance" variable with ln(x):
Original value: Balance (in 1000$):
Bal2 = Bal1 + 1unit --> Bal2 = Bal1 + 1000$
Scenario1: Bal1 = 1000$
--> Bal2 = 1000$ + 1000$ = 2000$
Scenario2: Bal1 = 10000$
--> Bal2 = 10000$ + 1000$ = 11000$
Log10(Balance + 1):
log10(Bal2) = log10(Bal1) + 1unit --> Bal2 = Bal1*10
Scenario1: Bal1 = 1000$
--> Bal2 = 1000*10 = 10000
Scenario2: Bal1 = 10000$
---> Bal2 = 10000*10 = 100000
Using the original value, means that if someone that starts with a balance of 1000$ and has an increase of 1000$, and someone that starts with a balance of 10000$ and has the same amount of increasing, are two completly different things.
Using the ln transformation, it has the same affect on any person regardless of their starting point. It always a 10 times increase so a unit increase has a consistent increase in the balance variable which is 10 times.
So regardless of who we're segmenting we can say that the effect of a one unit increase in the new transformed variable is consistent throughout our population and that is much more powerful because that does not restrict the logistic regression.
df_x4 = df_x3
df_x4['Balance_log'] = df_x4['Balance'].map(lambda x: np.log10(x+1))
df_x4[:2]
const | Germany | Female | CreditScore | Age | Tenure | Balance | NumOfProducts | IsActiveMember | Balance_log | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 619 | 42 | 2 | 0.00 | 1 | 1 | 0.00000 |
1 | 1 | 0 | 1 | 608 | 41 | 1 | 83807.86 | 1 | 1 | 4.92329 |
model6 = smdis.Logit(df_y, df_x4).fit()
print model6.summary()
print "\n_____P-values____"
p = model6.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model6.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))
Optimization terminated successfully. Current function value: 0.428134 Iterations 6 Logit Regression Results ============================================================================== Dep. Variable: Exited No. Observations: 10000 Model: Logit Df Residuals: 9990 Method: MLE Df Model: 9 Date: Wed, 28 Dec 2016 Pseudo R-squ.: 0.1530 Time: 13:04:14 Log-Likelihood: -4281.3 converged: True LL-Null: -5054.9 LLR p-value: 0.000 ================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ---------------------------------------------------------------------------------- const -3.9147 0.237 -16.505 0.000 -4.380 -3.450 Germany 0.7509 0.065 11.532 0.000 0.623 0.879 Female 0.5286 0.054 9.701 0.000 0.422 0.635 CreditScore -0.0007 0.000 -2.387 0.017 -0.001 -0.000 Age 0.0727 0.003 28.233 0.000 0.068 0.078 Tenure -0.0160 0.009 -1.706 0.088 -0.034 0.002 Balance 1.871e-06 1.19e-06 1.568 0.117 -4.68e-07 4.21e-06 NumOfProducts -0.0957 0.048 -2.013 0.044 -0.189 -0.003 IsActiveMember -1.0754 0.058 -18.650 0.000 -1.188 -0.962 Balance_log 0.0235 0.032 0.728 0.467 -0.040 0.087 ================================================================================== _____P-values____ const 3.364882e-61 Germany 9.111212e-31 Female 2.971672e-22 CreditScore 1.697954e-02 Age 2.287651e-175 Tenure 8.794792e-02 Balance 1.169019e-01 NumOfProducts 4.409600e-02 IsActiveMember 1.262368e-77 Balance_log 4.668374e-01 dtype: float64 _____Highest p-value____ Balance_log 0.466837 dtype: float64 ____Confusion Matrix___ [[ 7681. 282.] [ 1597. 440.]] Number of cases correctly predicted: 8121.0 (81.21 %)
The accuracy is improved!
Sometimes it is useful too account for some effects, for example in our case, the balance and age might be correlated, because the older a person is the more whealth he/she can accumulate.
So a new variable can be created:
$$WealthAccumlation = \frac{Balance}{Age}$$df_x5 = df_x4
df_x5['WealthAccumulation'] = df_x5['Balance']/ df_x5['Age']
df_x5[:2]
const | Germany | Female | CreditScore | Age | Tenure | Balance | NumOfProducts | IsActiveMember | Balance_log | WealthAccumulation | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 619 | 42 | 2 | 0.00 | 1 | 1 | 0.00000 | 0.000000 |
1 | 1 | 0 | 1 | 608 | 41 | 1 | 83807.86 | 1 | 1 | 4.92329 | 2044.094146 |
model7 = smdis.Logit(df_y, df_x5).fit()
print model7.summary()
print "\n_____P-values____"
p = model7.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model7.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))
Optimization terminated successfully. Current function value: 0.427323 Iterations 6 Logit Regression Results ============================================================================== Dep. Variable: Exited No. Observations: 10000 Model: Logit Df Residuals: 9989 Method: MLE Df Model: 10 Date: Wed, 28 Dec 2016 Pseudo R-squ.: 0.1546 Time: 13:04:15 Log-Likelihood: -4273.2 converged: True LL-Null: -5054.9 LLR p-value: 0.000 ====================================================================================== coef std err z P>|z| [95.0% Conf. Int.] -------------------------------------------------------------------------------------- const -3.4253 0.266 -12.896 0.000 -3.946 -2.905 Germany 0.7534 0.066 11.483 0.000 0.625 0.882 Female 0.5289 0.055 9.693 0.000 0.422 0.636 CreditScore -0.0007 0.000 -2.326 0.020 -0.001 -0.000 Age 0.0611 0.004 15.903 0.000 0.054 0.069 Tenure -0.0165 0.009 -1.762 0.078 -0.035 0.002 Balance 8.362e-06 2.02e-06 4.143 0.000 4.41e-06 1.23e-05 NumOfProducts -0.1011 0.048 -2.119 0.034 -0.195 -0.008 IsActiveMember -1.0601 0.058 -18.349 0.000 -1.173 -0.947 Balance_log 0.0213 0.032 0.657 0.511 -0.042 0.085 WealthAccumulation -0.0003 6.51e-05 -3.954 0.000 -0.000 -0.000 ====================================================================================== _____P-values____ const 4.756747e-38 Germany 1.599880e-30 Female 3.244594e-22 CreditScore 2.000443e-02 Age 6.030631e-57 Tenure 7.801725e-02 Balance 3.428919e-05 NumOfProducts 3.411316e-02 IsActiveMember 3.376037e-75 Balance_log 5.111832e-01 WealthAccumulation 7.673375e-05 dtype: float64 _____Highest p-value____ Balance_log 0.511183 dtype: float64 ____Confusion Matrix___ [[ 7662. 301.] [ 1586. 451.]] Number of cases correctly predicted: 8113.0 (81.13 %)
This new variable had a negative effect in the model accuracy.
However, one should have in consideration that might be some collinearity effects because wealth accumulation includes the variables balance and age that are already in the model. So basically this means that this new variable might be somehow correlated with the other two and including all of them in the model might cause some damage, so on of them should be excluded.
Multicollinearity is a phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy.
The "Variance Inflation Factors - VIF" can be used to measure the degree of multicollinearity.
Minimum possible value = 1.0 Values > 5 may indicate a collinearity problem.
$$VIF(j) = \frac{1}{( 1 - R(j)^2)}$$where R(j) is the multiple correlarion coefficient between variable j and the other independent variables
i = 0
for column in df_x5.columns:
print column + " %s" % outliers.variance_inflation_factor(df_x5.values, i)
i += 1
const 91.4773722803 Germany 1.27101224421 Female 1.00321582613 CreditScore 1.00104450767 Age 2.10487620512 Tenure 1.00161073156 Balance 20.8895673667 NumOfProducts 1.15312466504 IsActiveMember 1.01384972933 Balance_log 8.72881202924 WealthAccumulation 14.5598595824
Rerun without the "Balance_log" variable:
df_x6 = df_x5.drop('Balance_log', axis = 1)
model8 = smdis.Logit(df_y, df_x6).fit()
print model8.summary()
print "\n_____P-values____"
p = model8.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model8.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))
Optimization terminated successfully. Current function value: 0.427344 Iterations 6 Logit Regression Results ============================================================================== Dep. Variable: Exited No. Observations: 10000 Model: Logit Df Residuals: 9990 Method: MLE Df Model: 9 Date: Wed, 28 Dec 2016 Pseudo R-squ.: 0.1546 Time: 13:04:15 Log-Likelihood: -4273.4 converged: True LL-Null: -5054.9 LLR p-value: 0.000 ====================================================================================== coef std err z P>|z| [95.0% Conf. Int.] -------------------------------------------------------------------------------------- const -3.4089 0.264 -12.904 0.000 -3.927 -2.891 Germany 0.7636 0.064 11.967 0.000 0.639 0.889 Female 0.5292 0.055 9.699 0.000 0.422 0.636 CreditScore -0.0007 0.000 -2.318 0.020 -0.001 -0.000 Age 0.0612 0.004 15.938 0.000 0.054 0.069 Tenure -0.0165 0.009 -1.766 0.077 -0.035 0.002 Balance 9.078e-06 1.7e-06 5.351 0.000 5.75e-06 1.24e-05 NumOfProducts -0.1054 0.047 -2.229 0.026 -0.198 -0.013 IsActiveMember -1.0598 0.058 -18.343 0.000 -1.173 -0.947 WealthAccumulation -0.0003 6.49e-05 -3.965 0.000 -0.000 -0.000 ====================================================================================== _____P-values____ const 4.282278e-38 Germany 5.318191e-33 Female 3.042840e-22 CreditScore 2.047417e-02 Age 3.455661e-57 Tenure 7.744433e-02 Balance 8.770661e-08 NumOfProducts 2.578202e-02 IsActiveMember 3.744959e-75 WealthAccumulation 7.325518e-05 dtype: float64 _____Highest p-value____ Tenure 0.077444 dtype: float64 ____Confusion Matrix___ [[ 7672. 291.] [ 1586. 451.]] Number of cases correctly predicted: 8123.0 (81.23 %)
i = 0
for column in df_x6.columns:
print column + " %s" % outliers.variance_inflation_factor(df_x6.values, i)
i += 1
const 90.6320775047 Germany 1.2161676569 Female 1.00302785174 CreditScore 1.00097995156 Age 2.10477505278 Tenure 1.00153406128 Balance 14.0245653407 NumOfProducts 1.12330226029 IsActiveMember 1.01353234553 WealthAccumulation 14.5598562679
Create new variable log for wealth accumulation:
df_x7 = df_x5
df_x7['WealthAccumulation_log'] = df_x7['WealthAccumulation'].map(lambda x: np.log10(x + 1))
df_x7[:2]
const | Germany | Female | CreditScore | Age | Tenure | Balance | NumOfProducts | IsActiveMember | Balance_log | WealthAccumulation | WealthAccumulation_log | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 619 | 42 | 2 | 0.00 | 1 | 1 | 0.00000 | 0.000000 | 0.000000 |
1 | 1 | 0 | 1 | 608 | 41 | 1 | 83807.86 | 1 | 1 | 4.92329 | 2044.094146 | 3.310713 |
df_x7 = df_x7.drop(['Balance', 'WealthAccumulation'], axis = 1)
model9 = smdis.Logit(df_y, df_x7).fit()
print model9.summary()
print "\n_____P-values____"
p = model9.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model9.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))
Optimization terminated successfully. Current function value: 0.427891 Iterations 6 Logit Regression Results ============================================================================== Dep. Variable: Exited No. Observations: 10000 Model: Logit Df Residuals: 9990 Method: MLE Df Model: 9 Date: Wed, 28 Dec 2016 Pseudo R-squ.: 0.1535 Time: 13:04:16 Log-Likelihood: -4278.9 converged: True LL-Null: -5054.9 LLR p-value: 0.000 ========================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------------------ const -3.5735 0.267 -13.362 0.000 -4.098 -3.049 Germany 0.7516 0.065 11.492 0.000 0.623 0.880 Female 0.5263 0.055 9.655 0.000 0.420 0.633 CreditScore -0.0007 0.000 -2.386 0.017 -0.001 -0.000 Age 0.0647 0.004 16.631 0.000 0.057 0.072 Tenure -0.0162 0.009 -1.729 0.084 -0.035 0.002 NumOfProducts -0.0983 0.048 -2.063 0.039 -0.192 -0.005 IsActiveMember -1.0687 0.058 -18.516 0.000 -1.182 -0.956 Balance_log 0.8699 0.297 2.932 0.003 0.288 1.451 WealthAccumulation_log -1.1762 0.435 -2.702 0.007 -2.029 -0.323 ========================================================================================== _____P-values____ const 1.007546e-40 Germany 1.441579e-30 Female 4.661027e-22 CreditScore 1.704065e-02 Age 4.171785e-62 Tenure 8.382654e-02 NumOfProducts 3.907377e-02 IsActiveMember 1.520894e-76 Balance_log 3.373142e-03 WealthAccumulation_log 6.887931e-03 dtype: float64 _____Highest p-value____ Tenure 0.083827 dtype: float64 ____Confusion Matrix___ [[ 7685. 278.] [ 1597. 440.]] Number of cases correctly predicted: 8125.0 (81.25 %)
i = 0
for column in df_x7.columns:
print column + " %s" % outliers.variance_inflation_factor(df_x7.values, i)
i += 1
const 93.5608074531 Germany 1.26986359735 Female 1.00282038022 CreditScore 1.00102945704 Age 2.26479838945 Tenure 1.00165459162 NumOfProducts 1.15243000098 IsActiveMember 1.01168357579 Balance_log 705.940971113 WealthAccumulation_log 704.739971273
"Balance_log" and "WealthAccumulation_log" have too large values which means that these two variables are basically the same thing. So one of these variables must be excluded!
df_x8 = df_x7.drop(['Balance_log'], axis = 1)
model10 = smdis.Logit(df_y, df_x8).fit()
print model10.summary()
print "\n_____P-values____"
p = model10.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model10.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))
Optimization terminated successfully. Current function value: 0.428322 Iterations 6 Logit Regression Results ============================================================================== Dep. Variable: Exited No. Observations: 10000 Model: Logit Df Residuals: 9991 Method: MLE Df Model: 8 Date: Wed, 28 Dec 2016 Pseudo R-squ.: 0.1527 Time: 13:04:18 Log-Likelihood: -4283.2 converged: True LL-Null: -5054.9 LLR p-value: 0.000 ========================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------------------ const -3.9325 0.239 -16.467 0.000 -4.401 -3.464 Germany 0.7515 0.065 11.550 0.000 0.624 0.879 Female 0.5266 0.054 9.671 0.000 0.420 0.633 CreditScore -0.0007 0.000 -2.409 0.016 -0.001 -0.000 Age 0.0733 0.003 28.398 0.000 0.068 0.078 Tenure -0.0159 0.009 -1.697 0.090 -0.034 0.002 NumOfProducts -0.0967 0.048 -2.034 0.042 -0.190 -0.004 IsActiveMember -1.0763 0.058 -18.672 0.000 -1.189 -0.963 WealthAccumulation_log 0.0985 0.020 4.812 0.000 0.058 0.139 ========================================================================================== _____P-values____ const 6.349678e-61 Germany 7.379902e-31 Female 4.015783e-22 CreditScore 1.598070e-02 Age 2.132441e-177 Tenure 8.978088e-02 NumOfProducts 4.195258e-02 IsActiveMember 8.443410e-78 WealthAccumulation_log 1.490621e-06 dtype: float64 _____Highest p-value____ Tenure 0.089781 dtype: float64 ____Confusion Matrix___ [[ 7688. 275.] [ 1597. 440.]] Number of cases correctly predicted: 8128.0 (81.28 %)
Correlation between variables:
df_x5[['Age', 'Balance_log', 'WealthAccumulation_log', 'WealthAccumulation']].corr()
Age | Balance_log | WealthAccumulation_log | WealthAccumulation | |
---|---|---|---|---|
Age | 1.000000 | 0.034530 | -0.007524 | -0.246293 |
Balance_log | 0.034530 | 1.000000 | 0.998404 | 0.865141 |
WealthAccumulation_log | -0.007524 | 0.998404 | 1.000000 | 0.888872 |
WealthAccumulation | -0.246293 | 0.865141 | 0.888872 | 1.000000 |
As one can observe, the "WealthAccumulation_log" and "Balance_log" variables are highly correlated - they are basically the same which is very bad for the model.
Thumb rule: anything 0.9 is very high correlation. Correlations above 0.5 should be addressed.
df_final = df_x5.drop(['Balance', 'WealthAccumulation', 'WealthAccumulation_log'], axis = 1)
model11 = smdis.Logit(df_y, df_final).fit()
print model11.summary()
print "\n_____P-values____"
p = model11.pvalues
print p
print "\n_____Highest p-value____"
print "\n %s \n" % p[p ==max(p)]
print "\n____Confusion Matrix___"
cm = model11.pred_table()
print cm
print "\nNumber of cases correctly predicted: %s (%s %%)" % (cm[0][0] + cm[1][1], (cm[0][0] + cm[1][1])*100/np.sum(cm))
Optimization terminated successfully. Current function value: 0.428257 Iterations 6 Logit Regression Results ============================================================================== Dep. Variable: Exited No. Observations: 10000 Model: Logit Df Residuals: 9991 Method: MLE Df Model: 8 Date: Wed, 28 Dec 2016 Pseudo R-squ.: 0.1528 Time: 13:04:19 Log-Likelihood: -4282.6 converged: True LL-Null: -5054.9 LLR p-value: 0.000 ================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ---------------------------------------------------------------------------------- const -3.9126 0.237 -16.497 0.000 -4.377 -3.448 Germany 0.7476 0.065 11.492 0.000 0.620 0.875 Female 0.5267 0.054 9.672 0.000 0.420 0.633 CreditScore -0.0007 0.000 -2.408 0.016 -0.001 -0.000 Age 0.0727 0.003 28.221 0.000 0.068 0.078 Tenure -0.0159 0.009 -1.699 0.089 -0.034 0.002 NumOfProducts -0.0950 0.048 -1.999 0.046 -0.188 -0.002 IsActiveMember -1.0758 0.058 -18.662 0.000 -1.189 -0.963 Balance_log 0.0690 0.014 4.945 0.000 0.042 0.096 ================================================================================== _____P-values____ const 3.835044e-61 Germany 1.441009e-30 Female 3.970390e-22 CreditScore 1.604464e-02 Age 3.239286e-175 Tenure 8.932416e-02 NumOfProducts 4.562540e-02 IsActiveMember 1.011855e-77 Balance_log 7.619575e-07 dtype: float64 _____Highest p-value____ Tenure 0.089324 dtype: float64 ____Confusion Matrix___ [[ 7687. 276.] [ 1597. 440.]] Number of cases correctly predicted: 8127.0 (81.27 %)
i = 0
for column in df_final.columns:
print column + " %s" % outliers.variance_inflation_factor(df_final.values, i)
i += 1
const 76.3936634172 Germany 1.26934389528 Female 1.00281465894 CreditScore 1.00102204693 Age 1.01171547993 Tenure 1.0014367308 NumOfProducts 1.15145597018 IsActiveMember 1.01028026511 Balance_log 1.42007014313
Now, none of the variables presents multicollinearity.
df_cap = df[['RowNumber', 'Exited']]
df_cap['phat'] = model11.predict()
df_cap.sort('phat', ascending=False)[:5]
RowNumber | Exited | phat | |
---|---|---|---|
4815 | 4816 | 0 | 0.930669 |
3531 | 3532 | 1 | 0.928311 |
9587 | 9588 | 0 | 0.923303 |
7499 | 7500 | 1 | 0.906178 |
9555 | 9556 | 1 | 0.904457 |
Total_Exited = sum(df_cap['Exited'])
print "Total Exited: %s" % Total_Exited
Total_Records = len(df)
print "Total number of clients: %s" % Total_Records
Exit_Ratio = Total_Exited/Total_Records
print "Ratio of exits: %s " % Exit_Ratio
Total Exited: 2037 Total number of clients: 10000 Ratio of exits: 0.2037
First, I select the clients that have a higher probability.
df_cap['total_selected'] = range(1,len(df_cap)+1)
df_cap['total_selected_per'] = df_cap['total_selected'].map(lambda x: round(x*100/Total_Records, 1))
df_cap['random_selection'] = df_cap['total_selected'].map(lambda x: x*Exit_Ratio)
df_cap['random_selection_per'] = df_cap['total_selected'].map(lambda x: round(x*100/Total_Records, 1))
df_cap = df_cap.sort('phat', ascending=False)
df_cap['model_select'] = df_cap.Exited.cumsum()
df_cap['model_select_per'] = df_cap['model_select'].map(lambda x: round(x*100/Total_Exited, 1))
df_cap[:5]
RowNumber | Exited | phat | total_selected | total_selected_per | random_selection | random_selection_per | model_select | model_select_per | |
---|---|---|---|---|---|---|---|---|---|
4815 | 4816 | 0 | 0.930669 | 1 | 0.0 | 0.2037 | 0.0 | 0 | 0.0 |
3531 | 3532 | 1 | 0.928311 | 2 | 0.0 | 0.4074 | 0.0 | 1 | 0.0 |
9587 | 9588 | 0 | 0.923303 | 3 | 0.0 | 0.6111 | 0.0 | 1 | 0.0 |
7499 | 7500 | 1 | 0.906178 | 4 | 0.0 | 0.8148 | 0.0 | 2 | 0.1 |
9555 | 9556 | 1 | 0.904457 | 5 | 0.1 | 1.0185 | 0.1 | 3 | 0.1 |
df_cap = df_cap.sort('RowNumber')
plt.plot(df_cap['total_selected_per'], df_cap['random_selection_per'], label = 'random model' )
plt.plot(df_cap['total_selected_per'], df_cap['model_select_per'], label = 'Good Model' )
plt.plot([50, 50], [80, 0], 'r--', lw=1, label = 'x = 50%, y = X')
plt.plot([0, 50], [80, 80], 'r--', lw=1)
plt.plot([10, 100], [99, 99], 'y-', lw=3, label = 'perfect model')
plt.plot([0, 10], [0, 99], 'y-', lw=3)
plt.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1)
plt.xlabel('Total Contacted')
plt.ylabel('Total Purchased')
plt.show()
The closer the green line is to the yellow line, the better the model. The closer to the blue line, the worse.
Accuracy Ratio: - Take the area under the perfect model (yellow line) to the random model (blue line) - $$a_p$$ - Take the area under the good model (green line) to the random model (blue line) -
$$a_r$$- So the accuracy ratio:
$$AR = \frac{a_r}{a_p}$$that is between 0 and 1, and the closer the ratio is to 1 the better.
Rule of thumb:
Another way of evaluating the model is to look to the red line, x = 50%, and y = X:
If:
90% < X < 100% -> Too Good (overfitting?)
80% < X < 90% -> Very Good
70% < X < 80% -> Good
60% < X < 70% -> Poor
X < 60% -> Rubbish
df_test = pandas.DataFrame.from_csv('Churn-Modelling-Test-Data.csv', index_col=None)
df_test[:5]
RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10001 | 15798485 | Copley | 565 | France | Male | 31 | 1 | 0.00 | 1 | 0 | 1 | 20443.08 | 0 |
1 | 10002 | 15588959 | T'ang | 569 | France | Male | 34 | 4 | 0.00 | 1 | 0 | 1 | 4045.90 | 0 |
2 | 10003 | 15624896 | Ku | 669 | France | Female | 20 | 7 | 0.00 | 2 | 1 | 0 | 128838.67 | 0 |
3 | 10004 | 15639629 | McConnan | 694 | France | Male | 39 | 4 | 173255.48 | 1 | 1 | 1 | 81293.10 | 0 |
4 | 10005 | 15638852 | Ts'ui | 504 | Spain | Male | 28 | 10 | 109291.36 | 1 | 1 | 1 | 187593.15 | 0 |
df_test = sm.add_constant(df_test)
df_test_y = df_test['Exited']
df_test = df_test.drop(['Exited'], axis = 1)
dummy = pandas.get_dummies(df_test['Gender'])
df_test= dummy.join(df_test)
dummy = pandas.get_dummies(df_test['Geography'])
df_test = dummy.join(df_test)
df_test[:5]
France | Germany | Spain | Female | Male | const | RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 1 | 1 | 10001 | 15798485 | Copley | 565 | France | Male | 31 | 1 | 0.00 | 1 | 0 | 1 | 20443.08 |
1 | 1 | 0 | 0 | 0 | 1 | 1 | 10002 | 15588959 | T'ang | 569 | France | Male | 34 | 4 | 0.00 | 1 | 0 | 1 | 4045.90 |
2 | 1 | 0 | 0 | 1 | 0 | 1 | 10003 | 15624896 | Ku | 669 | France | Female | 20 | 7 | 0.00 | 2 | 1 | 0 | 128838.67 |
3 | 1 | 0 | 0 | 0 | 1 | 1 | 10004 | 15639629 | McConnan | 694 | France | Male | 39 | 4 | 173255.48 | 1 | 1 | 1 | 81293.10 |
4 | 0 | 0 | 1 | 0 | 1 | 1 | 10005 | 15638852 | Ts'ui | 504 | Spain | Male | 28 | 10 | 109291.36 | 1 | 1 | 1 | 187593.15 |
df_test['Balance_log'] = df_test['Balance'].map(lambda x: np.log10(x+1))
Check the parameters included in the final model (model11) and rerun the model with the same parameters but now including the test data:
model11.params
const -3.912576 Germany 0.747595 Female 0.526721 CreditScore -0.000675 Age 0.072655 Tenure -0.015879 NumOfProducts -0.095020 IsActiveMember -1.075776 Balance_log 0.069026 dtype: float64
df_test = df_test[['Germany', 'Female', 'CreditScore', 'Age', 'Tenure', 'NumOfProducts', 'IsActiveMember', 'Balance_log', 'const']]
df_total = df_final.append(df_test)
model_final = smdis.Logit(df_y, df_total[:10000]).fit()
Optimization terminated successfully. Current function value: 0.428257 Iterations 6
y_pred = model_final.predict(df_total[10000:])
y_pred[:5]
array([ 0.03811657, 0.04476137, 0.06383845, 0.08165338, 0.03915974])
Let's construct the CAP curve:
df_pred = pandas.DataFrame()
df_pred['phat'] = y_pred
df_pred['Exited'] = df_test_y
df_pred = df_pred.sort( 'phat', ascending=False)
df_pred[:5]
phat | Exited | |
---|---|---|
957 | 0.877614 | 1 |
249 | 0.857731 | 1 |
429 | 0.832754 | 1 |
209 | 0.826484 | 1 |
693 | 0.826381 | 0 |
Total_Exited = sum(df_pred['Exited'])
print "Total Exited: %s" % Total_Exited
Total_Records = len(df_pred)
print "Total number of clients: %s" % Total_Records
Exit_Ratio = Total_Exited/Total_Records
print "Ratio of exits: %s" % Exit_Ratio
Total Exited: 260 Total number of clients: 1000 Ratio of exits: 0.26
df_pred['total_selected'] = range(1,len(df_pred)+1)
df_pred['total_selected_per'] = df_pred['total_selected'].map(lambda x: round(x*100/Total_Records, 1))
df_pred['random_selection'] = df_pred['total_selected'].map(lambda x: x*Exit_Ratio)
df_pred['random_selection_per'] = df_pred['total_selected'].map(lambda x: round(x*100/Total_Records, 1))
df_pred = df_pred.sort('phat', ascending=False)
df_pred['model_select'] = df_pred.Exited.cumsum()
df_pred['model_select_per'] = df_pred['model_select'].map(lambda x: round(x*100/Total_Exited, 1))
df_pred[:5]
phat | Exited | total_selected | total_selected_per | random_selection | random_selection_per | model_select | model_select_per | |
---|---|---|---|---|---|---|---|---|
957 | 0.877614 | 1 | 1 | 0.1 | 0.26 | 0.1 | 1 | 0.4 |
249 | 0.857731 | 1 | 2 | 0.2 | 0.52 | 0.2 | 2 | 0.8 |
429 | 0.832754 | 1 | 3 | 0.3 | 0.78 | 0.3 | 3 | 1.2 |
209 | 0.826484 | 1 | 4 | 0.4 | 1.04 | 0.4 | 4 | 1.5 |
693 | 0.826381 | 0 | 5 | 0.5 | 1.30 | 0.5 | 4 | 1.5 |
plt.plot(df_pred['total_selected_per'], df_pred['random_selection_per'], label = 'random model' )
plt.plot(df_pred['total_selected_per'], df_pred['model_select_per'], label = 'Good Model' )
plt.plot([50, 50], [76, 0], 'r--', lw=1, label = 'x = 50%, y = X')
plt.plot([0, 50], [76, 76], 'r--', lw=1)
plt.plot([10, 100], [99, 99], 'y-', lw=3, label = 'perfect model')
plt.plot([0, 10], [0, 99], 'y-', lw=3)
plt.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1)
plt.xlabel('Total Contacted')
plt.ylabel('Total Purchased')
plt.show()
Conclusions:
Odds ratio:
$$odds = \frac{p}{1-p}$$Multiple logistic regression:
$$ln(\frac{p}{1-p}) = b_0 + b_0x_1 + ... + b_nx_n$$So we can write:
$$ln(odds) = b_0 + b_0x_1 + ... + b_nx_n$$$$<=> odds = e^{b_0 + b_0x_1 + ... + b_nx_n}$$$$<=> odds = e^{b_0} + e^{b_0x_1} + ... + e^{b_nx_n}$$which means that, if we increase an unit to the $$x_1$$ variable, for example:
$$e^{b_1x_1} -> e^{b_1(x_1 + 1)} = e^{b_1x_1}e^{b_1}$$which means that, increading an independente variable, e.g. $$x_1$$, by 1 unit, will increase the odds by a multiplicative factor of $$e^{b_i}$$
So let's transform the coefficients of our model into odds ratios:
odds_ratio = np.exp(model_final.params)
odds_ratio
Age 1.075360 Balance_log 1.071464 CreditScore 0.999325 Female 1.693371 Germany 2.111916 IsActiveMember 0.341033 NumOfProducts 0.909355 Tenure 0.984246 const 0.019989 dtype: float64