Logistic regression example¶

Dr. Tirthajyoti Sarkar, Fremont, CA 94536¶

This notebook demonstrates solving a logistic regression problem of predicting Hypothyrodism with Scikit-learn and Statsmodels libraries.

The dataset is taken from UCI ML repository.
Here is the link: https://archive.ics.uci.edu/ml/datasets/Thyroid+Disease

In [ ]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Read the dataset¶

In [ ]:

names = 'response age sex on_thyroxine query_on_thyroxine antithyroid_medication thyroid_surgery query_hypothyroid query_hyperthyroid pregnant \
sick tumor lithium goitre TSH_measured TSH T3_measured \
T3 TT4_measured TT4 T4U_measured T4U FTI_measured FTI TBG_measured TBG'

In [ ]:

names = names.split(' ')

In [ ]:

!wget https://raw.githubusercontent.com/tirthajyoti/Machine-Learning-with-Python/master/Datasets/hypothyroid.csv
! rm -rf Data/
!mkdir Data
!mv hypothyroid.csv Data/

--2019-08-31 01:16:23--  https://raw.githubusercontent.com/tirthajyoti/Machine-Learning-with-Python/master/Datasets/hypothyroid.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 218406 (213K) [text/plain]
Saving to: ‘hypothyroid.csv’

hypothyroid.csv     100%[===================>] 213.29K  --.-KB/s    in 0.05s   

2019-08-31 01:16:23 (4.23 MB/s) - ‘hypothyroid.csv’ saved [218406/218406]

In [ ]:

df = pd.read_csv('Data/hypothyroid.csv',index_col=False,names=names,na_values=['?'])

In [ ]:

df.head()

Out[ ]:

	response	age	sex	on_thyroxine	query_on_thyroxine	antithyroid_medication	thyroid_surgery	query_hypothyroid	query_hyperthyroid	pregnant	sick	tumor	lithium	goitre	TSH_measured	TSH	T3_measured	T3	TT4_measured	TT4	T4U_measured	T4U	FTI_measured	FTI	TBG_measured	TBG
0	hypothyroid	72.0	M	f	f	f	f	f	f	f	f	f	f	f	y	30.0	y	0.6	y	15.0	y	1.48	y	10.0	n	NaN
1	hypothyroid	15.0	F	t	f	f	f	f	f	f	f	f	f	f	y	145.0	y	1.7	y	19.0	y	1.13	y	17.0	n	NaN
2	hypothyroid	24.0	M	f	f	f	f	f	f	f	f	f	f	f	y	0.0	y	0.2	y	4.0	y	1.00	y	0.0	n	NaN
3	hypothyroid	24.0	F	f	f	f	f	f	f	f	f	f	f	f	y	430.0	y	0.4	y	6.0	y	1.04	y	6.0	n	NaN
4	hypothyroid	77.0	M	f	f	f	f	f	f	f	f	f	f	f	y	7.3	y	1.2	y	57.0	y	1.28	y	44.0	n	NaN

In [ ]:

to_drop=[]
for c in df.columns:
    if 'measured' in c or 'query' in c:
        to_drop.append(c)

In [ ]:

to_drop

Out[ ]:

['query_on_thyroxine',
 'query_hypothyroid',
 'query_hyperthyroid',
 'TSH_measured',
 'T3_measured',
 'TT4_measured',
 'T4U_measured',
 'FTI_measured',
 'TBG_measured']

In [ ]:

to_drop.append('TBG')

In [ ]:

df.drop(to_drop,axis=1,inplace=True)

In [ ]:

df.head()

Out[ ]:

	response	age	sex	on_thyroxine	antithyroid_medication	thyroid_surgery	pregnant	sick	tumor	lithium	goitre	TSH	T3	TT4	T4U	FTI
0	hypothyroid	72.0	M	f	f	f	f	f	f	f	f	30.0	0.6	15.0	1.48	10.0
1	hypothyroid	15.0	F	t	f	f	f	f	f	f	f	145.0	1.7	19.0	1.13	17.0
2	hypothyroid	24.0	M	f	f	f	f	f	f	f	f	0.0	0.2	4.0	1.00	0.0
3	hypothyroid	24.0	F	f	f	f	f	f	f	f	f	430.0	0.4	6.0	1.04	6.0
4	hypothyroid	77.0	M	f	f	f	f	f	f	f	f	7.3	1.2	57.0	1.28	44.0

Let us see the basic statistics on the dataset¶

In [ ]:

df.describe().T

Out[ ]:

	count	mean	std	min	25%	50%	75%	max
age	2717.0	51.154214	19.294405	1.0	35.00	54.00	67.000	98.00
TSH	2695.0	5.923180	23.899467	0.0	0.00	0.70	2.300	530.00
T3	2468.0	1.939749	0.996773	0.0	1.40	1.80	2.300	10.20
TT4	2914.0	108.850000	45.485419	2.0	83.00	104.00	128.000	450.00
T4U	2915.0	0.978199	0.226580	0.0	0.85	0.96	1.065	2.21
FTI	2916.0	115.397771	60.239572	0.0	91.00	107.00	129.000	881.00

Are any data points are missing? We can check it using `df.isna()` method¶

The df.isna() method gives back a full DataFrame with Boolean values - True for data present, False for missing data. We can use sum() on that DataFrame to see and calculate the number of missing values per column.

In [ ]:

df.isna().sum()

Out[ ]:

response                    0
age                       446
sex                        73
on_thyroxine                0
antithyroid_medication      0
thyroid_surgery             0
pregnant                    0
sick                        0
tumor                       0
lithium                     0
goitre                      0
TSH                       468
T3                        695
TT4                       249
T4U                       248
FTI                       247
dtype: int64

We can use `df.dropna()` method to drop those missing rows¶

In [ ]:

df.dropna(inplace=True)

In [ ]:

df.shape

Out[ ]:

(2000, 16)

Creating a transformation function to convert `+` or `-` responses to 1 and 0¶

In [ ]:

def class_convert(response):
    if response=='hypothyroid':
        return 1
    else:
        return 0

In [ ]:

df['response']=df['response'].apply(class_convert)

In [ ]:

df.head()

Out[ ]:

	response	age	sex	on_thyroxine	antithyroid_medication	thyroid_surgery	pregnant	sick	tumor	lithium	goitre	TSH	T3	TT4	T4U	FTI
0	1	72.0	M	f	f	f	f	f	f	f	f	30.0	0.6	15.0	1.48	10.0
1	1	15.0	F	t	f	f	f	f	f	f	f	145.0	1.7	19.0	1.13	17.0
2	1	24.0	M	f	f	f	f	f	f	f	f	0.0	0.2	4.0	1.00	0.0
3	1	24.0	F	f	f	f	f	f	f	f	f	430.0	0.4	6.0	1.04	6.0
4	1	77.0	M	f	f	f	f	f	f	f	f	7.3	1.2	57.0	1.28	44.0

In [ ]:

df.columns

Out[ ]:

Index(['response', 'age', 'sex', 'on_thyroxine', 'antithyroid_medication',
       'thyroid_surgery', 'pregnant', 'sick', 'tumor', 'lithium', 'goitre',
       'TSH', 'T3', 'TT4', 'T4U', 'FTI'],
      dtype='object')

Exploratory data analysis¶

In [ ]:

for var in ['age','TSH','T3','TT4','T4U','FTI']:
    sns.boxplot(x='response',y=var,data=df)
    plt.show()

In [ ]:

sns.pairplot(data=df[df.columns[1:]],diag_kws={'edgecolor':'k','bins':25},plot_kws={'edgecolor':'k'})
plt.show()

Create dummy variables for the categorical variables¶

In [ ]:

df_dummies = pd.get_dummies(data=df)

In [ ]:

df_dummies.shape

Out[ ]:

(2000, 25)

In [ ]:

df_dummies.sample(10)

Out[ ]:

	response	age	TSH	T3	TT4	T4U	FTI	sex_F	sex_M	on_thyroxine_f	on_thyroxine_t	antithyroid_medication_f	thyroid_surgery_f	thyroid_surgery_t	pregnant_f	sick_f	sick_t	tumor_f	lithium_f	goitre_f
2723	0	72.0	0.0	3.2	333.0	0.74	450.0	1	0	1	0	1	1	0	1	1	0	1	1	1
496	0	34.0	0.0	2.3	116.0	1.06	109.0	1	0	1	0	1	1	0	1	1	0	1	1	1
1002	0	57.0	0.0	1.9	145.0	0.81	180.0	0	1	1	0	1	1	0	1	0	1	1	1	1
1243	0	68.0	0.0	2.8	86.0	1.05	82.0	0	1	1	0	1	1	0	1	1	0	1	1	1
1732	0	53.0	1.3	1.9	104.0	1.04	100.0	1	0	1	0	1	1	0	1	1	0	1	1	1
1295	0	29.0	0.5	1.6	95.0	0.80	119.0	1	0	1	0	1	1	0	1	1	0	1	1	1
71	1	70.0	13.0	1.8	58.0	1.03	56.0	1	0	1	0	1	0	1	1	1	0	1	1	1
2654	0	59.0	0.0	1.6	117.0	1.04	113.0	1	0	1	0	1	1	0	1	1	0	1	1	1
234	0	63.0	0.3	1.3	130.0	1.05	124.0	1	0	0	1	1	1	0	1	1	0	1	1	1
1925	0	7.0	0.4	2.3	54.0	0.86	63.0	0	1	1	0	1	1	0	1	1	0	1	1	1

Test/train split¶

In [ ]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_dummies.drop('response',axis=1), 
                                                    df_dummies['response'], test_size=0.30, 
                                                    random_state=42)

In [ ]:

print("Training set shape",X_train.shape)
print("Test set shape",X_test.shape)

Training set shape (1400, 24)
Test set shape (600, 24)

Using `LogisticRegression` estimator from Scikit-learn¶

We are using the L2 regularization by default

In [ ]:

from sklearn.linear_model import LogisticRegression

In [ ]:

clf1 = LogisticRegression(penalty='l2',solver='newton-cg')

In [ ]:

clf1.fit(X_train,y_train)

Out[ ]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='newton-cg', tol=0.0001, verbose=0,
                   warm_start=False)

Intercept, coefficients, and score¶

In [ ]:

clf1.intercept_

Out[ ]:

array([2.28502971])

In [ ]:

clf1.coef_

Out[ ]:

array([[ 0.02031893,  0.01861482, -0.3252159 ,  0.00395534,  0.30699787,
        -0.10796205,  0.30204784, -0.30203619,  0.89773279, -0.89772114,
         0.62687938, -0.62686773, -0.67442642,  0.67443807,  0.13429321,
        -0.13428156,  0.27313209, -0.27312044,  0.26138875, -0.2613771 ,
         0.08933295, -0.0893213 , -0.2489876 ,  0.24899925]])

In [ ]:

clf1.score(X_test,y_test)

Out[ ]:

0.9816666666666667

For `LogisticRegression` estimator, there is a special `predict_proba` method which computes the raw probability values¶

In [ ]:

prob_threshold = 0.5

In [ ]:

prob_df=pd.DataFrame(clf1.predict_proba(X_test[:10]),columns=['Prob of NO','Prob of YES'])
prob_df['Decision']=(prob_df['Prob of YES']>prob_threshold).apply(int)
prob_df

Out[ ]:

	Prob of NO	Prob of YES	Decision
0	0.990471	0.009529	0
1	0.998714	0.001286	0
2	0.999999	0.000001	0
3	0.624016	0.375984	0
4	0.999296	0.000704	0
5	0.864075	0.135925	0
6	0.983696	0.016304	0
7	0.998507	0.001493	0
8	0.221967	0.778033	1
9	0.999968	0.000032	0

In [ ]:

y_test[:10]

Out[ ]:

2944    0
511     0
2116    0
1412    0
2039    0
2013    0
1478    0
2744    0
83      1
2100    0
Name: response, dtype: int64

Classification report, and confusion matrix¶

In [ ]:

from sklearn.metrics import classification_report, confusion_matrix

In [ ]:

print(classification_report(y_test, clf1.predict(X_test)))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       569
           1       0.86      0.77      0.81        31

    accuracy                           0.98       600
   macro avg       0.92      0.88      0.90       600
weighted avg       0.98      0.98      0.98       600

In [ ]:

pd.DataFrame(confusion_matrix(y_test, clf1.predict(X_test)),columns=['Predict-YES','Predict-NO'],index=['YES','NO'])

Out[ ]:

	Predict-YES	Predict-NO
YES	565	4
NO	7	24

Using `statsmodels` library¶

In [ ]:

import statsmodels.formula.api as smf
import statsmodels.api as sm

In [ ]:

df_dummies.columns

Out[ ]:

Index(['response', 'age', 'TSH', 'T3', 'TT4', 'T4U', 'FTI', 'sex_F', 'sex_M',
       'on_thyroxine_f', 'on_thyroxine_t', 'antithyroid_medication_f',
       'antithyroid_medication_t', 'thyroid_surgery_f', 'thyroid_surgery_t',
       'pregnant_f', 'pregnant_t', 'sick_f', 'sick_t', 'tumor_f', 'tumor_t',
       'lithium_f', 'lithium_t', 'goitre_f', 'goitre_t'],
      dtype='object')

Create a 'formula' in the same style as in R language¶

In [ ]:

formula = 'response ~ ' + '+'.join(df_dummies.columns[1:])

In [ ]:

formula

Out[ ]:

'response ~ age+TSH+T3+TT4+T4U+FTI+sex_F+sex_M+on_thyroxine_f+on_thyroxine_t+antithyroid_medication_f+antithyroid_medication_t+thyroid_surgery_f+thyroid_surgery_t+pregnant_f+pregnant_t+sick_f+sick_t+tumor_f+tumor_t+lithium_f+lithium_t+goitre_f+goitre_t'

Fit a GLM (Generalized Linear model) with this formula and choosing `Binomial` as the family of function¶

In [ ]:

model = smf.glm(formula = formula, data=df_dummies, family=sm.families.Binomial())

In [ ]:

result=model.fit()

`summary` method shows a R-style table with all kind of statistical information¶

In [ ]:

print(result.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:               response   No. Observations:                 2000
Model:                            GLM   Df Residuals:                     1984
Model Family:                Binomial   Df Model:                           15
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -123.59
Date:                Sat, 31 Aug 2019   Deviance:                       247.17
Time:                        01:17:27   Pearson chi2:                 9.10e+03
No. Iterations:                    22                                         
Covariance Type:            nonrobust                                         
============================================================================================
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept                   -3.1161   4981.805     -0.001      1.000   -9767.274    9761.042
age                          0.0219      0.011      2.062      0.039       0.001       0.043
TSH                          0.0214      0.008      2.823      0.005       0.007       0.036
T3                          -0.0233      0.320     -0.073      0.942      -0.651       0.604
TT4                          0.0367      0.027      1.376      0.169      -0.016       0.089
T4U                         -1.5693      1.656     -0.947      0.343      -4.816       1.677
FTI                         -0.1420      0.030     -4.709      0.000      -0.201      -0.083
sex_F                       -1.4005   2490.902     -0.001      1.000   -4883.479    4880.678
sex_M                       -1.7156   2490.902     -0.001      0.999   -4883.794    4880.363
on_thyroxine_f              -1.2357   2490.902     -0.000      1.000   -4883.315    4880.843
on_thyroxine_t              -1.8804   2490.902     -0.001      0.999   -4883.959    4880.198
antithyroid_medication_f    -0.3206   2490.902     -0.000      1.000   -4882.400    4881.758
antithyroid_medication_t    -2.7954   2490.902     -0.001      0.999   -4884.874    4879.284
thyroid_surgery_f           -2.2489   2490.902     -0.001      0.999   -4884.328    4879.830
thyroid_surgery_t           -0.8672   2490.902     -0.000      1.000   -4882.946    4881.212
pregnant_f                  -0.2489   2490.903  -9.99e-05      1.000   -4882.329    4881.831
pregnant_t                  -2.8671   2490.903     -0.001      0.999   -4884.947    4879.213
sick_f                      -1.1899   2490.902     -0.000      1.000   -4883.269    4880.889
sick_t                      -1.9262   2490.902     -0.001      0.999   -4884.005    4880.153
tumor_f                      8.1294   6275.938      0.001      0.999   -1.23e+04    1.23e+04
tumor_t                    -11.2454   7357.670     -0.002      0.999   -1.44e+04    1.44e+04
lithium_f                    8.7704   2.42e+04      0.000      1.000   -4.75e+04    4.75e+04
lithium_t                  -11.8864   2.91e+04     -0.000      1.000    -5.7e+04     5.7e+04
goitre_f                    -1.9841   2490.902     -0.001      0.999   -4884.063    4880.095
goitre_t                    -1.1320   2490.902     -0.000      1.000   -4883.211    4880.947
============================================================================================

The `predict` method computes probability for the test dataset¶

In [ ]:

result.predict(X_test[:10])

Out[ ]:

2944    2.408353e-03
511     6.250161e-04
2116    6.125699e-07
1412    3.274420e-01
2039    7.888963e-04
2013    1.325293e-01
1478    3.197375e-02
2744    1.098108e-03
83      8.650453e-01
2100    6.524129e-05
dtype: float64

To create binary predictions, you have to apply a threshold probability and convert the booleans into integers¶

In [ ]:

y_pred=(result.predict(X_test)>prob_threshold).apply(int)

In [ ]:

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       569
           1       0.89      0.77      0.83        31

    accuracy                           0.98       600
   macro avg       0.94      0.88      0.91       600
weighted avg       0.98      0.98      0.98       600

In [ ]:

pd.DataFrame(confusion_matrix(y_test, y_pred),columns=['Predict-YES','Predict-NO'],index=['YES','NO'])

Out[ ]:

	Predict-YES	Predict-NO
YES	566	3
NO	7	24

A smaller model with only first few variables¶

We saw that majority of variables in the logistic regression model has p-values very high and therefore they are not statistically significant. We create another smaller model removing those variables.

In [ ]:

formula = 'response ~ ' + '+'.join(df_dummies.columns[1:7])
formula

Out[ ]:

'response ~ age+TSH+T3+TT4+T4U+FTI'

In [ ]:

model = smf.glm(formula = formula, data=df_dummies, family=sm.families.Binomial())
result=model.fit()
print(result.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:               response   No. Observations:                 2000
Model:                            GLM   Df Residuals:                     1993
Model Family:                Binomial   Df Model:                            6
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -131.55
Date:                Sat, 31 Aug 2019   Deviance:                       263.09
Time:                        01:17:34   Pearson chi2:                 5.65e+03
No. Iterations:                     9                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      4.1596      1.995      2.085      0.037       0.250       8.069
age            0.0248      0.010      2.481      0.013       0.005       0.044
TSH            0.0212      0.007      2.892      0.004       0.007       0.035
T3             0.0704      0.297      0.237      0.812      -0.511       0.652
TT4            0.0296      0.026      1.134      0.257      -0.022       0.081
T4U           -1.2903      1.665     -0.775      0.438      -4.553       1.973
FTI           -0.1307      0.029     -4.449      0.000      -0.188      -0.073
==============================================================================

In [ ]:

y_pred=(result.predict(X_test)>prob_threshold).apply(int)
print(classification_report(y_pred,y_test))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       573
           1       0.74      0.85      0.79        27

    accuracy                           0.98       600
   macro avg       0.87      0.92      0.89       600
weighted avg       0.98      0.98      0.98       600

In [ ]:

pd.DataFrame(confusion_matrix(y_test, y_pred),columns=['Predict-YES','Predict-NO'],index=['YES','NO'])

Out[ ]:

	Predict-YES	Predict-NO
YES	565	4
NO	8	23

How does the confusion matrix change when we modulate the probability threshold?¶

In [ ]:

for prob_threshold in [0.5,0.6,0.7,0.8,0.9]:
  y_pred=(result.predict(X_test)>prob_threshold).apply(int)
  df=pd.DataFrame(confusion_matrix(y_test, y_pred),columns=['Predict-YES','Predict-NO'],index=['YES','NO'])
  print("Confusion matrix for probability threshold {}".format(prob_threshold))
  print("-"*100)
  print(df)
  print()

Confusion matrix for probability threshold 0.5
----------------------------------------------------------------------------------------------------
     Predict-YES  Predict-NO
YES          565           4
NO             8          23

Confusion matrix for probability threshold 0.6
----------------------------------------------------------------------------------------------------
     Predict-YES  Predict-NO
YES          565           4
NO            13          18

Confusion matrix for probability threshold 0.7
----------------------------------------------------------------------------------------------------
     Predict-YES  Predict-NO
YES          567           2
NO            16          15

Confusion matrix for probability threshold 0.8
----------------------------------------------------------------------------------------------------
     Predict-YES  Predict-NO
YES          567           2
NO            17          14

Confusion matrix for probability threshold 0.9
----------------------------------------------------------------------------------------------------
     Predict-YES  Predict-NO
YES          568           1
NO            21          10

How do the probabilities compare between `Scikit-learn` and `Statsmodels` predictions?¶

In [ ]:

sklearn_prob = clf1.predict_proba(X_test)[...,1][:10]
statsmodels_prob =  result.predict(X_test[:10])

In [ ]:

prob_comp_df=pd.DataFrame(data={'Scikit-learn Prob':list(sklearn_prob),'Statsmodels Prob':list(statsmodels_prob)})
prob_comp_df

Out[ ]:

	Scikit-learn Prob	Statsmodels Prob
0	0.009529	0.026742
1	0.001286	0.000780
2	0.000001	0.000001
3	0.375984	0.306287
4	0.000704	0.000797
5	0.135925	0.132377
6	0.016304	0.007412
7	0.001493	0.001065
8	0.778033	0.874865
9	0.000032	0.000161

Coefficient interpretation¶

What is the interpretation of the coefficient value for age and FTI?

With every one year of age increase, the log odds of the hypothyrodism increases by 0.0248 or the odds of hypothyroidsm increases by a factor of exp(0.0248) = 1.025 i.e. almost 2.5%.
With every one unit of FTI increase, the log odds of the hypothyrodism decreases by 0.1307 or the odds of hypothyroidsm decreases by a factor of exp(0.1307) = 1.1396 i.e. almost by 12.25%.

Food for thought¶

When doing classification with deep neural networks, should you also examine the raw probabilities (outputs from the last sigmoid layer for example)?

This approach may be helpful for deep learning tasks, when the output of deep learning is fed to another business decision process, where the cost of Type-I and Type-II error are vastly different.

Logistic regression example¶

Dr. Tirthajyoti Sarkar, Fremont, CA 94536¶

Read the dataset¶

Let us see the basic statistics on the dataset¶

Are any data points are missing? We can check it using df.isna() method¶

We can use df.dropna() method to drop those missing rows¶

Creating a transformation function to convert + or - responses to 1 and 0¶

Exploratory data analysis¶

Create dummy variables for the categorical variables¶

Test/train split¶

Using LogisticRegression estimator from Scikit-learn¶

Intercept, coefficients, and score¶

For LogisticRegression estimator, there is a special predict_proba method which computes the raw probability values¶

Classification report, and confusion matrix¶

Using statsmodels library¶

Create a 'formula' in the same style as in R language¶

Fit a GLM (Generalized Linear model) with this formula and choosing Binomial as the family of function¶

summary method shows a R-style table with all kind of statistical information¶

The predict method computes probability for the test dataset¶

To create binary predictions, you have to apply a threshold probability and convert the booleans into integers¶

A smaller model with only first few variables¶

How does the confusion matrix change when we modulate the probability threshold?¶

How do the probabilities compare between Scikit-learn and Statsmodels predictions?¶

Coefficient interpretation¶

Food for thought¶

Are any data points are missing? We can check it using `df.isna()` method¶

We can use `df.dropna()` method to drop those missing rows¶

Creating a transformation function to convert `+` or `-` responses to 1 and 0¶

Using `LogisticRegression` estimator from Scikit-learn¶

For `LogisticRegression` estimator, there is a special `predict_proba` method which computes the raw probability values¶

Using `statsmodels` library¶

Fit a GLM (Generalized Linear model) with this formula and choosing `Binomial` as the family of function¶

`summary` method shows a R-style table with all kind of statistical information¶

The `predict` method computes probability for the test dataset¶

How do the probabilities compare between `Scikit-learn` and `Statsmodels` predictions?¶