This notebook demonstrates solving a logistic regression problem of predicting Hypothyrodism with Scikit-learn and Statsmodels libraries.
The dataset is taken from UCI ML repository.
Here is the link: https://archive.ics.uci.edu/ml/datasets/Thyroid+Disease
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
names = 'response age sex on_thyroxine query_on_thyroxine antithyroid_medication thyroid_surgery query_hypothyroid query_hyperthyroid pregnant \
sick tumor lithium goitre TSH_measured TSH T3_measured \
T3 TT4_measured TT4 T4U_measured T4U FTI_measured FTI TBG_measured TBG'
names = names.split(' ')
!wget https://raw.githubusercontent.com/tirthajyoti/Machine-Learning-with-Python/master/Datasets/hypothyroid.csv
! rm -rf Data/
!mkdir Data
!mv hypothyroid.csv Data/
--2019-08-31 01:16:23-- https://raw.githubusercontent.com/tirthajyoti/Machine-Learning-with-Python/master/Datasets/hypothyroid.csv Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 218406 (213K) [text/plain] Saving to: ‘hypothyroid.csv’ hypothyroid.csv 100%[===================>] 213.29K --.-KB/s in 0.05s 2019-08-31 01:16:23 (4.23 MB/s) - ‘hypothyroid.csv’ saved [218406/218406]
df = pd.read_csv('Data/hypothyroid.csv',index_col=False,names=names,na_values=['?'])
df.head()
response | age | sex | on_thyroxine | query_on_thyroxine | antithyroid_medication | thyroid_surgery | query_hypothyroid | query_hyperthyroid | pregnant | sick | tumor | lithium | goitre | TSH_measured | TSH | T3_measured | T3 | TT4_measured | TT4 | T4U_measured | T4U | FTI_measured | FTI | TBG_measured | TBG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | hypothyroid | 72.0 | M | f | f | f | f | f | f | f | f | f | f | f | y | 30.0 | y | 0.6 | y | 15.0 | y | 1.48 | y | 10.0 | n | NaN |
1 | hypothyroid | 15.0 | F | t | f | f | f | f | f | f | f | f | f | f | y | 145.0 | y | 1.7 | y | 19.0 | y | 1.13 | y | 17.0 | n | NaN |
2 | hypothyroid | 24.0 | M | f | f | f | f | f | f | f | f | f | f | f | y | 0.0 | y | 0.2 | y | 4.0 | y | 1.00 | y | 0.0 | n | NaN |
3 | hypothyroid | 24.0 | F | f | f | f | f | f | f | f | f | f | f | f | y | 430.0 | y | 0.4 | y | 6.0 | y | 1.04 | y | 6.0 | n | NaN |
4 | hypothyroid | 77.0 | M | f | f | f | f | f | f | f | f | f | f | f | y | 7.3 | y | 1.2 | y | 57.0 | y | 1.28 | y | 44.0 | n | NaN |
to_drop=[]
for c in df.columns:
if 'measured' in c or 'query' in c:
to_drop.append(c)
to_drop
['query_on_thyroxine', 'query_hypothyroid', 'query_hyperthyroid', 'TSH_measured', 'T3_measured', 'TT4_measured', 'T4U_measured', 'FTI_measured', 'TBG_measured']
to_drop.append('TBG')
df.drop(to_drop,axis=1,inplace=True)
df.head()
response | age | sex | on_thyroxine | antithyroid_medication | thyroid_surgery | pregnant | sick | tumor | lithium | goitre | TSH | T3 | TT4 | T4U | FTI | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | hypothyroid | 72.0 | M | f | f | f | f | f | f | f | f | 30.0 | 0.6 | 15.0 | 1.48 | 10.0 |
1 | hypothyroid | 15.0 | F | t | f | f | f | f | f | f | f | 145.0 | 1.7 | 19.0 | 1.13 | 17.0 |
2 | hypothyroid | 24.0 | M | f | f | f | f | f | f | f | f | 0.0 | 0.2 | 4.0 | 1.00 | 0.0 |
3 | hypothyroid | 24.0 | F | f | f | f | f | f | f | f | f | 430.0 | 0.4 | 6.0 | 1.04 | 6.0 |
4 | hypothyroid | 77.0 | M | f | f | f | f | f | f | f | f | 7.3 | 1.2 | 57.0 | 1.28 | 44.0 |
df.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
age | 2717.0 | 51.154214 | 19.294405 | 1.0 | 35.00 | 54.00 | 67.000 | 98.00 |
TSH | 2695.0 | 5.923180 | 23.899467 | 0.0 | 0.00 | 0.70 | 2.300 | 530.00 |
T3 | 2468.0 | 1.939749 | 0.996773 | 0.0 | 1.40 | 1.80 | 2.300 | 10.20 |
TT4 | 2914.0 | 108.850000 | 45.485419 | 2.0 | 83.00 | 104.00 | 128.000 | 450.00 |
T4U | 2915.0 | 0.978199 | 0.226580 | 0.0 | 0.85 | 0.96 | 1.065 | 2.21 |
FTI | 2916.0 | 115.397771 | 60.239572 | 0.0 | 91.00 | 107.00 | 129.000 | 881.00 |
df.isna()
method¶The df.isna()
method gives back a full DataFrame with Boolean values - True for data present, False for missing data. We can use sum()
on that DataFrame to see and calculate the number of missing values per column.
df.isna().sum()
response 0 age 446 sex 73 on_thyroxine 0 antithyroid_medication 0 thyroid_surgery 0 pregnant 0 sick 0 tumor 0 lithium 0 goitre 0 TSH 468 T3 695 TT4 249 T4U 248 FTI 247 dtype: int64
df.dropna()
method to drop those missing rows¶df.dropna(inplace=True)
df.shape
(2000, 16)
+
or -
responses to 1 and 0¶def class_convert(response):
if response=='hypothyroid':
return 1
else:
return 0
df['response']=df['response'].apply(class_convert)
df.head()
response | age | sex | on_thyroxine | antithyroid_medication | thyroid_surgery | pregnant | sick | tumor | lithium | goitre | TSH | T3 | TT4 | T4U | FTI | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 72.0 | M | f | f | f | f | f | f | f | f | 30.0 | 0.6 | 15.0 | 1.48 | 10.0 |
1 | 1 | 15.0 | F | t | f | f | f | f | f | f | f | 145.0 | 1.7 | 19.0 | 1.13 | 17.0 |
2 | 1 | 24.0 | M | f | f | f | f | f | f | f | f | 0.0 | 0.2 | 4.0 | 1.00 | 0.0 |
3 | 1 | 24.0 | F | f | f | f | f | f | f | f | f | 430.0 | 0.4 | 6.0 | 1.04 | 6.0 |
4 | 1 | 77.0 | M | f | f | f | f | f | f | f | f | 7.3 | 1.2 | 57.0 | 1.28 | 44.0 |
df.columns
Index(['response', 'age', 'sex', 'on_thyroxine', 'antithyroid_medication', 'thyroid_surgery', 'pregnant', 'sick', 'tumor', 'lithium', 'goitre', 'TSH', 'T3', 'TT4', 'T4U', 'FTI'], dtype='object')
for var in ['age','TSH','T3','TT4','T4U','FTI']:
sns.boxplot(x='response',y=var,data=df)
plt.show()
sns.pairplot(data=df[df.columns[1:]],diag_kws={'edgecolor':'k','bins':25},plot_kws={'edgecolor':'k'})
plt.show()
df_dummies = pd.get_dummies(data=df)
df_dummies.shape
(2000, 25)
df_dummies.sample(10)
response | age | TSH | T3 | TT4 | T4U | FTI | sex_F | sex_M | on_thyroxine_f | on_thyroxine_t | antithyroid_medication_f | antithyroid_medication_t | thyroid_surgery_f | thyroid_surgery_t | pregnant_f | pregnant_t | sick_f | sick_t | tumor_f | tumor_t | lithium_f | lithium_t | goitre_f | goitre_t | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2723 | 0 | 72.0 | 0.0 | 3.2 | 333.0 | 0.74 | 450.0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
496 | 0 | 34.0 | 0.0 | 2.3 | 116.0 | 1.06 | 109.0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
1002 | 0 | 57.0 | 0.0 | 1.9 | 145.0 | 0.81 | 180.0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 |
1243 | 0 | 68.0 | 0.0 | 2.8 | 86.0 | 1.05 | 82.0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
1732 | 0 | 53.0 | 1.3 | 1.9 | 104.0 | 1.04 | 100.0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
1295 | 0 | 29.0 | 0.5 | 1.6 | 95.0 | 0.80 | 119.0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
71 | 1 | 70.0 | 13.0 | 1.8 | 58.0 | 1.03 | 56.0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
2654 | 0 | 59.0 | 0.0 | 1.6 | 117.0 | 1.04 | 113.0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
234 | 0 | 63.0 | 0.3 | 1.3 | 130.0 | 1.05 | 124.0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
1925 | 0 | 7.0 | 0.4 | 2.3 | 54.0 | 0.86 | 63.0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_dummies.drop('response',axis=1),
df_dummies['response'], test_size=0.30,
random_state=42)
print("Training set shape",X_train.shape)
print("Test set shape",X_test.shape)
Training set shape (1400, 24) Test set shape (600, 24)
LogisticRegression
estimator from Scikit-learn¶We are using the L2 regularization by default
from sklearn.linear_model import LogisticRegression
clf1 = LogisticRegression(penalty='l2',solver='newton-cg')
clf1.fit(X_train,y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2', random_state=None, solver='newton-cg', tol=0.0001, verbose=0, warm_start=False)
clf1.intercept_
array([2.28502971])
clf1.coef_
array([[ 0.02031893, 0.01861482, -0.3252159 , 0.00395534, 0.30699787, -0.10796205, 0.30204784, -0.30203619, 0.89773279, -0.89772114, 0.62687938, -0.62686773, -0.67442642, 0.67443807, 0.13429321, -0.13428156, 0.27313209, -0.27312044, 0.26138875, -0.2613771 , 0.08933295, -0.0893213 , -0.2489876 , 0.24899925]])
clf1.score(X_test,y_test)
0.9816666666666667
LogisticRegression
estimator, there is a special predict_proba
method which computes the raw probability values¶prob_threshold = 0.5
prob_df=pd.DataFrame(clf1.predict_proba(X_test[:10]),columns=['Prob of NO','Prob of YES'])
prob_df['Decision']=(prob_df['Prob of YES']>prob_threshold).apply(int)
prob_df
Prob of NO | Prob of YES | Decision | |
---|---|---|---|
0 | 0.990471 | 0.009529 | 0 |
1 | 0.998714 | 0.001286 | 0 |
2 | 0.999999 | 0.000001 | 0 |
3 | 0.624016 | 0.375984 | 0 |
4 | 0.999296 | 0.000704 | 0 |
5 | 0.864075 | 0.135925 | 0 |
6 | 0.983696 | 0.016304 | 0 |
7 | 0.998507 | 0.001493 | 0 |
8 | 0.221967 | 0.778033 | 1 |
9 | 0.999968 | 0.000032 | 0 |
y_test[:10]
2944 0 511 0 2116 0 1412 0 2039 0 2013 0 1478 0 2744 0 83 1 2100 0 Name: response, dtype: int64
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, clf1.predict(X_test)))
precision recall f1-score support 0 0.99 0.99 0.99 569 1 0.86 0.77 0.81 31 accuracy 0.98 600 macro avg 0.92 0.88 0.90 600 weighted avg 0.98 0.98 0.98 600
pd.DataFrame(confusion_matrix(y_test, clf1.predict(X_test)),columns=['Predict-YES','Predict-NO'],index=['YES','NO'])
Predict-YES | Predict-NO | |
---|---|---|
YES | 565 | 4 |
NO | 7 | 24 |
statsmodels
library¶import statsmodels.formula.api as smf
import statsmodels.api as sm
df_dummies.columns
Index(['response', 'age', 'TSH', 'T3', 'TT4', 'T4U', 'FTI', 'sex_F', 'sex_M', 'on_thyroxine_f', 'on_thyroxine_t', 'antithyroid_medication_f', 'antithyroid_medication_t', 'thyroid_surgery_f', 'thyroid_surgery_t', 'pregnant_f', 'pregnant_t', 'sick_f', 'sick_t', 'tumor_f', 'tumor_t', 'lithium_f', 'lithium_t', 'goitre_f', 'goitre_t'], dtype='object')
formula = 'response ~ ' + '+'.join(df_dummies.columns[1:])
formula
'response ~ age+TSH+T3+TT4+T4U+FTI+sex_F+sex_M+on_thyroxine_f+on_thyroxine_t+antithyroid_medication_f+antithyroid_medication_t+thyroid_surgery_f+thyroid_surgery_t+pregnant_f+pregnant_t+sick_f+sick_t+tumor_f+tumor_t+lithium_f+lithium_t+goitre_f+goitre_t'
Binomial
as the family of function¶model = smf.glm(formula = formula, data=df_dummies, family=sm.families.Binomial())
result=model.fit()
summary
method shows a R-style table with all kind of statistical information¶print(result.summary())
Generalized Linear Model Regression Results ============================================================================== Dep. Variable: response No. Observations: 2000 Model: GLM Df Residuals: 1984 Model Family: Binomial Df Model: 15 Link Function: logit Scale: 1.0000 Method: IRLS Log-Likelihood: -123.59 Date: Sat, 31 Aug 2019 Deviance: 247.17 Time: 01:17:27 Pearson chi2: 9.10e+03 No. Iterations: 22 Covariance Type: nonrobust ============================================================================================ coef std err z P>|z| [0.025 0.975] -------------------------------------------------------------------------------------------- Intercept -3.1161 4981.805 -0.001 1.000 -9767.274 9761.042 age 0.0219 0.011 2.062 0.039 0.001 0.043 TSH 0.0214 0.008 2.823 0.005 0.007 0.036 T3 -0.0233 0.320 -0.073 0.942 -0.651 0.604 TT4 0.0367 0.027 1.376 0.169 -0.016 0.089 T4U -1.5693 1.656 -0.947 0.343 -4.816 1.677 FTI -0.1420 0.030 -4.709 0.000 -0.201 -0.083 sex_F -1.4005 2490.902 -0.001 1.000 -4883.479 4880.678 sex_M -1.7156 2490.902 -0.001 0.999 -4883.794 4880.363 on_thyroxine_f -1.2357 2490.902 -0.000 1.000 -4883.315 4880.843 on_thyroxine_t -1.8804 2490.902 -0.001 0.999 -4883.959 4880.198 antithyroid_medication_f -0.3206 2490.902 -0.000 1.000 -4882.400 4881.758 antithyroid_medication_t -2.7954 2490.902 -0.001 0.999 -4884.874 4879.284 thyroid_surgery_f -2.2489 2490.902 -0.001 0.999 -4884.328 4879.830 thyroid_surgery_t -0.8672 2490.902 -0.000 1.000 -4882.946 4881.212 pregnant_f -0.2489 2490.903 -9.99e-05 1.000 -4882.329 4881.831 pregnant_t -2.8671 2490.903 -0.001 0.999 -4884.947 4879.213 sick_f -1.1899 2490.902 -0.000 1.000 -4883.269 4880.889 sick_t -1.9262 2490.902 -0.001 0.999 -4884.005 4880.153 tumor_f 8.1294 6275.938 0.001 0.999 -1.23e+04 1.23e+04 tumor_t -11.2454 7357.670 -0.002 0.999 -1.44e+04 1.44e+04 lithium_f 8.7704 2.42e+04 0.000 1.000 -4.75e+04 4.75e+04 lithium_t -11.8864 2.91e+04 -0.000 1.000 -5.7e+04 5.7e+04 goitre_f -1.9841 2490.902 -0.001 0.999 -4884.063 4880.095 goitre_t -1.1320 2490.902 -0.000 1.000 -4883.211 4880.947 ============================================================================================
predict
method computes probability for the test dataset¶result.predict(X_test[:10])
2944 2.408353e-03 511 6.250161e-04 2116 6.125699e-07 1412 3.274420e-01 2039 7.888963e-04 2013 1.325293e-01 1478 3.197375e-02 2744 1.098108e-03 83 8.650453e-01 2100 6.524129e-05 dtype: float64
y_pred=(result.predict(X_test)>prob_threshold).apply(int)
print(classification_report(y_test,y_pred))
precision recall f1-score support 0 0.99 0.99 0.99 569 1 0.89 0.77 0.83 31 accuracy 0.98 600 macro avg 0.94 0.88 0.91 600 weighted avg 0.98 0.98 0.98 600
pd.DataFrame(confusion_matrix(y_test, y_pred),columns=['Predict-YES','Predict-NO'],index=['YES','NO'])
Predict-YES | Predict-NO | |
---|---|---|
YES | 566 | 3 |
NO | 7 | 24 |
We saw that majority of variables in the logistic regression model has p-values very high and therefore they are not statistically significant. We create another smaller model removing those variables.
formula = 'response ~ ' + '+'.join(df_dummies.columns[1:7])
formula
'response ~ age+TSH+T3+TT4+T4U+FTI'
model = smf.glm(formula = formula, data=df_dummies, family=sm.families.Binomial())
result=model.fit()
print(result.summary())
Generalized Linear Model Regression Results ============================================================================== Dep. Variable: response No. Observations: 2000 Model: GLM Df Residuals: 1993 Model Family: Binomial Df Model: 6 Link Function: logit Scale: 1.0000 Method: IRLS Log-Likelihood: -131.55 Date: Sat, 31 Aug 2019 Deviance: 263.09 Time: 01:17:34 Pearson chi2: 5.65e+03 No. Iterations: 9 Covariance Type: nonrobust ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 4.1596 1.995 2.085 0.037 0.250 8.069 age 0.0248 0.010 2.481 0.013 0.005 0.044 TSH 0.0212 0.007 2.892 0.004 0.007 0.035 T3 0.0704 0.297 0.237 0.812 -0.511 0.652 TT4 0.0296 0.026 1.134 0.257 -0.022 0.081 T4U -1.2903 1.665 -0.775 0.438 -4.553 1.973 FTI -0.1307 0.029 -4.449 0.000 -0.188 -0.073 ==============================================================================
y_pred=(result.predict(X_test)>prob_threshold).apply(int)
print(classification_report(y_pred,y_test))
precision recall f1-score support 0 0.99 0.99 0.99 573 1 0.74 0.85 0.79 27 accuracy 0.98 600 macro avg 0.87 0.92 0.89 600 weighted avg 0.98 0.98 0.98 600
pd.DataFrame(confusion_matrix(y_test, y_pred),columns=['Predict-YES','Predict-NO'],index=['YES','NO'])
Predict-YES | Predict-NO | |
---|---|---|
YES | 565 | 4 |
NO | 8 | 23 |
for prob_threshold in [0.5,0.6,0.7,0.8,0.9]:
y_pred=(result.predict(X_test)>prob_threshold).apply(int)
df=pd.DataFrame(confusion_matrix(y_test, y_pred),columns=['Predict-YES','Predict-NO'],index=['YES','NO'])
print("Confusion matrix for probability threshold {}".format(prob_threshold))
print("-"*100)
print(df)
print()
Confusion matrix for probability threshold 0.5 ---------------------------------------------------------------------------------------------------- Predict-YES Predict-NO YES 565 4 NO 8 23 Confusion matrix for probability threshold 0.6 ---------------------------------------------------------------------------------------------------- Predict-YES Predict-NO YES 565 4 NO 13 18 Confusion matrix for probability threshold 0.7 ---------------------------------------------------------------------------------------------------- Predict-YES Predict-NO YES 567 2 NO 16 15 Confusion matrix for probability threshold 0.8 ---------------------------------------------------------------------------------------------------- Predict-YES Predict-NO YES 567 2 NO 17 14 Confusion matrix for probability threshold 0.9 ---------------------------------------------------------------------------------------------------- Predict-YES Predict-NO YES 568 1 NO 21 10
Scikit-learn
and Statsmodels
predictions?¶sklearn_prob = clf1.predict_proba(X_test)[...,1][:10]
statsmodels_prob = result.predict(X_test[:10])
prob_comp_df=pd.DataFrame(data={'Scikit-learn Prob':list(sklearn_prob),'Statsmodels Prob':list(statsmodels_prob)})
prob_comp_df
Scikit-learn Prob | Statsmodels Prob | |
---|---|---|
0 | 0.009529 | 0.026742 |
1 | 0.001286 | 0.000780 |
2 | 0.000001 | 0.000001 |
3 | 0.375984 | 0.306287 |
4 | 0.000704 | 0.000797 |
5 | 0.135925 | 0.132377 |
6 | 0.016304 | 0.007412 |
7 | 0.001493 | 0.001065 |
8 | 0.778033 | 0.874865 |
9 | 0.000032 | 0.000161 |
What is the interpretation of the coefficient value for age
and FTI
?
When doing classification with deep neural networks, should you also examine the raw probabilities (outputs from the last sigmoid
layer for example)?
This approach may be helpful for deep learning tasks, when the output of deep learning is fed to another business decision process, where the cost of Type-I and Type-II error are vastly different.