Using Machine Learning to Identify Heart Disease

Heart Disease is the cause of 1 out of every 4 deaths in the United States. Heart disease is a treatable condition, but we can only treat a condition if we can correctly diagnose the condition.

In this notebook, I train and tune many of the most commonly used classifiers to survey their performance. In the process, I explore how different preprocessing steps and different sets of hyperparameters impact the performance of different classifiers and visual methods for exploring the hyperparameter spaces to get a better sense of how models react to slight variations in hyperparameters.

In general, more training data will make models more accurate, but some data is very costly to acquire, so I've selected a smaller dataset to show how powerful these techniques are even with small datasets. This dataset didn't have any missing or illegal values so there was no need to delete observations or impute any values, but with sklearn's pipeline framework, adding imputation is a simple process. You can see examples of this pipeline framework below.

For each of the machine learning algorithms used in this notebook, I provide a short theoretical explanation of how the algorithm works, and use that understanding to guide the classifier tuning process.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from IPython.core.display import display, HTML
import os
from mpl_toolkits import mplot3d
%matplotlib inline
In [2]:
# sklearn preprocessing and visualization
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, train_test_split, RepeatedStratifiedKFold
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.pipeline import Pipeline
from sklearn.base import clone
from sklearn.tree import export_graphviz
import graphviz
In [3]:
# Linear Classifiers
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.feature_selection import RFE
In [4]:
# Base Classifiers
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import BernoulliNB
In [5]:
# Ensembles
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
In [6]:
# Neural Network
from sklearn.neural_network import MLPClassifier
In [7]:
# Notebook Styling 
sns.set()
pd.options.display.max_columns = None
# display(HTML("<style>.container { width:100% !important; }</style>"))
# pd.set_option('display.float_format',lambda x: '%.5f' % x)
nb_seed = 1234
In [8]:
g_cmap = sns.light_palette('seagreen', n_colors=10, as_cmap=True)

Custom Functions

In [9]:
RepeatedStratifiedKFold
In [256]:
def param_heat_mapper(ax_, df_, ind_col, val_col, col_col, base_clfname, cmap=g_cmap):
    '''Formats a heatmap of model performance given two hyperparameters
    
    Args:
        ax_:                a matplotlib axes object to plot to
        df_:                a Pandas DataFrame containing results from GridSearch in 
                              a tidy format where column names = parameter names
        ind_col_, col_col_: a string contaning the name of the feature to map
        val_col:            a string contining the name of the value column to map
        base_clfname:       a string containing a title prefix
    '''
    logit_piv = df_.pivot(index=ind_col, values=val_col, columns=col_col)
    title_ = base_clfname + ' Params: ' + ind_col + ' and ' + col_col
    with plt.style.context('seaborn-whitegrid'):
        sns.heatmap(logit_piv, linewidths=0.0, annot=True, fmt='0.3f',
                    ax=ax_, cmap=cmap, robust=True, annot_kws={'size':16})
        ax_.set_title(title_, fontsize=14)
        ax_.set_xlabel('Param: ' + col_col, fontsize=14)
        ax_.set_ylabel('Param: ' + ind_col, fontsize=14)
In [11]:
def train_test_param_map(x_var_, y_var_, map_df_):
    '''Prints two heatmaps showing model performance for different hyperparameter combinations.
    
    Args:
        x_var_, y_var_:   strings containing the names of the parameters to be mapped
        map_df_:          a Pandas DataFrame containing results from GridSearch in a 
                            tidy format where column names = parameter names
    '''
    with plt.style.context('seaborn-whitegrid'):
        fig, ax = plt.subplots(nrows=1, ncols=2, sharex=False, figsize=(14,6))
        param_heat_mapper(ax[0], map_df_, y_var_, 'mean_train_score', x_var_, '[Training_Data]')
        param_heat_mapper(ax[1], map_df_, y_var_, 'mean_test_score', x_var_, '[Testing_Data]')

Dataset Description

This dataset was collected from the Cleveland Heart Disease Database and consists of 13 features for 270 patients as well as a labeled vector indicating whether specimen has heart disease or not. The 13 features (and their types) are listed and described below.

Features are described here https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4441402/

Attribute Information:

  1. age
  2. sex
  3. chest pain type (4 values)
  4. resting blood pressure
  5. serum cholestoral in mg/dl
  6. fasting blood sugar > 120 mg/dl
  7. resting electrocardiographic results (values 0,1,2)
  8. maximum heart rate achieved
  9. exercise induced angina
  10. oldpeak = ST depression induced by exercise relative to rest
  11. the slope of the peak exercise ST segment
  12. number of major vessels (0-3) colored by flourosopy
  13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

Attributes types

  • Real: 1,4,5,8,10,12
  • Ordered:11,
  • Binary: 2,6,9
  • Nominal:7,3,13
In [12]:
CSV_PATH = os.path.join('data', 'heart_disease', 'train_values.csv')
X_train = pd.read_csv(CSV_PATH, encoding='latin1', index_col='patient_id') 
X_train.head()
Out[12]:
slope_of_peak_exercise_st_segment thal resting_blood_pressure chest_pain_type num_major_vessels fasting_blood_sugar_gt_120_mg_per_dl resting_ekg_results serum_cholesterol_mg_per_dl oldpeak_eq_st_depression sex age max_heart_rate_achieved exercise_induced_angina
patient_id
0z64un 1 normal 128 2 0 0 2 308 0.0 1 45 170 0
ryoo3j 2 normal 110 3 0 0 0 214 1.6 0 54 158 0
yt1s1x 1 normal 125 4 3 0 2 304 0.0 1 77 162 1
l2xjde 1 reversible_defect 152 4 0 0 0 223 0.0 1 40 181 0
oyt4ek 3 reversible_defect 178 1 0 0 2 270 4.2 1 59 145 0
In [13]:
CSV_PATH = os.path.join('data', 'heart_disease', 'test_values.csv')
X_test = pd.read_csv(CSV_PATH, encoding='latin1', index_col='patient_id') 
X_test.head()
Out[13]:
slope_of_peak_exercise_st_segment thal resting_blood_pressure chest_pain_type num_major_vessels fasting_blood_sugar_gt_120_mg_per_dl resting_ekg_results serum_cholesterol_mg_per_dl oldpeak_eq_st_depression sex age max_heart_rate_achieved exercise_induced_angina
patient_id
olalu7 2 reversible_defect 170 1 0 0 2 288 0.2 1 59 159 0
z9n6mx 1 normal 138 4 0 0 0 183 1.4 0 35 182 0
5k4413 2 reversible_defect 120 4 0 0 2 177 2.5 1 43 120 1
mrg7q5 1 normal 102 3 1 0 0 318 0.0 0 60 160 0
uki4do 2 normal 138 4 1 0 2 166 3.6 1 61 125 1
In [14]:
CSV_PATH = os.path.join('data', 'heart_disease', 'train_labels.csv')
y_train = pd.read_csv(CSV_PATH, encoding='latin1', index_col='patient_id') 
y_train.head()
Out[14]:
heart_disease_present
patient_id
0z64un 0
ryoo3j 0
yt1s1x 1
l2xjde 1
oyt4ek 0

From the info printout below, we see that all of the features except for 'thal' (Thalium heart scan observation), which is has the object (string) type. From the given feature information, we know that features

  • sex
  • fasting_blood_sugar_gt_120_mg_per_dl
  • exercise_induced_angina

are binary categories and features

  • resting_ekg_results
  • chest_pain_type
  • thal
  • slope_of_peak_exercise_st_segment

are categorical features with more than 2 categories.

I'm using sklearn's machine learning libraries for my analysis, and sklearn is built on the numpy library which cannot correctly handle features of categorical data. To produce correct results, we'll have to make dummy variables for categorical features.

In [15]:
X_train.info()
<class 'pandas.core.frame.DataFrame'>
Index: 180 entries, 0z64un to 2nx10r
Data columns (total 13 columns):
slope_of_peak_exercise_st_segment       180 non-null int64
thal                                    180 non-null object
resting_blood_pressure                  180 non-null int64
chest_pain_type                         180 non-null int64
num_major_vessels                       180 non-null int64
fasting_blood_sugar_gt_120_mg_per_dl    180 non-null int64
resting_ekg_results                     180 non-null int64
serum_cholesterol_mg_per_dl             180 non-null int64
oldpeak_eq_st_depression                180 non-null float64
sex                                     180 non-null int64
age                                     180 non-null int64
max_heart_rate_achieved                 180 non-null int64
exercise_induced_angina                 180 non-null int64
dtypes: float64(1), int64(11), object(1)
memory usage: 19.7+ KB

Data Preprocessing

In [16]:
categoricals = ['chest_pain_type', 'resting_ekg_results', 'thal', 
                'sex', 'fasting_blood_sugar_gt_120_mg_per_dl', 
                'exercise_induced_angina', 'slope_of_peak_exercise_st_segment']

def categorize_features(df, cats):
    df_cols = df.columns.tolist()
    for cat in cats:
        if cat in df_cols:
            df[cat] = df[cat].astype('category')
    return df
In [17]:
X_train = categorize_features(X_train, categoricals)
X_test = categorize_features(X_test, categoricals)
In [18]:
X_train.info()
<class 'pandas.core.frame.DataFrame'>
Index: 180 entries, 0z64un to 2nx10r
Data columns (total 13 columns):
slope_of_peak_exercise_st_segment       180 non-null category
thal                                    180 non-null category
resting_blood_pressure                  180 non-null int64
chest_pain_type                         180 non-null category
num_major_vessels                       180 non-null int64
fasting_blood_sugar_gt_120_mg_per_dl    180 non-null category
resting_ekg_results                     180 non-null category
serum_cholesterol_mg_per_dl             180 non-null int64
oldpeak_eq_st_depression                180 non-null float64
sex                                     180 non-null category
age                                     180 non-null int64
max_heart_rate_achieved                 180 non-null int64
exercise_induced_angina                 180 non-null category
dtypes: category(7), float64(1), int64(5)
memory usage: 11.8+ KB

After changing the data types of the categorical features, we see that the memory usage dropped from 19.7 KB to 11.8 KB. While this is an extremely small dataset, on larger datasets this would produce a substantial improvement in runtime.

Nominal Categorical Types

In [19]:
val_map = {1: 'typical_angina', 
           2: 'atypical_angina', 
           3: 'non_angina', 
           4: 'asymptotic_angina'}

print('Before mapping: {}'.format(X_train['chest_pain_type'].unique()))
X_train['chest_pain_type'] = X_train['chest_pain_type'].map(val_map)
print('After mapping: {}'.format(X_train['chest_pain_type'].unique()))
Before mapping: [2, 3, 4, 1]
Categories (4, int64): [2, 3, 4, 1]
After mapping: ['atypical_angina' 'non_angina' 'asymptotic_angina' 'typical_angina']
In [20]:
# Resting electrocardiographic results (names too long to map)
# 0: normal, 
# 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), 
# 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria;
X_train['resting_ekg_results'].unique()
Out[20]:
[2, 0, 1]
Categories (3, int64): [2, 0, 1]
In [21]:
# Thalium heart scan
X_train['thal'].unique()
Out[21]:
[normal, reversible_defect, fixed_defect]
Categories (3, object): [normal, reversible_defect, fixed_defect]
In [22]:
val_map = {1: 'upsloping',
           2: 'flat',
           3: 'downsloping'}
print(X_train['slope_of_peak_exercise_st_segment'].unique())
X_train['slope_of_peak_exercise_st_segment'] = X_train['slope_of_peak_exercise_st_segment'].map(val_map)
print('After mapping: {}'.format(X_train['slope_of_peak_exercise_st_segment'].unique()))
[1, 2, 3]
Categories (3, int64): [1, 2, 3]
After mapping: ['upsloping' 'flat' 'downsloping']

Binary Categorical Types

In [23]:
val_map = {0: 'female', 
           1: 'male'}
print('Before mapping: {}'.format(X_train['sex'].unique()))
X_train['sex'] = X_train['sex'].map(val_map)
print('After mapping: {}'.format(X_train['sex'].unique()))
Before mapping: [1, 0]
Categories (2, int64): [1, 0]
After mapping: ['male' 'female']
In [24]:
val_map = {0: 'false', 1: 'true'}
print('Before mapping: {} '.format(X_train['fasting_blood_sugar_gt_120_mg_per_dl'].unique()))
X_train['fasting_blood_sugar_gt_120_mg_per_dl'] = X_train['fasting_blood_sugar_gt_120_mg_per_dl'].map(val_map)
print('After mapping: {} '.format(X_train['fasting_blood_sugar_gt_120_mg_per_dl'].unique()))
Before mapping: [0, 1]
Categories (2, int64): [0, 1] 
After mapping: ['false' 'true'] 
In [25]:
val_map = {0: 'false', 1: 'true'}
print('Before mapping: {}'.format(X_train['exercise_induced_angina'].unique()))
X_train['exercise_induced_angina'] = X_train['exercise_induced_angina'].map(val_map)
print('After mapping: {}'.format(X_train['exercise_induced_angina'].unique()))
Before mapping: [0, 1]
Categories (2, int64): [0, 1]
After mapping: ['false' 'true']

Exploratory Data Analysis

Before creating dummy variables (and increasing the number of features), we should examine the data.

In [26]:
# Joining the labeled output with the features, for plotting
train_df = y_train.join(X_train)
In [27]:
train_df.head()
Out[27]:
heart_disease_present slope_of_peak_exercise_st_segment thal resting_blood_pressure chest_pain_type num_major_vessels fasting_blood_sugar_gt_120_mg_per_dl resting_ekg_results serum_cholesterol_mg_per_dl oldpeak_eq_st_depression sex age max_heart_rate_achieved exercise_induced_angina
patient_id
0z64un 0 upsloping normal 128 atypical_angina 0 false 2 308 0.0 male 45 170 false
ryoo3j 0 flat normal 110 non_angina 0 false 0 214 1.6 female 54 158 false
yt1s1x 1 upsloping normal 125 asymptotic_angina 3 false 2 304 0.0 male 77 162 true
l2xjde 1 upsloping reversible_defect 152 asymptotic_angina 0 false 0 223 0.0 male 40 181 false
oyt4ek 0 downsloping reversible_defect 178 typical_angina 0 false 2 270 4.2 male 59 145 false

As we'll be plotting some categorical features, I'll set an appropriate palette.

In [28]:
colors = ['#66c2a5', '#fc8d62', '#8da0cb']
cat_palette = sns.color_palette(['#66c2a5', '#fc8d62', '#8da0cb'])
sns.palplot(cat_palette)

From the frequency plot of heart disease below, we see that the two classes ('Heart Disease' and 'No Heart Disease') are approximately balanced, with 45% of observations having heart disease and the remaining population not having heart disease.

In [29]:
fig, ax = plt.subplots(figsize=(8,6))
sns.countplot(data=train_df, x='heart_disease_present', palette=cat_palette, ax=ax)
ax.set_title('Distribution of Heart Disease', fontsize=16)
ax.set_xlabel('Heart Disease', fontsize=14)
ax.set_xticklabels(['No Heart Disease', 'Heart Disease'], fontsize=12)
ax.set_ylabel('Count', fontsize=14)
ax.set_ylim([0,105])
plt.show()
In [30]:
with sns.axes_style("whitegrid"):
    plt.rcParams["axes.labelsize"] = 14
    sns.pairplot(train_df, 
                 vars=['num_major_vessels', 'serum_cholesterol_mg_per_dl', 'age', 
                       'resting_blood_pressure', 'max_heart_rate_achieved'], 
                 kind='reg',
                 diag_kind='kde',
                 hue='heart_disease_present',
                 diag_kws={'shade':True},
                 palette=cat_palette, 
                 size=3)

I've plotted all of the numerical features in the pairplot above, and I've colored the data by the 'heart_disease_present' label. The off-diagonal scatter plots include regression lines to show the trends for both populations relative to that plot's features, and the diagonal shows kernel density plots showing the rough distributions of the two populations.

We can make a couple observations:

  • From the plot of resting blood pressure as a function of age, we see that resting blood pressure tends to increase with age regardless of heart disease.
  • From the plot of max heart rate as a function of age, we see that max heart rates are significantly lower for people without heart disease.
  • From the plot of 'serum cholesterol' (colloquially 'cholesterol levels'), we see that cholesterol levels increase with age regardless of heart disease.
  • From the distribution plot of resting blood pressure, we see nearly perfect overlap of the distributions, indicating that feature probably won't be useful in discriminating between healthy and unhealthy cases.
In [31]:
fig, ax = plt.subplots(nrows=1, ncols=2, sharey=True, sharex=True, figsize=(14,6))
train_df[train_df['heart_disease_present']==0]['age'].plot(kind='hist', color=colors[0], ax=ax[0])
train_df[train_df['heart_disease_present']==1]['age'].plot(kind='hist', color=colors[1], ax=ax[1])
ax[0].set_xlabel('Age', fontsize=14)
ax[1].set_xlabel('Age', fontsize=14)
ax[0].set_title('Age Distribution (No Heart Disease)', fontsize=14)
ax[1].set_title('Age Distribution (Heart Disease)', fontsize=14)
plt.show()

From the distributions of age plots, we see a fair amount of overlap, but the distribution of people with heart disease skews a little bit older than those without heart disease. This is intuitive.

In [32]:
fig, ax = plt.subplots(figsize=(8,6))
sns.boxplot(data=train_df, 
            y='thal', 
            x='resting_blood_pressure', 
            hue='heart_disease_present', 
            palette=cat_palette,
            ax=ax)
handles, _ = ax.get_legend_handles_labels()
ax.legend(handles, ['No Heart Disease','Heart Diseasee'], loc='best', )
ax.set_xlabel('Resting Blood Pressure', fontsize=14)
ax.set_ylabel('Thalium Stress Test Result', fontsize=14)
plt.show()

In the boxplot above, we see that there is a lot of overlap of resting blood pressure range across all thalium stress test results, so resting blood pressure probably won't be a significant factor in detecting heart disease.

In [33]:
fig, ax = plt.subplots(figsize=(8,6))
sns.boxplot(data=train_df, 
            y='resting_ekg_results', 
            x='resting_blood_pressure', 
            hue='heart_disease_present', 
            palette=cat_palette,
            ax=ax)
handles, _ = ax.get_legend_handles_labels()
ax.legend(handles, ['No Heart Disease','Heart Diseasee'], loc='best', )
ax.set_xlabel('Resting Blood Pressure', fontsize=14)
ax.set_ylabel('Resting EKG Results', fontsize=14)
plt.show()

That's kind of strange, from that box plot, it appears there are very few examples of EKG result #1. We confirm that via the countplot below. As our only observations of this EKG type have heart disease, our classifier won't have any prior data to form an intuition about heart-disease free examples. Also, as we only have 1 example of this case in the training set, the cross validation process will leave it out of the cross-validation model trainings. In any case, our classifiers won't have enough data to learn how to handle this class.

In [34]:
fig, ax = plt.subplots(figsize=(8,6))
sns.countplot(data=train_df, x='resting_ekg_results', hue='heart_disease_present', palette=cat_palette, ax=ax)
ax.set_title('Distribution of EKG Result by Heart Disease', fontsize=16)
ax.set_xlabel('EKG Result', fontsize=14)
ax.set_xticklabels(['Normal', 'ST-T Wave Abnormality', 'left Ventricular Hypertrophy'], fontsize=12)
ax.set_ylabel('Count', fontsize=14)
ax.set_ylim([0,60])
plt.show()
In [35]:
fig, ax = plt.subplots(figsize=(8,6))
sns.countplot(data=train_df, x='fasting_blood_sugar_gt_120_mg_per_dl', hue='heart_disease_present', palette=cat_palette, ax=ax)
ax.set_title('Fasting Blood Sugar Level > 120 mg per deciliter', fontsize=16)
ax.set_xlabel('EKG Result', fontsize=14)
ax.set_xticklabels(['False', 'True'], fontsize=12)
ax.set_ylabel('Count', fontsize=14)
ax.set_ylim([0,100])
plt.show()

Looking at the histogram of fasting blood sugar levels relative to 120 mg per deciliter, we see that the relative distributions are approximately equal. There may be other interactions with other features that allow models to use this value to discriminate between those with and without heart disease, but at this point, this evidence implies fasting blood sugar levels aren't a significant predictor of heart disease.

In [36]:
fig, ax = plt.subplots(nrows=3, ncols=1, sharex=True, figsize=(8,9))
sns.boxplot(data=train_df, 
            y='exercise_induced_angina', 
            x='max_heart_rate_achieved', 
            hue='heart_disease_present', 
            palette=cat_palette,
            ax=ax[0])
handles, _ = ax[0].get_legend_handles_labels()
ax[0].legend(handles, ['No Heart Disease','Heart Diseasee'], loc='best')
ax[2].set_xlabel('Max Heart Rate Achieved [bpm]', fontsize=14)
ax[0].set_ylabel('Exercise Induced Angina', fontsize=14)

sns.boxplot(data=train_df, 
            y='sex', 
            x='max_heart_rate_achieved', 
            hue='heart_disease_present', 
            palette=cat_palette,
            ax=ax[1])
handles, _ = ax[1].get_legend_handles_labels()
ax[1].legend(handles, ['No Heart Disease','Heart Diseasee'], loc='best')
ax[1].set_ylabel('Gender', fontsize=14)

sns.boxplot(data=train_df, 
            y='fasting_blood_sugar_gt_120_mg_per_dl', 
            x='max_heart_rate_achieved', 
            hue='heart_disease_present', 
            palette=cat_palette,
            ax=ax[2])
handles, _ = ax[2].get_legend_handles_labels()
ax[2].legend(handles, ['No Heart Disease','Heart Diseasee'], loc='best')
ax[2].set_ylabel('Fasting Blood Sugar > 120 mg/dl', fontsize=14)
plt.show()

Looking at the boxplots above we see for all three y-axis features, people with heart disease tend to achieve a lower max heart rate than those without heart disease. This makes intuitive sense, as max heart rate is an intuitive indicator of cardiac performance, and it makes sense that heart disease would reduce performance. In the boxplot broken down by gender, from the interquartile ranges, we see that healthy men and women achieve similar max heart rates, and for both genders, there is relatively little overlap between healthy and unhealthy populations.

In [37]:
fig, ax = plt.subplots(nrows=3, ncols=1, sharex=True, figsize=(8,10))
sns.boxplot(data=train_df, 
            y='slope_of_peak_exercise_st_segment', 
            x='max_heart_rate_achieved', 
            hue='heart_disease_present', 
            palette=cat_palette,
            ax=ax[0])
handles, _ = ax[0].get_legend_handles_labels()
ax[0].legend(handles, ['No Heart Disease','Heart Diseasee'], loc='best')
ax[2].set_xlabel('Max Heart Rate Achieved [bpm]', fontsize=14)
ax[0].set_ylabel('Peak Exercise ST Segment Slope', fontsize=14)

sns.boxplot(data=train_df, 
            y='thal', 
            x='max_heart_rate_achieved', 
            hue='heart_disease_present', 
            palette=cat_palette,
            ax=ax[1])
handles, _ = ax[1].get_legend_handles_labels()
ax[1].legend(handles, ['No Heart Disease','Heart Diseasee'], loc='best')
ax[1].set_ylabel('Thalium Stress Test Results ', fontsize=14)

sns.boxplot(data=train_df, 
            y='chest_pain_type', 
            x='max_heart_rate_achieved', 
            hue='heart_disease_present', 
            palette=cat_palette,
            ax=ax[2])
handles, _ = ax[2].get_legend_handles_labels()
ax[2].legend(handles, ['No Heart Disease','Heart Diseasee'], loc='best')
ax[2].set_ylabel('Chest Pain Type ', fontsize=14)
plt.show()

From the peak exercise ST segment slope, we see for downsloping cases that there's no overlap between interquartile ranges, so these should be easy cases to classify. There's moderate overlap between healthy and unhealthy populatons for flat and upsloping cases, so max heart rate will be less useful in separating healthy and unhealthy cases.

From the thalium stress test boxplot, we see that the median max heart rate for people with heart disease is less than the lower quartile max heart rate for people without disease. We also see for fixed_defect cases that there is no overlap between the interquartile ranges, so these should be easy samples for classifiers to distinguish. The thalium stress test features (relative to max heart rate) are probably going to be significant factors in classifying new examples.

From the chest pain type boxplot, we see for atypical angina and non angina cases, there's almost no overlap between the interquartile ranges. These cases will probably be easy to classify.

In [38]:
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)

Models

Now that we've preprocessed our data and reformulated it to be suitable with sklearn's numpy framework, and explored the data to develop an intuition about the data, let's start building classifiers!

Logistic Regression

In [61]:
logreg = LogisticRegression()
parameters = {'penalty':['l1','l2'],
              'C':[0.001, 0.01, 0.1, 1, 10, 100, 1000]}
logreg_gs = GridSearchCV(logreg, parameters, cv=30, return_train_score=True)
logreg_gs.fit(X_train, y_train.values.reshape((y_train.size,)))

print("Best CV params", logreg_gs.best_params_)
best_lr = logreg_gs.best_estimator_
coefs = best_lr.coef_
print("Total number of features:", coefs.size)
print('Number of selected features: {}'.format(np.count_nonzero(coefs)))
Best CV params {'C': 0.1, 'penalty': 'l2'}
Total number of features: 18
Number of selected features: 18
In [62]:
grid_clf = logreg_gs
logreg_grid_dict = {'mean_train_score': grid_clf.cv_results_['mean_train_score'],
                    'mean_test_score': grid_clf.cv_results_['mean_test_score'], 
                    'C': grid_clf.cv_results_['param_C'], 
                    'penalty': grid_clf.cv_results_['param_penalty']} 

Hyperparameter Performance

In [63]:
train_test_param_map(x_var_='penalty', y_var_='C', map_df_=pd.DataFrame(logreg_grid_dict))
In [66]:
map_df = pd.DataFrame(logreg_grid_dict)
x_var = map_df[map_df['penalty'] == 'l1']['C']

fig, ax = plt.subplots(figsize=(12,8))
ax.semilogx(x_var, map_df[map_df['penalty'] == 'l1']['mean_train_score'], label='L1 Train')
ax.semilogx(x_var, map_df[map_df['penalty'] == 'l1']['mean_test_score'], label='L1 Test')
ax.semilogx(x_var, map_df[map_df['penalty'] == 'l2']['mean_train_score'], label='L2 Train')
ax.semilogx(x_var, map_df[map_df['penalty'] == 'l2']['mean_test_score'], label='L2 Test')
ax.legend(fontsize=14, loc='best')
ax.set_title('Model Accuracy as a function of C and Regularization Type', fontsize=16)
ax.set_xlabel('C (Inverse Regularization Strength)', fontsize=14)
ax.set_ylabel('Accuracy', fontsize=14)
plt.show()
In [67]:
print('Best accuracy: {:0.3f}%'.format(logreg_gs.best_score_ * 100))
print('Best parameter(s): {}'.format(logreg_gs.best_params_))
Best accuracy: 85.000%
Best parameter(s): {'C': 0.1, 'penalty': 'l2'}
In [68]:
cross_validated_cm_generator(clf_=logreg_gs.best_estimator_, 
                             X_train_=X_train,
                             y_train_ = y_train.values.reshape((len(y_train),)), 
                             class_labels=['No Heart Disease', 'Heart Disease'], 
                             n_splits_=5, 
                             n_reps_=6, 
                             labels=[0,1])

Logistic Regression Observations

From the plot and printout above, we see that the highest accuracy is achieved when the parameter $C$ is set to $0.1$ and the penalty parameter is set to 'l2' (aka Ridge Regression). However, accuracy is a measure of how well all cases were classified, and the 2 different kinds of misclassifications (labeling a healthy person as unhealthy, and labeling an unhealthy person as healthy) are not equally bad. It's definitely worse to label someone as healthy if they actually have heart disease, as their heart disease would go untreated, and they may continue with an unhealthy diet or pursue dangerous activity levels. In terms of classification metrics, that kind of prediction mistake is a False Negative. The alternative mistake (a False Positive) is to label someone as unhealthy when they are actually healthy, which would lead to someone unnecessarily changing their diet and lifestyle, which would be unpleasant, but not potentially lethal, as in the other case. From the confusion matrices above, we see that the most 'accurate' model tends to make the worse kind of mistake.

We should try to build a model again, but we should use a scoring metric that takes the cost of mistakes into account.

In [69]:
logreg = LogisticRegression()
parameters = {'penalty':['l1','l2'],
              'C':[0.001, 0.01, 0.1, 1, 10, 100, 1000]}
logreg_gs = GridSearchCV(logreg, parameters, cv=30, return_train_score=True, scoring='recall')
logreg_gs.fit(X_train, y_train.values.reshape((y_train.size,)))

print("Best CV params", logreg_gs.best_params_)
best_lr = logreg_gs.best_estimator_
coefs = best_lr.coef_
print('Total number of features: {}'.format(coefs.size))
print('Number of selected features: {}'.format(np.count_nonzero(coefs)))
print('Average Accuracy of the best model: {}'.format(logreg_gs.best_score_))
Best CV params {'C': 1, 'penalty': 'l1'}
Total number of features: 18
Number of selected features: 16
Average Accuracy of the best model: 0.7962962962962962

This time, the number of significant features is less than the total number of features, so I'll recursively eliminate unimportant features.

In [70]:
logreg = LogisticRegression(C=1, penalty='l1')
logreg_rfe = RFE(logreg, 13)
logreg_rfe.fit(X_train, y_train.values.reshape((len(y_train), )))
print(logreg_rfe.support_)
print(logreg_rfe.ranking_)
[False  True False  True  True False  True  True  True  True  True  True
  True False False  True  True  True]
[3 1 4 1 1 2 1 1 1 1 1 1 1 5 6 1 1 1]
In [71]:
X_train.loc[:,logreg_rfe.support_].columns.tolist()
Out[71]:
['num_major_vessels',
 'oldpeak_eq_st_depression',
 'age',
 'slope_of_peak_exercise_st_segment_flat',
 'slope_of_peak_exercise_st_segment_upsloping',
 'thal_normal',
 'thal_reversible_defect',
 'chest_pain_type_atypical_angina',
 'chest_pain_type_non_angina',
 'chest_pain_type_typical_angina',
 'resting_ekg_results_2',
 'sex_male',
 'exercise_induced_angina_true']
In [74]:
logreg = LogisticRegression()
parameters = {'penalty':['l1','l2'],
              'C':[0.001, 0.01, 0.1, 1, 10, 100, 1000]}
logreg_gs = GridSearchCV(logreg, parameters, cv=30, return_train_score=True, scoring='recall')
logreg_gs.fit(X_train.loc[:,logreg_rfe.support_], y_train.values.reshape((y_train.size,)))

print("Best CV params", logreg_gs.best_params_)
best_lr = logreg_gs.best_estimator_
coefs = best_lr.coef_
print('Total number of features: {}'.format(coefs.size))
print('Number of selected features: {}'.format(np.count_nonzero(coefs)))
print('Average Accuracy of the best model: {}'.format(logreg_gs.best_score_))
Best CV params {'C': 0.1, 'penalty': 'l2'}
Total number of features: 13
Number of selected features: 13
Average Accuracy of the best model: 0.7953703703703703

Ok, so we can see that the l1 logistic regression (aka Lasso Regression) was smart enough to assign 0 importance to features below the significance threshold.

In [75]:
grid_clf = logreg_gs
logreg_grid_dict = {'mean_train_score': grid_clf.cv_results_['mean_train_score'],
                    'mean_test_score': grid_clf.cv_results_['mean_test_score'], 
                    'C': grid_clf.cv_results_['param_C'], 
                    'penalty': grid_clf.cv_results_['param_penalty']} 

Hyperparameter Performance

In [76]:
train_test_param_map(x_var_='penalty', y_var_='C', map_df_=pd.DataFrame(logreg_grid_dict))
In [139]:
map_df = pd.DataFrame(logreg_grid_dict)
x_var = map_df[map_df['penalty'] == 'l1']['C']

fig, ax = plt.subplots(figsize=(12,8))
ax.semilogx(x_var, map_df[map_df['penalty'] == 'l1']['mean_train_score'], label='L1 Train', linewidth=3)
ax.semilogx(x_var, map_df[map_df['penalty'] == 'l1']['mean_test_score'], label='L1 Test', linewidth=3)
ax.semilogx(x_var, map_df[map_df['penalty'] == 'l2']['mean_train_score'], label='L2 Train', linewidth=3)
ax.semilogx(x_var, map_df[map_df['penalty'] == 'l2']['mean_test_score'], label='L2 Test', linewidth=3)
ax.legend(fontsize=14, loc='best')
ax.set_title('Model Recall Score as a function of C and Regularization Type', fontsize=16)
ax.set_xlabel('C (Inverse Regularization Strength)', fontsize=14)
ax.set_ylabel('Recall Score', fontsize=14)
plt.show()
In [79]:
print('Best accuracy: {:0.3f}%'.format(logreg_gs.best_score_ * 100))
print('Best parameter(s): {}'.format(logreg_gs.best_params_))
Best accuracy: 79.537%
Best parameter(s): {'C': 0.1, 'penalty': 'l2'}

The confusion matrices directly below as well as the confusion matrices a bit further up were made by fitting classifiers with 30 different slices of the data and generating a confusion matrix for each data slice. The confusion matrices below used a classifier with parameters {$C = 1$, penalty $= \text{L2}$}, only used the 13 features that the recursive feature elimination found to be significant, and used parameters that were selected by the recall scoring measure rather than simple accuracy. Over the 30 runs and over a sample of 35 people, this model labeled (on average) $3.30$ unhealthy people as being healthy, which is slightly better than the above model, which labeled $3.67$ unhealthy people as healthy, and this model labeled $1.93$ healthy people as unhealthy while the above model labeled $1.97$ healthy people as unhealthy.

This is an improvement.

In [80]:
cross_validated_cm_generator(clf_=logreg_gs.best_estimator_, 
                             X_train_=X_train.loc[:,logreg_rfe.support_],
                             y_train_ = y_train.values.reshape((len(y_train),)), 
                             class_labels=['No Heart Disease', 'Heart Disease'], 
                             n_splits_=5, 
                             n_reps_=6, 
                             labels=[0,1])