by Samokhvalov Mikhail, Moscow 2018
https://www.kaggle.com/brandao/diabetes/home
It is important to know if a patient will be readmitted in some hospital. The reason is that you can change the treatment, in order to avoid a readmission.
In this database, you have 3 different outputs:
In this context, you can see different objective functions for the problem. You can try to figure out situations where the patient will not be readmitted, or if their are going to be readmitted in less than 30 days (because the problem can the the treatment), etc.
"The data set represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria.
It is an inpatient encounter (a hospital admission). It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis. The length of stay was at least 1 day and at most 14 days. Laboratory tests were performed during the encounter. Medications were administered during the encounter. The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc."
The data are submitted on behalf of the Center for Clinical and Translational Research, Virginia Commonwealth University, a recipient of NIH CTSA grant UL1 TR00058 and a recipient of the CERNER data. John Clore (jclore '@' vcu.edu), Krzysztof J. Cios (kcios '@' vcu.edu), Jon DeShazo (jpdeshazo '@' vcu.edu), and Beata Strack (strackb '@' vcu.edu). This data is a de-identified abstract of the Health Facts database (Cerner Corporation, Kansas City, MO).
First of all lets get features description from the article and convert in to markdown for better readable. Also lets map them to dataframe names.
Feature name | Name in dataframe | Type | Description and values | % missing |
---|---|---|---|---|
Encounter ID | encounter_id | Numeric | Unique identifier of an encounter | 0 |
Patient number | patient_nbr | Numeric | Unique identifier of a patient | 0 |
Race | race | Nominal | Values: Caucasian, Asian, African American, Hispanic, and other | 2 |
Gender | gender | Nominal | Values: male, female, and unknown/invalid | 0 |
Age | age | Nominal | Grouped in 10-year intervals: [0, 10), [10, 20), . . ., [90, 100) | 0 |
Weight | weight | Numeric | Weight in pounds. | 97 |
Admission type | admission_type_id | Nominal | Integer identifier corresponding to 9 distinct values, for example, emergency, urgent, elective, newborn, and not available | 0 |
Discharge disposition | discharge_disposition_id | Nominal | Integer identifier corresponding to 29 distinct values, for example, discharged to home, expired, and not available | 0 |
Admission source | admission_source_id | Nominal | Integer identifier corresponding to 21 distinct values, for example, physician referral, emergency room, and transfer from a hospital | 0 |
Time in hospital | time_in_hospital | Numeric | Integer number of days between admission and discharge | 0 |
Payer code | payer_code | Nominal | Integer identifier corresponding to 23 distinct values, for example, Blue Cross\Blue Shield, Medicare, and self-pay | 52 |
Medical specialty | medical_specialty | Nominal | Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct values, for example, cardiology, internal medicine, family\general practice, and surgeon | 53 |
Number of lab procedures | num_lab_procedures | Numeric | Number of lab tests performed during the encounter | 0 |
Number of procedures | num_procedures | Numeric | Number of procedures (other than lab tests) performed during the encounter | 0 |
Number of medications | num_medications | Numeric | Number of distinct generic names administered during the encounter | 0 |
Number of outpatient visits | number_outpatient | Numeric | Number of outpatient visits of the patient in the year preceding the encounter | 0 |
Number of emergency visits | number_emergency | Numeric | Number of emergency visits of the patient in the year preceding the encounter | 0 |
Number of inpatient visits | number_inpatient | Numeric | Number of inpatient visits of the patient in the year preceding the encounter | 0 |
Diagnosis 1 | diag_1 | Nominal | The primary diagnosis (coded as first three digits of ICD9); 848 distinct values | 0 |
Diagnosis 2 | diag_2 | Nominal | Secondary diagnosis (coded as first three digits of ICD9); 923 distinct values | 0 |
Diagnosis 3 | diag_3 | Nominal | Additional secondary diagnosis (coded as first three digits of ICD9); 954 distinct values | 1 |
Number of diagnoses | number_diagnoses | Numeric | Number of diagnoses entered to the system | 0 |
Glucose serum test result | max_glu_serum | Nominal | Indicates the range of the result or if the test was not taken. Values: “>200,” “>300,” “normal,” and “none” if not measured | 0 |
A1c test result | A1Cresult | Nominal | Indicates the range of the result or if the test was not taken. Values: “>8” if the result was greater than 8%, “>7” if the result was greater than 7% but less than 8%, “normal” if the result was less than 7%, and “none” if not measured. | 0 |
Change of medications | change | Nominal | Indicates if there was a change in diabetic medications (either dosage or generic name). Values: “change” and “no change” | 0 |
Diabetes medications | diabetesMed | Nominal | Indicates if there was any diabetic medication prescribed. Values: “yes” and “no” | 0 |
24 features for medications | metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide-metformin glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone metformin-pioglitazone | Nominal | For the generic names: metformin, repaglinide, nateglinide, chlorpropamide, glimepiride, acetohexamide, glipizide, glyburide, tolbutamide, pioglitazone, rosiglitazone, acarbose, miglitol, troglitazone, tolazamide, examide, sitagliptin, insulin, glyburide-metformin, glipizide-metformin, glimepiride-pioglitazone, metformin-rosiglitazone, and metformin-pioglitazone, the feature indicates whether the drug was prescribed or there was a change in the dosage. Values: “up” if the dosage was increased during the encounter, “down” if the dosage was decreased, “steady” if the dosage did not change, and “no” if the drug was not prescribed | 0 |
Readmitted | readmitted | Nominal | Days to inpatient readmission. Values: “<30” if the patient was readmitted in less than 30 days, “>30” if the patient was readmitted in more than 30 days, and “No” for no record of readmission. | 0 |
Last one feature - readmitted feature - is a target.
# Loading all necessary libraries:
import zipfile
import missingno as msno
from tqdm import tqdm_notebook
import itertools
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency
from sklearn.impute import SimpleImputer #sklearn 0.20.1 is necessary
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# We can read files without unzipping!
with zipfile.ZipFile("diabetes.zip") as z:
with z.open("diabetic_data.csv") as f:
data_df = pd.read_csv(f, encoding='utf-8')
data_df.head()
data_df.dtypes.head()
# Lets take a look at the data:
display(data_df.describe())
data_size = len(data_df)
print(f'Whole dataset size: {data_size}')
As we got entire dataset here we need to split it to two parts: train and test and never spy to the test target array. We will use test target for checking our final solution.
Data could be collected in chronological order. Therefore, to make the experiment more realistic, we divide the sample in half.
total_len = len(data_df)
print('Total length: ', total_len)
split_coef = 0.5
split_number = int(total_len*split_coef)
print('Split number: ', split_number)
X_train = data_df.iloc[0:split_number]
X_test = data_df.iloc[split_number:]
y_train = X_train['readmitted']
y_test = X_test['readmitted']
X_train = X_train.drop(columns='readmitted')
X_test = X_test.drop(columns='readmitted')
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
# Also for the baseline lets convert y_target to numeric in this way:
y_target = y_train.map({'<30':0, '>30':1, 'NO':2})
y_test = y_test.map({'<30':0, '>30':1, 'NO':2})
# Lets check missings:
for col in data_df:
uniq_values = data_df[col].unique()
if '?' in uniq_values:
num_of_nan = len(data_df[data_df[col]=='?'])
print(f'Feature {col}, missed: {num_of_nan} or {num_of_nan/data_size*100:.2f} %')
# adding here uniq_values we can see all of them. Ans see missings as '?' always
Here we found missing values in dataset marked as '?'. Also there are '?' not only in features as shown in the article, but also in diag_1
and diag_2
features too!
There are several methods to fill in the missings:
Good example of using different methods: https://towardsdatascience.com/working-with-missing-data-in-machine-learning-9c0a430df4ce
Important moment - we can't just drop missings in data - model should be able to work with missing values because we can't ignore any new patient just because he/she didn't indicate weight or race in the questionary.
# interesting method to visualize missings:
columns_nans = ['race', 'weight', 'payer_code', 'medical_specialty', 'diag_1', 'diag_2', 'diag_3']
imp = SimpleImputer(missing_values='?', strategy='constant', fill_value=np.nan)
data_df_nans = pd.DataFrame(imp.fit_transform(data_df[columns_nans]), columns=columns_nans)
msno.matrix(data_df_nans);
msno.heatmap(data_df_nans);
There is no correlation in missings (they doesn't appear simultaneously). Three theatures have too many missings: weight, payer_code, medical_specialty - from 40 to 97%. So it can be unsafe to fill them with any values. Let's ignore them for baseline and try different filling methods at tuning stage.
Let's try different methods - start from the simplest one for baseline model and come back here and try another methods for more complex model. We will change data always in new columns and drop excess data before using each model.
%%time
columns_nans = ['race', 'diag_1', 'diag_2', 'diag_3']
imp_most_frequent = SimpleImputer(missing_values='?', strategy='most_frequent', verbose=1)
X_train_nan_most_frequent = pd.DataFrame(imp_most_frequent.fit_transform(X_train[columns_nans]),
columns=[el+'_mf' for el in columns_nans] )
X_test_nan_most_frequent = pd.DataFrame(imp_most_frequent.transform(X_test[columns_nans]),
columns=[el+'_mf' for el in columns_nans] )
X_train = pd.concat([X_train, X_train_nan_most_frequent], axis=1)
X_test = pd.concat([X_test.reset_index(drop=True), X_test_nan_most_frequent], axis=1).set_index(X_test.index)
Lets do some data analisys. First of all we check numeric data, then categorical and finish with cat vs num data comparison. Very good example about general methods for data analisys: # https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
features_numeric = X_train.select_dtypes(include='int64').columns
features_categorical = X_train.select_dtypes(include='object').columns
print(features_numeric)
print(len(features_numeric))
print(features_categorical)
print(len(features_categorical))
Lets take a look at numeric first ...
X_train[features_numeric].drop(columns=['encounter_id', 'patient_nbr']).describe()
for col in features_numeric[2:]:
print(col)
print(X_train[col].value_counts())
print(X_test[col].value_counts())
%%time
sns.set(style="whitegrid")
sns.set(rc={'figure.figsize':(10,10)})
#sns.boxplot(data=X_train[features_numeric].drop(columns=['encounter_id', 'patient_nbr']).iloc[:,[1,2,3,4,6,10]]);
#sns.swarmplot(data=X_train[features_numeric].drop(columns=['encounter_id', 'patient_nbr']).iloc[::100,[1,2,3,4,6,10]], color=".25")
sns.violinplot(data=X_train[features_numeric].drop(columns=['encounter_id', 'patient_nbr']).iloc[:,0:5]);
#sns.boxplot(data=X_train[features_numeric].drop(columns=['encounter_id', 'patient_nbr']).iloc[:,[0,5,7,8,9]]);
#sns.swarmplot(data=X_train[features_numeric].drop(columns=['encounter_id', 'patient_nbr']).iloc[::100,[0,5,7,8,9]], color=".25")
plt.figure()
sns.violinplot(data=X_train[features_numeric].drop(columns=['encounter_id', 'patient_nbr']).iloc[:,5:10]);
sns.set(rc={'figure.figsize':(15,5)})
for axis in range(0,len(X_train[features_numeric[2:]].columns),3):
cols = X_train[features_numeric[2:]].columns[axis:axis+3]
f, axes = plt.subplots(1, 3, sharex=True)
palette = "crimson"
sns.distplot( X_train[cols[0]].values , color=palette, ax=axes[0]);
try:
sns.distplot( X_train[cols[1]].values , color=palette, ax=axes[1], label=cols[1]);
except:
pass
try:
sns.distplot( X_train[cols[2]].values , color=palette, ax=axes[2], label=cols[2]);
except:
pass
... and categorical.
features_ignored = ['weight', 'payer_code', 'medical_specialty', 'diag_1', 'diag_2', 'diag_3', 'race']
X_train_categorical = X_train[features_categorical].drop(columns=features_ignored)
features_categorical = [el for el in features_categorical if el not in features_ignored]
for axis in range(0,len(X_train_categorical.columns[:-3]),3):
cols = X_train_categorical.columns[axis:axis+3]
f, axes = plt.subplots(1, 3)
palette = "crimson"
sns.countplot(X_train_categorical[cols[0]] , color=palette, ax=axes[0]);
try:
sns.countplot(X_train_categorical[cols[1]] , color=palette, ax=axes[1]);
except:
pass
try:
sns.countplot(X_train_categorical[cols[2]] , color=palette, ax=axes[2]);
except:
pass
X_train_categorical[['diag_1_mf', 'diag_2_mf', 'diag_3_mf']] \
.apply(pd.Series.value_counts) \
.sort_values('diag_3_mf', ascending=False)
for col in X_train_categorical.columns[:-3]:
print(col, X_train_categorical[col].unique())
print('---'*10)
print(X_train_categorical[col].value_counts())
constant_features = ['examide', 'citoglipton', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone',
'acetohexamide',
'tolbutamide', 'miglitol', 'troglitazone', 'tolazamide', 'glipizide-metformin']
for col in constant_features:
print(col)
print(X_train[col].value_counts())
print(X_test[col].value_counts())
print('---'*10)
First of all - we can drop this columns: all columns has the same value (No
). There are only 1-2 values == Steady
X_train.drop(columns=constant_features, inplace=True)
X_train_categorical.drop(columns=constant_features, inplace=True)
X_test.drop(columns=constant_features, inplace=True)
features_categorical = [el for el in features_categorical if el not in constant_features]
%%time
# this action can take about minute
sns.pairplot( X_train[features_numeric].assign(target=y_target.values) );