Diabetes readmission

by Samokhvalov Mikhail, Moscow 2018

Part 1. Dataset and features description

1.1. Dataset description from Kaggle

https://www.kaggle.com/brandao/diabetes/home

Basic Explanaition

It is important to know if a patient will be readmitted in some hospital. The reason is that you can change the treatment, in order to avoid a readmission.

In this database, you have 3 different outputs:

  • No readmission;
  • A readmission in less than 30 days (this situation is not good, because maybe your treatment was not appropriate);
  • A readmission in more than 30 days (this one is not so good as well the last one, however, the reason can be the state of the patient.

In this context, you can see different objective functions for the problem. You can try to figure out situations where the patient will not be readmitted, or if their are going to be readmitted in less than 30 days (because the problem can the the treatment), etc.

Content

"The data set represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria.

It is an inpatient encounter (a hospital admission). It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis. The length of stay was at least 1 day and at most 14 days. Laboratory tests were performed during the encounter. Medications were administered during the encounter. The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc."

Source

The data are submitted on behalf of the Center for Clinical and Translational Research, Virginia Commonwealth University, a recipient of NIH CTSA grant UL1 TR00058 and a recipient of the CERNER data. John Clore (jclore '@' vcu.edu), Krzysztof J. Cios (kcios '@' vcu.edu), Jon DeShazo (jpdeshazo '@' vcu.edu), and Beata Strack (strackb '@' vcu.edu). This data is a de-identified abstract of the Health Facts database (Cerner Corporation, Kansas City, MO).

1.2. Feature description

First of all lets get features description from the article and convert in to markdown for better readable. Also lets map them to dataframe names.

Feature name Name in dataframe Type Description and values % missing
Encounter ID encounter_id Numeric Unique identifier of an encounter 0
Patient number patient_nbr Numeric Unique identifier of a patient 0
Race race Nominal Values: Caucasian, Asian, African American, Hispanic, and other 2
Gender gender Nominal Values: male, female, and unknown/invalid 0
Age age Nominal Grouped in 10-year intervals: [0, 10), [10, 20), . . ., [90, 100) 0
Weight weight Numeric Weight in pounds. 97
Admission type admission_type_id Nominal Integer identifier corresponding to 9 distinct values, for example, emergency, urgent, elective, newborn, and not available 0
Discharge disposition discharge_disposition_id Nominal Integer identifier corresponding to 29 distinct values, for example, discharged to home, expired, and not available 0
Admission source admission_source_id Nominal Integer identifier corresponding to 21 distinct values, for example, physician referral, emergency room, and transfer from a hospital 0
Time in hospital time_in_hospital Numeric Integer number of days between admission and discharge 0
Payer code payer_code Nominal Integer identifier corresponding to 23 distinct values, for example, Blue Cross\Blue Shield, Medicare, and self-pay 52
Medical specialty medical_specialty Nominal Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct values, for example, cardiology, internal medicine, family\general practice, and surgeon 53
Number of lab procedures num_lab_procedures Numeric Number of lab tests performed during the encounter 0
Number of procedures num_procedures Numeric Number of procedures (other than lab tests) performed during the encounter 0
Number of medications num_medications Numeric Number of distinct generic names administered during the encounter 0
Number of outpatient visits number_outpatient Numeric Number of outpatient visits of the patient in the year preceding the encounter 0
Number of emergency visits number_emergency Numeric Number of emergency visits of the patient in the year preceding the encounter 0
Number of inpatient visits number_inpatient Numeric Number of inpatient visits of the patient in the year preceding the encounter 0
Diagnosis 1 diag_1 Nominal The primary diagnosis (coded as first three digits of ICD9); 848 distinct values 0
Diagnosis 2 diag_2 Nominal Secondary diagnosis (coded as first three digits of ICD9); 923 distinct values 0
Diagnosis 3 diag_3 Nominal Additional secondary diagnosis (coded as first three digits of ICD9); 954 distinct values 1
Number of diagnoses number_diagnoses Numeric Number of diagnoses entered to the system 0
Glucose serum test result max_glu_serum Nominal Indicates the range of the result or if the test was not taken. Values: “>200,” “>300,” “normal,” and “none” if not measured 0
A1c test result A1Cresult Nominal Indicates the range of the result or if the test was not taken. Values: “>8” if the result was greater than 8%, “>7” if the result was greater than 7% but less than 8%, “normal” if the result was less than 7%, and “none” if not measured. 0
Change of medications change Nominal Indicates if there was a change in diabetic medications (either dosage or generic name). Values: “change” and “no change” 0
Diabetes medications diabetesMed Nominal Indicates if there was any diabetic medication prescribed. Values: “yes” and “no” 0
24 features for medications metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide-metformin glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone metformin-pioglitazone Nominal For the generic names: metformin, repaglinide, nateglinide, chlorpropamide, glimepiride, acetohexamide, glipizide, glyburide, tolbutamide, pioglitazone, rosiglitazone, acarbose, miglitol, troglitazone, tolazamide, examide, sitagliptin, insulin, glyburide-metformin, glipizide-metformin, glimepiride-pioglitazone, metformin-rosiglitazone, and metformin-pioglitazone, the feature indicates whether the drug was prescribed or there was a change in the dosage. Values: “up” if the dosage was increased during the encounter, “down” if the dosage was decreased, “steady” if the dosage did not change, and “no” if the drug was not prescribed 0
Readmitted readmitted Nominal Days to inpatient readmission. Values: “<30” if the patient was readmitted in less than 30 days, “>30” if the patient was readmitted in more than 30 days, and “No” for no record of readmission. 0

Output variable

Last one feature - readmitted feature - is a target.

In [ ]:
 
In [ ]:
 

Part 2. Exploratory data analysis

2.1. Loading data

In [837]:
# Loading all necessary libraries:
import zipfile
import missingno as msno
from tqdm import tqdm_notebook
import itertools

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

from sklearn.impute import SimpleImputer #sklearn 0.20.1 is necessary
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
In [838]:
# We can read files without unzipping!
with zipfile.ZipFile("diabetes.zip") as z:
    with z.open("diabetic_data.csv") as f:
        data_df = pd.read_csv(f, encoding='utf-8')
In [839]:
data_df.head()
Out[839]:
encounter_id patient_nbr race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital ... citoglipton insulin glyburide-metformin glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone metformin-pioglitazone change diabetesMed readmitted
0 2278392 8222157 Caucasian Female [0-10) ? 6 25 1 1 ... No No No No No No No No No NO
1 149190 55629189 Caucasian Female [10-20) ? 1 1 7 3 ... No Up No No No No No Ch Yes >30
2 64410 86047875 AfricanAmerican Female [20-30) ? 1 1 7 2 ... No No No No No No No No Yes NO
3 500364 82442376 Caucasian Male [30-40) ? 1 1 7 2 ... No Up No No No No No Ch Yes NO
4 16680 42519267 Caucasian Male [40-50) ? 1 1 7 1 ... No Steady No No No No No Ch Yes NO

5 rows × 50 columns

In [840]:
data_df.dtypes.head()
Out[840]:
encounter_id     int64
patient_nbr      int64
race            object
gender          object
age             object
dtype: object
In [841]:
# Lets take a look at the data:
display(data_df.describe())
data_size = len(data_df)
print(f'Whole dataset size: {data_size}')
encounter_id patient_nbr admission_type_id discharge_disposition_id admission_source_id time_in_hospital num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient number_diagnoses
count 1.017660e+05 1.017660e+05 101766.000000 101766.000000 101766.000000 101766.000000 101766.000000 101766.000000 101766.000000 101766.000000 101766.000000 101766.000000 101766.000000
mean 1.652016e+08 5.433040e+07 2.024006 3.715642 5.754437 4.395987 43.095641 1.339730 16.021844 0.369357 0.197836 0.635566 7.422607
std 1.026403e+08 3.869636e+07 1.445403 5.280166 4.064081 2.985108 19.674362 1.705807 8.127566 1.267265 0.930472 1.262863 1.933600
min 1.252200e+04 1.350000e+02 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000
25% 8.496119e+07 2.341322e+07 1.000000 1.000000 1.000000 2.000000 31.000000 0.000000 10.000000 0.000000 0.000000 0.000000 6.000000
50% 1.523890e+08 4.550514e+07 1.000000 1.000000 7.000000 4.000000 44.000000 1.000000 15.000000 0.000000 0.000000 0.000000 8.000000
75% 2.302709e+08 8.754595e+07 3.000000 4.000000 7.000000 6.000000 57.000000 2.000000 20.000000 0.000000 0.000000 1.000000 9.000000
max 4.438672e+08 1.895026e+08 8.000000 28.000000 25.000000 14.000000 132.000000 6.000000 81.000000 42.000000 76.000000 21.000000 16.000000
Whole dataset size: 101766

2.2. Train test split

As we got entire dataset here we need to split it to two parts: train and test and never spy to the test target array. We will use test target for checking our final solution.

Data could be collected in chronological order. Therefore, to make the experiment more realistic, we divide the sample in half.

In [842]:
total_len = len(data_df)
print('Total length: ', total_len)
split_coef = 0.5
split_number = int(total_len*split_coef)
print('Split number: ', split_number)

X_train = data_df.iloc[0:split_number]
X_test = data_df.iloc[split_number:]

y_train = X_train['readmitted']
y_test = X_test['readmitted']

X_train = X_train.drop(columns='readmitted')
X_test = X_test.drop(columns='readmitted')

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

# Also for the baseline lets convert y_target to numeric in this way:
y_target = y_train.map({'<30':0, '>30':1, 'NO':2})
y_test = y_test.map({'<30':0, '>30':1, 'NO':2})
Total length:  101766
Split number:  50883
(50883, 49) (50883,)
(50883, 49) (50883,)

2.3. Filling missings

In [843]:
# Lets check missings:
for col in data_df:
    uniq_values = data_df[col].unique()
    if '?' in uniq_values:
        num_of_nan = len(data_df[data_df[col]=='?'])
        print(f'Feature {col}, missed: {num_of_nan} or {num_of_nan/data_size*100:.2f} %') 
        # adding here uniq_values we can see all of them. Ans see missings as '?' always
C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\lib\site-packages\ipykernel_launcher.py:4: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  after removing the cwd from sys.path.
Feature race, missed: 2273 or 2.23 %
Feature weight, missed: 98569 or 96.86 %
Feature payer_code, missed: 40256 or 39.56 %
Feature medical_specialty, missed: 49949 or 49.08 %
Feature diag_1, missed: 21 or 0.02 %
Feature diag_2, missed: 358 or 0.35 %
Feature diag_3, missed: 1423 or 1.40 %

Here we found missing values in dataset marked as '?'. Also there are '?' not only in features as shown in the article, but also in diag_1 and diag_2 features too!

There are several methods to fill in the missings:

  1. drop nans
  2. fill with constant (0, -1, ...)
  3. fill with mean/median/moda
  4. groupby and fill with mean/median of the group
  5. built model to predict missings
  6. some methods can handle missings!

Good example of using different methods: https://towardsdatascience.com/working-with-missing-data-in-machine-learning-9c0a430df4ce

Important moment - we can't just drop missings in data - model should be able to work with missing values because we can't ignore any new patient just because he/she didn't indicate weight or race in the questionary.

In [844]:
# interesting method to visualize missings:

columns_nans = ['race', 'weight', 'payer_code', 'medical_specialty', 'diag_1', 'diag_2', 'diag_3']

imp = SimpleImputer(missing_values='?', strategy='constant', fill_value=np.nan)

data_df_nans = pd.DataFrame(imp.fit_transform(data_df[columns_nans]), columns=columns_nans)
msno.matrix(data_df_nans);
msno.heatmap(data_df_nans);

There is no correlation in missings (they doesn't appear simultaneously). Three theatures have too many missings: weight, payer_code, medical_specialty - from 40 to 97%. So it can be unsafe to fill them with any values. Let's ignore them for baseline and try different filling methods at tuning stage.

Let's try different methods - start from the simplest one for baseline model and come back here and try another methods for more complex model. We will change data always in new columns and drop excess data before using each model.

In [ ]:
 
  • Baseline model: fill with most frequent value
In [845]:
%%time
columns_nans = ['race', 'diag_1', 'diag_2', 'diag_3']
imp_most_frequent = SimpleImputer(missing_values='?', strategy='most_frequent', verbose=1)
X_train_nan_most_frequent = pd.DataFrame(imp_most_frequent.fit_transform(X_train[columns_nans]),
                                         columns=[el+'_mf' for el in columns_nans] )
X_test_nan_most_frequent = pd.DataFrame(imp_most_frequent.transform(X_test[columns_nans]),
                                         columns=[el+'_mf' for el in columns_nans] )

X_train = pd.concat([X_train, X_train_nan_most_frequent], axis=1)
X_test = pd.concat([X_test.reset_index(drop=True), X_test_nan_most_frequent], axis=1).set_index(X_test.index)
Wall time: 3.33 s

Part 3. Visual analysis of the features

3.1. Univariate analisys

Lets do some data analisys. First of all we check numeric data, then categorical and finish with cat vs num data comparison. Very good example about general methods for data analisys: # https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/

In [846]:
features_numeric = X_train.select_dtypes(include='int64').columns
features_categorical = X_train.select_dtypes(include='object').columns

print(features_numeric)
print(len(features_numeric))
print(features_categorical)
print(len(features_categorical))
Index(['encounter_id', 'patient_nbr', 'admission_type_id',
       'discharge_disposition_id', 'admission_source_id', 'time_in_hospital',
       'num_lab_procedures', 'num_procedures', 'num_medications',
       'number_outpatient', 'number_emergency', 'number_inpatient',
       'number_diagnoses'],
      dtype='object')
13
Index(['race', 'gender', 'age', 'weight', 'payer_code', 'medical_specialty',
       'diag_1', 'diag_2', 'diag_3', 'max_glu_serum', 'A1Cresult', 'metformin',
       'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
       'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed', 'race_mf',
       'diag_1_mf', 'diag_2_mf', 'diag_3_mf'],
      dtype='object')
40

Lets take a look at numeric first ...

In [847]:
X_train[features_numeric].drop(columns=['encounter_id', 'patient_nbr']).describe()
Out[847]:
admission_type_id discharge_disposition_id admission_source_id time_in_hospital num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient number_diagnoses
count 50883.000000 50883.000000 50883.000000 50883.000000 50883.000000 50883.000000 50883.000000 50883.000000 50883.000000 50883.000000 50883.000000
mean 2.228839 4.269540 6.301024 4.554370 42.328224 1.369868 15.286048 0.222432 0.124894 0.582532 6.911483
std 1.621665 6.019788 4.779297 3.087251 19.323138 1.687201 8.131921 0.869751 0.623733 1.198313 2.023880
min 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000
25% 1.000000 1.000000 1.000000 2.000000 30.000000 0.000000 10.000000 0.000000 0.000000 0.000000 5.000000
50% 2.000000 1.000000 7.000000 4.000000 43.000000 1.000000 14.000000 0.000000 0.000000 0.000000 7.000000
75% 3.000000 5.000000 7.000000 6.000000 56.000000 2.000000 19.000000 0.000000 0.000000 1.000000 9.000000
max 8.000000 28.000000 20.000000 14.000000 129.000000 6.000000 81.000000 36.000000 42.000000 21.000000 9.000000
In [848]:
for col in features_numeric[2:]:
    print(col)
    print(X_train[col].value_counts())
    print(X_test[col].value_counts())
admission_type_id
1    24753
2     9859
3     8540
6     4199
5     3367
8      156
4        7
7        2
Name: admission_type_id, dtype: int64
1    29237
3    10329
2     8621
5     1418
6     1092
8      164
7       19
4        3
Name: admission_type_id, dtype: int64
discharge_disposition_id
1     29888
3      5968
6      5427
18     3668
2      1002
25      975
5       957
11      868
22      727
4       499
7       277
23      189
14      137
13      110
8        98
28       24
15       22
17       14
16       11
10        6
9         4
24        4
12        3
19        2
20        2
27        1
Name: discharge_disposition_id, dtype: int64
1     30346
3      7986
6      7475
22     1266
2      1126
11      774
7       346
4       316
13      289
14      235
5       227
23      223
28      115
24       44
15       41
18       23
9        17
25       14
8        10
19        6
27        4
Name: discharge_disposition_id, dtype: int64
admission_source_id
7     24545
1     13979
17     6086
4      2342
6      2083
2       775
5       619
3       166
20      160
9       118
8         8
14        1
10        1
Name: admission_source_id, dtype: int64
7     32949
1     15586
4       845
17      695
2       329
5       236
6       181
3        21
22       12
8         8
9         7
10        7
11        2
25        2
13        1
14        1
20        1
Name: admission_source_id, dtype: int64
time_in_hospital
3     8485
2     8356
1     6830
4     6829
5     4950
6     3873
7     2988
8     2413
9     1621
10    1334
11    1063
12     853
13     671
14     617
Name: time_in_hospital, dtype: int64
3     9271
2     8868
1     7378
4     7095
5     5016
6     3666
7     2871
8     1978
9     1381
10    1008
11     792
12     595
13     539
14     425
Name: time_in_hospital, dtype: int64
num_lab_procedures
43     1517
1      1338
44     1334
45     1228
46     1153
38     1134
47     1118
42     1080
39     1070
35     1062
41     1058
48     1058
40     1039
49     1033
37     1017
36      984
50      981
51      922
34      877
54      870
57      856
53      854
52      850
56      849
55      848
61      753
58      752
59      751
25      731
31      718
       ... 
85       72
86       61
87       42
88       41
90       36
89       34
95       29
93       28
91       26
94       24
92       23
97       16
96       16
98       15
101       6
102       5
105       4
103       4
100       4
99        3
106       3
104       2
120       1
114       1
113       1
129       1
111       1
109       1
108       1
107       1
Name: num_lab_procedures, Length: 114, dtype: int64
1      1870
43     1287
44     1162
40     1162
45     1148
38     1079
37     1062
41     1059
46     1036
42     1033
49     1033
39     1031
54     1018
51     1003
48     1000
56      990
52      988
55      988
47      988
36      978
58      956
53      948
50      943
60      898
57      891
61      885
59      873
35      845
29      807
34      800
       ... 
86       67
88       61
87       49
89       39
91       35
90       29
93       28
92       25
94       21
95       17
97       15
96       12
98       11
100       9
101       7
99        6
108       3
102       3
109       3
106       2
105       2
113       2
103       2
111       2
126       1
132       1
104       1
114       1
118       1
121       1
Name: num_lab_procedures, Length: 115, dtype: int64
num_procedures
0    22473
1    10664
2     6444
3     5177
4     2112
6     2107
5     1906
Name: num_procedures, dtype: int64
0    24179
1    10078
2     6273
3     4266
6     2847
4     2068
5     1172
Name: num_procedures, dtype: int64
num_medications
12    3230
13    3151
11    3098
10    2925
14    2891
15    2829
9     2745
16    2625
8     2522
17    2331
18    2086
7     2056
19    1888
20    1648
6     1584
21    1409
22    1223
5     1176
23    1047
4      879
24     866
25     793
26     672
27     596
3      538
28     480
29     407
30     337
31     296
2      278
      ... 
46      57
44      54
45      48
47      41
48      37
52      35
49      30
50      28
51      25
53      22
54      20
56      19
55      16
58      13
57      13
59      11
60      10
61       9
63       8
65       7
69       5
62       5
67       4
68       3
70       2
64       2
66       1
81       1
79       1
75       1
Name: num_medications, Length: 73, dtype: int64
15    2963
13    2935
14    2816
16    2805
12    2774
11    2697
17    2588
18    2437
10    2421
19    2190
9     2168
20    2043
8     1831
21    1821
22    1645
7     1428
23    1379
24    1243
6     1115
25    1095
26     936
5      841
27     836
28     753
29     593
4      538
30     512
31     416
32     377
3      362
      ... 
42      57
43      50
44      46
45      40
46      35
47      33
49      31
50      27
48      23
52      19
51      18
53      18
56      18
55      16
57      13
60      13
54      13
58      12
62      10
59       9
64       6
63       6
61       5
65       5
68       4
66       4
67       3
72       3
74       1
75       1
Name: num_medications, Length: 71, dtype: int64
number_outpatient
0     45171
1      3235
2      1117
3       648
4       344
5       177
6        69
7        32
8        28
9        16
10       12
11        9
12        5
13        4
14        4
15        3
16        3
21        1
36        1
35        1
17        1
20        1
29        1
Name: number_outpatient, dtype: int64
0     39856
1      5312
2      2477
3      1394
4       755
5       356
6       234
7       123
8        70
9        67
10       45
11       33
13       27
12       25
14       24
15       17
16       12
17        7
20        6
21        6
18        5
22        5
19        3
24        3
27        3
33        2
23        2
25        2
26        2
34        1
35        1
29        1
36        1
37        1
38        1
39        1
40        1
28        1
42        1
Name: number_outpatient, dtype: int64
number_emergency
0     46990
1      2722
2       658
3       237
4       133
5        46
6        27
7        25
8        17
9        10
10        8
11        3
22        2
25        1
42        1
13        1
16        1
28        1
Name: number_emergency, dtype: int64
0     43393
1      4955
2      1384
3       488
4       241
5       146
6        67
7        48
8        33
10       26
9        23
11       20
13       11
12       10
18        5
20        4
19        4
16        4
22        4
14        3
15        3
21        2
64        1
63        1
37        1
29        1
46        1
54        1
24        1
25        1
76        1
Name: number_emergency, dtype: int64
number_inpatient
0     34865
1      9381
2      3497
3      1530
4       726
5       363
6       213
7       122
8        65
9        45
10       29
11       15
12       12
13        5
14        4
15        4
16        4
18        1
17        1
21        1
Name: number_inpatient, dtype: int64
0     32765
1     10140
2      4069
3      1881
4       896
5       449
6       267
7       146
8        86
9        66
11       34
10       32
12       22
13       15
14        6
15        5
16        2
19        2
Name: number_inpatient, dtype: int64
number_diagnoses
9    18209
5     8330
6     6063
7     5824
8     5806
4     3713
3     1994
2      774
1      170
Name: number_diagnoses, dtype: int64
9     31265
8      4810
7      4569
6      4098
5      3063
4      1824
3       841
2       249
1        49
16       45
10       17
13       16
11       11
15       10
12        9
14        7
Name: number_diagnoses, dtype: int64
In [849]:
%%time
sns.set(style="whitegrid")
sns.set(rc={'figure.figsize':(10,10)})
#sns.boxplot(data=X_train[features_numeric].drop(columns=['encounter_id', 'patient_nbr']).iloc[:,[1,2,3,4,6,10]]);
#sns.swarmplot(data=X_train[features_numeric].drop(columns=['encounter_id', 'patient_nbr']).iloc[::100,[1,2,3,4,6,10]], color=".25")
sns.violinplot(data=X_train[features_numeric].drop(columns=['encounter_id', 'patient_nbr']).iloc[:,0:5]);
#sns.boxplot(data=X_train[features_numeric].drop(columns=['encounter_id', 'patient_nbr']).iloc[:,[0,5,7,8,9]]);
#sns.swarmplot(data=X_train[features_numeric].drop(columns=['encounter_id', 'patient_nbr']).iloc[::100,[0,5,7,8,9]], color=".25")
plt.figure()
sns.violinplot(data=X_train[features_numeric].drop(columns=['encounter_id', 'patient_nbr']).iloc[:,5:10]);
Wall time: 1.37 s
In [850]:
sns.set(rc={'figure.figsize':(15,5)})
for axis in range(0,len(X_train[features_numeric[2:]].columns),3):
    cols = X_train[features_numeric[2:]].columns[axis:axis+3]
    f, axes = plt.subplots(1, 3, sharex=True)
    palette = "crimson"
    sns.distplot( X_train[cols[0]].values , color=palette, ax=axes[0]);
    try:
        sns.distplot( X_train[cols[1]].values , color=palette, ax=axes[1], label=cols[1]);
    except:
        pass
    try:
        sns.distplot( X_train[cols[2]].values , color=palette, ax=axes[2], label=cols[2]);
    except:
        pass
In [ ]:
 
In [ ]:
 

... and categorical.

In [851]:
features_ignored = ['weight', 'payer_code', 'medical_specialty', 'diag_1', 'diag_2', 'diag_3', 'race']
X_train_categorical = X_train[features_categorical].drop(columns=features_ignored)
features_categorical = [el for el in features_categorical if el not in features_ignored]
In [852]:
for axis in range(0,len(X_train_categorical.columns[:-3]),3):
    cols = X_train_categorical.columns[axis:axis+3]
    f, axes = plt.subplots(1, 3)
    palette = "crimson"
    sns.countplot(X_train_categorical[cols[0]] , color=palette, ax=axes[0]);
    try:
        sns.countplot(X_train_categorical[cols[1]] , color=palette, ax=axes[1]);
    except:
        pass
    try:
        sns.countplot(X_train_categorical[cols[2]] , color=palette, ax=axes[2]);
    except:
        pass
In [853]:
X_train_categorical[['diag_1_mf', 'diag_2_mf', 'diag_3_mf']] \
                    .apply(pd.Series.value_counts) \
                    .sort_values('diag_3_mf', ascending=False)
Out[853]:
diag_1_mf diag_2_mf diag_3_mf
250 96.0 3436.0 7430.0
401 173.0 2062.0 4502.0
428 3530.0 3806.0 2261.0
276 967.0 3221.0 2253.0
427 1276.0 2546.0 1982.0
414 3897.0 1378.0 1939.0
496 44.0 1771.0 1322.0
403 238.0 1593.0 1059.0
272 4.0 199.0 966.0
599 653.0 1506.0 899.0
V45 5.0 225.0 813.0
250.01 37.0 1293.0 768.0
780 978.0 788.0 693.0
707 115.0 978.0 687.0
250.6 614.0 565.0 644.0
250.02 362.0 1029.0 608.0
424 100.0 594.0 603.0
425 51.0 767.0 562.0
285 156.0 587.0 523.0
305 11.0 362.0 484.0
584 568.0 503.0 425.0
682 985.0 700.0 389.0
41 2.0 182.0 387.0
518 374.0 557.0 387.0
493 459.0 447.0 371.0
278 236.0 135.0 351.0
585 32.0 308.0 313.0
530 278.0 147.0 309.0
244 7.0 89.0 307.0
250.4 159.0 96.0 296.0
... ... ... ...
952 1.0 1.0 NaN
955 1.0 NaN NaN
964 3.0 NaN NaN
968 7.0 1.0 NaN
973 2.0 NaN NaN
974 NaN 1.0 NaN
977 5.0 1.0 NaN
980 1.0 1.0 NaN
982 1.0 NaN NaN
983 2.0 NaN NaN
986 2.0 NaN NaN
989 7.0 1.0 NaN
990 1.0 NaN NaN
992 3.0 1.0 NaN
994 1.0 NaN NaN
E813 NaN 1.0 NaN
E814 NaN 1.0 NaN
E854 NaN 1.0 NaN
E868 NaN 1.0 NaN
E881 NaN 1.0 NaN
E890 NaN 1.0 NaN
E900 NaN 1.0 NaN
E915 NaN 1.0 NaN
E918 NaN 1.0 NaN
V13 NaN 1.0 NaN
V26 2.0 NaN NaN
V56 8.0 NaN NaN
V61 NaN 1.0 NaN
V67 1.0 NaN NaN
V71 3.0 NaN NaN

825 rows × 3 columns

In [854]:
for col in X_train_categorical.columns[:-3]:
    print(col, X_train_categorical[col].unique())
    print('---'*10)
    print(X_train_categorical[col].value_counts())
gender ['Female' 'Male' 'Unknown/Invalid']
------------------------------
Female             27486
Male               23396
Unknown/Invalid        1
Name: gender, dtype: int64
age ['[0-10)' '[10-20)' '[20-30)' '[30-40)' '[40-50)' '[50-60)' '[60-70)'
 '[70-80)' '[80-90)' '[90-100)']
------------------------------
[70-80)     13316
[60-70)     11059
[50-60)      8927
[80-90)      7681
[40-50)      5137
[30-40)      2093
[90-100)     1207
[20-30)       853
[10-20)       472
[0-10)        138
Name: age, dtype: int64
max_glu_serum ['None' '>300' 'Norm' '>200']
------------------------------
None    46235
Norm     2361
>200     1375
>300      912
Name: max_glu_serum, dtype: int64
A1Cresult ['None' '>7' '>8' 'Norm']
------------------------------
None    43168
>8       4397
Norm     1723
>7       1595
Name: A1Cresult, dtype: int64
metformin ['No' 'Steady' 'Up' 'Down']
------------------------------
No        41635
Steady     8381
Up          579
Down        288
Name: metformin, dtype: int64
repaglinide ['No' 'Up' 'Steady' 'Down']
------------------------------
No        50281
Steady      530
Up           48
Down         24
Name: repaglinide, dtype: int64
nateglinide ['No' 'Steady' 'Down' 'Up']
------------------------------
No        50613
Steady      262
Up            5
Down          3
Name: nateglinide, dtype: int64
chlorpropamide ['No' 'Steady' 'Down' 'Up']
------------------------------
No        50816
Steady       61
Up            5
Down          1
Name: chlorpropamide, dtype: int64
glimepiride ['No' 'Steady' 'Down' 'Up']
------------------------------
No        48584
Steady     2028
Up          178
Down         93
Name: glimepiride, dtype: int64
acetohexamide ['No' 'Steady']
------------------------------
No        50882
Steady        1
Name: acetohexamide, dtype: int64
glipizide ['No' 'Steady' 'Up' 'Down']
------------------------------
No        44335
Steady     5768
Up          471
Down        309
Name: glipizide, dtype: int64
glyburide ['No' 'Steady' 'Up' 'Down']
------------------------------
No        44619
Steady     5381
Up          529
Down        354
Name: glyburide, dtype: int64
tolbutamide ['No' 'Steady']
------------------------------
No        50865
Steady       18
Name: tolbutamide, dtype: int64
pioglitazone ['No' 'Steady' 'Up' 'Down']
------------------------------
No        47779
Steady     2921
Up          129
Down         54
Name: pioglitazone, dtype: int64
rosiglitazone ['No' 'Steady' 'Up' 'Down']
------------------------------
No        47357
Steady     3370
Up          104
Down         52
Name: rosiglitazone, dtype: int64
acarbose ['No' 'Steady' 'Up']
------------------------------
No        50749
Steady      127
Up            7
Name: acarbose, dtype: int64
miglitol ['No' 'Steady' 'Down']
------------------------------
No        50866
Steady       16
Down          1
Name: miglitol, dtype: int64
troglitazone ['No' 'Steady']
------------------------------
No        50880
Steady        3
Name: troglitazone, dtype: int64
tolazamide ['No' 'Steady' 'Up']
------------------------------
No        50846
Steady       36
Up            1
Name: tolazamide, dtype: int64
examide ['No']
------------------------------
No    50883
Name: examide, dtype: int64
citoglipton ['No']
------------------------------
No    50883
Name: citoglipton, dtype: int64
insulin ['No' 'Up' 'Steady' 'Down']
------------------------------
No        26192
Steady    16225
Down       4667
Up         3799
Name: insulin, dtype: int64
glyburide-metformin ['No' 'Steady' 'Down' 'Up']
------------------------------
No        50697
Steady      175
Down          6
Up            5
Name: glyburide-metformin, dtype: int64
glipizide-metformin ['No' 'Steady']
------------------------------
No        50879
Steady        4
Name: glipizide-metformin, dtype: int64
glimepiride-pioglitazone ['No']
------------------------------
No    50883
Name: glimepiride-pioglitazone, dtype: int64
metformin-rosiglitazone ['No']
------------------------------
No    50883
Name: metformin-rosiglitazone, dtype: int64
metformin-pioglitazone ['No']
------------------------------
No    50883
Name: metformin-pioglitazone, dtype: int64
change ['No' 'Ch']
------------------------------
No    30168
Ch    20715
Name: change, dtype: int64
diabetesMed ['No' 'Yes']
------------------------------
Yes    38024
No     12859
Name: diabetesMed, dtype: int64
race_mf ['Caucasian' 'AfricanAmerican' 'Other' 'Asian' 'Hispanic']
------------------------------
Caucasian          37625
AfricanAmerican    11343
Hispanic            1027
Other                620
Asian                268
Name: race_mf, dtype: int64
In [855]:
constant_features = ['examide', 'citoglipton', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone',
                     'acetohexamide', 
                     'tolbutamide', 'miglitol', 'troglitazone', 'tolazamide', 'glipizide-metformin']
for col in constant_features:
    print(col)
    print(X_train[col].value_counts())
    print(X_test[col].value_counts())
    print('---'*10)
examide
No    50883
Name: examide, dtype: int64
No    50883
Name: examide, dtype: int64
------------------------------
citoglipton
No    50883
Name: citoglipton, dtype: int64
No    50883
Name: citoglipton, dtype: int64
------------------------------
glimepiride-pioglitazone
No    50883
Name: glimepiride-pioglitazone, dtype: int64
No        50882
Steady        1
Name: glimepiride-pioglitazone, dtype: int64
------------------------------
metformin-rosiglitazone
No    50883
Name: metformin-rosiglitazone, dtype: int64
No        50881
Steady        2
Name: metformin-rosiglitazone, dtype: int64
------------------------------
metformin-pioglitazone
No    50883
Name: metformin-pioglitazone, dtype: int64
No        50882
Steady        1
Name: metformin-pioglitazone, dtype: int64
------------------------------
acetohexamide
No        50882
Steady        1
Name: acetohexamide, dtype: int64
No    50883
Name: acetohexamide, dtype: int64
------------------------------
tolbutamide
No        50865
Steady       18
Name: tolbutamide, dtype: int64
No        50878
Steady        5
Name: tolbutamide, dtype: int64
------------------------------
miglitol
No        50866
Steady       16
Down          1
Name: miglitol, dtype: int64
No        50862
Steady       15
Down          4
Up            2
Name: miglitol, dtype: int64
------------------------------
troglitazone
No        50880
Steady        3
Name: troglitazone, dtype: int64
No    50883
Name: troglitazone, dtype: int64
------------------------------
tolazamide
No        50846
Steady       36
Up            1
Name: tolazamide, dtype: int64
No        50881
Steady        2
Name: tolazamide, dtype: int64
------------------------------
glipizide-metformin
No        50879
Steady        4
Name: glipizide-metformin, dtype: int64
No        50874
Steady        9
Name: glipizide-metformin, dtype: int64
------------------------------

First of all - we can drop this columns: all columns has the same value (No). There are only 1-2 values == Steady

In [856]:
X_train.drop(columns=constant_features, inplace=True)
X_train_categorical.drop(columns=constant_features, inplace=True)
X_test.drop(columns=constant_features, inplace=True)

features_categorical = [el for el in features_categorical if el not in constant_features]

3.2. Bi-variate Analysis

3.2.1. Continuous & Continuous

In [857]:
%%time
# this action can take about minute
sns.pairplot( X_train[features_numeric].assign(target=y_target.values) );
Wall time: 27.5 s
Out[857]:
<seaborn.axisgrid.PairGrid at 0x270ae049cc0>