Lab 1: Exploring the mysterious patient dataset.

In this first lab we will start with data exploration in Pandas. We will explore a medical dataset where the only thing we know is that each row is a patient. The last column (279) is the medical condition (coded with an integer) and we are trying to understand what the other features mean.

Let's start by loading a few standard libraries and load the dataset using the DataFrame.read_csv method

In [55]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import StandardScaler
path='PatientData.csv'
df=pd.read_csv(path,header=None,na_values='?')

The Pandas Dataframe df now contains our data as a table. We can have a quick look using

In [56]:
df.head()
Out[56]:
0 1 2 3 4 5 6 7 8 9 ... 270 271 272 273 274 275 276 277 278 279
0 75 0 190 80 91 193 371 174 121 -16 ... 0.0 9.0 -0.9 0.0 0.0 0.9 2.9 23.3 49.4 8
1 56 1 165 64 81 174 401 149 39 25 ... 0.0 8.5 0.0 0.0 0.0 0.2 2.1 20.4 38.8 6
2 54 0 172 95 138 163 386 185 102 96 ... 0.0 9.5 -2.4 0.0 0.0 0.3 3.4 12.3 49.0 10
3 55 0 175 94 100 202 380 179 143 28 ... 0.0 12.2 -2.2 0.0 0.0 0.4 2.6 34.6 61.6 1
4 75 0 190 80 88 181 360 177 103 -16 ... 0.0 13.1 -3.6 0.0 0.0 -0.1 3.9 25.4 62.8 7

5 rows × 280 columns

In [57]:
df.shape  #See how many patients and how many features we have. 
Out[57]:
(452, 280)

One first method of visualization is to simply plot the first column as a time series

In [58]:
plt.plot(df[0]) 
Out[58]:
[<matplotlib.lines.Line2D at 0x119a4dcd0>]

Looks like some mess around 50. Can't tell much more from this.

Lets try to explore the values df[0] takes instead.

In [59]:
df[0].value_counts()[:10]  #The top 10 highest values 
Out[59]:
46    15
36    14
37    14
47    14
44    13
35    13
45    13
40    12
50    12
57    12
Name: 0, dtype: int64

Ok, these are always integers. It seems that 46 is the most frequent value that appears in 15 patients, then 36 etc.

Lets make a histogram

In [60]:
df[0].hist(bins=10)
Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x1192bc5d0>



There is a range from 0 to 80 and peaks around 45. Since these are patients, our conclusion:

df[0] must be the age of each patient !


Moving to the second column


Our most reliable exploration is to see the value counts

In [61]:
df[1].value_counts()
Out[61]:
1    249
0    203
Name: 1, dtype: int64

This is some binary feature that seems balanced. We will need the help of something else to decide what.

Let's keep on exploring for the third column. Plotting it as a time series shows something interesting:

In [62]:
plt.plot(df[2])
Out[62]:
[<matplotlib.lines.Line2D at 0x11d0d0b10>]



Most values are around 180 but there are some crazy outliers. Let's find them and look at them more carefully:

In [63]:
df[ df[2]>200]  #select and show the rows where df[2] is big
Out[63]:
0 1 2 3 4 5 6 7 8 9 ... 270 271 272 273 274 275 276 277 278 279
141 1 1 780 6 85 165 237 150 106 88 ... 0.0 5.0 -4.6 0.0 0.0 1.3 0.7 2.7 5.5 5
316 0 0 608 10 83 126 232 128 60 125 ... -0.7 4.5 -5.5 0.0 0.0 0.5 2.5 -11.8 1.7 5

2 rows × 280 columns



So patients 141 and 316 are outliers for this feature. Lets see the value counts:

In [64]:
df[2].value_counts()[:10]
Out[64]:
160    81
165    46
170    40
155    23
175    21
156    19
163    16
162    15
168    15
172    14
Name: 2, dtype: int64



The key thing to realize here is that this is height in centimeters.

This also explains that the **outliers** with height 780 and 608 have age 0 or 1. These must be babies and maybe there was some error in data entry.



Decoding the first column using the new information


We can now decode what the 0 and 1 must mean in the categorical feature.

We will create the average of the other features. Lets keep only columns 1:5

In [71]:
df2=df[[0,1,2,3,4,5]]

print df2.head()
print df2.shape
    0  1    2   3    4    5
0  75  0  190  80   91  193
1  56  1  165  64   81  174
2  54  0  172  95  138  163
3  55  0  175  94  100  202
4  75  0  190  80   88  181
(452, 6)
In [88]:
df3=df2.groupby(1).mean()
df3
Out[88]:
0 2 3 4 5
1
0 47.546798 171.315271 72.724138 94.650246 157.472906
1 45.594378 162.008032 64.457831 84.248996 153.261044


Groupby is a cool dataframe method that can be used to average rows grouped by their value on one feature (and can also do many other things). The output we obtain says that for these rows where df[1]=0, the average height (i.e. df[2]) is 171cm.

The National center for Health statistics lists average weight and height for men to be 88.5 kg 176cm and for women 75kg and 162cm.

So we conclude that df[1] =0 must indicate male patients and the opposite female.

If we compare with the CDC data more carefully, we should calculate average height and weight of patients aged 20 or older:

In [87]:
df4= df2[ df[0]>19].groupby(1).mean()
df4
Out[87]:
0 2 3 4 5
1
0 50.673797 170.994652 75.631016 94.417112 159.593583
1 47.679487 160.102564 66.106838 83.414530 152.452991

Strange ! Average heights became smaller for both men and women.

This is because we removed the two giant babies (Observe above that one was male and the other female) which were actually increasing the average.

Gladly the average weight increased a bit, as expected

What about the other features? They turn out to be QRS duration and P-R interval. This is the UCI Arrhythmia dataset.



HTML styling commands follow.

In [20]:
from IPython.core.display import HTML
def css_styling():
    styles = open("custom.css", "r").read()
    return HTML(styles)
css_styling()
Out[20]:
/* http://nbviewer.jupyter.org/github/barbagroup/CFDPython/blob/master/styles/custom.css */ .MathJax nobr>span.math>span{border-left-width:0 !important}; /*from http://stackoverflow.com/questions/34277967/chrome-rendering-mathjax-equations-with-a-trailing-vertical-line */
In [18]:
%%javascript
javascript:$('.math>span').css("border-left-color","transparent")