#!/usr/bin/env python # coding: utf-8 #

Lab 1: Exploring the mysterious patient dataset.

# #

In this first lab we will start with data exploration in Pandas. We will explore a medical dataset where the only thing we know is that each row is a patient. The last column (279) is the medical condition (coded with an integer) and we are trying to understand what the other features mean. #

# #

Let's start by loading a few standard libraries and load the dataset using the DataFrame.read_csv method # In[55]: import pandas as pd import numpy as np import matplotlib.pyplot as plt get_ipython().run_line_magic('matplotlib', 'inline') from sklearn.preprocessing import StandardScaler path='PatientData.csv' df=pd.read_csv(path,header=None,na_values='?') # The Pandas Dataframe df now contains our data as a table. We can have a quick look using # In[56]: df.head() # In[57]: df.shape #See how many patients and how many features we have. # One first method of visualization is to simply plot the first column as a time series # In[58]: plt.plot(df[0]) # Looks like some mess around 50. Can't tell much more from this.

# Lets try to explore the values df[0] takes instead. # In[59]: df[0].value_counts()[:10] #The top 10 highest values # Ok, these are always integers. It seems that 46 is the most frequent value that appears in 15 patients, then 36 etc. #

# Lets make a histogram # In[60]: df[0].hist(bins=10) #

# There is a range from 0 to 80 and peaks around 45. # Since these are patients, our conclusion:

# df[0] must be the age of each patient ! #


#

Moving to the second column


# #

Our most reliable exploration is to see the value counts

# In[61]: df[1].value_counts() # This is some binary feature that seems balanced. We will need the help of something else to decide what. #

# Let's keep on exploring for the third column. Plotting it as a time series shows something interesting:

# In[62]: plt.plot(df[2]) #

# Most values are around 180 but there are some crazy outliers. Let's find them and look at them more carefully: #

# In[63]: df[ df[2]>200] #select and show the rows where df[2] is big #

# # So patients 141 and 316 are outliers for this feature. Lets see the value counts: #

# In[64]: df[2].value_counts()[:10] #

# # The key thing to realize here is that this is **height in centimeters**. #

# This also explains that the **outliers** with height 780 and 608 have age 0 or 1. These must be babies and maybe there was some error in data entry. #

#

# #

Decoding the first column using the new information


# #

We can now decode what the 0 and 1 must mean in the categorical feature.

# #

We will create the average of the other features. Lets keep only columns 1:5 # In[71]: df2=df[[0,1,2,3,4,5]] print df2.head() print df2.shape # In[88]: df3=df2.groupby(1).mean() df3 #
#

Groupby is a cool dataframe method that can be used to average rows grouped by their value on one feature (and can also do many other things). # The output we obtain says that for these rows where df[1]=0, the average height (i.e. df[2]) is 171cm.

# # #

# The # # National center for Health statistics lists average weight and height for men to be 88.5 kg 176cm and for women 75kg and 162cm. #

# #

So we conclude that df[1] =0 must indicate male patients and the opposite female. #

# # If we compare with the CDC data more carefully, we should calculate average height and weight of patients aged 20 or older: #

# In[87]: df4= df2[ df[0]>19].groupby(1).mean() df4 # Strange ! Average heights became smaller for both men and women. # #

This is because we removed the two giant babies # (Observe above that one was male and the other female) which were actually increasing the average.

# #

Gladly the average weight increased a bit, as expected

# What about the other features? They turn out to be QRS duration and P-R interval. # This is the UCI Arrhythmia # dataset. #

#
#
# HTML styling commands follow. # In[20]: from IPython.core.display import HTML def css_styling(): styles = open("custom.css", "r").read() return HTML(styles) css_styling() # In[18]: get_ipython().run_cell_magic('javascript', '', 'javascript:$(\'.math>span\').css("border-left-color","transparent")\n')