#!/usr/bin/env python # coding: utf-8 #
In this first lab we will start with data exploration in Pandas. We will explore a medical dataset where the only thing we know is that each row is a patient. The last column (279) is the medical condition (coded with an integer) and we are trying to understand what the other features mean. #
# # Let's start by loading a few standard libraries and load the dataset using the DataFrame.read_csv method
# In[55]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')
from sklearn.preprocessing import StandardScaler
path='PatientData.csv'
df=pd.read_csv(path,header=None,na_values='?')
# The Pandas Dataframe df now contains our data as a table. We can have a quick look using
# In[56]:
df.head()
# In[57]:
df.shape #See how many patients and how many features we have.
# One first method of visualization is to simply plot the first column as a time series
# In[58]:
plt.plot(df[0])
# Looks like some mess around 50. Can't tell much more from this.
# Lets try to explore the values df[0] takes instead.
# In[59]:
df[0].value_counts()[:10] #The top 10 highest values
# Ok, these are always integers. It seems that 46 is the most frequent value that appears in 15 patients, then 36 etc.
#
# Lets make a histogram
# In[60]:
df[0].hist(bins=10)
#
# There is a range from 0 to 80 and peaks around 45.
# Since these are patients, our conclusion:
# df[0] must be the age of each patient !
#
Our most reliable exploration is to see the value counts
# In[61]: df[1].value_counts() # This is some binary feature that seems balanced. We will need the help of something else to decide what. ## Let's keep on exploring for the third column. Plotting it as a time series shows something interesting:
# In[62]: plt.plot(df[2]) ## This also explains that the **outliers** with height 780 and 608 have age 0 or 1. These must be babies and maybe there was some error in data entry. #
#We can now decode what the 0 and 1 must mean in the categorical feature.
# # We will create the average of the other features. Lets keep only columns 1:5
# In[71]:
df2=df[[0,1,2,3,4,5]]
print df2.head()
print df2.shape
# In[88]:
df3=df2.groupby(1).mean()
df3
#
#
Groupby is a cool dataframe method that can be used to average rows grouped by their value on one feature (and can also do many other things). # The output we obtain says that for these rows where df[1]=0, the average height (i.e. df[2]) is 171cm.
# # ## The # # National center for Health statistics lists average weight and height for men to be 88.5 kg 176cm and for women 75kg and 162cm. #
# #So we conclude that df[1] =0 must indicate male patients and the opposite female. #
# # If we compare with the CDC data more carefully, we should calculate average height and weight of patients aged 20 or older: #This is because we removed the two giant babies # (Observe above that one was male and the other female) which were actually increasing the average.
# #Gladly the average weight increased a bit, as expected
# What about the other features? They turn out to be QRS duration and P-R interval. # This is the UCI Arrhythmia # dataset. #