In this first lab we will start with data exploration in Pandas. We will explore a medical dataset where the only thing we know is that each row is a patient. The last column (279) is the medical condition (coded with an integer) and we are trying to understand what the other features mean.

Let's start by loading a few standard libraries and load the dataset using the DataFrame.read_csv method

In [55]:

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import StandardScaler
path='PatientData.csv'
df=pd.read_csv(path,header=None,na_values='?')
```

The Pandas Dataframe df now contains our data as a table. We can have a quick look using

In [56]:

```
df.head()
```

Out[56]:

In [57]:

```
df.shape #See how many patients and how many features we have.
```

Out[57]:

One first method of visualization is to simply plot the first column as a time series

In [58]:

```
plt.plot(df[0])
```

Out[58]:

Looks like some mess around 50. Can't tell much more from this.

Lets try to explore the values df[0] takes instead.

In [59]:

```
df[0].value_counts()[:10] #The top 10 highest values
```

Out[59]:

Ok, these are always integers. It seems that 46 is the most frequent value that appears in 15 patients, then 36 etc.

Lets make a histogram

In [60]:

```
df[0].hist(bins=10)
```

Out[60]:

There is a range from 0 to 80 and peaks around 45.
Since these are patients, our conclusion:

df[0] must be the age of each patient !

Our most reliable exploration is to see the value counts

In [61]:

```
df[1].value_counts()
```

Out[61]:

This is some binary feature that seems balanced. We will need the help of something else to decide what.

Let's keep on exploring for the third column. Plotting it as a time series shows something interesting:

In [62]:

```
plt.plot(df[2])
```

Out[62]:

Most values are around 180 but there are some crazy outliers. Let's find them and look at them more carefully:

In [63]:

```
df[ df[2]>200] #select and show the rows where df[2] is big
```

Out[63]:

So patients 141 and 316 are outliers for this feature. Lets see the value counts:

In [64]:

```
df[2].value_counts()[:10]
```

Out[64]:

The key thing to realize here is that this is **height in centimeters**.

This also explains that the **outliers** with height 780 and 608 have age 0 or 1. These must be babies and maybe there was some error in data entry.

We can now decode what the 0 and 1 must mean in the categorical feature.

We will create the average of the other features. Lets keep only columns 1:5

In [71]:

```
df2=df[[0,1,2,3,4,5]]
print df2.head()
print df2.shape
```

In [88]:

```
df3=df2.groupby(1).mean()
df3
```

Out[88]:

Groupby is a cool dataframe method that can be used to average rows grouped by their value on one feature (and can also do many other things). The output we obtain says that for these rows where df[1]=0, the average height (i.e. df[2]) is 171cm.

The National center for Health statistics lists average weight and height for men to be 88.5 kg 176cm and for women 75kg and 162cm.

So we conclude that df[1] =0 must indicate male patients and the opposite female.

If we compare with the CDC data more carefully, we should calculate average height and weight of patients aged 20 or older:

In [87]:

```
df4= df2[ df[0]>19].groupby(1).mean()
df4
```

Out[87]:

Strange ! Average heights became smaller for both men and women.

This is because we removed the two giant babies (Observe above that one was male and the other female) which were actually increasing the average.

Gladly the average weight increased a bit, as expected

What about the other features? They turn out to be QRS duration and P-R interval.
This is the UCI Arrhythmia
dataset.

HTML styling commands follow.

In [20]:

```
from IPython.core.display import HTML
def css_styling():
styles = open("custom.css", "r").read()
return HTML(styles)
css_styling()
```

Out[20]:

In [18]:

```
%%javascript
javascript:$('.math>span').css("border-left-color","transparent")
```