*Pandas* is a very strong library for manipulation large and complex datasets using a new data structure, the data frame. Pandas helps to close the gap between Python and R for data analysis and statistical computing.

Pandas data frames address three deficiencies of arrays:

- they hold heterogenous data; each column can have its own numpy.dtype,
- the axes of a DataFrame are labeled with column names and row indices,
- and, they account for missing values which this is not directly supported by arrays.

(See Introduction to Python for Statistical Learning)

Data frames are extremely useful for data munging. They provide a large range of operations such as filter, join, and group-by aggregation, as well as plotting.

In [1]:

```
import pandas as pd
print('Pandas version:', pd.__version__)
```

We will analyze animal life-history data from AnAge.
We will get the data from the download page, but it's compressed with zip so we need to unzip it and then we can read the data using *pandas* `read_table`

function:

In [3]:

```
import urllib.request
import zipfile
```

In [5]:

```
fname = '../data/anage_dataset.zip'
urllib.request.urlretrieve('http://genomics.senescence.info/species/dataset.zip', fname)
```

In [6]:

```
with zipfile.ZipFile(fname) as z:
f = z.open('anage_data.txt')
data = pd.read_table(f) # lots of other pd.read_... functions
print(type(data))
print(data.shape)
```

*pandas* holds data in `DataFrame`

(similar to *R*).
`DataFrame`

have a single row per observation (in contrast to the previous exercise in which each table cell was one observation), and each column has a single variable. Variables can be numbers or strings.

In [7]:

```
data.head()
```

Out[7]:

`DataFrame`

has many of the features of `numpy.ndarray`

- it also has a `shape`

and various statistical methods (`max`

, `mean`

etc.).
However, `DataFrame`

allows richer indexing.
For example, let's browse our data for species that have body mass greater than 300 kg.
First we will a new column that tells us if a row is a large animal row or not:

In [8]:

```
large_index = data['Body mass (g)'] > 300 * 1000 # 300 kg
large_index.head()
```

Out[8]:

Now, we slice or data with this boolean index.
The `iterrows`

method let's us iterate over the rows of the data.
For each row we get both the row as a `Series`

object (similar to `dict`

for our use)
and the row number as an `int`

.

In [9]:

```
large_data = data[large_index]
for i, row in large_data.iterrows():
print(row['Common name'], row['Body mass (g)']/1000, 'kg')
```

So... a Dromedary is a Camel.

Let's continue with small and medium animals. For starters, let's plot a scatter of Body mass vs. Metabolic rate.
Because we work with *pandas*, we can do that with the `plot`

method of `DataFrame`

, specifying the columns for `x`

and `y`

and a plotting style (without the style we would get a line plot which makes no sense here).

In [10]:

```
%matplotlib inline
import matplotlib.pyplot as plt
```

In [14]:

```
data = data[data['Body mass (g)'] < 3e5]
data.plot(x='Body mass (g)', y='Metabolic rate (W)', style='o')
plt.ylabel('Metabolic rate (W)')
plt.xlim(0, 200000);
```

From this plot it seems that 1) there is a correlation between body mass and metabolic rate, and 2) there are many small animals (less than 30 kg) and not many medium animals (between 50 and 300 kg).

Before we continue, I prefer to have mass in kg, let's add a new column:

In [15]:

```
data['Body mass (kg)'] = data['Body mass (g)'] / 1000
```

Next, let's check how many records do we have for each Class (as in the taxonomic unit):

In [16]:

```
class_counts = data.Class.value_counts()
print(class_counts)
```

In [17]:

```
class_counts.plot(kind='bar')
plt.ylabel('Num. of species');
```

So we have lots of mammals and birds, and a few reptiles and amphibians. This is important as amphibian and reptiles could have a different replationship between mass and metabolism because they are cold blooded.

In [18]:

```
data[data.Class == 'Reptilia']
```

Out[18]:

Check how many species in this dataset have the word *fly* in their name. Extract the `Common name`

column, use the `.str`

attribute to get a string representation, then use the string method `contains`

. The result is a series of booleans which you can sum with the `sum`

method.

Let's do a simple linear regression plot; but let's do it in separate for each Class. We can do this kind of thing with *Matplotlib* and *Scipy*, but a very good tool for statistical visualizations is **Seaborn**.

*Seaborn* adds on top of *Pandas* a set of sophisticated statistical visualizations, similar to *ggplot* for R.

Unfortunately, *Seaborn* doesn't ship with *Anaconda* so we need to install it manually. The best way is to use *conda* or *pip*:
`conda install seaborn`

or `pip install seaborn`

.

In [19]:

```
import seaborn as sns
sns.set_style("ticks") # control the plotting style
sns.set_context("talk") # set to talk because this is a lecture! hit shift-tab after the "(" to see other options.
sns.set_palette("muted") # many color palettes to choose from
sns.palplot(sns.color_palette('muted')) # this is the color palette we chose
```

In [20]:

```
sns.lmplot(
x='Body mass (kg)',
y='Metabolic rate (W)',
hue='Class',
data=data,
ci=False,
size=6,
aspect=1.3,
legend_out=False
);
```

Note that `hue`

means *color*, but it also causes *seaborn* to fit a different linear model to each of the Classes.
As for the last 3 paramteres:

`ci`

controls the confidence intervals. I chose`False`

, but setting it to`True`

will show them.`size`

controls the size of the plot`legend_out`

decides if the legend is inside the plot or outside. We have enough space for it in the left corner.

We can see that mammals and birds have a clear correlation between size and metabolism and that it extends over a nice range of mass, so let's stick to mammals; next up we will see which Orders of mammals we have.

In [21]:

```
mammalia = data[data.Class=='Mammalia']
order_counts = mammalia.Order.value_counts()
ax = order_counts.plot.barh()
ax.set(
xlabel='Num. of species',
ylabel='Order'
)
sns.despine();
```

You see we have alot of rodents and carnivores, but also a good number of bats (*Chiroptera*) and primates.

Let's continue only with orders that have at least 20 species - this also includes some cool marsupials like Kangaroo, Koala and Taz (Diprotodontia and Dasyuromorphia)

In [22]:

```
orders = order_counts[order_counts >= 20]
print(orders)
abund_mammalia = mammalia[mammalia.Order.isin(orders.index)]
```

In [23]:

```
sns.lmplot(
x='Body mass (kg)',
y='Metabolic rate (W)',
hue='Order',
data=abund_mammalia,
ci=False,
size=8,
aspect=1.3,
legend_out=False,
line_kws={'lw':2, 'ls':'--'},
scatter_kws={'s':50, 'alpha':0.85}
);
```

Because there is alot of data here I made the lines thinner - this can be done by giving *matplotlib* keywords as a dictionary to the argument `line_kws`

- and I made the markers bigger but with alpha (transperancy) 0.5 using the `scatter_kws`

argument.

Still ,there's too much data, and part of the problem is that some Orders are large - primates - and some are small - rodents.
Let's plot a separate plot for each Order. We do this using the `col`

and `row`

arguments of `lmplot`

, but in general this can be done for any plot using *seaborn*'s `FacetGrid`

function.

In [24]:

```
sns.lmplot(
x='Body mass (kg)',
y='Metabolic rate (W)',
data=abund_mammalia,
hue="Order",
col="Order",
col_wrap=3,
ci=None,
scatter_kws={'s':40},
sharex=False,
sharey=False
);
```

We used the `sharex=False`

and `sharey=False`

arguments so that each Order will have a different axis range and so the data is will spread nicely.
Last but not least, let's have a closer look at the corelation between mass and metabolism in primates.
We will do a joint plot which will give us the pearson correlation and the distribution of each parameter.

In [25]:

```
primates = mammalia[mammalia.Order == 'Primates']
print(' | '. join(sorted(primates["Common name"])))
sns.jointplot(
x='Body mass (kg)',
y='Metabolic rate (W)',
data=primates,
kind='reg',
size=8
);
```

- Examples: Seaborn example gallery
- Slides: Statistical inference with Python by Allen Downey
- Book: Think Stats by Allen Downey - statistics with Python. Free Ebook.
- Blog post: A modern guide to getting started with Data Science and Python
- Tutorial: An Introduction to Pandas