#!/usr/bin/env python
# coding: utf-8

# ![](img/logo.png)
# 
# # Data analysis: Pandas and Seaborn
# ## Yoav Ram

# _Pandas_ is a very strong library for manipulating large and complex datasets using a new data structure, the **data frame**, which models a table of data.
# Pandas helps to close the gap between Python and R for data analysis and statistical computing.
# 
# Pandas data frames address three deficiencies of NumPy arrays:
# 
# - data frame hold heterogenous data; each column can have its own numpy.dtype,
# - the axes of a data frame are labeled with column names and row indices,
# - and, they account for missing values which this is not directly supported by arrays.
# 
# Data frames are extremely useful for data manipulation.
# They provide a large range of operations such as filter, join, and group-by aggregation, as well as plotting.

# In[1]:


import numpy as np
import pandas as pd
print('Pandas version:', pd.__version__)


# # Statistical Analysis of Life History Traits

# We will analyze animal life-history data from [AnAge](http://genomics.senescence.info/download.html#anage). 

# In[2]:


data = pd.read_csv('../data/anage_data.txt', sep='\t') # lots of other pd.read_... functions
print(type(data))
print(data.shape)


# Pandas holds data in `DataFrame` (similar to _R_).
# `DataFrame` have a single row per observation (in contrast to the previous exercise in which each table cell was one observation), and each column has a single variable. Variables can be numbers or strings.
# 
# The `head` method gives us the 5 first rows of the data frame.

# In[3]:


data.head()


# `DataFrame` has many of the features of `numpy.ndarray` - it also has a `shape` and various statistical methods (`max`, `mean` etc.).
# However, `DataFrame` allows richer indexing.
# For example, let's browse our data for species that have body mass greater than 300 kg.
# First we will a create new column (`Series` object) that tells us if a row is a large animal row or not:

# In[4]:


large_index = data['Body mass (g)'] > 300 * 1000 # 300 kg
large_index.head()


# Now, we slice our data with this boolean index. 
# The `iterrows` method let's us iterate over the rows of the data.
# For each row we get both the row as a `Series` object (similar to `dict` for our use) and the row number as an `int` (this is similar to the use of `enumerate` on lists and strings).

# In[5]:


large_data = data[large_index]
for i, row in large_data.iterrows(): 
    print(row['Common name'], row['Body mass (g)']/1000, 'kg')


# So... a [Dromedary](http://en.wikipedia.org/wiki/Dromedary) is the single-humped camel.
# 
# ![Camel](https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Camelus_dromedarius_on_Sinai.jpg/220px-Camelus_dromedarius_on_Sinai.jpg)
# 
# Let's continue with small and medium animals - we filter out anything that doesn't have body mass of less than 300 kg.

# In[6]:


data = data[data['Body mass (g)'] <  300 * 1000] 


# For starters, let's plot a scatter of body mass vs. metabolic rate.
# Because we work with pandas, we can do that with the `plot` method of `DataFrame`, specifying the columns for `x` and `y` and a plotting style (without the style we would get a line plot which makes no sense here).
# 
# You can change `%matplotlib inline` to `%matplotlib widget` to get interactive plotting -- if this causes errors, just stay with `inline`, as the `widget` feature is new and may require to update some packages.

# In[7]:


get_ipython().run_line_magic('matplotlib', 'inline')
import matplotlib.pyplot as plt


# In[8]:


data.plot.scatter(x='Body mass (g)', y='Metabolic rate (W)', legend=False)
plt.ylabel('Metabolic rate (W)');


# If this plot looks funny, you are probably using Pandas with version <0.22; the bug was [reported](https://github.com/pandas-dev/pandas/issues/11471) and fixed in version 0.22.

# From this plot it seems that 
# 1. there is a correlation between body mass and metabolic rate, and 
# 1. there are many small animals (less than 30 kg) and not many medium animals (between 50 and 300 kg).
# 
# Before we continue, I prefer to have mass in kg, let's add a new column:

# In[9]:


data['Body mass (kg)'] = data['Body mass (g)'] / 1000


# Next, let's check how many records do we have for each Class (as in the taxonomic unit): 

# In[10]:


class_counts = data['Class'].value_counts()
print(class_counts)


# In[12]:


# plt.figure() # only required if you used %matplotlib widget
class_counts.plot.bar()
plt.ylabel('Num. of species');


# So we have lots of mammals and birds, and a few reptiles and amphibians. This is important as amphibian and reptiles could have a different replationship between mass and metabolism because they are cold blooded.

# ## Exercise: data frames
# 
# 1) **Print the number** of reptiles are in this dataset, and how many of them are of the genus `Python`.
# 
# **Reminder**
# - Edit cell by double clicking
# - Run cell by pressing _Shift+Enter_
# - Get autocompletion by pressing _Tab_
# - Get documentation by pressing _Shift+Tab_

# In[28]:


# In[29]:


print("# of reptiles: ", reptiles)
print("# of pythons: ", pythons)


# 2) **Plot the histogram of the mammal body masses** using `plot.hist()`.
# Since most mammals are small, the histogram looks better if we plot a cumulative distribution rather then the distribution - we can do this with the `cumulative` argument. You also need to specify a higher `bins` argument then the default.

# In[ ]:


# In[34]:


# # Seaborn
# 
# Let's do a simple linear regression plot; but let's do it in separate for each Class. We can do this kind of thing with Matplotlib and [SciPy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html), but a very good tool for statistical visualizations is **[Seaborn](http://seaborn.pydata.org)**.
# 
# Seaborn adds on top of Pandas a set of sophisticated statistical visualizations, similar to [ggplot2](http://ggplot2.org) for R.

# In[13]:


import seaborn as sns
sns.set_context("talk")


# In[14]:


sns.lmplot(
    x='Body mass (kg)', 
    y='Metabolic rate (W)', 
    hue='Class', 
    data=data, 
    ci=False, 
);


# - `hue` means _color_, but it also causes _seaborn_ to fit a different linear model to each of the Classes. 
# - `ci` controls the confidence intervals. I chose `False`, but setting it to `True` will show them.
# 
# We can see that mammals and birds have a clear correlation between size and metabolism and that it extends over a nice range of mass, so let's stick to mammals; next up we will see which orders of mammals we have.

# In[15]:


mammalia = data[data.Class=='Mammalia']
order_counts = mammalia.Order.value_counts()
ax = order_counts.plot.barh()
ax.set(
    xlabel='Num. of species',
    ylabel='Mammalia order'
)
ax.figure.set_figheight(7)


# You see we have alot of rodents and carnivores, but also a good number of bats (_Chiroptera_) and primates.
# 
# Let's continue with orders that have at least 20 species - this also includes some cool marsupials like Kangaroo, Koala and [Taz](http://upload.wikimedia.org/wikipedia/en/c/c4/Taz-Looney_Tunes.svg) (Diprotodontia and Dasyuromorphia)

# In[16]:


orders = order_counts[order_counts >= 20]
print(orders)
abund_mammalia = mammalia[mammalia.Order.isin(orders.index)]


# In[17]:


sns.lmplot(
    x='Body mass (kg)', 
    y='Metabolic rate (W)', 
    hue='Order',
    data=abund_mammalia, 
    ci=False, 
    height=8,
    aspect=1.3,
    line_kws={'lw':2, 'ls':'--'}, 
    scatter_kws={'s':50, 'alpha':0.5}
);
# if you get an error about height not being a keyword, change it to size or update seaborn: conda update seaborn


# Because there is alot of data here I made the lines thinner - this can be done by giving _matplotlib_ keywords as a dictionary to the argument `line_kws` - and I made the markers bigger but with alpha (transperancy) 0.5 using the `scatter_kws` argument.
# 
# Still ,there's too much data, and part of the problem is that some orders are large (e.g. primates) and some are small (e.g. rodents).
# 
# Let's plot a separate regression plot for each order.
# We do this using the `col` and `row` arguments of `lmplot`, but in general this can be done for any plot using [seaborn's `FacetGrid` function](http://stanford.edu/~mwaskom/software/seaborn/tutorial/axis_grids.html).

# In[18]:


sns.lmplot(
    x='Body mass (kg)', 
    y='Metabolic rate (W)', 
    data=abund_mammalia, 
    hue='Order',
    col='Order', 
    col_wrap=3, 
    ci=None, 
    scatter_kws={'s':40}, 
    sharex=False, 
    sharey=False
);


# We used the `sharex=False` and `sharey=False` arguments so that each Order will have a different axis range and so the data is will spread nicely.

# # Statistics
# 
# Lastly, let's do some quick statistics.
# 
# First, calculate a summary of the the mammals using `describe`.

# In[19]:


mass = abund_mammalia
mass.describe()


# Now lets check if we can significantly say that the body mass of rodents is lower than that of carnivores.
# 
# ## Exercise: boxplot
# **Plot boxplots of the mammals body mass** using Seaborn, which is easier to use (and also makes nicer boxplots) then standard matplotlib boxplot.

# In[ ]:


# In[82]:


# Now, we'll use a t-test (implemented in the `scipy.stats` module) to test the hypothesis that there is *no difference* in body mass between rodents and carnivores.
# 
# - `ttest_ind` calculates the t-test for the means of *two independent* samples of scores.
# - `scipy.stats` has many more statistical tests, distributions, etc.

# In[20]:


from scipy.stats import ttest_ind


# In[21]:


carnivora_mass = abund_mammalia.loc[abund_mammalia['Order']=='Carnivora', 'Body mass (kg)']
rodentia_mass = abund_mammalia.loc[abund_mammalia['Order']=='Rodentia', 'Body mass (kg)']

res = ttest_ind(carnivora_mass, rodentia_mass, equal_var=False)
print("P-value of t-test: {:.2g}".format(res.pvalue))


# # References
# 
# - Examples: [Seaborn example gallery](http://seaborn.pydata.org/examples/index.html)
# - Slides: [Statistical inference with Python](https://docs.google.com/presentation/d/1imQAEmNg4GB3bCAblauMOOLlAC95-XvkTSKB1_dB3Tg/pub?slide=id.p) by Allen Downey
# - Book: [Think Stats](greenteapress.com/thinkstats2/html/index.html) by Allen Downey - statistics with Python. Free Ebook.
# - Blog post: [A modern guide to getting started with Data Science and Python](http://twiecki.github.io/blog/2014/11/18/python-for-data-science/)
# - Tutorial: [An Introduction to Pandas](http://www.synesthesiam.com/posts/an-introduction-to-pandas.html)

# # Colophon
# This notebook was written by [Yoav Ram](http://python.yoavram.com).
# 
# The notebook was written using [Python](http://python.org/) 3.7.
# Dependencies listed in [environment.yml](../environment.yml).
# 
# This work is licensed under a CC BY-NC-SA 4.0 International License.
# 
# ![Python logo](https://www.python.org/static/community_logos/python-logo.png)