#!/usr/bin/env python # coding: utf-8 # ![](img/logo.png) # # # Data analysis: Pandas and Seaborn # ## Yoav Ram # _Pandas_ is a very strong library for manipulating large and complex datasets using a new data structure, the **data frame**, which models a table of data. # Pandas helps to close the gap between Python and R for data analysis and statistical computing. # # Pandas data frames address three deficiencies of NumPy arrays: # # - data frame hold heterogenous data; each column can have its own numpy.dtype, # - the axes of a data frame are labeled with column names and row indices, # - and, they account for missing values which this is not directly supported by arrays. # # Data frames are extremely useful for data manipulation. # They provide a large range of operations such as filter, join, and group-by aggregation, as well as plotting. # In[1]: import numpy as np import pandas as pd print('Pandas version:', pd.__version__) # # Statistical Analysis of Life History Traits # We will analyze animal life-history data from [AnAge](http://genomics.senescence.info/download.html#anage). # In[2]: data = pd.read_csv('../data/anage_data.txt', sep='\t') # lots of other pd.read_... functions print(type(data)) print(data.shape) # Pandas holds data in `DataFrame` (similar to _R_). # `DataFrame` have a single row per observation (in contrast to the previous exercise in which each table cell was one observation), and each column has a single variable. Variables can be numbers or strings. # # The `head` method gives us the 5 first rows of the data frame. # In[3]: data.head() # `DataFrame` has many of the features of `numpy.ndarray` - it also has a `shape` and various statistical methods (`max`, `mean` etc.). # However, `DataFrame` allows richer indexing. # For example, let's browse our data for species that have body mass greater than 300 kg. # First we will a create new column (`Series` object) that tells us if a row is a large animal row or not: # In[4]: large_index = data['Body mass (g)'] > 300 * 1000 # 300 kg large_index.head() # Now, we slice our data with this boolean index. # The `iterrows` method let's us iterate over the rows of the data. # For each row we get both the row as a `Series` object (similar to `dict` for our use) and the row number as an `int` (this is similar to the use of `enumerate` on lists and strings). # In[5]: large_data = data[large_index] for i, row in large_data.iterrows(): print(row['Common name'], row['Body mass (g)']/1000, 'kg') # So... a [Dromedary](http://en.wikipedia.org/wiki/Dromedary) is the single-humped camel. # # ![Camel](https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Camelus_dromedarius_on_Sinai.jpg/220px-Camelus_dromedarius_on_Sinai.jpg) # # Let's continue with small and medium animals - we filter out anything that doesn't have body mass of less than 300 kg. # In[6]: data = data[data['Body mass (g)'] < 300 * 1000] # For starters, let's plot a scatter of body mass vs. metabolic rate. # Because we work with pandas, we can do that with the `plot` method of `DataFrame`, specifying the columns for `x` and `y` and a plotting style (without the style we would get a line plot which makes no sense here). # # You can change `%matplotlib inline` to `%matplotlib widget` to get interactive plotting -- if this causes errors, just stay with `inline`, as the `widget` feature is new and may require to update some packages. # In[7]: get_ipython().run_line_magic('matplotlib', 'inline') import matplotlib.pyplot as plt # In[8]: data.plot.scatter(x='Body mass (g)', y='Metabolic rate (W)', legend=False) plt.ylabel('Metabolic rate (W)'); # If this plot looks funny, you are probably using Pandas with version <0.22; the bug was [reported](https://github.com/pandas-dev/pandas/issues/11471) and fixed in version 0.22. # From this plot it seems that # 1. there is a correlation between body mass and metabolic rate, and # 1. there are many small animals (less than 30 kg) and not many medium animals (between 50 and 300 kg). # # Before we continue, I prefer to have mass in kg, let's add a new column: # In[9]: data['Body mass (kg)'] = data['Body mass (g)'] / 1000 # Next, let's check how many records do we have for each Class (as in the taxonomic unit): # In[10]: class_counts = data['Class'].value_counts() print(class_counts) # In[12]: # plt.figure() # only required if you used %matplotlib widget class_counts.plot.bar() plt.ylabel('Num. of species'); # So we have lots of mammals and birds, and a few reptiles and amphibians. This is important as amphibian and reptiles could have a different replationship between mass and metabolism because they are cold blooded. # ## Exercise: data frames # # 1) **Print the number** of reptiles are in this dataset, and how many of them are of the genus `Python`. # # **Reminder** # - Edit cell by double clicking # - Run cell by pressing _Shift+Enter_ # - Get autocompletion by pressing _Tab_ # - Get documentation by pressing _Shift+Tab_ # In[28]: # In[29]: print("# of reptiles: ", reptiles) print("# of pythons: ", pythons) # 2) **Plot the histogram of the mammal body masses** using `plot.hist()`. # Since most mammals are small, the histogram looks better if we plot a cumulative distribution rather then the distribution - we can do this with the `cumulative` argument. You also need to specify a higher `bins` argument then the default. # In[ ]: # In[34]: # # Seaborn # # Let's do a simple linear regression plot; but let's do it in separate for each Class. We can do this kind of thing with Matplotlib and [SciPy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html), but a very good tool for statistical visualizations is **[Seaborn](http://seaborn.pydata.org)**. # # Seaborn adds on top of Pandas a set of sophisticated statistical visualizations, similar to [ggplot2](http://ggplot2.org) for R. # In[13]: import seaborn as sns sns.set_context("talk") # In[14]: sns.lmplot( x='Body mass (kg)', y='Metabolic rate (W)', hue='Class', data=data, ci=False, ); # - `hue` means _color_, but it also causes _seaborn_ to fit a different linear model to each of the Classes. # - `ci` controls the confidence intervals. I chose `False`, but setting it to `True` will show them. # # We can see that mammals and birds have a clear correlation between size and metabolism and that it extends over a nice range of mass, so let's stick to mammals; next up we will see which orders of mammals we have. # In[15]: mammalia = data[data.Class=='Mammalia'] order_counts = mammalia.Order.value_counts() ax = order_counts.plot.barh() ax.set( xlabel='Num. of species', ylabel='Mammalia order' ) ax.figure.set_figheight(7) # You see we have alot of rodents and carnivores, but also a good number of bats (_Chiroptera_) and primates. # # Let's continue with orders that have at least 20 species - this also includes some cool marsupials like Kangaroo, Koala and [Taz](http://upload.wikimedia.org/wikipedia/en/c/c4/Taz-Looney_Tunes.svg) (Diprotodontia and Dasyuromorphia) # In[16]: orders = order_counts[order_counts >= 20] print(orders) abund_mammalia = mammalia[mammalia.Order.isin(orders.index)] # In[17]: sns.lmplot( x='Body mass (kg)', y='Metabolic rate (W)', hue='Order', data=abund_mammalia, ci=False, height=8, aspect=1.3, line_kws={'lw':2, 'ls':'--'}, scatter_kws={'s':50, 'alpha':0.5} ); # if you get an error about height not being a keyword, change it to size or update seaborn: conda update seaborn # Because there is alot of data here I made the lines thinner - this can be done by giving _matplotlib_ keywords as a dictionary to the argument `line_kws` - and I made the markers bigger but with alpha (transperancy) 0.5 using the `scatter_kws` argument. # # Still ,there's too much data, and part of the problem is that some orders are large (e.g. primates) and some are small (e.g. rodents). # # Let's plot a separate regression plot for each order. # We do this using the `col` and `row` arguments of `lmplot`, but in general this can be done for any plot using [seaborn's `FacetGrid` function](http://stanford.edu/~mwaskom/software/seaborn/tutorial/axis_grids.html). # In[18]: sns.lmplot( x='Body mass (kg)', y='Metabolic rate (W)', data=abund_mammalia, hue='Order', col='Order', col_wrap=3, ci=None, scatter_kws={'s':40}, sharex=False, sharey=False ); # We used the `sharex=False` and `sharey=False` arguments so that each Order will have a different axis range and so the data is will spread nicely. # # Statistics # # Lastly, let's do some quick statistics. # # First, calculate a summary of the the mammals using `describe`. # In[19]: mass = abund_mammalia mass.describe() # Now lets check if we can significantly say that the body mass of rodents is lower than that of carnivores. # # ## Exercise: boxplot # **Plot boxplots of the mammals body mass** using Seaborn, which is easier to use (and also makes nicer boxplots) then standard matplotlib boxplot. # In[ ]: # In[82]: # Now, we'll use a t-test (implemented in the `scipy.stats` module) to test the hypothesis that there is *no difference* in body mass between rodents and carnivores. # # - `ttest_ind` calculates the t-test for the means of *two independent* samples of scores. # - `scipy.stats` has many more statistical tests, distributions, etc. # In[20]: from scipy.stats import ttest_ind # In[21]: carnivora_mass = abund_mammalia.loc[abund_mammalia['Order']=='Carnivora', 'Body mass (kg)'] rodentia_mass = abund_mammalia.loc[abund_mammalia['Order']=='Rodentia', 'Body mass (kg)'] res = ttest_ind(carnivora_mass, rodentia_mass, equal_var=False) print("P-value of t-test: {:.2g}".format(res.pvalue)) # # References # # - Examples: [Seaborn example gallery](http://seaborn.pydata.org/examples/index.html) # - Slides: [Statistical inference with Python](https://docs.google.com/presentation/d/1imQAEmNg4GB3bCAblauMOOLlAC95-XvkTSKB1_dB3Tg/pub?slide=id.p) by Allen Downey # - Book: [Think Stats](greenteapress.com/thinkstats2/html/index.html) by Allen Downey - statistics with Python. Free Ebook. # - Blog post: [A modern guide to getting started with Data Science and Python](http://twiecki.github.io/blog/2014/11/18/python-for-data-science/) # - Tutorial: [An Introduction to Pandas](http://www.synesthesiam.com/posts/an-introduction-to-pandas.html) # # Colophon # This notebook was written by [Yoav Ram](http://python.yoavram.com). # # The notebook was written using [Python](http://python.org/) 3.7. # Dependencies listed in [environment.yml](../environment.yml). # # This work is licensed under a CC BY-NC-SA 4.0 International License. # # ![Python logo](https://www.python.org/static/community_logos/python-logo.png)