Dataframes and exploratory plotting: Pandas and Seaborn

Tom Wallis and Philipp Berens

The University of Tübingen

Introduction

Python is a general-purpose programming language used in a vast number of domains. Thanks to add-on packages ("modules"), it can be used to analyse data in a manner similar to domain-specific languages (like R). One advantage of using a language like Python over something like R is that knowing some Python will probably be useful to you even when you're not doing data analysis.

Here we introduce Pandas, a module for handling data sets as dataframes, and Seaborn, a high-level plotting library

  • We will be running through code examples to do basic data manipulation using Python, and you will be able to have these lecture slides for use in your own data analysis applications
  • The web is also full of tutorials, guides and answers on how to do things: if you have a problem, do a web-search for the thing you're trying to do. StackOverflow is a particularly useful website.

Importing modules into the Python workspace

To use the modules that extend the functionality of Python, they must first be imported into memory.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# set up to have plots appear in the ipython notebook:
%matplotlib inline  
sns.set_style("white")  # sets the style of seaborn plots

Example: Sleep in Mammals

We have compiled a few interesting datasets to use in this course. First we will look at a dataset of sleep times in mammals.

In this paper (Allison & Cicchetti, 1976), the authors collected a dataset of the number of hours various species of mammal sleep, and what type of sleep (slow-wave or REM sleep). They also compiled some other metrics, including things like lifespan and gestation, as well as environmental factors like the species' risk of predation. Here's the abstract from the paper:

The interrelationships between sleep, ecological, and constitutional variables were assessed statistically for 39 mammalian species. Slow-wave sleep is negatively associated with a factor related to body size, which suggests that large amounts of this sleep phase are disadvantageous in large species. Paradoxical sleep is associated with a factor related to predatory danger, which suggests that large amounts of this sleep phase are disadvantageous in prey species.

Allison, T. & Cicchetti, D.V. (1976). Sleep in mammals: ecological and constitutional correlates. Science, 194(4266): 732-734

The dataset is provided as a tab-delimited text file. First, let's import the data into a Pandas dataframe:

In [2]:
# import the dataset csv file into a Pandas dataframe object:
sleep = pd.read_csv('../Datasets/sleep.txt', sep='\t')  # the "\t" means it's a tab separated file.

Note that Pandas offers lots of other read / write functions. For example, it's easy to read from an excel spreadsheet using read_excel.

Note how we've referenced Pandas as pd (as we imported it, above). The dot after pd (as in pd.read_csv) tells us that we're using a function (read_csv) from the Pandas module. This is a nice feature of Python code, because it's clear where all of the functions you're using come from.

We can see the contents of our current "workspace" (the stuff our IPython Notebook knows about) by typing whos in an otherwise empty cell and running it:

In [3]:
whos  
Variable   Type         Data/Info
---------------------------------
np         module       <module 'numpy' from '/Us<...>kages/numpy/__init__.py'>
pd         module       <module 'pandas' from '/U<...>ages/pandas/__init__.py'>
plt        module       <module 'matplotlib.pyplo<...>es/matplotlib/pyplot.py'>
sleep      DataFrame                       Specie<...>n\n[62 rows x 11 columns]
sns        module       <module 'seaborn' from '/<...>ges/seaborn/__init__.py'>

Objects

Most variables in the workspace are objects. An object has a specific meaning for programmers.

If you don't know this meaning, think of an object like a car:

  • you can do things with it
  • you can store things in it
  • if you use it for something it wasn't intended for, bad things can happen

Similarly, an object in programming usually contains methods (things you can do with it) and sometimes data of some sort (things stored in it). If you try to use an object in a way it's not intended to be used, it will return an error.

We can explore what stuff we can do with an object in the workspace by using dot completion in the notebook.

Dots, tab completion, accessing help

When using the IPython Notebook, you can check what stuff you can do to an object in the workspace by putting a fullstop after it and then hitting tab in the cell. This shows you a list of all the methods that can operate on that object.

You can also hit shift + tab to get help on a method; hit this multiple times to get more help.

Finally, be sure to look at the user interface tour of the IPython Notebook, and to check out some of the keyboard shortcuts.

In [4]:
sleep.mean()
Out[4]:
BodyWt         198.789984
BrainWt        283.134194
NonDreaming      8.672917
Dreaming         1.972000
TotalSleep      10.532759
LifeSpan        19.877586
Gestation      142.353448
Predation        2.870968
Exposure         2.419355
Danger           2.612903
dtype: float64
In [5]:
# look at what the "sleep" dataframe contains:
sleep.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 62 entries, 0 to 61
Data columns (total 11 columns):
Species        62 non-null object
BodyWt         62 non-null float64
BrainWt        62 non-null float64
NonDreaming    48 non-null float64
Dreaming       50 non-null float64
TotalSleep     58 non-null float64
LifeSpan       58 non-null float64
Gestation      58 non-null float64
Predation      62 non-null int64
Exposure       62 non-null int64
Danger         62 non-null int64
dtypes: float64(7), int64(3), object(1)
memory usage: 5.8+ KB
In [6]:
# let's look at the first few lines of the sleep dataset using `head`:
sleep.head()
Out[6]:
Species BodyWt BrainWt NonDreaming Dreaming TotalSleep LifeSpan Gestation Predation Exposure Danger
0 Africanelephant 6654.000 5712.0 NaN NaN 3.3 38.6 645 3 5 3
1 Africangiantpouchedrat 1.000 6.6 6.3 2.0 8.3 4.5 42 3 1 3
2 ArcticFox 3.385 44.5 NaN NaN 12.5 14.0 60 1 1 1
3 Arcticgroundsquirrel 0.920 5.7 NaN NaN 16.5 NaN 25 5 2 3
4 Asianelephant 2547.000 4603.0 2.1 1.8 3.9 69.0 624 3 5 4
In [7]:
# we can also use the "describe" method to summarise some information in our variables:
sleep.describe()
Out[7]:
BodyWt BrainWt NonDreaming Dreaming TotalSleep LifeSpan Gestation Predation Exposure Danger
count 62.000000 62.000000 48.000000 50.000000 58.000000 58.000000 58.000000 62.000000 62.000000 62.000000
mean 198.789984 283.134194 8.672917 1.972000 10.532759 19.877586 142.353448 2.870968 2.419355 2.612903
std 899.158011 930.278942 3.666452 1.442651 4.606760 18.206255 146.805039 1.476414 1.604792 1.441252
min 0.005000 0.140000 2.100000 0.000000 2.600000 2.000000 12.000000 1.000000 1.000000 1.000000
25% 0.600000 4.250000 6.250000 0.900000 8.050000 6.625000 35.750000 2.000000 1.000000 1.000000
50% 3.342500 17.250000 8.350000 1.800000 10.450000 15.100000 79.000000 3.000000 2.000000 2.000000
75% 48.202500 166.000000 11.000000 2.550000 13.200000 27.750000 207.500000 4.000000 4.000000 4.000000
max 6654.000000 5712.000000 17.900000 6.600000 19.900000 100.000000 645.000000 5.000000 5.000000 5.000000

You can see that we have some missing cases: not all variables have 62 entries.

Let's keep only the complete cases, by using the dropna method:

In [8]:
sleep = sleep.dropna(axis=0)  # drop on axis 0 (i.e. rows)
In [9]:
sleep.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 42 entries, 1 to 60
Data columns (total 11 columns):
Species        42 non-null object
BodyWt         42 non-null float64
BrainWt        42 non-null float64
NonDreaming    42 non-null float64
Dreaming       42 non-null float64
TotalSleep     42 non-null float64
LifeSpan       42 non-null float64
Gestation      42 non-null float64
Predation      42 non-null int64
Exposure       42 non-null int64
Danger         42 non-null int64
dtypes: float64(7), int64(3), object(1)
memory usage: 3.9+ KB
In [10]:
sleep.head()
Out[10]:
Species BodyWt BrainWt NonDreaming Dreaming TotalSleep LifeSpan Gestation Predation Exposure Danger
1 Africangiantpouchedrat 1.000 6.6 6.3 2.0 8.3 4.5 42 3 1 3
4 Asianelephant 2547.000 4603.0 2.1 1.8 3.9 69.0 624 3 5 4
5 Baboon 10.550 179.5 9.1 0.7 9.8 27.0 180 4 4 4
6 Bigbrownbat 0.023 0.3 15.8 3.9 19.7 19.0 35 1 1 1
7 Braziliantapir 160.000 169.0 5.2 1.0 6.2 30.4 392 4 5 4

Tidy data

Most of the tools we will use throughout this course expect your data to be in a format that we will call tidy data (after Wickham, 2013):

All of the datasets we have provided follow this format. It can also be called a "long format" data frame, because it tends to be longer (rows) than it is wide (columns).

As an example of a wide (i.e. untidy) data frame, imagine that we measured three peoples' heights, once at time one and once at time two:

In [11]:
heights = pd.DataFrame({'person': ['Jim', 'Sally', 'Meg'],
                        't1': [120, 123, 118],
                        't2': [152, 147, 153]})
heights
Out[11]:
person t1 t2
0 Jim 120 152
1 Sally 123 147
2 Meg 118 153

Do you see how the columns t1 and t2 are mixing up what should be one variable (time) with two levels (t1 and t2)? While a data frame like this can be useful in some circumstances, in general it makes things harder.

Thankfully, we can easily reshape wide dataframes into long dataframes using Pandas' melt function:

In [12]:
tidy = pd.melt(heights, id_vars='person', 
               var_name='time', value_name='height')
tidy
Out[12]:
person time height
0 Jim t1 120
1 Sally t1 123
2 Meg t1 118
3 Jim t2 152
4 Sally t2 147
5 Meg t2 153

You can find out more about how you can reshape datasets in Pandas, including aggregating values using pivot tables, in the documentation on reshaping.

Exploratory plots

Now we have only the complete cases, let's look at two variables of interest by using the Seaborn plotting package.

In [13]:
# simple histogram of Total sleep:
vec = sleep['TotalSleep']
g = sns.distplot(vec)
sns.despine(ax=g, offset=10);    
In [14]:
# the jointplot function can be used to set up a subplot grid:
g = sns.jointplot("BodyWt", "TotalSleep", sleep); 

We can see that the body weight variable is extremely skewed. It ranges from near zero (0.005 kg for the Lesser Short-Tailed Shrew) up to nearly 2.5 tonnes (for the Asian Elephant):

In [15]:
sleep['Species'][sleep['BodyWt']==sleep['BodyWt'].min()]
Out[15]:
31    Lessershort-tailedshrew
Name: Species, dtype: object
In [16]:
sleep['Species'][sleep['BodyWt']==sleep['BodyWt'].max()]
Out[16]:
4    Asianelephant
Name: Species, dtype: object
In [17]:
sleep['BodyWt'].describe()  # use the describe method on the variable BodyWt
Out[17]:
count      42.000000
mean      100.813905
std       402.082389
min         0.005000
25%         0.316250
50%         2.250000
75%        10.412500
max      2547.000000
Name: BodyWt, dtype: float64

Indexing Pandas dataframes

For the next step in our exploratory analysis, we want to access variables in our sleep dataframe.

There are a number of ways to access subsets of a Pandas dataframe; an overview can be found on the 10 minutes to Pandas website. We will concentrate here on selecting single variables (columns).

In [18]:
sleep['BodyWt']  # prints the column of "bodyweight" variable.
Out[18]:
1        1.000
4     2547.000
5       10.550
6        0.023
7      160.000
8        3.300
9       52.160
10       0.425
11     465.000
14       0.075
15       3.000
16       0.785
17       0.200
21      27.660
22       0.120
24      85.000
26       0.101
27       1.040
28     521.000
31       0.005
32       0.010
33      62.000
36       0.023
37       0.048
38       1.700
39       3.500
41       0.480
42      10.000
43       1.620
44     192.000
45       2.500
47       0.280
48       4.235
49       6.800
50       0.750
51       3.600
53      55.500
56       0.900
57       2.000
58       0.104
59       4.190
60       3.500
Name: BodyWt, dtype: float64
In [19]:
sleep['BodyWt'][0:3]  # the first three rows of the bodyweight variable
Out[19]:
1       1.00
4    2547.00
5      10.55
Name: BodyWt, dtype: float64
In [20]:
# Boolean indexing. All bodyweights of animals with less than 6 hours of sleep:
sleep['BodyWt'][sleep['TotalSleep'] < 6]  
Out[20]:
4     2547.00
11     465.00
21      27.66
28     521.00
51       3.60
53      55.50
57       2.00
Name: BodyWt, dtype: float64

Variable transforms

  • Linear bodyweight is extremely skewed; this complicates data analysis
  • It makes sense to consider the log of body weight instead
  • We can create a new variable of log bodyweight.
In [21]:
sleep['log_BodyWt'] = np.log(sleep['BodyWt'])
sleep['log_BodyWt'].describe()
Out[21]:
count    42.000000
mean      0.803995
std       3.069388
min      -5.298317
25%      -1.168641
50%       0.804719
75%       2.342741
max       7.842671
Name: log_BodyWt, dtype: float64
In [22]:
sleep.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 42 entries, 1 to 60
Data columns (total 12 columns):
Species        42 non-null object
BodyWt         42 non-null float64
BrainWt        42 non-null float64
NonDreaming    42 non-null float64
Dreaming       42 non-null float64
TotalSleep     42 non-null float64
LifeSpan       42 non-null float64
Gestation      42 non-null float64
Predation      42 non-null int64
Exposure       42 non-null int64
Danger         42 non-null int64
log_BodyWt     42 non-null float64
dtypes: float64(8), int64(3), object(1)
memory usage: 4.3+ KB
In [23]:
g = sns.jointplot("log_BodyWt", "TotalSleep", sleep)

Now we see a much nicer and more linear spread of the data. Heavier animals seem to sleep for less hours.

Grouped plots

The sleep in mammals dataset contains a number of other variables. For example, the variable Danger is an ordinal scale variable, ranging from 1 ("least danger from other animals") to 5 ("most danger from other animals").

Does the total time spent sleeping change according to danger?

In [24]:
g = sns.violinplot(sleep['TotalSleep'], groupby=sleep['Danger'])
sns.despine(ax=g, offset=10);

Out of interest, let's find which animals are in the highest and lowest danger categories:

In [25]:
sleep['Species'][sleep['Danger']==1]
Out[25]:
6              Bigbrownbat
8                      Cat
9               Chimpanzee
14     EasternAmericanmole
24                Grayseal
32          Littlebrownbat
33                     Man
38        NAmericanopossum
39    Nine-bandedarmadillo
48                  Redfox
60            Wateropossum
Name: Species, dtype: object
In [26]:
sleep['Species'][sleep['Danger']==5]
Out[26]:
11       Cow
21      Goat
28     Horse
45    Rabbit
53     Sheep
Name: Species, dtype: object

Faceted subplots

Seaborn allows us to easily visualise 2D plots across various groups by faceting plots in rows and columns according to categorical variables.

For example, we could ask: Does the relationship between total sleep time and bodyweight depend on danger level?

In [27]:
g = sns.FacetGrid(sleep, col="Danger", col_wrap=3)
g.map(plt.scatter, 'log_BodyWt', 'TotalSleep');

Go forth and explore!

Seaborn and Pandas allow you to do very powerful data manipulation and exploratory plots in relatively few lines of code. In this lecture we've provided a taste of what you can do. We encourage you to use this code, to check out more examples on the Seaborn website, and to get in and play around with our data or your own.

Don't be dismayed that errors are frequent. Learning to use these packages will be a lot of trial and error / hacking around / internet searchs. The rewards for your research will be worthwhile.

In [1]:
from IPython.core.display import HTML


def css_styling():
    styles = open("../custom_style.css", "r").read()
    return HTML(styles)
css_styling()
Out[1]: