Dataframes and exploratory plotting: Pandas and Seaborn¶

The University of Tübingen

Introduction¶

Python is a general-purpose programming language used in a vast number of domains. Thanks to add-on packages ("modules"), it can be used to analyse data in a manner similar to domain-specific languages (like R). One advantage of using a language like Python over something like R is that knowing some Python will probably be useful to you even when you're not doing data analysis.

Here we introduce Pandas, a module for handling data sets as dataframes, and Seaborn, a high-level plotting library

We will be running through code examples to do basic data manipulation using Python, and you will be able to have these lecture slides for use in your own data analysis applications
The web is also full of tutorials, guides and answers on how to do things: if you have a problem, do a web-search for the thing you're trying to do. StackOverflow is a particularly useful website.

Importing modules into the Python workspace¶

To use the modules that extend the functionality of Python, they must first be imported into memory.

In [1]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# set up to have plots appear in the ipython notebook:
%matplotlib inline  
sns.set_style("white")  # sets the style of seaborn plots

Example: Sleep in Mammals¶

We have compiled a few interesting datasets to use in this course. First we will look at a dataset of sleep times in mammals.

In this paper (Allison & Cicchetti, 1976), the authors collected a dataset of the number of hours various species of mammal sleep, and what type of sleep (slow-wave or REM sleep). They also compiled some other metrics, including things like lifespan and gestation, as well as environmental factors like the species' risk of predation. Here's the abstract from the paper:

The interrelationships between sleep, ecological, and constitutional variables were assessed statistically for 39 mammalian species. Slow-wave sleep is negatively associated with a factor related to body size, which suggests that large amounts of this sleep phase are disadvantageous in large species. Paradoxical sleep is associated with a factor related to predatory danger, which suggests that large amounts of this sleep phase are disadvantageous in prey species.

Allison, T. & Cicchetti, D.V. (1976). Sleep in mammals: ecological and constitutional correlates. Science, 194(4266): 732-734

The dataset is provided as a tab-delimited text file. First, let's import the data into a Pandas dataframe:

In [2]:

# import the dataset csv file into a Pandas dataframe object:
sleep = pd.read_csv('../Datasets/sleep.txt', sep='\t')  # the "\t" means it's a tab separated file.

Note that Pandas offers lots of other read / write functions. For example, it's easy to read from an excel spreadsheet using read_excel.

Note how we've referenced Pandas as pd (as we imported it, above). The dot after pd (as in pd.read_csv) tells us that we're using a function (read_csv) from the Pandas module. This is a nice feature of Python code, because it's clear where all of the functions you're using come from.

We can see the contents of our current "workspace" (the stuff our IPython Notebook knows about) by typing whos in an otherwise empty cell and running it:

In [3]:

whos

Variable   Type         Data/Info
---------------------------------
np         module       <module 'numpy' from '/Us<...>kages/numpy/__init__.py'>
pd         module       <module 'pandas' from '/U<...>ages/pandas/__init__.py'>
plt        module       <module 'matplotlib.pyplo<...>es/matplotlib/pyplot.py'>
sleep      DataFrame                       Specie<...>n\n[62 rows x 11 columns]
sns        module       <module 'seaborn' from '/<...>ges/seaborn/__init__.py'>

Objects¶

Most variables in the workspace are objects. An object has a specific meaning for programmers.

If you don't know this meaning, think of an object like a car:

you can do things with it
you can store things in it
if you use it for something it wasn't intended for, bad things can happen

Similarly, an object in programming usually contains methods (things you can do with it) and sometimes data of some sort (things stored in it). If you try to use an object in a way it's not intended to be used, it will return an error.

We can explore what stuff we can do with an object in the workspace by using dot completion in the notebook.

Dots, tab completion, accessing help¶

When using the IPython Notebook, you can check what stuff you can do to an object in the workspace by putting a fullstop after it and then hitting tab in the cell. This shows you a list of all the methods that can operate on that object.

You can also hit shift + tab to get help on a method; hit this multiple times to get more help.

Finally, be sure to look at the user interface tour of the IPython Notebook, and to check out some of the keyboard shortcuts.

In [4]:

sleep.mean()

Out[4]:

BodyWt         198.789984
BrainWt        283.134194
NonDreaming      8.672917
Dreaming         1.972000
TotalSleep      10.532759
LifeSpan        19.877586
Gestation      142.353448
Predation        2.870968
Exposure         2.419355
Danger           2.612903
dtype: float64

In [5]:

# look at what the "sleep" dataframe contains:
sleep.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 62 entries, 0 to 61
Data columns (total 11 columns):
Species        62 non-null object
BodyWt         62 non-null float64
BrainWt        62 non-null float64
NonDreaming    48 non-null float64
Dreaming       50 non-null float64
TotalSleep     58 non-null float64
LifeSpan       58 non-null float64
Gestation      58 non-null float64
Predation      62 non-null int64
Exposure       62 non-null int64
Danger         62 non-null int64
dtypes: float64(7), int64(3), object(1)
memory usage: 5.8+ KB

In [6]:

# let's look at the first few lines of the sleep dataset using `head`:
sleep.head()

Out[6]:

	Species	BodyWt	BrainWt	NonDreaming	Dreaming	TotalSleep	LifeSpan	Gestation	Predation	Exposure	Danger
0	Africanelephant	6654.000	5712.0	NaN	NaN	3.3	38.6	645	3	5	3
1	Africangiantpouchedrat	1.000	6.6	6.3	2.0	8.3	4.5	42	3	1	3
2	ArcticFox	3.385	44.5	NaN	NaN	12.5	14.0	60	1	1	1
3	Arcticgroundsquirrel	0.920	5.7	NaN	NaN	16.5	NaN	25	5	2	3
4	Asianelephant	2547.000	4603.0	2.1	1.8	3.9	69.0	624	3	5	4

In [7]:

# we can also use the "describe" method to summarise some information in our variables:
sleep.describe()

Out[7]:

	BodyWt	BrainWt	NonDreaming	Dreaming	TotalSleep	LifeSpan	Gestation	Predation	Exposure	Danger
count	62.000000	62.000000	48.000000	50.000000	58.000000	58.000000	58.000000	62.000000	62.000000	62.000000
mean	198.789984	283.134194	8.672917	1.972000	10.532759	19.877586	142.353448	2.870968	2.419355	2.612903
std	899.158011	930.278942	3.666452	1.442651	4.606760	18.206255	146.805039	1.476414	1.604792	1.441252
min	0.005000	0.140000	2.100000	0.000000	2.600000	2.000000	12.000000	1.000000	1.000000	1.000000
25%	0.600000	4.250000	6.250000	0.900000	8.050000	6.625000	35.750000	2.000000	1.000000	1.000000
50%	3.342500	17.250000	8.350000	1.800000	10.450000	15.100000	79.000000	3.000000	2.000000	2.000000
75%	48.202500	166.000000	11.000000	2.550000	13.200000	27.750000	207.500000	4.000000	4.000000	4.000000
max	6654.000000	5712.000000	17.900000	6.600000	19.900000	100.000000	645.000000	5.000000	5.000000	5.000000

You can see that we have some missing cases: not all variables have 62 entries.

Let's keep only the complete cases, by using the dropna method:

In [8]:

sleep = sleep.dropna(axis=0)  # drop on axis 0 (i.e. rows)

In [9]:

sleep.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42 entries, 1 to 60
Data columns (total 11 columns):
Species        42 non-null object
BodyWt         42 non-null float64
BrainWt        42 non-null float64
NonDreaming    42 non-null float64
Dreaming       42 non-null float64
TotalSleep     42 non-null float64
LifeSpan       42 non-null float64
Gestation      42 non-null float64
Predation      42 non-null int64
Exposure       42 non-null int64
Danger         42 non-null int64
dtypes: float64(7), int64(3), object(1)
memory usage: 3.9+ KB

In [10]:

sleep.head()

Out[10]:

	Species	BodyWt	BrainWt	NonDreaming	Dreaming	TotalSleep	LifeSpan	Gestation	Predation	Exposure	Danger
1	Africangiantpouchedrat	1.000	6.6	6.3	2.0	8.3	4.5	42	3	1	3
4	Asianelephant	2547.000	4603.0	2.1	1.8	3.9	69.0	624	3	5	4
5	Baboon	10.550	179.5	9.1	0.7	9.8	27.0	180	4	4	4
6	Bigbrownbat	0.023	0.3	15.8	3.9	19.7	19.0	35	1	1	1
7	Braziliantapir	160.000	169.0	5.2	1.0	6.2	30.4	392	4	5	4

Tidy data¶

Most of the tools we will use throughout this course expect your data to be in a format that we will call tidy data (after Wickham, 2013):

All of the datasets we have provided follow this format. It can also be called a "long format" data frame, because it tends to be longer (rows) than it is wide (columns).

As an example of a wide (i.e. untidy) data frame, imagine that we measured three peoples' heights, once at time one and once at time two:

In [11]:

heights = pd.DataFrame({'person': ['Jim', 'Sally', 'Meg'],
                        't1': [120, 123, 118],
                        't2': [152, 147, 153]})
heights

Out[11]:

	person	t1	t2
0	Jim	120	152
1	Sally	123	147
2	Meg	118	153

Do you see how the columns t1 and t2 are mixing up what should be one variable (time) with two levels (t1 and t2)? While a data frame like this can be useful in some circumstances, in general it makes things harder.

Thankfully, we can easily reshape wide dataframes into long dataframes using Pandas' melt function:

In [12]:

tidy = pd.melt(heights, id_vars='person', 
               var_name='time', value_name='height')
tidy

Out[12]:

	person	time	height
0	Jim	t1	120
1	Sally	t1	123
2	Meg	t1	118
3	Jim	t2	152
4	Sally	t2	147
5	Meg	t2	153

You can find out more about how you can reshape datasets in Pandas, including aggregating values using pivot tables, in the documentation on reshaping.

Exploratory plots¶

Now we have only the complete cases, let's look at two variables of interest by using the Seaborn plotting package.

In [13]:

# simple histogram of Total sleep:
vec = sleep['TotalSleep']
g = sns.distplot(vec)
sns.despine(ax=g, offset=10);    

In [14]:

# the jointplot function can be used to set up a subplot grid:
g = sns.jointplot("BodyWt", "TotalSleep", sleep); 

We can see that the body weight variable is extremely skewed. It ranges from near zero (0.005 kg for the Lesser Short-Tailed Shrew) up to nearly 2.5 tonnes (for the Asian Elephant):

In [15]:

sleep['Species'][sleep['BodyWt']==sleep['BodyWt'].min()]

Out[15]:

31    Lessershort-tailedshrew
Name: Species, dtype: object

In [16]:

sleep['Species'][sleep['BodyWt']==sleep['BodyWt'].max()]

Out[16]:

4    Asianelephant
Name: Species, dtype: object

In [17]:

sleep['BodyWt'].describe()  # use the describe method on the variable BodyWt

Out[17]:

count      42.000000
mean      100.813905
std       402.082389
min         0.005000
25%         0.316250
50%         2.250000
75%        10.412500
max      2547.000000
Name: BodyWt, dtype: float64

Indexing Pandas dataframes¶

For the next step in our exploratory analysis, we want to access variables in our sleep dataframe.

There are a number of ways to access subsets of a Pandas dataframe; an overview can be found on the 10 minutes to Pandas website. We will concentrate here on selecting single variables (columns).

In [18]:

sleep['BodyWt']  # prints the column of "bodyweight" variable.

Out[18]:

1        1.000
4     2547.000
5       10.550
6        0.023
7      160.000
8        3.300
9       52.160
10       0.425
11     465.000
14       0.075
15       3.000
16       0.785
17       0.200
21      27.660
22       0.120
24      85.000
26       0.101
27       1.040
28     521.000
31       0.005
32       0.010
33      62.000
36       0.023
37       0.048
38       1.700
39       3.500
41       0.480
42      10.000
43       1.620
44     192.000
45       2.500
47       0.280
48       4.235
49       6.800
50       0.750
51       3.600
53      55.500
56       0.900
57       2.000
58       0.104
59       4.190
60       3.500
Name: BodyWt, dtype: float64

In [19]:

sleep['BodyWt'][0:3]  # the first three rows of the bodyweight variable

Out[19]:

1       1.00
4    2547.00
5      10.55
Name: BodyWt, dtype: float64

In [20]:

# Boolean indexing. All bodyweights of animals with less than 6 hours of sleep:
sleep['BodyWt'][sleep['TotalSleep'] < 6]  

Out[20]:

4     2547.00
11     465.00
21      27.66
28     521.00
51       3.60
53      55.50
57       2.00
Name: BodyWt, dtype: float64

Variable transforms¶

Linear bodyweight is extremely skewed; this complicates data analysis
It makes sense to consider the log of body weight instead
We can create a new variable of log bodyweight.

In [21]:

sleep['log_BodyWt'] = np.log(sleep['BodyWt'])
sleep['log_BodyWt'].describe()

Out[21]:

count    42.000000
mean      0.803995
std       3.069388
min      -5.298317
25%      -1.168641
50%       0.804719
75%       2.342741
max       7.842671
Name: log_BodyWt, dtype: float64

In [22]:

sleep.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42 entries, 1 to 60
Data columns (total 12 columns):
Species        42 non-null object
BodyWt         42 non-null float64
BrainWt        42 non-null float64
NonDreaming    42 non-null float64
Dreaming       42 non-null float64
TotalSleep     42 non-null float64
LifeSpan       42 non-null float64
Gestation      42 non-null float64
Predation      42 non-null int64
Exposure       42 non-null int64
Danger         42 non-null int64
log_BodyWt     42 non-null float64
dtypes: float64(8), int64(3), object(1)
memory usage: 4.3+ KB

In [23]:

g = sns.jointplot("log_BodyWt", "TotalSleep", sleep)

Now we see a much nicer and more linear spread of the data. Heavier animals seem to sleep for less hours.

Grouped plots¶

The sleep in mammals dataset contains a number of other variables. For example, the variable Danger is an ordinal scale variable, ranging from 1 ("least danger from other animals") to 5 ("most danger from other animals").

Does the total time spent sleeping change according to danger?

In [24]:

g = sns.violinplot(sleep['TotalSleep'], groupby=sleep['Danger'])
sns.despine(ax=g, offset=10);

Out of interest, let's find which animals are in the highest and lowest danger categories:

In [25]:

sleep['Species'][sleep['Danger']==1]

Out[25]:

6              Bigbrownbat
8                      Cat
9               Chimpanzee
14     EasternAmericanmole
24                Grayseal
32          Littlebrownbat
33                     Man
38        NAmericanopossum
39    Nine-bandedarmadillo
48                  Redfox
60            Wateropossum
Name: Species, dtype: object

In [26]:

sleep['Species'][sleep['Danger']==5]

Out[26]:

11       Cow
21      Goat
28     Horse
45    Rabbit
53     Sheep
Name: Species, dtype: object

Faceted subplots¶

Seaborn allows us to easily visualise 2D plots across various groups by faceting plots in rows and columns according to categorical variables.

For example, we could ask: Does the relationship between total sleep time and bodyweight depend on danger level?

In [27]:

g = sns.FacetGrid(sleep, col="Danger", col_wrap=3)
g.map(plt.scatter, 'log_BodyWt', 'TotalSleep');

Go forth and explore!¶

Seaborn and Pandas allow you to do very powerful data manipulation and exploratory plots in relatively few lines of code. In this lecture we've provided a taste of what you can do. We encourage you to use this code, to check out more examples on the Seaborn website, and to get in and play around with our data or your own.

Don't be dismayed that errors are frequent. Learning to use these packages will be a lot of trial and error / hacking around / internet searchs. The rewards for your research will be worthwhile.

In [1]:

from IPython.core.display import HTML


def css_styling():
    styles = open("../custom_style.css", "r").read()
    return HTML(styles)
css_styling()

Out[1]: