The University of Tübingen
Python is a general-purpose programming language used in a vast number of domains. Thanks to add-on packages ("modules"), it can be used to analyse data in a manner similar to domain-specific languages (like R). One advantage of using a language like Python over something like R is that knowing some Python will probably be useful to you even when you're not doing data analysis.
Here we introduce Pandas
, a module for handling data sets as dataframes, and Seaborn
, a high-level plotting library
To use the modules that extend the functionality of Python, they must first be imported into memory.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# set up to have plots appear in the ipython notebook:
%matplotlib inline
sns.set_style("white") # sets the style of seaborn plots
We have compiled a few interesting datasets to use in this course. First we will look at a dataset of sleep times in mammals.
In this paper (Allison & Cicchetti, 1976), the authors collected a dataset of the number of hours various species of mammal sleep, and what type of sleep (slow-wave or REM sleep). They also compiled some other metrics, including things like lifespan and gestation, as well as environmental factors like the species' risk of predation. Here's the abstract from the paper:
The interrelationships between sleep, ecological, and constitutional variables were assessed statistically for 39 mammalian species. Slow-wave sleep is negatively associated with a factor related to body size, which suggests that large amounts of this sleep phase are disadvantageous in large species. Paradoxical sleep is associated with a factor related to predatory danger, which suggests that large amounts of this sleep phase are disadvantageous in prey species.
Allison, T. & Cicchetti, D.V. (1976). Sleep in mammals: ecological and constitutional correlates. Science, 194(4266): 732-734
The dataset is provided as a tab-delimited text file. First, let's import the data into a Pandas dataframe:
# import the dataset csv file into a Pandas dataframe object:
sleep = pd.read_csv('../Datasets/sleep.txt', sep='\t') # the "\t" means it's a tab separated file.
Note that Pandas offers lots of other read / write functions. For example, it's easy to read from an excel spreadsheet using read_excel
.
Note how we've referenced Pandas as pd
(as we imported it, above). The dot after pd
(as in pd.read_csv
) tells us that we're using a function (read_csv
) from the Pandas module. This is a nice feature of Python code, because it's clear where all of the functions you're using come from.
We can see the contents of our current "workspace" (the stuff our IPython Notebook knows about) by typing whos
in an otherwise empty cell and running it:
whos
Variable Type Data/Info --------------------------------- np module <module 'numpy' from '/Us<...>kages/numpy/__init__.py'> pd module <module 'pandas' from '/U<...>ages/pandas/__init__.py'> plt module <module 'matplotlib.pyplo<...>es/matplotlib/pyplot.py'> sleep DataFrame Specie<...>n\n[62 rows x 11 columns] sns module <module 'seaborn' from '/<...>ges/seaborn/__init__.py'>
Most variables in the workspace are objects. An object has a specific meaning for programmers.
If you don't know this meaning, think of an object like a car:
Similarly, an object in programming usually contains methods (things you can do with it) and sometimes data of some sort (things stored in it). If you try to use an object in a way it's not intended to be used, it will return an error.
We can explore what stuff we can do with an object in the workspace by using dot completion in the notebook.
When using the IPython Notebook, you can check what stuff you can do to an object in the workspace by putting a fullstop after it and then hitting tab in the cell. This shows you a list of all the methods that can operate on that object.
You can also hit shift + tab
to get help on a method; hit this multiple times to get more help.
Finally, be sure to look at the user interface tour of the IPython Notebook, and to check out some of the keyboard shortcuts.
sleep.mean()
BodyWt 198.789984 BrainWt 283.134194 NonDreaming 8.672917 Dreaming 1.972000 TotalSleep 10.532759 LifeSpan 19.877586 Gestation 142.353448 Predation 2.870968 Exposure 2.419355 Danger 2.612903 dtype: float64
# look at what the "sleep" dataframe contains:
sleep.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 62 entries, 0 to 61 Data columns (total 11 columns): Species 62 non-null object BodyWt 62 non-null float64 BrainWt 62 non-null float64 NonDreaming 48 non-null float64 Dreaming 50 non-null float64 TotalSleep 58 non-null float64 LifeSpan 58 non-null float64 Gestation 58 non-null float64 Predation 62 non-null int64 Exposure 62 non-null int64 Danger 62 non-null int64 dtypes: float64(7), int64(3), object(1) memory usage: 5.8+ KB
# let's look at the first few lines of the sleep dataset using `head`:
sleep.head()
Species | BodyWt | BrainWt | NonDreaming | Dreaming | TotalSleep | LifeSpan | Gestation | Predation | Exposure | Danger | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Africanelephant | 6654.000 | 5712.0 | NaN | NaN | 3.3 | 38.6 | 645 | 3 | 5 | 3 |
1 | Africangiantpouchedrat | 1.000 | 6.6 | 6.3 | 2.0 | 8.3 | 4.5 | 42 | 3 | 1 | 3 |
2 | ArcticFox | 3.385 | 44.5 | NaN | NaN | 12.5 | 14.0 | 60 | 1 | 1 | 1 |
3 | Arcticgroundsquirrel | 0.920 | 5.7 | NaN | NaN | 16.5 | NaN | 25 | 5 | 2 | 3 |
4 | Asianelephant | 2547.000 | 4603.0 | 2.1 | 1.8 | 3.9 | 69.0 | 624 | 3 | 5 | 4 |
# we can also use the "describe" method to summarise some information in our variables:
sleep.describe()
BodyWt | BrainWt | NonDreaming | Dreaming | TotalSleep | LifeSpan | Gestation | Predation | Exposure | Danger | |
---|---|---|---|---|---|---|---|---|---|---|
count | 62.000000 | 62.000000 | 48.000000 | 50.000000 | 58.000000 | 58.000000 | 58.000000 | 62.000000 | 62.000000 | 62.000000 |
mean | 198.789984 | 283.134194 | 8.672917 | 1.972000 | 10.532759 | 19.877586 | 142.353448 | 2.870968 | 2.419355 | 2.612903 |
std | 899.158011 | 930.278942 | 3.666452 | 1.442651 | 4.606760 | 18.206255 | 146.805039 | 1.476414 | 1.604792 | 1.441252 |
min | 0.005000 | 0.140000 | 2.100000 | 0.000000 | 2.600000 | 2.000000 | 12.000000 | 1.000000 | 1.000000 | 1.000000 |
25% | 0.600000 | 4.250000 | 6.250000 | 0.900000 | 8.050000 | 6.625000 | 35.750000 | 2.000000 | 1.000000 | 1.000000 |
50% | 3.342500 | 17.250000 | 8.350000 | 1.800000 | 10.450000 | 15.100000 | 79.000000 | 3.000000 | 2.000000 | 2.000000 |
75% | 48.202500 | 166.000000 | 11.000000 | 2.550000 | 13.200000 | 27.750000 | 207.500000 | 4.000000 | 4.000000 | 4.000000 |
max | 6654.000000 | 5712.000000 | 17.900000 | 6.600000 | 19.900000 | 100.000000 | 645.000000 | 5.000000 | 5.000000 | 5.000000 |
You can see that we have some missing cases: not all variables have 62 entries.
Let's keep only the complete cases, by using the dropna
method:
sleep = sleep.dropna(axis=0) # drop on axis 0 (i.e. rows)
sleep.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 42 entries, 1 to 60 Data columns (total 11 columns): Species 42 non-null object BodyWt 42 non-null float64 BrainWt 42 non-null float64 NonDreaming 42 non-null float64 Dreaming 42 non-null float64 TotalSleep 42 non-null float64 LifeSpan 42 non-null float64 Gestation 42 non-null float64 Predation 42 non-null int64 Exposure 42 non-null int64 Danger 42 non-null int64 dtypes: float64(7), int64(3), object(1) memory usage: 3.9+ KB
sleep.head()
Species | BodyWt | BrainWt | NonDreaming | Dreaming | TotalSleep | LifeSpan | Gestation | Predation | Exposure | Danger | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | Africangiantpouchedrat | 1.000 | 6.6 | 6.3 | 2.0 | 8.3 | 4.5 | 42 | 3 | 1 | 3 |
4 | Asianelephant | 2547.000 | 4603.0 | 2.1 | 1.8 | 3.9 | 69.0 | 624 | 3 | 5 | 4 |
5 | Baboon | 10.550 | 179.5 | 9.1 | 0.7 | 9.8 | 27.0 | 180 | 4 | 4 | 4 |
6 | Bigbrownbat | 0.023 | 0.3 | 15.8 | 3.9 | 19.7 | 19.0 | 35 | 1 | 1 | 1 |
7 | Braziliantapir | 160.000 | 169.0 | 5.2 | 1.0 | 6.2 | 30.4 | 392 | 4 | 5 | 4 |
Most of the tools we will use throughout this course expect your data to be in a format that we will call tidy data (after Wickham, 2013):
All of the datasets we have provided follow this format. It can also be called a "long format" data frame, because it tends to be longer (rows) than it is wide (columns).
As an example of a wide (i.e. untidy) data frame, imagine that we measured three peoples' heights, once at time one and once at time two:
heights = pd.DataFrame({'person': ['Jim', 'Sally', 'Meg'],
't1': [120, 123, 118],
't2': [152, 147, 153]})
heights
person | t1 | t2 | |
---|---|---|---|
0 | Jim | 120 | 152 |
1 | Sally | 123 | 147 |
2 | Meg | 118 | 153 |
Do you see how the columns t1
and t2
are mixing up what should be one variable (time) with two levels (t1 and t2)? While a data frame like this can be useful in some circumstances, in general it makes things harder.
Thankfully, we can easily reshape wide dataframes into long dataframes using Pandas' melt
function:
tidy = pd.melt(heights, id_vars='person',
var_name='time', value_name='height')
tidy
person | time | height | |
---|---|---|---|
0 | Jim | t1 | 120 |
1 | Sally | t1 | 123 |
2 | Meg | t1 | 118 |
3 | Jim | t2 | 152 |
4 | Sally | t2 | 147 |
5 | Meg | t2 | 153 |
You can find out more about how you can reshape datasets in Pandas, including aggregating values using pivot tables, in the documentation on reshaping.
Now we have only the complete cases, let's look at two variables of interest by using the Seaborn plotting package.
# simple histogram of Total sleep:
vec = sleep['TotalSleep']
g = sns.distplot(vec)
sns.despine(ax=g, offset=10);
# the jointplot function can be used to set up a subplot grid:
g = sns.jointplot("BodyWt", "TotalSleep", sleep);
We can see that the body weight variable is extremely skewed. It ranges from near zero (0.005 kg for the Lesser Short-Tailed Shrew) up to nearly 2.5 tonnes (for the Asian Elephant):
sleep['Species'][sleep['BodyWt']==sleep['BodyWt'].min()]
31 Lessershort-tailedshrew Name: Species, dtype: object
sleep['Species'][sleep['BodyWt']==sleep['BodyWt'].max()]
4 Asianelephant Name: Species, dtype: object
sleep['BodyWt'].describe() # use the describe method on the variable BodyWt
count 42.000000 mean 100.813905 std 402.082389 min 0.005000 25% 0.316250 50% 2.250000 75% 10.412500 max 2547.000000 Name: BodyWt, dtype: float64
For the next step in our exploratory analysis, we want to access variables in our sleep
dataframe.
There are a number of ways to access subsets of a Pandas dataframe; an overview can be found on the 10 minutes to Pandas website. We will concentrate here on selecting single variables (columns).
sleep['BodyWt'] # prints the column of "bodyweight" variable.
1 1.000 4 2547.000 5 10.550 6 0.023 7 160.000 8 3.300 9 52.160 10 0.425 11 465.000 14 0.075 15 3.000 16 0.785 17 0.200 21 27.660 22 0.120 24 85.000 26 0.101 27 1.040 28 521.000 31 0.005 32 0.010 33 62.000 36 0.023 37 0.048 38 1.700 39 3.500 41 0.480 42 10.000 43 1.620 44 192.000 45 2.500 47 0.280 48 4.235 49 6.800 50 0.750 51 3.600 53 55.500 56 0.900 57 2.000 58 0.104 59 4.190 60 3.500 Name: BodyWt, dtype: float64
sleep['BodyWt'][0:3] # the first three rows of the bodyweight variable
1 1.00 4 2547.00 5 10.55 Name: BodyWt, dtype: float64
# Boolean indexing. All bodyweights of animals with less than 6 hours of sleep:
sleep['BodyWt'][sleep['TotalSleep'] < 6]
4 2547.00 11 465.00 21 27.66 28 521.00 51 3.60 53 55.50 57 2.00 Name: BodyWt, dtype: float64
sleep['log_BodyWt'] = np.log(sleep['BodyWt'])
sleep['log_BodyWt'].describe()
count 42.000000 mean 0.803995 std 3.069388 min -5.298317 25% -1.168641 50% 0.804719 75% 2.342741 max 7.842671 Name: log_BodyWt, dtype: float64
sleep.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 42 entries, 1 to 60 Data columns (total 12 columns): Species 42 non-null object BodyWt 42 non-null float64 BrainWt 42 non-null float64 NonDreaming 42 non-null float64 Dreaming 42 non-null float64 TotalSleep 42 non-null float64 LifeSpan 42 non-null float64 Gestation 42 non-null float64 Predation 42 non-null int64 Exposure 42 non-null int64 Danger 42 non-null int64 log_BodyWt 42 non-null float64 dtypes: float64(8), int64(3), object(1) memory usage: 4.3+ KB
g = sns.jointplot("log_BodyWt", "TotalSleep", sleep)
Now we see a much nicer and more linear spread of the data. Heavier animals seem to sleep for less hours.
The sleep in mammals dataset contains a number of other variables. For example, the variable Danger
is an ordinal scale variable, ranging from 1 ("least danger from other animals") to 5 ("most danger from other animals").
Does the total time spent sleeping change according to danger?
g = sns.violinplot(sleep['TotalSleep'], groupby=sleep['Danger'])
sns.despine(ax=g, offset=10);
Out of interest, let's find which animals are in the highest and lowest danger categories:
sleep['Species'][sleep['Danger']==1]
6 Bigbrownbat 8 Cat 9 Chimpanzee 14 EasternAmericanmole 24 Grayseal 32 Littlebrownbat 33 Man 38 NAmericanopossum 39 Nine-bandedarmadillo 48 Redfox 60 Wateropossum Name: Species, dtype: object
sleep['Species'][sleep['Danger']==5]
11 Cow 21 Goat 28 Horse 45 Rabbit 53 Sheep Name: Species, dtype: object
Seaborn allows us to easily visualise 2D plots across various groups by faceting plots in rows and columns according to categorical variables.
For example, we could ask: Does the relationship between total sleep time and bodyweight depend on danger level?
g = sns.FacetGrid(sleep, col="Danger", col_wrap=3)
g.map(plt.scatter, 'log_BodyWt', 'TotalSleep');
Seaborn and Pandas allow you to do very powerful data manipulation and exploratory plots in relatively few lines of code. In this lecture we've provided a taste of what you can do. We encourage you to use this code, to check out more examples on the Seaborn website, and to get in and play around with our data or your own.
Don't be dismayed that errors are frequent. Learning to use these packages will be a lot of trial and error / hacking around / internet searchs. The rewards for your research will be worthwhile.
from IPython.core.display import HTML
def css_styling():
styles = open("../custom_style.css", "r").read()
return HTML(styles)
css_styling()