Data analysis in Python

Lesson preamble

Learning objectives

  • Describe what a data frame is.
  • Load external data from a .csv file into a data frame with pandas.
  • Summarize the contents of a data frame with pandas.
  • Learn to use data frame attributes loc[], head(), info(), describe(), shape, columns, index.
  • Understand the split-apply-combine concept for data analysis.
    • Use groupby(), mean(), agg() and size() to apply this technique.
  • Use concat() and merge() to combine data frames.

Lesson outline

  • Manipulating and analyzing data with pandas
    • Data set background (10 min)
    • What are data frames (15 min)
    • Data wrangling with pandas (40 min)
  • Split-apply-combine techniques in pandas
    • Using mean() to summarize categorical data (20 min)
    • Using size() to summarize categorical data (15 min)
  • Combining data frames (15 min)

Manipulating and analyzing data with pandas

To access additional functionality in a spreadsheet program, you need to click the menu and select the tool you want to use. All charts are in one menu, text layout tools in another, data analyses tools in a third, and so on. Programming languages such as Python have so many tools and functions so that they would not fit in a menu. Instead of clicking File -> Open and chose the file, you would type something similar to file.open('') in a programming language. Don't worry if you forget the exact expression, it is often enough to just type the few first letters and then hit Tab, to show the available options, more on that later.

Dataset background

Today, we will be working with real data from a longitudinal study of the species abundance in the Chihuahuan desert ecosystem near Portal, Arizona, USA. This study includes observations of plants, ants, and rodents from 1977 - 2002, and has been used in over 100 publications. More information is available in the abstract of this paper from 2009. There are several datasets available related to this study, and we will be working with datasets that have been preprocessed by the Data Carpentry to facilitate teaching. These are made available online as The Portal Project Teaching Database, both at the Data Carpentry website, and on Figshare. Figshare is a great place to publish data, code, figures, and more openly to make them available for other researchers and to communicate findings that are not part of a longer paper.

Presentation of the survey data

We are studying the species and weight of animals caught in plots in our study area. The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent:

Column Description
record_id unique id for the observation
month month of observation
day day of observation
year year of observation
plot_id ID of a particular plot
species_id 2-letter code
sex sex of animal ("M", "F")
hindfoot_length length of the hindfoot in mm
weight weight of the animal in grams
genus genus of animal
species species of animal
taxa e.g. rodent, reptile, bird, rabbit
plot_type type of plot

To read the data into Python, we are going to use a function called read_csv. This function is contained in an Python-package called pandas. As mentioned previously, Python-packages are a bit like browser extensions, they are not essential, but can provide nifty functionality. To use a package, it first needs to be imported.

In [1]:
# pandas is given the nickname `pd`
import pandas as pd

pandas can read CSV-files saved on the computer or directly from an URL.

In [2]:
surveys = pd.read_csv('https://ndownloader.figshare.com/files/2292169')

To view the result, type surveys in a cell and run it, just as when viewing the content of any variable in Python.

In [3]:
surveys
Out[3]:
record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa plot_type
0 1 7 16 1977 2 NL M 32.0 NaN Neotoma albigula Rodent Control
1 72 8 19 1977 2 NL M 31.0 NaN Neotoma albigula Rodent Control
2 224 9 13 1977 2 NL NaN NaN NaN Neotoma albigula Rodent Control
3 266 10 16 1977 2 NL NaN NaN NaN Neotoma albigula Rodent Control
4 349 11 12 1977 2 NL NaN NaN NaN Neotoma albigula Rodent Control
5 363 11 12 1977 2 NL NaN NaN NaN Neotoma albigula Rodent Control
6 435 12 10 1977 2 NL NaN NaN NaN Neotoma albigula Rodent Control
7 506 1 8 1978 2 NL NaN NaN NaN Neotoma albigula Rodent Control
8 588 2 18 1978 2 NL M NaN 218.0 Neotoma albigula Rodent Control
9 661 3 11 1978 2 NL NaN NaN NaN Neotoma albigula Rodent Control
10 748 4 8 1978 2 NL NaN NaN NaN Neotoma albigula Rodent Control
11 845 5 6 1978 2 NL M 32.0 204.0 Neotoma albigula Rodent Control
12 990 6 9 1978 2 NL M NaN 200.0 Neotoma albigula Rodent Control
13 1164 8 5 1978 2 NL M 34.0 199.0 Neotoma albigula Rodent Control
14 1261 9 4 1978 2 NL M 32.0 197.0 Neotoma albigula Rodent Control
15 1374 10 8 1978 2 NL NaN NaN NaN Neotoma albigula Rodent Control
16 1453 11 5 1978 2 NL M NaN 218.0 Neotoma albigula Rodent Control
17 1756 4 29 1979 2 NL M 33.0 166.0 Neotoma albigula Rodent Control
18 1818 5 30 1979 2 NL M 32.0 184.0 Neotoma albigula Rodent Control
19 1882 7 4 1979 2 NL M 32.0 206.0 Neotoma albigula Rodent Control
20 2133 10 25 1979 2 NL F 33.0 274.0 Neotoma albigula Rodent Control
21 2184 11 17 1979 2 NL F 30.0 186.0 Neotoma albigula Rodent Control
22 2406 1 16 1980 2 NL F 33.0 184.0 Neotoma albigula Rodent Control
23 2728 3 9 1980 2 NL F NaN NaN Neotoma albigula Rodent Control
24 3000 5 18 1980 2 NL F 31.0 87.0 Neotoma albigula Rodent Control
25 3002 5 18 1980 2 NL F 33.0 174.0 Neotoma albigula Rodent Control
26 4667 7 8 1981 2 NL F 30.0 130.0 Neotoma albigula Rodent Control
27 4859 10 1 1981 2 NL M 34.0 208.0 Neotoma albigula Rodent Control
28 5048 11 23 1981 2 NL M 34.0 192.0 Neotoma albigula Rodent Control
29 5180 1 1 1982 2 NL M NaN 206.0 Neotoma albigula Rodent Control
... ... ... ... ... ... ... ... ... ... ... ... ... ...
34756 21209 11 13 1993 7 CU NaN NaN NaN Cnemidophorus uniparens Reptile Rodent Exclosure
34757 25710 5 10 1997 7 PB M 26.0 31.0 Chaetodipus baileyi Rodent Rodent Exclosure
34758 26042 6 10 1997 7 PB F 27.0 24.0 Chaetodipus baileyi Rodent Rodent Exclosure
34759 26096 6 10 1997 7 PB F 26.0 30.0 Chaetodipus baileyi Rodent Rodent Exclosure
34760 26356 7 9 1997 7 PB M 27.0 32.0 Chaetodipus baileyi Rodent Rodent Exclosure
34761 26475 7 9 1997 7 PB M 28.0 36.0 Chaetodipus baileyi Rodent Rodent Exclosure
34762 26546 7 29 1997 7 PB M 27.0 37.0 Chaetodipus baileyi Rodent Rodent Exclosure
34763 26776 9 27 1997 7 PB M 26.0 37.0 Chaetodipus baileyi Rodent Rodent Exclosure
34764 26819 9 27 1997 7 PB M 30.0 40.0 Chaetodipus baileyi Rodent Rodent Exclosure
34765 28332 8 22 1998 7 PB M 26.0 27.0 Chaetodipus baileyi Rodent Rodent Exclosure
34766 28336 8 22 1998 7 PB M 26.0 23.0 Chaetodipus baileyi Rodent Rodent Exclosure
34767 28337 8 22 1998 7 PB F 27.0 30.0 Chaetodipus baileyi Rodent Rodent Exclosure
34768 28338 8 22 1998 7 PB F 25.0 23.0 Chaetodipus baileyi Rodent Rodent Exclosure
34769 28585 9 20 1998 7 PB M 26.0 25.0 Chaetodipus baileyi Rodent Rodent Exclosure
34770 28667 10 24 1998 7 PB M 26.0 25.0 Chaetodipus baileyi Rodent Rodent Exclosure
34771 29231 2 20 1999 7 PB M 26.0 28.0 Chaetodipus baileyi Rodent Rodent Exclosure
34772 30355 2 5 2000 7 PB M 27.0 20.0 Chaetodipus baileyi Rodent Rodent Exclosure
34773 32085 5 26 2001 7 PB F 22.0 37.0 Chaetodipus baileyi Rodent Rodent Exclosure
34774 32477 8 25 2001 7 PB M 28.0 32.0 Chaetodipus baileyi Rodent Rodent Exclosure
34775 33103 11 17 2001 7 PB M 28.0 41.0 Chaetodipus baileyi Rodent Rodent Exclosure
34776 33305 12 15 2001 7 PB M 29.0 44.0 Chaetodipus baileyi Rodent Rodent Exclosure
34777 34524 7 13 2002 7 PB M 25.0 16.0 Chaetodipus baileyi Rodent Rodent Exclosure
34778 35382 12 8 2002 7 PB M 26.0 30.0 Chaetodipus baileyi Rodent Rodent Exclosure
34779 26557 7 29 1997 7 PL F 20.0 22.0 Peromyscus leucopus Rodent Rodent Exclosure
34780 26787 9 27 1997 7 PL F 21.0 16.0 Peromyscus leucopus Rodent Rodent Exclosure
34781 26966 10 25 1997 7 PL M 20.0 16.0 Peromyscus leucopus Rodent Rodent Exclosure
34782 27185 11 22 1997 7 PL F 21.0 22.0 Peromyscus leucopus Rodent Rodent Exclosure
34783 27792 5 2 1998 7 PL F 20.0 8.0 Peromyscus leucopus Rodent Rodent Exclosure
34784 28806 11 21 1998 7 PX NaN NaN NaN Chaetodipus sp. Rodent Rodent Exclosure
34785 30986 7 1 2000 7 PX NaN NaN NaN Chaetodipus sp. Rodent Rodent Exclosure

34786 rows × 13 columns

This is how a data frame is displayed in the Jupyter Notebook. Although the data frame itself just consists of the values, the Notebook knows that this is a data frame and displays it in a nice tabular format (by adding HTML decorators), and adds some cosmetic conveniences such as the bold font type for the column and row names, the alternating grey and white zebra stripes for the rows and highlights the row the mouse pointer moves over. The increasing numbers on the far left is the data frame's index, which was added by pandas to easily distinguish between the rows.

What are data frames?

A data frame is the representation of data in a tabular format, similar to how data is often arranged in spreadsheets. The data is rectangular, meaning that all rows have the same amount of columns and all columns have the same amount of rows. Data frames are the de facto data structure for most tabular data, and what we use for statistics and plotting. A data frame can be created by hand, but most commonly they are generated by an input function, such as read_csv(). In other words, when importing spreadsheets from your hard drive (or the web).

As can be seen above, the default is to display the first and last 30 rows and truncate everything in between, as indicated by the ellipsis (...). Although it is truncated, this output is still quite space consuming. To glance at how the data frame looks, it is sufficient to display only the top (the first 5 lines) using the head() method.

In [4]:
surveys.head()
Out[4]:
record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa plot_type
0 1 7 16 1977 2 NL M 32.0 NaN Neotoma albigula Rodent Control
1 72 8 19 1977 2 NL M 31.0 NaN Neotoma albigula Rodent Control
2 224 9 13 1977 2 NL NaN NaN NaN Neotoma albigula Rodent Control
3 266 10 16 1977 2 NL NaN NaN NaN Neotoma albigula Rodent Control
4 349 11 12 1977 2 NL NaN NaN NaN Neotoma albigula Rodent Control

Methods are very similar to functions, the main difference is that they belong to an object (above, the method head() belongs to the data frame surveys). Methods operate on the object they belong to, that's why we can call the method with an empty parenthesis without any arguments. Compare this with the function type() that was introduced previously.

In [5]:
type(surveys)
Out[5]:
pandas.core.frame.DataFrame

Here, the surveys variable is explicitly passed as an argument to type(). An immediately tangible advantage with methods is that they simplify tab completion. Just type the name of the dataframe, a period, and then hit tab to see all the relevant methods for that data frame instead of fumbling around with all the available functions in Python (there's quite a few!) and figuring out which ones operate on data frames and which do not. Methods also facilitates readability when chaining many operations together, which will be shown in detail later.

The columns in a data frame can contain data of different types, e.g. integers, floats, and objects (which includes strings, lists, dictionaries, and more)). General information about the data frame (including the column data types) can be obtained with the info() method.

In [6]:
surveys.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34786 entries, 0 to 34785
Data columns (total 13 columns):
record_id          34786 non-null int64
month              34786 non-null int64
day                34786 non-null int64
year               34786 non-null int64
plot_id            34786 non-null int64
species_id         34786 non-null object
sex                33038 non-null object
hindfoot_length    31438 non-null float64
weight             32283 non-null float64
genus              34786 non-null object
species            34786 non-null object
taxa               34786 non-null object
plot_type          34786 non-null object
dtypes: float64(2), int64(5), object(6)
memory usage: 3.5+ MB

The information includes the total number of rows and columns, the number of non-null observations, the column data types, and the memory (RAM) usage. The number of non-null observation is not the same for all columns, which means that some columns contain null values representing that there is missing information. In pandas these missing values are marked NaN for "not a number".

The column data type is often indicative of which type of data is stored in that column, and approximately corresponds to the following

  • Qualitative/Categorical
    • Nominal (labels, e.g. 'red', 'green', 'blue')
      • object, category
    • Ordinal (labels with order, e.g. 'Jan', 'Feb', 'Mar')
      • object, category, int
    • Binary (only two outcomes, e.g. True or False)
      • bool
  • Quantitative/Numerical
    • Discrete (whole numbers, often counting, e.g. number of children)
      • int
    • Continuous (measured values with decimals, e.g. weight)
      • float

Note that an object could contain different types, e.g. str or list. Also note that there can be exceptions to the schema above, but it is still a useful rough guide.

After reading in the data into a data frame, head() and info() are two of the most useful methods to get an idea of the structure of this data frame. There are many additional methods that can facilitate the understanding of what a data frame contains:

  • Size:

    • surveys.shape - a tuple with the number of rows in the first element and the number of columns as the second element
    • surveys.shape[0] - the number of rows
    • surveys.shape[1]- the number of columns
  • Content:

    • surveys.head() - shows the first 5 rows
    • surveys.tail() - shows the last 5 rows
  • Names:

    • surveys.columns - returns the names of the columns (also called variable names) objects)
    • surveys.index - returns the names of the rows (referred to as the index in pandas)
  • Summary:

    • surveys.info() - column names and data types, number of observations, memory consumptions length, and content of each column
    • surveys.describe() - summary statistics for each column

These belong to a data frame and are commonly referred to as attributes of the data frame. All attributes are accessed with the dot-syntax (.), which returns the value of the attribute.

In [7]:
surveys.shape
Out[7]:
(34786, 13)

If the attribute is a method, parentheses can be appended to the name to carry out the method's operation on the data frame.

In [8]:
surveys.head
Out[8]:
<bound method NDFrame.head of        record_id  month  day  year  plot_id species_id  sex  hindfoot_length  \
0              1      7   16  1977        2         NL    M             32.0   
1             72      8   19  1977        2         NL    M             31.0   
2            224      9   13  1977        2         NL  NaN              NaN   
3            266     10   16  1977        2         NL  NaN              NaN   
4            349     11   12  1977        2         NL  NaN              NaN   
5            363     11   12  1977        2         NL  NaN              NaN   
6            435     12   10  1977        2         NL  NaN              NaN   
7            506      1    8  1978        2         NL  NaN              NaN   
8            588      2   18  1978        2         NL    M              NaN   
9            661      3   11  1978        2         NL  NaN              NaN   
10           748      4    8  1978        2         NL  NaN              NaN   
11           845      5    6  1978        2         NL    M             32.0   
12           990      6    9  1978        2         NL    M              NaN   
13          1164      8    5  1978        2         NL    M             34.0   
14          1261      9    4  1978        2         NL    M             32.0   
15          1374     10    8  1978        2         NL  NaN              NaN   
16          1453     11    5  1978        2         NL    M              NaN   
17          1756      4   29  1979        2         NL    M             33.0   
18          1818      5   30  1979        2         NL    M             32.0   
19          1882      7    4  1979        2         NL    M             32.0   
20          2133     10   25  1979        2         NL    F             33.0   
21          2184     11   17  1979        2         NL    F             30.0   
22          2406      1   16  1980        2         NL    F             33.0   
23          2728      3    9  1980        2         NL    F              NaN   
24          3000      5   18  1980        2         NL    F             31.0   
25          3002      5   18  1980        2         NL    F             33.0   
26          4667      7    8  1981        2         NL    F             30.0   
27          4859     10    1  1981        2         NL    M             34.0   
28          5048     11   23  1981        2         NL    M             34.0   
29          5180      1    1  1982        2         NL    M              NaN   
...          ...    ...  ...   ...      ...        ...  ...              ...   
34756      21209     11   13  1993        7         CU  NaN              NaN   
34757      25710      5   10  1997        7         PB    M             26.0   
34758      26042      6   10  1997        7         PB    F             27.0   
34759      26096      6   10  1997        7         PB    F             26.0   
34760      26356      7    9  1997        7         PB    M             27.0   
34761      26475      7    9  1997        7         PB    M             28.0   
34762      26546      7   29  1997        7         PB    M             27.0   
34763      26776      9   27  1997        7         PB    M             26.0   
34764      26819      9   27  1997        7         PB    M             30.0   
34765      28332      8   22  1998        7         PB    M             26.0   
34766      28336      8   22  1998        7         PB    M             26.0   
34767      28337      8   22  1998        7         PB    F             27.0   
34768      28338      8   22  1998        7         PB    F             25.0   
34769      28585      9   20  1998        7         PB    M             26.0   
34770      28667     10   24  1998        7         PB    M             26.0   
34771      29231      2   20  1999        7         PB    M             26.0   
34772      30355      2    5  2000        7         PB    M             27.0   
34773      32085      5   26  2001        7         PB    F             22.0   
34774      32477      8   25  2001        7         PB    M             28.0   
34775      33103     11   17  2001        7         PB    M             28.0   
34776      33305     12   15  2001        7         PB    M             29.0   
34777      34524      7   13  2002        7         PB    M             25.0   
34778      35382     12    8  2002        7         PB    M             26.0   
34779      26557      7   29  1997        7         PL    F             20.0   
34780      26787      9   27  1997        7         PL    F             21.0   
34781      26966     10   25  1997        7         PL    M             20.0   
34782      27185     11   22  1997        7         PL    F             21.0   
34783      27792      5    2  1998        7         PL    F             20.0   
34784      28806     11   21  1998        7         PX  NaN              NaN   
34785      30986      7    1  2000        7         PX  NaN              NaN   

       weight          genus    species     taxa         plot_type  
0         NaN        Neotoma   albigula   Rodent           Control  
1         NaN        Neotoma   albigula   Rodent           Control  
2         NaN        Neotoma   albigula   Rodent           Control  
3         NaN        Neotoma   albigula   Rodent           Control  
4         NaN        Neotoma   albigula   Rodent           Control  
5         NaN        Neotoma   albigula   Rodent           Control  
6         NaN        Neotoma   albigula   Rodent           Control  
7         NaN        Neotoma   albigula   Rodent           Control  
8       218.0        Neotoma   albigula   Rodent           Control  
9         NaN        Neotoma   albigula   Rodent           Control  
10        NaN        Neotoma   albigula   Rodent           Control  
11      204.0        Neotoma   albigula   Rodent           Control  
12      200.0        Neotoma   albigula   Rodent           Control  
13      199.0        Neotoma   albigula   Rodent           Control  
14      197.0        Neotoma   albigula   Rodent           Control  
15        NaN        Neotoma   albigula   Rodent           Control  
16      218.0        Neotoma   albigula   Rodent           Control  
17      166.0        Neotoma   albigula   Rodent           Control  
18      184.0        Neotoma   albigula   Rodent           Control  
19      206.0        Neotoma   albigula   Rodent           Control  
20      274.0        Neotoma   albigula   Rodent           Control  
21      186.0        Neotoma   albigula   Rodent           Control  
22      184.0        Neotoma   albigula   Rodent           Control  
23        NaN        Neotoma   albigula   Rodent           Control  
24       87.0        Neotoma   albigula   Rodent           Control  
25      174.0        Neotoma   albigula   Rodent           Control  
26      130.0        Neotoma   albigula   Rodent           Control  
27      208.0        Neotoma   albigula   Rodent           Control  
28      192.0        Neotoma   albigula   Rodent           Control  
29      206.0        Neotoma   albigula   Rodent           Control  
...       ...            ...        ...      ...               ...  
34756     NaN  Cnemidophorus  uniparens  Reptile  Rodent Exclosure  
34757    31.0    Chaetodipus    baileyi   Rodent  Rodent Exclosure  
34758    24.0    Chaetodipus    baileyi   Rodent  Rodent Exclosure  
34759    30.0    Chaetodipus    baileyi   Rodent  Rodent Exclosure  
34760    32.0    Chaetodipus    baileyi   Rodent  Rodent Exclosure  
34761    36.0    Chaetodipus    baileyi   Rodent  Rodent Exclosure  
34762    37.0    Chaetodipus    baileyi   Rodent  Rodent Exclosure  
34763    37.0    Chaetodipus    baileyi   Rodent  Rodent Exclosure  
34764    40.0    Chaetodipus    baileyi   Rodent  Rodent Exclosure  
34765    27.0    Chaetodipus    baileyi   Rodent  Rodent Exclosure  
34766    23.0    Chaetodipus    baileyi   Rodent  Rodent Exclosure  
34767    30.0    Chaetodipus    baileyi   Rodent  Rodent Exclosure  
34768    23.0    Chaetodipus    baileyi   Rodent  Rodent Exclosure  
34769    25.0    Chaetodipus    baileyi   Rodent  Rodent Exclosure  
34770    25.0    Chaetodipus    baileyi   Rodent  Rodent Exclosure  
34771    28.0    Chaetodipus    baileyi   Rodent  Rodent Exclosure  
34772    20.0    Chaetodipus    baileyi   Rodent  Rodent Exclosure  
34773    37.0    Chaetodipus    baileyi   Rodent  Rodent Exclosure  
34774    32.0    Chaetodipus    baileyi   Rodent  Rodent Exclosure  
34775    41.0    Chaetodipus    baileyi   Rodent  Rodent Exclosure  
34776    44.0    Chaetodipus    baileyi   Rodent  Rodent Exclosure  
34777    16.0    Chaetodipus    baileyi   Rodent  Rodent Exclosure  
34778    30.0    Chaetodipus    baileyi   Rodent  Rodent Exclosure  
34779    22.0     Peromyscus   leucopus   Rodent  Rodent Exclosure  
34780    16.0     Peromyscus   leucopus   Rodent  Rodent Exclosure  
34781    16.0     Peromyscus   leucopus   Rodent  Rodent Exclosure  
34782    22.0     Peromyscus   leucopus   Rodent  Rodent Exclosure  
34783     8.0     Peromyscus   leucopus   Rodent  Rodent Exclosure  
34784     NaN    Chaetodipus        sp.   Rodent  Rodent Exclosure  
34785     NaN    Chaetodipus        sp.   Rodent  Rodent Exclosure  

[34786 rows x 13 columns]>
In [9]:
surveys.head()
Out[9]:
record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa plot_type
0 1 7 16 1977 2 NL M 32.0 NaN Neotoma albigula Rodent Control
1 72 8 19 1977 2 NL M 31.0 NaN Neotoma albigula Rodent Control
2 224 9 13 1977 2 NL NaN NaN NaN Neotoma albigula Rodent Control
3 266 10 16 1977 2 NL NaN NaN NaN Neotoma albigula Rodent Control
4 349 11 12 1977 2 NL NaN NaN NaN Neotoma albigula Rodent Control

Attributes that are not methods often hold a value that has been precomputed because it is commonly accessed and it saves time store the value in an attribute instead of recomputing it every time it is needed. For example, every time pandas creates a data frame, the number of rows and columns is computed and stored in the shape attribute.

Challenge

Based on the output of surveys.info(), can you answer the following questions?

  • What is the class of the object surveys?
  • How many rows and how many columns are in this object?
  • Why is there not the same number of rows (observations) for each column?

Saving data frames locally

It is good practice to keep a copy of the data stored locally on your computer in case you want to do offline analyses, the online version of the file changes, or the file is taken down. For this, the data could be downloaded manually or the current surveys data frame could be saved to disk as a CSV-file with to_csv().

In [10]:
surveys.to_csv('surveys.csv', index=False)
# `index=False` because the index (the row names) was generated automatically when pandas opened
# the file and this information is not needed to be saved

Since the data is now saved locally, the next time this Notebook is opened, it could be loaded from the local path instead of downloading it from the URL.

In [11]:
surveys = pd.read_csv('surveys.csv')
surveys.head()
Out[11]:
record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa plot_type
0 1 7 16 1977 2 NL M 32.0 NaN Neotoma albigula Rodent Control
1 72 8 19 1977 2 NL M 31.0 NaN Neotoma albigula Rodent Control
2 224 9 13 1977 2 NL NaN NaN NaN Neotoma albigula Rodent Control
3 266 10 16 1977 2 NL NaN NaN NaN Neotoma albigula Rodent Control
4 349 11 12 1977 2 NL NaN NaN NaN Neotoma albigula Rodent Control

Indexing and subsetting data frames

The survey data frame has rows and columns (it has 2 dimensions). To extract specific data from it (also referred to as "subsetting"), columns can be selected by their name.The Jupyter Notebook (technically, the underlying IPython interpreter) knows about the columns in the data frame, so tab autocompletion can be used to get the correct column name.

In [12]:
surveys['species_id'].head()
Out[12]:
0    NL
1    NL
2    NL
3    NL
4    NL
Name: species_id, dtype: object

The name of the column is not shown, since there is only one. Remember that the numbers on the left is just the index of the data frame, which was added by pandas upon importing the data.

Another syntax that is often used to specify column names is .<column_name>.

In [13]:
surveys.species_id.head()
Out[13]:
0    NL
1    NL
2    NL
3    NL
4    NL
Name: species_id, dtype: object

Using brackets is clearer and also alows for passing multiple columns as a list, so this tutorial will stick to that.

In [14]:
surveys[['species_id', 'record_id']].head()
Out[14]:
species_id record_id
0 NL 1
1 NL 72
2 NL 224
3 NL 266
4 NL 349

The output is displayed a bit differently this time. The reason is that in the last cell where the returned data frame only had one column ("species") pandas technically returned a Series, not a Dataframe. This can be confirmed by using type as previously.

In [15]:
type(surveys['species_id'].head())
Out[15]:
pandas.core.series.Series
In [16]:
type(surveys[['species_id', 'record_id']].head())
Out[16]:
pandas.core.frame.DataFrame

So, every individual column is actually a Series and together they constitue a Dataframe. This introductory tutorial will not make any further distinction between a Series and a Dataframe, and many of the analysis techniques used here will apply to both series and data frames.

Selecting with single brackets ([]) is a shortcut to common operations, such as selecting columns by labels as above. For more flexible and robust row and column selection the more verbose loc[<rows>, <columns>] (location) syntax is used.

In [17]:
surveys.loc[[0, 2, 4], ['species', 'record_id']]
# Although methods usually have trailing parenthesis, square brackets are used with `loc[]` to stay
# consistent with the indexing with square brackets in general in Python (e.g. lists and Numpy arrays)
Out[17]:
species record_id
0 albigula 1
2 albigula 224
4 albigula 349

A single number can be selected, which returns that value (here, an integer) rather than a data frame or series with just one value.

In [18]:
surveys.loc[4, 'record_id']
Out[18]:
349

If the column argument is is left out, all columns are selected.

In [19]:
surveys.loc[[3, 4]]
Out[19]:
record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa plot_type
3 266 10 16 1977 2 NL NaN NaN NaN Neotoma albigula Rodent Control
4 349 11 12 1977 2 NL NaN NaN NaN Neotoma albigula Rodent Control

To select all rows, but only a subset of columns, the colon character (:) can be used.

In [20]:
surveys.loc[:, ['month', 'day']].shape # show the size of the data frame
Out[20]:
(34786, 2)

It is also possible to select slices of rows and column labels.

In [21]:
surveys.loc[2:4, 'record_id':'day']
Out[21]:
record_id month day
2 224 9 13
3 266 10 16
4 349 11 12

It is important to realize that loc[] selects rows and columns by their labels, and that slicing therefore in inclusive of both the start and the end. To instead select by row or column position, use iloc[] (integer location).

In [22]:
surveys.iloc[[2, 3, 4], [0, 1, 2]]
Out[22]:
record_id month day
2 224 9 13
3 266 10 16
4 349 11 12

The index of surveys consists of consecutive integers so in this case selecting from the index by labels or position will look the same. As will be shown later, an index could also consist of text names just like the columns.

While selecting a slice by label is inclusive of the start and end, selecting a slice by position is inclusive of the start by exclusive of the end position, consistent with how position slicing works for other Python objects, such as lists and tuples.

In [23]:
surveys.iloc[2:5, :3]
Out[23]:
record_id month day
2 224 9 13
3 266 10 16
4 349 11 12

Selecting slices of row positions is a common operation, and has thus been given a shortcut syntax with single brackets.

In [24]:
surveys[2:5]
Out[24]:
record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa plot_type
2 224 9 13 1977 2 NL NaN NaN NaN Neotoma albigula Rodent Control
3 266 10 16 1977 2 NL NaN NaN NaN Neotoma albigula Rodent Control
4 349 11 12 1977 2 NL NaN NaN NaN Neotoma albigula Rodent Control

Challenge

  1. Extract the 200th and 201st row of the surveys dataset and assign the resulting data frame to a new variable name (surveys_200_201). Remember that Python indexing starts at 0!

  2. How can you get the same result as from surveys.head() by using row slices instead of the head() method?

  3. There are at least three distinct ways to extract the last row of the data frame. How many can you come up with?

The describe() method was mentioned above as a way of retrieving summary statistics of a data frame. Together with info() and head() this is often a good place to start exploratory data analysis as it gives a nice overview of the numeric valuables the data set.

In [25]:
surveys.describe()
Out[25]:
record_id month day year plot_id hindfoot_length weight
count 34786.000000 34786.000000 34786.000000 34786.000000 34786.000000 31438.000000 32283.000000
mean 17804.204421 6.473725 16.095987 1990.495832 11.343098 29.287932 42.672428
std 10229.682311 3.398384 8.249405 7.468714 6.794049 9.564759 36.631259
min 1.000000 1.000000 1.000000 1977.000000 1.000000 2.000000 4.000000
25% 8964.250000 4.000000 9.000000 1984.000000 5.000000 21.000000 20.000000
50% 17761.500000 6.000000 16.000000 1990.000000 11.000000 32.000000 37.000000
75% 26654.750000 10.000000 23.000000 1997.000000 17.000000 36.000000 48.000000
max 35548.000000 12.000000 31.000000 2002.000000 24.000000 70.000000 280.000000

A common next step would be to plot the data to explore relationships between different variables, but before getting into plotting, it is beneficial to elaborate on the data frame object and several of its common operations.

An often desired outcome is to select a subset of rows matching a criteria, e.g. which observations have a weight under 5 grams. To do this, the "less than" comparison operator that was introduced previously can be used.

In [26]:
surveys['weight'] < 5
Out[26]:
0        False
1        False
2        False
3        False
4        False
5        False
6        False
7        False
8        False
9        False
10       False
11       False
12       False
13       False
14       False
15       False
16       False
17       False
18       False
19       False
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27       False
28       False
29       False
         ...  
34756    False
34757    False
34758    False
34759    False
34760    False
34761    False
34762    False
34763    False
34764    False
34765    False
34766    False
34767    False
34768    False
34769    False
34770    False
34771    False
34772    False
34773    False
34774    False
34775    False
34776    False
34777    False
34778    False
34779    False
34780    False
34781    False
34782    False
34783    False
34784    False
34785    False
Name: weight, Length: 34786, dtype: bool

The result is a boolean array with one value for every row in the data frame indicating whether it is True or False that this row has a value below 5 in the weight column. This boolean array can be used to select only those rows from the data frame that meet the specified condition.

In [27]:
surveys[surveys['weight'] < 5]
Out[27]:
record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa plot_type
2428 4052 4 5 1981 3 PF F 15.0 4.0 Perognathus flavus Rodent Long-term Krat Exclosure
2453 7084 11 22 1982 3 PF F 16.0 4.0 Perognathus flavus Rodent Long-term Krat Exclosure
4253 28126 6 28 1998 15 PF M NaN 4.0 Perognathus flavus Rodent Long-term Krat Exclosure
4665 9909 1 20 1985 15 RM F 15.0 4.0 Reithrodontomys megalotis Rodent Long-term Krat Exclosure
6860 9853 1 19 1985 17 RM M 16.0 4.0 Reithrodontomys megalotis Rodent Control
21224 4290 4 6 1981 4 PF NaN NaN 4.0 Perognathus flavus Rodent Control
21674 29906 10 10 1999 4 PP M 21.0 4.0 Chaetodipus penicillatus Rodent Control
24191 8736 12 8 1983 19 RM M 17.0 4.0 Reithrodontomys megalotis Rodent Long-term Krat Exclosure
24200 9799 1 19 1985 19 RM M 16.0 4.0 Reithrodontomys megalotis Rodent Long-term Krat Exclosure
25529 9794 1 19 1985 24 RM M 16.0 4.0 Reithrodontomys megalotis Rodent Rodent Exclosure
26457 218 9 13 1977 1 PF M 13.0 4.0 Perognathus flavus Rodent Spectab exclosure
31678 5346 2 22 1982 21 PF F 14.0 4.0 Perognathus flavus Rodent Long-term Krat Exclosure
32114 9937 2 16 1985 21 RM M 16.0 4.0 Reithrodontomys megalotis Rodent Long-term Krat Exclosure
32808 10119 3 17 1985 10 RM M 16.0 4.0 Reithrodontomys megalotis Rodent Rodent Exclosure
33347 9790 1 19 1985 16 RM F 16.0 4.0 Reithrodontomys megalotis Rodent Rodent Exclosure
33760 9823 1 19 1985 23 RM M 16.0 4.0 Reithrodontomys megalotis Rodent Rodent Exclosure
34396 10439 5 24 1985 7 RM M 16.0 4.0 Reithrodontomys megalotis Rodent Rodent Exclosure

As before, this can be combined with selection of a particular set of columns.

In [28]:
surveys.loc[surveys['weight'] < 5, ['weight', 'species']]
Out[28]:
weight species
2428 4.0 flavus
2453 4.0 flavus
4253 4.0 flavus
4665 4.0 megalotis
6860 4.0 megalotis
21224 4.0 flavus
21674 4.0 penicillatus
24191 4.0 megalotis
24200 4.0 megalotis
25529 4.0 megalotis
26457 4.0 flavus
31678 4.0 flavus
32114 4.0 megalotis
32808 4.0 megalotis
33347 4.0 megalotis
33760 4.0 megalotis
34396 4.0 megalotis

A single expression can also be used to filter for several criteria, either matching all criteria (&) or any criteria (|). These special operators are used instead of and and or to make sure that the comparison occurs for each row in the data frame. Parentheses are added to indicate the priority of the comparisons.

In [29]:
# AND = &
surveys.loc[(surveys['taxa'] == 'Rodent') & (surveys['sex'] == 'F'), ['taxa', 'sex']].head()
Out[29]:
taxa sex
20 Rodent F
21 Rodent F
22 Rodent F
23 Rodent F
24 Rodent F

To increase readability, these statements can be put on multiple rows. Anything that is within a parameter or bracket in Python can be continued on the next row. When inside a bracket or parenthesis, the indentation is not significant to the Python interpreter, but it is still recommended to include it in order to make the code more readable.

In [30]:
surveys.loc[(surveys['taxa'] == 'Rodent') &
            (surveys['sex'] == 'F'),
            ['taxa', 'sex']].head()
Out[30]:
taxa sex
20 Rodent F
21 Rodent F
22 Rodent F
23 Rodent F
24 Rodent F

With the | operator, rows matching either of the supplied criteria are returned.

In [31]:
# OR = |
surveys.loc[(surveys['species'] == 'clarki') |
            (surveys['species'] == 'leucophrys'),
            'species']
Out[31]:
10603    leucophrys
24480        clarki
34045    leucophrys
Name: species, dtype: object

Challenge

Subset the survey data to include individuals collected before 1995 and retain only the columns year, sex, and weight.

Creating new columns

A frequent operation when working with data, is to create new columns based on the values in existing columns, for example to do unit conversions or find the ratio of values in two columns. To create a new column of the weight in kg instead of in grams:

In [32]:
surveys['weight_kg'] = surveys['weight'] / 1000
surveys.head(10)
Out[32]:
record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa plot_type weight_kg
0 1 7 16 1977 2 NL M 32.0 NaN Neotoma albigula Rodent Control NaN
1 72 8 19 1977 2 NL M 31.0 NaN Neotoma albigula Rodent Control NaN
2 224 9 13 1977 2 NL NaN NaN NaN Neotoma albigula Rodent Control NaN
3 266 10 16 1977 2 NL NaN NaN NaN Neotoma albigula Rodent Control NaN
4 349 11 12 1977 2 NL NaN NaN NaN Neotoma albigula Rodent Control NaN
5 363 11 12 1977 2 NL NaN NaN NaN Neotoma albigula Rodent Control NaN
6 435 12 10 1977 2 NL NaN NaN NaN Neotoma albigula Rodent Control NaN
7 506 1 8 1978 2 NL NaN NaN NaN Neotoma albigula Rodent Control NaN
8 588 2 18 1978 2 NL M NaN 218.0 Neotoma albigula Rodent Control 0.218
9 661 3 11 1978 2 NL NaN NaN NaN Neotoma albigula Rodent Control NaN

The first few rows of the output are full of NAs. To remove those, use the dropna() method of the data frame.

In [33]:
surveys.dropna().head(10)
Out[33]:
record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa plot_type weight_kg
11 845 5 6 1978 2 NL M 32.0 204.0 Neotoma albigula Rodent Control 0.204
13 1164 8 5 1978 2 NL M 34.0 199.0 Neotoma albigula Rodent Control 0.199
14 1261 9 4 1978 2 NL M 32.0 197.0 Neotoma albigula Rodent Control 0.197
17 1756 4 29 1979 2 NL M 33.0 166.0 Neotoma albigula Rodent Control 0.166
18 1818 5 30 1979 2 NL M 32.0 184.0 Neotoma albigula Rodent Control 0.184
19 1882 7 4 1979 2 NL M 32.0 206.0 Neotoma albigula Rodent Control 0.206
20 2133 10 25 1979 2 NL F 33.0 274.0 Neotoma albigula Rodent Control 0.274
21 2184 11 17 1979 2 NL F 30.0 186.0 Neotoma albigula Rodent Control 0.186
22 2406 1 16 1980 2 NL F 33.0 184.0 Neotoma albigula Rodent Control 0.184
24 3000 5 18 1980 2 NL F 31.0 87.0 Neotoma albigula Rodent Control 0.087

By default, .dropna() removes all rows that has an NA value in any of the columns. There are parameters that controls how the rows are dropped and which columns should be searched for NAs.

A common alternative to removing rows containing NA values is to fill out the values with e.g. the mean of all observations or the previous non-NA value. This can be done with the fillna() method.

In [34]:
surveys['hindfoot_length'].head()
Out[34]:
0    32.0
1    31.0
2     NaN
3     NaN
4     NaN
Name: hindfoot_length, dtype: float64
In [35]:
# Fill with mean value
fill_value = surveys['hindfoot_length'].mean()
surveys['hindfoot_length'].fillna(fill_value).head()
Out[35]:
0    32.000000
1    31.000000
2    29.287932
3    29.287932
4    29.287932
Name: hindfoot_length, dtype: float64
In [36]:
# Fill with previous non-null value
surveys['hindfoot_length'].fillna(method='ffill').head()
Out[36]:
0    32.0
1    31.0
2    31.0
3    31.0
4    31.0
Name: hindfoot_length, dtype: float64

Whether to use dropna() or fillna() depends on the data set and the purpose of the analysis. It is also possible to interpolate missing values from neighboring values via the interpolate() method, which can be useful when working with timeseries.

Challenge

  1. Create a new data frame from the surveys data that contains only the species_id and hindfoot_length columns and no NA values.
  2. Add a column to this new data frame called hindfoot_half, which contains values that are half the hindfoot_length values. Keep all observations that have a value less than 30 in the hindfoot_half.
  3. The final data frame should have 31,436 rows and 3 columns. How can you check that your data frame meets these criteria?

Split-apply-combine techniques in pandas

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results.

pandas facilitates this workflow through the use of groupby() to split data and summary/aggregation functions such as mean(), which collapses each group into a single-row summary of that group. The arguments to groupby() are the column names that contain the categorical variables by which summary statistics should be calculated. To start, compute the mean weight by sex.

Image credit Jake VanderPlas

Image credit Jake VanderPlas

Using mean() to summarize categorical data

The .mean() method can be used to calculate the average of each group. When the mean is computed, the default behavior is to ignore NA values, so they only need to be dropped if they are to be excluded from the visual output.

In [37]:
surveys.groupby('species')['weight'].mean()
Out[37]:
species
albigula           159.245660
audubonii                 NaN
baileyi             31.735943
bilineata                 NaN
brunneicapillus           NaN
chlorurus                 NaN
clarki                    NaN
eremicus            21.586508
flavus               7.923127
fulvescens          13.386667
fulviventer         58.878049
fuscus                    NaN
gramineus                 NaN
harrisi                   NaN
hispidus            65.563953
intermedius         19.250000
leucogaster         31.575258
leucophrys                NaN
leucopus            19.138889
maniculatus         21.364155
megalotis           10.585010
melanocorys               NaN
merriami            43.157864
montanus            10.250000
ochrognathus        55.414634
ordii               48.870523
penicillatus        17.173942
savannarum                NaN
scutalatus                NaN
sp.                 19.500000
spectabilis        120.130546
spilosoma           93.500000
squamata                  NaN
taylori              8.600000
tereticaudus              NaN
tigris                    NaN
torridus            24.230556
undulatus                 NaN
uniparens                 NaN
viridis                   NaN
Name: weight, dtype: float64

The output here is a series that is indexed with the grouped variable (the species) and the single column contains the result of the aggregation (the mean weight). Since there are so many species, a subset of the will be selected to fit the output within the screen and facilitate instruction. This could be done by manually typing each OR (|) condition.

In [38]:
surveys.loc[(surveys['species'] == 'albigula') |
            (surveys['species'] == 'ordii') |
            (surveys['species'] == 'flavus') |
            (surveys['species'] == 'torridus')].shape
Out[38]:
(8125, 14)

Comparing this number with the number of rows in the original data frame shows that it was filtered successfully.

In [39]:
surveys.shape
Out[39]:
(34786, 14)

However, it is rather tedious to type out the species names by hand and to do one comparison per species. Instead, the names of the species of interest can be extract by putting them in a list and use the isin() method.

In [40]:
species_to_keep = ['albigula', 'ordii', 'flavus', 'torridus']
surveys_sub = surveys.loc[surveys['species'].isin(species_to_keep)]
surveys_sub.shape
Out[40]:
(8125, 14)
In [41]:
avg_wt_spec = surveys_sub.groupby('species')['weight'].mean()
avg_wt_spec
Out[41]:
species
albigula    159.245660
flavus        7.923127
ordii        48.870523
torridus     24.230556
Name: weight, dtype: float64

Individual species can be selected from the resulting series using loc[], just as previously.

In [42]:
avg_wt_spec.loc[['ordii', 'albigula']]
Out[42]:
species
ordii        48.870523
albigula    159.245660
Name: weight, dtype: float64

Single [] without .loc[] could also be used to extract rows by label from a series, but since single [] extracts columns from a data frame, it can be easier to be explicit and use .loc[] instead of keeping track when the returned object is a series or a data frame.

Groups can also be created from multiple columns, e.g. it could be interesting to see the difference in weight between males and females within each species.

In [43]:
avg_wt_spec_sex = surveys_sub.groupby(['species', 'sex'])['weight'].mean()
avg_wt_spec_sex
Out[43]:
species   sex
albigula  F      154.282209
          M      165.652893
flavus    F        7.974394
          M        7.885287
ordii     F       48.531250
          M       49.135102
torridus  F       24.830904
          M       23.695382
Name: weight, dtype: float64

The returned series has an index that is a combination of the columns species and sex, and referred to as a MultiIndex. The same syntax as previously can be used to select rows on the species-level.

In [44]:
avg_wt_spec_sex.loc[['ordii', 'albigula']]
Out[44]:
species   sex
albigula  F      154.282209
          M      165.652893
ordii     F       48.531250
          M       49.135102
Name: weight, dtype: float64

To select specific values from both levels of the MultiIndex, a list of tuples can be passed to loc[].

In [45]:
avg_wt_spec_sex.loc[[('ordii', 'F'), ('albigula', 'M')]]
Out[45]:
species   sex
ordii     F       48.531250
albigula  M      165.652893
Name: weight, dtype: float64

To select only the female observations from all species, the xs() (cross section) method can be used.

In [46]:
avg_wt_spec_sex.xs('F', level='sex')
Out[46]:
species
albigula    154.282209
flavus        7.974394
ordii        48.531250
torridus     24.830904
Name: weight, dtype: float64

The names and values of the index levels can be seen by inspecting the index object.

In [47]:
avg_wt_spec_sex.index
Out[47]:
MultiIndex(levels=[['albigula', 'flavus', 'ordii', 'torridus'], ['F', 'M']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['species', 'sex'])

Although MultiIndexes offer succinct and fast ways to access data, they also requires memorization of additional syntax and are strictly speaking not essential unless speed is of particular concern. It can therefore be easier to reset the index, so that all values are stored in columns.

In [48]:
avg_wt_spec_sex_res = avg_wt_spec_sex.reset_index()
avg_wt_spec_sex_res
Out[48]:
species sex weight
0 albigula F 154.282209
1 albigula M 165.652893
2 flavus F 7.974394
3 flavus M 7.885287
4 ordii F 48.531250
5 ordii M 49.135102
6 torridus F 24.830904
7 torridus M 23.695382

After resetting the index, the same comparison syntax introduced earlier can be used instead of xs() or passing lists of tuples to loc[].

In [49]:
female_weights = avg_wt_spec_sex_res.loc[avg_wt_spec_sex_res['sex'] == 'F']
female_weights
Out[49]:
species sex weight
0 albigula F 154.282209
2 flavus F 7.974394
4 ordii F 48.531250
6 torridus F 24.830904

reset_index() grants the freedom of not having to work with indexes, but it is still worth keeping in mind that selecting on an index level with xs() can be orders of magnitude faster than using bollean comparisons (on large data frames).

The opposite operation (to create an index) can be performed with set_index() on any column (or combination of columns) that creates an index with unique values.

In [50]:
female_weights.set_index(['species', 'sex'])
Out[50]:
weight
species sex
albigula F 154.282209
flavus F 7.974394
ordii F 48.531250
torridus F 24.830904

Multiple aggregations on grouped data

Since the same grouped data frame will be used in multiple code chunks below, this can be assigned to a new variable instead of typing out the grouping expression each time.

In [51]:
grouped_surveys = surveys_sub.groupby(['species', 'sex'])
grouped_surveys['weight'].mean()
Out[51]:
species   sex
albigula  F      154.282209
          M      165.652893
flavus    F        7.974394
          M        7.885287
ordii     F       48.531250
          M       49.135102
torridus  F       24.830904
          M       23.695382
Name: weight, dtype: float64

Other aggregation methods, such as the standard deviation, are called with the same syntax.

In [52]:
grouped_surveys['weight'].std()
Out[52]:
species   sex
albigula  F      39.186546
          M      48.991563
flavus    F       1.686060
          M       1.587815
ordii     F       8.244255
          M       7.649716
torridus  F       4.908162
          M       3.990268
Name: weight, dtype: float64

Instead of using the mean() method, the more general agg() method could be called to aggregate (or summarize) by any existing aggregation functions. The equivalent to the mean() method would be to call agg() and specify 'mean'.

In [53]:
grouped_surveys['weight'].agg('mean')
Out[53]:
species   sex
albigula  F      154.282209
          M      165.652893
flavus    F        7.974394
          M        7.885287
ordii     F       48.531250
          M       49.135102
torridus  F       24.830904
          M       23.695382
Name: weight, dtype: float64

This general approach is more flexible and powerful since multiple aggregation functions can be applied in the same line of code by passing them as a list to agg(). For instance, the standard deviation and mean could be computed in the same call by passing them in a list.

In [54]:
grouped_surveys['weight'].agg(['mean', 'std'])
Out[54]:
mean std
species sex
albigula F 154.282209 39.186546
M 165.652893 48.991563
flavus F 7.974394 1.686060
M 7.885287 1.587815
ordii F 48.531250 8.244255
M 49.135102 7.649716
torridus F 24.830904 4.908162
M 23.695382 3.990268

The returned output is in this case a data frame and the MultiIndex is indicated in bold font.

By passing a dictionary to .agg() it is possible to apply different aggregations to the different columns. Long code statements can be broken down into multiple lines if they are enclosed by parentheses, brackets or braces, something that will be described in detail later.

In [55]:
grouped_surveys[['weight', 'hindfoot_length']].agg(
    {'weight': 'sum',
     'hindfoot_length': ['min', 'max']
    }
)
Out[55]:
weight hindfoot_length
sum min max
species sex
albigula F 100592.0 21.0 38.0
M 80176.0 21.0 42.0
flavus F 5917.0 7.0 38.0
M 6324.0 9.0 22.0
ordii F 62120.0 26.0 64.0
M 79648.0 27.0 58.0
torridus F 25551.0 13.0 50.0
M 26681.0 13.0 29.0

There are plenty of aggregation methods available in pandas (e.g. sem, mad, sum), the most common are listed in the table at the end of this subsection in the documentation and all can be found via tab-completions of the data frame (surveys.<tab>).

Even if a function is not part of the pandas library, it can be passed to agg().

In [56]:
import numpy as np

grouped_surveys['weight'].agg(np.mean)
Out[56]:
species   sex
albigula  F      154.282209
          M      165.652893
flavus    F        7.974394
          M        7.885287
ordii     F       48.531250
          M       49.135102
torridus  F       24.830904
          M       23.695382
Name: weight, dtype: float64

Any function can be passed like this, including user-created functions.

Challenge

  1. Use groupby() and agg() to find the mean, min, and max hindfoot length for each species.

Using size() to summarize categorical data

When working with data, it is common to want to know the number of observations present for each categorical variable. For this, pandas provides the size() method. For example, to group by 'taxa' and find the number of observations for each 'taxa':

In [57]:
# Note that the original full length data frame is used here again
surveys.groupby('taxa').size()
Out[57]:
taxa
Bird         450
Rabbit        75
Reptile       14
Rodent     34247
dtype: int64

size() can also be used when grouping on multiple variables.

In [58]:
surveys.groupby(['taxa', 'plot_type']).size()
Out[58]:
taxa     plot_type                
Bird     Control                        110
         Long-term Krat Exclosure        67
         Rodent Exclosure               118
         Short-term Krat Exclosure      131
         Spectab exclosure               24
Rabbit   Control                         29
         Long-term Krat Exclosure        17
         Rodent Exclosure                24
         Short-term Krat Exclosure        2
         Spectab exclosure                3
Reptile  Control                          6
         Long-term Krat Exclosure         3
         Rodent Exclosure                 1
         Short-term Krat Exclosure        4
Rodent   Control                      15466
         Long-term Krat Exclosure      5031
         Rodent Exclosure              4090
         Short-term Krat Exclosure     5769
         Spectab exclosure             3891
dtype: int64

If there are many groups, size() is not that useful on its own. For example, it is difficult to quickly find the five most abundant species among the observations.

In [59]:
surveys.groupby('species').size()
Out[59]:
species
albigula            1252
audubonii             75
baileyi             2891
bilineata            303
brunneicapillus       50
chlorurus             39
clarki                 1
eremicus            1299
flavus              1597
fulvescens            75
fulviventer           43
fuscus                 5
gramineus              8
harrisi              437
hispidus             179
intermedius            9
leucogaster         1006
leucophrys             2
leucopus              36
maniculatus          899
megalotis           2609
melanocorys           13
merriami           10596
montanus               8
ochrognathus          43
ordii               3027
penicillatus        3123
savannarum             2
scutalatus             1
sp.                   86
spectabilis         2504
spilosoma            248
squamata              16
taylori               46
tereticaudus           1
tigris                 1
torridus            2249
undulatus              5
uniparens              1
viridis                1
dtype: int64

Since there are many rows in this output, it would be beneficial to sort the table values and display the most abundant species first. This is easy to do with the sort_values() method.

In [60]:
surveys.groupby('species').size().sort_values()
Out[60]:
species
viridis                1
uniparens              1
scutalatus             1
tereticaudus           1
tigris                 1
clarki                 1
leucophrys             2
savannarum             2
undulatus              5
fuscus                 5
gramineus              8
montanus               8
intermedius            9
melanocorys           13
squamata              16
leucopus              36
chlorurus             39
ochrognathus          43
fulviventer           43
taylori               46
brunneicapillus       50
fulvescens            75
audubonii             75
sp.                   86
hispidus             179
spilosoma            248
bilineata            303
harrisi              437
maniculatus          899
leucogaster         1006
albigula            1252
eremicus            1299
flavus              1597
torridus            2249
spectabilis         2504
megalotis           2609
baileyi             2891
ordii               3027
penicillatus        3123
merriami           10596
dtype: int64

That's better, but it could be helpful to display the most abundant species on top. In other words, the output should be arranged in descending order.

In [61]:
surveys.groupby('species').size().sort_values(ascending=False).head(5)
Out[61]:
species
merriami        10596
penicillatus     3123
ordii            3027
baileyi          2891
megalotis        2609
dtype: int64

Looks good! By now, the code statement has grown quite long because many methods have been chained together. It can be tricky to keep track of what is going on in long method chains. To make the code more readable, it can be broken up multiple lines by adding a surrounding parenthesis.

In [62]:
(surveys
     .groupby('species')
     .size()
     .sort_values(ascending=False)
     .head(5)
)
Out[62]:
species
merriami        10596
penicillatus     3123
ordii            3027
baileyi          2891
megalotis        2609
dtype: int64

This looks neater and makes long method chains easier to reads. There is no absolute rule for when to break code into multiple line, but always try to write code that is easy for collaborators (your most common collaborator is a future version of yourself!) to understand.

pandas actually has a convenience function for returning the top five results, so the values don't need to be sorted explicitly.

In [63]:
(surveys
     .groupby(['species'])
     .size()
     .nlargest() # the default is 5
)
Out[63]:
species
merriami        10596
penicillatus     3123
ordii            3027
baileyi          2891
megalotis        2609
dtype: int64

To include more attributes about these species, add columns to groupby().

In [64]:
(surveys
     .groupby(['species', 'taxa', 'genus'])
     .size()
     .nlargest()
) 
Out[64]:
species       taxa    genus          
merriami      Rodent  Dipodomys          10596
penicillatus  Rodent  Chaetodipus         3123
ordii         Rodent  Dipodomys           3027
baileyi       Rodent  Chaetodipus         2891
megalotis     Rodent  Reithrodontomys     2609
dtype: int64

Challenge

  1. How many individuals were caught in each plot_type surveyed?

  2. Calculate the number of animals trapped per plot type for each year. Extract the combinations of year and plot type that had the three highest number of observations (e.g. "1998-Control").

Merging and concatenating data frames

Commonly, several data frames (e.g. from different CSV-files) need to be combined together into one big data frame. Either rows or columns from different data frames can be combined together resulting in a data frame that is longer or wider than the originals, respectively.

First, the syntax for subsetting data frames will be used to create the partial data frames that will later be joined together.

In [65]:
wt_avg = surveys_sub.groupby('species')['weight'].mean().reset_index()
wt_avg
Out[65]:
species weight
0 albigula 159.245660
1 flavus 7.923127
2 ordii 48.870523
3 torridus 24.230556
In [66]:
wt_avg_top = wt_avg[:2]
wt_avg_top
Out[66]:
species weight
0 albigula 159.245660
1 flavus 7.923127
In [67]:
wt_avg_bot = wt_avg[2:]
wt_avg_bot
Out[67]:
species weight
2 ordii 48.870523
3 torridus 24.230556

The pandas function concat() can be used to join the rows from these two data frames into one longer data frame.

In [68]:
wt_avg_full = pd.concat([wt_avg_top, wt_avg_bot])
wt_avg_full
Out[68]:
species weight
0 albigula 159.245660
1 flavus 7.923127
2 ordii 48.870523
3 torridus 24.230556

The first argument to concat() is a list of the data frames to concatenate, and can consist of more than two data frames.

In [69]:
frames = [wt_avg_top, wt_avg_bot, wt_avg_top]
pd.concat(frames)
Out[69]:
species weight
0 albigula 159.245660
1 flavus 7.923127
2 ordii 48.870523
3 torridus 24.230556
0 albigula 159.245660
1 flavus 7.923127

Concatenating rows from separate data sets is very useful when new data has been recorded and needs to be added to the previously existing data. Although concat() can also be used to add columns from different data frames together, the merge() function provides a more powerful interface for this operation. Its syntax is inspired by how these operations are performed on (SQL-like) databases.

Since merging data frames joins columns together, let's first create a new data frame with a column for the median weight per species to be merged with the mean weight data frame.

In [70]:
wt_med = surveys_sub.groupby('species')['weight'].median().reset_index()
wt_med
Out[70]:
species weight
0 albigula 164.0
1 flavus 8.0
2 ordii 50.0
3 torridus 24.0

Merging only works with two data frames at a time and they are specified as the first and second argument to merge(). By default all columns with a common name from both data frames will be used to figure out how the rows should be added together. Here, only the species (and not the weight) column should be used to determine how to merge the dataframes, which can be specified with the on parameter.

In [71]:
merged_df = pd.merge(wt_avg, wt_med, on='species')
merged_df
Out[71]:
species weight_x weight_y
0 albigula 159.245660 164.0
1 flavus 7.923127 8.0
2 ordii 48.870523 50.0
3 torridus 24.230556 24.0

Since the column weight exists in both the individual data frames, pandas appends _x and _y to differentiate between the columns in the final data frame. These columns could be renamed to better indicate what their values represent.

In [72]:
merged_df.rename(columns={'weight_x': 'mean_wt', 'weight_y': 'median_wt'})
Out[72]:
species mean_wt median_wt
0 albigula 159.245660 164.0
1 flavus 7.923127 8.0
2 ordii 48.870523 50.0
3 torridus 24.230556 24.0

For clarity, the rename() method could be used on the two individual data frames before merging.

It is important to understand that merge() method uses the specified column to match which rows to merge from the different data frames, so the order of the values in each data frame does not matter. Likewise, merge() can handle species that occur only in one of the two data frames.

In [73]:
wt_avg.loc[2, 'species'] = 't_rex'
wt_avg
Out[73]:
species weight
0 albigula 159.245660
1 flavus 7.923127
2 t_rex 48.870523
3 torridus 24.230556
In [74]:
pd.merge(wt_avg, wt_med, on='species')
Out[74]:
species weight_x weight_y
0 albigula 159.245660 164.0
1 flavus 7.923127 8.0
2 torridus 24.230556 24.0

By default, if the column(s) to join on contains values that are not present in both data frames, rows with those values will not be included in the resulting data frame. This default behavior is referred to as an 'inner' join. To include unique values, an 'outer' join should be performed instead.

In [75]:
pd.merge(wt_avg, wt_med, on='species', how='outer')
Out[75]:
species weight_x weight_y
0 albigula 159.245660 164.0
1 flavus 7.923127 8.0
2 t_rex 48.870523 NaN
3 torridus 24.230556 24.0
4 ordii NaN 50.0

The NaNs are introduced since these rows were only present in one of the joined data frames, and thus only has a value for one of the columns weight_x or weight_y.