loc[]
, head()
, info()
, describe()
, shape
, columns
, index
.groupby()
, mean()
, agg()
and size()
to apply this technique.concat()
and merge()
to combine data frames.pandas
mean()
to summarize categorical data (20 min)size()
to summarize categorical data (15 min)To access additional functionality in a spreadsheet program, you need to click the menu and select the tool you want to use. All charts are in one menu, text layout tools in another, data analyses tools in a third, and so on. Programming languages such as Python have so many tools and functions so that they would not fit in a menu. Instead of clicking File -> Open and chose the file, you would type something similar to file.open('
Today, we will be working with real data from a longitudinal study of the species abundance in the Chihuahuan desert ecosystem near Portal, Arizona, USA. This study includes observations of plants, ants, and rodents from 1977 - 2002, and has been used in over 100 publications. More information is available in the abstract of this paper from 2009. There are several datasets available related to this study, and we will be working with datasets that have been preprocessed by the Data Carpentry to facilitate teaching. These are made available online as The Portal Project Teaching Database, both at the Data Carpentry website, and on Figshare. Figshare is a great place to publish data, code, figures, and more openly to make them available for other researchers and to communicate findings that are not part of a longer paper.
We are studying the species and weight of animals caught in plots in our study area. The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent:
Column | Description |
---|---|
record_id | unique id for the observation |
month | month of observation |
day | day of observation |
year | year of observation |
plot_id | ID of a particular plot |
species_id | 2-letter code |
sex | sex of animal ("M", "F") |
hindfoot_length | length of the hindfoot in mm |
weight | weight of the animal in grams |
genus | genus of animal |
species | species of animal |
taxa | e.g. rodent, reptile, bird, rabbit |
plot_type | type of plot |
To read the data into Python, we are going to use a function called read_csv
. This function is contained in an Python-package called pandas
. As mentioned previously, Python-packages are a bit like browser extensions, they are not essential, but can provide nifty functionality. To use a package, it first needs to be imported.
# pandas is given the nickname `pd`
import pandas as pd
pandas
can read CSV-files saved on the computer or directly from an URL.
surveys = pd.read_csv('https://ndownloader.figshare.com/files/2292169')
To view the result, type surveys
in a cell and run it, just as when viewing the content of any variable in Python.
surveys
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | plot_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent | Control |
1 | 72 | 8 | 19 | 1977 | 2 | NL | M | 31.0 | NaN | Neotoma | albigula | Rodent | Control |
2 | 224 | 9 | 13 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
3 | 266 | 10 | 16 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
4 | 349 | 11 | 12 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
5 | 363 | 11 | 12 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
6 | 435 | 12 | 10 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
7 | 506 | 1 | 8 | 1978 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
8 | 588 | 2 | 18 | 1978 | 2 | NL | M | NaN | 218.0 | Neotoma | albigula | Rodent | Control |
9 | 661 | 3 | 11 | 1978 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
10 | 748 | 4 | 8 | 1978 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
11 | 845 | 5 | 6 | 1978 | 2 | NL | M | 32.0 | 204.0 | Neotoma | albigula | Rodent | Control |
12 | 990 | 6 | 9 | 1978 | 2 | NL | M | NaN | 200.0 | Neotoma | albigula | Rodent | Control |
13 | 1164 | 8 | 5 | 1978 | 2 | NL | M | 34.0 | 199.0 | Neotoma | albigula | Rodent | Control |
14 | 1261 | 9 | 4 | 1978 | 2 | NL | M | 32.0 | 197.0 | Neotoma | albigula | Rodent | Control |
15 | 1374 | 10 | 8 | 1978 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
16 | 1453 | 11 | 5 | 1978 | 2 | NL | M | NaN | 218.0 | Neotoma | albigula | Rodent | Control |
17 | 1756 | 4 | 29 | 1979 | 2 | NL | M | 33.0 | 166.0 | Neotoma | albigula | Rodent | Control |
18 | 1818 | 5 | 30 | 1979 | 2 | NL | M | 32.0 | 184.0 | Neotoma | albigula | Rodent | Control |
19 | 1882 | 7 | 4 | 1979 | 2 | NL | M | 32.0 | 206.0 | Neotoma | albigula | Rodent | Control |
20 | 2133 | 10 | 25 | 1979 | 2 | NL | F | 33.0 | 274.0 | Neotoma | albigula | Rodent | Control |
21 | 2184 | 11 | 17 | 1979 | 2 | NL | F | 30.0 | 186.0 | Neotoma | albigula | Rodent | Control |
22 | 2406 | 1 | 16 | 1980 | 2 | NL | F | 33.0 | 184.0 | Neotoma | albigula | Rodent | Control |
23 | 2728 | 3 | 9 | 1980 | 2 | NL | F | NaN | NaN | Neotoma | albigula | Rodent | Control |
24 | 3000 | 5 | 18 | 1980 | 2 | NL | F | 31.0 | 87.0 | Neotoma | albigula | Rodent | Control |
25 | 3002 | 5 | 18 | 1980 | 2 | NL | F | 33.0 | 174.0 | Neotoma | albigula | Rodent | Control |
26 | 4667 | 7 | 8 | 1981 | 2 | NL | F | 30.0 | 130.0 | Neotoma | albigula | Rodent | Control |
27 | 4859 | 10 | 1 | 1981 | 2 | NL | M | 34.0 | 208.0 | Neotoma | albigula | Rodent | Control |
28 | 5048 | 11 | 23 | 1981 | 2 | NL | M | 34.0 | 192.0 | Neotoma | albigula | Rodent | Control |
29 | 5180 | 1 | 1 | 1982 | 2 | NL | M | NaN | 206.0 | Neotoma | albigula | Rodent | Control |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
34756 | 21209 | 11 | 13 | 1993 | 7 | CU | NaN | NaN | NaN | Cnemidophorus | uniparens | Reptile | Rodent Exclosure |
34757 | 25710 | 5 | 10 | 1997 | 7 | PB | M | 26.0 | 31.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34758 | 26042 | 6 | 10 | 1997 | 7 | PB | F | 27.0 | 24.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34759 | 26096 | 6 | 10 | 1997 | 7 | PB | F | 26.0 | 30.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34760 | 26356 | 7 | 9 | 1997 | 7 | PB | M | 27.0 | 32.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34761 | 26475 | 7 | 9 | 1997 | 7 | PB | M | 28.0 | 36.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34762 | 26546 | 7 | 29 | 1997 | 7 | PB | M | 27.0 | 37.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34763 | 26776 | 9 | 27 | 1997 | 7 | PB | M | 26.0 | 37.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34764 | 26819 | 9 | 27 | 1997 | 7 | PB | M | 30.0 | 40.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34765 | 28332 | 8 | 22 | 1998 | 7 | PB | M | 26.0 | 27.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34766 | 28336 | 8 | 22 | 1998 | 7 | PB | M | 26.0 | 23.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34767 | 28337 | 8 | 22 | 1998 | 7 | PB | F | 27.0 | 30.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34768 | 28338 | 8 | 22 | 1998 | 7 | PB | F | 25.0 | 23.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34769 | 28585 | 9 | 20 | 1998 | 7 | PB | M | 26.0 | 25.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34770 | 28667 | 10 | 24 | 1998 | 7 | PB | M | 26.0 | 25.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34771 | 29231 | 2 | 20 | 1999 | 7 | PB | M | 26.0 | 28.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34772 | 30355 | 2 | 5 | 2000 | 7 | PB | M | 27.0 | 20.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34773 | 32085 | 5 | 26 | 2001 | 7 | PB | F | 22.0 | 37.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34774 | 32477 | 8 | 25 | 2001 | 7 | PB | M | 28.0 | 32.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34775 | 33103 | 11 | 17 | 2001 | 7 | PB | M | 28.0 | 41.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34776 | 33305 | 12 | 15 | 2001 | 7 | PB | M | 29.0 | 44.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34777 | 34524 | 7 | 13 | 2002 | 7 | PB | M | 25.0 | 16.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34778 | 35382 | 12 | 8 | 2002 | 7 | PB | M | 26.0 | 30.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34779 | 26557 | 7 | 29 | 1997 | 7 | PL | F | 20.0 | 22.0 | Peromyscus | leucopus | Rodent | Rodent Exclosure |
34780 | 26787 | 9 | 27 | 1997 | 7 | PL | F | 21.0 | 16.0 | Peromyscus | leucopus | Rodent | Rodent Exclosure |
34781 | 26966 | 10 | 25 | 1997 | 7 | PL | M | 20.0 | 16.0 | Peromyscus | leucopus | Rodent | Rodent Exclosure |
34782 | 27185 | 11 | 22 | 1997 | 7 | PL | F | 21.0 | 22.0 | Peromyscus | leucopus | Rodent | Rodent Exclosure |
34783 | 27792 | 5 | 2 | 1998 | 7 | PL | F | 20.0 | 8.0 | Peromyscus | leucopus | Rodent | Rodent Exclosure |
34784 | 28806 | 11 | 21 | 1998 | 7 | PX | NaN | NaN | NaN | Chaetodipus | sp. | Rodent | Rodent Exclosure |
34785 | 30986 | 7 | 1 | 2000 | 7 | PX | NaN | NaN | NaN | Chaetodipus | sp. | Rodent | Rodent Exclosure |
34786 rows × 13 columns
This is how a data frame is displayed in the Jupyter Notebook. Although the data frame itself just consists of the values, the Notebook knows that this is a data frame and displays it in a nice tabular format (by adding HTML decorators), and adds some cosmetic conveniences such as the bold font type for the column and row names, the alternating grey and white zebra stripes for the rows and highlights the row the mouse pointer moves over. The increasing numbers on the far left is the data frame's index, which was added by pandas
to easily distinguish between the rows.
A data frame is the representation of data in a tabular format, similar to how data is often arranged in spreadsheets. The data is rectangular, meaning that all rows have the same amount of columns and all columns have the same amount of rows. Data frames are the de facto data structure for most tabular data, and what we use for statistics and plotting. A data frame can be created by hand, but most commonly they are generated by an input function, such as read_csv()
. In other words, when importing spreadsheets from your hard drive (or the web).
As can be seen above, the default is to display the first and last 30 rows and truncate everything in between, as indicated by the ellipsis (...
). Although it is truncated, this output is still quite space consuming. To glance at how the data frame looks, it is sufficient to display only the top (the first 5 lines) using the head()
method.
surveys.head()
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | plot_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent | Control |
1 | 72 | 8 | 19 | 1977 | 2 | NL | M | 31.0 | NaN | Neotoma | albigula | Rodent | Control |
2 | 224 | 9 | 13 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
3 | 266 | 10 | 16 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
4 | 349 | 11 | 12 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
Methods are very similar to functions, the main difference is that they belong to an object (above, the method head()
belongs to the data frame surveys
). Methods operate on the object they belong to, that's why we can call the method with an empty parenthesis without any arguments. Compare this with the function type()
that was introduced previously.
type(surveys)
pandas.core.frame.DataFrame
Here, the surveys
variable is explicitly passed as an argument to type()
. An immediately tangible advantage with methods is that they simplify tab completion. Just type the name of the dataframe, a period, and then hit tab to see all the relevant methods for that data frame instead of fumbling around with all the available functions in Python (there's quite a few!) and figuring out which ones operate on data frames and which do not. Methods also facilitates readability when chaining many operations together, which will be shown in detail later.
The columns in a data frame can contain data of different types, e.g. integers, floats, and objects (which includes strings, lists, dictionaries, and more)). General information about the data frame (including the column data types) can be obtained with the info()
method.
surveys.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 34786 entries, 0 to 34785 Data columns (total 13 columns): record_id 34786 non-null int64 month 34786 non-null int64 day 34786 non-null int64 year 34786 non-null int64 plot_id 34786 non-null int64 species_id 34786 non-null object sex 33038 non-null object hindfoot_length 31438 non-null float64 weight 32283 non-null float64 genus 34786 non-null object species 34786 non-null object taxa 34786 non-null object plot_type 34786 non-null object dtypes: float64(2), int64(5), object(6) memory usage: 3.5+ MB
The information includes the total number of rows and columns, the number of non-null observations, the column data types, and the memory (RAM) usage. The number of non-null observation is not the same for all columns, which means that some columns contain null values representing that there is missing information. In pandas
these missing values are marked NaN
for "not a number".
The column data type is often indicative of which type of data is stored in that column, and approximately corresponds to the following
object
, category
object
, category
, int
bool
int
float
Note that an object
could contain different types, e.g. str
or list
. Also note that there can be exceptions to the schema above, but it is still a useful rough guide.
After reading in the data into a data frame, head()
and info()
are two of the most useful methods to get an idea of the structure of this data frame. There are many additional methods that can facilitate the understanding of what a data frame contains:
Size:
surveys.shape
- a tuple with the number of rows in the first element
and the number of columns as the second elementsurveys.shape[0]
- the number of rowssurveys.shape[1]
- the number of columnsContent:
surveys.head()
- shows the first 5 rowssurveys.tail()
- shows the last 5 rowsNames:
surveys.columns
- returns the names of the columns (also called variable names)
objects)surveys.index
- returns the names of the rows (referred to as the index in pandas)Summary:
surveys.info()
- column names and data types, number of observations, memory consumptions
length, and content of each columnsurveys.describe()
- summary statistics for each columnThese belong to a data frame and are commonly referred to as attributes of the data frame. All attributes are accessed with the dot-syntax (.
), which returns the value of the attribute.
surveys.shape
(34786, 13)
If the attribute is a method, parentheses can be appended to the name to carry out the method's operation on the data frame.
surveys.head
<bound method NDFrame.head of record_id month day year plot_id species_id sex hindfoot_length \ 0 1 7 16 1977 2 NL M 32.0 1 72 8 19 1977 2 NL M 31.0 2 224 9 13 1977 2 NL NaN NaN 3 266 10 16 1977 2 NL NaN NaN 4 349 11 12 1977 2 NL NaN NaN 5 363 11 12 1977 2 NL NaN NaN 6 435 12 10 1977 2 NL NaN NaN 7 506 1 8 1978 2 NL NaN NaN 8 588 2 18 1978 2 NL M NaN 9 661 3 11 1978 2 NL NaN NaN 10 748 4 8 1978 2 NL NaN NaN 11 845 5 6 1978 2 NL M 32.0 12 990 6 9 1978 2 NL M NaN 13 1164 8 5 1978 2 NL M 34.0 14 1261 9 4 1978 2 NL M 32.0 15 1374 10 8 1978 2 NL NaN NaN 16 1453 11 5 1978 2 NL M NaN 17 1756 4 29 1979 2 NL M 33.0 18 1818 5 30 1979 2 NL M 32.0 19 1882 7 4 1979 2 NL M 32.0 20 2133 10 25 1979 2 NL F 33.0 21 2184 11 17 1979 2 NL F 30.0 22 2406 1 16 1980 2 NL F 33.0 23 2728 3 9 1980 2 NL F NaN 24 3000 5 18 1980 2 NL F 31.0 25 3002 5 18 1980 2 NL F 33.0 26 4667 7 8 1981 2 NL F 30.0 27 4859 10 1 1981 2 NL M 34.0 28 5048 11 23 1981 2 NL M 34.0 29 5180 1 1 1982 2 NL M NaN ... ... ... ... ... ... ... ... ... 34756 21209 11 13 1993 7 CU NaN NaN 34757 25710 5 10 1997 7 PB M 26.0 34758 26042 6 10 1997 7 PB F 27.0 34759 26096 6 10 1997 7 PB F 26.0 34760 26356 7 9 1997 7 PB M 27.0 34761 26475 7 9 1997 7 PB M 28.0 34762 26546 7 29 1997 7 PB M 27.0 34763 26776 9 27 1997 7 PB M 26.0 34764 26819 9 27 1997 7 PB M 30.0 34765 28332 8 22 1998 7 PB M 26.0 34766 28336 8 22 1998 7 PB M 26.0 34767 28337 8 22 1998 7 PB F 27.0 34768 28338 8 22 1998 7 PB F 25.0 34769 28585 9 20 1998 7 PB M 26.0 34770 28667 10 24 1998 7 PB M 26.0 34771 29231 2 20 1999 7 PB M 26.0 34772 30355 2 5 2000 7 PB M 27.0 34773 32085 5 26 2001 7 PB F 22.0 34774 32477 8 25 2001 7 PB M 28.0 34775 33103 11 17 2001 7 PB M 28.0 34776 33305 12 15 2001 7 PB M 29.0 34777 34524 7 13 2002 7 PB M 25.0 34778 35382 12 8 2002 7 PB M 26.0 34779 26557 7 29 1997 7 PL F 20.0 34780 26787 9 27 1997 7 PL F 21.0 34781 26966 10 25 1997 7 PL M 20.0 34782 27185 11 22 1997 7 PL F 21.0 34783 27792 5 2 1998 7 PL F 20.0 34784 28806 11 21 1998 7 PX NaN NaN 34785 30986 7 1 2000 7 PX NaN NaN weight genus species taxa plot_type 0 NaN Neotoma albigula Rodent Control 1 NaN Neotoma albigula Rodent Control 2 NaN Neotoma albigula Rodent Control 3 NaN Neotoma albigula Rodent Control 4 NaN Neotoma albigula Rodent Control 5 NaN Neotoma albigula Rodent Control 6 NaN Neotoma albigula Rodent Control 7 NaN Neotoma albigula Rodent Control 8 218.0 Neotoma albigula Rodent Control 9 NaN Neotoma albigula Rodent Control 10 NaN Neotoma albigula Rodent Control 11 204.0 Neotoma albigula Rodent Control 12 200.0 Neotoma albigula Rodent Control 13 199.0 Neotoma albigula Rodent Control 14 197.0 Neotoma albigula Rodent Control 15 NaN Neotoma albigula Rodent Control 16 218.0 Neotoma albigula Rodent Control 17 166.0 Neotoma albigula Rodent Control 18 184.0 Neotoma albigula Rodent Control 19 206.0 Neotoma albigula Rodent Control 20 274.0 Neotoma albigula Rodent Control 21 186.0 Neotoma albigula Rodent Control 22 184.0 Neotoma albigula Rodent Control 23 NaN Neotoma albigula Rodent Control 24 87.0 Neotoma albigula Rodent Control 25 174.0 Neotoma albigula Rodent Control 26 130.0 Neotoma albigula Rodent Control 27 208.0 Neotoma albigula Rodent Control 28 192.0 Neotoma albigula Rodent Control 29 206.0 Neotoma albigula Rodent Control ... ... ... ... ... ... 34756 NaN Cnemidophorus uniparens Reptile Rodent Exclosure 34757 31.0 Chaetodipus baileyi Rodent Rodent Exclosure 34758 24.0 Chaetodipus baileyi Rodent Rodent Exclosure 34759 30.0 Chaetodipus baileyi Rodent Rodent Exclosure 34760 32.0 Chaetodipus baileyi Rodent Rodent Exclosure 34761 36.0 Chaetodipus baileyi Rodent Rodent Exclosure 34762 37.0 Chaetodipus baileyi Rodent Rodent Exclosure 34763 37.0 Chaetodipus baileyi Rodent Rodent Exclosure 34764 40.0 Chaetodipus baileyi Rodent Rodent Exclosure 34765 27.0 Chaetodipus baileyi Rodent Rodent Exclosure 34766 23.0 Chaetodipus baileyi Rodent Rodent Exclosure 34767 30.0 Chaetodipus baileyi Rodent Rodent Exclosure 34768 23.0 Chaetodipus baileyi Rodent Rodent Exclosure 34769 25.0 Chaetodipus baileyi Rodent Rodent Exclosure 34770 25.0 Chaetodipus baileyi Rodent Rodent Exclosure 34771 28.0 Chaetodipus baileyi Rodent Rodent Exclosure 34772 20.0 Chaetodipus baileyi Rodent Rodent Exclosure 34773 37.0 Chaetodipus baileyi Rodent Rodent Exclosure 34774 32.0 Chaetodipus baileyi Rodent Rodent Exclosure 34775 41.0 Chaetodipus baileyi Rodent Rodent Exclosure 34776 44.0 Chaetodipus baileyi Rodent Rodent Exclosure 34777 16.0 Chaetodipus baileyi Rodent Rodent Exclosure 34778 30.0 Chaetodipus baileyi Rodent Rodent Exclosure 34779 22.0 Peromyscus leucopus Rodent Rodent Exclosure 34780 16.0 Peromyscus leucopus Rodent Rodent Exclosure 34781 16.0 Peromyscus leucopus Rodent Rodent Exclosure 34782 22.0 Peromyscus leucopus Rodent Rodent Exclosure 34783 8.0 Peromyscus leucopus Rodent Rodent Exclosure 34784 NaN Chaetodipus sp. Rodent Rodent Exclosure 34785 NaN Chaetodipus sp. Rodent Rodent Exclosure [34786 rows x 13 columns]>
surveys.head()
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | plot_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent | Control |
1 | 72 | 8 | 19 | 1977 | 2 | NL | M | 31.0 | NaN | Neotoma | albigula | Rodent | Control |
2 | 224 | 9 | 13 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
3 | 266 | 10 | 16 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
4 | 349 | 11 | 12 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
Attributes that are not methods often hold a value that has been precomputed because it is commonly accessed and it saves time store the value in an attribute instead of recomputing it every time it is needed. For example, every time pandas
creates a data frame, the number of rows and columns is computed and stored in the shape
attribute.
Challenge¶
Based on the output of
surveys.info()
, can you answer the following questions?
- What is the class of the object
surveys
?- How many rows and how many columns are in this object?
- Why is there not the same number of rows (observations) for each column?
It is good practice to keep a copy of the data stored locally on your computer in case you want to do offline analyses, the online version of the file changes, or the file is taken down. For this, the data could be downloaded manually or the current surveys
data frame could be saved to disk as a CSV-file with to_csv()
.
surveys.to_csv('surveys.csv', index=False)
# `index=False` because the index (the row names) was generated automatically when pandas opened
# the file and this information is not needed to be saved
Since the data is now saved locally, the next time this Notebook is opened, it could be loaded from the local path instead of downloading it from the URL.
surveys = pd.read_csv('surveys.csv')
surveys.head()
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | plot_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent | Control |
1 | 72 | 8 | 19 | 1977 | 2 | NL | M | 31.0 | NaN | Neotoma | albigula | Rodent | Control |
2 | 224 | 9 | 13 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
3 | 266 | 10 | 16 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
4 | 349 | 11 | 12 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
The survey data frame has rows and columns (it has 2 dimensions). To extract specific data from it (also referred to as "subsetting"), columns can be selected by their name.The Jupyter Notebook (technically, the underlying IPython interpreter) knows about the columns in the data frame, so tab autocompletion can be used to get the correct column name.
surveys['species_id'].head()
0 NL 1 NL 2 NL 3 NL 4 NL Name: species_id, dtype: object
The name of the column is not shown, since there is only one. Remember that the numbers on the left is just the index of the data frame, which was added by pandas
upon importing the data.
Another syntax that is often used to specify column names is .<column_name>
.
surveys.species_id.head()
0 NL 1 NL 2 NL 3 NL 4 NL Name: species_id, dtype: object
Using brackets is clearer and also alows for passing multiple columns as a list, so this tutorial will stick to that.
surveys[['species_id', 'record_id']].head()
species_id | record_id | |
---|---|---|
0 | NL | 1 |
1 | NL | 72 |
2 | NL | 224 |
3 | NL | 266 |
4 | NL | 349 |
The output is displayed a bit differently this time. The reason is that in the last cell where the returned data frame only had one column ("species") pandas
technically returned a Series
, not a Dataframe
. This can be confirmed by using type
as previously.
type(surveys['species_id'].head())
pandas.core.series.Series
type(surveys[['species_id', 'record_id']].head())
pandas.core.frame.DataFrame
So, every individual column is actually a Series
and together they constitue a Dataframe
. This introductory tutorial will not make any further distinction between a Series
and a Dataframe
, and many of the analysis techniques used here will apply to both series and data frames.
Selecting with single brackets ([]
) is a shortcut to common operations, such as selecting columns by labels as above. For more flexible and robust row and column selection the more verbose loc[<rows>, <columns>]
(location) syntax is used.
surveys.loc[[0, 2, 4], ['species', 'record_id']]
# Although methods usually have trailing parenthesis, square brackets are used with `loc[]` to stay
# consistent with the indexing with square brackets in general in Python (e.g. lists and Numpy arrays)
species | record_id | |
---|---|---|
0 | albigula | 1 |
2 | albigula | 224 |
4 | albigula | 349 |
A single number can be selected, which returns that value (here, an integer) rather than a data frame or series with just one value.
surveys.loc[4, 'record_id']
349
If the column argument is is left out, all columns are selected.
surveys.loc[[3, 4]]
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | plot_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 266 | 10 | 16 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
4 | 349 | 11 | 12 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
To select all rows, but only a subset of columns, the colon character (:
) can be used.
surveys.loc[:, ['month', 'day']].shape # show the size of the data frame
(34786, 2)
It is also possible to select slices of rows and column labels.
surveys.loc[2:4, 'record_id':'day']
record_id | month | day | |
---|---|---|---|
2 | 224 | 9 | 13 |
3 | 266 | 10 | 16 |
4 | 349 | 11 | 12 |
It is important to realize that loc[]
selects rows and columns by their labels, and that slicing therefore in inclusive of both the start and the end. To instead select by row or column position, use iloc[]
(integer location).
surveys.iloc[[2, 3, 4], [0, 1, 2]]
record_id | month | day | |
---|---|---|---|
2 | 224 | 9 | 13 |
3 | 266 | 10 | 16 |
4 | 349 | 11 | 12 |
The index of surveys
consists of consecutive integers so in this case selecting from the index by labels or position will look the same. As will be shown later, an index could also consist of text names just like the columns.
While selecting a slice by label is inclusive of the start and end, selecting a slice by position is inclusive of the start by exclusive of the end position, consistent with how position slicing works for other Python objects, such as lists and tuples.
surveys.iloc[2:5, :3]
record_id | month | day | |
---|---|---|---|
2 | 224 | 9 | 13 |
3 | 266 | 10 | 16 |
4 | 349 | 11 | 12 |
Selecting slices of row positions is a common operation, and has thus been given a shortcut syntax with single brackets.
surveys[2:5]
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | plot_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 224 | 9 | 13 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
3 | 266 | 10 | 16 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
4 | 349 | 11 | 12 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
Challenge¶
Extract the 200th and 201st row of the
surveys
dataset and assign the resulting data frame to a new variable name (surveys_200_201
). Remember that Python indexing starts at 0!How can you get the same result as from
surveys.head()
by using row slices instead of thehead()
method?There are at least three distinct ways to extract the last row of the data frame. How many can you come up with?
The describe()
method was mentioned above as a way of retrieving summary statistics of a data frame. Together with info()
and head()
this is often a good place to start exploratory data analysis as it gives a nice overview of the numeric valuables the data set.
surveys.describe()
record_id | month | day | year | plot_id | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|
count | 34786.000000 | 34786.000000 | 34786.000000 | 34786.000000 | 34786.000000 | 31438.000000 | 32283.000000 |
mean | 17804.204421 | 6.473725 | 16.095987 | 1990.495832 | 11.343098 | 29.287932 | 42.672428 |
std | 10229.682311 | 3.398384 | 8.249405 | 7.468714 | 6.794049 | 9.564759 | 36.631259 |
min | 1.000000 | 1.000000 | 1.000000 | 1977.000000 | 1.000000 | 2.000000 | 4.000000 |
25% | 8964.250000 | 4.000000 | 9.000000 | 1984.000000 | 5.000000 | 21.000000 | 20.000000 |
50% | 17761.500000 | 6.000000 | 16.000000 | 1990.000000 | 11.000000 | 32.000000 | 37.000000 |
75% | 26654.750000 | 10.000000 | 23.000000 | 1997.000000 | 17.000000 | 36.000000 | 48.000000 |
max | 35548.000000 | 12.000000 | 31.000000 | 2002.000000 | 24.000000 | 70.000000 | 280.000000 |
A common next step would be to plot the data to explore relationships between different variables, but before getting into plotting, it is beneficial to elaborate on the data frame object and several of its common operations.
An often desired outcome is to select a subset of rows matching a criteria, e.g. which observations have a weight under 5 grams. To do this, the "less than" comparison operator that was introduced previously can be used.
surveys['weight'] < 5
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 False 15 False 16 False 17 False 18 False 19 False 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 False ... 34756 False 34757 False 34758 False 34759 False 34760 False 34761 False 34762 False 34763 False 34764 False 34765 False 34766 False 34767 False 34768 False 34769 False 34770 False 34771 False 34772 False 34773 False 34774 False 34775 False 34776 False 34777 False 34778 False 34779 False 34780 False 34781 False 34782 False 34783 False 34784 False 34785 False Name: weight, Length: 34786, dtype: bool
The result is a boolean array with one value for every row in the data frame indicating whether it is True
or False
that this row has a value below 5 in the weight column. This boolean array can be used to select only those rows from the data frame that meet the specified condition.
surveys[surveys['weight'] < 5]
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | plot_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2428 | 4052 | 4 | 5 | 1981 | 3 | PF | F | 15.0 | 4.0 | Perognathus | flavus | Rodent | Long-term Krat Exclosure |
2453 | 7084 | 11 | 22 | 1982 | 3 | PF | F | 16.0 | 4.0 | Perognathus | flavus | Rodent | Long-term Krat Exclosure |
4253 | 28126 | 6 | 28 | 1998 | 15 | PF | M | NaN | 4.0 | Perognathus | flavus | Rodent | Long-term Krat Exclosure |
4665 | 9909 | 1 | 20 | 1985 | 15 | RM | F | 15.0 | 4.0 | Reithrodontomys | megalotis | Rodent | Long-term Krat Exclosure |
6860 | 9853 | 1 | 19 | 1985 | 17 | RM | M | 16.0 | 4.0 | Reithrodontomys | megalotis | Rodent | Control |
21224 | 4290 | 4 | 6 | 1981 | 4 | PF | NaN | NaN | 4.0 | Perognathus | flavus | Rodent | Control |
21674 | 29906 | 10 | 10 | 1999 | 4 | PP | M | 21.0 | 4.0 | Chaetodipus | penicillatus | Rodent | Control |
24191 | 8736 | 12 | 8 | 1983 | 19 | RM | M | 17.0 | 4.0 | Reithrodontomys | megalotis | Rodent | Long-term Krat Exclosure |
24200 | 9799 | 1 | 19 | 1985 | 19 | RM | M | 16.0 | 4.0 | Reithrodontomys | megalotis | Rodent | Long-term Krat Exclosure |
25529 | 9794 | 1 | 19 | 1985 | 24 | RM | M | 16.0 | 4.0 | Reithrodontomys | megalotis | Rodent | Rodent Exclosure |
26457 | 218 | 9 | 13 | 1977 | 1 | PF | M | 13.0 | 4.0 | Perognathus | flavus | Rodent | Spectab exclosure |
31678 | 5346 | 2 | 22 | 1982 | 21 | PF | F | 14.0 | 4.0 | Perognathus | flavus | Rodent | Long-term Krat Exclosure |
32114 | 9937 | 2 | 16 | 1985 | 21 | RM | M | 16.0 | 4.0 | Reithrodontomys | megalotis | Rodent | Long-term Krat Exclosure |
32808 | 10119 | 3 | 17 | 1985 | 10 | RM | M | 16.0 | 4.0 | Reithrodontomys | megalotis | Rodent | Rodent Exclosure |
33347 | 9790 | 1 | 19 | 1985 | 16 | RM | F | 16.0 | 4.0 | Reithrodontomys | megalotis | Rodent | Rodent Exclosure |
33760 | 9823 | 1 | 19 | 1985 | 23 | RM | M | 16.0 | 4.0 | Reithrodontomys | megalotis | Rodent | Rodent Exclosure |
34396 | 10439 | 5 | 24 | 1985 | 7 | RM | M | 16.0 | 4.0 | Reithrodontomys | megalotis | Rodent | Rodent Exclosure |
As before, this can be combined with selection of a particular set of columns.
surveys.loc[surveys['weight'] < 5, ['weight', 'species']]
weight | species | |
---|---|---|
2428 | 4.0 | flavus |
2453 | 4.0 | flavus |
4253 | 4.0 | flavus |
4665 | 4.0 | megalotis |
6860 | 4.0 | megalotis |
21224 | 4.0 | flavus |
21674 | 4.0 | penicillatus |
24191 | 4.0 | megalotis |
24200 | 4.0 | megalotis |
25529 | 4.0 | megalotis |
26457 | 4.0 | flavus |
31678 | 4.0 | flavus |
32114 | 4.0 | megalotis |
32808 | 4.0 | megalotis |
33347 | 4.0 | megalotis |
33760 | 4.0 | megalotis |
34396 | 4.0 | megalotis |
A single expression can also be used to filter for several criteria, either matching all criteria (&
) or any criteria (|
). These special operators are used instead of and
and or
to make sure that the comparison occurs for each row in the data frame. Parentheses are added to indicate the priority of the comparisons.
# AND = &
surveys.loc[(surveys['taxa'] == 'Rodent') & (surveys['sex'] == 'F'), ['taxa', 'sex']].head()
taxa | sex | |
---|---|---|
20 | Rodent | F |
21 | Rodent | F |
22 | Rodent | F |
23 | Rodent | F |
24 | Rodent | F |
To increase readability, these statements can be put on multiple rows. Anything that is within a parameter or bracket in Python can be continued on the next row. When inside a bracket or parenthesis, the indentation is not significant to the Python interpreter, but it is still recommended to include it in order to make the code more readable.
surveys.loc[(surveys['taxa'] == 'Rodent') &
(surveys['sex'] == 'F'),
['taxa', 'sex']].head()
taxa | sex | |
---|---|---|
20 | Rodent | F |
21 | Rodent | F |
22 | Rodent | F |
23 | Rodent | F |
24 | Rodent | F |
With the |
operator, rows matching either of the supplied criteria are returned.
# OR = |
surveys.loc[(surveys['species'] == 'clarki') |
(surveys['species'] == 'leucophrys'),
'species']
10603 leucophrys 24480 clarki 34045 leucophrys Name: species, dtype: object
Challenge¶
Subset the
survey
data to include individuals collected before 1995 and retain only the columnsyear
,sex
, andweight
.
A frequent operation when working with data, is to create new columns based on the values in existing columns, for example to do unit conversions or find the ratio of values in two columns. To create a new column of the weight in kg instead of in grams:
surveys['weight_kg'] = surveys['weight'] / 1000
surveys.head(10)
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | plot_type | weight_kg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent | Control | NaN |
1 | 72 | 8 | 19 | 1977 | 2 | NL | M | 31.0 | NaN | Neotoma | albigula | Rodent | Control | NaN |
2 | 224 | 9 | 13 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control | NaN |
3 | 266 | 10 | 16 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control | NaN |
4 | 349 | 11 | 12 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control | NaN |
5 | 363 | 11 | 12 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control | NaN |
6 | 435 | 12 | 10 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control | NaN |
7 | 506 | 1 | 8 | 1978 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control | NaN |
8 | 588 | 2 | 18 | 1978 | 2 | NL | M | NaN | 218.0 | Neotoma | albigula | Rodent | Control | 0.218 |
9 | 661 | 3 | 11 | 1978 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control | NaN |
The first few rows of the output are full of NA
s. To remove those, use the dropna()
method of the data frame.
surveys.dropna().head(10)
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | plot_type | weight_kg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11 | 845 | 5 | 6 | 1978 | 2 | NL | M | 32.0 | 204.0 | Neotoma | albigula | Rodent | Control | 0.204 |
13 | 1164 | 8 | 5 | 1978 | 2 | NL | M | 34.0 | 199.0 | Neotoma | albigula | Rodent | Control | 0.199 |
14 | 1261 | 9 | 4 | 1978 | 2 | NL | M | 32.0 | 197.0 | Neotoma | albigula | Rodent | Control | 0.197 |
17 | 1756 | 4 | 29 | 1979 | 2 | NL | M | 33.0 | 166.0 | Neotoma | albigula | Rodent | Control | 0.166 |
18 | 1818 | 5 | 30 | 1979 | 2 | NL | M | 32.0 | 184.0 | Neotoma | albigula | Rodent | Control | 0.184 |
19 | 1882 | 7 | 4 | 1979 | 2 | NL | M | 32.0 | 206.0 | Neotoma | albigula | Rodent | Control | 0.206 |
20 | 2133 | 10 | 25 | 1979 | 2 | NL | F | 33.0 | 274.0 | Neotoma | albigula | Rodent | Control | 0.274 |
21 | 2184 | 11 | 17 | 1979 | 2 | NL | F | 30.0 | 186.0 | Neotoma | albigula | Rodent | Control | 0.186 |
22 | 2406 | 1 | 16 | 1980 | 2 | NL | F | 33.0 | 184.0 | Neotoma | albigula | Rodent | Control | 0.184 |
24 | 3000 | 5 | 18 | 1980 | 2 | NL | F | 31.0 | 87.0 | Neotoma | albigula | Rodent | Control | 0.087 |
By default, .dropna()
removes all rows that has an NA value in any of the columns. There are parameters that controls how the rows are dropped and which columns should be searched for NAs.
A common alternative to removing rows containing NA
values is to fill out the values with e.g. the mean of all observations or the previous non-NA value. This can be done with the fillna()
method.
surveys['hindfoot_length'].head()
0 32.0 1 31.0 2 NaN 3 NaN 4 NaN Name: hindfoot_length, dtype: float64
# Fill with mean value
fill_value = surveys['hindfoot_length'].mean()
surveys['hindfoot_length'].fillna(fill_value).head()
0 32.000000 1 31.000000 2 29.287932 3 29.287932 4 29.287932 Name: hindfoot_length, dtype: float64
# Fill with previous non-null value
surveys['hindfoot_length'].fillna(method='ffill').head()
0 32.0 1 31.0 2 31.0 3 31.0 4 31.0 Name: hindfoot_length, dtype: float64
Whether to use dropna()
or fillna()
depends on the data set and the purpose of the analysis. It is also possible to interpolate missing values from neighboring values via the interpolate()
method, which can be useful when working with timeseries.
Challenge¶
- Create a new data frame from the
surveys
data that contains only thespecies_id
andhindfoot_length
columns and no NA values.- Add a column to this new data frame called
hindfoot_half
, which contains values that are half thehindfoot_length
values. Keep all observations that have a value less than 30 in thehindfoot_half
.- The final data frame should have 31,436 rows and 3 columns. How can you check that your data frame meets these criteria?
Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results.
pandas
facilitates this workflow through the use of groupby()
to split data and summary/aggregation functions such as mean()
, which collapses each group into a single-row summary of that group. The arguments to groupby()
are the column names that contain the categorical variables by which summary statistics should be calculated. To start, compute the mean weight
by sex.
Image credit Jake VanderPlas
mean()
to summarize categorical data¶The .mean()
method can be used to calculate the average of each group. When the mean is computed, the default behavior is to ignore NA values, so they only need to be dropped if they are to be excluded from the visual output.
surveys.groupby('species')['weight'].mean()
species albigula 159.245660 audubonii NaN baileyi 31.735943 bilineata NaN brunneicapillus NaN chlorurus NaN clarki NaN eremicus 21.586508 flavus 7.923127 fulvescens 13.386667 fulviventer 58.878049 fuscus NaN gramineus NaN harrisi NaN hispidus 65.563953 intermedius 19.250000 leucogaster 31.575258 leucophrys NaN leucopus 19.138889 maniculatus 21.364155 megalotis 10.585010 melanocorys NaN merriami 43.157864 montanus 10.250000 ochrognathus 55.414634 ordii 48.870523 penicillatus 17.173942 savannarum NaN scutalatus NaN sp. 19.500000 spectabilis 120.130546 spilosoma 93.500000 squamata NaN taylori 8.600000 tereticaudus NaN tigris NaN torridus 24.230556 undulatus NaN uniparens NaN viridis NaN Name: weight, dtype: float64
The output here is a series that is indexed with the grouped variable (the species) and the single column contains the result of the aggregation (the mean weight). Since there are so many species, a subset of the will be selected to fit the output within the screen and facilitate instruction. This could be done by manually typing each OR (|
) condition.
surveys.loc[(surveys['species'] == 'albigula') |
(surveys['species'] == 'ordii') |
(surveys['species'] == 'flavus') |
(surveys['species'] == 'torridus')].shape
(8125, 14)
Comparing this number with the number of rows in the original data frame shows that it was filtered successfully.
surveys.shape
(34786, 14)
However, it is rather tedious to type out the species names by hand and to do one comparison per species. Instead, the names of the species of interest can be extract by putting them in a list and use the isin()
method.
species_to_keep = ['albigula', 'ordii', 'flavus', 'torridus']
surveys_sub = surveys.loc[surveys['species'].isin(species_to_keep)]
surveys_sub.shape
(8125, 14)
avg_wt_spec = surveys_sub.groupby('species')['weight'].mean()
avg_wt_spec
species albigula 159.245660 flavus 7.923127 ordii 48.870523 torridus 24.230556 Name: weight, dtype: float64
Individual species can be selected from the resulting series using loc[]
, just as previously.
avg_wt_spec.loc[['ordii', 'albigula']]
species ordii 48.870523 albigula 159.245660 Name: weight, dtype: float64
Single []
without .loc[]
could also be used to extract rows by label from a series, but since single []
extracts columns from a data frame, it can be easier to be explicit and use .loc[]
instead of keeping track when the returned object is a series or a data frame.
Groups can also be created from multiple columns, e.g. it could be interesting to see the difference in weight between males and females within each species.
avg_wt_spec_sex = surveys_sub.groupby(['species', 'sex'])['weight'].mean()
avg_wt_spec_sex
species sex albigula F 154.282209 M 165.652893 flavus F 7.974394 M 7.885287 ordii F 48.531250 M 49.135102 torridus F 24.830904 M 23.695382 Name: weight, dtype: float64
The returned series has an index that is a combination of the columns species
and sex
, and referred to as a MultiIndex
. The same syntax as previously can be used to select rows on the species-level.
avg_wt_spec_sex.loc[['ordii', 'albigula']]
species sex albigula F 154.282209 M 165.652893 ordii F 48.531250 M 49.135102 Name: weight, dtype: float64
To select specific values from both levels of the MultiIndex
, a list of tuples can be passed to loc[]
.
avg_wt_spec_sex.loc[[('ordii', 'F'), ('albigula', 'M')]]
species sex ordii F 48.531250 albigula M 165.652893 Name: weight, dtype: float64
To select only the female observations from all species, the xs()
(cross section) method can be used.
avg_wt_spec_sex.xs('F', level='sex')
species albigula 154.282209 flavus 7.974394 ordii 48.531250 torridus 24.830904 Name: weight, dtype: float64
The names and values of the index levels can be seen by inspecting the index object.
avg_wt_spec_sex.index
MultiIndex(levels=[['albigula', 'flavus', 'ordii', 'torridus'], ['F', 'M']], labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]], names=['species', 'sex'])
Although MultiIndexes offer succinct and fast ways to access data, they also requires memorization of additional syntax and are strictly speaking not essential unless speed is of particular concern. It can therefore be easier to reset the index, so that all values are stored in columns.
avg_wt_spec_sex_res = avg_wt_spec_sex.reset_index()
avg_wt_spec_sex_res
species | sex | weight | |
---|---|---|---|
0 | albigula | F | 154.282209 |
1 | albigula | M | 165.652893 |
2 | flavus | F | 7.974394 |
3 | flavus | M | 7.885287 |
4 | ordii | F | 48.531250 |
5 | ordii | M | 49.135102 |
6 | torridus | F | 24.830904 |
7 | torridus | M | 23.695382 |
After resetting the index, the same comparison syntax introduced earlier can be used instead of xs()
or passing lists of tuples to loc[]
.
female_weights = avg_wt_spec_sex_res.loc[avg_wt_spec_sex_res['sex'] == 'F']
female_weights
species | sex | weight | |
---|---|---|---|
0 | albigula | F | 154.282209 |
2 | flavus | F | 7.974394 |
4 | ordii | F | 48.531250 |
6 | torridus | F | 24.830904 |
reset_index()
grants the freedom of not having to work with indexes, but it is still worth keeping in mind that selecting on an index level with xs()
can be orders of magnitude faster than using bollean comparisons (on large data frames).
The opposite operation (to create an index) can be performed with set_index()
on any column (or combination of columns) that creates an index with unique values.
female_weights.set_index(['species', 'sex'])
weight | ||
---|---|---|
species | sex | |
albigula | F | 154.282209 |
flavus | F | 7.974394 |
ordii | F | 48.531250 |
torridus | F | 24.830904 |
Since the same grouped data frame will be used in multiple code chunks below, this can be assigned to a new variable instead of typing out the grouping expression each time.
grouped_surveys = surveys_sub.groupby(['species', 'sex'])
grouped_surveys['weight'].mean()
species sex albigula F 154.282209 M 165.652893 flavus F 7.974394 M 7.885287 ordii F 48.531250 M 49.135102 torridus F 24.830904 M 23.695382 Name: weight, dtype: float64
Other aggregation methods, such as the standard deviation, are called with the same syntax.
grouped_surveys['weight'].std()
species sex albigula F 39.186546 M 48.991563 flavus F 1.686060 M 1.587815 ordii F 8.244255 M 7.649716 torridus F 4.908162 M 3.990268 Name: weight, dtype: float64
Instead of using the mean()
method, the more general agg()
method could be called to aggregate (or summarize) by any existing aggregation functions. The equivalent to the mean()
method would be to call agg()
and specify 'mean'
.
grouped_surveys['weight'].agg('mean')
species sex albigula F 154.282209 M 165.652893 flavus F 7.974394 M 7.885287 ordii F 48.531250 M 49.135102 torridus F 24.830904 M 23.695382 Name: weight, dtype: float64
This general approach is more flexible and powerful since multiple aggregation functions can be applied in the same line of code by passing them as a list to agg()
. For instance, the standard deviation and mean could be computed in the same call by passing them in a list.
grouped_surveys['weight'].agg(['mean', 'std'])
mean | std | ||
---|---|---|---|
species | sex | ||
albigula | F | 154.282209 | 39.186546 |
M | 165.652893 | 48.991563 | |
flavus | F | 7.974394 | 1.686060 |
M | 7.885287 | 1.587815 | |
ordii | F | 48.531250 | 8.244255 |
M | 49.135102 | 7.649716 | |
torridus | F | 24.830904 | 4.908162 |
M | 23.695382 | 3.990268 |
The returned output is in this case a data frame and the MultiIndex
is indicated in bold font.
By passing a dictionary to .agg()
it is possible to apply different aggregations to the different columns. Long code statements can be broken down into multiple lines if they are enclosed by parentheses, brackets or braces, something that will be described in detail later.
grouped_surveys[['weight', 'hindfoot_length']].agg(
{'weight': 'sum',
'hindfoot_length': ['min', 'max']
}
)
weight | hindfoot_length | |||
---|---|---|---|---|
sum | min | max | ||
species | sex | |||
albigula | F | 100592.0 | 21.0 | 38.0 |
M | 80176.0 | 21.0 | 42.0 | |
flavus | F | 5917.0 | 7.0 | 38.0 |
M | 6324.0 | 9.0 | 22.0 | |
ordii | F | 62120.0 | 26.0 | 64.0 |
M | 79648.0 | 27.0 | 58.0 | |
torridus | F | 25551.0 | 13.0 | 50.0 |
M | 26681.0 | 13.0 | 29.0 |
There are plenty of aggregation methods available in pandas (e.g. sem
, mad
, sum
), the most common are listed in the table at the end of this subsection in the documentation and all can be found via tab-completions of the data frame (surveys.<tab>
).
Even if a function is not part of the pandas
library, it can be passed to agg()
.
import numpy as np
grouped_surveys['weight'].agg(np.mean)
species sex albigula F 154.282209 M 165.652893 flavus F 7.974394 M 7.885287 ordii F 48.531250 M 49.135102 torridus F 24.830904 M 23.695382 Name: weight, dtype: float64
Any function can be passed like this, including user-created functions.
Challenge¶
- Use
groupby()
andagg()
to find the mean, min, and max hindfootlength for each species.
size()
to summarize categorical data¶When working with data, it is common to want to know the number of observations present for each categorical variable. For this, pandas
provides the size()
method. For example, to group by 'taxa' and find the number of observations for each 'taxa':
# Note that the original full length data frame is used here again
surveys.groupby('taxa').size()
taxa Bird 450 Rabbit 75 Reptile 14 Rodent 34247 dtype: int64
size()
can also be used when grouping on multiple variables.
surveys.groupby(['taxa', 'plot_type']).size()
taxa plot_type Bird Control 110 Long-term Krat Exclosure 67 Rodent Exclosure 118 Short-term Krat Exclosure 131 Spectab exclosure 24 Rabbit Control 29 Long-term Krat Exclosure 17 Rodent Exclosure 24 Short-term Krat Exclosure 2 Spectab exclosure 3 Reptile Control 6 Long-term Krat Exclosure 3 Rodent Exclosure 1 Short-term Krat Exclosure 4 Rodent Control 15466 Long-term Krat Exclosure 5031 Rodent Exclosure 4090 Short-term Krat Exclosure 5769 Spectab exclosure 3891 dtype: int64
If there are many groups, size()
is not that useful on its own. For example, it is difficult to quickly find the five most abundant species among the observations.
surveys.groupby('species').size()
species albigula 1252 audubonii 75 baileyi 2891 bilineata 303 brunneicapillus 50 chlorurus 39 clarki 1 eremicus 1299 flavus 1597 fulvescens 75 fulviventer 43 fuscus 5 gramineus 8 harrisi 437 hispidus 179 intermedius 9 leucogaster 1006 leucophrys 2 leucopus 36 maniculatus 899 megalotis 2609 melanocorys 13 merriami 10596 montanus 8 ochrognathus 43 ordii 3027 penicillatus 3123 savannarum 2 scutalatus 1 sp. 86 spectabilis 2504 spilosoma 248 squamata 16 taylori 46 tereticaudus 1 tigris 1 torridus 2249 undulatus 5 uniparens 1 viridis 1 dtype: int64
Since there are many rows in this output, it would be beneficial to sort the table values and display the most abundant species first. This is easy to do with the sort_values()
method.
surveys.groupby('species').size().sort_values()
species viridis 1 uniparens 1 scutalatus 1 tereticaudus 1 tigris 1 clarki 1 leucophrys 2 savannarum 2 undulatus 5 fuscus 5 gramineus 8 montanus 8 intermedius 9 melanocorys 13 squamata 16 leucopus 36 chlorurus 39 ochrognathus 43 fulviventer 43 taylori 46 brunneicapillus 50 fulvescens 75 audubonii 75 sp. 86 hispidus 179 spilosoma 248 bilineata 303 harrisi 437 maniculatus 899 leucogaster 1006 albigula 1252 eremicus 1299 flavus 1597 torridus 2249 spectabilis 2504 megalotis 2609 baileyi 2891 ordii 3027 penicillatus 3123 merriami 10596 dtype: int64
That's better, but it could be helpful to display the most abundant species on top. In other words, the output should be arranged in descending order.
surveys.groupby('species').size().sort_values(ascending=False).head(5)
species merriami 10596 penicillatus 3123 ordii 3027 baileyi 2891 megalotis 2609 dtype: int64
Looks good! By now, the code statement has grown quite long because many methods have been chained together. It can be tricky to keep track of what is going on in long method chains. To make the code more readable, it can be broken up multiple lines by adding a surrounding parenthesis.
(surveys
.groupby('species')
.size()
.sort_values(ascending=False)
.head(5)
)
species merriami 10596 penicillatus 3123 ordii 3027 baileyi 2891 megalotis 2609 dtype: int64
This looks neater and makes long method chains easier to reads. There is no absolute rule for when to break code into multiple line, but always try to write code that is easy for collaborators (your most common collaborator is a future version of yourself!) to understand.
pandas
actually has a convenience function for returning the top five results, so the values don't need to be sorted explicitly.
(surveys
.groupby(['species'])
.size()
.nlargest() # the default is 5
)
species merriami 10596 penicillatus 3123 ordii 3027 baileyi 2891 megalotis 2609 dtype: int64
To include more attributes about these species, add columns to groupby()
.
(surveys
.groupby(['species', 'taxa', 'genus'])
.size()
.nlargest()
)
species taxa genus merriami Rodent Dipodomys 10596 penicillatus Rodent Chaetodipus 3123 ordii Rodent Dipodomys 3027 baileyi Rodent Chaetodipus 2891 megalotis Rodent Reithrodontomys 2609 dtype: int64
Challenge¶
How many individuals were caught in each
plot_type
surveyed?Calculate the number of animals trapped per plot type for each year. Extract the combinations of year and plot type that had the three highest number of observations (e.g. "1998-Control").
Commonly, several data frames (e.g. from different CSV-files) need to be combined together into one big data frame. Either rows or columns from different data frames can be combined together resulting in a data frame that is longer or wider than the originals, respectively.
First, the syntax for subsetting data frames will be used to create the partial data frames that will later be joined together.
wt_avg = surveys_sub.groupby('species')['weight'].mean().reset_index()
wt_avg
species | weight | |
---|---|---|
0 | albigula | 159.245660 |
1 | flavus | 7.923127 |
2 | ordii | 48.870523 |
3 | torridus | 24.230556 |
wt_avg_top = wt_avg[:2]
wt_avg_top
species | weight | |
---|---|---|
0 | albigula | 159.245660 |
1 | flavus | 7.923127 |
wt_avg_bot = wt_avg[2:]
wt_avg_bot
species | weight | |
---|---|---|
2 | ordii | 48.870523 |
3 | torridus | 24.230556 |
The pandas
function concat()
can be used to join the rows from these two data frames into one longer data frame.
wt_avg_full = pd.concat([wt_avg_top, wt_avg_bot])
wt_avg_full
species | weight | |
---|---|---|
0 | albigula | 159.245660 |
1 | flavus | 7.923127 |
2 | ordii | 48.870523 |
3 | torridus | 24.230556 |
The first argument to concat()
is a list of the data frames to concatenate, and can consist of more than two data frames.
frames = [wt_avg_top, wt_avg_bot, wt_avg_top]
pd.concat(frames)
species | weight | |
---|---|---|
0 | albigula | 159.245660 |
1 | flavus | 7.923127 |
2 | ordii | 48.870523 |
3 | torridus | 24.230556 |
0 | albigula | 159.245660 |
1 | flavus | 7.923127 |
Concatenating rows from separate data sets is very useful when new data has been recorded and needs to be added to the previously existing data. Although concat()
can also be used to add columns from different data frames together, the merge()
function provides a more powerful interface for this operation. Its syntax is inspired by how these operations are performed on (SQL-like) databases.
Since merging data frames joins columns together, let's first create a new data frame with a column for the median weight per species to be merged with the mean weight data frame.
wt_med = surveys_sub.groupby('species')['weight'].median().reset_index()
wt_med
species | weight | |
---|---|---|
0 | albigula | 164.0 |
1 | flavus | 8.0 |
2 | ordii | 50.0 |
3 | torridus | 24.0 |
Merging only works with two data frames at a time and they are specified as the first and second argument to merge()
. By default all columns with a common name from both data frames will be used to figure out how the rows should be added together. Here, only the species
(and not the weight
) column should be used to determine how to merge the dataframes, which can be specified with the on
parameter.
merged_df = pd.merge(wt_avg, wt_med, on='species')
merged_df
species | weight_x | weight_y | |
---|---|---|---|
0 | albigula | 159.245660 | 164.0 |
1 | flavus | 7.923127 | 8.0 |
2 | ordii | 48.870523 | 50.0 |
3 | torridus | 24.230556 | 24.0 |
Since the column weight
exists in both the individual data frames, pandas
appends _x
and _y
to differentiate between the columns in the final data frame. These columns could be renamed to better indicate what their values represent.
merged_df.rename(columns={'weight_x': 'mean_wt', 'weight_y': 'median_wt'})
species | mean_wt | median_wt | |
---|---|---|---|
0 | albigula | 159.245660 | 164.0 |
1 | flavus | 7.923127 | 8.0 |
2 | ordii | 48.870523 | 50.0 |
3 | torridus | 24.230556 | 24.0 |
For clarity, the rename()
method could be used on the two individual data frames before merging.
It is important to understand that merge()
method uses the specified column to match which rows to merge from the different data frames, so the order of the values in each data frame does not matter. Likewise, merge()
can handle species that occur only in one of the two data frames.
wt_avg.loc[2, 'species'] = 't_rex'
wt_avg
species | weight | |
---|---|---|
0 | albigula | 159.245660 |
1 | flavus | 7.923127 |
2 | t_rex | 48.870523 |
3 | torridus | 24.230556 |
pd.merge(wt_avg, wt_med, on='species')
species | weight_x | weight_y | |
---|---|---|---|
0 | albigula | 159.245660 | 164.0 |
1 | flavus | 7.923127 | 8.0 |
2 | torridus | 24.230556 | 24.0 |
By default, if the column(s) to join on contains values that are not present in both data frames, rows with those values will not be included in the resulting data frame. This default behavior is referred to as an 'inner' join. To include unique values, an 'outer' join should be performed instead.
pd.merge(wt_avg, wt_med, on='species', how='outer')
species | weight_x | weight_y | |
---|---|---|---|
0 | albigula | 159.245660 | 164.0 |
1 | flavus | 7.923127 | 8.0 |
2 | t_rex | 48.870523 | NaN |
3 | torridus | 24.230556 | 24.0 |
4 | ordii | NaN | 50.0 |
The NaN
s are introduced since these rows were only present in one of the joined data frames, and thus only has a value for one of the columns weight_x
or weight_y
.