loc()
, head()
, info()
, describe()
, shape()
, columns()
, and index()
.Python can be used as a calculator and mathematical calculations use familiar operators such as +
, -
, /
, and *
.
2 + 2
4
6 * 7
42
4 / 3
1.3333333333333333
Text prefaced with a #
is called a "comment". These are notes to people reading the code, so they will be ignored by the Python interpreter.
# `**` means "to the power of"
2 ** 3
8
Values can be given a nickname, this is called assigning values to variables and is handy when the same value will be used multiple times. The assignment operator in Python is =
.
a = 5
a * 2
10
A variable can be named almost anything. It is recommended to separate multiple words with underscores and start the variable name with a letter, not a number or symbol.
new_variable = 4
a - new_variable
1
Variables can hold different types of data, not just numbers. For example, a sequence of characters surrounded by single or double quotation marks (called a string). In Python, it is intuitive to append string by adding them together:
b = 'Hello'
c = 'universe'
b + c
'Hellouniverse'
A space can be added to separate the words.
b + ' ' + c
'Hello universe'
To find out what type a variable is, the built-in function type()
can be used. In essence, a function can be passed input values, follows a set of instructions with how to operate on the input, and then outputs the result. This is analogous to following a recipe: the ingredients are input, the recipe specifies the set of instructions, and the output is the finished dish.
type(a)
int
int
stands for "integer", which is the type of any number without a decimal component.
To be reminded of the value of a
, the variable name can be typed into an empty code cell.
a
5
A code cell will only output its last value. To see more than one value per code cell, the built-in function print()
can be used. When using Python from an interface that is not interactive like the JupyterLab Notebook, such as when executing a set of Python instructions together as a script, the function print()
is often the preferred way of displaying output.
print(a)
type(a)
5
int
Numbers with a decimal component are referred to as floats
type(3.14)
float
Text is of the type str
, which stands for "string". Strings hold sequences of characters, which can be letters, numbers, punctuation or more exotic forms of text (even emoji!).
print(type(b))
b
<class 'str'>
'Hello'
The output from type()
is formatted slightly differently when it is printed.
Python also allows to use comparison and logic operators (<
, >
, ==
, !=
, <=
, >=
, and
, or
, not
), which will return either True
or False
.
3 > 4
False
not
reverses the outcome from a comparison.
not 3 > 4
True
and
checks if both comparisons are True
.
3 > 4 and 5 > 1
False
or
checks if at least one of the comparisons are True
.
3 > 4 or 5 > 1
True
The type of the resulting True
or False
value is called "boolean".
type(True)
bool
Boolean comparison like these are important when extracting specific values from a larger set of values. This use case will be explored in detail later in this material.
Another common use of boolean comparison is with conditional statement, where the code after the comparison only is executed if the comparison is True
.
if a == 4:
print('a is 4')
else:
print('a is not 4')
a is not 4
a
5
Note that the second line in the example above is indented. Indentation is very important in Python, and the Python interpreter uses it to understand that the code in the indented block will only be exectuted if the conditional statement above is True
.
numbers = [1, 2, 3]
numbers[0]
1
You can index from the end of the list by prefixing with a minus sign
numbers[-1]
3
A loop can be used to access the elements in a list or other Python data structure one at a time.
for num in numbers:
print(num)
1 2 3
To add elements to the end of a list, we can use the append
method. Methods are a way to interact with an object (a list, for example). We can invoke a method using the dot .
followed by the method name and a list of arguments in parentheses. Let's look at an example using append
:
numbers.append(4)
numbers
[1, 2, 3, 4]
To find out what methods are available for an object, we can use the built-in ?
command in the Notebook.
?numbers
Type: list String form: [1, 2, 3, 4] Length: 4 Docstring: list() -> new empty list list(iterable) -> new list initialized from iterable's items
A tuple is similar to a list in that it's an ordered sequence of elements. However, tuples can not be changed once created (they are "immutable"). Tuples are created by separating values with a comma (and for clarity these are commonly surrounded by parentheses).
# Tuples use parentheses
a_tuple = (1, 2, 3)
another_tuple = ('blue', 'green', 'red')
Challenge - Tuples¶
- What happens when you type
a_tuple[2] = 5
vsa_list[1] = 5
?- Type
type(a_tuple)
into Python - what is the object type?
A dictionary is a container that holds pairs of objects - keys and values.
translation = {'one': 1, 'two': 2}
translation['one']
1
Dictionaries work a lot like lists - except that they are indexed with keys. Think about a key as a unique identifier for a set of values in the dictionary. Keys can only have particular types - they have to be "hashable". Strings and numeric types are acceptable, but lists aren't.
rev = {1: 'one', 2: 'two'}
rev[1]
'one'
bad = {[1, 2, 3]: 3}
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-29-8169df7fbbea> in <module>() ----> 1 bad = {[1, 2, 3]: 3} TypeError: unhashable type: 'list'
This generates an error message, commonly referred to as a "traceback". This message pinpoints what line in the code cell resulted in an error when it was executed, by pointing at it with an arrow (---->
). This is helpful in figuring out what went wrong.
To add an item to the dictionary, a value is assigned to a new dictionary key.
rev = {1: 'one', 2: 'two'}
rev[3] = 'three'
rev
{1: 'one', 2: 'two', 3: 'three'}
Using loops with dictionaries iterates over the keys by default.
for key in rev:
print(key, rev[key])
1 one 2 two 3 three
Challenge - Can you do reassignment in a dictionary?¶
First check what
rev
is right now (rememberrev
is the name of our dictionary).Try to reassign the second value (in the key value pair) so that it no longer reads "two" but instead reads "apple-sauce".
Now display
rev
again to see if it has changed.
It is important to note that dictionaries are "unordered" and do not remember the sequence of their items (i.e. the order in which key:value pairs were added to the dictionary). Because of this, the order in which items are returned from loops over dictionaries might appear random and can even change with time.
Defining a section of code as a function in Python is done using the def
keyword. For example a function that takes two arguments and returns their sum
can be defined as:
def add_function(a, b):
"""This function adds two values together"""
result = a + b
return result
z = add_function(20, 22)
z
42
Just previously, the ?
can be used to get help for the function.
?add_function
Signature: add_function(a, b) Docstring: This function adds two values together File: ~/proj/uoftcoders/workshops/2018-06-18-utoronto/code/<ipython-input-32-d410f2449e2e> Type: function
The string between the """
is what is shown in the help so it is good to write a helpful message here. It is possible to see the entire source code of the function by using double ?
(this can be quite complex for complicated functions).
??add_function
Signature: add_function(a, b) Source: def add_function(a, b): """This function adds two values together""" result = a + b return result File: ~/proj/uoftcoders/workshops/2018-06-18-utoronto/code/<ipython-input-32-d410f2449e2e> Type: function
To access additional functionality in a spreadsheet program, you need to click the menu and select the tool you want to use. All charts are in one menu, text layout tools in another, data analyses tools in a third, and so on. Programming languages such as Python have so many tools and functions so that they would not fit in a menu. Instead of clicking File -> Open and chose the file, you would type something similar to file.open('
Since there are so many esoteric tools and functions available in Python, it is unnecessary to include all of them with the basics that are loaded by default when you start the programming language (it would be as if your new phone came with every single app preinstalled). Instead, more advanced functionality is grouped into separate packages, which can be accessed by typing import <package_name>
in Python. You can think of this as that you are telling the program which menu items you want to use (similar to how Excel hides the Developer menu by default since most people rarely use it and you need activate it in the settings if you want to access its functionality). Some packages needs to be downloaded before they can be used, just like downloading an addon to a browser or mobile phone.
Just like in spreadsheet software menus, there are lots of different tools within each Python package. For example, if I want to use numerical Python functions, I can import the numerical python module, numpy
. I can then access any function by writing numpy.<function_name>
.
Today, we will be working with real data from a longitudinal study of the species abundance in the Chihuahuan desert ecosystem near Portal, Arizona, USA. This study includes observations of plants, ants, and rodents from 1977 - 2002, and has been used in over 100 publications. More information is available in the abstract of this paper from 2009. There are several datasets available related to this study, and we will be working with datasets that have been preprocessed by the Data Carpentry to facilitate teaching. These are made available online as The Portal Project Teaching Database, both at the Data Carpentry website, and on Figshare. Figshare is a great place to publish data, code, figures, and more openly to make them available for other researchers and to communicate findings that are not part of a longer paper.
We are studying the species and weight of animals caught in plots in our study area. The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent:
Column | Description |
---|---|
record_id | unique id for the observation |
month | month of observation |
day | day of observation |
year | year of observation |
plot_id | ID of a particular plot |
species_id | 2-letter code |
sex | sex of animal ("M", "F") |
hindfoot_length | length of the hindfoot in mm |
weight | weight of the animal in grams |
genus | genus of animal |
species | species of animal |
taxa | e.g. rodent, reptile, bird, rabbit |
plot_type | type of plot |
To read the data into Python, we are going to use a function called read_csv
. This function is contained in an Python-package called pandas
. As mentioned previously, Python-packages are a bit like browser extensions, they are not essential, but can provide nifty functionality. To use a package, it first needs to be imported.
# pandas is given the nickname `pd`
import pandas as pd
pandas
can read CSV-files saved on the computer or directly from an URL.
surveys = pd.read_csv('https://ndownloader.figshare.com/files/2292169')
To view the result, type surveys
in a cell and run it, just as when viewing the content of any variable in Python.
surveys
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | plot_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent | Control |
1 | 72 | 8 | 19 | 1977 | 2 | NL | M | 31.0 | NaN | Neotoma | albigula | Rodent | Control |
2 | 224 | 9 | 13 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
3 | 266 | 10 | 16 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
4 | 349 | 11 | 12 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
5 | 363 | 11 | 12 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
6 | 435 | 12 | 10 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
7 | 506 | 1 | 8 | 1978 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
8 | 588 | 2 | 18 | 1978 | 2 | NL | M | NaN | 218.0 | Neotoma | albigula | Rodent | Control |
9 | 661 | 3 | 11 | 1978 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
10 | 748 | 4 | 8 | 1978 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
11 | 845 | 5 | 6 | 1978 | 2 | NL | M | 32.0 | 204.0 | Neotoma | albigula | Rodent | Control |
12 | 990 | 6 | 9 | 1978 | 2 | NL | M | NaN | 200.0 | Neotoma | albigula | Rodent | Control |
13 | 1164 | 8 | 5 | 1978 | 2 | NL | M | 34.0 | 199.0 | Neotoma | albigula | Rodent | Control |
14 | 1261 | 9 | 4 | 1978 | 2 | NL | M | 32.0 | 197.0 | Neotoma | albigula | Rodent | Control |
15 | 1374 | 10 | 8 | 1978 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
16 | 1453 | 11 | 5 | 1978 | 2 | NL | M | NaN | 218.0 | Neotoma | albigula | Rodent | Control |
17 | 1756 | 4 | 29 | 1979 | 2 | NL | M | 33.0 | 166.0 | Neotoma | albigula | Rodent | Control |
18 | 1818 | 5 | 30 | 1979 | 2 | NL | M | 32.0 | 184.0 | Neotoma | albigula | Rodent | Control |
19 | 1882 | 7 | 4 | 1979 | 2 | NL | M | 32.0 | 206.0 | Neotoma | albigula | Rodent | Control |
20 | 2133 | 10 | 25 | 1979 | 2 | NL | F | 33.0 | 274.0 | Neotoma | albigula | Rodent | Control |
21 | 2184 | 11 | 17 | 1979 | 2 | NL | F | 30.0 | 186.0 | Neotoma | albigula | Rodent | Control |
22 | 2406 | 1 | 16 | 1980 | 2 | NL | F | 33.0 | 184.0 | Neotoma | albigula | Rodent | Control |
23 | 2728 | 3 | 9 | 1980 | 2 | NL | F | NaN | NaN | Neotoma | albigula | Rodent | Control |
24 | 3000 | 5 | 18 | 1980 | 2 | NL | F | 31.0 | 87.0 | Neotoma | albigula | Rodent | Control |
25 | 3002 | 5 | 18 | 1980 | 2 | NL | F | 33.0 | 174.0 | Neotoma | albigula | Rodent | Control |
26 | 4667 | 7 | 8 | 1981 | 2 | NL | F | 30.0 | 130.0 | Neotoma | albigula | Rodent | Control |
27 | 4859 | 10 | 1 | 1981 | 2 | NL | M | 34.0 | 208.0 | Neotoma | albigula | Rodent | Control |
28 | 5048 | 11 | 23 | 1981 | 2 | NL | M | 34.0 | 192.0 | Neotoma | albigula | Rodent | Control |
29 | 5180 | 1 | 1 | 1982 | 2 | NL | M | NaN | 206.0 | Neotoma | albigula | Rodent | Control |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
34756 | 21209 | 11 | 13 | 1993 | 7 | CU | NaN | NaN | NaN | Cnemidophorus | uniparens | Reptile | Rodent Exclosure |
34757 | 25710 | 5 | 10 | 1997 | 7 | PB | M | 26.0 | 31.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34758 | 26042 | 6 | 10 | 1997 | 7 | PB | F | 27.0 | 24.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34759 | 26096 | 6 | 10 | 1997 | 7 | PB | F | 26.0 | 30.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34760 | 26356 | 7 | 9 | 1997 | 7 | PB | M | 27.0 | 32.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34761 | 26475 | 7 | 9 | 1997 | 7 | PB | M | 28.0 | 36.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34762 | 26546 | 7 | 29 | 1997 | 7 | PB | M | 27.0 | 37.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34763 | 26776 | 9 | 27 | 1997 | 7 | PB | M | 26.0 | 37.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34764 | 26819 | 9 | 27 | 1997 | 7 | PB | M | 30.0 | 40.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34765 | 28332 | 8 | 22 | 1998 | 7 | PB | M | 26.0 | 27.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34766 | 28336 | 8 | 22 | 1998 | 7 | PB | M | 26.0 | 23.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34767 | 28337 | 8 | 22 | 1998 | 7 | PB | F | 27.0 | 30.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34768 | 28338 | 8 | 22 | 1998 | 7 | PB | F | 25.0 | 23.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34769 | 28585 | 9 | 20 | 1998 | 7 | PB | M | 26.0 | 25.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34770 | 28667 | 10 | 24 | 1998 | 7 | PB | M | 26.0 | 25.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34771 | 29231 | 2 | 20 | 1999 | 7 | PB | M | 26.0 | 28.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34772 | 30355 | 2 | 5 | 2000 | 7 | PB | M | 27.0 | 20.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34773 | 32085 | 5 | 26 | 2001 | 7 | PB | F | 22.0 | 37.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34774 | 32477 | 8 | 25 | 2001 | 7 | PB | M | 28.0 | 32.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34775 | 33103 | 11 | 17 | 2001 | 7 | PB | M | 28.0 | 41.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34776 | 33305 | 12 | 15 | 2001 | 7 | PB | M | 29.0 | 44.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34777 | 34524 | 7 | 13 | 2002 | 7 | PB | M | 25.0 | 16.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34778 | 35382 | 12 | 8 | 2002 | 7 | PB | M | 26.0 | 30.0 | Chaetodipus | baileyi | Rodent | Rodent Exclosure |
34779 | 26557 | 7 | 29 | 1997 | 7 | PL | F | 20.0 | 22.0 | Peromyscus | leucopus | Rodent | Rodent Exclosure |
34780 | 26787 | 9 | 27 | 1997 | 7 | PL | F | 21.0 | 16.0 | Peromyscus | leucopus | Rodent | Rodent Exclosure |
34781 | 26966 | 10 | 25 | 1997 | 7 | PL | M | 20.0 | 16.0 | Peromyscus | leucopus | Rodent | Rodent Exclosure |
34782 | 27185 | 11 | 22 | 1997 | 7 | PL | F | 21.0 | 22.0 | Peromyscus | leucopus | Rodent | Rodent Exclosure |
34783 | 27792 | 5 | 2 | 1998 | 7 | PL | F | 20.0 | 8.0 | Peromyscus | leucopus | Rodent | Rodent Exclosure |
34784 | 28806 | 11 | 21 | 1998 | 7 | PX | NaN | NaN | NaN | Chaetodipus | sp. | Rodent | Rodent Exclosure |
34785 | 30986 | 7 | 1 | 2000 | 7 | PX | NaN | NaN | NaN | Chaetodipus | sp. | Rodent | Rodent Exclosure |
34786 rows × 13 columns
This is how a data frame is displayed in the JupyterLab Notebook. Although the data frame itself just consists of the values, the Notebook knows that this is a data frame and displays it in a nice tabular format (by adding HTML decorators), and adds some cosmetic conveniences such as the bold font type for the column and row names, the alternating grey and white zebra stripes for the rows and highlights the row the mouse pointer moves over.
A data frame is the representation of data in a tabular format, similar to how data is often arranged in spreadsheets. The data is rectangular, meaning that all rows have the same amount of columns and all columns have the same amount of rows. Data frames are the de facto data structure for most tabular data, and what we use for statistics and plotting. A data frame can be created by hand, but most commonly they are generated by an input function, such as read_csv()
. In other words, when importing spreadsheets from your hard drive (or the web).
As can be seen above, the default is to display the first and last 30 rows and truncate everything in between, as indicated by the ellipsis (...
). Although it is truncated, this output is still quite space consuming. To glance at how the data frame looks, it is sufficient to display only the top (the first 5 lines) using the head()
method.
surveys.head()
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | plot_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent | Control |
1 | 72 | 8 | 19 | 1977 | 2 | NL | M | 31.0 | NaN | Neotoma | albigula | Rodent | Control |
2 | 224 | 9 | 13 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
3 | 266 | 10 | 16 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
4 | 349 | 11 | 12 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
Methods are very similar to functions, the main difference is that they belong to an object (above, the method head()
belongs to the data frame surveys
). Methods operate on the object they belong to, that's why we can call the method with an empty parenthesis without any arguments. Compare this with the function type()
that was introduced previously.
type(surveys)
pandas.core.frame.DataFrame
Here, the surveys
variable is explicitly passed as an argument to type()
. An immediately tangible advantage with methods is that they simplify tab completion. Just type the name of the dataframe, a period, and then hit tab to see all the relevant methods for that data frame instead of fumbling around with all the available functions in Python (there's quite a few!) and figuring out which ones operate on data frames and which do not. Methods also facilitates readability when chaining many operations together, which will be shown in detail later.
The columns in a data frame can contain data of different types, e.g. integers, floats, and objects (which includes strings, lists, dictionaries, and more)). General information about the data frame (including the column data types) can be obtained with the info()
method.
surveys.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 34786 entries, 0 to 34785 Data columns (total 13 columns): record_id 34786 non-null int64 month 34786 non-null int64 day 34786 non-null int64 year 34786 non-null int64 plot_id 34786 non-null int64 species_id 34786 non-null object sex 33038 non-null object hindfoot_length 31438 non-null float64 weight 32283 non-null float64 genus 34786 non-null object species 34786 non-null object taxa 34786 non-null object plot_type 34786 non-null object dtypes: float64(2), int64(5), object(6) memory usage: 3.5+ MB
The information includes the total number of rows and columns, the number of non-null observations, the column data types, and the memory (RAM) usage. The number of non-null observation is not the same for all columns, which means that some columns contain null (or NA) values representing that there is missing information.
After reading in the data into a data frame, head()
and info()
are two of the most useful methods to get an idea of the structure of this data frame. There are many additional methods that can facilitate the understanding of what a data frame contains:
Size:
surveys.shape
- a tuple with the number of rows in the first element
and the number of columns as the second elementsurveys.shape[0]
- the number of rowssurveys.shape[1]
- the number of columnsContent:
surveys.head()
- shows the first 5 rowssurveys.tail()
- shows the last 5 rowsNames:
surveys.columns
- returns the names of the columns (also called variable names)
objects)surveys.index
- returns the names of the rows (referred to as the index in pandas)Summary:
surveys.info()
- column names and data types, number of observations, memory consumptions
length, and content of each columnsurveys.describe()
- summary statistics for each columnAll methods end with parentheses. Those words that do not have a trailing parenthesis are called attributes and hold a value that has been computed earlier, think of them as variables that belong to the object. When an an attribute is accessed, it will just return its value, like a variable would. When a method is called it will first perform a computation and then return the resulting value. For example, every time pandas creates a data frame, the number of rows and columns is computed and stored in the shape
attribute, since it is very common to access this information and it would be a waste of time to compute it every time it is needed.
Challenge¶
Based on the output of
surveys.info()
, can you answer the following questions?
- What is the class of the object
surveys
?- How many rows and how many columns are in this object?
- Why is there not the same number of rows (observations) for each column?
It is good practice to keep a copy of the data stored locally on your computer in case you want to do offline analyses, the online version of the file changes, or the file is taken down. For this, the data could be downloaded manually or the current surveys
data frame could be saved to disk as a CSV-file with to_csv()
.
surveys.to_csv('surveys.csv', index=False)
# `index=False` because the index (the row names) was generated automatically when pandas opened
# the file and this information is not needed to be saved
Since the data is now saved locally, the next time this Notebook is opened, it could be loaded from the local path instead of downloading it from the URL.
surveys = pd.read_csv('surveys.csv')
surveys.head()
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | plot_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent | Control |
1 | 72 | 8 | 19 | 1977 | 2 | NL | M | 31.0 | NaN | Neotoma | albigula | Rodent | Control |
2 | 224 | 9 | 13 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
3 | 266 | 10 | 16 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
4 | 349 | 11 | 12 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
The survey data frame has rows and columns (it has 2 dimensions). To extract specific data from it (also referred to as "subsetting"), columns can be called by name.
surveys['species_id'].head() # Using `head` just to limit the ouput.
0 NL 1 NL 2 NL 3 NL 4 NL Name: species_id, dtype: object
The JupyterLab Notebook (technically, the underlying IPython interpreter) knows about the columns in the data frame, so tab autocompletion can be used to get the correct column name.
Another syntax that is often used to specify column names is .<column_name>
.
surveys.species_id.head()
0 NL 1 NL 2 NL 3 NL 4 NL Name: species_id, dtype: object
Using brackets is clearer and also alows for passing multiple columns as a list, so this tutorial will stick to that.
surveys[['species_id', 'record_id']].head()
species_id | record_id | |
---|---|---|
0 | NL | 1 |
1 | NL | 72 |
2 | NL | 224 |
3 | NL | 266 |
4 | NL | 349 |
The output is displayed a bit differently this time. The reason is that in the last cell where the returned data frame only had one column ("species") pandas technically returned a Series
, not a Dataframe
. This can be confirmed by using type
as previously.
type(surveys['species_id'].head())
pandas.core.series.Series
type(surveys[['species_id', 'record_id']].head())
pandas.core.frame.DataFrame
So, every individual column is actually a Series
and together they constitue a Dataframe
. This introductory tutorial will not make any further distinction between a Series
and a Dataframe
, and many of the analysis techniques used here will apply to both series and data frames. To convert a Series
to a Dataframe
the to_frame
method can be used.
type(surveys['species_id'].head().to_frame())
pandas.core.frame.DataFrame
surveys['species_id'].head().to_frame()
species_id | |
---|---|
0 | NL |
1 | NL |
2 | NL |
3 | NL |
4 | NL |
To select specific rows instead of columns, the loc[]
(location) syntax can be used. This will select the row where the index name (the row name) equals '4'. Indices are unique, so specifying one name to loc[]
will always return one row.
surveys.loc[4]
record_id 349 month 11 day 12 year 1977 plot_id 2 species_id NL sex NaN hindfoot_length NaN weight NaN genus Neotoma species albigula taxa Rodent plot_type Control Name: 4, dtype: object
Square brackets are used instead of parentheses to stay consistent with the indexing with square brackets for Python lists and Numpy arrays. The index of surveys
consists of consecutive integers but and index can also consist of text names, and loc[]
can then be used to reference a named row via a string. If it is desired to reference rows by their index position rather than their index name, iloc[]
could be used.
loc[]
can also select a range of rows with the same slice syntax introduced for lists earlier.
surveys.loc[2:4] # As a convenience row slicing can also be done in brackets without loc.
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | plot_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 224 | 9 | 13 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
3 | 266 | 10 | 16 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
4 | 349 | 11 | 12 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control |
And a combination of columns and rows.
surveys.loc[2:4, 'record_id']
2 224 3 266 4 349 Name: record_id, dtype: int64
surveys.loc[[2, 4, 7], ['species', 'record_id']]
species | record_id | |
---|---|---|
2 | albigula | 224 |
4 | albigula | 349 |
7 | albigula | 506 |
It is also possible to slice column names with .loc
.
surveys.loc[2:4, 'record_id':'plot_id']
record_id | month | day | year | plot_id | |
---|---|---|---|---|---|
2 | 224 | 9 | 13 | 1977 | 2 |
3 | 266 | 10 | 16 | 1977 | 2 |
4 | 349 | 11 | 12 | 1977 | 2 |
And column positions with .iloc
surveys.iloc[2:4, 1:5]
month | day | year | plot_id | |
---|---|---|---|---|
2 | 9 | 13 | 1977 | 2 |
3 | 10 | 16 | 1977 | 2 |
Challenge¶
- Create a
DataFrame
(surveys_200
) containing only the observations fromthe 200th row of the
surveys
dataset. Remember that Python indexing starts at 0!
Notice how
shape[0]
gave you the number of rows in a data frame?
- Use that number to pull out just that last row in the data frame.
- Compare that with what you see as the last row using
tail()
to make sure it's meeting expectations.- Pull out that last row using
shape[0]
instead of the row number.- Create a new data frame object (
surveys_last
) from that last row.What's a third way of getting the last row apart from using
shape
ortail
? Remember how to index lists from the end!
The describe()
method was mentioned above as a way of retrieving summary statistics of a data frame. Together with info()
and head()
this is often a good place to start exploratory data analysis as it gives a nice overview of the numeric valuables the data set.
surveys.describe()
record_id | month | day | year | plot_id | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|
count | 34786.000000 | 34786.000000 | 34786.000000 | 34786.000000 | 34786.000000 | 31438.000000 | 32283.000000 |
mean | 17804.204421 | 6.473725 | 16.095987 | 1990.495832 | 11.343098 | 29.287932 | 42.672428 |
std | 10229.682311 | 3.398384 | 8.249405 | 7.468714 | 6.794049 | 9.564759 | 36.631259 |
min | 1.000000 | 1.000000 | 1.000000 | 1977.000000 | 1.000000 | 2.000000 | 4.000000 |
25% | 8964.250000 | 4.000000 | 9.000000 | 1984.000000 | 5.000000 | 21.000000 | 20.000000 |
50% | 17761.500000 | 6.000000 | 16.000000 | 1990.000000 | 11.000000 | 32.000000 | 37.000000 |
75% | 26654.750000 | 10.000000 | 23.000000 | 1997.000000 | 17.000000 | 36.000000 | 48.000000 |
max | 35548.000000 | 12.000000 | 31.000000 | 2002.000000 | 24.000000 | 70.000000 | 280.000000 |
A common next step would be to plot the data to explore relationships between different variables, but before getting into plotting, it is beneficial to elaborate on the data frame object and several of its common operations.
An often desired outcome is to select a subset of rows matching a criteria, e.g. which observations have a weight under 5 grams. To do this, the "less than" comparison operator that was introduced previously can be used.
surveys['weight'] < 5
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 False 15 False 16 False 17 False 18 False 19 False 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 False ... 34756 False 34757 False 34758 False 34759 False 34760 False 34761 False 34762 False 34763 False 34764 False 34765 False 34766 False 34767 False 34768 False 34769 False 34770 False 34771 False 34772 False 34773 False 34774 False 34775 False 34776 False 34777 False 34778 False 34779 False 34780 False 34781 False 34782 False 34783 False 34784 False 34785 False Name: weight, Length: 34786, dtype: bool
The result is a boolean array of 3476 values, the same length as the data frame. This array actually has one value for every row in the data frame indicating whether it is True
or False
that this row has a value below 5 in the weight column. This boolean array can be used together with the loc[]
parameter to select only those observations from the data frame!
surveys.loc[surveys['weight'] < 5]
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | plot_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2428 | 4052 | 4 | 5 | 1981 | 3 | PF | F | 15.0 | 4.0 | Perognathus | flavus | Rodent | Long-term Krat Exclosure |
2453 | 7084 | 11 | 22 | 1982 | 3 | PF | F | 16.0 | 4.0 | Perognathus | flavus | Rodent | Long-term Krat Exclosure |
4253 | 28126 | 6 | 28 | 1998 | 15 | PF | M | NaN | 4.0 | Perognathus | flavus | Rodent | Long-term Krat Exclosure |
4665 | 9909 | 1 | 20 | 1985 | 15 | RM | F | 15.0 | 4.0 | Reithrodontomys | megalotis | Rodent | Long-term Krat Exclosure |
6860 | 9853 | 1 | 19 | 1985 | 17 | RM | M | 16.0 | 4.0 | Reithrodontomys | megalotis | Rodent | Control |
21224 | 4290 | 4 | 6 | 1981 | 4 | PF | NaN | NaN | 4.0 | Perognathus | flavus | Rodent | Control |
21674 | 29906 | 10 | 10 | 1999 | 4 | PP | M | 21.0 | 4.0 | Chaetodipus | penicillatus | Rodent | Control |
24191 | 8736 | 12 | 8 | 1983 | 19 | RM | M | 17.0 | 4.0 | Reithrodontomys | megalotis | Rodent | Long-term Krat Exclosure |
24200 | 9799 | 1 | 19 | 1985 | 19 | RM | M | 16.0 | 4.0 | Reithrodontomys | megalotis | Rodent | Long-term Krat Exclosure |
25529 | 9794 | 1 | 19 | 1985 | 24 | RM | M | 16.0 | 4.0 | Reithrodontomys | megalotis | Rodent | Rodent Exclosure |
26457 | 218 | 9 | 13 | 1977 | 1 | PF | M | 13.0 | 4.0 | Perognathus | flavus | Rodent | Spectab exclosure |
31678 | 5346 | 2 | 22 | 1982 | 21 | PF | F | 14.0 | 4.0 | Perognathus | flavus | Rodent | Long-term Krat Exclosure |
32114 | 9937 | 2 | 16 | 1985 | 21 | RM | M | 16.0 | 4.0 | Reithrodontomys | megalotis | Rodent | Long-term Krat Exclosure |
32808 | 10119 | 3 | 17 | 1985 | 10 | RM | M | 16.0 | 4.0 | Reithrodontomys | megalotis | Rodent | Rodent Exclosure |
33347 | 9790 | 1 | 19 | 1985 | 16 | RM | F | 16.0 | 4.0 | Reithrodontomys | megalotis | Rodent | Rodent Exclosure |
33760 | 9823 | 1 | 19 | 1985 | 23 | RM | M | 16.0 | 4.0 | Reithrodontomys | megalotis | Rodent | Rodent Exclosure |
34396 | 10439 | 5 | 24 | 1985 | 7 | RM | M | 16.0 | 4.0 | Reithrodontomys | megalotis | Rodent | Rodent Exclosure |
As before, this can be combined with selection of a particular set of columns.
surveys.loc[surveys['weight'] < 5, ['weight', 'species']]
weight | species | |
---|---|---|
2428 | 4.0 | flavus |
2453 | 4.0 | flavus |
4253 | 4.0 | flavus |
4665 | 4.0 | megalotis |
6860 | 4.0 | megalotis |
21224 | 4.0 | flavus |
21674 | 4.0 | penicillatus |
24191 | 4.0 | megalotis |
24200 | 4.0 | megalotis |
25529 | 4.0 | megalotis |
26457 | 4.0 | flavus |
31678 | 4.0 | flavus |
32114 | 4.0 | megalotis |
32808 | 4.0 | megalotis |
33347 | 4.0 | megalotis |
33760 | 4.0 | megalotis |
34396 | 4.0 | megalotis |
To prevent the output from running of the screen, head()
can be used just like before.
surveys.loc[surveys['weight'] < 5, ['weight', 'species']].head()
weight | species | |
---|---|---|
2428 | 4.0 | flavus |
2453 | 4.0 | flavus |
4253 | 4.0 | flavus |
4665 | 4.0 | megalotis |
6860 | 4.0 | megalotis |
A new object could be created from this smaller version of the data, by assigning it to a new variable name.
surveys_sml = surveys.loc[surveys['weight'] < 5, ['weight', 'species']]
surveys_sml.head()
weight | species | |
---|---|---|
2428 | 4.0 | flavus |
2453 | 4.0 | flavus |
4253 | 4.0 | flavus |
4665 | 4.0 | megalotis |
6860 | 4.0 | megalotis |
A single expression can also be used to filter for several criteria, either
matching all criteria (&
) or any criteria (|
):
# AND = &
surveys.loc[(surveys['taxa'] == 'Rodent') & (surveys['sex'] == 'F'), ['taxa', 'sex']].head()
taxa | sex | |
---|---|---|
20 | Rodent | F |
21 | Rodent | F |
22 | Rodent | F |
23 | Rodent | F |
24 | Rodent | F |
To increase readability, these statements can be put on multiple rows. Anything that is within a parameter or bracket in Python can be continued on the next row. When inside a bracket or parenthesis, the indentation is not significant to the Python interpreter, but it is still recommended to include it in order to make the code more readable.
surveys.loc[(surveys['taxa'] == 'Rodent') &
(surveys['sex'] == 'F'),
['taxa', 'sex']].head()
taxa | sex | |
---|---|---|
20 | Rodent | F |
21 | Rodent | F |
22 | Rodent | F |
23 | Rodent | F |
24 | Rodent | F |
With the |
operator, rows matching either of the supplied criteria are returned.
# OR = |
surveys.loc[(surveys['species'] == 'clarki') |
(surveys['species'] == 'leucophrys'),
'species']
10603 leucophrys 24480 clarki 34045 leucophrys Name: species, dtype: object
Challenge¶
Subset the
survey
data to include individuals collected before 1995 and retain only the columnsyear
,sex
, andweight
.
A frequent operation when working with data, is to create new columns based on the values in existing columns, for example to do unit conversions or find the ratio of values in two columns. To create a new column of the weight in kg instead of in grams:
surveys['weight_kg'] = surveys['weight'] / 1000
surveys.head(10)
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | plot_type | weight_kg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent | Control | NaN |
1 | 72 | 8 | 19 | 1977 | 2 | NL | M | 31.0 | NaN | Neotoma | albigula | Rodent | Control | NaN |
2 | 224 | 9 | 13 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control | NaN |
3 | 266 | 10 | 16 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control | NaN |
4 | 349 | 11 | 12 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control | NaN |
5 | 363 | 11 | 12 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control | NaN |
6 | 435 | 12 | 10 | 1977 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control | NaN |
7 | 506 | 1 | 8 | 1978 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control | NaN |
8 | 588 | 2 | 18 | 1978 | 2 | NL | M | NaN | 218.0 | Neotoma | albigula | Rodent | Control | 0.218 |
9 | 661 | 3 | 11 | 1978 | 2 | NL | NaN | NaN | NaN | Neotoma | albigula | Rodent | Control | NaN |
The first few rows of the output are full of NA
s. To remove those, use the dropna()
method of the data frame.
surveys.dropna().head(10)
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | plot_type | weight_kg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11 | 845 | 5 | 6 | 1978 | 2 | NL | M | 32.0 | 204.0 | Neotoma | albigula | Rodent | Control | 0.204 |
13 | 1164 | 8 | 5 | 1978 | 2 | NL | M | 34.0 | 199.0 | Neotoma | albigula | Rodent | Control | 0.199 |
14 | 1261 | 9 | 4 | 1978 | 2 | NL | M | 32.0 | 197.0 | Neotoma | albigula | Rodent | Control | 0.197 |
17 | 1756 | 4 | 29 | 1979 | 2 | NL | M | 33.0 | 166.0 | Neotoma | albigula | Rodent | Control | 0.166 |
18 | 1818 | 5 | 30 | 1979 | 2 | NL | M | 32.0 | 184.0 | Neotoma | albigula | Rodent | Control | 0.184 |
19 | 1882 | 7 | 4 | 1979 | 2 | NL | M | 32.0 | 206.0 | Neotoma | albigula | Rodent | Control | 0.206 |
20 | 2133 | 10 | 25 | 1979 | 2 | NL | F | 33.0 | 274.0 | Neotoma | albigula | Rodent | Control | 0.274 |
21 | 2184 | 11 | 17 | 1979 | 2 | NL | F | 30.0 | 186.0 | Neotoma | albigula | Rodent | Control | 0.186 |
22 | 2406 | 1 | 16 | 1980 | 2 | NL | F | 33.0 | 184.0 | Neotoma | albigula | Rodent | Control | 0.184 |
24 | 3000 | 5 | 18 | 1980 | 2 | NL | F | 31.0 | 87.0 | Neotoma | albigula | Rodent | Control | 0.087 |
By default, .dropna()
removes all rows that has an NA value in any of the columns. There are parameters that controls how the rows are dropped and which columns should be searched for NAs.
Challenge¶
Create a new data frame from the
surveys
data that meets the following criteria: contains only thespecies_id
andhindfoot_length
columns, and a new column calledhindfoot_half
containing values that are half thehindfoot_length
values. In thishindfoot_half
column, there are noNA
s and all values are less than 30.Hint: It is a good idea to break this into three steps!
This concludes the introductory data analysis section.