In the last lecture, we saw how to extract specific columns from a data frame. In many cases, we also need to extract specific rows. This operation is often called "filtering" -- we are filtering out the rows that we don't want, leaving the ones that we do.
import pandas as pd
import numpy as np
# you'll need to run the first block in pd_2.ipynb
# to download the data if you have not already done so
penguins = pd.read_csv("palmer_penguins.csv")
# just the first five rows and selected columns
penguins = penguins[["Species", "Region", "Island", "Culmen Length (mm)"]]
penguins.head()
Species | Region | Island | Culmen Length (mm) | |
---|---|---|---|---|
0 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 39.1 |
1 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 39.5 |
2 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 40.3 |
3 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | NaN |
4 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 36.7 |
The simplest way to select rows of data is by explicitly naming the value(s) of the index for the rows you want. Remember that the index is the set of bold numbers at the far left. To do this, you should use the df.loc
attribute of the data frame, like this:
penguins.loc[1:3] # rows with index values 1 through 3
Species | Region | Island | Culmen Length (mm) | |
---|---|---|---|---|
1 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 39.5 |
2 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 40.3 |
3 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | NaN |
# passing an explicit list can change the order of the rows.
s = penguins.loc[[1, 4, 0]]
s
Species | Region | Island | Culmen Length (mm) | |
---|---|---|---|---|
1 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 39.5 |
4 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 36.7 |
0 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 39.1 |
# note that this works, even though s does not have a 4th row,
# because s does have an index with value 4
s.loc[4]
Species Adelie Penguin (Pygoscelis adeliae) Region Anvers Island Torgersen Culmen Length (mm) 36.7 Name: 4, dtype: object
# on the other hand, this doesn't work
s.loc[2]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) /opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2645 try: -> 2646 return self._engine.get_loc(key) 2647 except KeyError: pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item() KeyError: 2 During handling of the above exception, another exception occurred: KeyError Traceback (most recent call last) <ipython-input-52-1d3ba2e47d68> in <module> 1 # on the other hand, this doesn't work ----> 2 s.loc[2] /opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py in __getitem__(self, key) 1766 1767 maybe_callable = com.apply_if_callable(key, self.obj) -> 1768 return self._getitem_axis(maybe_callable, axis=axis) 1769 1770 def _is_scalar_access(self, key: Tuple): /opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis) 1963 # fall thru to straight lookup 1964 self._validate_key(key, axis) -> 1965 return self._get_label(key, axis=axis) 1966 1967 /opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py in _get_label(self, label, axis) 623 raise IndexingError("no slices here, handle elsewhere") 624 --> 625 return self.obj._xs(label, axis=axis) 626 627 def _get_loc(self, key: int, axis: int): /opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in xs(self, key, axis, level, drop_level) 3535 loc, new_index = self.index.get_loc_level(key, drop_level=drop_level) 3536 else: -> 3537 loc = self.index.get_loc(key) 3538 3539 if isinstance(loc, np.ndarray): /opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2646 return self._engine.get_loc(key) 2647 except KeyError: -> 2648 return self._engine.get_loc(self._maybe_cast_indexer(key)) 2649 indexer = self.get_indexer([key], method=method, tolerance=tolerance) 2650 if indexer.ndim > 1 or indexer.size > 1: pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item() KeyError: 2
While it's good to know how to refer to rows by index, this is not the most useful way to filter data frames. Boolean indexing instead allows us to filter the rows of a data set based on one or more conditions. Boolean indexing in data frames is very similar to Boolean indexing in numpy
arrays.
penguins
Species | Region | Island | Culmen Length (mm) | |
---|---|---|---|---|
0 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 39.1 |
1 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 39.5 |
2 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 40.3 |
3 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | NaN |
4 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 36.7 |
... | ... | ... | ... | ... |
339 | Gentoo penguin (Pygoscelis papua) | Anvers | Biscoe | NaN |
340 | Gentoo penguin (Pygoscelis papua) | Anvers | Biscoe | 46.8 |
341 | Gentoo penguin (Pygoscelis papua) | Anvers | Biscoe | 50.4 |
342 | Gentoo penguin (Pygoscelis papua) | Anvers | Biscoe | 45.2 |
343 | Gentoo penguin (Pygoscelis papua) | Anvers | Biscoe | 49.9 |
344 rows × 4 columns
penguins['Culmen Length (mm)'] < 40
0 True 1 True 2 False 3 False 4 True ... 339 False 340 False 341 False 342 False 343 False Name: Culmen Length (mm), Length: 344, dtype: bool
penguins[penguins['Culmen Length (mm)'] < 40]
Species | Region | Island | Culmen Length (mm) | |
---|---|---|---|---|
0 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 39.1 |
1 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 39.5 |
4 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 36.7 |
5 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 39.3 |
6 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 38.9 |
... | ... | ... | ... | ... |
146 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Dream | 39.2 |
147 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Dream | 36.6 |
148 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Dream | 36.0 |
149 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Dream | 37.8 |
150 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Dream | 36.0 |
100 rows × 4 columns
# list of penguins encountered on Torgersen island
torg = penguins['Island']== "Torgersen"
penguins[torg].head()
Species | Region | Island | Culmen Length (mm) | |
---|---|---|---|---|
0 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 39.1 |
1 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 39.5 |
2 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 40.3 |
3 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | NaN |
4 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 36.7 |
# penguins encountered on Torgersen with culmen no longer than 40 mm
# using bitwise and operator &
culm = penguins['Culmen Length (mm)'] < 40
penguins[torg & culm].head()
Species | Region | Island | Culmen Length (mm) | |
---|---|---|---|---|
0 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 39.1 |
1 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 39.5 |
4 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 36.7 |
5 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 39.3 |
6 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 38.9 |
# using bitwise or instead of and
penguins[torg | culm].head()
Species | Region | Island | Culmen Length (mm) | |
---|---|---|---|---|
0 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 39.1 |
1 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 39.5 |
2 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 40.3 |
3 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | NaN |
4 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 36.7 |
An especially useful example of Boolean indexing is picking out nan
values from the data.
nas = penguins["Culmen Length (mm)"].isna()
penguins[nas]
Species | Region | Island | Culmen Length (mm) | |
---|---|---|---|---|
3 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | NaN |
339 | Gentoo penguin (Pygoscelis papua) | Anvers | Biscoe | NaN |
# invert flips the entries of a boolean array
penguins[np.invert(nas)]
Species | Region | Island | Culmen Length (mm) | |
---|---|---|---|---|
0 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 39.1 |
1 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 39.5 |
2 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 40.3 |
4 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 36.7 |
5 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | 39.3 |
... | ... | ... | ... | ... |
338 | Gentoo penguin (Pygoscelis papua) | Anvers | Biscoe | 47.2 |
340 | Gentoo penguin (Pygoscelis papua) | Anvers | Biscoe | 46.8 |
341 | Gentoo penguin (Pygoscelis papua) | Anvers | Biscoe | 50.4 |
342 | Gentoo penguin (Pygoscelis papua) | Anvers | Biscoe | 45.2 |
343 | Gentoo penguin (Pygoscelis papua) | Anvers | Biscoe | 49.9 |
342 rows × 4 columns
Boolean indexing is by far the most useful form of filtering, and should usually be preferred in most practical contexts. It is especially powerful when combined with functions that operate on columns, as we'll see shortly.