A common need is to grab a subset of records that meet a certain criteria. You can do this by indexing the DataFrame
much like you've seen done with a NumPy.ndarray
.
import os
import pandas as pd
users = pd.read_csv(os.path.join('data', 'users.csv'), index_col=0)
# Pop out a quick sanity check
len(users)
475
CashBox uses a referral system, everyone you refer will earn you $5 credit. Let's see if we can find everyone who has not yet taken advantage of that deal. The number of referrals a user has made is defined in the referral_count
column.
# This vectorized comparison returns a new `Series`, which we are naming so we can use it later
no_referrals_index = users['referral_count'] < 1
# See how the boolean `Series` returned includes all rows from the `DataFrame`.
# The value is the result of each comparison
no_referrals_index.head()
aaron False acook False adam.saunders False adrian False adrian.blair False Name: referral_count, dtype: bool
Using the boolean Series
we just created, no_referrals_index
, we can retrieve all rows where that comparison was True.
users[no_referrals_index].head()
first_name | last_name | email_verified | signup_date | referral_count | balance | ||
---|---|---|---|---|---|---|---|
alan9443 | Alan | Pope | pope@hotmail.com | True | 2018-04-17 | 0 | 56.09 |
andrew.alvarez | Andrew | Alvarez | aalvarez@hotmail.com | False | 2018-08-01 | 0 | 81.66 |
boyer7005 | Sara | Boyer | boyer8636@gmail.com | True | 2018-07-31 | 0 | 91.41 |
brandon.gilbert | Brandon | Gilbert | brandon.gilbert@hotmail.com | True | 2018-04-28 | 0 | 10.17 |
brooke2027 | Brooke | NaN | brooke6938@gmail.com | False | 2018-05-23 | 0 | 7.22 |
A handy shortcut is to prefix the index with a ~
(tilde). This returns the inverse of the boolean Series
. While I wish that the ~
was called "the opposite day" operator, it is in fact called bitwise not
operator.
# Careful, double negative here. We don't need no education.
~no_referrals_index.head()
aaron True acook True adam.saunders True adrian True adrian.blair True Name: referral_count, dtype: bool
# Use the inverse of the index to find where referral values DO NOT equal zero
users[~no_referrals_index].head()
first_name | last_name | email_verified | signup_date | referral_count | balance | ||
---|---|---|---|---|---|---|---|
aaron | Aaron | Davis | aaron6348@gmail.com | True | 2018-08-31 | 6 | 18.14 |
acook | Anthony | Cook | cook@gmail.com | True | 2018-05-12 | 2 | 55.45 |
adam.saunders | Adam | Saunders | adam@gmail.com | False | 2018-05-29 | 3 | 72.12 |
adrian | Adrian | Yang | adrian.yang@teamtreehouse.com | True | 2018-04-28 | 3 | 30.01 |
adrian.blair | Adrian | Blair | adrian9335@gmail.com | True | 2018-06-16 | 7 | 25.85 |
loc
¶Boolean Series
as an index may also be used as an index the DataFrame.loc
object.
# Select rows where there are no referrals, and select only the following ordered columns
users.loc[no_referrals_index, ['balance', 'email']].head()
balance | ||
---|---|---|
alan9443 | 56.09 | pope@hotmail.com |
andrew.alvarez | 81.66 | aalvarez@hotmail.com |
boyer7005 | 91.41 | boyer8636@gmail.com |
brandon.gilbert | 10.17 | brandon.gilbert@hotmail.com |
brooke2027 | 7.22 | brooke6938@gmail.com |
It is also possible to do the comparison inline, without storing the index in a variable.
users[users['referral_count'] == 0].head()
first_name | last_name | email_verified | signup_date | referral_count | balance | ||
---|---|---|---|---|---|---|---|
alan9443 | Alan | Pope | pope@hotmail.com | True | 2018-04-17 | 0 | 56.09 |
andrew.alvarez | Andrew | Alvarez | aalvarez@hotmail.com | False | 2018-08-01 | 0 | 81.66 |
boyer7005 | Sara | Boyer | boyer8636@gmail.com | True | 2018-07-31 | 0 | 91.41 |
brandon.gilbert | Brandon | Gilbert | brandon.gilbert@hotmail.com | True | 2018-04-28 | 0 | 10.17 |
brooke2027 | Brooke | NaN | brooke6938@gmail.com | False | 2018-05-23 | 0 | 7.22 |
Just like a NumPy ndarray
, it's possible for a boolean Series
to be compared to another boolean Series
using bitwise operators.
Don't forget to surround your expressions with parenthesis to control the order of operations.
# Select all users where they haven't made a referral AND their email has been verified
users[(users['referral_count'] == 0) & (users['email_verified'] == True)].head()
first_name | last_name | email_verified | signup_date | referral_count | balance | ||
---|---|---|---|---|---|---|---|
alan9443 | Alan | Pope | pope@hotmail.com | True | 2018-04-17 | 0 | 56.09 |
boyer7005 | Sara | Boyer | boyer8636@gmail.com | True | 2018-07-31 | 0 | 91.41 |
brandon.gilbert | Brandon | Gilbert | brandon.gilbert@hotmail.com | True | 2018-04-28 | 0 | 10.17 |
bryant | Darlene | Bryant | dbryant@yahoo.com | True | 2018-07-19 | 0 | 36.91 |
calvin.perez | Calvin | Perez | cperez@gmail.com | True | 2018-02-17 | 0 | 13.01 |