Selecting Data

A common need is to grab a subset of records that meet a certain criteria. You can do this by indexing the DataFrame much like you've seen done with a NumPy.ndarray.

In [1]:
import os
import pandas as pd

users = pd.read_csv(os.path.join('data', 'users.csv'), index_col=0)
# Pop out a quick sanity check
len(users)
Out[1]:
475

CashBox uses a referral system, everyone you refer will earn you $5 credit. Let's see if we can find everyone who has not yet taken advantage of that deal. The number of referrals a user has made is defined in the referral_count column.

In [2]:
# This vectorized comparison returns a new `Series` ... 
#   We are naming it so we can use it later.
no_referrals_index = users['referral_count'] < 1
# See how the boolean `Series` returned includes all rows from the `DataFrame`.
#  The value is the result of each comparison
no_referrals_index.head()
Out[2]:
aaron            False
acook            False
adam.saunders    False
adrian           False
adrian.blair     False
Name: referral_count, dtype: bool

Using the boolean Series we just created, no_referrals_index, we can retrieve all rows where that comparison was True.

In [3]:
users[no_referrals_index].head()
Out[3]:
first_name last_name email email_verified signup_date referral_count balance
alan9443 Alan Pope [email protected] True 2018-04-17 0 56.09
andrew.alvarez Andrew Alvarez [email protected] False 2018-08-01 0 81.66
boyer7005 Sara Boyer [email protected] True 2018-07-31 0 91.41
brandon.gilbert Brandon Gilbert [email protected] True 2018-04-28 0 10.17
brooke2027 Brooke NaN [email protected] False 2018-05-23 0 7.22

Inversed index

A handy shortcut is to prefix the index with a ~ (tilde). This returns the inverse of the boolean Series. While I wish that the ~ was called "the opposite day" operator, it is in fact called bitwise not operator.

In [4]:
# Careful, double negative here. We don't need no education.
~no_referrals_index.head()
Out[4]:
aaron            True
acook            True
adam.saunders    True
adrian           True
adrian.blair     True
Name: referral_count, dtype: bool
In [5]:
# Use the inverse of the index to find where referral values DO NOT equal zero
users[~no_referrals_index].head()
Out[5]:
first_name last_name email email_verified signup_date referral_count balance
aaron Aaron Davis [email protected] True 2018-08-31 6 18.14
acook Anthony Cook [email protected] True 2018-05-12 2 55.45
adam.saunders Adam Saunders [email protected] False 2018-05-29 3 72.12
adrian Adrian Yang [email protected] True 2018-04-28 3 30.01
adrian.blair Adrian Blair [email protected] True 2018-06-16 7 25.85

In loc

Boolean Series as an index may also be used as an index the DataFrame.loc object.

In [6]:
# Select rows where there are no referrals, and select only the following ordered columns
users.loc[no_referrals_index, ['balance', 'email']].head()
Out[6]:
balance email
alan9443 56.09 [email protected]
andrew.alvarez 81.66 [email protected]
boyer7005 91.41 [email protected]
brandon.gilbert 10.17 [email protected]
brooke2027 7.22 [email protected]

It is also possible to do the comparison inline, without storing the index in a variable.

In [7]:
users[users['referral_count'] == 0].head()
Out[7]:
first_name last_name email email_verified signup_date referral_count balance
alan9443 Alan Pope [email protected] True 2018-04-17 0 56.09
andrew.alvarez Andrew Alvarez [email protected] False 2018-08-01 0 81.66
boyer7005 Sara Boyer [email protected] True 2018-07-31 0 91.41
brandon.gilbert Brandon Gilbert [email protected] True 2018-04-28 0 10.17
brooke2027 Brooke NaN [email protected] False 2018-05-23 0 7.22

Just like a NumPy ndarray, it's possible for a boolean Series to be compared to another boolean Series using bitwise operators.

Don't forget to surround your expressions with parenthesis to control the order of operations.

In [8]:
# Select all users where they haven't made a referral AND their email has been verified
users[(users['referral_count'] == 0) & (users['email_verified'] == True)].head()
Out[8]:
first_name last_name email email_verified signup_date referral_count balance
alan9443 Alan Pope [email protected] True 2018-04-17 0 56.09
boyer7005 Sara Boyer [email protected] True 2018-07-31 0 91.41
brandon.gilbert Brandon Gilbert [email protected] True 2018-04-28 0 10.17
bryant Darlene Bryant [email protected] True 2018-07-19 0 36.91
calvin.perez Calvin Perez [email protected] True 2018-02-17 0 13.01