Selecting Data¶

A common need is to grab a subset of records that meet a certain criteria. You can do this by indexing the DataFrame much like you've seen done with a NumPy.ndarray.

In [1]:

import os
import pandas as pd

users = pd.read_csv(os.path.join('data', 'users.csv'), index_col=0)
# Pop out a quick sanity check
len(users)

Out[1]:

CashBox uses a referral system, everyone you refer will earn you $5 credit. Let's see if we can find everyone who has not yet taken advantage of that deal. The number of referrals a user has made is defined in the referral_count column.

In [2]:

# This vectorized comparison returns a new `Series` ... 
#   We are naming it so we can use it later.
no_referrals_index = users['referral_count'] < 1
# See how the boolean `Series` returned includes all rows from the `DataFrame`.
#  The value is the result of each comparison
no_referrals_index.head()

Out[2]:

aaron            False
acook            False
adam.saunders    False
adrian           False
adrian.blair     False
Name: referral_count, dtype: bool

Using the boolean Series we just created, no_referrals_index, we can retrieve all rows where that comparison was True.

In [3]:

users[no_referrals_index].head()

Out[3]:

	first_name	last_name	email	email_verified	signup_date	balance
alan9443	Alan	Pope	pope@hotmail.com	True	2018-04-17	56.09
andrew.alvarez	Andrew	Alvarez	aalvarez@hotmail.com	False	2018-08-01	81.66
boyer7005	Sara	Boyer	boyer8636@gmail.com	True	2018-07-31	91.41
brandon.gilbert	Brandon	Gilbert	brandon.gilbert@hotmail.com	True	2018-04-28	10.17
brooke2027	Brooke	NaN	brooke6938@gmail.com	False	2018-05-23	7.22

Inversed index¶

A handy shortcut is to prefix the index with a ~ (tilde). This returns the inverse of the boolean Series. While I wish that the ~ was called "the opposite day" operator, it is in fact called bitwise not operator.

In [4]:

# Careful, double negative here. We don't need no education.
~no_referrals_index.head()

Out[4]:

aaron            True
acook            True
adam.saunders    True
adrian           True
adrian.blair     True
Name: referral_count, dtype: bool

In [5]:

# Use the inverse of the index to find where referral values DO NOT equal zero
users[~no_referrals_index].head()

Out[5]:

	first_name	last_name	email	email_verified	signup_date	referral_count	balance
aaron	Aaron	Davis	aaron6348@gmail.com	True	2018-08-31	6	18.14
acook	Anthony	Cook	cook@gmail.com	True	2018-05-12	2	55.45
adam.saunders	Adam	Saunders	adam@gmail.com	False	2018-05-29	3	72.12
adrian	Adrian	Yang	adrian.yang@teamtreehouse.com	True	2018-04-28	3	30.01
adrian.blair	Adrian	Blair	adrian9335@gmail.com	True	2018-06-16	7	25.85

In `loc`¶

Boolean Series as an index may also be used as an index the DataFrame.loc object.

In [6]:

# Select rows where there are no referrals, and select only the following ordered columns
users.loc[no_referrals_index, ['balance', 'email']].head()

Out[6]:

	balance	email
alan9443	56.09	pope@hotmail.com
andrew.alvarez	81.66	aalvarez@hotmail.com
boyer7005	91.41	boyer8636@gmail.com
brandon.gilbert	10.17	brandon.gilbert@hotmail.com
brooke2027	7.22	brooke6938@gmail.com

It is also possible to do the comparison inline, without storing the index in a variable.

In [7]:

users[users['referral_count'] == 0].head()

Out[7]:

	first_name	last_name	email	email_verified	signup_date	balance
alan9443	Alan	Pope	pope@hotmail.com	True	2018-04-17	56.09
andrew.alvarez	Andrew	Alvarez	aalvarez@hotmail.com	False	2018-08-01	81.66
boyer7005	Sara	Boyer	boyer8636@gmail.com	True	2018-07-31	91.41
brandon.gilbert	Brandon	Gilbert	brandon.gilbert@hotmail.com	True	2018-04-28	10.17
brooke2027	Brooke	NaN	brooke6938@gmail.com	False	2018-05-23	7.22

Just like a NumPy ndarray, it's possible for a boolean Series to be compared to another boolean Series using bitwise operators.

Don't forget to surround your expressions with parenthesis to control the order of operations.

In [8]:

# Select all users where they haven't made a referral AND their email has been verified
users[(users['referral_count'] == 0) & (users['email_verified'] == True)].head()

Out[8]:

	first_name	last_name	email	email_verified	signup_date	balance
alan9443	Alan	Pope	pope@hotmail.com	True	2018-04-17	56.09
boyer7005	Sara	Boyer	boyer8636@gmail.com	True	2018-07-31	91.41
brandon.gilbert	Brandon	Gilbert	brandon.gilbert@hotmail.com	True	2018-04-28	10.17
bryant	Darlene	Bryant	dbryant@yahoo.com	True	2018-07-19	36.91
calvin.perez	Calvin	Perez	cperez@gmail.com	True	2018-02-17	13.01

Selecting Data¶

Inversed index¶

In loc¶

In `loc`¶