# Selecting Data¶

A common need is to grab a subset of records that meet a certain criteria. You can do this by indexing the DataFrame much like you've seen done with a NumPy.ndarray.

In [1]:
import os
import pandas as pd

# Pop out a quick sanity check
len(users)

Out[1]:
475

CashBox uses a referral system, everyone you refer will earn you \$5 credit. Let's see if we can find everyone who has not yet taken advantage of that deal. The number of referrals a user has made is defined in the referral_count column.

In [2]:
# This vectorized comparison returns a new Series ...
#   We are naming it so we can use it later.
no_referrals_index = users['referral_count'] < 1
# See how the boolean Series returned includes all rows from the DataFrame.
#  The value is the result of each comparison

Out[2]:
aaron            False
acook            False
Name: referral_count, dtype: bool

Using the boolean Series we just created, no_referrals_index, we can retrieve all rows where that comparison was True.

In [3]:
users[no_referrals_index].head()

Out[3]:
first_name last_name email email_verified signup_date referral_count balance
alan9443 Alan Pope [email protected] True 2018-04-17 0 56.09
andrew.alvarez Andrew Alvarez [email protected] False 2018-08-01 0 81.66
boyer7005 Sara Boyer [email protected] True 2018-07-31 0 91.41
brandon.gilbert Brandon Gilbert [email protected] True 2018-04-28 0 10.17
brooke2027 Brooke NaN [email protected] False 2018-05-23 0 7.22

## Inversed index¶

A handy shortcut is to prefix the index with a ~ (tilde). This returns the inverse of the boolean Series. While I wish that the ~ was called "the opposite day" operator, it is in fact called bitwise not operator.

In [4]:
# Careful, double negative here. We don't need no education.

Out[4]:
aaron            True
acook            True
Name: referral_count, dtype: bool
In [5]:
# Use the inverse of the index to find where referral values DO NOT equal zero

Out[5]:
first_name last_name email email_verified signup_date referral_count balance
aaron Aaron Davis [email protected] True 2018-08-31 6 18.14
acook Anthony Cook [email protected] True 2018-05-12 2 55.45

## In loc¶

Boolean Series as an index may also be used as an index the DataFrame.loc object.

In [6]:
# Select rows where there are no referrals, and select only the following ordered columns

Out[6]:
balance email
alan9443 56.09 [email protected]
andrew.alvarez 81.66 [email protected]
boyer7005 91.41 [email protected]
brandon.gilbert 10.17 [email protected]
brooke2027 7.22 [email protected]

It is also possible to do the comparison inline, without storing the index in a variable.

In [7]:
users[users['referral_count'] == 0].head()

Out[7]:
first_name last_name email email_verified signup_date referral_count balance
alan9443 Alan Pope [email protected] True 2018-04-17 0 56.09
andrew.alvarez Andrew Alvarez [email protected] False 2018-08-01 0 81.66
boyer7005 Sara Boyer [email protected] True 2018-07-31 0 91.41
brandon.gilbert Brandon Gilbert [email protected] True 2018-04-28 0 10.17
brooke2027 Brooke NaN [email protected] False 2018-05-23 0 7.22

Just like a NumPy ndarray, it's possible for a boolean Series to be compared to another boolean Series using bitwise operators.

Don't forget to surround your expressions with parenthesis to control the order of operations.

In [8]:
# Select all users where they haven't made a referral AND their email has been verified
users[(users['referral_count'] == 0) & (users['email_verified'] == True)].head()

Out[8]:
first_name last_name email email_verified signup_date referral_count balance
alan9443 Alan Pope [email protected] True 2018-04-17 0 56.09
boyer7005 Sara Boyer [email protected] True 2018-07-31 0 91.41
brandon.gilbert Brandon Gilbert [email protected] True 2018-04-28 0 10.17
bryant Darlene Bryant [email protected] True 2018-07-19 0 36.91
calvin.perez Calvin Perez [email protected] True 2018-02-17 0 13.01