# Setup
from datetime import datetime
import os
import numpy as np
import pandas as pd
from utils import render
users = pd.read_csv(os.path.join('data', 'users.csv'), index_col=0)
transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)
requests = pd.read_csv(os.path.join('data', 'requests.csv'), index_col=0)
# Perform the merge from the previous notebook
# (s2n06-combining-dataframes.ipynb)
successful_requests = requests.merge(
transactions,
left_on=['from_user', 'to_user', 'amount'],
right_on=['receiver', 'sender', 'amount']
)
# Statement from previous notebook
"Wow! ${:,.2f} has passed through the request system in {} transactions!!!".format(
successful_requests.amount.sum(),
len(successful_requests),
)
'Wow! $10,496.47 has passed through the request system in 214 transactions!!!'
We realized in our the previous notebook (s2n6-combining-dataframes.ipynb) that the requests
DataFrame
had duplicates. Unfortunately this means that our successful_requests
also contains duplicates because we merged those same values with a transaction, even though in actuality, only one of those duplicated requests should be deemed "successful".
We should correct our DataFrame
by removing the duplicate requests, keeping only the last one, as that is really the one that triggered the actual transaction. The great news is that there is a method named drop_duplicates
that does just that.
Like duplicated
there is a keep
parameter that works similarly, you tell it which of the duplicates to keep.
# Let's get our records sorted chronologically
successful_requests.sort_values('request_date', inplace=True)
# And then we'll drop dupes keeping only the last one.
# Note the use of the inplace keyword
successful_requests.drop_duplicates(('from_user', 'to_user', 'amount'), keep='last', inplace=True)
# Statement from previous notebook
"Wow! ${:,.2f} has passed through the request system in {} transactions!!!".format(
successful_requests.amount.sum(),
len(successful_requests),
)
'Wow! $9,316.12 has passed through the request system in 191 transactions!!!'
Yikes, that little bit of difference can be pretty important, especially if that data is being used to make a decision. Always make sure to check for duplicates in situations where you are merging one or more DataFrame
s together.
isin
¶Now that we have seen how successful the requests, have been, it would be nice to locate all the users who have not yet made a request to another user.
Basically we would to ask if each user from the users
DataFrame
if they are in the requests
DataFrame
as a from_user
.
The method isin
is available for Series
, DataFrame
, and even an Index
. The method takes an iterable and returns a boolean index of whether or not the value is contained in the owning Series
or DataFrame
.
# Create a boolean array where we check to see if the label (username)
# is in the `from_user` Series.
made_request_index = users.index.isin(requests.from_user)
# This will get us a list of all users who **have** made a request
users[made_request_index].head()
first_name | last_name | email_verified | signup_date | referral_count | balance | ||
---|---|---|---|---|---|---|---|
aaron | Aaron | Davis | aaron6348@gmail.com | True | 2018-08-31 | 6 | 18.14 |
acook | Anthony | Cook | cook@gmail.com | True | 2018-05-12 | 2 | 55.45 |
adam.saunders | Adam | Saunders | adam@gmail.com | False | 2018-05-29 | 3 | 72.12 |
adrian | Adrian | Yang | adrian.yang@teamtreehouse.com | True | 2018-04-28 | 3 | 30.01 |
adrian.blair | Adrian | Blair | adrian9335@gmail.com | True | 2018-06-16 | 7 | 25.85 |
Because we have a boolean index, we can negate it, using the ~
bitwise operator, to get to what we were looking for, users who have not yet made a request.
users[~made_request_index].head()
first_name | last_name | email_verified | signup_date | referral_count | balance | ||
---|---|---|---|---|---|---|---|
alan9443 | Alan | Pope | pope@hotmail.com | True | 2018-04-17 | 0 | 56.09 |
alvarado | Denise | Alvarado | alvarado@hotmail.com | True | 2018-09-07 | 6 | 26.72 |
amiller | Anne | Miller | miller@hotmail.com | False | 2018-06-02 | 5 | 86.28 |
andersen | Mark | Andersen | mark.andersen@yahoo.com | True | 2018-08-21 | 3 | 75.69 |
andrade | Melissa | Andrade | mandrade@yahoo.com | True | 2018-01-06 | 3 | 83.22 |
As I was looking at these people who hadn't made requests I noticed that a few of them had a NaN (Not A Number) for a last_name
.
We can get a quick overview of how many blank values we have by using the DataFrame.count
users.count()
first_name 475 last_name 430 email 475 email_verified 475 signup_date 475 referral_count 475 balance 475 dtype: int64
Series.isna
is a method that allows you to check for missing values, such as None
or np.nan
. It returns a boolean array which can be used as an index.
# Gather all users where the last name is missing
users[users.last_name.isna()].head()
first_name | last_name | email_verified | signup_date | referral_count | balance | ||
---|---|---|---|---|---|---|---|
anthony3761 | Anthony | NaN | anthony9581@gmail.com | True | 2018-08-16 | 6 | 59.81 |
bradley | Bradley | NaN | bradley3941@henderson.com | True | 2018-06-24 | 4 | 22.24 |
bradley7808 | Bradley | NaN | bradley8794@yahoo.com | True | 2018-06-15 | 1 | 76.49 |
brenda | Brenda | NaN | brenda@hotmail.com | True | 2018-07-28 | 7 | 6.18 |
brooke2027 | Brooke | NaN | brooke6938@gmail.com | False | 2018-05-23 | 0 | 7.22 |
# Make a copy of the DataFrame with "Unknown" as the last name where it is missing
users_with_unknown = users.fillna('Unknown')
# Make sure we got 'em all
users_with_unknown[users_with_unknown.last_name.isna()]
first_name | last_name | email_verified | signup_date | referral_count | balance |
---|
Another option is to just drop rows that contain missing data. That is the DataFrame.dropna
method. Use inplace
parameter to update the existing DataFrame
.
In this situation it's a little drastic, but you can imagine where missing data, makes what you are trying to accomplish invalid
users_with_last_names = users.dropna()
# Row counts of the original
(len(users), len(users_with_last_names))
(475, 430)