Handling Duplicated and Missing Data

In [1]:
# Setup
from datetime import datetime
import os

import numpy as np
import pandas as pd

from utils import render

users = pd.read_csv(os.path.join('data', 'users.csv'), index_col=0)
transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)
requests = pd.read_csv(os.path.join('data', 'requests.csv'), index_col=0)

# Perform the merge from the previous notebook 
#  (s2n06-combining-dataframes.ipynb)
successful_requests = requests.merge(
    transactions,
    left_on=['from_user', 'to_user', 'amount'], 
    right_on=['receiver', 'sender', 'amount']
)

# Statement from previous notebook
"Wow! ${:,.2f} has passed through the request system in {} transactions!!!".format(
    successful_requests.amount.sum(),
    len(successful_requests),
)
Out[1]:
'Wow! $10,496.47 has passed through the request system in 214 transactions!!!'

Duplicated Data

We realized in our the previous notebook (s2n6-combining-dataframes.ipynb) that the requests DataFrame had duplicates. Unfortunately this means that our successful_requests also contains duplicates because we merged those same values with a transaction, even though in actuality, only one of those duplicated requests should be deemed "successful".

We should correct our DataFrame by removing the duplicate requests, keeping only the last one, as that is really the one that triggered the actual transaction. The great news is that there is a method named drop_duplicates that does just that.

Like duplicated there is a keep parameter that works similarly, you tell it which of the duplicates to keep.

In [2]:
# Let's get our records sorted chronologically
successful_requests.sort_values('request_date', inplace=True) 

# And then we'll drop dupes keeping only the last one. 
#  Note the use of the inplace keyword
successful_requests.drop_duplicates(('from_user', 'to_user', 'amount'), keep='last', inplace=True)

# Statement from previous notebook
"Wow! ${:,.2f} has passed through the request system in {} transactions!!!".format(
    successful_requests.amount.sum(),
    len(successful_requests),
)
Out[2]:
'Wow! $9,316.12 has passed through the request system in 191 transactions!!!'

Yikes, that little bit of difference can be pretty important, especially if that data is being used to make a decision. Always make sure to check for duplicates in situations where you are merging one or more DataFrames together.

Locating records not found in another DataFrame using isin

Now that we have seen how successful the requests, have been, it would be nice to locate all the users who have not yet made a request to another user.

Basically we would to ask if each user from the users DataFrame if they are in the requests DataFrame as a from_user.

The method isin is available for Series, DataFrame, and even an Index. The method takes an iterable and returns a boolean index of whether or not the value is contained in the owning Series or DataFrame.

In [3]:
# Create a boolean array where we check to see if the label (username) 
#   is in the `from_user` Series.
made_request_index = users.index.isin(requests.from_user)
# This will get us a list of all users who **have** made a request
users[made_request_index].head()
Out[3]:
first_name last_name email email_verified signup_date referral_count balance
aaron Aaron Davis [email protected] True 2018-08-31 6 18.14
acook Anthony Cook [email protected] True 2018-05-12 2 55.45
adam.saunders Adam Saunders [email protected] False 2018-05-29 3 72.12
adrian Adrian Yang [email protected] True 2018-04-28 3 30.01
adrian.blair Adrian Blair [email protected] True 2018-06-16 7 25.85

Because we have a boolean index, we can negate it, using the ~ bitwise operator, to get to what we were looking for, users who have not yet made a request.

In [4]:
users[~made_request_index].head()
Out[4]:
first_name last_name email email_verified signup_date referral_count balance
alan9443 Alan Pope [email protected] True 2018-04-17 0 56.09
alvarado Denise Alvarado [email protected] True 2018-09-07 6 26.72
amiller Anne Miller [email protected] False 2018-06-02 5 86.28
andersen Mark Andersen [email protected] True 2018-08-21 3 75.69
andrade Melissa Andrade [email protected] True 2018-01-06 3 83.22

Locating Missing Data

As I was looking at these people who hadn't made requests I noticed that a few of them had a NaN (Not A Number) for a last_name.

We can get a quick overview of how many blank values we have by using the DataFrame.count

In [5]:
users.count()
Out[5]:
first_name        475
last_name         430
email             475
email_verified    475
signup_date       475
referral_count    475
balance           475
dtype: int64

Series.isna is a method that allows you to check for missing values, such as None or np.nan. It returns a boolean array which can be used as an index.

In [6]:
# Gather all users where the last name is missing
users[users.last_name.isna()].head()
Out[6]:
first_name last_name email email_verified signup_date referral_count balance
anthony3761 Anthony NaN [email protected] True 2018-08-16 6 59.81
bradley Bradley NaN [email protected] True 2018-06-24 4 22.24
bradley7808 Bradley NaN [email protected] True 2018-06-15 1 76.49
brenda Brenda NaN [email protected] True 2018-07-28 7 6.18
brooke2027 Brooke NaN [email protected] False 2018-05-23 0 7.22

Filling Missing Values

You can also set all unknown values to a specific value using the fillna method. By default it will return a new DataFrame, use the inplace parameter to update the existing.

In [9]:
# Make a copy of the DataFrame with "Unknown" as the last name where it is missing
users_with_unknown = users.fillna('Unknown')

# Make sure we got 'em all
users_with_unknown[users_with_unknown.last_name.isna()]
Out[9]:
first_name last_name email email_verified signup_date referral_count balance

Dropping Rows with Missing Data

Another option is to just drop rows that contain missing data. That is the DataFrame.dropna method. Use inplace parameter to update the existing DataFrame.

In this situation it's a little drastic, but you can imagine where missing data, makes what you are trying to accomplish invalid

In [10]:
users_with_last_names = users.dropna()

# Row counts of the original 
(len(users), len(users_with_last_names))
Out[10]:
(475, 430)