Handling Duplicated and Missing Data¶

In [1]:

# Setup
from datetime import datetime
import os

import numpy as np
import pandas as pd

from utils import render

users = pd.read_csv(os.path.join('data', 'users.csv'), index_col=0)
transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)
requests = pd.read_csv(os.path.join('data', 'requests.csv'), index_col=0)

# Perform the merge from the previous notebook 
#  (s2n06-combining-dataframes.ipynb)
successful_requests = requests.merge(
    transactions,
    left_on=['from_user', 'to_user', 'amount'], 
    right_on=['receiver', 'sender', 'amount']
)

# Statement from previous notebook
"Wow! ${:,.2f} has passed through the request system in {} transactions!!!".format(
    successful_requests.amount.sum(),
    len(successful_requests),
)

Out[1]:

'Wow! $10,496.47 has passed through the request system in 214 transactions!!!'

Duplicated Data¶

We realized in our the previous notebook (s2n6-combining-dataframes.ipynb) that the requests DataFrame had duplicates. Unfortunately this means that our successful_requests also contains duplicates because we merged those same values with a transaction, even though in actuality, only one of those duplicated requests should be deemed "successful".

We should correct our DataFrame by removing the duplicate requests, keeping only the last one, as that is really the one that triggered the actual transaction. The great news is that there is a method named drop_duplicates that does just that.

Like duplicated there is a keep parameter that works similarly, you tell it which of the duplicates to keep.

In [2]:

# Let's get our records sorted chronologically
successful_requests.sort_values('request_date', inplace=True) 

# And then we'll drop dupes keeping only the last one. 
#  Note the use of the inplace keyword
successful_requests.drop_duplicates(('from_user', 'to_user', 'amount'), keep='last', inplace=True)

# Statement from previous notebook
"Wow! ${:,.2f} has passed through the request system in {} transactions!!!".format(
    successful_requests.amount.sum(),
    len(successful_requests),
)

Out[2]:

'Wow! $9,316.12 has passed through the request system in 191 transactions!!!'

Yikes, that little bit of difference can be pretty important, especially if that data is being used to make a decision. Always make sure to check for duplicates in situations where you are merging one or more DataFrames together.

Locating records not found in another DataFrame using `isin`¶

Now that we have seen how successful the requests, have been, it would be nice to locate all the users who have not yet made a request to another user.

Basically we would to ask if each user from the users DataFrame if they are in the requests DataFrame as a from_user.

The method isin is available for Series, DataFrame, and even an Index. The method takes an iterable and returns a boolean index of whether or not the value is contained in the owning Series or DataFrame.

In [3]:

# Create a boolean array where we check to see if the label (username) 
#   is in the `from_user` Series.
made_request_index = users.index.isin(requests.from_user)
# This will get us a list of all users who **have** made a request
users[made_request_index].head()

Out[3]:

	first_name	last_name	email	email_verified	signup_date	referral_count	balance
aaron	Aaron	Davis	aaron6348@gmail.com	True	2018-08-31	6	18.14
acook	Anthony	Cook	cook@gmail.com	True	2018-05-12	2	55.45
adam.saunders	Adam	Saunders	adam@gmail.com	False	2018-05-29	3	72.12
adrian	Adrian	Yang	adrian.yang@teamtreehouse.com	True	2018-04-28	3	30.01
adrian.blair	Adrian	Blair	adrian9335@gmail.com	True	2018-06-16	7	25.85

Because we have a boolean index, we can negate it, using the ~ bitwise operator, to get to what we were looking for, users who have not yet made a request.

In [4]:

users[~made_request_index].head()

Out[4]:

	first_name	last_name	email	email_verified	signup_date	referral_count	balance
alan9443	Alan	Pope	pope@hotmail.com	True	2018-04-17	0	56.09
alvarado	Denise	Alvarado	alvarado@hotmail.com	True	2018-09-07	6	26.72
amiller	Anne	Miller	miller@hotmail.com	False	2018-06-02	5	86.28
andersen	Mark	Andersen	mark.andersen@yahoo.com	True	2018-08-21	3	75.69
andrade	Melissa	Andrade	mandrade@yahoo.com	True	2018-01-06	3	83.22

Locating Missing Data¶

As I was looking at these people who hadn't made requests I noticed that a few of them had a NaN (Not A Number) for a last_name.

We can get a quick overview of how many blank values we have by using the DataFrame.count

In [5]:

users.count()

Out[5]:

first_name        475
last_name         430
email             475
email_verified    475
signup_date       475
referral_count    475
balance           475
dtype: int64

Series.isna is a method that allows you to check for missing values, such as None or np.nan. It returns a boolean array which can be used as an index.

In [6]:

# Gather all users where the last name is missing
users[users.last_name.isna()].head()

Out[6]:

	first_name	last_name	email	email_verified	signup_date	referral_count	balance
anthony3761	Anthony	NaN	anthony9581@gmail.com	True	2018-08-16	6	59.81
bradley	Bradley	NaN	bradley3941@henderson.com	True	2018-06-24	4	22.24
bradley7808	Bradley	NaN	bradley8794@yahoo.com	True	2018-06-15	1	76.49
brenda	Brenda	NaN	brenda@hotmail.com	True	2018-07-28	7	6.18
brooke2027	Brooke	NaN	brooke6938@gmail.com	False	2018-05-23	0	7.22

Filling Missing Values¶

You can also set all unknown values to a specific value using the fillna method. By default it will return a new DataFrame, use the inplace parameter to update the existing.

In [9]:

# Make a copy of the DataFrame with "Unknown" as the last name where it is missing
users_with_unknown = users.fillna('Unknown')

# Make sure we got 'em all
users_with_unknown[users_with_unknown.last_name.isna()]

Out[9]:

	first_name	last_name	email	email_verified	signup_date	referral_count	balance

Dropping Rows with Missing Data¶

Another option is to just drop rows that contain missing data. That is the DataFrame.dropna method. Use inplace parameter to update the existing DataFrame.

In this situation it's a little drastic, but you can imagine where missing data, makes what you are trying to accomplish invalid

In [10]:

users_with_last_names = users.dropna()

# Row counts of the original 
(len(users), len(users_with_last_names))

Out[10]:

(475, 430)