Combining DataFrames

CashBox has provided us with several separate CSV files. Let's take a look at two files transactions.csv and requests.csv. Requests are made in the application when one user requests cash from another. Requests are not required for a transaction to occur.

Let's see if we can't see get a feeling on how many successful requests and payments have been made. In order to do this we will need to combine the two DataFrames.

In [1]:
# Setup
from datetime import datetime
import os

import pandas as pd

from utils import render

transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)
requests = pd.read_csv(os.path.join('data', 'requests.csv'), index_col=0)

# Pop out a quick sanity check
(transactions.shape, requests.shape)
Out[1]:
((998, 4), (313, 4))

Let's explore our data real quick, by taking a look at the first couple of rows in each DataFrame.

In [2]:
transactions.head(2)
Out[2]:
sender receiver amount sent_date
0 stein smoyer 49.03 2018-01-24
1 holden4580 joshua.henry 34.64 2018-02-06
In [3]:
requests.head(2)
Out[3]:
from_user to_user amount request_date
0 chad.chen paula7980 78.61 2018-02-12
1 kallen lmoore 1.94 2018-02-23

I'd like to see all the requests that have a matching transaction based on the users and the amount involved.

In order to do this we will merge both of our datasets together.

We'll create a new dataset by using the DataFrame.merge method.

In [4]:
# Since we are calling merge on the `requests` DataFrame it is considered the left side
successful_requests = requests.merge(
    # And transactions is the right side
    transactions, 
    # So now we line up columns that will make the join make sense.
    left_on=['from_user', 'to_user', 'amount'], 
    right_on=['receiver', 'sender', 'amount']
)
# Let's take a look and see how we did
successful_requests.head()
Out[4]:
from_user to_user amount request_date sender receiver sent_date
0 chad.chen paula7980 78.61 2018-02-12 paula7980 chad.chen 2018-07-15
1 kallen lmoore 1.94 2018-02-23 lmoore kallen 2018-03-05
2 gregory.blackwell rodriguez5768 30.57 2018-03-04 rodriguez5768 gregory.blackwell 2018-03-17
3 kristina.miller john.hardy 77.05 2018-03-12 john.hardy kristina.miller 2018-04-25
4 lacey8987 mcguire 54.09 2018-03-13 mcguire lacey8987 2018-06-28

Note that since the amount is the same on both sides of the merge that there is only one column represented.

Gather Insights

So looking at this data merged together, I'd like to see the time difference between when the request was made, and when the money was actually received.

Good news for us, pandas has very powerful date/time functionality, but in order to get there we're going to need to convert our columns. As you can see, the CSV import did not recognize our date field. sent_date and request_date are just plain old objects.

In [5]:
successful_requests.dtypes
Out[5]:
from_user        object
to_user          object
amount          float64
request_date     object
sender           object
receiver         object
sent_date        object
dtype: object

You can convert columns (which remember are a Series) on CSV import or just when you need them by using the pandas.to_datetime method. There are actually quite a few conversion methods.

In [6]:
successful_requests['request_date'] = pd.to_datetime(successful_requests['request_date'])
successful_requests['sent_date'] = pd.to_datetime(successful_requests['sent_date'])
# And now we can see they are converted
successful_requests.dtypes
Out[6]:
from_user               object
to_user                 object
amount                 float64
request_date    datetime64[ns]
sender                  object
receiver                object
sent_date       datetime64[ns]
dtype: object

Now that we have dates we can subtract them (vectoization ftw!) to create a timedelta. Let's create a new column called time_passed that stores the result.

In [7]:
successful_requests['time_passed'] = successful_requests.sent_date - successful_requests.request_date

Now, let's take a look at the top 5 longest request to successful transactions by sorting and limiting, to get a vibe.

In [8]:
successful_requests.sort_values(by='time_passed', ascending=False).head(5)
Out[8]:
from_user to_user amount request_date sender receiver sent_date time_passed
0 chad.chen paula7980 78.61 2018-02-12 paula7980 chad.chen 2018-07-15 153 days
33 sthompson andrade 14.07 2018-05-09 andrade sthompson 2018-09-21 135 days
4 lacey8987 mcguire 54.09 2018-03-13 mcguire lacey8987 2018-06-28 107 days
53 marcus.berry melissa.mendoza 71.48 2018-05-31 melissa.mendoza marcus.berry 2018-09-06 98 days
39 bishop massey2102 18.27 2018-05-16 massey2102 bishop 2018-08-15 91 days

Hmm, that makes wonder how much money passed through the request and transaction system. Let's see.

In [9]:
"Wow! ${:,.2f} has passed through the request system in {} transactions!!!".format(
    successful_requests.amount.sum(),
    len(successful_requests),
)
Out[9]:
'Wow! $10,496.47 has passed through the request system in 214 transactions!!!'

Further research

I saw something a little strange as I was looking through those successful_requests, it noticed a couple of what seemed like duplicated requests. I called my contact at CashBox and asked about possible duplicate requests. Sure enough, the application allows you to send multiple requests for the same amount.

So this means there are probably duplicates in our successful_requests DataFrame because there are duplicates in the requests. There is most likely only one transaction that fulfills the request, but there could be multiple requests that match. Our merge brought that duplication across as well.

Let's explore the possible duplicates in the requests DataFrame. There is a method DataFrame.duplicated that will return a boolean Series which we can use as an index. A keep parameter is available which is used to choose which of the duplicated rows to mark as a duplicate. You can mark the first, last or all of them.

In [10]:
# Create a boolean Series of records that are duplicated. 
#   Note that `keep=False` marks all that are duplicated
dupes = requests[requests.duplicated(('from_user', 'to_user', 'amount'), keep=False)]
# Order by requester and the date of request. 
#   Note that `request_date` in this case is a string, but this string date format sorts properly still.
dupes.sort_values(['from_user', 'request_date'])
Out[10]:
from_user to_user amount request_date
58 austin486 shelly 11.24 2018-05-29
59 austin486 shelly 11.24 2018-05-29
8 cjimenez sarah.evans 48.14 2018-03-21
26 cjimenez sarah.evans 48.14 2018-04-27
218 clark8139 moore 14.54 2018-08-31
240 clark8139 moore 14.54 2018-09-10
195 diane4652 dean2365 6.82 2018-08-21
224 diane4652 dean2365 6.82 2018-09-05
143 donna1922 danderson 79.22 2018-07-23
157 donna1922 danderson 79.22 2018-07-31
116 edwards8658 johnston2281 47.29 2018-07-10
127 edwards8658 johnston2281 47.29 2018-07-16
289 heather smercer 98.13 2018-09-22
290 heather smercer 98.13 2018-09-22
32 hoover4302 dennis 87.77 2018-05-04
33 hoover4302 dennis 87.77 2018-05-05
160 jcoleman fuentes 40.48 2018-08-03
216 jcoleman fuentes 40.48 2018-08-29
82 jones jones5796 95.88 2018-06-21
95 jones jones5796 95.88 2018-06-27
84 kimberly wbrown 9.02 2018-06-24
101 kimberly wbrown 9.02 2018-06-28
60 kirk jason 36.53 2018-05-30
85 kirk jason 36.53 2018-06-24
165 lawrence1685 joshua9690 99.96 2018-08-07
168 lawrence1685 joshua9690 99.96 2018-08-08
114 lspencer sarah 44.89 2018-07-09
194 lspencer sarah 44.89 2018-08-21
44 matthew7940 sherry6113 58.72 2018-05-13
108 matthew7940 sherry6113 58.72 2018-07-04
201 michael7792 jennifer 90.56 2018-08-23
241 michael7792 jennifer 90.56 2018-09-10
75 nicholas.travis corey 73.25 2018-06-15
109 nicholas.travis corey 73.25 2018-07-04
307 patricia martha6969 87.33 2018-09-25
312 patricia martha6969 87.33 2018-09-25
10 paula7980 mackenzie 56.00 2018-03-29
15 paula7980 mackenzie 56.00 2018-04-15
68 paula7980 mackenzie 56.00 2018-06-02
221 phillips5654 miguel.gamble 11.35 2018-09-01
260 phillips5654 miguel.gamble 11.35 2018-09-15
139 rebecca848 mpaul 1.34 2018-07-21
229 rebecca848 mpaul 1.34 2018-09-06
115 rodriguez9475 miller8552 25.89 2018-07-09
121 rodriguez9475 miller8552 25.89 2018-07-13

We should get those duplicates out of our successful requests. Let's take a look at some more tools that will help us do cleanup like this one.