CashBox has provided us with several separate CSV files. Let's take a look at two files transactions.csv
and requests.csv
. Requests are made in the application when one user requests cash from another. Requests are not required for a transaction to occur.
Let's see if we can't see get a feeling on how many successful requests and payments have been made. In order to do this we will need to combine the two DataFrame
s.
# Setup
from datetime import datetime
import os
import pandas as pd
from utils import render
transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)
requests = pd.read_csv(os.path.join('data', 'requests.csv'), index_col=0)
# Pop out a quick sanity check
(transactions.shape, requests.shape)
((998, 4), (313, 4))
Let's explore our data real quick, by taking a look at the first couple of rows in each DataFrame
.
transactions.head(2)
sender | receiver | amount | sent_date | |
---|---|---|---|---|
0 | stein | smoyer | 49.03 | 2018-01-24 |
1 | holden4580 | joshua.henry | 34.64 | 2018-02-06 |
requests.head(2)
from_user | to_user | amount | request_date | |
---|---|---|---|---|
0 | chad.chen | paula7980 | 78.61 | 2018-02-12 |
1 | kallen | lmoore | 1.94 | 2018-02-23 |
I'd like to see all the requests that have a matching transaction based on the users and the amount involved.
In order to do this we will merge both of our datasets together.
We'll create a new dataset by using the DataFrame.merge
method.
# Since we are calling merge on the `requests` DataFrame it is considered the left side
successful_requests = requests.merge(
# And transactions is the right side
transactions,
# So now we line up columns that will make the join make sense.
left_on=['from_user', 'to_user', 'amount'],
right_on=['receiver', 'sender', 'amount']
)
# Let's take a look and see how we did
successful_requests.head()
from_user | to_user | amount | request_date | sender | receiver | sent_date | |
---|---|---|---|---|---|---|---|
0 | chad.chen | paula7980 | 78.61 | 2018-02-12 | paula7980 | chad.chen | 2018-07-15 |
1 | kallen | lmoore | 1.94 | 2018-02-23 | lmoore | kallen | 2018-03-05 |
2 | gregory.blackwell | rodriguez5768 | 30.57 | 2018-03-04 | rodriguez5768 | gregory.blackwell | 2018-03-17 |
3 | kristina.miller | john.hardy | 77.05 | 2018-03-12 | john.hardy | kristina.miller | 2018-04-25 |
4 | lacey8987 | mcguire | 54.09 | 2018-03-13 | mcguire | lacey8987 | 2018-06-28 |
Note that since the amount
is the same on both sides of the merge that there is only one column represented.
So looking at this data merged together, I'd like to see the time difference between when the request was made, and when the money was actually received.
Good news for us, pandas has very powerful date/time functionality, but in order to get there we're going to need to convert our columns. As you can see, the CSV import did not recognize our date field. sent_date
and request_date
are just plain old objects.
successful_requests.dtypes
from_user object to_user object amount float64 request_date object sender object receiver object sent_date object dtype: object
You can convert columns (which remember are a Series
) on CSV import or just when you need them by using the pandas.to_datetime
method. There are actually quite a few conversion methods.
successful_requests['request_date'] = pd.to_datetime(successful_requests['request_date'])
successful_requests['sent_date'] = pd.to_datetime(successful_requests['sent_date'])
# And now we can see they are converted
successful_requests.dtypes
from_user object to_user object amount float64 request_date datetime64[ns] sender object receiver object sent_date datetime64[ns] dtype: object
Now that we have dates we can subtract them (vectoization ftw!) to create a timedelta. Let's create a new column called time_passed
that stores the result.
successful_requests['time_passed'] = successful_requests.sent_date - successful_requests.request_date
Now, let's take a look at the top 5 longest request to successful transactions by sorting and limiting, to get a vibe.
successful_requests.sort_values(by='time_passed', ascending=False).head(5)
from_user | to_user | amount | request_date | sender | receiver | sent_date | time_passed | |
---|---|---|---|---|---|---|---|---|
0 | chad.chen | paula7980 | 78.61 | 2018-02-12 | paula7980 | chad.chen | 2018-07-15 | 153 days |
33 | sthompson | andrade | 14.07 | 2018-05-09 | andrade | sthompson | 2018-09-21 | 135 days |
4 | lacey8987 | mcguire | 54.09 | 2018-03-13 | mcguire | lacey8987 | 2018-06-28 | 107 days |
53 | marcus.berry | melissa.mendoza | 71.48 | 2018-05-31 | melissa.mendoza | marcus.berry | 2018-09-06 | 98 days |
39 | bishop | massey2102 | 18.27 | 2018-05-16 | massey2102 | bishop | 2018-08-15 | 91 days |
Hmm, that makes wonder how much money passed through the request and transaction system. Let's see.
"Wow! ${:,.2f} has passed through the request system in {} transactions!!!".format(
successful_requests.amount.sum(),
len(successful_requests),
)
'Wow! $10,496.47 has passed through the request system in 214 transactions!!!'
I saw something a little strange as I was looking through those successful_requests
, it noticed a couple of what seemed like duplicated requests. I called my contact at CashBox and asked about possible duplicate requests. Sure enough, the application allows you to send multiple requests for the same amount.
So this means there are probably duplicates in our successful_requests
DataFrame
because there are duplicates in the requests
. There is most likely only one transaction that fulfills the request, but there could be multiple requests that match. Our merge brought that duplication across as well.
Let's explore the possible duplicates in the requests
DataFrame
. There is a method DataFrame.duplicated
that will return a boolean Series
which we can use as an index. A keep
parameter is available which is used to choose which of the duplicated rows to mark as a duplicate. You can mark the first, last or all of them.
# Create a boolean Series of records that are duplicated.
# Note that `keep=False` marks all that are duplicated
dupes = requests[requests.duplicated(('from_user', 'to_user', 'amount'), keep=False)]
# Order by requester and the date of request.
# Note that `request_date` in this case is a string, but this string date format sorts properly still.
dupes.sort_values(['from_user', 'request_date'])
from_user | to_user | amount | request_date | |
---|---|---|---|---|
58 | austin486 | shelly | 11.24 | 2018-05-29 |
59 | austin486 | shelly | 11.24 | 2018-05-29 |
8 | cjimenez | sarah.evans | 48.14 | 2018-03-21 |
26 | cjimenez | sarah.evans | 48.14 | 2018-04-27 |
218 | clark8139 | moore | 14.54 | 2018-08-31 |
240 | clark8139 | moore | 14.54 | 2018-09-10 |
195 | diane4652 | dean2365 | 6.82 | 2018-08-21 |
224 | diane4652 | dean2365 | 6.82 | 2018-09-05 |
143 | donna1922 | danderson | 79.22 | 2018-07-23 |
157 | donna1922 | danderson | 79.22 | 2018-07-31 |
116 | edwards8658 | johnston2281 | 47.29 | 2018-07-10 |
127 | edwards8658 | johnston2281 | 47.29 | 2018-07-16 |
289 | heather | smercer | 98.13 | 2018-09-22 |
290 | heather | smercer | 98.13 | 2018-09-22 |
32 | hoover4302 | dennis | 87.77 | 2018-05-04 |
33 | hoover4302 | dennis | 87.77 | 2018-05-05 |
160 | jcoleman | fuentes | 40.48 | 2018-08-03 |
216 | jcoleman | fuentes | 40.48 | 2018-08-29 |
82 | jones | jones5796 | 95.88 | 2018-06-21 |
95 | jones | jones5796 | 95.88 | 2018-06-27 |
84 | kimberly | wbrown | 9.02 | 2018-06-24 |
101 | kimberly | wbrown | 9.02 | 2018-06-28 |
60 | kirk | jason | 36.53 | 2018-05-30 |
85 | kirk | jason | 36.53 | 2018-06-24 |
165 | lawrence1685 | joshua9690 | 99.96 | 2018-08-07 |
168 | lawrence1685 | joshua9690 | 99.96 | 2018-08-08 |
114 | lspencer | sarah | 44.89 | 2018-07-09 |
194 | lspencer | sarah | 44.89 | 2018-08-21 |
44 | matthew7940 | sherry6113 | 58.72 | 2018-05-13 |
108 | matthew7940 | sherry6113 | 58.72 | 2018-07-04 |
201 | michael7792 | jennifer | 90.56 | 2018-08-23 |
241 | michael7792 | jennifer | 90.56 | 2018-09-10 |
75 | nicholas.travis | corey | 73.25 | 2018-06-15 |
109 | nicholas.travis | corey | 73.25 | 2018-07-04 |
307 | patricia | martha6969 | 87.33 | 2018-09-25 |
312 | patricia | martha6969 | 87.33 | 2018-09-25 |
10 | paula7980 | mackenzie | 56.00 | 2018-03-29 |
15 | paula7980 | mackenzie | 56.00 | 2018-04-15 |
68 | paula7980 | mackenzie | 56.00 | 2018-06-02 |
221 | phillips5654 | miguel.gamble | 11.35 | 2018-09-01 |
260 | phillips5654 | miguel.gamble | 11.35 | 2018-09-15 |
139 | rebecca848 | mpaul | 1.34 | 2018-07-21 |
229 | rebecca848 | mpaul | 1.34 | 2018-09-06 |
115 | rodriguez9475 | miller8552 | 25.89 | 2018-07-09 |
121 | rodriguez9475 | miller8552 | 25.89 | 2018-07-13 |
We should get those duplicates out of our successful requests. Let's take a look at some more tools that will help us do cleanup like this one.