Combining DataFrames¶

CashBox has provided us with several separate CSV files. Let's take a look at two files transactions.csv and requests.csv. Requests are made in the application when one user requests cash from another. Requests are not required for a transaction to occur.

Let's see if we can't see get a feeling on how many successful requests and payments have been made. In order to do this we will need to combine the two DataFrames.

In [1]:

# Setup
from datetime import datetime
import os

import pandas as pd

from utils import render

transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)
requests = pd.read_csv(os.path.join('data', 'requests.csv'), index_col=0)

# Pop out a quick sanity check
(transactions.shape, requests.shape)

Out[1]:

((998, 4), (313, 4))

Let's explore our data real quick, by taking a look at the first couple of rows in each DataFrame.

In [2]:

transactions.head(2)

Out[2]:

	sender	receiver	amount	sent_date
0	stein	smoyer	49.03	2018-01-24
1	holden4580	joshua.henry	34.64	2018-02-06

In [3]:

requests.head(2)

Out[3]:

	from_user	to_user	amount	request_date
0	chad.chen	paula7980	78.61	2018-02-12
1	kallen	lmoore	1.94	2018-02-23

I'd like to see all the requests that have a matching transaction based on the users and the amount involved.

In order to do this we will merge both of our datasets together.

We'll create a new dataset by using the DataFrame.merge method.

In [4]:

# Since we are calling merge on the `requests` DataFrame it is considered the left side
successful_requests = requests.merge(
    # And transactions is the right side
    transactions, 
    # So now we line up columns that will make the join make sense.
    left_on=['from_user', 'to_user', 'amount'], 
    right_on=['receiver', 'sender', 'amount']
)
# Let's take a look and see how we did
successful_requests.head()

Out[4]:

	from_user	to_user	amount	request_date	sender	receiver	sent_date
0	chad.chen	paula7980	78.61	2018-02-12	paula7980	chad.chen	2018-07-15
1	kallen	lmoore	1.94	2018-02-23	lmoore	kallen	2018-03-05
2	gregory.blackwell	rodriguez5768	30.57	2018-03-04	rodriguez5768	gregory.blackwell	2018-03-17
3	kristina.miller	john.hardy	77.05	2018-03-12	john.hardy	kristina.miller	2018-04-25
4	lacey8987	mcguire	54.09	2018-03-13	mcguire	lacey8987	2018-06-28

Note that since the amount is the same on both sides of the merge that there is only one column represented.

Gather Insights¶

So looking at this data merged together, I'd like to see the time difference between when the request was made, and when the money was actually received.

Good news for us, pandas has very powerful date/time functionality, but in order to get there we're going to need to convert our columns. As you can see, the CSV import did not recognize our date field. sent_date and request_date are just plain old objects.

In [5]:

successful_requests.dtypes

Out[5]:

from_user        object
to_user          object
amount          float64
request_date     object
sender           object
receiver         object
sent_date        object
dtype: object

You can convert columns (which remember are a Series) on CSV import or just when you need them by using the pandas.to_datetime method. There are actually quite a few conversion methods.

In [6]:

successful_requests['request_date'] = pd.to_datetime(successful_requests['request_date'])
successful_requests['sent_date'] = pd.to_datetime(successful_requests['sent_date'])
# And now we can see they are converted
successful_requests.dtypes

Out[6]:

from_user               object
to_user                 object
amount                 float64
request_date    datetime64[ns]
sender                  object
receiver                object
sent_date       datetime64[ns]
dtype: object

Now that we have dates we can subtract them (vectoization ftw!) to create a timedelta. Let's create a new column called time_passed that stores the result.

In [7]:

successful_requests['time_passed'] = successful_requests.sent_date - successful_requests.request_date

Now, let's take a look at the top 5 longest request to successful transactions by sorting and limiting, to get a vibe.

In [8]:

successful_requests.sort_values(by='time_passed', ascending=False).head(5)

Out[8]:

	from_user	to_user	amount	request_date	sender	receiver	sent_date	time_passed
0	chad.chen	paula7980	78.61	2018-02-12	paula7980	chad.chen	2018-07-15	153 days
33	sthompson	andrade	14.07	2018-05-09	andrade	sthompson	2018-09-21	135 days
4	lacey8987	mcguire	54.09	2018-03-13	mcguire	lacey8987	2018-06-28	107 days
53	marcus.berry	melissa.mendoza	71.48	2018-05-31	melissa.mendoza	marcus.berry	2018-09-06	98 days
39	bishop	massey2102	18.27	2018-05-16	massey2102	bishop	2018-08-15	91 days

Hmm, that makes wonder how much money passed through the request and transaction system. Let's see.

In [9]:

"Wow! ${:,.2f} has passed through the request system in {} transactions!!!".format(
    successful_requests.amount.sum(),
    len(successful_requests),
)

Out[9]:

'Wow! $10,496.47 has passed through the request system in 214 transactions!!!'

Further research¶

I saw something a little strange as I was looking through those successful_requests, it noticed a couple of what seemed like duplicated requests. I called my contact at CashBox and asked about possible duplicate requests. Sure enough, the application allows you to send multiple requests for the same amount.

So this means there are probably duplicates in our successful_requests DataFrame because there are duplicates in the requests. There is most likely only one transaction that fulfills the request, but there could be multiple requests that match. Our merge brought that duplication across as well.

Let's explore the possible duplicates in the requests DataFrame. There is a method DataFrame.duplicated that will return a boolean Series which we can use as an index. A keep parameter is available which is used to choose which of the duplicated rows to mark as a duplicate. You can mark the first, last or all of them.

In [10]:

# Create a boolean Series of records that are duplicated. 
#   Note that `keep=False` marks all that are duplicated
dupes = requests[requests.duplicated(('from_user', 'to_user', 'amount'), keep=False)]
# Order by requester and the date of request. 
#   Note that `request_date` in this case is a string, but this string date format sorts properly still.
dupes.sort_values(['from_user', 'request_date'])

Out[10]:

	from_user	to_user	amount	request_date
58	austin486	shelly	11.24	2018-05-29
59	austin486	shelly	11.24	2018-05-29
8	cjimenez	sarah.evans	48.14	2018-03-21
26	cjimenez	sarah.evans	48.14	2018-04-27
218	clark8139	moore	14.54	2018-08-31
240	clark8139	moore	14.54	2018-09-10
195	diane4652	dean2365	6.82	2018-08-21
224	diane4652	dean2365	6.82	2018-09-05
143	donna1922	danderson	79.22	2018-07-23
157	donna1922	danderson	79.22	2018-07-31
116	edwards8658	johnston2281	47.29	2018-07-10
127	edwards8658	johnston2281	47.29	2018-07-16
289	heather	smercer	98.13	2018-09-22
290	heather	smercer	98.13	2018-09-22
32	hoover4302	dennis	87.77	2018-05-04
33	hoover4302	dennis	87.77	2018-05-05
160	jcoleman	fuentes	40.48	2018-08-03
216	jcoleman	fuentes	40.48	2018-08-29
82	jones	jones5796	95.88	2018-06-21
95	jones	jones5796	95.88	2018-06-27
84	kimberly	wbrown	9.02	2018-06-24
101	kimberly	wbrown	9.02	2018-06-28
60	kirk	jason	36.53	2018-05-30
85	kirk	jason	36.53	2018-06-24
165	lawrence1685	joshua9690	99.96	2018-08-07
168	lawrence1685	joshua9690	99.96	2018-08-08
114	lspencer	sarah	44.89	2018-07-09
194	lspencer	sarah	44.89	2018-08-21
44	matthew7940	sherry6113	58.72	2018-05-13
108	matthew7940	sherry6113	58.72	2018-07-04
201	michael7792	jennifer	90.56	2018-08-23
241	michael7792	jennifer	90.56	2018-09-10
75	nicholas.travis	corey	73.25	2018-06-15
109	nicholas.travis	corey	73.25	2018-07-04
307	patricia	martha6969	87.33	2018-09-25
312	patricia	martha6969	87.33	2018-09-25
10	paula7980	mackenzie	56.00	2018-03-29
15	paula7980	mackenzie	56.00	2018-04-15
68	paula7980	mackenzie	56.00	2018-06-02
221	phillips5654	miguel.gamble	11.35	2018-09-01
260	phillips5654	miguel.gamble	11.35	2018-09-15
139	rebecca848	mpaul	1.34	2018-07-21
229	rebecca848	mpaul	1.34	2018-09-06
115	rodriguez9475	miller8552	25.89	2018-07-09
121	rodriguez9475	miller8552	25.89	2018-07-13

We should get those duplicates out of our successful requests. Let's take a look at some more tools that will help us do cleanup like this one.