#!/usr/bin/env python # coding: utf-8 # # Combining DataFrames # # CashBox has provided us with several separate CSV files. Let's take a look at two files `transactions.csv` and `requests.csv`. Requests are made in the application when one user requests cash from another. Requests are not required for a transaction to occur. # # Let's see if we can't see get a feeling on how many successful requests and payments have been made. In order to do this we will need to combine the two `DataFrame`s. # In[1]: # Setup from datetime import datetime import os import pandas as pd from utils import render transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0) requests = pd.read_csv(os.path.join('data', 'requests.csv'), index_col=0) # Pop out a quick sanity check (transactions.shape, requests.shape) # Let's explore our data real quick, by taking a look at the first couple of rows in each `DataFrame`. # In[2]: transactions.head(2) # In[3]: requests.head(2) # I'd like to see all the requests that have a matching transaction based on the users and the amount involved. # # In order to do this we will merge both of our datasets together. # # We'll create a new dataset by using the [`DataFrame.merge`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) method. # In[4]: # Since we are calling merge on the `requests` DataFrame it is considered the left side successful_requests = requests.merge( # And transactions is the right side transactions, # So now we line up columns that will make the join make sense. left_on=['from_user', 'to_user', 'amount'], right_on=['receiver', 'sender', 'amount'] ) # Let's take a look and see how we did successful_requests.head() # Note that since the **`amount`** is the same on both sides of the merge that there is only one column represented. # ## Gather Insights # # So looking at this data merged together, I'd like to see the time difference between when the request was made, and when the money was actually received. # # Good news for us, pandas has very powerful date/time functionality, but in order to get there we're going to need to convert our columns. As you can see, the CSV import did not recognize our date field. **`sent_date`** and **`request_date`** are just plain old objects. # In[5]: successful_requests.dtypes # You can convert columns (which remember are a `Series`) on CSV import or just when you need them by using the `pandas.to_datetime` method. There are actually quite a few conversion methods. # In[6]: successful_requests['request_date'] = pd.to_datetime(successful_requests['request_date']) successful_requests['sent_date'] = pd.to_datetime(successful_requests['sent_date']) # And now we can see they are converted successful_requests.dtypes # Now that we have dates we can subtract them (vectoization ftw!) to create a timedelta. Let's create a new column called **`time_passed`** that stores the result. # In[7]: successful_requests['time_passed'] = successful_requests.sent_date - successful_requests.request_date # Now, let's take a look at the top 5 longest request to successful transactions by sorting and limiting, to get a vibe. # In[8]: successful_requests.sort_values(by='time_passed', ascending=False).head(5) # Hmm, that makes wonder how much money passed through the request and transaction system. Let's see. # In[9]: "Wow! ${:,.2f} has passed through the request system in {} transactions!!!".format( successful_requests.amount.sum(), len(successful_requests), ) # ## Further research # I saw something a little strange as I was looking through those **`successful_requests`**, it noticed a couple of what seemed like duplicated requests. I called my contact at CashBox and asked about possible duplicate requests. Sure enough, the application allows you to send multiple requests for the same amount. # # So this means there are probably duplicates in our **`successful_requests`** `DataFrame` because there are duplicates in the **`requests`**. There is most likely only one transaction that fulfills the request, but there could be multiple requests that match. Our merge brought that duplication across as well. # # Let's explore the possible duplicates in the **`requests`** `DataFrame`. There is a method [`DataFrame.duplicated`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html) that will return a boolean `Series` which we can use as an index. A `keep` parameter is available which is used to choose which of the duplicated rows to mark as a duplicate. You can mark the first, last or all of them. # In[10]: # Create a boolean Series of records that are duplicated. # Note that `keep=False` marks all that are duplicated dupes = requests[requests.duplicated(('from_user', 'to_user', 'amount'), keep=False)] # Order by requester and the date of request. # Note that `request_date` in this case is a string, but this string date format sorts properly still. dupes.sort_values(['from_user', 'request_date']) # We should get those duplicates out of our successful requests. Let's take a look at some more tools that will help us do cleanup like this one.