#!/usr/bin/env python
# coding: utf-8

# # Combining DataFrames
# 
# CashBox has provided us with several separate CSV files. Let's take a look at two files `transactions.csv` and `requests.csv`. Requests are made in the application when one user requests cash from another. Requests are not required for a transaction to occur. 
# 
# Let's see if we can't see get a feeling on how many successful requests and payments have been made. In order to do this we will need to combine the two `DataFrame`s.

# In[1]:


# Setup
from datetime import datetime
import os

import pandas as pd

from utils import render

transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)
requests = pd.read_csv(os.path.join('data', 'requests.csv'), index_col=0)

# Pop out a quick sanity check
(transactions.shape, requests.shape)


# Let's explore our data real quick, by taking a look at the first couple of rows in each `DataFrame`.

# In[2]:


transactions.head(2)


# In[3]:


requests.head(2)


# I'd like to see all the requests that have a matching transaction based on the users and the amount involved.
# 
# In order to do this we will merge both of our datasets together.  
# 
# We'll create a new dataset by using the [`DataFrame.merge`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) method.

# In[4]:


# Since we are calling merge on the `requests` DataFrame it is considered the left side
successful_requests = requests.merge(
    # And transactions is the right side
    transactions, 
    # So now we line up columns that will make the join make sense.
    left_on=['from_user', 'to_user', 'amount'], 
    right_on=['receiver', 'sender', 'amount']
)
# Let's take a look and see how we did
successful_requests.head()


# Note that since the **`amount`** is the same on both sides of the merge that there is only one column represented.

# ## Gather Insights
# 
# So looking at this data merged together, I'd like to see the time difference between when the request was made, and when the money was actually received.
# 
# Good news for us, pandas has very powerful date/time functionality, but in order to get there we're going to need to convert our columns. As you can see, the CSV import did not recognize our date field.  **`sent_date`** and **`request_date`** are just plain old objects.

# In[5]:


successful_requests.dtypes


# You can convert columns (which remember are a `Series`) on CSV import or just when you need them by using the `pandas.to_datetime` method.  There are actually quite a few conversion methods.

# In[6]:


successful_requests['request_date'] = pd.to_datetime(successful_requests['request_date'])
successful_requests['sent_date'] = pd.to_datetime(successful_requests['sent_date'])
# And now we can see they are converted
successful_requests.dtypes


# Now that we have dates we can subtract them (vectoization ftw!) to create a timedelta. Let's create a new column called **`time_passed`** that stores the result.

# In[7]:


successful_requests['time_passed'] = successful_requests.sent_date - successful_requests.request_date


# Now, let's take a look at the top 5 longest request to successful transactions by sorting and limiting, to get a vibe.

# In[8]:


successful_requests.sort_values(by='time_passed', ascending=False).head(5)


# Hmm, that makes wonder how much money passed through the request and transaction system. Let's see.

# In[9]:


"Wow! ${:,.2f} has passed through the request system in {} transactions!!!".format(
    successful_requests.amount.sum(),
    len(successful_requests),
)


# ## Further research
# I saw something a little strange as I was looking through those **`successful_requests`**, it noticed a couple of what seemed like duplicated requests. I called my contact at CashBox and asked about possible duplicate requests.  Sure enough, the application allows you to send multiple requests for the same amount. 
# 
# So this means there are probably duplicates in our **`successful_requests`** `DataFrame` because there are duplicates in the **`requests`**. There is most likely only one transaction that fulfills the request, but there could be multiple requests that match. Our merge brought that duplication across as well.
# 
# Let's explore the possible duplicates in the **`requests`** `DataFrame`. There is a method [`DataFrame.duplicated`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html) that will return a boolean `Series` which we can use as an index.  A `keep` parameter is available which is used to choose which of the duplicated rows to mark as a duplicate.  You can mark the first, last or all of them.

# In[10]:


# Create a boolean Series of records that are duplicated. 
#   Note that `keep=False` marks all that are duplicated
dupes = requests[requests.duplicated(('from_user', 'to_user', 'amount'), keep=False)]
# Order by requester and the date of request. 
#   Note that `request_date` in this case is a string, but this string date format sorts properly still.
dupes.sort_values(['from_user', 'request_date'])


# We should get those duplicates out of our successful requests. Let's take a look at some more tools that will help us do cleanup like this one.