Grouping¶

A common need is bound to arise where you will need to look at an aggregate view of a DataFrame by a certain value. This is where grouping comes in.

CashBox has asked that we produce a list of the top 10 users who have been on the receiving side of transactions the most. They would like to see the user's first and last name, their email, and the total number of transactions where the user was the receiver.

We can acheive this by grouping our data. We need to take a look at the transactions DataFrame and group it by the receiver field, which is the username.

Grouping on a value returns a new type of object called the GroupBy.

Let's first explore how to create one of these, and then how to wield it's power.

In [1]:

# Setup
import os

import pandas as pd

pd.options.display.max_rows = 10
users = pd.read_csv(os.path.join('data', 'users.csv'), index_col=0)
transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)
# Sanity check
(len(users), len(transactions))

Out[1]:

(475, 998)

Let's remind ourselves about the types of data we have in the transactions DataFrame.

In [2]:

transactions.dtypes

Out[2]:

sender        object
receiver      object
amount       float64
sent_date     object
dtype: object

Grouping by a specific column is pretty straight forward. We want to group by the receiver, so we use the DataFrame.groupby method.

In [3]:

grouped_by_receiver = transactions.groupby('receiver')

# Let's see what type of object we got back
type(grouped_by_receiver)

Out[3]:

pandas.core.groupby.groupby.DataFrameGroupBy

We received a DataFrameGroupBy object. There are quite a few methods here.

Let's take a look first at GroupBy.size. This will return a Series of how many members are in each of the groups. In our case this is the number of transactions that each user received.

In [4]:

# Returns a Series of total number of rows
grouped_by_receiver.size()

Out[4]:

receiver
aaron            6
acook            1
adam.saunders    2
adrian           3
adrian.blair     7
                ..
wilson           2
wking            2
wright3590       4
young            2
zachary.neal     4
Length: 410, dtype: int64

Similarly, we can use the DataFrameGroupBy.count method to see counts of how many non missing data points we have across each column in our group across the columns of our DataFrame.

In [5]:

grouped_by_receiver.count()

Out[5]:

	sender	amount	sent_date
receiver
aaron	6	6	6
acook	1	1	1
adam.saunders	2	2	2
adrian	3	3	3
adrian.blair	7	7	7
...	...	...	...
wilson	2	2	2
wking	2	2	2
wright3590	4	4	4
young	2	2	2
zachary.neal	4	4	4

410 rows × 3 columns

The GroupBy object provides aggregate functions that makes getting calculations quick and seamless. For instance, if we use the GroupBy.sum method we can see each numeric column summed up for each grouping. In our case there is only one numeric column amount.

In [6]:

grouped_by_receiver.sum()

Out[6]:

	amount
receiver
aaron	366.15
acook	94.65
adam.saunders	101.15
adrian	124.36
adrian.blair	462.88
...	...
wilson	44.39
wking	74.07
wright3590	195.45
young	83.57
zachary.neal	186.01

410 rows × 1 columns

Now where were we? Oh right, we're trying to figure out the people who was received the most transactions. So why don't we use this group to create a new column on our users DataFrame.

In [7]:

# Create a new column in users called transaction count, and set the values to the size of the matching group
users['transaction_count'] = grouped_by_receiver.size()

# Not every user has made a transaction, let's see what kind of missing data we are dealing with
len(users[users.transaction_count.isna()])

Out[7]:

Since we don't have a transaction record for everyone, not every user will be in our grouping. So when we created the new column, we ended up adding some np.nan. Let's fix that.

In [8]:

# Set all missing data to 0, since in reality, there have been 0 received transactions for this user
users.transaction_count.fillna(0, inplace=True)
users

Out[8]:

	first_name	last_name	email	email_verified	signup_date	referral_count	balance	transaction_count
aaron	Aaron	Davis	aaron6348@gmail.com	True	2018-08-31	6	18.14	6.0
acook	Anthony	Cook	cook@gmail.com	True	2018-05-12	2	55.45	1.0
adam.saunders	Adam	Saunders	adam@gmail.com	False	2018-05-29	3	72.12	2.0
adrian	Adrian	Fang	adrian.fang@teamtreehouse.com	True	2018-04-28	3	30.01	3.0
adrian.blair	Adrian	Blair	adrian9335@gmail.com	True	2018-06-16	7	25.85	7.0
...	...	...	...	...	...	...	...	...
wilson	Robert	Wilson	robert@yahoo.com	False	2018-05-16	5	59.75	2.0
wking	Wanda	King	wanda.king@holt.com	True	2018-06-01	2	67.08	2.0
wright3590	Jacqueline	Wright	jacqueline.wright@gonzalez.com	True	2018-02-08	6	18.48	4.0
young	Jessica	Young	jessica4028@yahoo.com	True	2018-07-17	4	75.39	2.0
zachary.neal	Zachary	Neal	zneal@gmail.com	True	2018-07-26	1	39.90	4.0

475 rows × 8 columns

Check it out! There's our column, but it's a floating point number, we don't need that. Let's convert it!

In [9]:

# Convert from the default type of float64 to int64 (no precision needed)
users.transaction_count = users.transaction_count.astype('int64')

Finally we want to get the user with the highest transaction count, so let's sort by that descending.

In [10]:

# Sort our values by the new field descending (so the largest comes first), and then by first name ascending
users.sort_values(
    ['transaction_count', 'first_name'],
    ascending=[False, True],
    inplace=True
)
# Take a look at our top 10 receivers, showing only the columns we want
users.loc[:, ['first_name', 'last_name', 'email', 'transaction_count']].head(10)

Out[10]:

	first_name	last_name	email	transaction_count
scott3928	Scott	NaN	scott@yahoo.com	9
sfinley	Samuel	Finley	samuel@gmail.com	8
adrian.blair	Adrian	Blair	adrian9335@gmail.com	7
hdeleon	Hannah	Deleon	hannah@yahoo.com	7
miranda6426	Miranda	Rogers	miranda.rogers@gmail.com	7
aaron	Aaron	Davis	aaron6348@gmail.com	6
corey	Corey	Fuller	fuller8100@yahoo.com	6
heather	Heather	Ray	hray@yahoo.com	6
jennifer.hebert	Jennifer	Hebert	jennifer.hebert@yahoo.com	6
edwards	Michael	Edwards	edwards5456@gmail.com	6

Here they are, the Top 10 Receivers! Nice work putting all those skills together!