Split apply and combine

In [1]:
import addutils.toc ; addutils.toc.js(ipy_notebook=True)
Out[1]:

In this tutorial we are going to see advanced data management with pandas data structures.

In [2]:
import numpy as np
import pandas as pd
from IPython.display import (display, HTML)
from addutils import side_by_side2
from addutils import css_notebook
css_notebook()
Out[2]:

Categorizing a data set and applying a function to each group, is often a critical component of a data analysis workflow. After loading, merging, and preparing a data set, a familiar task is to compute group statistics or possibly pivot tables for reporting or visualization purposes. pandas provides a flexible and high-performance groupby facility.

By 'group by' we refer to a process involving one or more of the following steps:

  • Splitting the data into groups based on some criteria
  • Applying a function to each group independently
  • Combining the results into a data structure

Suppose we are managing a website and we have a log-file with number of wiews and likes coming from different cities:

In [3]:
d1 = pd.read_csv('temp/p07_d1.txt', index_col=0)
d1 = d1.reindex(columns=['State','City','Views','Likes'])
display(d1)
State City Views Likes
0 NE Page 10 4
1 KY Stone 9 3
2 CO Rye 3 0
3 CO Rye 7 2
4 KY Dema 4 1
5 KY Keavy 2 1
6 CO Rye 1 0
7 NE Cairo 8 3
8 CO Dumont 12 7

1 Groupby

groupby groups DataFrame or Series by a parameter on a given axis:

In [4]:
g1 = d1.groupby('State')
print (g1.groups)
{'CO': Int64Index([2, 3, 6, 8], dtype='int64'), 'KY': Int64Index([1, 4, 5], dtype='int64'), 'NE': Int64Index([0, 7], dtype='int64')}

The variable groups of a GroupBy object is a dictionary containing indexes of each group member.

In [5]:
for name,group in g1:
    print (name)
    print (group)
    print ('Total Views: %d - Total Likes: %d\n\n' %(group['Views'].sum(),
                                                    group['Likes'].sum()))
CO
  State    City  Views  Likes
2    CO     Rye      3      0
3    CO     Rye      7      2
6    CO     Rye      1      0
8    CO  Dumont     12      7
Total Views: 23 - Total Likes: 9


KY
  State   City  Views  Likes
1    KY  Stone      9      3
4    KY   Dema      4      1
5    KY  Keavy      2      1
Total Views: 15 - Total Likes: 5


NE
  State   City  Views  Likes
0    NE   Page     10      4
7    NE  Cairo      8      3
Total Views: 18 - Total Likes: 7


It is also possibile to apply a groupby over a hierarchical index DataFrame

In [6]:
d2 = d1.set_index(['State','City'])
display(d2)
Views Likes
State City
NE Page 10 4
KY Stone 9 3
CO Rye 3 0
Rye 7 2
KY Dema 4 1
Keavy 2 1
CO Rye 1 0
NE Cairo 8 3
CO Dumont 12 7

2 Aggregate

Once the GroupBy object has been created, several methods are available to perform a computation on the grouped data. Here we use aggregate. The result of the aggregation will have the group names as the new index along the grouped axis. In the case of multiple keys, the result is a MultiIndex by default, though this can be changed by using the as_index option:

In [7]:
g2 = d2.groupby(level=[0])
print (g2.groups)
g2.aggregate(np.sum)
{'CO': MultiIndex(levels=[['CO', 'KY', 'NE'], ['Cairo', 'Dema', 'Dumont', 'Keavy', 'Page', 'Rye', 'Stone']],
           labels=[[0, 0, 0, 0], [5, 5, 5, 2]],
           names=['State', 'City']), 'KY': MultiIndex(levels=[['CO', 'KY', 'NE'], ['Cairo', 'Dema', 'Dumont', 'Keavy', 'Page', 'Rye', 'Stone']],
           labels=[[1, 1, 1], [6, 1, 3]],
           names=['State', 'City']), 'NE': MultiIndex(levels=[['CO', 'KY', 'NE'], ['Cairo', 'Dema', 'Dumont', 'Keavy', 'Page', 'Rye', 'Stone']],
           labels=[[2, 2], [4, 0]],
           names=['State', 'City'])}
Out[7]:
Views Likes
State
CO 23 9
KY 15 5
NE 18 7
In [8]:
g3 = d2.groupby(level=[0,1])
g4 = d2.groupby(level=[0,1], as_index=False)
HTML(side_by_side2(g3.aggregate(np.sum), g4.aggregate(np.sum)))
Out[8]:
Views Likes
State City
CO Dumont 12 7
Rye 11 2
KY Dema 4 1
Keavy 2 1
Stone 9 3
NE Cairo 8 3
Page 10 4
Views Likes
0 12 7
1 11 2
2 4 1
3 2 1
4 9 3
5 8 3
6 10 4

aggregate allows to pass any function that returns a scalar value from a vector and can handle list of functions:

In [9]:
d1[['State', 'Views']].groupby('State').aggregate([np.sum, np.mean, np.std])
Out[9]:
Views
sum mean std
State
CO 23 5.75 4.856267
KY 15 5.00 3.605551
NE 18 9.00 1.414214

3 Apply

apply will extend the previous concepts to any Python function:

In [10]:
pd.set_option('display.float_format', lambda x: '{:.1f}'.format(x))

def add_field(group):
    group['Tot.Views'] = group['Views'].sum()
    group['Likes[%]'] = 100.0*group['Likes']/group['Likes'].sum()
    return group

HTML(side_by_side2(d1, d1.groupby('State').apply(add_field)))
Out[10]:
State City Views Likes
0 NE Page 10 4
1 KY Stone 9 3
2 CO Rye 3 0
3 CO Rye 7 2
4 KY Dema 4 1
5 KY Keavy 2 1
6 CO Rye 1 0
7 NE Cairo 8 3
8 CO Dumont 12 7
State City Views Likes Tot.Views Likes[%]
0 NE Page 10 4 18 57.1
1 KY Stone 9 3 15 60.0
2 CO Rye 3 0 23 0.0
3 CO Rye 7 2 23 22.2
4 KY Dema 4 1 15 20.0
5 KY Keavy 2 1 15 20.0
6 CO Rye 1 0 23 0.0
7 NE Cairo 8 3 18 42.9
8 CO Dumont 12 7 23 77.8

4 A pratical example: Normalize by year

In [11]:
idx = pd.date_range('1999/5/28', periods=1500, freq='1B')
s1 = pd.Series(np.random.normal(5.5, 2, 1500), idx)
s1 = s1.rolling(10,10).mean().dropna()
#s1 = s1.rolling_mean(s1, 10, 10).dropna()

Here we define a grouping key for months and one for years:

In [12]:
def my_groupby_key_year(timestamp):
    return timestamp.year

def my_groupby_key_month(timestamp):
    return timestamp.month

def my_normalization(group):
    return (group-group.mean())/group.std()

Here we normalize the data on a monthly base and check mean and std on an yearly base:

In [13]:
t1 = s1.groupby(my_groupby_key_month).apply(my_normalization)

HTML(side_by_side2(s1.head(8),
                  t1.head(8),
                  t1.groupby(my_groupby_key_year).aggregate([np.mean, np.std])))
Out[13]:
0
1999-06-10 5.0
1999-06-11 4.8
1999-06-14 5.0
1999-06-15 5.0
1999-06-16 5.4
1999-06-17 5.5
1999-06-18 5.9
1999-06-21 6.4
0
1999-06-10 -0.6
1999-06-11 -1.0
1999-06-14 -0.6
1999-06-15 -0.6
1999-06-16 -0.1
1999-06-17 0.1
1999-06-18 0.7
1999-06-21 1.5
mean std
1999 0.5 0.8
2000 0.1 1.2
2001 -0.3 0.9
2002 0.0 1.0
2003 0.0 1.1
2004 -0.1 0.8
2005 -0.1 0.7

5 A practical example: Group and standardize by dimension

In [14]:
d3 = pd.read_csv('example_data/company.csv', index_col=0)
display(d3.head())
Value expenses employees Dimension
Company
Dapibus Company 96 008 7124 78 Big
Pede Blandit Congue Company 61 562 8454 60 Big
Pede Suspendisse Associates 54 728 6641 4 Small
Dictum Associates 16 802 6498 89 Big
Dui Cras Pellentesque Ltd 93 954 5040 97 Big

Since the column "Value" is made by strings with a space separator we need a simpel intermediate step to convert values from string to floats:

In [15]:
d3['Value'] = d3['Value'].apply(lambda x: float(x.replace(' ', '')))
d3.head()
Out[15]:
Value expenses employees Dimension
Company
Dapibus Company 96008.0 7124 78 Big
Pede Blandit Congue Company 61562.0 8454 60 Big
Pede Suspendisse Associates 54728.0 6641 4 Small
Dictum Associates 16802.0 6498 89 Big
Dui Cras Pellentesque Ltd 93954.0 5040 97 Big
In [16]:
d3.groupby('Dimension').mean()
Out[16]:
Value expenses employees
Dimension
Big 49445.4 5474.9 61.1
Small 60947.8 4521.4 31.8

Visit www.add-for.com for more tutorials and updates.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.