Split apply and combine¶

In [1]:

import addutils.toc ; addutils.toc.js(ipy_notebook=True)

Out[1]:

In this tutorial we are going to see advanced data management with pandas data structures.

In [2]:

import numpy as np
import pandas as pd
from IPython.display import (display, HTML)
from addutils import side_by_side2
from addutils import css_notebook
css_notebook()

Out[2]:

Categorizing a data set and applying a function to each group, is often a critical component of a data analysis workflow. After loading, merging, and preparing a data set, a familiar task is to compute group statistics or possibly pivot tables for reporting or visualization purposes. pandas provides a flexible and high-performance groupby facility.

By 'group by' we refer to a process involving one or more of the following steps:

Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure

Suppose we are managing a website and we have a log-file with number of wiews and likes coming from different cities:

In [3]:

d1 = pd.read_csv('temp/p07_d1.txt', index_col=0)
d1 = d1.reindex(columns=['State','City','Views','Likes'])
display(d1)

	State	City	Views	Likes
0	NE	Page	10	4
1	KY	Stone	9	3
2	CO	Rye	3	0
3	CO	Rye	7	2
4	KY	Dema	4	1
5	KY	Keavy	2	1
6	CO	Rye	1	0
7	NE	Cairo	8	3
8	CO	Dumont	12	7

1 Groupby¶

groupby groups DataFrame or Series by a parameter on a given axis:

In [4]:

g1 = d1.groupby('State')
print (g1.groups)

{'CO': Int64Index([2, 3, 6, 8], dtype='int64'), 'KY': Int64Index([1, 4, 5], dtype='int64'), 'NE': Int64Index([0, 7], dtype='int64')}

The variable groups of a GroupBy object is a dictionary containing indexes of each group member.

In [5]:

for name,group in g1:
    print (name)
    print (group)
    print ('Total Views: %d - Total Likes: %d\n\n' %(group['Views'].sum(),
                                                    group['Likes'].sum()))

CO
  State    City  Views  Likes
2    CO     Rye      3      0
3    CO     Rye      7      2
6    CO     Rye      1      0
8    CO  Dumont     12      7
Total Views: 23 - Total Likes: 9


KY
  State   City  Views  Likes
1    KY  Stone      9      3
4    KY   Dema      4      1
5    KY  Keavy      2      1
Total Views: 15 - Total Likes: 5


NE
  State   City  Views  Likes
0    NE   Page     10      4
7    NE  Cairo      8      3
Total Views: 18 - Total Likes: 7

It is also possibile to apply a groupby over a hierarchical index DataFrame

In [6]:

d2 = d1.set_index(['State','City'])
display(d2)

		Views	Likes
State	City
NE	Page	10	4
KY	Stone	9	3
CO	Rye	3	0
CO	Rye	7	2
KY	Dema	4	1
KY	Keavy	2	1
CO	Rye	1	0
NE	Cairo	8	3
CO	Dumont	12	7

2 Aggregate¶

Once the GroupBy object has been created, several methods are available to perform a computation on the grouped data. Here we use aggregate. The result of the aggregation will have the group names as the new index along the grouped axis. In the case of multiple keys, the result is a MultiIndex by default, though this can be changed by using the as_index option:

In [7]:

g2 = d2.groupby(level=[0])
print (g2.groups)
g2.aggregate(np.sum)

{'CO': MultiIndex(levels=[['CO', 'KY', 'NE'], ['Cairo', 'Dema', 'Dumont', 'Keavy', 'Page', 'Rye', 'Stone']],
           labels=[[0, 0, 0, 0], [5, 5, 5, 2]],
           names=['State', 'City']), 'KY': MultiIndex(levels=[['CO', 'KY', 'NE'], ['Cairo', 'Dema', 'Dumont', 'Keavy', 'Page', 'Rye', 'Stone']],
           labels=[[1, 1, 1], [6, 1, 3]],
           names=['State', 'City']), 'NE': MultiIndex(levels=[['CO', 'KY', 'NE'], ['Cairo', 'Dema', 'Dumont', 'Keavy', 'Page', 'Rye', 'Stone']],
           labels=[[2, 2], [4, 0]],
           names=['State', 'City'])}

Out[7]:

	Views	Likes
State
CO	23	9
KY	15	5
NE	18	7

In [8]:

g3 = d2.groupby(level=[0,1])
g4 = d2.groupby(level=[0,1], as_index=False)
HTML(side_by_side2(g3.aggregate(np.sum), g4.aggregate(np.sum)))

Out[8]:

		Views	Likes
State	City
CO	Dumont	12	7
CO	Rye	11	2
KY	Dema	4	1
	Keavy	2	1
	Stone	9	3
NE	Cairo	8	3
NE	Page	10	4

	Views	Likes
0	12	7
1	11	2
2	4	1
3	2	1
4	9	3
5	8	3
6	10	4

aggregate allows to pass any function that returns a scalar value from a vector and can handle list of functions:

In [9]:

d1[['State', 'Views']].groupby('State').aggregate([np.sum, np.mean, np.std])

Out[9]:

	Views
	sum	mean	std
State
CO	23	5.75	4.856267
KY	15	5.00	3.605551
NE	18	9.00	1.414214

3 Apply¶

apply will extend the previous concepts to any Python function:

In [10]:

pd.set_option('display.float_format', lambda x: '{:.1f}'.format(x))

def add_field(group):
    group['Tot.Views'] = group['Views'].sum()
    group['Likes[%]'] = 100.0*group['Likes']/group['Likes'].sum()
    return group

HTML(side_by_side2(d1, d1.groupby('State').apply(add_field)))

Out[10]:

	State	City	Views	Likes
0	NE	Page	10	4
1	KY	Stone	9	3
2	CO	Rye	3	0
3	CO	Rye	7	2
4	KY	Dema	4	1
5	KY	Keavy	2	1
6	CO	Rye	1	0
7	NE	Cairo	8	3
8	CO	Dumont	12	7

	State	City	Views	Likes	Tot.Views	Likes[%]
0	NE	Page	10	4	18	57.1
1	KY	Stone	9	3	15	60.0
2	CO	Rye	3	0	23	0.0
3	CO	Rye	7	2	23	22.2
4	KY	Dema	4	1	15	20.0
5	KY	Keavy	2	1	15	20.0
6	CO	Rye	1	0	23	0.0
7	NE	Cairo	8	3	18	42.9
8	CO	Dumont	12	7	23	77.8

4 A pratical example: Normalize by year¶

In [11]:

idx = pd.date_range('1999/5/28', periods=1500, freq='1B')
s1 = pd.Series(np.random.normal(5.5, 2, 1500), idx)
s1 = s1.rolling(10,10).mean().dropna()
#s1 = s1.rolling_mean(s1, 10, 10).dropna()

Here we define a grouping key for months and one for years:

In [12]:

def my_groupby_key_year(timestamp):
    return timestamp.year

def my_groupby_key_month(timestamp):
    return timestamp.month

def my_normalization(group):
    return (group-group.mean())/group.std()

Here we normalize the data on a monthly base and check mean and std on an yearly base:

In [13]:

t1 = s1.groupby(my_groupby_key_month).apply(my_normalization)

HTML(side_by_side2(s1.head(8),
                  t1.head(8),
                  t1.groupby(my_groupby_key_year).aggregate([np.mean, np.std])))

Out[13]:

	0
1999-06-10	5.0
1999-06-11	4.8
1999-06-14	5.0
1999-06-15	5.0
1999-06-16	5.4
1999-06-17	5.5
1999-06-18	5.9
1999-06-21	6.4

	0
1999-06-10	-0.6
1999-06-11	-1.0
1999-06-14	-0.6
1999-06-15	-0.6
1999-06-16	-0.1
1999-06-17	0.1
1999-06-18	0.7
1999-06-21	1.5

	mean	std
1999	0.5	0.8
2000	0.1	1.2
2001	-0.3	0.9
2002	0.0	1.0
2003	0.0	1.1
2004	-0.1	0.8
2005	-0.1	0.7

5 A practical example: Group and standardize by dimension¶

In [14]:

d3 = pd.read_csv('example_data/company.csv', index_col=0)
display(d3.head())

	Value	expenses	employees	Dimension
Company
Dapibus Company	96 008	7124	78	Big
Pede Blandit Congue Company	61 562	8454	60	Big
Pede Suspendisse Associates	54 728	6641	4	Small
Dictum Associates	16 802	6498	89	Big
Dui Cras Pellentesque Ltd	93 954	5040	97	Big

Since the column "Value" is made by strings with a space separator we need a simpel intermediate step to convert values from string to floats:

In [15]:

d3['Value'] = d3['Value'].apply(lambda x: float(x.replace(' ', '')))
d3.head()

Out[15]:

	Value	expenses	employees	Dimension
Company
Dapibus Company	96008.0	7124	78	Big
Pede Blandit Congue Company	61562.0	8454	60	Big
Pede Suspendisse Associates	54728.0	6641	4	Small
Dictum Associates	16802.0	6498	89	Big
Dui Cras Pellentesque Ltd	93954.0	5040	97	Big

In [16]:

d3.groupby('Dimension').mean()

Out[16]:

	Value	expenses	employees
Dimension
Big	49445.4	5474.9	61.1
Small	60947.8	4521.4	31.8

Visit www.add-for.com for more tutorials and updates.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.