Notebook accompanying pbpython article - Pandas Grouper and Agg Functions Explained

In [1]:
import pandas as pd
import collections

Read in the sample sales file then convert the date column to a proper date time column

In [2]:
df = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=True")
In [3]:
df["date"] = pd.to_datetime(df['date'])
df.head()
Out[3]:
account number name sku quantity unit price ext price date
0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51
1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16 2014-01-01 10:00:47
2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58
3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 2014-01-01 15:05:22
4 412290 Jerde-Hilpert S2-34077 6 83.21 499.26 2014-01-01 23:26:55
In [4]:
df.dtypes
Out[4]:
account number             int64
name                      object
sku                       object
quantity                   int64
unit price               float64
ext price                float64
date              datetime64[ns]
dtype: object

Example showing how resample can be used along with set_index

In [5]:
df.set_index('date').resample('M')["ext price"].sum()
Out[5]:
date
2014-01-31    185361.66
2014-02-28    146211.62
2014-03-31    203921.38
2014-04-30    174574.11
2014-05-31    165418.55
2014-06-30    174089.33
2014-07-31    191662.11
2014-08-31    153778.59
2014-09-30    168443.17
2014-10-31    171495.32
2014-11-30    119961.22
2014-12-31    163867.26
Freq: M, Name: ext price, dtype: float64

A more complex example with a groupby

In [6]:
df.set_index('date').groupby('name')["ext price"].resample("M").sum().head(20)
Out[6]:
name                             date      
Barton LLC                       2014-01-31     6177.57
                                 2014-02-28    12218.03
                                 2014-03-31     3513.53
                                 2014-04-30    11474.20
                                 2014-05-31    10220.17
                                 2014-06-30    10463.73
                                 2014-07-31     6750.48
                                 2014-08-31    17541.46
                                 2014-09-30    14053.61
                                 2014-10-31     9351.68
                                 2014-11-30     4901.14
                                 2014-12-31     2772.90
Cronin, Oberbrunner and Spencer  2014-01-31     1141.75
                                 2014-02-28    13976.26
                                 2014-03-31    11691.62
                                 2014-04-30     3685.44
                                 2014-05-31     6760.11
                                 2014-06-30     5379.67
                                 2014-07-31     6020.30
                                 2014-08-31     5399.58
Name: ext price, dtype: float64

A simpler example using pd.Grouper

In [7]:
df.groupby(['name', pd.Grouper(key='date', freq='M')])['ext price'].sum().head(20)
Out[7]:
name                             date      
Barton LLC                       2014-01-31     6177.57
                                 2014-02-28    12218.03
                                 2014-03-31     3513.53
                                 2014-04-30    11474.20
                                 2014-05-31    10220.17
                                 2014-06-30    10463.73
                                 2014-07-31     6750.48
                                 2014-08-31    17541.46
                                 2014-09-30    14053.61
                                 2014-10-31     9351.68
                                 2014-11-30     4901.14
                                 2014-12-31     2772.90
Cronin, Oberbrunner and Spencer  2014-01-31     1141.75
                                 2014-02-28    13976.26
                                 2014-03-31    11691.62
                                 2014-04-30     3685.44
                                 2014-05-31     6760.11
                                 2014-06-30     5379.67
                                 2014-07-31     6020.30
                                 2014-08-31     5399.58
Name: ext price, dtype: float64
In [8]:
df.groupby(['name', pd.Grouper(key='date', freq='A-DEC')])['ext price'].sum()
Out[8]:
name                             date      
Barton LLC                       2014-12-31    109438.50
Cronin, Oberbrunner and Spencer  2014-12-31     89734.55
Frami, Hills and Schmidt         2014-12-31    103569.59
Fritsch, Russel and Anderson     2014-12-31    112214.71
Halvorson, Crona and Champlin    2014-12-31     70004.36
Herman LLC                       2014-12-31     82865.00
Jerde-Hilpert                    2014-12-31    112591.43
Kassulke, Ondricka and Metz      2014-12-31     86451.07
Keeling LLC                      2014-12-31    100934.30
Kiehn-Spinka                     2014-12-31     99608.77
Koepp Ltd                        2014-12-31    103660.54
Kuhn-Gusikowski                  2014-12-31     91094.28
Kulas Inc                        2014-12-31    137351.96
Pollich LLC                      2014-12-31     87347.18
Purdy-Kunde                      2014-12-31     77898.21
Sanford and Sons                 2014-12-31     98822.98
Stokes LLC                       2014-12-31     91535.92
Trantow-Barrows                  2014-12-31    123381.38
White-Trantow                    2014-12-31    135841.99
Will LLC                         2014-12-31    104437.60
Name: ext price, dtype: float64
In [9]:
# This works but is kind of slow and probably not that useful for this data set
#df.groupby(['name', pd.Grouper(key='date', freq='60s')])['ext price'].sum()

Some more examples using various off set alisases - http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

In [10]:
df.groupby(['name', pd.Grouper(key='date', freq='W-MON')])['ext price'].sum()
Out[10]:
name        date      
Barton LLC  2014-01-06    3380.91
            2014-01-20     224.86
            2014-01-27    2423.96
            2014-02-03     147.84
            2014-02-10     217.14
            2014-02-17    8442.09
            2014-03-03    3558.80
            2014-03-10    1763.49
            2014-03-17    1750.04
            2014-04-07    1561.14
            2014-04-14    4702.61
            2014-04-21     836.63
            2014-04-28    2540.79
            2014-05-05    4109.99
            2014-05-12    1671.83
            2014-05-26    3251.43
            2014-06-02    6112.87
            2014-06-09     677.35
            2014-06-16    2798.00
            2014-06-23    3895.46
            2014-07-14    1033.26
            2014-07-21    3989.22
            2014-08-04    7384.80
            2014-08-11      84.18
            2014-08-18    3857.10
            2014-08-25    2925.94
            2014-09-01    5017.44
            2014-09-08    1319.28
            2014-09-15    2608.50
            2014-09-22    7371.67
                           ...   
Will LLC    2014-03-31    2595.78
            2014-04-07     991.99
            2014-04-14    2069.20
            2014-04-21    3007.03
            2014-04-28      26.72
            2014-05-12    2220.74
            2014-05-19    3250.88
            2014-06-02    6385.33
            2014-06-16    1440.96
            2014-06-23      56.15
            2014-06-30     922.41
            2014-07-07    2985.85
            2014-07-14    3073.44
            2014-07-21    3951.27
            2014-07-28    1006.98
            2014-08-11     773.90
            2014-08-18     254.04
            2014-08-25     411.88
            2014-09-01    2020.57
            2014-09-29    2325.42
            2014-10-06     371.52
            2014-10-20      99.47
            2014-10-27    1616.14
            2014-11-03    5768.60
            2014-11-10    1734.08
            2014-12-01     705.96
            2014-12-08    4301.00
            2014-12-15    3502.09
            2014-12-22    2993.40
            2015-01-05    1764.72
Name: ext price, Length: 805, dtype: float64
In [11]:
df.groupby(['name', 'sku', pd.Grouper(key='date', freq='A-DEC')])['ext price'].sum()
Out[11]:
name                             sku       date      
Barton LLC                       B1-04202  2014-12-31     2433.20
                                 B1-05914  2014-12-31      324.72
                                 B1-20000  2014-12-31     7212.11
                                 B1-33087  2014-12-31     4159.97
                                 B1-33364  2014-12-31     4369.19
                                 B1-38851  2014-12-31     6627.93
                                 B1-50809  2014-12-31     5087.44
                                 B1-53102  2014-12-31      431.27
                                 B1-53636  2014-12-31     3737.59
                                 B1-65551  2014-12-31     4703.54
                                 B1-69924  2014-12-31     1634.29
                                 B1-86481  2014-12-31     1457.80
                                 S1-06532  2014-12-31     8346.20
                                 S1-27722  2014-12-31     5840.43
                                 S1-30248  2014-12-31      887.97
                                 S1-47412  2014-12-31     5176.74
                                 S1-50961  2014-12-31     1883.31
                                 S1-65481  2014-12-31     3352.62
                                 S1-82801  2014-12-31     5453.39
                                 S1-93683  2014-12-31     1715.28
                                 S2-10342  2014-12-31    12337.83
                                 S2-11481  2014-12-31      728.50
                                 S2-16558  2014-12-31      623.07
                                 S2-34077  2014-12-31     5726.16
                                 S2-77896  2014-12-31     6156.02
                                 S2-78676  2014-12-31     2616.60
                                 S2-82423  2014-12-31     3735.43
                                 S2-83881  2014-12-31     2679.90
Cronin, Oberbrunner and Spencer  B1-04202  2014-12-31     5054.84
                                 B1-05914  2014-12-31     4003.64
                                                           ...   
White-Trantow                    S2-82423  2014-12-31     5863.69
                                 S2-83881  2014-12-31     6506.07
Will LLC                         B1-04202  2014-12-31      464.33
                                 B1-05914  2014-12-31     5935.53
                                 B1-20000  2014-12-31     4113.19
                                 B1-33087  2014-12-31     3711.06
                                 B1-33364  2014-12-31    11599.40
                                 B1-38851  2014-12-31     1475.09
                                 B1-53102  2014-12-31     6995.58
                                 B1-53636  2014-12-31      532.80
                                 B1-65551  2014-12-31     3039.01
                                 B1-69924  2014-12-31    10021.68
                                 B1-86481  2014-12-31     1770.34
                                 S1-06532  2014-12-31     3314.72
                                 S1-27722  2014-12-31     4493.11
                                 S1-30248  2014-12-31      975.52
                                 S1-47412  2014-12-31     1642.68
                                 S1-50961  2014-12-31     3513.62
                                 S1-65481  2014-12-31      593.46
                                 S1-82801  2014-12-31     2034.92
                                 S1-93683  2014-12-31     9559.57
                                 S2-00301  2014-12-31     2117.33
                                 S2-10342  2014-12-31     4918.55
                                 S2-11481  2014-12-31     4324.79
                                 S2-16558  2014-12-31      413.52
                                 S2-23246  2014-12-31     1893.25
                                 S2-34077  2014-12-31      991.99
                                 S2-77896  2014-12-31     5321.87
                                 S2-78676  2014-12-31     5016.76
                                 S2-83881  2014-12-31     3653.93
Name: ext price, Length: 544, dtype: float64

Now show how to use the new .agg function

First, how to get summary stats without agg

In [12]:
df[["ext price", "quantity"]].sum()
Out[12]:
ext price    2018784.32
quantity       36463.00
dtype: float64
In [13]:
df[["ext price", "quantity"]].mean()
Out[13]:
ext price    1345.856213
quantity       24.308667
dtype: float64

Using .agg for sums and means across multiple columns

In [14]:
df[["ext price", "quantity", "unit price"]].agg(['sum', 'mean'])
Out[14]:
ext price quantity unit price
sum 2.018784e+06 36463.000000 82511.290000
mean 1.345856e+03 24.308667 55.007527

Passing a dictionary containing different operations per column

In [15]:
df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean']})
Out[15]:
ext price quantity unit price
mean 1.345856e+03 24.308667 55.007527
sum 2.018784e+06 36463.000000 NaN

Using custom functions

In [16]:
get_max = lambda x: x.value_counts(dropna=False).index[0]
In [17]:
df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]})
Out[17]:
ext price quantity unit price sku
<lambda> NaN NaN NaN S2-77896
mean 1.345856e+03 24.308667 55.007527 NaN
sum 2.018784e+06 36463.000000 NaN NaN

Clean up the naming in the output by defining the name for get_max

In [18]:
get_max.__name__ = "most frequent"
In [19]:
df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]})
Out[19]:
ext price quantity unit price sku
mean 1.345856e+03 24.308667 55.007527 NaN
most frequent NaN NaN NaN S2-77896
sum 2.018784e+06 36463.000000 NaN NaN

Using an OrderedDictionary to maintain column order

In [20]:
f = collections.OrderedDict([('ext price', ['sum', 'mean']), ('quantity', ['sum', 'mean']), ('sku', [get_max])])
In [21]:
df.agg(f)
Out[21]:
ext price quantity sku
mean 1.345856e+03 24.308667 NaN
most frequent NaN NaN S2-77896
sum 2.018784e+06 36463.000000 NaN
In [ ]: