Notebook accompanying pbpython article - Pandas Grouper and Agg Functions Explained
import pandas as pd
import collections
Read in the sample sales file then convert the date column to a proper date time column
df = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=True")
df["date"] = pd.to_datetime(df['date'])
df.head()
account number | name | sku | quantity | unit price | ext price | date | |
---|---|---|---|---|---|---|---|
0 | 740150 | Barton LLC | B1-20000 | 39 | 86.69 | 3380.91 | 2014-01-01 07:21:51 |
1 | 714466 | Trantow-Barrows | S2-77896 | -1 | 63.16 | -63.16 | 2014-01-01 10:00:47 |
2 | 218895 | Kulas Inc | B1-69924 | 23 | 90.70 | 2086.10 | 2014-01-01 13:24:58 |
3 | 307599 | Kassulke, Ondricka and Metz | S1-65481 | 41 | 21.05 | 863.05 | 2014-01-01 15:05:22 |
4 | 412290 | Jerde-Hilpert | S2-34077 | 6 | 83.21 | 499.26 | 2014-01-01 23:26:55 |
df.dtypes
account number int64 name object sku object quantity int64 unit price float64 ext price float64 date datetime64[ns] dtype: object
Example showing how resample can be used along with set_index
df.set_index('date').resample('M')["ext price"].sum()
date 2014-01-31 185361.66 2014-02-28 146211.62 2014-03-31 203921.38 2014-04-30 174574.11 2014-05-31 165418.55 2014-06-30 174089.33 2014-07-31 191662.11 2014-08-31 153778.59 2014-09-30 168443.17 2014-10-31 171495.32 2014-11-30 119961.22 2014-12-31 163867.26 Freq: M, Name: ext price, dtype: float64
A more complex example with a groupby
df.set_index('date').groupby('name')["ext price"].resample("M").sum().head(20)
name date Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 2014-06-30 10463.73 2014-07-31 6750.48 2014-08-31 17541.46 2014-09-30 14053.61 2014-10-31 9351.68 2014-11-30 4901.14 2014-12-31 2772.90 Cronin, Oberbrunner and Spencer 2014-01-31 1141.75 2014-02-28 13976.26 2014-03-31 11691.62 2014-04-30 3685.44 2014-05-31 6760.11 2014-06-30 5379.67 2014-07-31 6020.30 2014-08-31 5399.58 Name: ext price, dtype: float64
A simpler example using pd.Grouper
df.groupby(['name', pd.Grouper(key='date', freq='M')])['ext price'].sum().head(20)
name date Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 2014-06-30 10463.73 2014-07-31 6750.48 2014-08-31 17541.46 2014-09-30 14053.61 2014-10-31 9351.68 2014-11-30 4901.14 2014-12-31 2772.90 Cronin, Oberbrunner and Spencer 2014-01-31 1141.75 2014-02-28 13976.26 2014-03-31 11691.62 2014-04-30 3685.44 2014-05-31 6760.11 2014-06-30 5379.67 2014-07-31 6020.30 2014-08-31 5399.58 Name: ext price, dtype: float64
df.groupby(['name', pd.Grouper(key='date', freq='A-DEC')])['ext price'].sum()
name date Barton LLC 2014-12-31 109438.50 Cronin, Oberbrunner and Spencer 2014-12-31 89734.55 Frami, Hills and Schmidt 2014-12-31 103569.59 Fritsch, Russel and Anderson 2014-12-31 112214.71 Halvorson, Crona and Champlin 2014-12-31 70004.36 Herman LLC 2014-12-31 82865.00 Jerde-Hilpert 2014-12-31 112591.43 Kassulke, Ondricka and Metz 2014-12-31 86451.07 Keeling LLC 2014-12-31 100934.30 Kiehn-Spinka 2014-12-31 99608.77 Koepp Ltd 2014-12-31 103660.54 Kuhn-Gusikowski 2014-12-31 91094.28 Kulas Inc 2014-12-31 137351.96 Pollich LLC 2014-12-31 87347.18 Purdy-Kunde 2014-12-31 77898.21 Sanford and Sons 2014-12-31 98822.98 Stokes LLC 2014-12-31 91535.92 Trantow-Barrows 2014-12-31 123381.38 White-Trantow 2014-12-31 135841.99 Will LLC 2014-12-31 104437.60 Name: ext price, dtype: float64
# This works but is kind of slow and probably not that useful for this data set
#df.groupby(['name', pd.Grouper(key='date', freq='60s')])['ext price'].sum()
Some more examples using various off set alisases - http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
df.groupby(['name', pd.Grouper(key='date', freq='W-MON')])['ext price'].sum()
name date Barton LLC 2014-01-06 3380.91 2014-01-20 224.86 2014-01-27 2423.96 2014-02-03 147.84 2014-02-10 217.14 2014-02-17 8442.09 2014-03-03 3558.80 2014-03-10 1763.49 2014-03-17 1750.04 2014-04-07 1561.14 2014-04-14 4702.61 2014-04-21 836.63 2014-04-28 2540.79 2014-05-05 4109.99 2014-05-12 1671.83 2014-05-26 3251.43 2014-06-02 6112.87 2014-06-09 677.35 2014-06-16 2798.00 2014-06-23 3895.46 2014-07-14 1033.26 2014-07-21 3989.22 2014-08-04 7384.80 2014-08-11 84.18 2014-08-18 3857.10 2014-08-25 2925.94 2014-09-01 5017.44 2014-09-08 1319.28 2014-09-15 2608.50 2014-09-22 7371.67 ... Will LLC 2014-03-31 2595.78 2014-04-07 991.99 2014-04-14 2069.20 2014-04-21 3007.03 2014-04-28 26.72 2014-05-12 2220.74 2014-05-19 3250.88 2014-06-02 6385.33 2014-06-16 1440.96 2014-06-23 56.15 2014-06-30 922.41 2014-07-07 2985.85 2014-07-14 3073.44 2014-07-21 3951.27 2014-07-28 1006.98 2014-08-11 773.90 2014-08-18 254.04 2014-08-25 411.88 2014-09-01 2020.57 2014-09-29 2325.42 2014-10-06 371.52 2014-10-20 99.47 2014-10-27 1616.14 2014-11-03 5768.60 2014-11-10 1734.08 2014-12-01 705.96 2014-12-08 4301.00 2014-12-15 3502.09 2014-12-22 2993.40 2015-01-05 1764.72 Name: ext price, Length: 805, dtype: float64
df.groupby(['name', 'sku', pd.Grouper(key='date', freq='A-DEC')])['ext price'].sum()
name sku date Barton LLC B1-04202 2014-12-31 2433.20 B1-05914 2014-12-31 324.72 B1-20000 2014-12-31 7212.11 B1-33087 2014-12-31 4159.97 B1-33364 2014-12-31 4369.19 B1-38851 2014-12-31 6627.93 B1-50809 2014-12-31 5087.44 B1-53102 2014-12-31 431.27 B1-53636 2014-12-31 3737.59 B1-65551 2014-12-31 4703.54 B1-69924 2014-12-31 1634.29 B1-86481 2014-12-31 1457.80 S1-06532 2014-12-31 8346.20 S1-27722 2014-12-31 5840.43 S1-30248 2014-12-31 887.97 S1-47412 2014-12-31 5176.74 S1-50961 2014-12-31 1883.31 S1-65481 2014-12-31 3352.62 S1-82801 2014-12-31 5453.39 S1-93683 2014-12-31 1715.28 S2-10342 2014-12-31 12337.83 S2-11481 2014-12-31 728.50 S2-16558 2014-12-31 623.07 S2-34077 2014-12-31 5726.16 S2-77896 2014-12-31 6156.02 S2-78676 2014-12-31 2616.60 S2-82423 2014-12-31 3735.43 S2-83881 2014-12-31 2679.90 Cronin, Oberbrunner and Spencer B1-04202 2014-12-31 5054.84 B1-05914 2014-12-31 4003.64 ... White-Trantow S2-82423 2014-12-31 5863.69 S2-83881 2014-12-31 6506.07 Will LLC B1-04202 2014-12-31 464.33 B1-05914 2014-12-31 5935.53 B1-20000 2014-12-31 4113.19 B1-33087 2014-12-31 3711.06 B1-33364 2014-12-31 11599.40 B1-38851 2014-12-31 1475.09 B1-53102 2014-12-31 6995.58 B1-53636 2014-12-31 532.80 B1-65551 2014-12-31 3039.01 B1-69924 2014-12-31 10021.68 B1-86481 2014-12-31 1770.34 S1-06532 2014-12-31 3314.72 S1-27722 2014-12-31 4493.11 S1-30248 2014-12-31 975.52 S1-47412 2014-12-31 1642.68 S1-50961 2014-12-31 3513.62 S1-65481 2014-12-31 593.46 S1-82801 2014-12-31 2034.92 S1-93683 2014-12-31 9559.57 S2-00301 2014-12-31 2117.33 S2-10342 2014-12-31 4918.55 S2-11481 2014-12-31 4324.79 S2-16558 2014-12-31 413.52 S2-23246 2014-12-31 1893.25 S2-34077 2014-12-31 991.99 S2-77896 2014-12-31 5321.87 S2-78676 2014-12-31 5016.76 S2-83881 2014-12-31 3653.93 Name: ext price, Length: 544, dtype: float64
Now show how to use the new .agg function
First, how to get summary stats without agg
df[["ext price", "quantity"]].sum()
ext price 2018784.32 quantity 36463.00 dtype: float64
df[["ext price", "quantity"]].mean()
ext price 1345.856213 quantity 24.308667 dtype: float64
Using .agg for sums and means across multiple columns
df[["ext price", "quantity", "unit price"]].agg(['sum', 'mean'])
ext price | quantity | unit price | |
---|---|---|---|
sum | 2.018784e+06 | 36463.000000 | 82511.290000 |
mean | 1.345856e+03 | 24.308667 | 55.007527 |
Passing a dictionary containing different operations per column
df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean']})
ext price | quantity | unit price | |
---|---|---|---|
mean | 1.345856e+03 | 24.308667 | 55.007527 |
sum | 2.018784e+06 | 36463.000000 | NaN |
Using custom functions
get_max = lambda x: x.value_counts(dropna=False).index[0]
df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]})
ext price | quantity | unit price | sku | |
---|---|---|---|---|
<lambda> | NaN | NaN | NaN | S2-77896 |
mean | 1.345856e+03 | 24.308667 | 55.007527 | NaN |
sum | 2.018784e+06 | 36463.000000 | NaN | NaN |
Clean up the naming in the output by defining the name for get_max
get_max.__name__ = "most frequent"
df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]})
ext price | quantity | unit price | sku | |
---|---|---|---|---|
mean | 1.345856e+03 | 24.308667 | 55.007527 | NaN |
most frequent | NaN | NaN | NaN | S2-77896 |
sum | 2.018784e+06 | 36463.000000 | NaN | NaN |
Using an OrderedDictionary to maintain column order
f = collections.OrderedDict([('ext price', ['sum', 'mean']), ('quantity', ['sum', 'mean']), ('sku', [get_max])])
df.agg(f)
ext price | quantity | sku | |
---|---|---|---|
mean | 1.345856e+03 | 24.308667 | NaN |
most frequent | NaN | NaN | S2-77896 |
sum | 2.018784e+06 | 36463.000000 | NaN |