This notebook is based on the article on Pbpython.com. Please reference it for the background and additional details
import pandas as pd
import numpy as np
Read in our sample sales data that includes projected price for our new product launch
sales = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sales-estimate.xlsx?raw=True", sheetname="projections")
sales
Account | Name | State | Rep | Manager | Current_Price | Quantity | New_Product_Price | |
---|---|---|---|---|---|---|---|---|
0 | 714466 | Trantow-Barrows | MN | Craig Booker | Debra Henley | 500 | 100 | 550 |
1 | 737550 | Fritsch, Russel and Anderson | MN | Craig Booker | Debra Henley | 600 | 90 | 725 |
2 | 146832 | Kiehn-Spinka | TX | Daniel Hilton | Debra Henley | 225 | 475 | 255 |
3 | 218895 | Kulas Inc | TX | Daniel Hilton | Debra Henley | 290 | 375 | 300 |
4 | 412290 | Jerde-Hilpert | WI | John Smith | Debra Henley | 375 | 400 | 400 |
5 | 740150 | Barton LLC | WI | John Smith | Debra Henley | 550 | 100 | 600 |
6 | 141962 | Herman LLC | CA | Cedric Moss | Fred Anderson | 400 | 200 | 425 |
7 | 163416 | Purdy-Kunde | CA | Cedric Moss | Fred Anderson | 450 | 150 | 475 |
8 | 239344 | Stokes LLC | WA | Cedric Moss | Fred Anderson | 550 | 75 | 610 |
9 | 307599 | Kassulke, Ondricka and Metz | NV | Wendy Yule | Fred Anderson | 275 | 450 | 300 |
10 | 688981 | Keeling LLC | NV | Wendy Yule | Fred Anderson | 300 | 250 | 350 |
11 | 729833 | Koepp Ltd | NV | Wendy Yule | Fred Anderson | 350 | 100 | 375 |
Show the mean for our current and new product price
print(sales["Current_Price"].mean())
print(sales["New_Product_Price"].mean())
405.416666667 447.083333333
Calculate the weighted average using the long form
print((sales["Current_Price"] * sales["Quantity"]).sum() / sales["Quantity"].sum())
print((sales["New_Product_Price"] * sales["Quantity"]).sum() / sales["Quantity"].sum())
342.540687161 374.638336347
Use np.average to simplify the formula
print(np.average(sales["Current_Price"], weights=sales["Quantity"]))
print(np.average(sales["New_Product_Price"], weights=sales["Quantity"]))
342.540687161 374.638336347
For maximum flexibility, build our own weighted average function
def wavg(group, avg_name, weight_name):
""" http://stackoverflow.com/questions/10951341/pandas-dataframe-aggregate-function-using-multiple-columns
In rare instance, we may not have weights, so just return the mean
"""
d = group[avg_name]
w = group[weight_name]
try:
return (d * w).sum() / w.sum()
except ZeroDivisionError:
return d.mean()
Call the weighted average on all of the data
print(wavg(sales, "Current_Price", "Quantity"))
print(wavg(sales, "New_Product_Price", "Quantity"))
342.540687161 374.638336347
Group the data by manager
sales.groupby("Manager").apply(wavg, "Current_Price", "Quantity")
Manager Debra Henley 340.665584 Fred Anderson 344.897959 dtype: float64
sales.groupby("Manager").apply(wavg, "New_Product_Price", "Quantity")
Manager Debra Henley 372.646104 Fred Anderson 377.142857 dtype: float64
You can also group by state
sales.groupby("State").apply(wavg, "New_Product_Price", "Quantity")
State CA 446.428571 MN 632.894737 NV 325.000000 TX 274.852941 WA 610.000000 WI 440.000000 dtype: float64
You can also group by multiple criteria and the function will work correctly.
sales.groupby(["Manager", "State"]).apply(wavg, "New_Product_Price", "Quantity")
Manager State Debra Henley MN 632.894737 TX 274.852941 WI 440.000000 Fred Anderson CA 446.428571 NV 325.000000 WA 610.000000 dtype: float64
Example of applying multiple aggregation functions
f = {'New_Product_Price': ['mean'],'Current_Price': ['median'], 'Quantity': ['sum', 'mean']}
sales.groupby("Manager").agg(f)
Quantity | New_Product_Price | Current_Price | ||
---|---|---|---|---|
sum | mean | mean | median | |
Manager | ||||
Debra Henley | 1540 | 256.666667 | 471.666667 | 437.5 |
Fred Anderson | 1225 | 204.166667 | 422.500000 | 375.0 |
Similar method to group multiple custom functions together into a single DataFrame
data_1 = sales.groupby("Manager").apply(wavg, "New_Product_Price", "Quantity")
data_2 = sales.groupby("Manager").apply(wavg, "Current_Price", "Quantity")
summary = pd.DataFrame(data=dict(s1=data_1, s2=data_2))
summary.columns = ["New Product Price","Current Product Price"]
summary.head()
New Product Price | Current Product Price | |
---|---|---|
Manager | ||
Debra Henley | 372.646104 | 340.665584 |
Fred Anderson | 377.142857 | 344.897959 |
Finally, numpy has an average function that can be used:
np.average(sales["Current_Price"], weights=sales["Quantity"])
342.54068716094031
Use a lambda function for it to work with grouped data
sales.groupby("Manager").apply(lambda x: np.average(x['New_Product_Price'], weights=x['Quantity']))
Manager Debra Henley 372.646104 Fred Anderson 377.142857 dtype: float64