Learning More About Pandas by Building and Using a Weighted Average Function

This notebook is based on the article on Pbpython.com. Please reference it for the background and additional details

In [1]:
import pandas as pd
import numpy as np

Read in our sample sales data that includes projected price for our new product launch

In [2]:
sales = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sales-estimate.xlsx?raw=True", sheetname="projections")
sales
Out[2]:
Account Name State Rep Manager Current_Price Quantity New_Product_Price
0 714466 Trantow-Barrows MN Craig Booker Debra Henley 500 100 550
1 737550 Fritsch, Russel and Anderson MN Craig Booker Debra Henley 600 90 725
2 146832 Kiehn-Spinka TX Daniel Hilton Debra Henley 225 475 255
3 218895 Kulas Inc TX Daniel Hilton Debra Henley 290 375 300
4 412290 Jerde-Hilpert WI John Smith Debra Henley 375 400 400
5 740150 Barton LLC WI John Smith Debra Henley 550 100 600
6 141962 Herman LLC CA Cedric Moss Fred Anderson 400 200 425
7 163416 Purdy-Kunde CA Cedric Moss Fred Anderson 450 150 475
8 239344 Stokes LLC WA Cedric Moss Fred Anderson 550 75 610
9 307599 Kassulke, Ondricka and Metz NV Wendy Yule Fred Anderson 275 450 300
10 688981 Keeling LLC NV Wendy Yule Fred Anderson 300 250 350
11 729833 Koepp Ltd NV Wendy Yule Fred Anderson 350 100 375

Show the mean for our current and new product price

In [3]:
print(sales["Current_Price"].mean())
print(sales["New_Product_Price"].mean())
405.416666667
447.083333333

Calculate the weighted average using the long form

In [4]:
print((sales["Current_Price"] * sales["Quantity"]).sum() / sales["Quantity"].sum())
print((sales["New_Product_Price"] * sales["Quantity"]).sum() / sales["Quantity"].sum())
342.540687161
374.638336347

Use np.average to simplify the formula

In [5]:
print(np.average(sales["Current_Price"], weights=sales["Quantity"]))
print(np.average(sales["New_Product_Price"], weights=sales["Quantity"]))
342.540687161
374.638336347

For maximum flexibility, build our own weighted average function

In [6]:
def wavg(group, avg_name, weight_name):
    """ http://stackoverflow.com/questions/10951341/pandas-dataframe-aggregate-function-using-multiple-columns
    In rare instance, we may not have weights, so just return the mean
    """
    d = group[avg_name]
    w = group[weight_name]
    try:
        return (d * w).sum() / w.sum()
    except ZeroDivisionError:
        return d.mean()

Call the weighted average on all of the data

In [7]:
print(wavg(sales, "Current_Price", "Quantity"))
print(wavg(sales, "New_Product_Price", "Quantity"))
342.540687161
374.638336347

Group the data by manager

In [8]:
sales.groupby("Manager").apply(wavg, "Current_Price", "Quantity")
Out[8]:
Manager
Debra Henley     340.665584
Fred Anderson    344.897959
dtype: float64
In [9]:
sales.groupby("Manager").apply(wavg, "New_Product_Price", "Quantity")
Out[9]:
Manager
Debra Henley     372.646104
Fred Anderson    377.142857
dtype: float64

You can also group by state

In [10]:
sales.groupby("State").apply(wavg, "New_Product_Price", "Quantity")
Out[10]:
State
CA    446.428571
MN    632.894737
NV    325.000000
TX    274.852941
WA    610.000000
WI    440.000000
dtype: float64

You can also group by multiple criteria and the function will work correctly.

In [11]:
sales.groupby(["Manager", "State"]).apply(wavg, "New_Product_Price", "Quantity")
Out[11]:
Manager        State
Debra Henley   MN       632.894737
               TX       274.852941
               WI       440.000000
Fred Anderson  CA       446.428571
               NV       325.000000
               WA       610.000000
dtype: float64

Example of applying multiple aggregation functions

In [12]:
f = {'New_Product_Price': ['mean'],'Current_Price': ['median'], 'Quantity': ['sum', 'mean']}
sales.groupby("Manager").agg(f)
Out[12]:
Quantity New_Product_Price Current_Price
sum mean mean median
Manager
Debra Henley 1540 256.666667 471.666667 437.5
Fred Anderson 1225 204.166667 422.500000 375.0

Similar method to group multiple custom functions together into a single DataFrame

In [13]:
data_1 = sales.groupby("Manager").apply(wavg, "New_Product_Price", "Quantity")
data_2 = sales.groupby("Manager").apply(wavg, "Current_Price", "Quantity")
In [14]:
summary = pd.DataFrame(data=dict(s1=data_1, s2=data_2))
summary.columns = ["New Product Price","Current Product Price"]
summary.head()
Out[14]:
New Product Price Current Product Price
Manager
Debra Henley 372.646104 340.665584
Fred Anderson 377.142857 344.897959

Finally, numpy has an average function that can be used:

In [15]:
np.average(sales["Current_Price"], weights=sales["Quantity"])
Out[15]:
342.54068716094031

Use a lambda function for it to work with grouped data

In [16]:
sales.groupby("Manager").apply(lambda x: np.average(x['New_Product_Price'], weights=x['Quantity']))
Out[16]:
Manager
Debra Henley     372.646104
Fred Anderson    377.142857
dtype: float64
In [ ]: