Learning More About Pandas by Building and Using a Weighted Average Function¶

This notebook is based on the article on Pbpython.com. Please reference it for the background and additional details

In [1]:

import pandas as pd
import numpy as np

Read in our sample sales data that includes projected price for our new product launch

In [2]:

sales = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sales-estimate.xlsx?raw=True", sheetname="projections")
sales

Out[2]:

	Account	Name	State	Rep	Manager	Current_Price	Quantity	New_Product_Price
0	714466	Trantow-Barrows	MN	Craig Booker	Debra Henley	500	100	550
1	737550	Fritsch, Russel and Anderson	MN	Craig Booker	Debra Henley	600	90	725
2	146832	Kiehn-Spinka	TX	Daniel Hilton	Debra Henley	225	475	255
3	218895	Kulas Inc	TX	Daniel Hilton	Debra Henley	290	375	300
4	412290	Jerde-Hilpert	WI	John Smith	Debra Henley	375	400	400
5	740150	Barton LLC	WI	John Smith	Debra Henley	550	100	600
6	141962	Herman LLC	CA	Cedric Moss	Fred Anderson	400	200	425
7	163416	Purdy-Kunde	CA	Cedric Moss	Fred Anderson	450	150	475
8	239344	Stokes LLC	WA	Cedric Moss	Fred Anderson	550	75	610
9	307599	Kassulke, Ondricka and Metz	NV	Wendy Yule	Fred Anderson	275	450	300
10	688981	Keeling LLC	NV	Wendy Yule	Fred Anderson	300	250	350
11	729833	Koepp Ltd	NV	Wendy Yule	Fred Anderson	350	100	375

Show the mean for our current and new product price

In [3]:

print(sales["Current_Price"].mean())
print(sales["New_Product_Price"].mean())

405.416666667
447.083333333

Calculate the weighted average using the long form

In [4]:

print((sales["Current_Price"] * sales["Quantity"]).sum() / sales["Quantity"].sum())
print((sales["New_Product_Price"] * sales["Quantity"]).sum() / sales["Quantity"].sum())

342.540687161
374.638336347

Use np.average to simplify the formula

In [5]:

print(np.average(sales["Current_Price"], weights=sales["Quantity"]))
print(np.average(sales["New_Product_Price"], weights=sales["Quantity"]))

342.540687161
374.638336347

For maximum flexibility, build our own weighted average function

In [6]:

def wavg(group, avg_name, weight_name):
    """ http://stackoverflow.com/questions/10951341/pandas-dataframe-aggregate-function-using-multiple-columns
    In rare instance, we may not have weights, so just return the mean
    """
    d = group[avg_name]
    w = group[weight_name]
    try:
        return (d * w).sum() / w.sum()
    except ZeroDivisionError:
        return d.mean()

Call the weighted average on all of the data

In [7]:

print(wavg(sales, "Current_Price", "Quantity"))
print(wavg(sales, "New_Product_Price", "Quantity"))

342.540687161
374.638336347

Group the data by manager

In [8]:

sales.groupby("Manager").apply(wavg, "Current_Price", "Quantity")

Out[8]:

Manager
Debra Henley     340.665584
Fred Anderson    344.897959
dtype: float64

In [9]:

sales.groupby("Manager").apply(wavg, "New_Product_Price", "Quantity")

Out[9]:

Manager
Debra Henley     372.646104
Fred Anderson    377.142857
dtype: float64

You can also group by state

In [10]:

sales.groupby("State").apply(wavg, "New_Product_Price", "Quantity")

Out[10]:

State
CA    446.428571
MN    632.894737
NV    325.000000
TX    274.852941
WA    610.000000
WI    440.000000
dtype: float64

You can also group by multiple criteria and the function will work correctly.

In [11]:

sales.groupby(["Manager", "State"]).apply(wavg, "New_Product_Price", "Quantity")

Out[11]:

Manager        State
Debra Henley   MN       632.894737
               TX       274.852941
               WI       440.000000
Fred Anderson  CA       446.428571
               NV       325.000000
               WA       610.000000
dtype: float64

Example of applying multiple aggregation functions

In [12]:

f = {'New_Product_Price': ['mean'],'Current_Price': ['median'], 'Quantity': ['sum', 'mean']}
sales.groupby("Manager").agg(f)

Out[12]:

	Quantity		New_Product_Price	Current_Price
	sum	mean	mean	median
Manager
Debra Henley	1540	256.666667	471.666667	437.5
Fred Anderson	1225	204.166667	422.500000	375.0

Similar method to group multiple custom functions together into a single DataFrame

In [13]:

data_1 = sales.groupby("Manager").apply(wavg, "New_Product_Price", "Quantity")
data_2 = sales.groupby("Manager").apply(wavg, "Current_Price", "Quantity")

In [14]:

summary = pd.DataFrame(data=dict(s1=data_1, s2=data_2))
summary.columns = ["New Product Price","Current Product Price"]
summary.head()

Out[14]:

	New Product Price	Current Product Price
Manager
Debra Henley	372.646104	340.665584
Fred Anderson	377.142857	344.897959

Finally, numpy has an average function that can be used:

In [15]:

np.average(sales["Current_Price"], weights=sales["Quantity"])

Out[15]:

342.54068716094031

Use a lambda function for it to work with grouped data

In [16]:

sales.groupby("Manager").apply(lambda x: np.average(x['New_Product_Price'], weights=x['Quantity']))

Out[16]:

Manager
Debra Henley     372.646104
Fred Anderson    377.142857
dtype: float64

In [ ]: