(This was the final project in the Data Analytics course by Practicum by Yandex. Due to copyright, the raw dataset is unavailable for examination, but here is my analysis of the dataset we were provided)
Find ways to improve profits for the coming year by analyzing how the company did this year.
PROBLEM: How can we improve profits -- which means asking the questions:
SOLUTION:
Description of the data
InvoiceNo
— order identifierStockCode
-- item identifierDescription
-- item nameQuantity
InvoiceDate
-- order dateUnitPrice
-- price per itemCustomerID
Opening Data
* Rename variables to lower-cased with underscores between words
* Feature engineering
* Check for NULLS
* Check for duplicates
* Check for outliers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set() # Setting seaborn as default style even if use only matplotlib
sns.color_palette("tab10") # Set default color codes for seaborn
%matplotlib inline
# DISPLAY OPTIONS (uncomment and customize to activate)
# prevent annoying tendency of not displaying 'middle' columns if data is very wide
# change the max_columns value for even wider datasets
# pd.options.display.max_columns = None
# pd.options.display.max_columns = 40
# see "all" the rows if needed
# change max_rows to desired number
# pd.options.display.max_rows = 105
# pd.reset_option('^display.', silent=True)
I'ved coded up some functions that I seem to use regularly
# A helper function for my other functions
# returns the name of a DataFrame variable -- can be utilized in automation
def get_df_name(df):
name = [x for x in globals() if globals()[x] is df][0]
return name
# checks a single dataframe for duplicates, as well as show corresponding % of total
def show_info_on_duplicates(df):
# Percentage of rows that are duplicates
print("There are:\n" + str(df[df.duplicated()].shape[0])
+ " duplicate rows, making up\n"
+ (str(round(df[df.duplicated()].shape[0] / df.shape[0] * 100, 2)))
+ "% of the total "
+ str(df.shape[0])
+ " rows.")
def show_info_on_nulls(df):
'''This function takes a dataframe and an optional column name and prints out the
total number of rows that contain NaN's, either anywhere in the dataframe (by default),
or in the specified column_name given as the second argument. The function also
will print out the percentage of the total dataframe that these found NaN's make up.
PURPOSE: help the user decide whether or not s/he should consider dropping these particular rows
REQUIREMENT: get_df_name()
RETURNS: None. Displays a printout.
Version: 2020-09-28
'''
print("DataFrame: " + get_df_name(df))
total_rows = df.shape[0]
raw_data = {'Feature': df.columns}
new_df = pd.DataFrame(raw_data, columns = ['Feature'])
print("Observations (rows): " + str(total_rows))
print()
num_of_nans = []
percent_of_total = []
for each in df.columns:
num_of_nans.append(df[df[each].isnull()].shape[0])
percent_of_total.append(round(df[df[each].isnull()].shape[0] / df.shape[0] * 100, 2))
new_df['NaNs'] = num_of_nans
new_df['% total'] = percent_of_total
print(new_df)
print()
import re
def rename_variables(original_list):
'''This function takes a list of an original CamelCase-name-styled variables -- the list can
simply be the column names from df.columns -- and returns a list with all the variable names converted
to lowercased versions, and with underscores between previously capitalized words.
Librar(ies) required: re
PURPOSE: make variable names for readable or 'Pythonic'
RETURNS: a list of pythonically named variable names
'''
new_list = []
for each_name in original_list:
# Splitting on UpperCase using re - but not consecutive capital letters like 'ID' for example
single_variables_names_list = re.findall(r'[A-Z](?:[A-Z]*(?![a-z])|[a-z]*)', each_name)
lowercased_string_list = [each_string.lower() for each_string in single_variables_names_list]
# if original list doesn't have any capital letters, we'd have an empty list -- let's check
if not lowercased_string_list:
return original_list
else:
# Recombine using underscores
new_list.append('_'.join(lowercased_string_list))
return new_list
# Set the correct path -- uncomment the one you want to use
# Practicum path - uncomment the following line for reviewer to use on Practicum platform
# path = "/datasets/"
# local path - uncomment the following line to use inside project file
path = 'datasets/'
df = pd.read_csv(path + 'ecommerce_dataset_us.csv', sep = '\t')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 541909 entries, 0 to 541908 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 InvoiceNo 541909 non-null object 1 StockCode 541909 non-null object 2 Description 540455 non-null object 3 Quantity 541909 non-null int64 4 InvoiceDate 541909 non-null object 5 UnitPrice 541909 non-null float64 6 CustomerID 406829 non-null float64 dtypes: float64(2), int64(1), object(4) memory usage: 28.9+ MB
df.describe()
Quantity | UnitPrice | CustomerID | |
---|---|---|---|
count | 541909.000000 | 541909.000000 | 406829.000000 |
mean | 9.552250 | 4.611114 | 15287.690570 |
std | 218.081158 | 96.759853 | 1713.600303 |
min | -80995.000000 | -11062.060000 | 12346.000000 |
25% | 1.000000 | 1.250000 | 13953.000000 |
50% | 3.000000 | 2.080000 | 15152.000000 |
75% | 10.000000 | 4.130000 | 16791.000000 |
max | 80995.000000 | 38970.000000 | 18287.000000 |
df.head()
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | |
---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 11/29/2018 08:26 | 2.55 | 17850.0 |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 11/29/2018 08:26 | 3.39 | 17850.0 |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 11/29/2018 08:26 | 2.75 | 17850.0 |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 11/29/2018 08:26 | 3.39 | 17850.0 |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 11/29/2018 08:26 | 3.39 | 17850.0 |
df.tail()
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | |
---|---|---|---|---|---|---|---|
541904 | 581587 | 22613 | PACK OF 20 SPACEBOY NAPKINS | 12 | 12/07/2019 12:50 | 0.85 | 12680.0 |
541905 | 581587 | 22899 | CHILDREN'S APRON DOLLY GIRL | 6 | 12/07/2019 12:50 | 2.10 | 12680.0 |
541906 | 581587 | 23254 | CHILDRENS CUTLERY DOLLY GIRL | 4 | 12/07/2019 12:50 | 4.15 | 12680.0 |
541907 | 581587 | 23255 | CHILDRENS CUTLERY CIRCUS PARADE | 4 | 12/07/2019 12:50 | 4.15 | 12680.0 |
541908 | 581587 | 22138 | BAKING SET 9 PIECE RETROSPOT | 3 | 12/07/2019 12:50 | 4.95 | 12680.0 |
df.sample(50)
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | |
---|---|---|---|---|---|---|---|
181401 | 552468 | 21136 | PAINTED METAL PEARS ASSORTED | 8 | 05/07/2019 15:33 | 1.69 | 13141.0 |
134500 | 547850 | 22284 | HEN HOUSE DECORATION | 1 | 03/25/2019 12:06 | 1.65 | 14472.0 |
479735 | 577175 | 22899 | CHILDREN'S APRON DOLLY GIRL | 6 | 11/16/2019 10:58 | 2.10 | 14849.0 |
104591 | 545188 | 22554 | PLASTERS IN TIN WOODLAND ANIMALS | 3 | 02/26/2019 15:19 | 1.65 | 14056.0 |
444255 | 574722 | 85071A | BLUE CHARLIE+LOLA PERSONAL DOORSIGN | 4 | 11/04/2019 14:45 | 0.39 | 14502.0 |
33857 | 539300 | 22629 | SPACEBOY LUNCH BOX | 1 | 12/14/2018 17:31 | 1.95 | NaN |
504418 | 578949 | 22300 | COFFEE MUG DOG + BALL DESIGN | 6 | 11/25/2019 14:30 | 2.55 | 14954.0 |
85499 | 543474 | 22470 | HEART OF WICKER LARGE | 3 | 02/06/2019 15:09 | 2.95 | 14415.0 |
178184 | 552229 | 20731 | POSY CANDY BAG | 1 | 05/04/2019 15:40 | 0.83 | NaN |
335253 | 566281 | 22297 | HEART IVORY TRELLIS SMALL | 24 | 09/09/2019 14:15 | 1.25 | 12748.0 |
176220 | 551997 | 22970 | LONDON BUS COFFEE MUG | 2 | 05/03/2019 15:47 | 4.96 | NaN |
453261 | 575477 | 23522 | WALL ART DOG AND BALL | 2 | 11/07/2019 16:14 | 5.79 | NaN |
200014 | 554110 | 82482 | WOODEN PICTURE FRAME WHITE FINISH | 6 | 05/20/2019 14:35 | 2.55 | 13468.0 |
63782 | 541592 | 85173 | SET/6 FROG PRINCE T-LIGHT CANDLES | 1 | 01/17/2019 15:08 | 2.46 | NaN |
111502 | 545721 | 21033 | JUMBO BAG CHARLIE AND LOLA TOYS | 10 | 03/05/2019 10:52 | 2.95 | 15039.0 |
238078 | 557896 | 22175 | PINK OWL SOFT TOY | 1 | 06/21/2019 14:22 | 2.95 | 17611.0 |
248548 | 558861 | 23052 | RECYCLED ACAPULCO MAT TURQUOISE | 4 | 07/02/2019 12:19 | 8.25 | 13373.0 |
476807 | 577034 | 21070 | VINTAGE BILLBOARD MUG | 36 | 11/15/2019 13:16 | 1.06 | 12847.0 |
26521 | 538515 | 22550 | HOLIDAY FUN LUDO | 36 | 12/10/2018 14:43 | 3.39 | 17937.0 |
313529 | 564510 | 85019B | BLOSSOM IMAGES NOTEBOOK SET | 24 | 08/23/2019 14:42 | 1.25 | 14194.0 |
24054 | 538312 | 22576 | SWALLOW WOODEN CHRISTMAS DECORATION | 3 | 12/08/2018 13:48 | 0.85 | 16727.0 |
77147 | 542711 | 21485 | RETROSPOT HEART HOT WATER BOTTLE | 1 | 01/29/2019 13:19 | 4.95 | 17894.0 |
350402 | 567634 | 22569 | FELTCRAFT CUSHION BUTTERFLY | 48 | 09/19/2019 13:35 | 3.39 | 12939.0 |
234166 | 557502 | 20979 | 36 PENCILS TUBE RED RETROSPOT | 1 | 06/18/2019 15:32 | 2.46 | NaN |
393211 | 570807 | 22993 | SET OF 4 PANTRY JELLY MOULDS | 12 | 10/10/2019 12:44 | 1.25 | 13704.0 |
300672 | 563209 | 23228 | FILIGREE HEART BIRD WHITE | 6 | 08/12/2019 12:23 | 1.25 | 13527.0 |
73068 | 542278 | 84688 | BEACH HUT DESIGN BLACKBOARD | 6 | 01/25/2019 10:43 | 4.25 | 14911.0 |
432999 | 573902 | 23409 | PHOTO FRAME LINEN AND LACE LARGE | 6 | 10/30/2019 14:51 | 3.75 | 17428.0 |
313205 | 564473 | 21495 | SKULLS AND CROSSBONES WRAP | 25 | 08/23/2019 12:33 | 0.42 | 16722.0 |
161396 | 550474 | 16235 | RECYCLED PENCIL WITH RABBIT ERASER | 2 | 04/16/2019 13:58 | 0.42 | NaN |
240463 | 558102 | 23239 | SET OF 4 KNICK KNACK TINS POPPIES | 1 | 06/24/2019 14:06 | 4.15 | 17372.0 |
328638 | 565796 | 23200 | JUMBO BAG PEARS | 10 | 09/05/2019 10:07 | 2.08 | 14916.0 |
129783 | 547390 | 37448 | CERAMIC CAKE DESIGN SPOTTED MUG | 12 | 03/20/2019 16:08 | 1.49 | 12352.0 |
509367 | 579297 | 85123a | WHITE HANGING HEART T-LIGHT HOLDER | 1 | 11/27/2019 11:23 | 6.63 | NaN |
407579 | C571896 | 22494 | EMERGENCY FIRST AID TIN | -2 | 10/17/2019 14:18 | 1.25 | 16533.0 |
252285 | 559107 | 23091 | ZINC HERB GARDEN CONTAINER | 1 | 07/04/2019 11:22 | 6.25 | 15334.0 |
225804 | 556733 | 21564 | PINK HEART SHAPE LOVE BUCKET | 1 | 06/12/2019 11:53 | 2.95 | 16919.0 |
42440 | 539988 | 21354 | TOAST ITS - BEST MUM | 2 | 12/21/2018 16:06 | 1.25 | 18116.0 |
55036 | 540977 | 72801C | 4 ROSE PINK DINNER CANDLES | 1 | 01/10/2019 15:01 | 2.51 | NaN |
506725 | 579152 | 22734 | SET OF 6 RIBBONS VINTAGE CHRISTMAS | 6 | 11/26/2019 13:26 | 2.89 | 12479.0 |
88392 | 543802 | 22379 | RECYCLING BAG RETROSPOT | 1 | 02/11/2019 12:03 | 2.10 | 13742.0 |
111153 | 545713 | 85049g | CHOCOLATE BOX RIBBONS | 1 | 03/05/2019 10:12 | 2.46 | NaN |
456092 | C575669 | 23462 | ROCOCO WALL MIRROR WHITE | -1 | 11/08/2019 15:00 | 19.95 | 12417.0 |
535119 | 581188 | 23295 | SET OF 12 MINI LOAF BAKING CASES | 8 | 12/05/2019 16:47 | 0.83 | 17735.0 |
265599 | 560225 | 22322 | BIRD DECORATION GREEN POLKADOT | 5 | 07/13/2019 16:27 | 0.83 | NaN |
192691 | 553464 | 20725 | LUNCH BAG RED RETROSPOT | 50 | 05/15/2019 11:07 | 1.65 | 16218.0 |
23307 | 538197 | 22746 | POPPY'S PLAYHOUSE LIVINGROOM | 3 | 12/08/2018 10:56 | 2.10 | 14419.0 |
48614 | 540469 | 22966 | GINGERBREAD MAN COOKIE CUTTER | 3 | 01/05/2019 14:04 | 1.25 | 12484.0 |
416153 | 572552 | 21770 | OPEN CLOSED METAL SIGN | 1 | 10/22/2019 17:07 | 9.96 | 14096.0 |
450594 | 575176 | 23377 | PACK OF 12 DOLLY GIRL TISSUES | 1 | 11/06/2019 18:29 | 0.83 | NaN |
We'll use my custom function to change the variables from camelCased to lowercased with underscores
df.columns = rename_variables(df.columns)
df.columns
Index(['invoice_no', 'stock_code', 'description', 'quantity', 'invoice_date', 'unit_price', 'customer_id'], dtype='object')
We will add a total_price column, as we're going to look at sales totals by item/customer/date
df['total_price'] = df['quantity'] * df['unit_price']
df['total_price'].sum()
9747747.933999998
# check if any null values exist
df.isnull().values.any()
True
show_info_on_nulls(df)
DataFrame: df Observations (rows): 541909 Feature NaNs % total 0 invoice_no 0 0.00 1 stock_code 0 0.00 2 description 1454 0.27 3 quantity 0 0.00 4 invoice_date 0 0.00 5 unit_price 0 0.00 6 customer_id 135080 24.93 7 total_price 0 0.00
show_info_on_duplicates(df)
There are: 5268 duplicate rows, making up 0.97% of the total 541909 rows.
df = df.drop_duplicates()
Nearly 1% of the data are duplicates?
ACTION REQUIRED: ask Data Engineering why there are so many duplicates. Is there some UX bug where a user inadvertantly places an order more than once? Or some other issue?
Let's take a closer look as they comprise more than 25% of the data
# Let's take a look at the extremes of the quantity feature with customer_id as null
df[df['customer_id'].isnull()].sort_values(by = 'quantity', ascending = False).head(10)
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | total_price | |
---|---|---|---|---|---|---|---|---|
74614 | 542504 | 37413 | NaN | 5568 | 01/26/2019 12:03 | 0.00 | NaN | 0.0 |
220843 | 556231 | 85123A | ? | 4000 | 06/07/2019 15:04 | 0.00 | NaN | 0.0 |
263885 | 560040 | 23343 | came coded as 20713 | 3100 | 07/12/2019 14:28 | 0.00 | NaN | 0.0 |
115807 | 546139 | 84988 | ? | 3000 | 03/07/2019 16:35 | 0.00 | NaN | 0.0 |
74615 | 542505 | 79063D | NaN | 2560 | 01/26/2019 12:04 | 0.00 | NaN | 0.0 |
447035 | 574941 | 22197 | POPCORN HOLDER | 1820 | 11/05/2019 17:42 | 1.95 | NaN | 3549.0 |
160541 | 550460 | 47556B | did a credit and did not tick ret | 1300 | 04/16/2019 13:18 | 0.00 | NaN | 0.0 |
203751 | 554550 | 47566B | incorrectly credited C550456 see 47 | 1300 | 05/23/2019 09:57 | 0.00 | NaN | 0.0 |
82795 | 543258 | 84611B | NaN | 1287 | 02/02/2019 16:06 | 0.00 | NaN | 0.0 |
467796 | 576365 | 22197 | POPCORN HOLDER | 1130 | 11/12/2019 17:55 | 1.95 | NaN | 2203.5 |
df[df['customer_id'].isnull()].sort_values(by = 'quantity', ascending = False).tail(10)
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | total_price | |
---|---|---|---|---|---|---|---|---|
375429 | 569466 | 23270 | incorrect stock entry. | -2880 | 10/02/2019 11:42 | 0.0 | NaN | -0.0 |
113580 | 545990 | 84598 | check | -3000 | 03/06/2019 13:07 | 0.0 | NaN | -0.0 |
263884 | 560039 | 20713 | wrongly marked. 23343 in box | -3100 | 07/12/2019 14:27 | 0.0 | NaN | -0.0 |
323458 | 565304 | 16259 | NaN | -3167 | 08/31/2019 12:18 | 0.0 | NaN | -0.0 |
341601 | 566768 | 16045 | NaN | -3667 | 09/12/2019 17:53 | 0.0 | NaN | -0.0 |
431381 | 573596 | 79323W | Unsaleable, destroyed. | -4830 | 10/29/2019 15:17 | 0.0 | NaN | -0.0 |
115818 | 546152 | 72140F | throw away | -5368 | 03/07/2019 17:25 | 0.0 | NaN | -0.0 |
225528 | 556687 | 23003 | Printing smudges/thrown away | -9058 | 06/12/2019 10:36 | 0.0 | NaN | -0.0 |
225529 | 556690 | 23005 | printing smudges/thrown away | -9600 | 06/12/2019 10:37 | 0.0 | NaN | -0.0 |
225530 | 556691 | 23005 | printing smudges/thrown away | -9600 | 06/12/2019 10:37 | 0.0 | NaN | -0.0 |
df[df['customer_id'].isnull()].sample(10)
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | total_price | |
---|---|---|---|---|---|---|---|---|
87670 | 543713 | 21556 | CERAMIC STRAWBERRY MONEY BOX | 2 | 02/09/2019 11:46 | 4.13 | NaN | 8.26 |
501978 | 578833 | 23162 | REGENCY TEA STRAINER | 1 | 11/23/2019 15:23 | 7.46 | NaN | 7.46 |
143010 | 548663 | 21930 | JUMBO STORAGE BAG SKULLS | 2 | 03/30/2019 14:44 | 4.96 | NaN | 9.92 |
15924 | 537642 | 22357 | KINGS CHOICE BISCUIT TIN | 1 | 12/05/2018 15:33 | 8.47 | NaN | 8.47 |
295962 | 562843 | 90114 | SUMMER DAISIES BAG CHARM | 1 | 08/07/2019 17:37 | 2.46 | NaN | 2.46 |
355335 | 567921 | 21089 | damaged | -7 | 09/20/2019 17:24 | 0.00 | NaN | -0.00 |
138222 | 548191 | 85049D | BRIGHT BLUES RIBBONS | 2 | 03/27/2019 15:20 | 2.46 | NaN | 4.92 |
157759 | 550211 | 82599 | FANNY'S REST STOPMETAL SIGN | 1 | 04/13/2019 10:41 | 4.13 | NaN | 4.13 |
481153 | C577343 | BANK CHARGES | Bank Charges | -1 | 11/16/2019 15:13 | 27.21 | NaN | -27.21 |
80041 | 543013 | 20724 | RED RETROSPOT CHARLOTTE BAG | 1 | 01/31/2019 13:35 | 1.63 | NaN | 1.63 |
df[df['unit_price'] == 0].count()
invoice_no 2510 stock_code 2510 description 1056 quantity 2510 invoice_date 2510 unit_price 2510 customer_id 40 total_price 2510 dtype: int64
2510 records showing the unit_price of $0. Let's see how much of sales revenues that these unknown customers make up.
# select the null customers
df_null_customers = df[df['customer_id'].isnull()].reset_index()
df_null_customers.head()
index | invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | total_price | |
---|---|---|---|---|---|---|---|---|---|
0 | 622 | 536414 | 22139 | NaN | 56 | 11/29/2018 11:52 | 0.00 | NaN | 0.00 |
1 | 1443 | 536544 | 21773 | DECORATIVE ROSE BATHROOM BOTTLE | 1 | 11/29/2018 14:32 | 2.51 | NaN | 2.51 |
2 | 1444 | 536544 | 21774 | DECORATIVE CATS BATHROOM BOTTLE | 2 | 11/29/2018 14:32 | 2.51 | NaN | 5.02 |
3 | 1445 | 536544 | 21786 | POLKADOT RAIN HAT | 4 | 11/29/2018 14:32 | 0.85 | NaN | 3.40 |
4 | 1446 | 536544 | 21787 | RAIN PONCHO RETROSPOT | 2 | 11/29/2018 14:32 | 1.66 | NaN | 3.32 |
print("Transactions from non-ID'ed customers totaled: {:.2f}".format(df_null_customers['total_price'].sum()))
print("out of ALL customers transactions totalling: {:.2f}".format(df['total_price'].sum()))
print("which comes out to "
+ str(round(df_null_customers['total_price'].sum()
/ df['total_price'].sum()
* 100, 2))
+ "% of revenues.")
Transactions from non-ID'ed customers totaled: 1447487.53 out of ALL customers transactions totalling: 9726006.95 which comes out to 14.88% of revenues.
\$1.4 million from unaccounted customers is quite significant. They make up 25% of the transactions 15% of revenues.
We can assume that some were cash transactions, where we don't know who the customers are.
However, there are transactions involving hundreds, even thousands of units. Some of these large transactions have no associated revenue (e.g. $0 unit price). It appears that large negative quantities sometimes were returned damaged product. Some large "sales" may have been gifts or donations.
ACTION REQUIRED: There has to be a better way to reconcile these abnormal transactions so that we can properly track and attribute them to either a profit or loss. In our case, we're talking $1.4 million in sales that cannot be properly accounted.
Since we're talking about $1.4 million in transactions, we really cannot delete this information.
However, if we do any type of analysis regarding the customers, we would have to take this lack of ID into account.
We will fill the nulls with zeroes.
# Do not delete null customers
# df = df[df['customer_id'].notna()]
df['customer_id'].fillna(0, inplace=True)
df['invoice_date'] = pd.to_datetime(df['invoice_date'])
df['invoice_date'].min()
Timestamp('2018-11-29 08:26:00')
df['invoice_date'].max()
Timestamp('2019-12-07 12:50:00')
df['customer_id'] = df['customer_id'].astype(int)
df.sample(5)
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | total_price | |
---|---|---|---|---|---|---|---|---|
378614 | 569653 | 20717 | STRAWBERRY SHOPPER BAG | 20 | 2019-10-03 12:52:00 | 1.25 | 12451 | 25.00 |
414085 | 572336 | 22738 | RIBBON REEL SNOWY VILLAGE | 10 | 2019-10-22 10:29:00 | 1.65 | 13523 | 16.50 |
402584 | 571505 | 23197 | SKETCHBOOK MAGNETIC SHOPPING LIST | 24 | 2019-10-15 15:19:00 | 1.45 | 17244 | 34.80 |
143393 | 548699 | 22479 | DAISY GARDEN MARKER | 3 | 2019-04-01 11:08:00 | 1.25 | 0 | 3.75 |
31227 | 538907 | 22835 | HOT WATER BOTTLE I AM SO POORLY | 4 | 2018-12-13 10:40:00 | 4.65 | 15373 | 18.60 |
Let's check why there are negative quantities
df[df['quantity'] < 0].sort_values(by = 'quantity').head(10)
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | total_price | |
---|---|---|---|---|---|---|---|---|
540422 | C581484 | 23843 | PAPER CRAFT , LITTLE BIRDIE | -80995 | 2019-12-07 09:27:00 | 2.08 | 16446 | -168469.6 |
61624 | C541433 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | -74215 | 2019-01-16 10:17:00 | 1.04 | 12346 | -77183.6 |
225530 | 556691 | 23005 | printing smudges/thrown away | -9600 | 2019-06-12 10:37:00 | 0.00 | 0 | -0.0 |
225529 | 556690 | 23005 | printing smudges/thrown away | -9600 | 2019-06-12 10:37:00 | 0.00 | 0 | -0.0 |
4287 | C536757 | 84347 | ROTATING SILVER ANGELS T-LIGHT HLDR | -9360 | 2018-11-30 14:23:00 | 0.03 | 15838 | -280.8 |
225528 | 556687 | 23003 | Printing smudges/thrown away | -9058 | 2019-06-12 10:36:00 | 0.00 | 0 | -0.0 |
115818 | 546152 | 72140F | throw away | -5368 | 2019-03-07 17:25:00 | 0.00 | 0 | -0.0 |
431381 | 573596 | 79323W | Unsaleable, destroyed. | -4830 | 2019-10-29 15:17:00 | 0.00 | 0 | -0.0 |
341601 | 566768 | 16045 | NaN | -3667 | 2019-09-12 17:53:00 | 0.00 | 0 | -0.0 |
323458 | 565304 | 16259 | NaN | -3167 | 2019-08-31 12:18:00 | 0.00 | 0 | -0.0 |
Two items have huge negative quantities. Are the returns? Combined, they make up over $200k in refunds. Let's look at the first entry, stock# 23843 returned by customer# 16446
df[df['stock_code'] == '23843']
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | total_price | |
---|---|---|---|---|---|---|---|---|
540421 | 581483 | 23843 | PAPER CRAFT , LITTLE BIRDIE | 80995 | 2019-12-07 09:15:00 | 2.08 | 16446 | 168469.6 |
540422 | C581484 | 23843 | PAPER CRAFT , LITTLE BIRDIE | -80995 | 2019-12-07 09:27:00 | 2.08 | 16446 | -168469.6 |
df[df['stock_code'] == '23166'].head(30)
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | total_price | |
---|---|---|---|---|---|---|---|---|
61619 | 541431 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 74215 | 2019-01-16 10:01:00 | 1.04 | 12346 | 77183.60 |
61624 | C541433 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | -74215 | 2019-01-16 10:17:00 | 1.04 | 12346 | -77183.60 |
186770 | 552882 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 96 | 2019-05-10 10:10:00 | 1.04 | 14646 | 99.84 |
187196 | 552953 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 4 | 2019-05-10 12:11:00 | 1.25 | 16745 | 5.00 |
187718 | 553005 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 5 | 2019-05-10 16:29:00 | 1.25 | 14651 | 6.25 |
187868 | 553009 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 3 | 2019-05-10 16:52:00 | 1.25 | 13601 | 3.75 |
188644 | 553052 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 48 | 2019-05-11 10:14:00 | 1.04 | 15251 | 49.92 |
189398 | 553149 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 12 | 2019-05-11 14:16:00 | 1.25 | 14226 | 15.00 |
189452 | 553152 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 12 | 2019-05-11 14:55:00 | 1.25 | 13089 | 15.00 |
189591 | 553160 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 12 | 2019-05-11 15:26:00 | 1.25 | 14194 | 15.00 |
190854 | 553212 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 12 | 2019-05-14 09:42:00 | 1.25 | 15358 | 15.00 |
191128 | 553339 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 48 | 2019-05-14 12:19:00 | 1.04 | 15125 | 49.92 |
191165 | 553341 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 12 | 2019-05-14 12:25:00 | 1.25 | 16293 | 15.00 |
191581 | 553377 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 12 | 2019-05-14 14:58:00 | 1.25 | 14888 | 15.00 |
192039 | 553389 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 3 | 2019-05-14 16:37:00 | 2.46 | 0 | 7.38 |
194462 | 553607 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 240 | 2019-05-16 10:47:00 | 1.04 | 16684 | 249.60 |
194887 | 553678 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 12 | 2019-05-16 13:00:00 | 1.25 | 14009 | 15.00 |
195455 | 553718 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 1 | 2019-05-16 16:14:00 | 2.46 | 0 | 2.46 |
195725 | 553739 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 12 | 2019-05-17 09:23:00 | 1.25 | 14895 | 15.00 |
196707 | 553860 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 2 | 2019-05-17 13:57:00 | 1.25 | 17841 | 2.50 |
198115 | 554013 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 12 | 2019-05-18 13:13:00 | 1.25 | 13308 | 15.00 |
198839 | 554070 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 9 | 2019-05-20 11:06:00 | 1.25 | 13555 | 11.25 |
199427 | 554098 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 6 | 2019-05-20 13:01:00 | 1.25 | 14769 | 7.50 |
199868 | 554104 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 12 | 2019-05-20 14:03:00 | 1.25 | 12705 | 15.00 |
201030 | 554261 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 2 | 2019-05-21 12:32:00 | 1.25 | 16033 | 2.50 |
201099 | 554268 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 1 | 2019-05-21 12:59:00 | 1.25 | 15311 | 1.25 |
201570 | 554307 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 96 | 2019-05-21 14:59:00 | 1.04 | 12989 | 99.84 |
202855 | 554504 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 12 | 2019-05-22 15:13:00 | 1.25 | 17719 | 15.00 |
203207 | 554512 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 4 | 2019-05-22 15:54:00 | 2.46 | 0 | 9.84 |
203625 | C554527 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | -9 | 2019-05-22 17:25:00 | 1.04 | 15251 | -9.36 |
Okay, so they ARE refunds. With those quanties, most likely they were erroneous entries that were corrected, as the transactions were also withing 15-20 minutes apart.
We'll delete them as they cancelled out. That way, we can avoid skewing any visualizations we make.
df[df['invoice_no'].str.contains(pat = 'C[0-9]', regex = True)].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 9251 entries, 141 to 541717 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 invoice_no 9251 non-null object 1 stock_code 9251 non-null object 2 description 9251 non-null object 3 quantity 9251 non-null int64 4 invoice_date 9251 non-null datetime64[ns] 5 unit_price 9251 non-null float64 6 customer_id 9251 non-null int64 7 total_price 9251 non-null float64 dtypes: datetime64[ns](1), float64(2), int64(2), object(3) memory usage: 650.5+ KB
rows_to_delete = [540421, 540422, 61619, 61624]
df.drop(rows_to_delete, axis = 'index', inplace = True)
df[df['unit_price'] <= 0]
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | total_price | |
---|---|---|---|---|---|---|---|---|
622 | 536414 | 22139 | NaN | 56 | 2018-11-29 11:52:00 | 0.0 | 0 | 0.0 |
1970 | 536545 | 21134 | NaN | 1 | 2018-11-29 14:32:00 | 0.0 | 0 | 0.0 |
1971 | 536546 | 22145 | NaN | 1 | 2018-11-29 14:33:00 | 0.0 | 0 | 0.0 |
1972 | 536547 | 37509 | NaN | 1 | 2018-11-29 14:33:00 | 0.0 | 0 | 0.0 |
1987 | 536549 | 85226A | NaN | 1 | 2018-11-29 14:34:00 | 0.0 | 0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
536981 | 581234 | 72817 | NaN | 27 | 2019-12-06 10:33:00 | 0.0 | 0 | 0.0 |
538504 | 581406 | 46000M | POLYESTER FILLER PAD 45x45cm | 240 | 2019-12-06 13:58:00 | 0.0 | 0 | 0.0 |
538505 | 581406 | 46000S | POLYESTER FILLER PAD 40x40cm | 300 | 2019-12-06 13:58:00 | 0.0 | 0 | 0.0 |
538554 | 581408 | 85175 | NaN | 20 | 2019-12-06 14:06:00 | 0.0 | 0 | 0.0 |
538919 | 581422 | 23169 | smashed | -235 | 2019-12-06 15:24:00 | 0.0 | 0 | -0.0 |
2512 rows × 8 columns
df.sort_values(by = 'unit_price', ascending = False).head(30)
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | total_price | |
---|---|---|---|---|---|---|---|---|
222681 | C556445 | M | Manual | -1 | 2019-06-08 15:31:00 | 38970.00 | 15098 | -38970.00 |
524602 | C580605 | AMAZONFEE | AMAZON FEE | -1 | 2019-12-03 11:36:00 | 17836.46 | 0 | -17836.46 |
43702 | C540117 | AMAZONFEE | AMAZON FEE | -1 | 2019-01-03 09:55:00 | 16888.02 | 0 | -16888.02 |
43703 | C540118 | AMAZONFEE | AMAZON FEE | -1 | 2019-01-03 09:57:00 | 16453.71 | 0 | -16453.71 |
15017 | 537632 | AMAZONFEE | AMAZON FEE | 1 | 2018-12-05 15:08:00 | 13541.33 | 0 | 13541.33 |
15016 | C537630 | AMAZONFEE | AMAZON FEE | -1 | 2018-12-05 15:04:00 | 13541.33 | 0 | -13541.33 |
16356 | C537651 | AMAZONFEE | AMAZON FEE | -1 | 2018-12-05 15:49:00 | 13541.33 | 0 | -13541.33 |
16232 | C537644 | AMAZONFEE | AMAZON FEE | -1 | 2018-12-05 15:34:00 | 13474.79 | 0 | -13474.79 |
524601 | C580604 | AMAZONFEE | AMAZON FEE | -1 | 2019-12-03 11:35:00 | 11586.50 | 0 | -11586.50 |
299982 | A563185 | B | Adjust bad debt | 1 | 2019-08-10 14:50:00 | 11062.06 | 0 | 11062.06 |
446533 | C574902 | AMAZONFEE | AMAZON FEE | -1 | 2019-11-05 15:21:00 | 8286.22 | 0 | -8286.22 |
173277 | C551685 | POST | POSTAGE | -1 | 2019-05-01 12:51:00 | 8142.75 | 16029 | -8142.75 |
173382 | 551697 | POST | POSTAGE | 1 | 2019-05-01 13:46:00 | 8142.75 | 16029 | 8142.75 |
342635 | C566899 | AMAZONFEE | AMAZON FEE | -1 | 2019-09-13 13:53:00 | 7427.97 | 0 | -7427.97 |
191386 | C553355 | AMAZONFEE | AMAZON FEE | -1 | 2019-05-14 13:58:00 | 7006.83 | 0 | -7006.83 |
173391 | C551699 | M | Manual | -1 | 2019-05-01 14:12:00 | 6930.00 | 16029 | -6930.00 |
287150 | C562086 | AMAZONFEE | AMAZON FEE | -1 | 2019-07-31 12:27:00 | 6721.37 | 0 | -6721.37 |
16357 | C537652 | AMAZONFEE | AMAZON FEE | -1 | 2018-12-05 15:51:00 | 6706.71 | 0 | -6706.71 |
312246 | C564341 | AMAZONFEE | AMAZON FEE | -1 | 2019-08-22 14:53:00 | 6662.51 | 0 | -6662.51 |
262414 | C559917 | AMAZONFEE | AMAZON FEE | -1 | 2019-07-11 15:21:00 | 6497.47 | 0 | -6497.47 |
383495 | C570025 | AMAZONFEE | AMAZON FEE | -1 | 2019-10-05 10:29:00 | 5942.57 | 0 | -5942.57 |
429248 | C573549 | AMAZONFEE | AMAZON FEE | -1 | 2019-10-29 13:23:00 | 5942.57 | 0 | -5942.57 |
446434 | C574897 | AMAZONFEE | AMAZON FEE | -1 | 2019-11-05 15:03:00 | 5877.18 | 0 | -5877.18 |
191385 | C553354 | AMAZONFEE | AMAZON FEE | -1 | 2019-05-14 13:54:00 | 5876.40 | 0 | -5876.40 |
239250 | C558036 | AMAZONFEE | AMAZON FEE | -1 | 2019-06-22 12:31:00 | 5791.18 | 0 | -5791.18 |
124741 | C546987 | AMAZONFEE | AMAZON FEE | -1 | 2019-03-16 12:56:00 | 5693.05 | 0 | -5693.05 |
96844 | C544587 | AMAZONFEE | AMAZON FEE | -1 | 2019-02-19 15:07:00 | 5575.28 | 0 | -5575.28 |
342611 | C566889 | AMAZONFEE | AMAZON FEE | -1 | 2019-09-13 13:50:00 | 5522.14 | 0 | -5522.14 |
16313 | C537647 | AMAZONFEE | AMAZON FEE | -1 | 2018-12-05 15:41:00 | 5519.25 | 0 | -5519.25 |
96845 | C544589 | AMAZONFEE | AMAZON FEE | -1 | 2019-02-19 15:11:00 | 5258.77 | 0 | -5258.77 |
Let's take a closer look at this customer
df[df['customer_id'] == 15098]
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | total_price | |
---|---|---|---|---|---|---|---|---|
222670 | 556442 | 22502 | PICNIC BASKET WICKER SMALL | 60 | 2019-06-08 15:22:00 | 4.95 | 15098 | 297.0 |
222680 | 556444 | 22502 | PICNIC BASKET WICKER 60 PIECES | 60 | 2019-06-08 15:28:00 | 649.50 | 15098 | 38970.0 |
222681 | C556445 | M | Manual | -1 | 2019-06-08 15:31:00 | 38970.00 | 15098 | -38970.0 |
222682 | 556446 | 22502 | PICNIC BASKET WICKER 60 PIECES | 1 | 2019-06-08 15:33:00 | 649.50 | 15098 | 649.5 |
222692 | C556448 | 22502 | PICNIC BASKET WICKER SMALL | -60 | 2019-06-08 15:39:00 | 4.95 | 15098 | -297.0 |
Another case of weird shenanigans. We can see that for this customer, a grand total of $649.50 was eventually netted after all the various purchase/refund transactions. Perhaps someone was new and didn't know how to operate the cash register?
If this was online, then perhaps a customer service representative manually cancelled a mistakenly placed order.
ACTION REQUIRED: What happened here bears investigation. Check with Product Manager, Customer Support, and back-end team to see if this process needs improvement
Due to the fact that the sales total here was almost $40k, we're going to delete the transactions that cancel each other out in this instance.
rows_to_delete = [222670,
222680,
222681,
222692]
df.drop(rows_to_delete, axis = 'index', inplace = True)
It doesn't make sense to analyze products with no price. Question: were these items given away?
df[df['unit_price'] == 0].sort_values('customer_id')
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | total_price | |
---|---|---|---|---|---|---|---|---|
622 | 536414 | 22139 | NaN | 56 | 2018-11-29 11:52:00 | 0.0 | 0 | 0.0 |
282875 | 561663 | 22501 | incorrectly put back into stock | -108 | 2019-07-26 16:40:00 | 0.0 | 0 | -0.0 |
282882 | 561665 | 22171 | ? | 142 | 2019-07-26 16:55:00 | 0.0 | 0 | 0.0 |
282883 | 561666 | 22502 | NaN | 30 | 2019-07-26 16:59:00 | 0.0 | 0 | 0.0 |
282904 | 561668 | 44091A | NaN | -2 | 2019-07-26 17:03:00 | 0.0 | 0 | -0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
33576 | 539263 | 22580 | ADVENT CALENDAR GINGHAM SACK | 4 | 2018-12-14 14:36:00 | 0.0 | 16560 | 0.0 |
279324 | 561284 | 22167 | OVAL WALL MIRROR DIAMANTE | 1 | 2019-07-24 12:24:00 | 0.0 | 16818 | 0.0 |
86789 | 543599 | 84535B | FAIRY CAKES NOTEBOOK A6 SIZE | 16 | 2019-02-08 13:08:00 | 0.0 | 17560 | 0.0 |
187613 | 553000 | 47566 | PARTY BUNTING | 4 | 2019-05-10 15:21:00 | 0.0 | 17667 | 0.0 |
420404 | 572893 | 21208 | PASTEL COLOUR HONEYCOMB FAN | 5 | 2019-10-24 14:36:00 | 0.0 | 18059 | 0.0 |
2510 rows × 8 columns
Were these items given for free, perhaps as a promotion? How come several customers received more than one type of item?
We will delete these, but this certainly brings up questions that should be answered.
df = df[df['unit_price'] != 0]
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 534123 entries, 0 to 541908 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 invoice_no 534123 non-null object 1 stock_code 534123 non-null object 2 description 534123 non-null object 3 quantity 534123 non-null int64 4 invoice_date 534123 non-null datetime64[ns] 5 unit_price 534123 non-null float64 6 customer_id 534123 non-null int64 7 total_price 534123 non-null float64 dtypes: datetime64[ns](1), float64(2), int64(2), object(3) memory usage: 36.7+ MB
aggregations = {
'quantity': ('quantity','sum'),
'num_of_customers': ('customer_id','nunique'),
'sales_total': ('total_price','sum'),
'sales_average': ('total_price', 'mean')
}
grouped_items = df.groupby(['stock_code',
'description',
'unit_price']).agg(**aggregations).reset_index()
grouped_items.head(50)
stock_code | description | unit_price | quantity | num_of_customers | sales_total | sales_average | |
---|---|---|---|---|---|---|---|
0 | 10002 | INFLATABLE POLITICAL GLOBE | 0.85 | 824 | 41 | 700.40 | 14.008000 |
1 | 10002 | INFLATABLE POLITICAL GLOBE | 1.63 | 9 | 1 | 14.67 | 2.095714 |
2 | 10002 | INFLATABLE POLITICAL GLOBE | 1.66 | 27 | 1 | 44.82 | 3.201429 |
3 | 10080 | GROOVY CACTUS INFLATABLE | 0.39 | 301 | 19 | 117.39 | 5.590000 |
4 | 10080 | GROOVY CACTUS INFLATABLE | 0.85 | 2 | 1 | 1.70 | 1.700000 |
5 | 10120 | DOGGY RUBBER | 0.21 | 192 | 25 | 40.32 | 1.390345 |
6 | 10123C | HEARTS WRAPPING TAPE | 0.65 | 5 | 3 | 3.25 | 1.083333 |
7 | 10124A | SPOTS ON RED BOOKCOVER TAPE | 0.42 | 16 | 5 | 6.72 | 1.344000 |
8 | 10124G | ARMY CAMO BOOKCOVER TAPE | 0.42 | 17 | 4 | 7.14 | 1.785000 |
9 | 10125 | MINI FUNKY DESIGN TAPES | 0.42 | 260 | 10 | 109.20 | 10.920000 |
10 | 10125 | MINI FUNKY DESIGN TAPES | 0.79 | 24 | 1 | 18.96 | 1.580000 |
11 | 10125 | MINI FUNKY DESIGN TAPES | 0.81 | 36 | 1 | 29.16 | 2.916000 |
12 | 10125 | MINI FUNKY DESIGN TAPES | 0.85 | 965 | 39 | 820.25 | 15.476415 |
13 | 10125 | MINI FUNKY DESIGN TAPES | 1.63 | 6 | 1 | 9.78 | 1.956000 |
14 | 10125 | MINI FUNKY DESIGN TAPES | 1.66 | 4 | 1 | 6.64 | 2.213333 |
15 | 10133 | COLOURING PENCILS BROWN TUBE | 0.42 | 2035 | 81 | 854.70 | 9.092553 |
16 | 10133 | COLOURING PENCILS BROWN TUBE | 0.79 | 58 | 1 | 45.82 | 6.545714 |
17 | 10133 | COLOURING PENCILS BROWN TUBE | 0.81 | 38 | 1 | 30.78 | 2.798182 |
18 | 10133 | COLOURING PENCILS BROWN TUBE | 0.83 | 379 | 1 | 314.57 | 6.291400 |
19 | 10133 | COLOURING PENCILS BROWN TUBE | 0.85 | 331 | 25 | 281.35 | 9.075806 |
20 | 10133 | COLOURING PENCILS BROWN TUBE | 1.63 | 4 | 1 | 6.52 | 2.173333 |
21 | 10133 | COLOURING PENCILS BROWN TUBE | 1.66 | 1 | 1 | 1.66 | 1.660000 |
22 | 10135 | COLOURING PENCILS BROWN TUBE | 0.25 | 46 | 2 | 11.50 | 5.750000 |
23 | 10135 | COLOURING PENCILS BROWN TUBE | 0.42 | 740 | 21 | 310.80 | 14.127273 |
24 | 10135 | COLOURING PENCILS BROWN TUBE | 1.06 | 300 | 2 | 318.00 | 159.000000 |
25 | 10135 | COLOURING PENCILS BROWN TUBE | 1.25 | 902 | 69 | 1127.50 | 11.505102 |
26 | 10135 | COLOURING PENCILS BROWN TUBE | 1.28 | 132 | 1 | 168.96 | 12.068571 |
27 | 10135 | COLOURING PENCILS BROWN TUBE | 2.46 | 84 | 2 | 206.64 | 7.653333 |
28 | 10135 | COLOURING PENCILS BROWN TUBE | 2.51 | 24 | 1 | 60.24 | 4.633846 |
29 | 11001 | ASSTD DESIGN RACING CAR PEN | 0.83 | 281 | 1 | 233.23 | 10.140435 |
30 | 11001 | ASSTD DESIGN RACING CAR PEN | 1.27 | 284 | 1 | 360.68 | 120.226667 |
31 | 11001 | ASSTD DESIGN RACING CAR PEN | 1.69 | 805 | 44 | 1360.45 | 20.305224 |
32 | 11001 | ASSTD DESIGN RACING CAR PEN | 3.29 | 51 | 2 | 167.79 | 7.990000 |
33 | 11001 | ASSTD DESIGN RACING CAR PEN | 3.36 | 9 | 1 | 30.24 | 5.040000 |
34 | 15030 | FAN BLACK FRAME | 0.29 | 143 | 11 | 41.47 | 3.190000 |
35 | 15034 | PAPER POCKET TRAVELING FAN | 0.07 | 1657 | 4 | 115.99 | 19.331667 |
36 | 15034 | PAPER POCKET TRAVELING FAN | 0.14 | 3377 | 67 | 472.78 | 5.312135 |
37 | 15034 | PAPER POCKET TRAVELING FAN | 0.83 | 162 | 1 | 134.46 | 3.361500 |
38 | 15034 | PAPER POCKET TRAVELING FAN | 0.85 | 10 | 1 | 8.50 | 1.214286 |
39 | 15036 | ASSORTED COLOURS SILK FAN | 0.53 | 600 | 1 | 318.00 | 318.000000 |
40 | 15036 | ASSORTED COLOURS SILK FAN | 0.65 | 4200 | 8 | 2730.00 | 341.250000 |
41 | 15036 | ASSORTED COLOURS SILK FAN | 0.72 | 8688 | 12 | 6255.36 | 390.960000 |
42 | 15036 | ASSORTED COLOURS SILK FAN | 0.75 | 1407 | 52 | 1055.25 | 17.020161 |
43 | 15036 | ASSORTED COLOURS SILK FAN | 0.83 | 6999 | 160 | 5809.17 | 22.516163 |
44 | 15036 | ASSORTED COLOURS SILK FAN | 1.25 | 52 | 1 | 65.00 | 6.500000 |
45 | 15036 | ASSORTED COLOURS SILK FAN | 1.28 | 58 | 1 | 74.24 | 3.712000 |
46 | 15036 | ASSORTED COLOURS SILK FAN | 1.63 | 1078 | 1 | 1757.14 | 11.872568 |
47 | 15039 | SANDALWOOD FAN | 0.53 | 600 | 3 | 318.00 | 106.000000 |
48 | 15039 | SANDALWOOD FAN | 0.85 | 960 | 56 | 816.00 | 11.492958 |
49 | 15039 | SANDALWOOD FAN | 1.63 | 470 | 1 | 766.10 | 10.944286 |
There are some identically described items, but whose unit_price differs. Perhaps one is a discounted/bulk pricing? However, it seems that pricing is/can be arbitrarily set? Notice that item 10135 has pricing/quantity ratios that don't follow any rhyme or reason:
grouped_items.describe()
unit_price | quantity | num_of_customers | sales_total | sales_average | |
---|---|---|---|---|---|
count | 16281.000000 | 16281.000000 | 16281.000000 | 16281.000000 | 16281.000000 |
mean | 39.108888 | 325.336466 | 17.669246 | 597.383880 | 39.559728 |
std | 414.488956 | 1009.532305 | 49.223973 | 1911.086232 | 428.464352 |
min | -11062.060000 | -9360.000000 | 1.000000 | -22124.120000 | -17836.460000 |
25% | 1.450000 | 6.000000 | 1.000000 | 25.500000 | 4.960000 |
50% | 2.950000 | 34.000000 | 1.000000 | 115.460000 | 10.170000 |
75% | 5.950000 | 192.000000 | 6.000000 | 435.000000 | 25.270909 |
max | 17836.460000 | 27636.000000 | 814.000000 | 85738.500000 | 11062.060000 |
grouped_items_by_description = grouped_stock = df.groupby(['stock_code','description']).agg(**aggregations).reset_index()
grouped_items_by_description.describe()
quantity | num_of_customers | sales_total | sales_average | |
---|---|---|---|---|
count | 4176.000000 | 4176.000000 | 4176.000000 | 4176.000000 |
mean | 1268.391523 | 65.313458 | 2329.024654 | 13.507847 |
std | 2885.717156 | 88.364388 | 7666.730254 | 129.155911 |
min | -1194.000000 | 1.000000 | -221520.500000 | -6515.308824 |
25% | 45.000000 | 7.000000 | 106.192500 | 6.069004 |
50% | 310.000000 | 31.000000 | 601.125000 | 10.732028 |
75% | 1271.000000 | 88.000000 | 2055.182500 | 17.334659 |
max | 53751.000000 | 888.000000 | 206245.480000 | 3022.500000 |
grouped_items_by_description.head(20)
stock_code | description | quantity | num_of_customers | sales_total | sales_average | |
---|---|---|---|---|---|---|
0 | 10002 | INFLATABLE POLITICAL GLOBE | 860 | 41 | 759.89 | 10.702676 |
1 | 10080 | GROOVY CACTUS INFLATABLE | 303 | 20 | 119.09 | 5.413182 |
2 | 10120 | DOGGY RUBBER | 192 | 25 | 40.32 | 1.390345 |
3 | 10123C | HEARTS WRAPPING TAPE | 5 | 3 | 3.25 | 1.083333 |
4 | 10124A | SPOTS ON RED BOOKCOVER TAPE | 16 | 5 | 6.72 | 1.344000 |
5 | 10124G | ARMY CAMO BOOKCOVER TAPE | 17 | 4 | 7.14 | 1.785000 |
6 | 10125 | MINI FUNKY DESIGN TAPES | 1295 | 50 | 993.99 | 10.688065 |
7 | 10133 | COLOURING PENCILS BROWN TUBE | 2846 | 102 | 1535.40 | 7.793909 |
8 | 10135 | COLOURING PENCILS BROWN TUBE | 2228 | 93 | 2203.64 | 12.380000 |
9 | 11001 | ASSTD DESIGN RACING CAR PEN | 1430 | 46 | 2152.39 | 17.936583 |
10 | 15030 | FAN BLACK FRAME | 143 | 11 | 41.47 | 3.190000 |
11 | 15034 | PAPER POCKET TRAVELING FAN | 5206 | 71 | 731.73 | 5.153028 |
12 | 15036 | ASSORTED COLOURS SILK FAN | 23082 | 195 | 18064.16 | 34.539503 |
13 | 15039 | SANDALWOOD FAN | 2064 | 59 | 1956.54 | 13.219865 |
14 | 15044A | PINK PAPER PARASOL | 457 | 54 | 1436.07 | 13.942427 |
15 | 15044B | BLUE PAPER PARASOL | 320 | 38 | 943.88 | 15.473443 |
16 | 15044C | PURPLE PAPER PARASOL | 310 | 44 | 1014.94 | 11.277111 |
17 | 15044D | RED PAPER PARASOL | 642 | 55 | 1814.26 | 21.096047 |
18 | 15056BL | EDWARDIAN PARASOL BLACK | 2714 | 143 | 15176.34 | 46.553190 |
19 | 15056N | EDWARDIAN PARASOL NATURAL | 3893 | 185 | 21815.24 | 46.914495 |
If we omit unit pricing, we can see that there is a total of 4176 different items carried.
aggregations = {
'quantity': ('quantity','sum'),
'num_of_customers': ('customer_id','nunique'),
'sales_total': ('total_price','sum'),
'sales_average': ('total_price', 'mean')
}
aggregations
{'quantity': ('quantity', 'sum'), 'num_of_customers': ('customer_id', 'nunique'), 'sales_total': ('total_price', 'sum'), 'sales_average': ('total_price', 'mean')}
grouped_items_by_total_sales = df.groupby(['stock_code', 'description']).agg(
**aggregations).sort_values(
by = 'sales_total',
ascending = False).reset_index()
grouped_items_by_quantity_sold = df.groupby(['stock_code', 'description']).agg(
**aggregations).sort_values(
by = 'quantity',
ascending = False).reset_index()
grouped_items_by_unit_price = df.groupby(['stock_code', 'description', 'unit_price']).agg(
**aggregations).sort_values(
by = 'unit_price',
ascending = False).reset_index()
grouped_items_by_total_sales.head(20)
stock_code | description | quantity | num_of_customers | sales_total | sales_average | |
---|---|---|---|---|---|---|
0 | DOT | DOTCOM POSTAGE | 705 | 2 | 206245.48 | 291.719208 |
1 | 22423 | REGENCY CAKESTAND 3 TIER | 12996 | 888 | 164459.49 | 75.198669 |
2 | 47566 | PARTY BUNTING | 18006 | 709 | 98243.88 | 57.151763 |
3 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 35002 | 859 | 97659.94 | 42.720884 |
4 | 85099B | JUMBO BAG RED RETROSPOT | 47256 | 637 | 92175.79 | 42.812722 |
5 | 23084 | RABBIT NIGHT LIGHT | 30631 | 451 | 66661.63 | 64.594603 |
6 | POST | POSTAGE | 3003 | 380 | 66230.64 | 52.899872 |
7 | 22086 | PAPER CHAIN KIT 50'S CHRISTMAS | 18876 | 616 | 63715.24 | 53.362848 |
8 | 84879 | ASSORTED COLOUR BIRD ORNAMENT | 36282 | 680 | 58792.42 | 39.511035 |
9 | 79321 | CHILLI LIGHTS | 10222 | 206 | 53746.66 | 79.861308 |
10 | 23298 | SPOTTY BUNTING | 8210 | 572 | 42030.67 | 36.046887 |
11 | 22386 | JUMBO BAG PINK POLKADOT | 20992 | 373 | 41584.43 | 33.401149 |
12 | 21137 | BLACK RECORD COVER FRAME | 11636 | 141 | 40578.21 | 106.504488 |
13 | 22720 | SET OF 3 CAKE TINS PANTRY DESIGN | 7329 | 641 | 37378.79 | 25.514532 |
14 | 23284 | DOORMAT KEEP CALM AND COME IN | 5262 | 379 | 36532.39 | 49.434899 |
15 | 22960 | JAM MAKING SET WITH JARS | 8448 | 575 | 36069.34 | 29.565033 |
16 | 82484 | WOOD BLACK BOARD ANT WHITE FINISH | 5988 | 286 | 35795.97 | 51.430991 |
17 | 20725 | LUNCH BAG RED RETROSPOT | 18658 | 533 | 34717.66 | 21.364714 |
18 | 22197 | POPCORN HOLDER | 36322 | 297 | 33959.26 | 39.441649 |
19 | 22114 | HOT WATER BOTTLE TEA AND SYMPATHY | 5658 | 326 | 32663.34 | 50.174101 |
grouped_items_by_quantity_sold.head(20)
stock_code | description | quantity | num_of_customers | sales_total | sales_average | |
---|---|---|---|---|---|---|
0 | 84077 | WORLD WAR 2 GLIDERS ASSTD DESIGNS | 53751 | 308 | 13560.09 | 25.064861 |
1 | 85099B | JUMBO BAG RED RETROSPOT | 47256 | 637 | 92175.79 | 42.812722 |
2 | 22197 | POPCORN HOLDER | 36322 | 297 | 33959.26 | 39.441649 |
3 | 84879 | ASSORTED COLOUR BIRD ORNAMENT | 36282 | 680 | 58792.42 | 39.511035 |
4 | 21212 | PACK OF 72 RETROSPOT CAKE CASES | 36016 | 637 | 21047.07 | 15.396540 |
5 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 35002 | 859 | 97659.94 | 42.720884 |
6 | 23084 | RABBIT NIGHT LIGHT | 30631 | 451 | 66661.63 | 64.594603 |
7 | 22492 | MINI PAINT SET VINTAGE | 26437 | 214 | 16810.42 | 43.103641 |
8 | 22616 | PACK OF 12 LONDON TISSUES | 26095 | 196 | 7967.82 | 15.176800 |
9 | 21977 | PACK OF 60 PINK PAISLEY CAKE CASES | 24719 | 412 | 12170.77 | 13.736761 |
10 | 22178 | VICTORIAN GLASS HANGING T-LIGHT | 23825 | 404 | 32508.42 | 30.495704 |
11 | 15036 | ASSORTED COLOURS SILK FAN | 23082 | 195 | 18064.16 | 34.539503 |
12 | 17003 | BROCADE RING PURSE | 23017 | 137 | 5853.15 | 23.988320 |
13 | 21915 | RED HARMONICA IN BOX | 21836 | 351 | 26210.33 | 38.945513 |
14 | 22386 | JUMBO BAG PINK POLKADOT | 20992 | 373 | 41584.43 | 33.401149 |
15 | 22197 | SMALL POPCORN HOLDER | 20105 | 229 | 17008.66 | 28.020857 |
16 | 22086 | PAPER CHAIN KIT 50'S CHRISTMAS | 18876 | 616 | 63715.24 | 53.362848 |
17 | 20725 | LUNCH BAG RED RETROSPOT | 18658 | 533 | 34717.66 | 21.364714 |
18 | 84991 | 60 TEATIME FAIRY CAKE CASES | 18015 | 416 | 9074.27 | 10.650552 |
19 | 47566 | PARTY BUNTING | 18006 | 709 | 98243.88 | 57.151763 |
grouped_items_by_unit_price.head(30)
stock_code | description | unit_price | quantity | num_of_customers | sales_total | sales_average | |
---|---|---|---|---|---|---|---|
0 | AMAZONFEE | AMAZON FEE | 17836.46 | -1 | 1 | -17836.46 | -17836.460000 |
1 | AMAZONFEE | AMAZON FEE | 16888.02 | -1 | 1 | -16888.02 | -16888.020000 |
2 | AMAZONFEE | AMAZON FEE | 16453.71 | -1 | 1 | -16453.71 | -16453.710000 |
3 | AMAZONFEE | AMAZON FEE | 13541.33 | -1 | 1 | -13541.33 | -4513.776667 |
4 | AMAZONFEE | AMAZON FEE | 13474.79 | -1 | 1 | -13474.79 | -13474.790000 |
5 | AMAZONFEE | AMAZON FEE | 11586.50 | -1 | 1 | -11586.50 | -11586.500000 |
6 | B | Adjust bad debt | 11062.06 | 1 | 1 | 11062.06 | 11062.060000 |
7 | AMAZONFEE | AMAZON FEE | 8286.22 | -1 | 1 | -8286.22 | -8286.220000 |
8 | POST | POSTAGE | 8142.75 | 0 | 1 | 0.00 | 0.000000 |
9 | AMAZONFEE | AMAZON FEE | 7427.97 | -1 | 1 | -7427.97 | -7427.970000 |
10 | AMAZONFEE | AMAZON FEE | 7006.83 | -1 | 1 | -7006.83 | -7006.830000 |
11 | M | Manual | 6930.00 | -1 | 1 | -6930.00 | -6930.000000 |
12 | AMAZONFEE | AMAZON FEE | 6721.37 | -1 | 1 | -6721.37 | -6721.370000 |
13 | AMAZONFEE | AMAZON FEE | 6706.71 | -1 | 1 | -6706.71 | -6706.710000 |
14 | AMAZONFEE | AMAZON FEE | 6662.51 | -1 | 1 | -6662.51 | -6662.510000 |
15 | AMAZONFEE | AMAZON FEE | 6497.47 | -1 | 1 | -6497.47 | -6497.470000 |
16 | AMAZONFEE | AMAZON FEE | 5942.57 | -2 | 1 | -11885.14 | -5942.570000 |
17 | AMAZONFEE | AMAZON FEE | 5877.18 | -1 | 1 | -5877.18 | -5877.180000 |
18 | AMAZONFEE | AMAZON FEE | 5876.40 | -1 | 1 | -5876.40 | -5876.400000 |
19 | AMAZONFEE | AMAZON FEE | 5791.18 | -1 | 1 | -5791.18 | -5791.180000 |
20 | AMAZONFEE | AMAZON FEE | 5693.05 | -1 | 1 | -5693.05 | -5693.050000 |
21 | AMAZONFEE | AMAZON FEE | 5575.28 | -1 | 1 | -5575.28 | -5575.280000 |
22 | AMAZONFEE | AMAZON FEE | 5522.14 | -1 | 1 | -5522.14 | -5522.140000 |
23 | AMAZONFEE | AMAZON FEE | 5519.25 | -1 | 1 | -5519.25 | -5519.250000 |
24 | AMAZONFEE | AMAZON FEE | 5258.77 | -1 | 1 | -5258.77 | -5258.770000 |
25 | AMAZONFEE | AMAZON FEE | 5225.03 | -1 | 1 | -5225.03 | -5225.030000 |
26 | AMAZONFEE | AMAZON FEE | 4575.64 | -1 | 1 | -4575.64 | -4575.640000 |
27 | AMAZONFEE | AMAZON FEE | 4534.24 | -1 | 1 | -4534.24 | -4534.240000 |
28 | AMAZONFEE | AMAZON FEE | 4527.65 | -1 | 1 | -4527.65 | -4527.650000 |
29 | DOT | DOTCOM POSTAGE | 4505.17 | 1 | 1 | 4505.17 | 4505.170000 |
grouped_items_by_unit_price[grouped_items_by_unit_price['sales_total'] < 0].sample(50)
stock_code | description | unit_price | quantity | num_of_customers | sales_total | sales_average | |
---|---|---|---|---|---|---|---|
30 | AMAZONFEE | AMAZON FEE | 4383.62 | -1 | 1 | -4383.62 | -4383.620000 |
16190 | M | Manual | 0.19 | -550 | 9 | -104.50 | -10.450000 |
11 | M | Manual | 6930.00 | -1 | 1 | -6930.00 | -6930.000000 |
2522 | M | Manual | 9.36 | -1 | 1 | -9.36 | -9.360000 |
266 | M | Manual | 367.38 | -1 | 1 | -367.38 | -367.380000 |
54 | M | Manual | 1715.85 | -1 | 1 | -1715.85 | -1715.850000 |
15 | AMAZONFEE | AMAZON FEE | 6497.47 | -1 | 1 | -6497.47 | -6497.470000 |
1863 | M | Manual | 12.75 | -1 | 3 | -12.75 | -2.550000 |
337 | M | Manual | 269.70 | -1 | 1 | -269.70 | -269.700000 |
904 | POST | POSTAGE | 75.00 | -1 | 1 | -75.00 | -75.000000 |
2361 | D | Discount | 10.06 | -1 | 1 | -10.06 | -10.060000 |
4000 | S | SAMPLES | 6.15 | -1 | 1 | -6.15 | -6.150000 |
1078 | POST | POSTAGE | 30.53 | -1 | 1 | -30.53 | -30.530000 |
3201 | 85068 | CREAM SWEETHEART SHELF + HOOKS | 7.95 | -1 | 1 | -7.95 | -7.950000 |
12 | AMAZONFEE | AMAZON FEE | 6721.37 | -1 | 1 | -6721.37 | -6721.370000 |
5379 | 79320 | FLAMINGO LIGHTS | 4.95 | -1 | 1 | -4.95 | -4.950000 |
908 | M | Manual | 71.46 | -1 | 1 | -71.46 | -71.460000 |
2181 | M | Manual | 10.95 | -4 | 2 | -43.80 | -14.600000 |
115 | M | Manual | 869.55 | -1 | 1 | -869.55 | -869.550000 |
279 | BANK CHARGES | Bank Charges | 326.68 | -1 | 1 | -326.68 | -326.680000 |
1673 | M | Manual | 14.95 | -1 | 3 | -14.95 | -4.983333 |
202 | M | Manual | 544.40 | -1 | 1 | -544.40 | -544.400000 |
4654 | M | Manual | 5.70 | -1 | 1 | -5.70 | -5.700000 |
26 | AMAZONFEE | AMAZON FEE | 4575.64 | -1 | 1 | -4575.64 | -4575.640000 |
2368 | S | SAMPLES | 10.00 | -1 | 1 | -10.00 | -10.000000 |
964 | S | SAMPLES | 50.99 | -1 | 1 | -50.99 | -50.990000 |
1382 | M | Manual | 17.55 | -1 | 1 | -17.55 | -17.550000 |
4708 | S | SAMPLES | 5.44 | -1 | 1 | -5.44 | -5.440000 |
845 | BANK CHARGES | Bank Charges | 95.38 | -1 | 1 | -95.38 | -95.380000 |
1297 | POST | POSTAGE | 20.47 | -1 | 1 | -20.47 | -20.470000 |
1578 | CRUK | CRUK Commission | 15.96 | -1 | 1 | -15.96 | -15.960000 |
876 | S | SAMPLES | 83.06 | -1 | 1 | -83.06 | -83.060000 |
401 | M | Manual | 233.25 | -1 | 1 | -233.25 | -233.250000 |
686 | BANK CHARGES | Bank Charges | 149.16 | -1 | 1 | -149.16 | -149.160000 |
1 | AMAZONFEE | AMAZON FEE | 16888.02 | -1 | 1 | -16888.02 | -16888.020000 |
5380 | POST | POSTAGE | 4.90 | -1 | 1 | -4.90 | -4.900000 |
2170 | POST | POSTAGE | 11.00 | -1 | 1 | -11.00 | -11.000000 |
3857 | S | SAMPLES | 6.70 | -1 | 1 | -6.70 | -6.700000 |
16130 | D | Discount | 0.20 | -48 | 1 | -9.60 | -9.600000 |
961 | CRUK | CRUK Commission | 52.24 | -1 | 1 | -52.24 | -52.240000 |
877 | BANK CHARGES | Bank Charges | 82.89 | -1 | 1 | -82.89 | -82.890000 |
998 | D | Discount | 42.50 | -1 | 1 | -42.50 | -42.500000 |
869 | 22827 | RUSTIC SEVENTEEN DRAWER SIDEBOARD | 85.00 | -1 | 1 | -85.00 | -85.000000 |
949 | BANK CHARGES | Bank Charges | 56.93 | -1 | 1 | -56.93 | -56.930000 |
16277 | D | Discount | 0.01 | -720 | 1 | -7.20 | -7.200000 |
6527 | POST | POSTAGE | 3.95 | -1 | 2 | -3.95 | -1.316667 |
1639 | D | Discount | 15.07 | -1 | 1 | -15.07 | -15.070000 |
225 | BANK CHARGES | Bank Charges | 475.69 | -1 | 1 | -475.69 | -475.690000 |
753 | S | SAMPLES | 128.56 | -1 | 1 | -128.56 | -128.560000 |
27 | AMAZONFEE | AMAZON FEE | 4534.24 | -1 | 1 | -4534.24 | -4534.240000 |
It looks like we are being charged an Amazon fee (maybe we're selling on Amazon?) as well we are charging customers a shipping fee.
There are a lot of "manually" made entries that do not indicate at all what item was returned. There are also a lot of discounts as well.
And, what is a "CRUK" commission, a payout to a third party, perhaps?
We can see there's an error with index 0 in the above group -- item 22502 is a Wicker Picknet Basket, with a unit price of \$649.50. However, the description says there are 60 pieces.
I think it might be safe to assume that this is not a designer picnic basket that costs \$649.50. Therefore, someone entered the total sales price into the unit price, and used 1 in quantity. The correct entry should be 60 quantity, and a unit price of \$10.82
That way, we can more accurately gauge sales in terms of net profits
df_amazon_fees = df[df['stock_code'] == 'AMAZONFEE']
df_shipping_charges = df[df['stock_code'] == 'DOT']
df_postage_charges = df[df['stock_code'] == 'POST']
df_samples = df[df['stock_code'] == 'S']
df_bad_debt = df[df['stock_code'] == 'B']
df_bank_charges = df[df['stock_code'] == 'BANK CHARGES']
df = df[~df['stock_code'].str.contains('AMAZONFEE')]
df = df[~df['stock_code'].str.contains('DOT')]
df = df[~df['stock_code'].str.contains('POST')]
df = df[~df['stock_code'].str.contains('CRUK')]
df = df[~df['stock_code'].str.contains('BANK CHARGES')]
df = df[df['stock_code'] != 'M']
df = df[df['stock_code'] != 'D']
df = df[df['stock_code'] != 'B']
df = df[df['stock_code'] != 'S']
df[df['stock_code'] == '22502'].sort_values(by = 'unit_price', ascending = False)
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | total_price | |
---|---|---|---|---|---|---|---|---|
222682 | 556446 | 22502 | PICNIC BASKET WICKER 60 PIECES | 1 | 2019-06-08 15:33:00 | 649.50 | 15098 | 649.50 |
170645 | 551339 | 22502 | PICNIC BASKET WICKER SMALL | 1 | 2019-04-25 17:18:00 | 10.79 | 0 | 10.79 |
224313 | 556515 | 22502 | PICNIC BASKET WICKER SMALL | 1 | 2019-06-11 10:45:00 | 10.79 | 0 | 10.79 |
234244 | 557502 | 22502 | PICNIC BASKET WICKER SMALL | 2 | 2019-06-18 15:32:00 | 10.79 | 0 | 21.58 |
232567 | 557324 | 22502 | PICNIC BASKET WICKER SMALL | 2 | 2019-06-18 09:41:00 | 10.79 | 0 | 21.58 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
182710 | 552563 | 22502 | PICNIC BASKET WICKER SMALL | 16 | 2019-05-08 11:54:00 | 4.95 | 17597 | 79.20 |
182457 | 552547 | 22502 | PICNIC BASKET WICKER SMALL | 16 | 2019-05-08 10:34:00 | 4.95 | 13098 | 79.20 |
152290 | 549585 | 22502 | PICNIC BASKET WICKER SMALL | 32 | 2019-04-09 09:54:00 | 4.95 | 14616 | 158.40 |
181862 | 552514 | 22502 | PICNIC BASKET WICKER SMALL | 19 | 2019-05-07 16:30:00 | 4.95 | 0 | 94.05 |
359476 | 568188 | 22502 | PICNIC BASKET WICKER SMALL | 1 | 2019-09-23 14:33:00 | 2.00 | 16049 | 2.00 |
472 rows × 8 columns
df.at[222682, 'description'] = 'picnic basket wicker small'
df.at[222682, 'quantity'] = 60
df.at[222682, 'unit_price'] = 10.83
aggregations
{'quantity': ('quantity', 'sum'), 'num_of_customers': ('customer_id', 'nunique'), 'sales_total': ('total_price', 'sum'), 'sales_average': ('total_price', 'mean')}
grouped_items = df.groupby(['stock_code',
'description',
'unit_price']).agg(**aggregations).reset_index()
grouped_items_by_total_sales = df.groupby(['stock_code', 'description']).agg(
**aggregations).sort_values(
by = 'sales_total',
ascending = False).reset_index()
grouped_items_by_quantity_sold = df.groupby(['stock_code', 'description']).agg(
**aggregations).sort_values(
by = 'quantity',
ascending = False).reset_index()
grouped_items_by_unit_price = df.groupby(['stock_code', 'description', 'unit_price']).agg(
**aggregations).sort_values(
by = 'unit_price',
ascending = False).reset_index()
grouped_items_by_total_sales.head(10)
stock_code | description | quantity | num_of_customers | sales_total | sales_average | |
---|---|---|---|---|---|---|
0 | 22423 | REGENCY CAKESTAND 3 TIER | 12996 | 888 | 164459.49 | 75.198669 |
1 | 47566 | PARTY BUNTING | 18006 | 709 | 98243.88 | 57.151763 |
2 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 35002 | 859 | 97659.94 | 42.720884 |
3 | 85099B | JUMBO BAG RED RETROSPOT | 47256 | 637 | 92175.79 | 42.812722 |
4 | 23084 | RABBIT NIGHT LIGHT | 30631 | 451 | 66661.63 | 64.594603 |
5 | 22086 | PAPER CHAIN KIT 50'S CHRISTMAS | 18876 | 616 | 63715.24 | 53.362848 |
6 | 84879 | ASSORTED COLOUR BIRD ORNAMENT | 36282 | 680 | 58792.42 | 39.511035 |
7 | 79321 | CHILLI LIGHTS | 10222 | 206 | 53746.66 | 79.861308 |
8 | 23298 | SPOTTY BUNTING | 8210 | 572 | 42030.67 | 36.046887 |
9 | 22386 | JUMBO BAG PINK POLKADOT | 20992 | 373 | 41584.43 | 33.401149 |
grouped_items_by_quantity_sold.head(10)
stock_code | description | quantity | num_of_customers | sales_total | sales_average | |
---|---|---|---|---|---|---|
0 | 84077 | WORLD WAR 2 GLIDERS ASSTD DESIGNS | 53751 | 308 | 13560.09 | 25.064861 |
1 | 85099B | JUMBO BAG RED RETROSPOT | 47256 | 637 | 92175.79 | 42.812722 |
2 | 22197 | POPCORN HOLDER | 36322 | 297 | 33959.26 | 39.441649 |
3 | 84879 | ASSORTED COLOUR BIRD ORNAMENT | 36282 | 680 | 58792.42 | 39.511035 |
4 | 21212 | PACK OF 72 RETROSPOT CAKE CASES | 36016 | 637 | 21047.07 | 15.396540 |
5 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 35002 | 859 | 97659.94 | 42.720884 |
6 | 23084 | RABBIT NIGHT LIGHT | 30631 | 451 | 66661.63 | 64.594603 |
7 | 22492 | MINI PAINT SET VINTAGE | 26437 | 214 | 16810.42 | 43.103641 |
8 | 22616 | PACK OF 12 LONDON TISSUES | 26095 | 196 | 7967.82 | 15.176800 |
9 | 21977 | PACK OF 60 PINK PAISLEY CAKE CASES | 24719 | 412 | 12170.77 | 13.736761 |
grouped_items_by_unit_price.head(10)
stock_code | description | unit_price | quantity | num_of_customers | sales_total | sales_average | |
---|---|---|---|---|---|---|---|
0 | 22656 | VINTAGE BLUE KITCHEN CABINET | 295.0 | 3 | 2 | 885.0 | 295.000000 |
1 | 22655 | VINTAGE RED KITCHEN CABINET | 295.0 | 3 | 5 | 885.0 | 126.428571 |
2 | 22655 | VINTAGE RED KITCHEN CABINET | 265.5 | -1 | 1 | -265.5 | -265.500000 |
3 | 22826 | LOVE SEAT ANTIQUE WHITE METAL | 195.0 | 8 | 9 | 1560.0 | 130.000000 |
4 | 22826 | LOVE SEAT ANTIQUE WHITE METAL | 175.0 | 2 | 1 | 350.0 | 350.000000 |
5 | 22656 | VINTAGE BLUE KITCHEN CABINET | 175.0 | -1 | 1 | -175.0 | -175.000000 |
6 | 22827 | RUSTIC SEVENTEEN DRAWER SIDEBOARD | 165.0 | 15 | 17 | 2475.0 | 130.263158 |
7 | 22828 | REGENCY MIRROR WITH SHUTTERS | 165.0 | 4 | 4 | 660.0 | 165.000000 |
8 | C2 | CARRIAGE | 150.0 | 1 | 1 | 150.0 | 150.000000 |
9 | 22827 | RUSTIC SEVENTEEN DRAWER SIDEBOARD | 145.0 | 18 | 6 | 2610.0 | 290.000000 |
sns.set(style = 'whitegrid')
f, ax = plt.subplots(figsize = (12, 10))
sns.barplot(
x = 'sales_total',
y = 'description',
data = grouped_items_by_total_sales.head(25),
color = 'b',
alpha = 0.6).set_title("Top 25 Products sold sorted by Total Sales", fontsize = 16)
ax.set_xlabel('Sales')
ax.set_ylabel('Item');
sns.set(style = 'whitegrid')
f, ax = plt.subplots(figsize = (12, 9))
sns.barplot(
x = 'quantity',
y = 'description',
data = grouped_items_by_quantity_sold.head(25),
color = 'b',
alpha = 0.6).set_title("Top 25 Products by Quantity Sold", fontsize = 16)
ax.set_xlabel('Quantity Sold')
ax.set_ylabel('Item');
sns.set(style = 'whitegrid')
f, ax = plt.subplots(figsize = (12, 9))
sns.barplot(
x = 'unit_price',
y = 'description',
data = grouped_items_by_unit_price.head(48),
color = 'b',
alpha = 0.6).set_title("Top 25 Products sold by Unit Price", fontsize = 16)
ax.set_xlabel('Sales')
ax.set_ylabel('Item');
Like the "elbow method" in determining clusters, we can pick some numbers to see which group brought in more total profits.
product_sales_total = 20000
sales_above_20k = grouped_items[grouped_items['quantity'] >= product_sales_total]['sales_total'].sum()
sales_below_20k = grouped_items[grouped_items['quantity'] < product_sales_total]['sales_total'].sum()
print("Products bringing in sales of $"
+ str(product_sales_total)
+ " or above totaled: $"
+ str(round(sales_above_20k, 2))
+ " or {:.2f}% of total sales.".format(
sales_above_20k / grouped_items['sales_total'].sum() * 100))
print("Products that brought in under $"
+ str(product_sales_total)
+ " totaled: $"
+ str(round(sales_below_20k,2))
+ " or {:.2f}% of total sales.".format(
sales_below_20k / grouped_items['sales_total'].sum() * 100))
Products bringing in sales of $20000 or above totaled: $66941.98 or 0.69% of total sales. Products that brought in under $20000 totaled: $9673080.54 or 99.31% of total sales.
# we can do a more detailed breakdown. First, a list of dollar amounts for possible "elbow" locations
product_sales_list = [20000, 10000, 7500, 5000, 4500, 4000, 3000, 2000, 1000]
product_quantities_list = [20000, 10000, 7500, 5000, 4000, 3000, 1500]
unit_price_list = [500, 250, 100, 50, 25, 10, 5, 3, 2, 1, 0.75, 0.5, 0.25]
def revenues_percentage_above_below_x_feature(grouped_df, feature_name, list_of_units):
'''This function takes a grouped by item dataframe, and a list of feature quantities and compares sales
between products selling more than a given feature quantity versus products selling less than the feature quantity.
PURPOSE: give the analyst a better idea of whether more profits came from fewer high priced items, or
from a lot of lower priced items
RETURNS: a DataFrame with the percentages broken down by list_of_units
'''
columns = [feature_name,
'sales_above',
'sales_below',
'percent_above',
'percent_below']
new_df = pd.DataFrame(columns = columns)
total_sales = grouped_df['sales_total'].sum()
for units in list_of_units:
sales_at_above_feature_units = grouped_df[grouped_df[feature_name] >= units]['sales_total'].sum()
sales_below_feature_units = grouped_df[grouped_df[feature_name] < units]['sales_total'].sum()
new_df = new_df.append({feature_name : units,
'sales_above' : round(sales_at_above_feature_units, 2),
'sales_below' : round(sales_below_feature_units,2),
'percent_above' : round(sales_at_above_feature_units / total_sales * 100, 2),
'percent_below' : round(sales_below_feature_units / total_sales * 100, 2) },
ignore_index = True)
return new_df
df_revenue_percentages_above_below_sales_totals = revenues_percentage_above_below_x_feature(grouped_items, "sales_total", product_sales_list)
df_revenue_percentages_above_below_sales_totals
sales_total | sales_above | sales_below | percent_above | percent_below | |
---|---|---|---|---|---|
0 | 20000.0 | 537442.06 | 9202580.46 | 5.52 | 94.48 |
1 | 10000.0 | 1467462.59 | 8272559.93 | 15.07 | 84.93 |
2 | 7500.0 | 2068389.18 | 7671633.34 | 21.24 | 78.76 |
3 | 5000.0 | 3395942.69 | 6344079.83 | 34.87 | 65.13 |
4 | 4500.0 | 3712038.28 | 6027984.24 | 38.11 | 61.89 |
5 | 4000.0 | 4076211.92 | 5663810.60 | 41.85 | 58.15 |
6 | 3000.0 | 4972323.63 | 4767698.89 | 51.05 | 48.95 |
7 | 2000.0 | 6035255.75 | 3704766.77 | 61.96 | 38.04 |
8 | 1000.0 | 7520351.09 | 2219671.43 | 77.21 | 22.79 |
df_revenue_percentages_above_below_sales_totals.iloc[0][2]
9202580.46
import matplotlib.pyplot as plt
def show_pie(percentage_df, unit_name, title_name):
# Pie chart
labels = ['Equal to or above ' + str(percentage_df[0]) + ' ' + unit_name,
'< ' + str(percentage_df[0]) + ' ' + unit_name]
sizes = [percentage_df[1], percentage_df[2]]
#colors
colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99']
# figure size
plt.figure(figsize = (8,8))
# font size
plt.rcParams['font.size'] = 18
plt.pie(sizes,
colors = colors,
labels = labels,
autopct = '%1.1f%%',
startangle = 90,
pctdistance = 0.85)
#draw circle
centre_circle = plt.Circle((0,0),
0.6,
fc = 'white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.title(title_name, fontsize = 18)
plt.tight_layout()
plt.show()
show_pie(df_revenue_percentages_above_below_sales_totals.iloc[0], 'dollars','Revenue Percentage of Products Sales above or below $20k each')
We can see that the majority of sales are products that total less than \$20,000 in sales. Some of the extreme numbers we see make up just 5.5% of total sales.
df_revenue_percentages_above_below_quantities_sold = revenues_percentage_above_below_x_feature(grouped_items, "quantity", product_quantities_list)
df_revenue_percentages_above_below_quantities_sold
quantity | sales_above | sales_below | percent_above | percent_below | |
---|---|---|---|---|---|
0 | 20000.0 | 66941.98 | 9673080.54 | 0.69 | 99.31 |
1 | 10000.0 | 392919.72 | 9347102.80 | 4.03 | 95.97 |
2 | 7500.0 | 704203.02 | 9035819.50 | 7.23 | 92.77 |
3 | 5000.0 | 1372403.85 | 8367618.67 | 14.09 | 85.91 |
4 | 4000.0 | 1854922.01 | 7885100.51 | 19.04 | 80.96 |
5 | 3000.0 | 2480569.52 | 7259453.00 | 25.47 | 74.53 |
6 | 1500.0 | 4168348.60 | 5571673.92 | 42.80 | 57.20 |
show_pie(df_revenue_percentages_above_below_quantities_sold.iloc[0], 'units sold', 'Revenue Percentage of Products Sales above or below 20k units sold each')
Likewise, over 99% of total sales are from products selling less than 20,000 units.
df_revenue_percentages_above_below_unit_prices = revenues_percentage_above_below_x_feature(grouped_items, "unit_price", unit_price_list)
df_revenue_percentages_above_below_unit_prices
unit_price | sales_above | sales_below | percent_above | percent_below | |
---|---|---|---|---|---|
0 | 500.00 | 0.00 | 9740022.52 | 0.00 | 100.00 |
1 | 250.00 | 1504.50 | 9738518.02 | 0.02 | 99.98 |
2 | 100.00 | 20599.50 | 9719423.02 | 0.21 | 99.79 |
3 | 50.00 | 37119.52 | 9702903.00 | 0.38 | 99.62 |
4 | 25.00 | 86589.57 | 9653432.95 | 0.89 | 99.11 |
5 | 10.00 | 797854.35 | 8942168.17 | 8.19 | 91.81 |
6 | 5.00 | 2259786.11 | 7480236.41 | 23.20 | 76.80 |
7 | 3.00 | 4305784.64 | 5434237.88 | 44.21 | 55.79 |
8 | 2.00 | 6103645.45 | 3636377.07 | 62.67 | 37.33 |
9 | 1.00 | 8659263.90 | 1080758.62 | 88.90 | 11.10 |
10 | 0.75 | 9057681.00 | 682341.52 | 92.99 | 7.01 |
11 | 0.50 | 9346640.70 | 393381.82 | 95.96 | 4.04 |
12 | 0.25 | 9703772.53 | 36249.99 | 99.63 | 0.37 |
show_pie(df_revenue_percentages_above_below_unit_prices.iloc[5], 'dollars', 'Revenue Percentage of Sales of Producst selling above or below unit price of $10')
This appears to be a "\$10 Store", since the majority of sales comes from products selling for \$10 or less.
More likely, though, this is a wholesaler, judging by the variety of prices and discounts, as well as some of the quantities sold.
In any case:
aggregations = {
'orders': ('invoice_no', 'nunique'),
'sales_total': ('total_price','sum'),
'sales_average': ('total_price', 'mean')
}
grouped_customers = df.groupby(['customer_id']).agg(
**aggregations).sort_values(
by = 'sales_total', ascending = False).reset_index()
top_20_customers = grouped_customers.head(20).sort_values(by = 'sales_total', ascending = False)
top_20_customers
customer_id | orders | sales_total | sales_average | |
---|---|---|---|---|
0 | 0 | 1410 | 1506879.59 | 11.451323 |
1 | 14646 | 73 | 278778.02 | 135.197876 |
2 | 18102 | 60 | 259657.30 | 602.453132 |
3 | 17450 | 49 | 189575.53 | 559.219853 |
4 | 14911 | 243 | 132893.24 | 22.558690 |
5 | 12415 | 24 | 123638.18 | 159.739251 |
6 | 14156 | 65 | 114335.77 | 80.859809 |
7 | 17511 | 45 | 88138.20 | 82.065363 |
8 | 16684 | 30 | 65920.12 | 235.429000 |
9 | 13694 | 57 | 62961.54 | 108.367539 |
10 | 16029 | 66 | 60369.93 | 232.192038 |
11 | 15311 | 118 | 59284.19 | 23.924209 |
12 | 13089 | 118 | 57322.13 | 30.934771 |
13 | 15061 | 54 | 54250.34 | 132.641418 |
14 | 14096 | 17 | 53258.43 | 10.453078 |
15 | 17949 | 49 | 53215.74 | 700.207105 |
16 | 15769 | 29 | 51823.72 | 352.542313 |
17 | 14298 | 45 | 50862.44 | 31.013683 |
18 | 14088 | 14 | 50415.49 | 85.449983 |
19 | 17841 | 169 | 39861.49 | 5.110447 |
def plot_cust_sales_history(grouped_df, group_name = ""):
num_of_customers = len(grouped_df)
print("Sales history of the " + group_name + " customers")
for i in range(0, num_of_customers):
plt.figure(figsize = (15, 6))
df[df['customer_id'] == grouped_df['customer_id'][i]]['total_price'].hist(bins = 100)
plt.title('Customer:' + str(grouped_df['customer_id'][i]))
plt.xlabel('Orders')
plt.ylabel('Invoice totals')
plt.show()
plot_cust_sales_history(top_20_customers, 'top 20')
Sales history of the top 20 customers
top_20_customers.sort_values(by = 'orders', ascending = False)
customer_id | orders | sales_total | sales_average | |
---|---|---|---|---|
0 | 0 | 1410 | 1506879.59 | 11.451323 |
4 | 14911 | 243 | 132893.24 | 22.558690 |
19 | 17841 | 169 | 39861.49 | 5.110447 |
11 | 15311 | 118 | 59284.19 | 23.924209 |
12 | 13089 | 118 | 57322.13 | 30.934771 |
1 | 14646 | 73 | 278778.02 | 135.197876 |
10 | 16029 | 66 | 60369.93 | 232.192038 |
6 | 14156 | 65 | 114335.77 | 80.859809 |
2 | 18102 | 60 | 259657.30 | 602.453132 |
9 | 13694 | 57 | 62961.54 | 108.367539 |
13 | 15061 | 54 | 54250.34 | 132.641418 |
15 | 17949 | 49 | 53215.74 | 700.207105 |
3 | 17450 | 49 | 189575.53 | 559.219853 |
7 | 17511 | 45 | 88138.20 | 82.065363 |
17 | 14298 | 45 | 50862.44 | 31.013683 |
8 | 16684 | 30 | 65920.12 | 235.429000 |
16 | 15769 | 29 | 51823.72 | 352.542313 |
5 | 12415 | 24 | 123638.18 | 159.739251 |
14 | 14096 | 17 | 53258.43 | 10.453078 |
18 | 14088 | 14 | 50415.49 | 85.449983 |
Of the Top 20 customers, 90% made for than 24 orders (we are disregarding customer_id of "zero", since these were the unknown customers).
first_order_date_by_customers = df.groupby('customer_id')['invoice_date'].min()
first_order_date_by_customers.name = 'first_order_date'
df = df.join(first_order_date_by_customers, on = 'customer_id')
df.head()
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | total_price | first_order_date | |
---|---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2018-11-29 08:26:00 | 2.55 | 17850 | 15.30 | 2018-11-29 08:26:00 |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2018-11-29 08:26:00 | 3.39 | 17850 | 20.34 | 2018-11-29 08:26:00 |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2018-11-29 08:26:00 | 2.75 | 17850 | 22.00 | 2018-11-29 08:26:00 |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2018-11-29 08:26:00 | 3.39 | 17850 | 20.34 | 2018-11-29 08:26:00 |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2018-11-29 08:26:00 | 3.39 | 17850 | 20.34 | 2018-11-29 08:26:00 |
df['first_order_month'] = df['first_order_date'].astype('datetime64[M]')
df['order_month'] = df['invoice_date'].astype('datetime64[M]')
df.sample(50)
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | total_price | first_order_date | first_order_month | order_month | |
---|---|---|---|---|---|---|---|---|---|---|---|
529821 | 580754 | 21670 | BLUE SPOT CERAMIC DRAWER KNOB | 6 | 2019-12-04 10:05:00 | 3.29 | 0 | 19.74 | 2018-11-29 14:32:00 | 2018-11-01 | 2019-12-01 |
226155 | 556781 | 22846 | BREAD BIN DINER STYLE RED | 1 | 2019-06-12 12:47:00 | 16.95 | 13384 | 16.95 | 2019-06-12 12:47:00 | 2019-06-01 | 2019-06-01 |
508499 | 579196 | 21080 | SET/20 RED RETROSPOT PAPER NAPKINS | 4 | 2019-11-26 15:54:00 | 0.83 | 14096 | 3.32 | 2019-08-28 10:49:00 | 2019-08-01 | 2019-11-01 |
48623 | 540469 | 21289 | LARGE STRIPES CHOCOLATE GIFT BAG | 16 | 2019-01-05 14:04:00 | 1.25 | 12484 | 20.00 | 2019-01-05 14:04:00 | 2019-01-01 | 2019-01-01 |
477015 | 577046 | 22754 | SMALL RED BABUSHKA NOTEBOOK | 12 | 2019-11-15 13:46:00 | 0.85 | 12449 | 10.20 | 2019-06-03 15:37:00 | 2019-06-01 | 2019-11-01 |
519073 | 580136 | 16169E | WRAP 50'S CHRISTMAS | 1 | 2019-11-29 19:42:00 | 0.42 | 17841 | 0.42 | 2018-11-29 14:30:00 | 2018-11-01 | 2019-11-01 |
233151 | 557442 | 84931B | BLUE SCOTTIE DOG W FLOWER PATTERN | 2 | 2019-06-18 12:53:00 | 2.55 | 14808 | 5.10 | 2019-06-18 12:53:00 | 2019-06-01 | 2019-06-01 |
356016 | 567981 | 21243 | PINK POLKADOT PLATE | 2 | 2019-09-21 11:22:00 | 3.29 | 0 | 6.58 | 2018-11-29 14:32:00 | 2018-11-01 | 2019-09-01 |
108005 | 545475 | 21094 | SET/6 RED SPOTTY PAPER PLATES | 96 | 2019-03-01 10:59:00 | 0.64 | 12415 | 61.44 | 2019-01-04 11:12:00 | 2019-01-01 | 2019-03-01 |
169573 | 551193 | 21746 | SMALL RED RETROSPOT WINDMILL | 4 | 2019-04-25 11:05:00 | 1.25 | 17954 | 5.00 | 2018-12-01 14:18:00 | 2018-12-01 | 2019-04-01 |
381698 | 569887 | 23356 | LOVE HOT WATER BOTTLE | 3 | 2019-10-04 15:28:00 | 5.95 | 13882 | 17.85 | 2019-10-04 15:28:00 | 2019-10-01 | 2019-10-01 |
345057 | 567140 | 22429 | ENAMEL MEASURING JUG CREAM | 8 | 2019-09-14 14:47:00 | 4.25 | 15410 | 34.00 | 2019-03-21 10:15:00 | 2019-03-01 | 2019-09-01 |
74959 | 542536 | 20677 | PINK POLKADOT BOWL | 2 | 2019-01-26 13:37:00 | 2.46 | 0 | 4.92 | 2018-11-29 14:32:00 | 2018-11-01 | 2019-01-01 |
324195 | 565396 | 22111 | SCOTTIE DOG HOT WATER BOTTLE | 1 | 2019-08-31 16:39:00 | 10.79 | 0 | 10.79 | 2018-11-29 14:32:00 | 2018-11-01 | 2019-08-01 |
344584 | 567085 | 20750 | RED RETROSPOT MINI CASES | 2 | 2019-09-14 12:38:00 | 7.95 | 12434 | 15.90 | 2018-12-12 11:12:00 | 2018-12-01 | 2019-09-01 |
70938 | 542107 | 22720 | SET OF 3 CAKE TINS PANTRY DESIGN | 1 | 2019-01-23 13:38:00 | 4.95 | 16222 | 4.95 | 2019-01-23 13:38:00 | 2019-01-01 | 2019-01-01 |
497703 | 578463 | 23368 | SET 12 COLOUR PENCILS DOLLY GIRL | 16 | 2019-11-22 12:30:00 | 0.65 | 12757 | 10.40 | 2019-05-10 19:01:00 | 2019-05-01 | 2019-11-01 |
155172 | 549975 | 20702 | PINK PADDED MOBILE | 3 | 2019-04-11 14:38:00 | 4.25 | 16175 | 12.75 | 2019-04-11 14:38:00 | 2019-04-01 | 2019-04-01 |
66050 | 541711 | 21976 | PACK OF 60 MUSHROOM CAKE CASES | 120 | 2019-01-19 11:18:00 | 0.42 | 14646 | 50.40 | 2018-12-18 10:09:00 | 2018-12-01 | 2019-01-01 |
496587 | 578347 | 20782 | CAMOUFLAGE EAR MUFF HEADPHONES | 1 | 2019-11-22 09:26:00 | 10.79 | 0 | 10.79 | 2018-11-29 14:32:00 | 2018-11-01 | 2019-11-01 |
422173 | 573068 | 22728 | ALARM CLOCK BAKELIKE PINK | 2 | 2019-10-25 13:56:00 | 3.75 | 17545 | 7.50 | 2019-10-25 13:48:00 | 2019-10-01 | 2019-10-01 |
433204 | 573905 | 23254 | CHILDRENS CUTLERY DOLLY GIRL | 1 | 2019-10-30 15:06:00 | 4.15 | 16971 | 4.15 | 2019-10-30 15:06:00 | 2019-10-01 | 2019-10-01 |
188251 | 553017 | 20676 | RED RETROSPOT BOWL | 5 | 2019-05-10 19:01:00 | 1.25 | 12757 | 6.25 | 2019-05-10 19:01:00 | 2019-05-01 | 2019-05-01 |
359141 | 568177 | 23366 | SET 12 COLOURING PENCILS DOILY | 16 | 2019-09-23 13:30:00 | 0.65 | 13536 | 10.40 | 2019-09-23 13:30:00 | 2019-09-01 | 2019-09-01 |
109457 | 545638 | 21430 | SET/3 RED GINGHAM ROSE STORAGE BOX | 24 | 2019-03-02 12:26:00 | 3.39 | 13113 | 81.36 | 2018-12-06 13:07:00 | 2018-12-01 | 2019-03-01 |
228587 | 556932 | 21422 | PORCELAIN ROSE SMALL | 1 | 2019-06-13 15:41:00 | 1.63 | 0 | 1.63 | 2018-11-29 14:32:00 | 2018-11-01 | 2019-06-01 |
387833 | 570388 | 22045 | SPACEBOY GIFT WRAP | 25 | 2019-10-08 12:37:00 | 0.42 | 14911 | 10.50 | 2018-11-29 14:05:00 | 2018-11-01 | 2019-10-01 |
500771 | 578808 | 21775 | DECORATIVE FLORE BATHROOM BOTTLE | 3 | 2019-11-23 13:23:00 | 1.25 | 16327 | 3.75 | 2018-12-03 12:43:00 | 2018-12-01 | 2019-11-01 |
128726 | 547358 | 21086 | SET/6 RED SPOTTY PAPER CUPS | 4 | 2019-03-20 12:25:00 | 0.65 | 15998 | 2.60 | 2018-12-10 13:32:00 | 2018-12-01 | 2019-03-01 |
182346 | 552539 | 21175 | GIN + TONIC DIET METAL SIGN | 12 | 2019-05-08 10:17:00 | 2.55 | 0 | 30.60 | 2018-11-29 14:32:00 | 2018-11-01 | 2019-05-01 |
348661 | 567461 | 21239 | PINK POLKADOT CUP | 3 | 2019-09-18 12:31:00 | 1.63 | 0 | 4.89 | 2018-11-29 14:32:00 | 2018-11-01 | 2019-09-01 |
66421 | 541794 | 22327 | ROUND SNACK BOXES SET OF 4 SKULLS | 1 | 2019-01-19 13:21:00 | 2.95 | 17238 | 2.95 | 2018-12-01 15:05:00 | 2018-12-01 | 2019-01-01 |
505691 | 579094 | 22326 | ROUND SNACK BOXES SET OF4 WOODLAND | 6 | 2019-11-26 10:56:00 | 2.95 | 12668 | 17.70 | 2018-12-17 12:46:00 | 2018-12-01 | 2019-11-01 |
507164 | 579167 | 23333 | IVORY WICKER HEART MEDIUM | 6 | 2019-11-26 14:06:00 | 1.25 | 16800 | 7.50 | 2019-11-26 14:06:00 | 2019-11-01 | 2019-11-01 |
19895 | 537900 | 85172 | HYACINTH BULB T-LIGHT CANDLES | 32 | 2018-12-07 10:45:00 | 0.42 | 15983 | 13.44 | 2018-11-29 12:15:00 | 2018-11-01 | 2018-12-01 |
227657 | 556890 | 21929 | JUMBO BAG PINK VINTAGE PAISLEY | 20 | 2019-06-13 12:48:00 | 2.08 | 17581 | 41.60 | 2018-11-30 14:47:00 | 2018-11-01 | 2019-06-01 |
421536 | 572997 | 21803 | CHRISTMAS TREE STAR DECORATION | 36 | 2019-10-25 11:24:00 | 0.42 | 17878 | 15.12 | 2019-10-25 11:24:00 | 2019-10-01 | 2019-10-01 |
493910 | 578255 | 21985 | PACK OF 12 HEARTS DESIGN TISSUES | 5 | 2019-11-21 12:58:00 | 0.39 | 14675 | 1.95 | 2019-11-21 12:58:00 | 2019-11-01 | 2019-11-01 |
388167 | 570420 | 82580 | BATHROOM METAL SIGN | 1 | 2019-10-08 13:33:00 | 0.55 | 17841 | 0.55 | 2018-11-29 14:30:00 | 2018-11-01 | 2019-10-01 |
69335 | 541969 | 22169 | FAMILY ALBUM WHITE PICTURE FRAME | 1 | 2019-01-22 13:46:00 | 16.63 | 0 | 16.63 | 2018-11-29 14:32:00 | 2018-11-01 | 2019-01-01 |
161596 | 550477 | 22378 | WALL TIDY RETROSPOT | 3 | 2019-04-16 14:01:00 | 2.10 | 17338 | 6.30 | 2018-12-11 12:57:00 | 2018-12-01 | 2019-04-01 |
380764 | 569834 | 22554 | PLASTERS IN TIN WOODLAND ANIMALS | 12 | 2019-10-04 13:00:00 | 1.65 | 14383 | 19.80 | 2019-06-27 10:18:00 | 2019-06-01 | 2019-10-01 |
460699 | 575930 | 21174 | POTTERING IN THE SHED METAL SIGN | 1 | 2019-11-09 17:58:00 | 4.13 | 0 | 4.13 | 2018-11-29 14:32:00 | 2018-11-01 | 2019-11-01 |
443118 | 574690 | 22664 | TOY TIDY DOLLY GIRL DESIGN | 5 | 2019-11-04 13:11:00 | 2.10 | 12638 | 10.50 | 2019-11-04 13:11:00 | 2019-11-01 | 2019-11-01 |
120930 | 546684 | 21535 | RED RETROSPOT SMALL MILK JUG | 24 | 2019-03-14 09:47:00 | 2.55 | 16152 | 61.20 | 2019-03-14 09:47:00 | 2019-03-01 | 2019-03-01 |
160687 | 550468 | 22549 | PICTURE DOMINOES | 12 | 2019-04-16 13:43:00 | 1.45 | 16700 | 17.40 | 2018-12-06 16:34:00 | 2018-12-01 | 2019-04-01 |
406636 | 571824 | 23494 | VINTAGE DOILY DELUXE SEWING KIT | 3 | 2019-10-17 11:49:00 | 5.95 | 12472 | 17.85 | 2018-11-29 14:33:00 | 2018-11-01 | 2019-10-01 |
539936 | 581450 | 22720 | SET OF 3 CAKE TINS PANTRY DESIGN | 3 | 2019-12-06 17:54:00 | 4.95 | 16794 | 14.85 | 2019-06-10 14:15:00 | 2019-06-01 | 2019-12-01 |
516546 | 579927 | 22984 | CARD GINGHAM ROSE | 12 | 2019-11-29 09:20:00 | 0.42 | 12572 | 5.04 | 2019-10-17 08:20:00 | 2019-10-01 | 2019-11-01 |
123219 | 546892 | 22241 | GARLAND WOODEN HAPPY EASTER | 1 | 2019-03-15 18:20:00 | 2.46 | 0 | 2.46 | 2018-11-29 14:32:00 | 2018-11-01 | 2019-03-01 |
aggregations = {
'num_of_orders': ('invoice_no', 'nunique'),
'num_of_customers' : ('customer_id', 'nunique'),
'sales_total': ('total_price','sum')
}
cohort_grouped = df.groupby('first_order_month').agg(**aggregations)
cohort_grouped
num_of_orders | num_of_customers | sales_total | |
---|---|---|---|
first_order_month | |||
2018-11-01 | 4533 | 205 | 2.791599e+06 |
2018-12-01 | 7012 | 743 | 3.109304e+06 |
2019-01-01 | 2821 | 473 | 1.075818e+06 |
2019-02-01 | 1692 | 365 | 5.095876e+05 |
2019-03-01 | 1735 | 414 | 5.661178e+05 |
2019-04-01 | 989 | 285 | 2.895956e+05 |
2019-05-01 | 936 | 291 | 2.784896e+05 |
2019-06-01 | 703 | 225 | 2.063646e+05 |
2019-07-01 | 530 | 201 | 1.591624e+05 |
2019-08-01 | 392 | 162 | 1.625227e+05 |
2019-09-01 | 675 | 289 | 2.186757e+05 |
2019-10-01 | 707 | 376 | 2.162994e+05 |
2019-11-01 | 432 | 302 | 1.319079e+05 |
2019-12-01 | 34 | 31 | 2.457762e+04 |
We would expect that revenue from every cohort be lower than that of the previous one since customers from older cohorts would have had more time to place orders. However, there was a huge (more than 2x) spike in the second cohort (December, 2018), which could possibly be attributed to the Christmas holiday.
Additionally, the March 2019 cohort was also (albeit slightly) higher than the February 2019 cohort, as well as May compared to April, September to August, and October to September.
December 2019 cohort was very small, due to the fact the data collected ends on 2019-12-07.
df['invoice_date'].max()
Timestamp('2019-12-07 12:50:00')
# Assess changes in absolute values by month
df.pivot_table(index = 'first_order_month',
columns = 'order_month',
values = 'customer_id',
aggfunc = 'nunique')
order_month | 2018-11-01 | 2018-12-01 | 2019-01-01 | 2019-02-01 | 2019-03-01 | 2019-04-01 | 2019-05-01 | 2019-06-01 | 2019-07-01 | 2019-08-01 | 2019-09-01 | 2019-10-01 | 2019-11-01 | 2019-12-01 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
first_order_month | ||||||||||||||
2018-11-01 | 205.0 | 93.0 | 80.0 | 70.0 | 77.0 | 79.0 | 83.0 | 81.0 | 72.0 | 76.0 | 81.0 | 80.0 | 99.0 | 39.0 |
2018-12-01 | NaN | 743.0 | 295.0 | 236.0 | 289.0 | 250.0 | 298.0 | 264.0 | 263.0 | 261.0 | 280.0 | 296.0 | 387.0 | 161.0 |
2019-01-01 | NaN | NaN | 473.0 | 110.0 | 130.0 | 113.0 | 157.0 | 134.0 | 127.0 | 126.0 | 141.0 | 158.0 | 177.0 | 45.0 |
2019-02-01 | NaN | NaN | NaN | 365.0 | 90.0 | 66.0 | 103.0 | 99.0 | 86.0 | 85.0 | 100.0 | 99.0 | 110.0 | 24.0 |
2019-03-01 | NaN | NaN | NaN | NaN | 414.0 | 77.0 | 110.0 | 83.0 | 103.0 | 78.0 | 102.0 | 107.0 | 113.0 | 32.0 |
2019-04-01 | NaN | NaN | NaN | NaN | NaN | 285.0 | 65.0 | 61.0 | 62.0 | 58.0 | 66.0 | 66.0 | 78.0 | 17.0 |
2019-05-01 | NaN | NaN | NaN | NaN | NaN | NaN | 291.0 | 67.0 | 49.0 | 49.0 | 62.0 | 72.0 | 84.0 | 15.0 |
2019-06-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 225.0 | 50.0 | 43.0 | 58.0 | 60.0 | 72.0 | 14.0 |
2019-07-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 201.0 | 45.0 | 40.0 | 47.0 | 55.0 | 19.0 |
2019-08-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 162.0 | 34.0 | 43.0 | 47.0 | 16.0 |
2019-09-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 289.0 | 88.0 | 97.0 | 31.0 |
2019-10-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 376.0 | 104.0 | 37.0 |
2019-11-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 302.0 | 23.0 |
2019-12-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 31.0 |
aggregations = {'revenue': ('total_price', 'sum'),
'num_of_customers' : ('customer_id','nunique')}
orders_grouped_by_cohorts = df.groupby(['first_order_month',
'order_month']).agg(**aggregations)
orders_grouped_by_cohorts.head(10)
revenue | num_of_customers | ||
---|---|---|---|
first_order_month | order_month | ||
2018-11-01 | 2018-11-01 | 103249.92 | 205 |
2018-12-01 | 288927.89 | 93 | |
2019-01-01 | 201104.98 | 80 | |
2019-02-01 | 122481.67 | 70 | |
2019-03-01 | 188820.36 | 77 | |
2019-04-01 | 110959.43 | 79 | |
2019-05-01 | 181814.92 | 83 | |
2019-06-01 | 159306.76 | 81 | |
2019-07-01 | 198619.92 | 72 | |
2019-08-01 | 185200.43 | 76 |
orders_grouped_by_cohorts['revenue_per_customer'] = orders_grouped_by_cohorts[
'revenue'] / orders_grouped_by_cohorts['num_of_customers']
orders_grouped_by_cohorts.head()
revenue | num_of_customers | revenue_per_customer | ||
---|---|---|---|---|
first_order_month | order_month | |||
2018-11-01 | 2018-11-01 | 103249.92 | 205 | 503.658146 |
2018-12-01 | 288927.89 | 93 | 3106.751505 | |
2019-01-01 | 201104.98 | 80 | 2513.812250 | |
2019-02-01 | 122481.67 | 70 | 1749.738143 | |
2019-03-01 | 188820.36 | 77 | 2452.212468 |
orders_grouped_by_cohorts.pivot_table(index = 'first_order_month',
columns = 'order_month',
values = 'revenue_per_customer',
aggfunc = 'mean')
order_month | 2018-11-01 | 2018-12-01 | 2019-01-01 | 2019-02-01 | 2019-03-01 | 2019-04-01 | 2019-05-01 | 2019-06-01 | 2019-07-01 | 2019-08-01 | 2019-09-01 | 2019-10-01 | 2019-11-01 | 2019-12-01 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
first_order_month | ||||||||||||||
2018-11-01 | 503.658146 | 3106.751505 | 2513.812250 | 1749.738143 | 2452.212468 | 1404.549747 | 2190.541205 | 1966.750123 | 2758.610000 | 2436.847763 | 2428.364198 | 2969.289125 | 4946.200505 | 3261.504359 |
2018-12-01 | NaN | 493.394118 | 691.084339 | 696.352712 | 723.754533 | 567.085600 | 841.306879 | 815.689356 | 803.970722 | 970.578123 | 1109.974214 | 1124.566791 | 914.069871 | 592.175280 |
2019-01-01 | NaN | NaN | 469.320825 | 514.782636 | 502.203846 | 419.405398 | 617.751720 | 585.060448 | 603.758898 | 684.126746 | 535.173830 | 746.551392 | 745.179718 | 465.561333 |
2019-02-01 | NaN | NaN | NaN | 413.594548 | 280.583778 | 460.032727 | 425.118544 | 322.990606 | 374.810000 | 453.589647 | 494.392500 | 468.381010 | 508.164182 | 197.993333 |
2019-03-01 | NaN | NaN | NaN | NaN | 416.745749 | 301.293117 | 523.536818 | 430.806145 | 480.659709 | 541.381282 | 532.999216 | 600.243271 | 523.734690 | 235.308125 |
2019-04-01 | NaN | NaN | NaN | NaN | NaN | 393.485758 | 380.790769 | 363.614590 | 331.168387 | 368.477759 | 392.663030 | 416.196667 | 400.702436 | 233.914118 |
2019-05-01 | NaN | NaN | NaN | NaN | NaN | NaN | 412.786632 | 268.635224 | 401.065714 | 346.586327 | 443.159839 | 472.353750 | 443.804524 | 331.350000 |
2019-06-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 382.773289 | 285.003400 | 284.189070 | 475.182931 | 425.922500 | 506.441667 | 299.322857 |
2019-07-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 440.266323 | 241.316444 | 406.657250 | 399.262766 | 364.312182 | 249.515789 |
2019-08-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 399.519630 | 458.548235 | 740.920000 | 838.770000 | 683.005000 |
2019-09-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 511.196647 | 301.554659 | 347.474021 | 345.099677 |
2019-10-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 426.481170 | 442.203942 | 269.008378 |
2019-11-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 418.951126 | 234.117391 |
2019-12-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 792.826452 |
orders_grouped_by_cohorts = orders_grouped_by_cohorts.reset_index()
orders_grouped_by_cohorts['cohort_lifetime'] = orders_grouped_by_cohorts[
'order_month'] - orders_grouped_by_cohorts['first_order_month']
# convert lifetime to months
orders_grouped_by_cohorts['cohort_lifetime'] = orders_grouped_by_cohorts['cohort_lifetime'] / np.timedelta64(1,'M')
# round to nearest month
orders_grouped_by_cohorts['cohort_lifetime'] = orders_grouped_by_cohorts['cohort_lifetime'].round().astype('int')
# change first_order_month to year and month only
orders_grouped_by_cohorts['first_order_month'] = orders_grouped_by_cohorts['first_order_month'].dt.strftime('%Y-%m')
orders_grouped_by_cohorts
first_order_month | order_month | revenue | num_of_customers | revenue_per_customer | cohort_lifetime | |
---|---|---|---|---|---|---|
0 | 2018-11 | 2018-11-01 | 103249.92 | 205 | 503.658146 | 0 |
1 | 2018-11 | 2018-12-01 | 288927.89 | 93 | 3106.751505 | 1 |
2 | 2018-11 | 2019-01-01 | 201104.98 | 80 | 2513.812250 | 2 |
3 | 2018-11 | 2019-02-01 | 122481.67 | 70 | 1749.738143 | 3 |
4 | 2018-11 | 2019-03-01 | 188820.36 | 77 | 2452.212468 | 4 |
... | ... | ... | ... | ... | ... | ... |
100 | 2019-10 | 2019-11-01 | 45989.21 | 104 | 442.203942 | 1 |
101 | 2019-10 | 2019-12-01 | 9953.31 | 37 | 269.008378 | 2 |
102 | 2019-11 | 2019-11-01 | 126523.24 | 302 | 418.951126 | 0 |
103 | 2019-11 | 2019-12-01 | 5384.70 | 23 | 234.117391 | 1 |
104 | 2019-12 | 2019-12-01 | 24577.62 | 31 | 792.826452 | 0 |
105 rows × 6 columns
# pivot table of changes in average revenue per customer whose columns will contain the lifetime and whose rows will be cohorts
avg_revenue_per_customer_pivot = orders_grouped_by_cohorts.pivot_table(index = 'first_order_month',
columns = 'cohort_lifetime',
values = 'revenue_per_customer',
aggfunc = 'mean')
# let's visualize revenue_per_customer_pivot
plt.figure(figsize = (20, 9))
plt.title('Average customer purchase size')
sns.heatmap(avg_revenue_per_customer_pivot,
annot = True,
fmt='.1f',
linewidth = 1,
linecolor = 'gray');
After analyzing the data, the question comes to mind whether the average unit price paid by the top buyers was different than the rest. Did the top buyers get discounted pricing, and "made up" for it with the sheer volume they ordered?
Null hypothesis: The average price paid by the Top 25% of customers is the same as the rest, suggesting that the only difference in top customers is the volume of items they purchase.
Alternative hypothesis: The average price by the Top 25% of customers is not the same as the rest, suggesting that they get special pricing discounts, but make up for it by the sheer volume they purchase.
# We will group the customers and then sort by total revenues
grouped_customer = df.groupby('customer_id').agg({'invoice_no':'nunique',
'unit_price': 'mean',
'total_price': 'sum'}).reset_index()
grouped_customer.head()
customer_id | invoice_no | unit_price | total_price | |
---|---|---|---|---|
0 | 0 | 1410 | 4.468638 | 1506879.59 |
1 | 12347 | 7 | 2.644011 | 4310.00 |
2 | 12348 | 4 | 0.692963 | 1437.24 |
3 | 12349 | 1 | 4.237500 | 1457.55 |
4 | 12350 | 1 | 1.581250 | 294.40 |
grouped_customer.describe()
customer_id | invoice_no | unit_price | total_price | |
---|---|---|---|---|
count | 4362.00000 | 4362.000000 | 4362.000000 | 4.362000e+03 |
mean | 15297.41105 | 5.316598 | 3.406677 | 2.232926e+03 |
std | 1736.71147 | 23.134383 | 5.973947 | 2.423915e+04 |
min | 0.00000 | 1.000000 | 0.122500 | -1.192200e+03 |
25% | 13814.25000 | 1.000000 | 2.171436 | 2.930250e+02 |
50% | 15299.50000 | 3.000000 | 2.845251 | 6.432150e+02 |
75% | 16777.75000 | 5.000000 | 3.740464 | 1.584745e+03 |
max | 18287.00000 | 1410.000000 | 295.000000 | 1.506880e+06 |
# For the top 25%, they spent $1582.60 or more
grouped_customers_sorted_total_price = grouped_customer.sort_values(by = 'total_price', ascending = False)
top_25_percent = grouped_customers_sorted_total_price[
grouped_customers_sorted_total_price['total_price'] >= grouped_customers_sorted_total_price['total_price'].quantile(0.75)]['unit_price']
other_75_percent = grouped_customers_sorted_total_price[
grouped_customers_sorted_total_price['total_price'] < grouped_customers_sorted_total_price['total_price'].quantile(0.75)]['unit_price']
from scipy import stats as st
# significance level. Not a life or death situation, so we choose 0.5
alpha = 0.05
top_25 = top_25_percent.to_list()
other_75 = other_75_percent.to_list()
results = st.ttest_ind(top_25_percent, other_75_percent)
print("Result of t-test pvalue is:", str(results.pvalue))
if (results.pvalue < alpha):
print("Rejecting the null hypothesis: there is a significant difference between the samples tested")
else:
print("Failed to reject the null hypothesis: there is no reason to consider the samples different")
Result of t-test pvalue is: 0.01107778774627851 Rejecting the null hypothesis: there is a significant difference between the samples tested
ACTION REQUIRED: Why was nearly 1% of the data duplicates? Need to ask Data Engineering about data logging, or UX about data entry. Is there a bug in the backend, or is this a front-end issue (or user error)?
Some can perhaps be attributed to cash transactions, but the sheer number of their existence suggests a systemic problem that should be discussed with PdM, UX, Data/Engineering.
There were transactions involving hundreds, even thousands of units. Some of these large transactions have no associated revenue (e.g. \$0 unit price). It appears that large negative quantities sometimes were returned damaged product. Some large "sales" may have been gifts or donations. In any case, since we cannot correlate these transactions with actual customers, we deleted these transactions as well.
ACTION REQUIRED: There has to be a better way to reconcile these abnormal transactions so that we can properly track and attribute them to either a profit or loss. In our case, we're talking \$1.4 million in sales that cannot be properly accounted.
ACTION REQUIRED: Some items are obvious returns, but others have no customers associated, and some descriptions look very arbitrary. There should be a more consistent system, because some of the quantities and dollar amounts are very large (in the negative).
ACTION REQUIRED: Why were also several dozen items are listed with unit_price of \$0?
RESULT: We reject the null hypothesis. The top 25% customers do appear to be getting a different pricing.
Although we've seen pricing that seems arbritrary, our testing do show that the top 25% of customers on average have different pricing. Perhaps they were getting free samples, which might explain the \$0 unit pricing on some items. In addition to questions about the entire ordering system that we posed above, we should also ask about customer service/support -- they being the face or "voice" of the company, they can constitute a large part of whether there is repeat business by current customers.
In short, there appears to be many areas that can be shored up, in order to continuing generating more accurate and complete data that will help the business make more accurate assessments and decisions.