Customer segmentation will be applied to an e-commerce customer database using K-means clustering from scikit-learn. It is an extension of a case study solved couple of months ago.
The provided customers database is visualized as part of a case study. This project is taking the case study one step further with the following motive:
Can this customer database be grouped to develop customized relationships?
To answer this question 3 features will be created and used:
Dataset represents real customers & orders data between November 2018 - April 2019 and it is pseudonymized for confidentiality.
Imports
# data wrangling
import pandas as pd
import numpy as np
# visualization
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# for data preprocessing and clustering
from sklearn.cluster import KMeans
%matplotlib inline
# to include graphs inline within the frontends next to code
%config InlineBackend.figure_format='retina'
#to enable retina (high resolution) plots
pd.options.mode.chained_assignment = None
# to bypass warnings in various dataframe assignments
Investigate data
# load data into a dataframe
customers_orders = pd.read_csv("Orders - Analysis Task.csv")
# first rows of the dataset
customers_orders.head()
product_title | product_type | variant_title | variant_sku | variant_id | customer_id | order_id | day | net_quantity | gross_sales | discounts | returns | net_sales | taxes | total_sales | returned_item_quantity | ordered_item_quantity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | DPR | DPR | 100 | AD-982-708-895-F-6C894FB | 52039657 | 1312378 | 83290718932496 | 04/12/2018 | 2 | 200.0 | -200.00 | 0.00 | 0.0 | 0.0 | 0.0 | 0 | 2 |
1 | RJF | Product P | 28 / A / MTM | 83-490-E49-8C8-8-3B100BC | 56914686 | 3715657 | 36253792848113 | 01/04/2019 | 2 | 190.0 | -190.00 | 0.00 | 0.0 | 0.0 | 0.0 | 0 | 2 |
2 | CLH | Product B | 32 / B / FtO | 68-ECA-BC7-3B2-A-E73DE1B | 24064862 | 9533448 | 73094559597229 | 05/11/2018 | 0 | 164.8 | -156.56 | -8.24 | 0.0 | 0.0 | 0.0 | -2 | 2 |
3 | NMA | Product F | 40 / B / FtO | 6C-1F1-226-1B3-2-3542B41 | 43823868 | 4121004 | 53616575668264 | 19/02/2019 | 1 | 119.0 | -119.00 | 0.00 | 0.0 | 0.0 | 0.0 | 0 | 1 |
4 | NMA | Product F | 40 / B / FtO | 6C-1F1-226-1B3-2-3542B41 | 43823868 | 4121004 | 29263220319421 | 19/02/2019 | 1 | 119.0 | -119.00 | 0.00 | 0.0 | 0.0 | 0.0 | 0 | 1 |
# first glance of customers_orders data
customers_orders.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 70052 entries, 0 to 70051 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 product_title 70052 non-null object 1 product_type 70052 non-null object 2 variant_title 70052 non-null object 3 variant_sku 70052 non-null object 4 variant_id 70052 non-null int64 5 customer_id 70052 non-null int64 6 order_id 70052 non-null int64 7 day 70052 non-null object 8 net_quantity 70052 non-null int64 9 gross_sales 70052 non-null float64 10 discounts 70052 non-null float64 11 returns 70052 non-null float64 12 net_sales 70052 non-null float64 13 taxes 70052 non-null float64 14 total_sales 70052 non-null float64 15 returned_item_quantity 70052 non-null int64 16 ordered_item_quantity 70052 non-null int64 dtypes: float64(6), int64(6), object(5) memory usage: 9.1+ MB
# descriptive statistics of the non-object columns
customers_orders.describe()
variant_id | customer_id | order_id | net_quantity | gross_sales | discounts | returns | net_sales | taxes | total_sales | returned_item_quantity | ordered_item_quantity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 7.005200e+04 | 7.005200e+04 | 7.005200e+04 | 70052.000000 | 70052.000000 | 70052.000000 | 70052.000000 | 70052.000000 | 70052.000000 | 70052.000000 | 70052.000000 | 70052.000000 |
mean | 2.442320e+11 | 6.013091e+11 | 5.506075e+13 | 0.701179 | 61.776302 | -4.949904 | -10.246051 | 46.580348 | 9.123636 | 55.703982 | -0.156098 | 0.857277 |
std | 4.255079e+12 | 6.223201e+12 | 2.587640e+13 | 0.739497 | 31.800689 | 7.769972 | 25.154677 | 51.802690 | 10.305236 | 61.920557 | 0.369190 | 0.380820 |
min | 1.001447e+07 | 1.000661e+06 | 1.000657e+13 | -3.000000 | 0.000000 | -200.000000 | -237.500000 | -237.500000 | -47.500000 | -285.000000 | -3.000000 | 0.000000 |
25% | 2.692223e+07 | 3.295695e+06 | 3.270317e+13 | 1.000000 | 51.670000 | -8.340000 | 0.000000 | 47.080000 | 8.375000 | 56.227500 | 0.000000 | 1.000000 |
50% | 4.494514e+07 | 5.566107e+06 | 5.522207e+13 | 1.000000 | 74.170000 | 0.000000 | 0.000000 | 63.330000 | 12.660000 | 76.000000 | 0.000000 | 1.000000 |
75% | 7.743106e+07 | 7.815352e+06 | 7.736876e+13 | 1.000000 | 79.170000 | 0.000000 | 0.000000 | 74.170000 | 14.840000 | 89.000000 | 0.000000 | 1.000000 |
max | 8.422212e+13 | 9.977409e+13 | 9.999554e+13 | 6.000000 | 445.000000 | 0.000000 | 0.000000 | 445.000000 | 63.340000 | 445.000000 | 0.000000 | 6.000000 |
There were significant number of rows whose ordered_item_quantity
is 0 and net_quantity
is less than 0, which means they are not ordered/sold at all; but the fact that they have returns requires investigation.
print("Number of rows that net quantity is negative:",
customers_orders[customers_orders.net_quantity < 0].shape[0])
Number of rows that net quantity is negative: 10715
These rows will be excluded from the orders dataset for the project.
# exclude not sold/ordered SKUs from the dataset
customers_orders = customers_orders[
customers_orders["ordered_item_quantity"] > 0]
It is the count of the products ordered in product_type column by a customer.
Create functions to identify customers who order multiple products
def encode_column(column):
if column > 0:
return 1
if column <= 0:
return 0
def aggregate_by_ordered_quantity(dataframe, column_list):
'''this function:
1. aggregates a given dataframe by column list,
as a result creates a aggregated dataframe by counting the ordered item quantities
2. adds number_of_X ordered where X is the second element in the column_list
to the aggregated dataframe by encoding ordered items into 1
3. creates final dataframe containing information about
how many of X are ordered, based on the first element passed in the column list'''
aggregated_dataframe = (dataframe
.groupby(column_list)
.ordered_item_quantity.count()
.reset_index())
aggregated_dataframe["products_ordered"] = (aggregated_dataframe
.ordered_item_quantity
.apply(encode_column))
final_dataframe = (aggregated_dataframe
.groupby(column_list[0])
.products_ordered.sum() # aligned with the added column name
.reset_index())
return final_dataframe
# apply functions to customers_orders
customers = aggregate_by_ordered_quantity(customers_orders, ["customer_id", "product_type"])
print(customers.head())
customer_id products_ordered 0 1000661 1 1 1001914 1 2 1002167 3 3 1002387 1 4 1002419 2
It is the ratio of returned item quantity and ordered item quantity. This ratio is first calculated per order and then averaged for all orders of a customer.
# aggregate data per customer_id and order_id,
# to see ordered item sum and returned item sum
ordered_sum_by_customer_order = (customers_orders
.groupby(["customer_id", "order_id"])
.ordered_item_quantity.sum()
.reset_index())
returned_sum_by_customer_order = (customers_orders
.groupby(["customer_id", "order_id"])
.returned_item_quantity.sum()
.reset_index())
# merge two dataframes to be able to calculate unit return rate
ordered_returned_sums = pd.merge(ordered_sum_by_customer_order, returned_sum_by_customer_order)
# calculate unit return rate per order and customer
ordered_returned_sums["average_return_rate"] = (-1 *
ordered_returned_sums["returned_item_quantity"] /
ordered_returned_sums["ordered_item_quantity"])
ordered_returned_sums.head()
customer_id | order_id | ordered_item_quantity | returned_item_quantity | average_return_rate | |
---|---|---|---|---|---|
0 | 1000661 | 99119989117212 | 3 | 0 | 0.0 |
1 | 1001914 | 79758569034715 | 1 | 0 | 0.0 |
2 | 1002167 | 38156088848638 | 1 | 0 | 0.0 |
3 | 1002167 | 57440147820257 | 1 | 0 | 0.0 |
4 | 1002167 | 58825523953710 | 1 | 0 | 0.0 |
# take average of the unit return rate for all orders of a customer
customer_return_rate = (ordered_returned_sums
.groupby("customer_id")
.average_return_rate
.mean()
.reset_index())
return_rates = pd.DataFrame(customer_return_rate["average_return_rate"]
.value_counts()
.reset_index())
return_rates.rename(columns=
{"index": "average return rate",
"average_return_rate": "count of unit return rate"},
inplace=True)
return_rates.sort_values(by="average return rate")
average return rate | count of unit return rate | |
---|---|---|
0 | 0.000000 | 24823 |
9 | 0.013889 | 1 |
10 | 0.066667 | 1 |
8 | 0.083333 | 1 |
12 | 0.125000 | 1 |
5 | 0.166667 | 2 |
6 | 0.200000 | 2 |
4 | 0.250000 | 5 |
1 | 0.333333 | 13 |
11 | 0.400000 | 1 |
3 | 0.500000 | 9 |
7 | 0.666667 | 2 |
2 | 1.000000 | 13 |
# add average_return_rate to customers dataframe
customers = pd.merge(customers,
customer_return_rate,
on="customer_id")
Total spending is the aggregated sum of total sales value which is the amount after the taxes and returns.
# aggreagate total sales per customer id
customer_total_spending = (customers_orders
.groupby("customer_id")
.total_sales
.sum()
.reset_index())
customer_total_spending.rename(columns = {"total_sales" : "total_spending"},
inplace = True)
# add total sales to customers dataframe
customers = customers.merge(customer_total_spending,
on="customer_id")
print("The number of customers from the existing customer base:", customers.shape[0])
The number of customers from the existing customer base: 24874
# drop id column since it is not a feature
customers.drop(columns="customer_id",
inplace=True)
customers.head()
products_ordered | average_return_rate | total_spending | |
---|---|---|---|
0 | 1 | 0.0 | 260.0 |
1 | 1 | 0.0 | 79.2 |
2 | 3 | 0.0 | 234.2 |
3 | 1 | 0.0 | 89.0 |
4 | 2 | 0.0 | 103.0 |
fig = make_subplots(rows=3, cols=1,
subplot_titles=("Products Ordered",
"Average Return Rate",
"Total Spending"))
fig.append_trace(go.Histogram(x=customers.products_ordered),
row=1, col=1)
fig.append_trace(go.Histogram(x=customers.average_return_rate),
row=2, col=1)
fig.append_trace(go.Histogram(x=customers.total_spending),
row=3, col=1)
fig.update_layout(height=800, width=800,
title_text="Distribution of the Features")
fig.show()
def apply_log1p_transformation(dataframe, column):
'''This function takes a dataframe and a column in the string format
then applies numpy log1p transformation to the column
as a result returns log1p applied pandas series'''
dataframe["log_" + column] = np.log1p(dataframe[column])
return dataframe["log_" + column]
apply_log1p_transformation(customers, "products_ordered")
0 0.693147 1 0.693147 2 1.386294 3 0.693147 4 1.098612 ... 24869 1.098612 24870 1.098612 24871 0.693147 24872 1.098612 24873 0.693147 Name: log_products_ordered, Length: 24874, dtype: float64
apply_log1p_transformation(customers, "average_return_rate")
0 0.0 1 0.0 2 0.0 3 0.0 4 0.0 ... 24869 0.0 24870 0.0 24871 0.0 24872 0.0 24873 0.0 Name: log_average_return_rate, Length: 24874, dtype: float64
apply_log1p_transformation(customers, "total_spending")
0 5.564520 1 4.384524 2 5.460436 3 4.499810 4 4.644391 ... 24869 5.560682 24870 5.495117 24871 4.499810 24872 5.590987 24873 4.174387 Name: log_total_spending, Length: 24874, dtype: float64
fig = make_subplots(rows=3, cols=1,
subplot_titles=("Products Ordered",
"Average Return Rate",
"Total Spending"))
fig.append_trace(go.Histogram(x=customers.log_products_ordered),
row=1, col=1)
fig.append_trace(go.Histogram(x=customers.log_average_return_rate),
row=2, col=1)
fig.append_trace(go.Histogram(x=customers.log_total_spending),
row=3, col=1)
fig.update_layout(height=800, width=800,
title_text="Distribution of the Features after Logarithm Transformation")
fig.show()
customers.head()
products_ordered | average_return_rate | total_spending | log_products_ordered | log_average_return_rate | log_total_spending | |
---|---|---|---|---|---|---|
0 | 1 | 0.0 | 260.0 | 0.693147 | 0.0 | 5.564520 |
1 | 1 | 0.0 | 79.2 | 0.693147 | 0.0 | 4.384524 |
2 | 3 | 0.0 | 234.2 | 1.386294 | 0.0 | 5.460436 |
3 | 1 | 0.0 | 89.0 | 0.693147 | 0.0 | 4.499810 |
4 | 2 | 0.0 | 103.0 | 1.098612 | 0.0 | 4.644391 |
# features we are going to use as an input to the model
customers.iloc[:,3:]
log_products_ordered | log_average_return_rate | log_total_spending | |
---|---|---|---|
0 | 0.693147 | 0.0 | 5.564520 |
1 | 0.693147 | 0.0 | 4.384524 |
2 | 1.386294 | 0.0 | 5.460436 |
3 | 0.693147 | 0.0 | 4.499810 |
4 | 1.098612 | 0.0 | 4.644391 |
... | ... | ... | ... |
24869 | 1.098612 | 0.0 | 5.560682 |
24870 | 1.098612 | 0.0 | 5.495117 |
24871 | 0.693147 | 0.0 | 4.499810 |
24872 | 1.098612 | 0.0 | 5.590987 |
24873 | 0.693147 | 0.0 | 4.174387 |
24874 rows × 3 columns
# create initial K-means model
kmeans_model = KMeans(init='k-means++',
max_iter=500,
random_state=42)
kmeans_model.fit(customers.iloc[:,3:])
# print the sum of distances from all examples to the center of the cluster
print("within-cluster sum-of-squares (inertia) of the model is:", kmeans_model.inertia_)
within-cluster sum-of-squares (inertia) of the model is: 1066.6086426894992
def make_list_of_K(K, dataframe):
'''inputs: K as integer and dataframe
apply k-means clustering to dataframe
and make a list of inertia values against 1 to K (inclusive)
return the inertia values list
'''
cluster_values = list(range(1, K+1))
inertia_values=[]
for c in cluster_values:
model = KMeans(
n_clusters = c,
init='k-means++',
max_iter=500,
random_state=42)
model.fit(dataframe)
inertia_values.append(model.inertia_)
return inertia_values
# save inertia values in a dataframe for k values between 1 to 15
results = make_list_of_K(15, customers.iloc[:, 3:])
k_values_distances = pd.DataFrame({"clusters": list(range(1, 16)),
"within cluster sum of squared distances": results})
# visualization for the selection of number of segments
fig = go.Figure()
fig.add_trace(go.Scatter(x=k_values_distances["clusters"],
y=k_values_distances["within cluster sum of squared distances"],
mode='lines+markers'))
fig.update_layout(xaxis = dict(
tickmode = 'linear',
tick0 = 1,
dtick = 1),
title_text="Within Cluster Sum of Squared Distances VS K Values",
xaxis_title="K values",
yaxis_title="Cluster sum of squared distances")
fig.show()
# create clustering model with optimal k=4
updated_kmeans_model = KMeans(n_clusters = 4,
init='k-means++',
max_iter=500,
random_state=42)
updated_kmeans_model.fit_predict(customers.iloc[:,3:])
array([3, 0, 3, ..., 0, 3, 0], dtype=int32)
# create cluster centers and actual data arrays
cluster_centers = updated_kmeans_model.cluster_centers_
actual_data = np.expm1(cluster_centers)
add_points = np.append(actual_data, cluster_centers, axis=1)
add_points
array([[1.01496335e+00, 1.15284613e-03, 7.65510085e+01, 7.00601007e-01, 1.15218211e-03, 4.35093589e+00], [1.52690198e+00, 5.47445531e-04, 1.59884595e+02, 9.26994039e-01, 5.47295737e-04, 5.08068731e+00], [3.94543624e+00, 5.55690433e-04, 5.80572876e+02, 1.59846518e+00, 5.55536095e-04, 6.36573629e+00], [2.39821221e+00, 5.20180049e-04, 2.83774022e+02, 1.22324947e+00, 5.20044802e-04, 5.65169596e+00]])
# add labels to customers dataframe and add_points array
add_points = np.append(add_points, [[0], [1], [2], [3]], axis=1)
customers["clusters"] = updated_kmeans_model.labels_
# create centers dataframe from add_points
centers_df = pd.DataFrame(data=add_points, columns=["products_ordered",
"average_return_rate",
"total_spending",
"log_products_ordered",
"log_average_return_rate",
"log_total_spending",
"clusters"])
centers_df.head()
products_ordered | average_return_rate | total_spending | log_products_ordered | log_average_return_rate | log_total_spending | clusters | |
---|---|---|---|---|---|---|---|
0 | 1.014963 | 0.001153 | 76.551009 | 0.700601 | 0.001152 | 4.350936 | 0.0 |
1 | 1.526902 | 0.000547 | 159.884595 | 0.926994 | 0.000547 | 5.080687 | 1.0 |
2 | 3.945436 | 0.000556 | 580.572876 | 1.598465 | 0.000556 | 6.365736 | 2.0 |
3 | 2.398212 | 0.000520 | 283.774022 | 1.223249 | 0.000520 | 5.651696 | 3.0 |
# align cluster centers of centers_df and customers
centers_df["clusters"] = centers_df["clusters"].astype("int")
centers_df.head()
products_ordered | average_return_rate | total_spending | log_products_ordered | log_average_return_rate | log_total_spending | clusters | |
---|---|---|---|---|---|---|---|
0 | 1.014963 | 0.001153 | 76.551009 | 0.700601 | 0.001152 | 4.350936 | 0 |
1 | 1.526902 | 0.000547 | 159.884595 | 0.926994 | 0.000547 | 5.080687 | 1 |
2 | 3.945436 | 0.000556 | 580.572876 | 1.598465 | 0.000556 | 6.365736 | 2 |
3 | 2.398212 | 0.000520 | 283.774022 | 1.223249 | 0.000520 | 5.651696 | 3 |
customers.head()
products_ordered | average_return_rate | total_spending | log_products_ordered | log_average_return_rate | log_total_spending | clusters | |
---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | 260.0 | 0.693147 | 0.0 | 5.564520 | 3 |
1 | 1 | 0.0 | 79.2 | 0.693147 | 0.0 | 4.384524 | 0 |
2 | 3 | 0.0 | 234.2 | 1.386294 | 0.0 | 5.460436 | 3 |
3 | 1 | 0.0 | 89.0 | 0.693147 | 0.0 | 4.499810 | 0 |
4 | 2 | 0.0 | 103.0 | 1.098612 | 0.0 | 4.644391 | 1 |
# differentiate between data points and cluster centers
customers["is_center"] = 0
centers_df["is_center"] = 1
# add dataframes together
customers = customers.append(centers_df, ignore_index=True)
customers.tail()
products_ordered | average_return_rate | total_spending | log_products_ordered | log_average_return_rate | log_total_spending | clusters | is_center | |
---|---|---|---|---|---|---|---|---|
24873 | 1.000000 | 0.000000 | 64.000000 | 0.693147 | 0.000000 | 4.174387 | 0 | 0 |
24874 | 1.014963 | 0.001153 | 76.551009 | 0.700601 | 0.001152 | 4.350936 | 0 | 1 |
24875 | 1.526902 | 0.000547 | 159.884595 | 0.926994 | 0.000547 | 5.080687 | 1 | 1 |
24876 | 3.945436 | 0.000556 | 580.572876 | 1.598465 | 0.000556 | 6.365736 | 2 | 1 |
24877 | 2.398212 | 0.000520 | 283.774022 | 1.223249 | 0.000520 | 5.651696 | 3 | 1 |
# add clusters to the dataframe
customers["cluster_name"] = customers["clusters"].astype(str)
# visualize log_transformation customer segments with a 3D plot
fig = px.scatter_3d(customers,
x="log_products_ordered",
y="log_average_return_rate",
z="log_total_spending",
color='cluster_name',
hover_data=["products_ordered",
"average_return_rate",
"total_spending"],
category_orders = {"cluster_name":
["0", "1", "2", "3"]},
symbol = "is_center"
)
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
fig.show()
# values for log_transformation
cardinality_df = pd.DataFrame(
customers.cluster_name.value_counts().reset_index())
cardinality_df.rename(columns={"index": "Customer Groups",
"cluster_name": "Customer Group Magnitude"},
inplace=True)
cardinality_df
Customer Groups | Customer Group Magnitude | |
---|---|---|
0 | 0 | 10468 |
1 | 1 | 7207 |
2 | 3 | 5120 |
3 | 2 | 2083 |
fig = px.bar(cardinality_df, x="Customer Groups",
y="Customer Group Magnitude",
color = "Customer Groups",
category_orders = {"Customer Groups": ["0", "1", "2", "3"]})
fig.update_layout(xaxis = dict(
tickmode = 'linear',
tick0 = 1,
dtick = 1),
yaxis = dict(
tickmode = 'linear',
tick0 = 1000,
dtick = 1000))
fig.show()