Market Basket Analysis¶

Association Analysis

Get pandas and MLxtend code imported and read the data:

In [1]:

#imports
import pandas as pd #Python data analysis library
import numpy as np #Python scientific computing
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

Citation - “The Instacart Online Grocery Shopping Dataset 2017”, Accessed from https://www.instacart.com/datasets/grocery-shopping-2017"

In [2]:

#import dataset
trainDf = pd.read_csv("../data/raw/order_products__train.csv")
orderDf = pd.read_csv("../data/raw/orders.csv")
productDf = pd.read_csv("../data/raw/products.csv")

We will merge the prior and train dataset to get the complete order dataset. We will use append() for doing so.

In [3]:

priorDf = pd.read_csv("../data/raw/order_products__prior.csv")
trainDf = trainDf.append(priorDf,ignore_index = True)

For counting each product, we can assign reordered column as 1

In [4]:

trainDf['reordered'] = 1 

In [5]:

productCountDf = trainDf.groupby("product_id",as_index = False)["order_id"].count()

In [6]:

#Top 100 most frequently purchased products
topLev = 100

#Here order_id is the count so we need to sort the data frame w.r.t order_id
productCountDf = productCountDf.sort_values("order_id",ascending = False)

topProdFrame = productCountDf.iloc[0:topLev,:]
topProdFrame = topProdFrame.merge(productDf,on = "product_id")
productId= topProdFrame.loc[:,["product_id"]]

Now we will filter the orders and get orders containting the the most frequently purchased products

In [7]:

df = trainDf[0:0]
for i in range(0,99):
    pId = productId.iloc[i]['product_id'] 
    stDf = trainDf[trainDf.product_id == pId ]
    df = df.append(stDf,ignore_index = False)

In [8]:

df.head()

Out[8]:

	order_id	product_id	add_to_cart_order	reordered
115	226	24852	2	1
156	473	24852	2	1
196	878	24852	2	1
272	1042	24852	1	1
297	1139	24852	1	1

Now we need to consolidate the items into 1 transaction per row with each product 1 hot encoded. Each row will represent an order and each column will represent product_id. If the cell value is '1' say (i,j) then ith order contains jth product.

In [9]:

basket = df.groupby(['order_id', 'product_id'])['reordered'].sum().unstack().reset_index().fillna(0).set_index('order_id')

In [10]:

# Convert the units to 1 hot encoded values
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1    

In [11]:

basket_sets = basket.applymap(encode_units)

In [12]:

basket_sets.head()

Out[12]:

product_id	196	3957	4210	4605	4799	4920	5077	5450	5785	5876	...	46667	46906	46979	47144	47209	47626	47766	48679	49235	49683
order_id
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	1	0	0	0	0	1
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	1	0	0	0	0	0	0	0	0	0
5	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	1	0	0	0	0	0
9	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

5 rows × 99 columns

In [13]:

basket_sets.size

Out[13]:

241667217

Now that the data is structured properly, we can generate frequent item sets that have a support of at least 1%

In [14]:

# Build up the frequent items
frequent_itemsets = apriori(basket_sets, min_support=0.01, use_colnames=True)

In [15]:

frequent_itemsets

Out[15]:

	support	itemsets
0	0.015279	[196]
1	0.016088	[3957]
2	0.015144	[4210]
3	0.031514	[4605]
4	0.015439	[4799]
5	0.035537	[4920]
6	0.025855	[5077]
7	0.021353	[5450]
8	0.020226	[5785]
9	0.037381	[5876]
10	0.014564	[7781]
11	0.018312	[8174]
12	0.035751	[8277]
13	0.018425	[8424]
14	0.029005	[8518]
15	0.023870	[9076]
16	0.015389	[9387]
17	0.016539	[9839]
18	0.024529	[10749]
19	0.017318	[11520]
20	0.021506	[12341]
21	0.161785	[13176]
22	0.016817	[15290]
23	0.015387	[16759]
24	0.061221	[16797]
25	0.030820	[17794]
26	0.017128	[18465]
27	0.031976	[19057]
28	0.023888	[19660]
29	0.018213	[20114]
...	...	...
98	0.040854	[49683]
99	0.010551	[4920, 24852]
100	0.026505	[13176, 21137]
101	0.021551	[13176, 21903]
102	0.010476	[13176, 22935]
103	0.011306	[13176, 27845]
104	0.017321	[13176, 27966]
105	0.026530	[13176, 47209]
106	0.017689	[16797, 24852]
107	0.016293	[21137, 21903]
108	0.023895	[21137, 24852]
109	0.010146	[21137, 27845]
110	0.014556	[21137, 27966]
111	0.010252	[21137, 39275]
112	0.017342	[21137, 47209]
113	0.010271	[21137, 47766]
114	0.021873	[21903, 24852]
115	0.014810	[21903, 47209]
116	0.013228	[21903, 47766]
117	0.013561	[24852, 26209]
118	0.013390	[24852, 27845]
119	0.014401	[24852, 28204]
120	0.012142	[24852, 45066]
121	0.013178	[24852, 47209]
122	0.017631	[24852, 47626]
123	0.022781	[24852, 47766]
124	0.013453	[24852, 49683]
125	0.011879	[26209, 47626]
126	0.010984	[27966, 47209]
127	0.010555	[47626, 47766]

128 rows × 2 columns

The final step is to generate the rules with their corresponding support, confidence and lift:

In [16]:

# Create the rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules

Out[16]:

	antecedants	consequents	support	confidence	lift
0	(26209)	(47626)	0.060080	0.197723	3.001750
1	(47626)	(26209)	0.065869	0.180345	3.001750
2	(47766)	(21903)	0.075468	0.175281	1.699910
3	(21903)	(47766)	0.103112	0.128289	1.699910
4	(47209)	(27966)	0.090483	0.121389	2.077938
5	(27966)	(47209)	0.058418	0.188018	2.077938
6	(24852)	(21903)	0.201259	0.108683	1.054029
7	(21903)	(24852)	0.103112	0.212133	1.054029
8	(13176)	(21903)	0.161785	0.133208	1.291881
9	(21903)	(13176)	0.103112	0.209007	1.291881
10	(13176)	(27845)	0.161785	0.069883	1.194505
11	(27845)	(13176)	0.058504	0.193253	1.194505
12	(26209)	(24852)	0.060080	0.225713	1.121500
13	(24852)	(26209)	0.201259	0.067380	1.121500
14	(24852)	(16797)	0.201259	0.087891	1.435638
15	(16797)	(24852)	0.061221	0.288936	1.435638
16	(21137)	(47209)	0.112891	0.153616	1.697728
17	(47209)	(21137)	0.090483	0.191659	1.697728
18	(49683)	(24852)	0.040854	0.329296	1.636175
19	(24852)	(49683)	0.201259	0.066844	1.636175
20	(21137)	(39275)	0.112891	0.090809	2.110653
21	(39275)	(21137)	0.043024	0.238274	2.110653
22	(47626)	(47766)	0.065869	0.160244	2.123337
23	(47766)	(47626)	0.075468	0.139862	2.123337
24	(24852)	(47766)	0.201259	0.113194	1.499886
25	(47766)	(24852)	0.075468	0.301866	1.499886
26	(13176)	(27966)	0.161785	0.107065	1.832734
27	(27966)	(13176)	0.058418	0.296508	1.832734
28	(21137)	(27845)	0.112891	0.089873	1.536191
29	(27845)	(21137)	0.058504	0.173423	1.536191
30	(13176)	(47209)	0.161785	0.163981	1.812281
31	(47209)	(13176)	0.090483	0.293199	1.812281
32	(47209)	(21903)	0.090483	0.163679	1.587393
33	(21903)	(47209)	0.103112	0.143632	1.587393
34	(45066)	(24852)	0.034132	0.355725	1.767494
35	(24852)	(45066)	0.201259	0.060329	1.767494
36	(21137)	(21903)	0.112891	0.144326	1.399704
37	(21903)	(21137)	0.103112	0.158014	1.399704
38	(24852)	(27845)	0.201259	0.066529	1.137168
39	(27845)	(24852)	0.058504	0.228866	1.137168
40	(28204)	(24852)	0.038052	0.378441	1.880364
41	(24852)	(28204)	0.201259	0.071552	1.880364
42	(21137)	(27966)	0.112891	0.128940	2.207206
43	(27966)	(21137)	0.058418	0.249174	2.207206
44	(4920)	(24852)	0.035537	0.296906	1.475240
45	(24852)	(4920)	0.201259	0.052425	1.475240
46	(21137)	(47766)	0.112891	0.090980	1.205542
47	(47766)	(21137)	0.075468	0.136095	1.205542
48	(47626)	(24852)	0.065869	0.267663	1.329938
49	(24852)	(47626)	0.201259	0.087602	1.329938
50	(13176)	(22935)	0.161785	0.064756	1.342844
51	(22935)	(13176)	0.048223	0.217252	1.342844
52	(13176)	(21137)	0.161785	0.163832	1.451233
53	(21137)	(13176)	0.112891	0.234787	1.451233
54	(21137)	(24852)	0.112891	0.211665	1.051702
55	(24852)	(21137)	0.201259	0.118728	1.051702

We can also see several where the confidence is high as well.or instance, we can see that there are quite a few rules with a high lift value which means that it occurs more frequently than would be expected given the number of transaction and product combinations.

We can filter the dataframe using standard pandas code. In this case, look for a large lift (2) and high confidence (.1):

In [17]:

rules[ (rules['lift'] >= 2) &
       (rules['confidence'] >= 0.1) ]

Out[17]:

	antecedants	consequents	support	confidence	lift
0	(26209)	(47626)	0.060080	0.197723	3.001750
1	(47626)	(26209)	0.065869	0.180345	3.001750
4	(47209)	(27966)	0.090483	0.121389	2.077938
5	(27966)	(47209)	0.058418	0.188018	2.077938
21	(39275)	(21137)	0.043024	0.238274	2.110653
22	(47626)	(47766)	0.065869	0.160244	2.123337
23	(47766)	(47626)	0.075468	0.139862	2.123337
42	(21137)	(27966)	0.112891	0.128940	2.207206
43	(27966)	(21137)	0.058418	0.249174	2.207206