Association Analysis
Get pandas and MLxtend code imported and read the data:
#imports
import pandas as pd #Python data analysis library
import numpy as np #Python scientific computing
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
Citation - “The Instacart Online Grocery Shopping Dataset 2017”, Accessed from https://www.instacart.com/datasets/grocery-shopping-2017"
#import dataset
trainDf = pd.read_csv("../data/raw/order_products__train.csv")
orderDf = pd.read_csv("../data/raw/orders.csv")
productDf = pd.read_csv("../data/raw/products.csv")
We will merge the prior and train dataset to get the complete order dataset. We will use append() for doing so.
priorDf = pd.read_csv("../data/raw/order_products__prior.csv")
trainDf = trainDf.append(priorDf,ignore_index = True)
For counting each product, we can assign reordered column as 1
trainDf['reordered'] = 1
productCountDf = trainDf.groupby("product_id",as_index = False)["order_id"].count()
#Top 100 most frequently purchased products
topLev = 100
#Here order_id is the count so we need to sort the data frame w.r.t order_id
productCountDf = productCountDf.sort_values("order_id",ascending = False)
topProdFrame = productCountDf.iloc[0:topLev,:]
topProdFrame = topProdFrame.merge(productDf,on = "product_id")
productId= topProdFrame.loc[:,["product_id"]]
Now we will filter the orders and get orders containting the the most frequently purchased products
df = trainDf[0:0]
for i in range(0,99):
pId = productId.iloc[i]['product_id']
stDf = trainDf[trainDf.product_id == pId ]
df = df.append(stDf,ignore_index = False)
df.head()
order_id | product_id | add_to_cart_order | reordered | |
---|---|---|---|---|
115 | 226 | 24852 | 2 | 1 |
156 | 473 | 24852 | 2 | 1 |
196 | 878 | 24852 | 2 | 1 |
272 | 1042 | 24852 | 1 | 1 |
297 | 1139 | 24852 | 1 | 1 |
Now we need to consolidate the items into 1 transaction per row with each product 1 hot encoded. Each row will represent an order and each column will represent product_id. If the cell value is '1' say (i,j) then ith order contains jth product.
basket = df.groupby(['order_id', 'product_id'])['reordered'].sum().unstack().reset_index().fillna(0).set_index('order_id')
# Convert the units to 1 hot encoded values
def encode_units(x):
if x <= 0:
return 0
if x >= 1:
return 1
basket_sets = basket.applymap(encode_units)
basket_sets.head()
product_id | 196 | 3957 | 4210 | 4605 | 4799 | 4920 | 5077 | 5450 | 5785 | 5876 | ... | 46667 | 46906 | 46979 | 47144 | 47209 | 47626 | 47766 | 48679 | 49235 | 49683 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
order_id | |||||||||||||||||||||
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 99 columns
basket_sets.size
241667217
Now that the data is structured properly, we can generate frequent item sets that have a support of at least 1%
# Build up the frequent items
frequent_itemsets = apriori(basket_sets, min_support=0.01, use_colnames=True)
frequent_itemsets
support | itemsets | |
---|---|---|
0 | 0.015279 | [196] |
1 | 0.016088 | [3957] |
2 | 0.015144 | [4210] |
3 | 0.031514 | [4605] |
4 | 0.015439 | [4799] |
5 | 0.035537 | [4920] |
6 | 0.025855 | [5077] |
7 | 0.021353 | [5450] |
8 | 0.020226 | [5785] |
9 | 0.037381 | [5876] |
10 | 0.014564 | [7781] |
11 | 0.018312 | [8174] |
12 | 0.035751 | [8277] |
13 | 0.018425 | [8424] |
14 | 0.029005 | [8518] |
15 | 0.023870 | [9076] |
16 | 0.015389 | [9387] |
17 | 0.016539 | [9839] |
18 | 0.024529 | [10749] |
19 | 0.017318 | [11520] |
20 | 0.021506 | [12341] |
21 | 0.161785 | [13176] |
22 | 0.016817 | [15290] |
23 | 0.015387 | [16759] |
24 | 0.061221 | [16797] |
25 | 0.030820 | [17794] |
26 | 0.017128 | [18465] |
27 | 0.031976 | [19057] |
28 | 0.023888 | [19660] |
29 | 0.018213 | [20114] |
... | ... | ... |
98 | 0.040854 | [49683] |
99 | 0.010551 | [4920, 24852] |
100 | 0.026505 | [13176, 21137] |
101 | 0.021551 | [13176, 21903] |
102 | 0.010476 | [13176, 22935] |
103 | 0.011306 | [13176, 27845] |
104 | 0.017321 | [13176, 27966] |
105 | 0.026530 | [13176, 47209] |
106 | 0.017689 | [16797, 24852] |
107 | 0.016293 | [21137, 21903] |
108 | 0.023895 | [21137, 24852] |
109 | 0.010146 | [21137, 27845] |
110 | 0.014556 | [21137, 27966] |
111 | 0.010252 | [21137, 39275] |
112 | 0.017342 | [21137, 47209] |
113 | 0.010271 | [21137, 47766] |
114 | 0.021873 | [21903, 24852] |
115 | 0.014810 | [21903, 47209] |
116 | 0.013228 | [21903, 47766] |
117 | 0.013561 | [24852, 26209] |
118 | 0.013390 | [24852, 27845] |
119 | 0.014401 | [24852, 28204] |
120 | 0.012142 | [24852, 45066] |
121 | 0.013178 | [24852, 47209] |
122 | 0.017631 | [24852, 47626] |
123 | 0.022781 | [24852, 47766] |
124 | 0.013453 | [24852, 49683] |
125 | 0.011879 | [26209, 47626] |
126 | 0.010984 | [27966, 47209] |
127 | 0.010555 | [47626, 47766] |
128 rows × 2 columns
The final step is to generate the rules with their corresponding support, confidence and lift:
# Create the rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules
antecedants | consequents | support | confidence | lift | |
---|---|---|---|---|---|
0 | (26209) | (47626) | 0.060080 | 0.197723 | 3.001750 |
1 | (47626) | (26209) | 0.065869 | 0.180345 | 3.001750 |
2 | (47766) | (21903) | 0.075468 | 0.175281 | 1.699910 |
3 | (21903) | (47766) | 0.103112 | 0.128289 | 1.699910 |
4 | (47209) | (27966) | 0.090483 | 0.121389 | 2.077938 |
5 | (27966) | (47209) | 0.058418 | 0.188018 | 2.077938 |
6 | (24852) | (21903) | 0.201259 | 0.108683 | 1.054029 |
7 | (21903) | (24852) | 0.103112 | 0.212133 | 1.054029 |
8 | (13176) | (21903) | 0.161785 | 0.133208 | 1.291881 |
9 | (21903) | (13176) | 0.103112 | 0.209007 | 1.291881 |
10 | (13176) | (27845) | 0.161785 | 0.069883 | 1.194505 |
11 | (27845) | (13176) | 0.058504 | 0.193253 | 1.194505 |
12 | (26209) | (24852) | 0.060080 | 0.225713 | 1.121500 |
13 | (24852) | (26209) | 0.201259 | 0.067380 | 1.121500 |
14 | (24852) | (16797) | 0.201259 | 0.087891 | 1.435638 |
15 | (16797) | (24852) | 0.061221 | 0.288936 | 1.435638 |
16 | (21137) | (47209) | 0.112891 | 0.153616 | 1.697728 |
17 | (47209) | (21137) | 0.090483 | 0.191659 | 1.697728 |
18 | (49683) | (24852) | 0.040854 | 0.329296 | 1.636175 |
19 | (24852) | (49683) | 0.201259 | 0.066844 | 1.636175 |
20 | (21137) | (39275) | 0.112891 | 0.090809 | 2.110653 |
21 | (39275) | (21137) | 0.043024 | 0.238274 | 2.110653 |
22 | (47626) | (47766) | 0.065869 | 0.160244 | 2.123337 |
23 | (47766) | (47626) | 0.075468 | 0.139862 | 2.123337 |
24 | (24852) | (47766) | 0.201259 | 0.113194 | 1.499886 |
25 | (47766) | (24852) | 0.075468 | 0.301866 | 1.499886 |
26 | (13176) | (27966) | 0.161785 | 0.107065 | 1.832734 |
27 | (27966) | (13176) | 0.058418 | 0.296508 | 1.832734 |
28 | (21137) | (27845) | 0.112891 | 0.089873 | 1.536191 |
29 | (27845) | (21137) | 0.058504 | 0.173423 | 1.536191 |
30 | (13176) | (47209) | 0.161785 | 0.163981 | 1.812281 |
31 | (47209) | (13176) | 0.090483 | 0.293199 | 1.812281 |
32 | (47209) | (21903) | 0.090483 | 0.163679 | 1.587393 |
33 | (21903) | (47209) | 0.103112 | 0.143632 | 1.587393 |
34 | (45066) | (24852) | 0.034132 | 0.355725 | 1.767494 |
35 | (24852) | (45066) | 0.201259 | 0.060329 | 1.767494 |
36 | (21137) | (21903) | 0.112891 | 0.144326 | 1.399704 |
37 | (21903) | (21137) | 0.103112 | 0.158014 | 1.399704 |
38 | (24852) | (27845) | 0.201259 | 0.066529 | 1.137168 |
39 | (27845) | (24852) | 0.058504 | 0.228866 | 1.137168 |
40 | (28204) | (24852) | 0.038052 | 0.378441 | 1.880364 |
41 | (24852) | (28204) | 0.201259 | 0.071552 | 1.880364 |
42 | (21137) | (27966) | 0.112891 | 0.128940 | 2.207206 |
43 | (27966) | (21137) | 0.058418 | 0.249174 | 2.207206 |
44 | (4920) | (24852) | 0.035537 | 0.296906 | 1.475240 |
45 | (24852) | (4920) | 0.201259 | 0.052425 | 1.475240 |
46 | (21137) | (47766) | 0.112891 | 0.090980 | 1.205542 |
47 | (47766) | (21137) | 0.075468 | 0.136095 | 1.205542 |
48 | (47626) | (24852) | 0.065869 | 0.267663 | 1.329938 |
49 | (24852) | (47626) | 0.201259 | 0.087602 | 1.329938 |
50 | (13176) | (22935) | 0.161785 | 0.064756 | 1.342844 |
51 | (22935) | (13176) | 0.048223 | 0.217252 | 1.342844 |
52 | (13176) | (21137) | 0.161785 | 0.163832 | 1.451233 |
53 | (21137) | (13176) | 0.112891 | 0.234787 | 1.451233 |
54 | (21137) | (24852) | 0.112891 | 0.211665 | 1.051702 |
55 | (24852) | (21137) | 0.201259 | 0.118728 | 1.051702 |
We can also see several where the confidence is high as well.or instance, we can see that there are quite a few rules with a high lift value which means that it occurs more frequently than would be expected given the number of transaction and product combinations.
We can filter the dataframe using standard pandas code. In this case, look for a large lift (2) and high confidence (.1):
rules[ (rules['lift'] >= 2) &
(rules['confidence'] >= 0.1) ]
antecedants | consequents | support | confidence | lift | |
---|---|---|---|---|---|
0 | (26209) | (47626) | 0.060080 | 0.197723 | 3.001750 |
1 | (47626) | (26209) | 0.065869 | 0.180345 | 3.001750 |
4 | (47209) | (27966) | 0.090483 | 0.121389 | 2.077938 |
5 | (27966) | (47209) | 0.058418 | 0.188018 | 2.077938 |
21 | (39275) | (21137) | 0.043024 | 0.238274 | 2.110653 |
22 | (47626) | (47766) | 0.065869 | 0.160244 | 2.123337 |
23 | (47766) | (47626) | 0.075468 | 0.139862 | 2.123337 |
42 | (21137) | (27966) | 0.112891 | 0.128940 | 2.207206 |
43 | (27966) | (21137) | 0.058418 | 0.249174 | 2.207206 |