Market Basket Analysis

Association Analysis

Get pandas and MLxtend code imported and read the data:

In [1]:
#imports
import pandas as pd #Python data analysis library
import numpy as np #Python scientific computing
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

Citation - “The Instacart Online Grocery Shopping Dataset 2017”, Accessed from https://www.instacart.com/datasets/grocery-shopping-2017"

In [2]:
#import dataset
trainDf = pd.read_csv("../data/raw/order_products__train.csv")
orderDf = pd.read_csv("../data/raw/orders.csv")
productDf = pd.read_csv("../data/raw/products.csv")

We will merge the prior and train dataset to get the complete order dataset. We will use append() for doing so.

In [3]:
priorDf = pd.read_csv("../data/raw/order_products__prior.csv")
trainDf = trainDf.append(priorDf,ignore_index = True)

For counting each product, we can assign reordered column as 1

In [4]:
trainDf['reordered'] = 1 
In [5]:
productCountDf = trainDf.groupby("product_id",as_index = False)["order_id"].count()
In [6]:
#Top 100 most frequently purchased products
topLev = 100

#Here order_id is the count so we need to sort the data frame w.r.t order_id
productCountDf = productCountDf.sort_values("order_id",ascending = False)

topProdFrame = productCountDf.iloc[0:topLev,:]
topProdFrame = topProdFrame.merge(productDf,on = "product_id")
productId= topProdFrame.loc[:,["product_id"]]

Now we will filter the orders and get orders containting the the most frequently purchased products

In [7]:
df = trainDf[0:0]
for i in range(0,99):
    pId = productId.iloc[i]['product_id'] 
    stDf = trainDf[trainDf.product_id == pId ]
    df = df.append(stDf,ignore_index = False)
In [8]:
df.head()
Out[8]:
order_id product_id add_to_cart_order reordered
115 226 24852 2 1
156 473 24852 2 1
196 878 24852 2 1
272 1042 24852 1 1
297 1139 24852 1 1

Now we need to consolidate the items into 1 transaction per row with each product 1 hot encoded. Each row will represent an order and each column will represent product_id. If the cell value is '1' say (i,j) then ith order contains jth product.

In [9]:
basket = df.groupby(['order_id', 'product_id'])['reordered'].sum().unstack().reset_index().fillna(0).set_index('order_id')
In [10]:
# Convert the units to 1 hot encoded values
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1    
In [11]:
basket_sets = basket.applymap(encode_units)
In [12]:
basket_sets.head()
Out[12]:
product_id 196 3957 4210 4605 4799 4920 5077 5450 5785 5876 ... 46667 46906 46979 47144 47209 47626 47766 48679 49235 49683
order_id
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 1
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 99 columns

In [13]:
basket_sets.size
Out[13]:
241667217

Now that the data is structured properly, we can generate frequent item sets that have a support of at least 1%

In [14]:
# Build up the frequent items
frequent_itemsets = apriori(basket_sets, min_support=0.01, use_colnames=True)
In [15]:
frequent_itemsets
Out[15]:
support itemsets
0 0.015279 [196]
1 0.016088 [3957]
2 0.015144 [4210]
3 0.031514 [4605]
4 0.015439 [4799]
5 0.035537 [4920]
6 0.025855 [5077]
7 0.021353 [5450]
8 0.020226 [5785]
9 0.037381 [5876]
10 0.014564 [7781]
11 0.018312 [8174]
12 0.035751 [8277]
13 0.018425 [8424]
14 0.029005 [8518]
15 0.023870 [9076]
16 0.015389 [9387]
17 0.016539 [9839]
18 0.024529 [10749]
19 0.017318 [11520]
20 0.021506 [12341]
21 0.161785 [13176]
22 0.016817 [15290]
23 0.015387 [16759]
24 0.061221 [16797]
25 0.030820 [17794]
26 0.017128 [18465]
27 0.031976 [19057]
28 0.023888 [19660]
29 0.018213 [20114]
... ... ...
98 0.040854 [49683]
99 0.010551 [4920, 24852]
100 0.026505 [13176, 21137]
101 0.021551 [13176, 21903]
102 0.010476 [13176, 22935]
103 0.011306 [13176, 27845]
104 0.017321 [13176, 27966]
105 0.026530 [13176, 47209]
106 0.017689 [16797, 24852]
107 0.016293 [21137, 21903]
108 0.023895 [21137, 24852]
109 0.010146 [21137, 27845]
110 0.014556 [21137, 27966]
111 0.010252 [21137, 39275]
112 0.017342 [21137, 47209]
113 0.010271 [21137, 47766]
114 0.021873 [21903, 24852]
115 0.014810 [21903, 47209]
116 0.013228 [21903, 47766]
117 0.013561 [24852, 26209]
118 0.013390 [24852, 27845]
119 0.014401 [24852, 28204]
120 0.012142 [24852, 45066]
121 0.013178 [24852, 47209]
122 0.017631 [24852, 47626]
123 0.022781 [24852, 47766]
124 0.013453 [24852, 49683]
125 0.011879 [26209, 47626]
126 0.010984 [27966, 47209]
127 0.010555 [47626, 47766]

128 rows × 2 columns

The final step is to generate the rules with their corresponding support, confidence and lift:

In [16]:
# Create the rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules
Out[16]:
antecedants consequents support confidence lift
0 (26209) (47626) 0.060080 0.197723 3.001750
1 (47626) (26209) 0.065869 0.180345 3.001750
2 (47766) (21903) 0.075468 0.175281 1.699910
3 (21903) (47766) 0.103112 0.128289 1.699910
4 (47209) (27966) 0.090483 0.121389 2.077938
5 (27966) (47209) 0.058418 0.188018 2.077938
6 (24852) (21903) 0.201259 0.108683 1.054029
7 (21903) (24852) 0.103112 0.212133 1.054029
8 (13176) (21903) 0.161785 0.133208 1.291881
9 (21903) (13176) 0.103112 0.209007 1.291881
10 (13176) (27845) 0.161785 0.069883 1.194505
11 (27845) (13176) 0.058504 0.193253 1.194505
12 (26209) (24852) 0.060080 0.225713 1.121500
13 (24852) (26209) 0.201259 0.067380 1.121500
14 (24852) (16797) 0.201259 0.087891 1.435638
15 (16797) (24852) 0.061221 0.288936 1.435638
16 (21137) (47209) 0.112891 0.153616 1.697728
17 (47209) (21137) 0.090483 0.191659 1.697728
18 (49683) (24852) 0.040854 0.329296 1.636175
19 (24852) (49683) 0.201259 0.066844 1.636175
20 (21137) (39275) 0.112891 0.090809 2.110653
21 (39275) (21137) 0.043024 0.238274 2.110653
22 (47626) (47766) 0.065869 0.160244 2.123337
23 (47766) (47626) 0.075468 0.139862 2.123337
24 (24852) (47766) 0.201259 0.113194 1.499886
25 (47766) (24852) 0.075468 0.301866 1.499886
26 (13176) (27966) 0.161785 0.107065 1.832734
27 (27966) (13176) 0.058418 0.296508 1.832734
28 (21137) (27845) 0.112891 0.089873 1.536191
29 (27845) (21137) 0.058504 0.173423 1.536191
30 (13176) (47209) 0.161785 0.163981 1.812281
31 (47209) (13176) 0.090483 0.293199 1.812281
32 (47209) (21903) 0.090483 0.163679 1.587393
33 (21903) (47209) 0.103112 0.143632 1.587393
34 (45066) (24852) 0.034132 0.355725 1.767494
35 (24852) (45066) 0.201259 0.060329 1.767494
36 (21137) (21903) 0.112891 0.144326 1.399704
37 (21903) (21137) 0.103112 0.158014 1.399704
38 (24852) (27845) 0.201259 0.066529 1.137168
39 (27845) (24852) 0.058504 0.228866 1.137168
40 (28204) (24852) 0.038052 0.378441 1.880364
41 (24852) (28204) 0.201259 0.071552 1.880364
42 (21137) (27966) 0.112891 0.128940 2.207206
43 (27966) (21137) 0.058418 0.249174 2.207206
44 (4920) (24852) 0.035537 0.296906 1.475240
45 (24852) (4920) 0.201259 0.052425 1.475240
46 (21137) (47766) 0.112891 0.090980 1.205542
47 (47766) (21137) 0.075468 0.136095 1.205542
48 (47626) (24852) 0.065869 0.267663 1.329938
49 (24852) (47626) 0.201259 0.087602 1.329938
50 (13176) (22935) 0.161785 0.064756 1.342844
51 (22935) (13176) 0.048223 0.217252 1.342844
52 (13176) (21137) 0.161785 0.163832 1.451233
53 (21137) (13176) 0.112891 0.234787 1.451233
54 (21137) (24852) 0.112891 0.211665 1.051702
55 (24852) (21137) 0.201259 0.118728 1.051702

We can also see several where the confidence is high as well.or instance, we can see that there are quite a few rules with a high lift value which means that it occurs more frequently than would be expected given the number of transaction and product combinations.

We can filter the dataframe using standard pandas code. In this case, look for a large lift (2) and high confidence (.1):

In [17]:
rules[ (rules['lift'] >= 2) &
       (rules['confidence'] >= 0.1) ]
Out[17]:
antecedants consequents support confidence lift
0 (26209) (47626) 0.060080 0.197723 3.001750
1 (47626) (26209) 0.065869 0.180345 3.001750
4 (47209) (27966) 0.090483 0.121389 2.077938
5 (27966) (47209) 0.058418 0.188018 2.077938
21 (39275) (21137) 0.043024 0.238274 2.110653
22 (47626) (47766) 0.065869 0.160244 2.123337
23 (47766) (47626) 0.075468 0.139862 2.123337
42 (21137) (27966) 0.112891 0.128940 2.207206
43 (27966) (21137) 0.058418 0.249174 2.207206