Notebook

Clustering Methods for Mixed Data Types¶

Notebook by Marco Tavora ¶

Table of contents¶

Introduction
Data

Introduction¶

Using clustering algorithms we can understand how a sample might be comprised of different subgroups. In the present case, the data to be clustered includes both categorical and continuous data and standard distance measures such as Euclidean cannot be used. In the following we will use Gower distances.

Data¶

[go back to the top]

From Kaggle:

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.

Data dictionary¶

I will consider only some of the data fields, namely:

Store - a unique Id for each store
StoreType - differentiates between 4 different store models: a, b, c, d
Assortment - describes an assortment level: a = basic, b = extra, c = extended
CompetitionDistance - distance in meters to the nearest competitor store
CompetitionOpenSinceMonth - gives the approximate month of the time the nearest competitor was opened

In [48]:

store <- read.csv('store.csv')
store <- store[,c("Store","StoreType","Assortment", "CompetitionDistance", "CompetitionOpenSinceMonth")]
head(store)

Store	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth
1	c	a	1270	9
2	a	a	570	11
3	a	a	14130	12
4	c	c	620	9
5	a	a	29910	4
6	a	a	310	12

We will use the following packages:

dplyr for data cleaning
cluster for gower similarity and pam
Rtsne for t-SNE plot
ggplot2 for visualization

In [49]:

set.seed(1680) 
library(dplyr) 
library(cluster)
library(Rtsne) 
library(ggplot2)

Clustering Analysis of Mixed Types¶

[go back to the top]

From R-bloggers:

While many introductions to cluster analysis typically review a simple application using continuous variables, clustering data of mixed types (e.g., continuous, ordinal, and nominal) is often of interest. The following is an overview of one approach to clustering data of mixed types using Gower distance, partitioning around medoids, and silhouette width.

Decisions to make¶

Calculate the distances
Choose the cluster algo
Selecting the number of clusters

Gower distance¶

[go back to the top]

For each variable type there is a particular distance metric that works well for that type. The metric is scaled to fall between 0 and 1. After that, a linear combination with user-specified weights (e.g. averages) is calculated and a final distance matrix is build. The metrics by type are:
- quantitative (interval): range-normalized Manhattan distance
- ordinal: variable is first ranked, then Manhattan distance is used with a special adjustment for ties
- nominal: variables of k categories are converted into k binary columns and the Dice coefficient is used
The Gower distance is sensitive to non-normality and outliers of continuous variables. Hence pre-processing is a recommended.

Pre-processing to eliminate non-normality¶

In [50]:

library(gpairs)
library(corrplot)
library(gplots)
library(car)

In [51]:

lambda_CompetitionDistance <- coef(powerTransform(store$CompetitionDistance))
lambda_CompetitionOpenSinceMonth <- coef(powerTransform(store$CompetitionOpenSinceMonth))
lambda_CompetitionDistance
lambda_CompetitionOpenSinceMonth

store$CompetitionDistance: 0.0989815175631127

store$CompetitionOpenSinceMonth: 0.907653009482875

In [52]:

par(mfrow=c(2,1))
hist(store$CompetitionDistance, xlab="Original variable", 
     main="Histogram of original variable")
hist(bcPower(store$CompetitionDistance, lambda_CompetitionDistance),
     xlab="Box-Cox Transform", ylab="New Distribution", 
     main="Transformed Distribution")

In [53]:

store$CompetitionDistance <- bcPower(store$CompetitionDistance, lambda_CompetitionDistance)

Drop `NA`¶

In [54]:

store = na.omit(store)

In [55]:

library(cluster)

In [56]:

gower.dist <- daisy(store, metric = "gower")

In [57]:

summary(gower.dist)

289180 dissimilarities, summarized :
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0009001 0.2809200 0.3934200 0.3962600 0.5256600 0.9041400 
Metric :  mixed ;  Types = I, N, N, I, I 
Number of objects : 761

Choosing a clustering algorithm¶

We will use partition around medoids (PAM).

In [58]:

sil_width <- c(NA)

for(i in 2:10){
  
  pam_fit <- pam(gower.dist,
                 diss = TRUE,
                 k = i)
  
  sil_width[i] <- pam_fit$silinfo$avg.width
  
}

pam_fit

Medoids:
      ID         
 [1,] "458" "668"
 [2,] "139" "200"
 [3,] "345" "519"
 [4,] "430" "630"
 [5,] "339" "507"
 [6,] "335" "501"
 [7,] "253" "369"
 [8,] "219" "317"
 [9,] "396" "590"
[10,] "559" "813"
Clustering vector:
   1    2    3    4    5    6    7    8    9   10   11   14   15   17   18   20 
   1    2    2    3    4    2    5    2    6    2    6    4    7    2    7    8 
  21   23   24   25   27   28   30   31   33   34   35   36   37   38   39   44 
   3    8    5    1    4    2    4    7    5    1    9    6    1    8    2    2 
  45   46   47   48   49   50   51   52   53   54   55   56   57   58   59   60 
   8    1    5    2    9    8    5    7    6    9    2    7    7    5    6    9 
  61   63   65   67   71   72   73   75   76   77   78   81   82   84   85   86 
   6    3    5    5    2    2    6    9    7    9    2    4    4    6    2    4 
  87   88   89   90   95   96   98   99  102  103  104  106  107  108  109  110 
   2    2    2    2    2    4    9    3    2    7    2    2    2    9    6    5 
 112  113  115  116  117  118  119  120  121  122  123  124  125  126  127  131 
   2    9    7    4    2    9    5    8    2    5    2    4    2    8    8    1 
 133  134  136  137  138  139  140  142  143  146  148  149  150  151  153  156 
   2    2    6    2    6    4    6    2    8    9    2    8    3    9    2    4 
 157  159  160  161  162  163  164  165  166  167  169  170  173  177  180  181 
   6    8    9    5    7    4    2    4    5    4    8    2    2    4    8    4 
 185  186  189  190  191  192  196  197  198  199  200  202  203  204  205  208 
   7    2    8    2    2    7    1    1    2    9    2    7    3    2    2    1 
 209  210  211  212  213  214  219  220  221  222  223  225  229  230  231  232 
   6    8    6    6    7    8    4    2    9    2    9    8    7    9    9    3 
 235  236  237  240  242  244  246  247  248  249  254  255  256  257  258  260 
   4    2    2    4    8    8    1    9    6    9    8    3    6    2    2    2 
 261  262  263  264  266  267  268  269  270  272  275  276  278  280  281  282 
   7    8    5    4    5    1    4    6    2    2    8    2    5    9    9    2 
 286  287  289  290  292  293  294  295  296  297  298  299  300  301  302  303 
   4    1    8    4    2    3    4    2    4    2    8    9    5    5    9    2 
 304  305  306  307  308  311  312  314  315  316  317  318  319  320  321  322 
   2    3    4    2    2    6    8    2    5    8    8    7    6    6    3    4 
 323  325  326  327  328  329  332  333  334  336  337  341  343  344  347  349 
   7    5    8    3    2    4    4    5    9    2    7    2    8    5    7    3 
 351  354  355  356  357  358  360  361  363  366  367  368  369  370  371  372 
   2    9    6    9    2    2    4    3    2    9    9    7    7    8    7    9 
 374  375  376  377  378  380  381  382  385  386  389  390  391  393  395  399 
   2    5    2    6    6    4    2    3    8    7    6    5    2    7    4    2 
 400  401  402  403  405  406  407  410  413  415  416  418  419  421  423  427 
   2    6    3    2    4    7    2    1    6    7    5    4    1    3    4    6 
 428  429  430  432  433  434  439  440  443  444  446  447  448  449  450  451 
   8    7    9    4    6    2    2    8    8    1    2    6    6    6    1    4 
 452  455  459  460  461  462  464  465  466  467  468  469  472  475  476  477 
   6    9   10    4    9   10    1    7    5    5    3    3    3   10    8    8 
 478  479  480  482  483  484  487  488  489  490  492  494  496  500  501  502 
   7    2    2    1    6    5    9    5    2    4    4    4    7    9    6    4 
 503  506  507  509  511  513  514  518  519  521  522  523  524  525  527  529 
   9    2    5    4   10   10    3    7    3    8    9    3    6    9    7    9 
 532  534  535  536  537  538  539  541  542  543  544  545  546  547  548  550 
   6    8    4    6    4    4    4    6   10    1    2    5    4    9    7    9 
 551  552  553  555  556  558  559  560  563  565  567  569  570  571  572  573 
   6    4    1    8    9    4    8    3    4    6    1   10   10    8    7   10 
 575  576  578  579  580  581  583  585  586  587  588  590  592  593  594  595 
   4    1    8    1    5   10   10    7    6    9    7    9    4    5   10    3 
 596  597  598  599  600  601  602  603  604  605  606  607  608  609  610  611 
   1   10    1    9    9    8    4    4    8    8   10   10    5   10    4   10 
 612  613  614  615  616  619  621  623  624  625  626  627  629  630  632  633 
   9    1   10    8    6    4    4    4    6    4    3    3    8    4    4    8 
 635  636  638  639  641  642  643  644  646  647  648  649  650  653  654  655 
   4    1    8   10    6    3    4    1   10    5    8   10   10    9    1    9 
 656  657  659  660  661  663  664  665  667  668  669  671  672  674  675  676 
   8    3    8   10    9    5    9   10    9    1    8    5    1   10   10    1 
 677  679  682  683  685  686  688  689  692  694  695  698  701  702  703  704 
   8   10    1   10   10    4   10    8   10    6   10    4    8   10    4    7 
 705  706  707  708  709  710  711  712  714  715  716  718  720  721  723  725 
  10    8    6    3   10    8    8   10    9    4    8    4    5    6    9    9 
 726  727  729  730  733  735  737  738  739  740  743  744  745  746  747  748 
   5    4    3    1    1    7    4    9    9    8   10   10   10    7    3    8 
 749  750  751  752  753  754  755  758  759  760  761  763  765  768  770  771 
  10    8   10    4    9    3    9    5   10    4   10    7    5    6    5   10 
 774  775  776  777  778  779  781  782  784  785  786  787  788  789  790  791 
   6    9    1    9   10    4    4    3   10    9   10    3    5    5    9    4 
 792  793  794  796  797  798  800  801  805  807  809  810  811  812  813  814 
   8    8    3    6   10    4    8    8    8    4   10    9   10    8   10    9 
 815  817  819  821  822  823  825  826  827  828  831  833  834  835  836  837 
   4    4    6   10    6    6    4    5    5    9   10    9    4   10   10    5 
 839  840  841  842  843  844  845  846  848  850  852  856  857  858  859  860 
   1   10   10    9    1   10    8    5    6    8    1    4    1   10    1    3 
 862  863  864  867  868  869  870  872  873  875  876  878  881  882  883  885 
   5    6   10    9    9    1    4    6   10    8    4    9    4    4   10   10 
 886  888  889  892  894  896  897  900  901  902  903  904  905  906  908  912 
   6    8    8    4   10    6    3    4    5    4    9    9   10   10   10    3 
 914  915  916  917  918  920  921  922  923  924  925  926  929  931  933  935 
   3    7   10    4    5    4   10    8   10    4    1    7    6    6    5    5 
 936  938  940  941  944  945  946  947  949  950  952  953  954  955  956  957 
   4   10    9   10    1    5   10    4    4   10    9    4    4    9   10    9 
 958  959  963  966  967  969  970  971  974  977  979  980  981  983  985  986 
  10    6    6    4    6    6   10    1    4   10    6   10    9    4    3   10 
 987  988  989  992  993  994  995  996  997  998  999 1000 1002 1003 1006 1007 
   1   10   10   10    9   10    8    1    9   10    7    5    9   10    3    3 
1008 1009 1010 1011 1012 1013 1015 1017 1018 1019 1020 1021 1023 1024 1025 1026 
   6   10    9    6    9    4    9    1    3    9   10    4    1    3   10    1 
1027 1029 1030 1031 1032 1033 1034 1038 1039 1040 1041 1043 1044 1045 1046 1048 
   6    4    4    8    7    4    4    8    6    4    1    1    1    6    7    9 
1049 1050 1051 1053 1055 1057 1059 1062 1067 1070 1071 1072 1074 1075 1077 1081 
  10    9    1   10    1    9    1    8    9    3    4    6    3    6   10    4 
1082 1085 1086 1087 1088 1089 1092 1093 1094 1095 1097 1098 1099 1101 1102 1103 
   1    1   10    9    4    8   10    3    8   10    4   10    5    9   10    9 
1104 1105 1106 1107 1108 1109 1110 1111 1112 
   8    3    6   10    4    1    3   10    3 
Objective function:
    build      swap 
0.1114728 0.1076116 

Available components:
[1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
[6] "clusinfo"   "silinfo"    "diss"       "call"

Plot sihouette width (higher is better)¶

The number of clusters with highest silhouette width is 6 from the plot below.

In [59]:

par(mfrow=c(1,1))
plot(1:10, sil_width,
     xlab = "Number of clusters",
     ylab = "Silhouette Width")
lines(1:10, sil_width)

Interpretation¶

In [60]:

pam_fit <- pam(gower.dist, diss = TRUE, k = 6)

pam_results <- store %>%
  mutate(cluster = pam_fit$clustering) %>%
  group_by(cluster) %>%
  do(the_summary = summary(.))

pam_results$the_summary

[[1]]
     Store        StoreType Assortment CompetitionDistance
 Min.   :   1.0   a: 0      a:57       Min.   : 4.452     
 1st Qu.: 447.0   b: 3      b: 2       1st Qu.: 8.330     
 Median : 668.0   c:56      c: 0       Median :10.278     
 Mean   : 651.1   d: 0                 Mean   :10.222     
 3rd Qu.: 957.5                        3rd Qu.:12.179     
 Max.   :1109.0                        Max.   :17.050     
 CompetitionOpenSinceMonth    cluster 
 Min.   : 1.000            Min.   :1  
 1st Qu.: 4.000            1st Qu.:1  
 Median : 8.000            Median :1  
 Mean   : 7.169            Mean   :1  
 3rd Qu.: 9.500            3rd Qu.:1  
 Max.   :12.000            Max.   :1  

[[2]]
     Store        StoreType Assortment CompetitionDistance
 Min.   :   2.0   a:295     a:296      Min.   : 4.044     
 1st Qu.: 269.5   b:  1     b:  0      1st Qu.: 8.755     
 Median : 582.0   c:  0     c:  0      Median :10.642     
 Mean   : 554.6   d:  0                Mean   :10.780     
 3rd Qu.: 835.2                        3rd Qu.:12.543     
 Max.   :1111.0                        Max.   :19.029     
 CompetitionOpenSinceMonth    cluster 
 Min.   : 1.000            Min.   :2  
 1st Qu.: 4.000            1st Qu.:2  
 Median : 8.000            Median :2  
 Mean   : 7.348            Mean   :2  
 3rd Qu.:10.000            3rd Qu.:2  
 Max.   :12.000            Max.   :2  

[[3]]
     Store        StoreType Assortment CompetitionDistance
 Min.   :   4.0   a: 0      a: 0       Min.   : 4.777     
 1st Qu.: 355.0   b: 0      b: 0       1st Qu.: 9.476     
 Median : 626.0   c:51      c:51       Median :12.286     
 Mean   : 614.7   d: 0                 Mean   :11.738     
 3rd Qu.: 904.5                        3rd Qu.:13.450     
 Max.   :1112.0                        Max.   :19.120     
 CompetitionOpenSinceMonth    cluster 
 Min.   : 1.000            Min.   :3  
 1st Qu.: 5.000            1st Qu.:3  
 Median : 8.000            Median :3  
 Mean   : 7.098            Mean   :3  
 3rd Qu.:10.000            3rd Qu.:3  
 Max.   :12.000            Max.   :3  

[[4]]
     Store        StoreType Assortment CompetitionDistance
 Min.   :   7.0   a:131     a:  0      Min.   : 4.044     
 1st Qu.: 267.5   b:  0     b:  0      1st Qu.: 8.942     
 Median : 507.0   c:  0     c:131      Median :12.131     
 Mean   : 529.5   d:  0                Mean   :12.009     
 3rd Qu.: 822.5                        3rd Qu.:14.907     
 Max.   :1106.0                        Max.   :19.829     
 CompetitionOpenSinceMonth    cluster 
 Min.   : 1.000            Min.   :4  
 1st Qu.: 4.000            1st Qu.:4  
 Median : 7.000            Median :4  
 Mean   : 6.824            Mean   :4  
 3rd Qu.: 9.000            3rd Qu.:4  
 Max.   :12.000            Max.   :4  

[[5]]
     Store        StoreType Assortment CompetitionDistance
 Min.   :  15.0   a:  0     a:  0      Min.   : 4.777     
 1st Qu.: 252.0   b:  0     b:  0      1st Qu.:11.489     
 Median : 523.5   c:  0     c:138      Median :13.495     
 Mean   : 534.4   d:138                Mean   :13.207     
 3rd Qu.: 788.8                        3rd Qu.:14.977     
 Max.   :1103.0                        Max.   :18.636     
 CompetitionOpenSinceMonth    cluster 
 Min.   : 2.000            Min.   :5  
 1st Qu.: 4.250            1st Qu.:5  
 Median : 9.000            Median :5  
 Mean   : 7.659            Mean   :5  
 3rd Qu.:10.000            3rd Qu.:5  
 Max.   :12.000            Max.   :5  

[[6]]
     Store        StoreType Assortment CompetitionDistance
 Min.   :  20.0   a: 0      a:85       Min.   : 6.88      
 1st Qu.: 278.5   b: 5      b: 1       1st Qu.:10.68      
 Median : 574.5   c: 0      c: 0       Median :12.80      
 Mean   : 542.9   d:81                 Mean   :12.73      
 3rd Qu.: 749.5                        3rd Qu.:14.75      
 Max.   :1104.0                        Max.   :19.28      
 CompetitionOpenSinceMonth    cluster 
 Min.   : 1.000            Min.   :6  
 1st Qu.: 4.000            1st Qu.:6  
 Median : 7.000            Median :6  
 Mean   : 6.826            Mean   :6  
 3rd Qu.:10.000            3rd Qu.:6  
 Max.   :12.000            Max.   :6

In [61]:

store[pam_fit$medoids, ]

	Store	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth
668	668	c	a	10.39291	9
569	569	a	a	10.50205	9
519	519	c	c	11.82267	8
452	452	a	c	11.17044	8
590	590	d	c	13.13715	9
677	677	d	a	11.04175	6

In [62]:

df_clusters <- cbind(store)
df_clusters[pam_fit$medoids, ]

	Store	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth
668	668	c	a	10.39291	9
569	569	a	a	10.50205	9
519	519	c	c	11.82267	8
452	452	a	c	11.17044	8
590	590	d	c	13.13715	9
677	677	d	a	11.04175	6

In [63]:

df_clusters$cluster <- factor(pam_fit$clustering)

tsne_obj <- Rtsne(gower.dist, is_distance = TRUE, perplexity = 28)

In [64]:

tsne_data <- tsne_obj$Y %>%
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = factor(pam_fit$clustering),
         name = store$Store)

ggplot(aes(x = X, y = Y), data = tsne_data) +
  geom_point(aes(color = cluster))

Consider the largest cluster:

In [65]:

data = tsne_data %>% filter(X > 0, Y > 0)

In [66]:

head(data)
dim(data)

X	Y	cluster	name
19.32617	18.74965	2	2
19.03534	15.24453	2	3
10.69588	4.65073	2	5
19.10044	19.57034	2	6
18.43798	14.77562	2	8
17.11165	14.65790	2	10

In [68]:

colnames(data) <- c("X", "Y", "cluster", 'Store')
print(data %>% left_join(store, by = "Store") %>% collect %>%.[["Store"]])

  [1]    2    3    5    6    8   10   14   17   27   28   30   39   44   48   55
 [16]   71   72   78   81   82   86   87   88   89   90   95   96  102  104  106
 [31]  107  112  116  117  121  123  124  125  133  134  137  139  142  148  153
 [46]  156  163  164  165  167  170  173  177  181  186  190  191  198  200  204
 [61]  205  219  220  222  235  236  237  240  257  258  260  264  268  270  272
 [76]  276  282  286  290  292  294  295  296  297  303  304  306  307  308  314
 [91]  322  328  329  332  336  341  351  357  358  360  363  374  376  380  381
[106]  391  395  399  400  403  405  407  418  432  434  439  446  451  459  460
[121]  462  475  479  480  489  490  492  502  506  509  511  513  535  537  538
[136]  539  542  544  546  552  558  563  569  570  573  575  581  583  592  594
[151]  597  602  603  606  607  609  610  611  614  619  621  623  625  630  632
[166]  635  639  643  646  649  650  660  665  674  675  679  683  685  686  688
[181]  692  695  698  702  703  705  709  712  715  718  727  737  743  744  745
[196]  749  751  752  759  760  761  771  778  779  781  784  786  791  797  798
[211]  807  809  811  813  815  817  821  825  831  834  835  836  840  841  844
[226]  856  858  864  870  873  876  881  882  885  892  894  900  902  905  906
[241]  916  917  920  921  923  924  936  938  941  946  947  949  950  953  954
[256]  956  958  966  970  974  977  980  983  986  988  998 1003 1009 1013 1020
[271] 1021 1025 1029 1030 1033 1034 1040 1049 1053 1071 1077 1086 1088 1092 1095
[286] 1098 1102 1107 1108 1111

In [ ]:

Clustering Methods for Mixed Data Types¶

Notebook by Marco Tavora¶

Table of contents¶

Introduction¶

Data¶

Data dictionary¶

Clustering Analysis of Mixed Types¶

Decisions to make¶

Gower distance¶

Pre-processing to eliminate non-normality¶

Drop NA¶

Choosing a clustering algorithm¶

Plot sihouette width (higher is better)¶

Interpretation¶

Notebook by Marco Tavora ¶

Drop `NA`¶