Using clustering algorithms we can understand how a sample might be comprised of different subgroups. In the present case, the data to be clustered includes both categorical and continuous data and standard distance measures such as Euclidean cannot be used. In the following we will use Gower distances.
From Kaggle:
Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.
I will consider only some of the data fields, namely:
store <- read.csv('store.csv')
store <- store[,c("Store","StoreType","Assortment", "CompetitionDistance", "CompetitionOpenSinceMonth")]
head(store)
Store | StoreType | Assortment | CompetitionDistance | CompetitionOpenSinceMonth |
---|---|---|---|---|
1 | c | a | 1270 | 9 |
2 | a | a | 570 | 11 |
3 | a | a | 14130 | 12 |
4 | c | c | 620 | 9 |
5 | a | a | 29910 | 4 |
6 | a | a | 310 | 12 |
We will use the following packages:
dplyr
for data cleaningcluster
for gower similarity and pamRtsne
for t-SNE plotggplot2
for visualizationset.seed(1680)
library(dplyr)
library(cluster)
library(Rtsne)
library(ggplot2)
From R-bloggers:
While many introductions to cluster analysis typically review a simple application using continuous variables, clustering data of mixed types (e.g., continuous, ordinal, and nominal) is often of interest. The following is an overview of one approach to clustering data of mixed types using Gower distance, partitioning around medoids, and silhouette width.
For each variable type there is a particular distance metric that works well for that type. The metric is scaled to fall between 0 and 1. After that, a linear combination with user-specified weights (e.g. averages) is calculated and a final distance matrix is build. The metrics by type are:
The Gower distance is sensitive to non-normality and outliers of continuous variables. Hence pre-processing is a recommended.
library(gpairs)
library(corrplot)
library(gplots)
library(car)
lambda_CompetitionDistance <- coef(powerTransform(store$CompetitionDistance))
lambda_CompetitionOpenSinceMonth <- coef(powerTransform(store$CompetitionOpenSinceMonth))
lambda_CompetitionDistance
lambda_CompetitionOpenSinceMonth
par(mfrow=c(2,1))
hist(store$CompetitionDistance, xlab="Original variable",
main="Histogram of original variable")
hist(bcPower(store$CompetitionDistance, lambda_CompetitionDistance),
xlab="Box-Cox Transform", ylab="New Distribution",
main="Transformed Distribution")
store$CompetitionDistance <- bcPower(store$CompetitionDistance, lambda_CompetitionDistance)
NA
¶store = na.omit(store)
library(cluster)
gower.dist <- daisy(store, metric = "gower")
summary(gower.dist)
289180 dissimilarities, summarized : Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0009001 0.2809200 0.3934200 0.3962600 0.5256600 0.9041400 Metric : mixed ; Types = I, N, N, I, I Number of objects : 761
We will use partition around medoids (PAM).
sil_width <- c(NA)
for(i in 2:10){
pam_fit <- pam(gower.dist,
diss = TRUE,
k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}
pam_fit
Medoids: ID [1,] "458" "668" [2,] "139" "200" [3,] "345" "519" [4,] "430" "630" [5,] "339" "507" [6,] "335" "501" [7,] "253" "369" [8,] "219" "317" [9,] "396" "590" [10,] "559" "813" Clustering vector: 1 2 3 4 5 6 7 8 9 10 11 14 15 17 18 20 1 2 2 3 4 2 5 2 6 2 6 4 7 2 7 8 21 23 24 25 27 28 30 31 33 34 35 36 37 38 39 44 3 8 5 1 4 2 4 7 5 1 9 6 1 8 2 2 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 8 1 5 2 9 8 5 7 6 9 2 7 7 5 6 9 61 63 65 67 71 72 73 75 76 77 78 81 82 84 85 86 6 3 5 5 2 2 6 9 7 9 2 4 4 6 2 4 87 88 89 90 95 96 98 99 102 103 104 106 107 108 109 110 2 2 2 2 2 4 9 3 2 7 2 2 2 9 6 5 112 113 115 116 117 118 119 120 121 122 123 124 125 126 127 131 2 9 7 4 2 9 5 8 2 5 2 4 2 8 8 1 133 134 136 137 138 139 140 142 143 146 148 149 150 151 153 156 2 2 6 2 6 4 6 2 8 9 2 8 3 9 2 4 157 159 160 161 162 163 164 165 166 167 169 170 173 177 180 181 6 8 9 5 7 4 2 4 5 4 8 2 2 4 8 4 185 186 189 190 191 192 196 197 198 199 200 202 203 204 205 208 7 2 8 2 2 7 1 1 2 9 2 7 3 2 2 1 209 210 211 212 213 214 219 220 221 222 223 225 229 230 231 232 6 8 6 6 7 8 4 2 9 2 9 8 7 9 9 3 235 236 237 240 242 244 246 247 248 249 254 255 256 257 258 260 4 2 2 4 8 8 1 9 6 9 8 3 6 2 2 2 261 262 263 264 266 267 268 269 270 272 275 276 278 280 281 282 7 8 5 4 5 1 4 6 2 2 8 2 5 9 9 2 286 287 289 290 292 293 294 295 296 297 298 299 300 301 302 303 4 1 8 4 2 3 4 2 4 2 8 9 5 5 9 2 304 305 306 307 308 311 312 314 315 316 317 318 319 320 321 322 2 3 4 2 2 6 8 2 5 8 8 7 6 6 3 4 323 325 326 327 328 329 332 333 334 336 337 341 343 344 347 349 7 5 8 3 2 4 4 5 9 2 7 2 8 5 7 3 351 354 355 356 357 358 360 361 363 366 367 368 369 370 371 372 2 9 6 9 2 2 4 3 2 9 9 7 7 8 7 9 374 375 376 377 378 380 381 382 385 386 389 390 391 393 395 399 2 5 2 6 6 4 2 3 8 7 6 5 2 7 4 2 400 401 402 403 405 406 407 410 413 415 416 418 419 421 423 427 2 6 3 2 4 7 2 1 6 7 5 4 1 3 4 6 428 429 430 432 433 434 439 440 443 444 446 447 448 449 450 451 8 7 9 4 6 2 2 8 8 1 2 6 6 6 1 4 452 455 459 460 461 462 464 465 466 467 468 469 472 475 476 477 6 9 10 4 9 10 1 7 5 5 3 3 3 10 8 8 478 479 480 482 483 484 487 488 489 490 492 494 496 500 501 502 7 2 2 1 6 5 9 5 2 4 4 4 7 9 6 4 503 506 507 509 511 513 514 518 519 521 522 523 524 525 527 529 9 2 5 4 10 10 3 7 3 8 9 3 6 9 7 9 532 534 535 536 537 538 539 541 542 543 544 545 546 547 548 550 6 8 4 6 4 4 4 6 10 1 2 5 4 9 7 9 551 552 553 555 556 558 559 560 563 565 567 569 570 571 572 573 6 4 1 8 9 4 8 3 4 6 1 10 10 8 7 10 575 576 578 579 580 581 583 585 586 587 588 590 592 593 594 595 4 1 8 1 5 10 10 7 6 9 7 9 4 5 10 3 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 1 10 1 9 9 8 4 4 8 8 10 10 5 10 4 10 612 613 614 615 616 619 621 623 624 625 626 627 629 630 632 633 9 1 10 8 6 4 4 4 6 4 3 3 8 4 4 8 635 636 638 639 641 642 643 644 646 647 648 649 650 653 654 655 4 1 8 10 6 3 4 1 10 5 8 10 10 9 1 9 656 657 659 660 661 663 664 665 667 668 669 671 672 674 675 676 8 3 8 10 9 5 9 10 9 1 8 5 1 10 10 1 677 679 682 683 685 686 688 689 692 694 695 698 701 702 703 704 8 10 1 10 10 4 10 8 10 6 10 4 8 10 4 7 705 706 707 708 709 710 711 712 714 715 716 718 720 721 723 725 10 8 6 3 10 8 8 10 9 4 8 4 5 6 9 9 726 727 729 730 733 735 737 738 739 740 743 744 745 746 747 748 5 4 3 1 1 7 4 9 9 8 10 10 10 7 3 8 749 750 751 752 753 754 755 758 759 760 761 763 765 768 770 771 10 8 10 4 9 3 9 5 10 4 10 7 5 6 5 10 774 775 776 777 778 779 781 782 784 785 786 787 788 789 790 791 6 9 1 9 10 4 4 3 10 9 10 3 5 5 9 4 792 793 794 796 797 798 800 801 805 807 809 810 811 812 813 814 8 8 3 6 10 4 8 8 8 4 10 9 10 8 10 9 815 817 819 821 822 823 825 826 827 828 831 833 834 835 836 837 4 4 6 10 6 6 4 5 5 9 10 9 4 10 10 5 839 840 841 842 843 844 845 846 848 850 852 856 857 858 859 860 1 10 10 9 1 10 8 5 6 8 1 4 1 10 1 3 862 863 864 867 868 869 870 872 873 875 876 878 881 882 883 885 5 6 10 9 9 1 4 6 10 8 4 9 4 4 10 10 886 888 889 892 894 896 897 900 901 902 903 904 905 906 908 912 6 8 8 4 10 6 3 4 5 4 9 9 10 10 10 3 914 915 916 917 918 920 921 922 923 924 925 926 929 931 933 935 3 7 10 4 5 4 10 8 10 4 1 7 6 6 5 5 936 938 940 941 944 945 946 947 949 950 952 953 954 955 956 957 4 10 9 10 1 5 10 4 4 10 9 4 4 9 10 9 958 959 963 966 967 969 970 971 974 977 979 980 981 983 985 986 10 6 6 4 6 6 10 1 4 10 6 10 9 4 3 10 987 988 989 992 993 994 995 996 997 998 999 1000 1002 1003 1006 1007 1 10 10 10 9 10 8 1 9 10 7 5 9 10 3 3 1008 1009 1010 1011 1012 1013 1015 1017 1018 1019 1020 1021 1023 1024 1025 1026 6 10 9 6 9 4 9 1 3 9 10 4 1 3 10 1 1027 1029 1030 1031 1032 1033 1034 1038 1039 1040 1041 1043 1044 1045 1046 1048 6 4 4 8 7 4 4 8 6 4 1 1 1 6 7 9 1049 1050 1051 1053 1055 1057 1059 1062 1067 1070 1071 1072 1074 1075 1077 1081 10 9 1 10 1 9 1 8 9 3 4 6 3 6 10 4 1082 1085 1086 1087 1088 1089 1092 1093 1094 1095 1097 1098 1099 1101 1102 1103 1 1 10 9 4 8 10 3 8 10 4 10 5 9 10 9 1104 1105 1106 1107 1108 1109 1110 1111 1112 8 3 6 10 4 1 3 10 3 Objective function: build swap 0.1114728 0.1076116 Available components: [1] "medoids" "id.med" "clustering" "objective" "isolation" [6] "clusinfo" "silinfo" "diss" "call"
The number of clusters with highest silhouette width is 6 from the plot below.
par(mfrow=c(1,1))
plot(1:10, sil_width,
xlab = "Number of clusters",
ylab = "Silhouette Width")
lines(1:10, sil_width)
pam_fit <- pam(gower.dist, diss = TRUE, k = 6)
pam_results <- store %>%
mutate(cluster = pam_fit$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
pam_results$the_summary
[[1]] Store StoreType Assortment CompetitionDistance Min. : 1.0 a: 0 a:57 Min. : 4.452 1st Qu.: 447.0 b: 3 b: 2 1st Qu.: 8.330 Median : 668.0 c:56 c: 0 Median :10.278 Mean : 651.1 d: 0 Mean :10.222 3rd Qu.: 957.5 3rd Qu.:12.179 Max. :1109.0 Max. :17.050 CompetitionOpenSinceMonth cluster Min. : 1.000 Min. :1 1st Qu.: 4.000 1st Qu.:1 Median : 8.000 Median :1 Mean : 7.169 Mean :1 3rd Qu.: 9.500 3rd Qu.:1 Max. :12.000 Max. :1 [[2]] Store StoreType Assortment CompetitionDistance Min. : 2.0 a:295 a:296 Min. : 4.044 1st Qu.: 269.5 b: 1 b: 0 1st Qu.: 8.755 Median : 582.0 c: 0 c: 0 Median :10.642 Mean : 554.6 d: 0 Mean :10.780 3rd Qu.: 835.2 3rd Qu.:12.543 Max. :1111.0 Max. :19.029 CompetitionOpenSinceMonth cluster Min. : 1.000 Min. :2 1st Qu.: 4.000 1st Qu.:2 Median : 8.000 Median :2 Mean : 7.348 Mean :2 3rd Qu.:10.000 3rd Qu.:2 Max. :12.000 Max. :2 [[3]] Store StoreType Assortment CompetitionDistance Min. : 4.0 a: 0 a: 0 Min. : 4.777 1st Qu.: 355.0 b: 0 b: 0 1st Qu.: 9.476 Median : 626.0 c:51 c:51 Median :12.286 Mean : 614.7 d: 0 Mean :11.738 3rd Qu.: 904.5 3rd Qu.:13.450 Max. :1112.0 Max. :19.120 CompetitionOpenSinceMonth cluster Min. : 1.000 Min. :3 1st Qu.: 5.000 1st Qu.:3 Median : 8.000 Median :3 Mean : 7.098 Mean :3 3rd Qu.:10.000 3rd Qu.:3 Max. :12.000 Max. :3 [[4]] Store StoreType Assortment CompetitionDistance Min. : 7.0 a:131 a: 0 Min. : 4.044 1st Qu.: 267.5 b: 0 b: 0 1st Qu.: 8.942 Median : 507.0 c: 0 c:131 Median :12.131 Mean : 529.5 d: 0 Mean :12.009 3rd Qu.: 822.5 3rd Qu.:14.907 Max. :1106.0 Max. :19.829 CompetitionOpenSinceMonth cluster Min. : 1.000 Min. :4 1st Qu.: 4.000 1st Qu.:4 Median : 7.000 Median :4 Mean : 6.824 Mean :4 3rd Qu.: 9.000 3rd Qu.:4 Max. :12.000 Max. :4 [[5]] Store StoreType Assortment CompetitionDistance Min. : 15.0 a: 0 a: 0 Min. : 4.777 1st Qu.: 252.0 b: 0 b: 0 1st Qu.:11.489 Median : 523.5 c: 0 c:138 Median :13.495 Mean : 534.4 d:138 Mean :13.207 3rd Qu.: 788.8 3rd Qu.:14.977 Max. :1103.0 Max. :18.636 CompetitionOpenSinceMonth cluster Min. : 2.000 Min. :5 1st Qu.: 4.250 1st Qu.:5 Median : 9.000 Median :5 Mean : 7.659 Mean :5 3rd Qu.:10.000 3rd Qu.:5 Max. :12.000 Max. :5 [[6]] Store StoreType Assortment CompetitionDistance Min. : 20.0 a: 0 a:85 Min. : 6.88 1st Qu.: 278.5 b: 5 b: 1 1st Qu.:10.68 Median : 574.5 c: 0 c: 0 Median :12.80 Mean : 542.9 d:81 Mean :12.73 3rd Qu.: 749.5 3rd Qu.:14.75 Max. :1104.0 Max. :19.28 CompetitionOpenSinceMonth cluster Min. : 1.000 Min. :6 1st Qu.: 4.000 1st Qu.:6 Median : 7.000 Median :6 Mean : 6.826 Mean :6 3rd Qu.:10.000 3rd Qu.:6 Max. :12.000 Max. :6
store[pam_fit$medoids, ]
Store | StoreType | Assortment | CompetitionDistance | CompetitionOpenSinceMonth | |
---|---|---|---|---|---|
668 | 668 | c | a | 10.39291 | 9 |
569 | 569 | a | a | 10.50205 | 9 |
519 | 519 | c | c | 11.82267 | 8 |
452 | 452 | a | c | 11.17044 | 8 |
590 | 590 | d | c | 13.13715 | 9 |
677 | 677 | d | a | 11.04175 | 6 |
df_clusters <- cbind(store)
df_clusters[pam_fit$medoids, ]
Store | StoreType | Assortment | CompetitionDistance | CompetitionOpenSinceMonth | |
---|---|---|---|---|---|
668 | 668 | c | a | 10.39291 | 9 |
569 | 569 | a | a | 10.50205 | 9 |
519 | 519 | c | c | 11.82267 | 8 |
452 | 452 | a | c | 11.17044 | 8 |
590 | 590 | d | c | 13.13715 | 9 |
677 | 677 | d | a | 11.04175 | 6 |
df_clusters$cluster <- factor(pam_fit$clustering)
tsne_obj <- Rtsne(gower.dist, is_distance = TRUE, perplexity = 28)
tsne_data <- tsne_obj$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(cluster = factor(pam_fit$clustering),
name = store$Store)
ggplot(aes(x = X, y = Y), data = tsne_data) +
geom_point(aes(color = cluster))
Consider the largest cluster:
data = tsne_data %>% filter(X > 0, Y > 0)
head(data)
dim(data)
X | Y | cluster | name |
---|---|---|---|
19.32617 | 18.74965 | 2 | 2 |
19.03534 | 15.24453 | 2 | 3 |
10.69588 | 4.65073 | 2 | 5 |
19.10044 | 19.57034 | 2 | 6 |
18.43798 | 14.77562 | 2 | 8 |
17.11165 | 14.65790 | 2 | 10 |
colnames(data) <- c("X", "Y", "cluster", 'Store')
print(data %>% left_join(store, by = "Store") %>% collect %>%.[["Store"]])
[1] 2 3 5 6 8 10 14 17 27 28 30 39 44 48 55 [16] 71 72 78 81 82 86 87 88 89 90 95 96 102 104 106 [31] 107 112 116 117 121 123 124 125 133 134 137 139 142 148 153 [46] 156 163 164 165 167 170 173 177 181 186 190 191 198 200 204 [61] 205 219 220 222 235 236 237 240 257 258 260 264 268 270 272 [76] 276 282 286 290 292 294 295 296 297 303 304 306 307 308 314 [91] 322 328 329 332 336 341 351 357 358 360 363 374 376 380 381 [106] 391 395 399 400 403 405 407 418 432 434 439 446 451 459 460 [121] 462 475 479 480 489 490 492 502 506 509 511 513 535 537 538 [136] 539 542 544 546 552 558 563 569 570 573 575 581 583 592 594 [151] 597 602 603 606 607 609 610 611 614 619 621 623 625 630 632 [166] 635 639 643 646 649 650 660 665 674 675 679 683 685 686 688 [181] 692 695 698 702 703 705 709 712 715 718 727 737 743 744 745 [196] 749 751 752 759 760 761 771 778 779 781 784 786 791 797 798 [211] 807 809 811 813 815 817 821 825 831 834 835 836 840 841 844 [226] 856 858 864 870 873 876 881 882 885 892 894 900 902 905 906 [241] 916 917 920 921 923 924 936 938 941 946 947 949 950 953 954 [256] 956 958 966 970 974 977 980 983 986 988 998 1003 1009 1013 1020 [271] 1021 1025 1029 1030 1033 1034 1040 1049 1053 1071 1077 1086 1088 1092 1095 [286] 1098 1102 1107 1108 1111