Introduction
Cluster analysis is the statistical problem of classifying data in groups according to similarity. For example, birds can be classified in flocks depending on their physical position, and how close they are to one another. There are multiple ways to classify the data and to decide how many clusters should be identified. Choosing one classification over another can be a difficult problem. In this article, we compare cluster analysis techniques for data that is binary.
Cluster analysis is a common technique for exploratory data analysis, and is used in fields like pattern recognition, machine learning, and bioinformatics. The algorithms aim to be versatile, suitable for any data type. In this article, we are going to show that not all algorithms are suitable to classify a given dataset, and demonstrate how building a simulated dataset can be helpful for choosing an appropriate algorithm.
We identified more than a dozen cluster analysis functions by different contributors to the R statistical language. This article does not aim to be a thorough description or comparison of those functions, and instead focuses on showing the construction of the simulated dataset and the results of as many of those functions as possible.
The simulation generated 3 clusters of approximately 333 points each; the resulting dataset had 1000 points defined in a space of 20 binary dimensions. Then, the R functions for cluster analysis were run. Some algorithms such as agnes
required us to specify the number of clusters; other algorithms such as ekmeans
were able to determine the number of clusters by themselves. We also found that most algorithms like agnes
and ekmeans
classified the dataset into appropriately-sized clusters, while other algorithms such as dbscan
and hc
produced a wrong number of clusters or the wrong sizes. Finally, some of the algorithms were not easy to process and interpret, because appropriate statistical or graphical tools for analysis do not exists. That is the case of mona
, an algorithm designed to work with binary variables only. We had to use some guesswork to determine the size of the clusters, and still got them wrong.
In conclusion, we found that several algorithms were unsuitable to classify binary data. Assays with simulated datasets are fundamental for choosing an efficient algorithm that can clearly illustrate a correct classification of binary data.
Options for Cluster Analysis
Here is a list of facilities to perform cluster analysis in R. The package cluster
offers agnes
, clara
, diana
, fanny
, mona
, and pam
. Package fpc
offers dbscan
. Package mclust
offers mclust
, hc
, and hcvvv
. Package pvclust
offers pvclust
, and package stats
offers hclust
and kmeans
.
Package factoextra
enhances several of the facilities listed above with additional types of plots and automatic determination of the number of clusters.
Requirements and Installation
The comparison only requires a working R installation with the packages mentioned above and a few additional utilities. If you are starting with a fresh setup, you can use the following command:
1
2
|
install_packages(c("rlang", "data.table", "ggplot2", "Cairo",
"cowplot", "cluster", "factoextra", "fpc", "mclust", "pvclust"))
|
Let’s get started
We first generate a simulated dataset and we run all cluster analysis algorithms on it,
saving the results for later.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
|
for (package_name in c("rlang", "data.table", "ggplot2", "Cairo",
"cowplot", "cluster", "factoextra", "fpc", "mclust", "pvclust")) {
require(package_name, character.only = TRUE)
message("Package ", package_name, " version ", packageVersion(package_name),
" released ", packageDate(package_name))
}
# Example use case for our binary dataset:
# Each row in this dataset represents a scientific paper.
# A paper is a point in the dataset space.
# Each column represents a certain analysis methodology.
# A column is a dimension in the dataset space.
# Each value is a 0 or 1:
# 1: The paper used the given methodology
# 0: Otherwise
message("Step 1: Definitions and preparation")
set.seed(1000)
ca <- list()
# Number of rows: number of papers; points in the dataset space.
ca$n_papers <- 1000
ca$n_methodologies <- 20 # Number of columns; dimension of the dataset space
ca$n_clusters <- 3 # Number of clusters to generate
# Minimum radius of a cluster in dataset space.
# Should be less than sqrt(ca$n_methodologies).
ca$min_cluster_radius <- 1.0
# Maximum radius of a cluster in dataset space.
# Should be less than sqrt(ca$n_methodologies).
ca$max_cluster_radius <- 1.6
tmp <- list()
stopifnot(ca$n_papers > 100)
stopifnot(ca$n_methodologies > 5)
stopifnot(ca$n_clusters > 1)
stopifnot(ca$min_cluster_radius >= 1)
stopifnot(ca$min_cluster_radius < ca$max_cluster_radius)
if (2 * ca$max_cluster_radius > sqrt(ca$n_methodologies)) {
cat("Warning: The cluster radius should be reasonably low. ",
"Please review:\nca$radius =",
ca$radius, "| Recommendation < 0.5 * sqrt(ca$n_methodologies =",
0.5 * sqrt(ca$n_methodologies), "\n")
}
message("Step 2: The dataset starts as a matrix of zeroes.")
tmp$zeroes <- rep(0, ca$n_papers)
# Dataset header, as in methodology1, methodology2, etc.
tmp$methodologies <- paste0("m", 1:(ca$n_methodologies))
ca$dataset <- data.table()
ca$dataset[, tmp$methodologies := tmp$zeroes]
message(
"Step 3: We build clusters at random points in the space of the dataset.")
ca$cluster_radii <- runif(n = ca$n_clusters, min = ca$min_cluster_radius,
max = ca$max_cluster_radius)
ca$cluster_centers <- as.data.table(matrix(rbinom(n = ca$n_clusters *
ca$n_methodologies, size = 1, prob = 0.5), nrow = ca$n_clusters,
ncol = ca$n_methodologies))
setnames(ca$cluster_centers, tmp$methodologies)
tmp$limits <- 1 + as.integer(0:(ca$n_clusters - 1) * ca$n_papers /
ca$n_clusters) # Paper index at the start of each cluster
stopifnot(tmp$limits < ca$n_papers)
tmp$cluster_sizes <- diff(c(tmp$limits, ca$n_papers + 1))
ca$cluster_sizes <- data.table(Original =
sort(tmp$cluster_sizes))
# Sorts column x and adds it to ca$cluster_sizes by reference,
# resizing dt if needed, padding with NAs if necessary. No data recycling.
new_sizes <- function(x_name, x) {
sx <- seq_along(x)
if (length(x) > nrow(ca$cluster_sizes)) {
ca$cluster_sizes <<- ca$cluster_sizes[sx]
}
ca$cluster_sizes[sx, c(x_name) := sort(x)]
}
tmp$limits <- c(tmp$limits, 0)
tmp$random_index <- sample.int(ca$n_papers)
tmp$limit <- 1
tmp$cluster_index <- 0
for (paper_index in 1:(ca$n_papers)) {
if (paper_index == tmp$limit) {
tmp$cluster_index <- tmp$cluster_index + 1
if (tmp$cluster_index <= ca$n_clusters) {
tmp$limit <- tmp$limits[tmp$cluster_index + 1]
tmp$prob <- ca$cluster_radii[tmp$cluster_index] *
ca$cluster_radii[tmp$cluster_index] / ca$n_methodologies
tmp$center <- ca$cluster_centers[tmp$cluster_index]
message("Cluster ", tmp$cluster_index, " has size ",
tmp$cluster_sizes[tmp$cluster_index], ", radius ", tmp$prob,
" and the following center:")
show(tmp$center)
}
}
ca$dataset[tmp$random_index[paper_index], ] <- (tmp$center +
rbinom(n = ca$n_methodologies, size = 1, prob = tmp$prob)
# A random point near the center of the cluster is defined by flipping
# some of the bits that constitute the coordinates.
) %% 2
}
message("Here is an excerpt of the dataset:")
show(ca$dataset)
message("Step 4: We run the clustering algorithms and save the results.")
ca$p_cluster <- list()
ca$p_fpc <- list()
ca$p_mclust <- list()
ca$p_pvclust <- list()
ca$p_stats <- list()
ca$p_cluster$agnes <- agnes(ca$dataset)
ca$p_cluster$agnes_cluster_sizes <- table(cutree(ca$p_cluster$agnes, k = 3))
new_sizes("agnes", ca$p_cluster$agnes_cluster_sizes)
ca$p_cluster$eagnes <- eclust(ca$dataset, "agnes", graph = FALSE)
ca$p_cluster$clara <- clara(ca$dataset, k = 3, samples = 100)
ca$p_cluster$clara_cluster_sizes <- ca$p_cluster$clara$clusinfo[, "size"]
new_sizes("clara", ca$p_cluster$clara_cluster_sizes)
ca$p_cluster$eclara <- eclust(ca$dataset, "clara", graph = FALSE, k = 3,
samples = 100)
ca$p_cluster$diana <- diana(ca$dataset)
ca$p_cluster$diana_cluster_sizes <- table(cutree(ca$p_cluster$diana, k = 3))
new_sizes("diana", ca$p_cluster$diana_cluster_sizes)
ca$p_cluster$ediana <- eclust(ca$dataset, "diana", graph = FALSE)
ca$p_cluster$fanny <- fanny(ca$dataset, k = 3, tol = 1e-6)
ca$p_cluster$fanny_cluster_sizes <- table(ca$p_cluster$fanny$clustering)
new_sizes("fanny", ca$p_cluster$fanny_cluster_sizes)
ca$p_cluster$efanny <- eclust(ca$dataset, "fanny", graph = FALSE, k = 3,
tol = 1e-6)
ca$p_cluster$mona <- mona(ca$dataset)
tmp$mona <- list()
tmp$mona$raw <- which(ca$p_cluster$mona$step == 20)
tmp$mona$indices <- c(0, tmp$mona$raw[4:5] - tmp$mona$raw[1], 1001)
ca$p_cluster$mona_cluster_sizes <- diff(tmp$mona$indices)
new_sizes("mona", ca$p_cluster$mona_cluster_sizes)
ca$p_cluster$pam <- pam(ca$dataset, k = 3)
ca$p_cluster$pam$pam_cluster_sizes <- ca$p_cluster$pam$clusinfo[, "size"]
new_sizes("pam", ca$p_cluster$pam$pam_cluster_sizes)
ca$p_cluster$epam <- eclust(ca$dataset, "pam", graph = FALSE, k = 3)
ca$p_fpc$dbscan <- dbscan(ca$dataset, eps = 0.000001, MinPts = 10)
ca$p_fpc$dbscan_cluster_sizes <- table(ca$p_fpc$dbscan$cluster)
new_sizes("dbscan", ca$p_fpc$dbscan_cluster_sizes)
ca$p_mclust$mclust <- Mclust(ca$dataset)
ca$p_mclust$mclust_cluster_sizes <- table(ca$p_mclust$mclust$classification)
new_sizes("mclust", ca$p_mclust$mclust_cluster_sizes)
ca$p_mclust$clust_combi <- clustCombi(ca$p_mclust$mclust, ca$dataset)
ca$p_mclust$hc <- hc(ca$dataset)
ca$p_mclust$hc_cluster_sizes <- table(hclass(ca$p_mclust$hc, 3))
new_sizes("hc", ca$p_mclust$hc_cluster_sizes)
ca$p_mclust$hcvvv <- hcVVV(ca$dataset)
ca$p_mclust$hcvvv_cluster_sizes <- table(hclass(ca$p_mclust$hcvvv, 3))
new_sizes("hcvvv", ca$p_mclust$hcvvv_cluster_sizes)
ca$p_pvclust$pvclust <- pvclust(t(ca$dataset), parallel = TRUE)
ca$p_pvclust$pvclust_cluster_sizes <- table(cutree(ca$p_pvclust$pvclust$hclust,
k = 3))
new_sizes("pvclust", ca$p_pvclust$pvclust_cluster_sizes)
# ca$p_stats$hclust <- hclust(ca$dataset) # Did not work with this dataset
ca$p_stats$ehclust <- eclust(ca$dataset, "hclust", graph = FALSE)
ca$p_stats$ehclust_cluster_sizes <- table(cutree(ca$p_stats$ehclust, k = 3))
new_sizes("ehclust", ca$p_stats$ehclust_cluster_sizes)
# eclust("kmeans") might warn "did not converge in 10 iterations"
ca$p_stats$ekmeans <- eclust(ca$dataset, "kmeans", graph = FALSE)
ca$p_stats$ekmeans_cluster_sizes <- table(ca$p_stats$ekmeans$cluster)
new_sizes("ekmeans", ca$p_stats$ekmeans_cluster_sizes)
ca$p_stats$kmeans <- kmeans(ca$dataset, centers = ca$cluster_centers)
ca$p_stats$kmeans_cluster_sizes <- table(ca$p_stats$kmeans$cluster)
new_sizes("kmeans", ca$p_stats$kmeans_cluster_sizes)
save(ca, file = "data.RData")
tmp <- NULL
|
Here is the output of the code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
|
Loading required package: rlang
Package rlang version 1.0.2 released 2022-03-04
Loading required package: data.table
data.table 1.14.2 using 4 threads (see ?getDTthreads). Latest news: r-datatable.com
Attaching package: 'data.table'
The following object is masked from 'package:rlang':
:=
Package data.table version 1.14.2 released 2021-09-23
Loading required package: ggplot2
Package ggplot2 version 3.3.6 released 2022-04-27
Loading required package: Cairo
Package Cairo version 1.5.15 released 2022-03-16
Loading required package: cowplot
Package cowplot version 1.1.1 released 2020-12-15
Loading required package: cluster
Package cluster version 2.1.3 released 2022-03-28
Loading required package: factoextra
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Package factoextra version 1.0.7 released 2020-04-01
Loading required package: fpc
Package fpc version 2.2.9 released 2020-12-06
Loading required package: mclust
__ ___________ __ _____________
/ |/ / ____/ / / / / / ___/_ __/
/ /|_/ / / / / / / / /\__ \ / /
/ / / / /___/ /___/ /_/ /___/ // /
/_/ /_/\____/_____/\____//____//_/ version 5.4.9
Type 'citation("mclust")' for citing this R package in publications.
Package mclust version 5.4.9 released 2021-12-17
Loading required package: pvclust
Package pvclust version 2.2.0 released 2019-11-19
Step 1: Definitions and preparation
Step 2: The dataset starts as a matrix of zeroes.
Step 3: We build clusters at random points in the space of the dataset.
Cluster 1 has size 333, radius 0.0716078026473276 and the following center:
m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1: 1 1 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 1 0
Cluster 2 has size 333, radius 0.105896052986884 and the following center:
m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1: 1 1 0 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1 1 0
Cluster 3 has size 334, radius 0.0570698508073141 and the following center:
m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1: 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 0 1
Here is an excerpt of the dataset:
m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1: 1 1 1 1 0 1 0 1 0 1 1 1 1 1 1 1 0 1 0 0
2: 1 1 0 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1 1 0
3: 1 1 0 0 0 0 1 1 1 1 1 0 1 0 0 0 1 1 1 1
4: 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 1 1 0
5: 1 1 0 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1 1 0
---
996: 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 1 1 0
997: 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 0 1
998: 1 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1 0 1
999: 1 1 0 1 0 1 0 1 0 0 1 0 1 1 0 1 1 1 1 0
1000: 1 1 0 0 0 0 1 1 1 1 1 0 0 1 1 0 1 0 1 0
Step 4: We run the clustering algorithms and save the results.
Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 100) [one "." per sample]:
.................................................. 50
.................................................. 100
Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 100) [one "." per sample]:
.................................................. 50
.................................................. 100
fitting ...
|====================================================================================================| 100%
Bootstrap (r = 0.5)... Done.
Bootstrap (r = 0.6)... Done.
Bootstrap (r = 0.7)... Done.
Bootstrap (r = 0.8)... Done.
Bootstrap (r = 0.9)... Done.
Bootstrap (r = 1.0)... Done.
Bootstrap (r = 1.1)... Done.
Bootstrap (r = 1.2)... Done.
Bootstrap (r = 1.3)... Done.
Bootstrap (r = 1.4)... Done.
Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 100) [one "." per sample]:
.................................................. 50
.................................................. 100
Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 100) [one "." per sample]:
.................................................. 50
.................................................. 100
|
We first compare the cluster analysis methods in the following table:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
Original agnes clara diana fanny mona pam dbscan mclust hc hcvvv pvclust
1: 333 333 333 311 333 262 333 10 87 76 76 333
2: 333 333 333 334 333 339 333 10 121 105 105 333
3: 334 334 334 355 334 400 334 10 121 819 819 334
4: NA NA NA NA NA NA NA 12 132 NA NA NA
5: NA NA NA NA NA NA NA 34 146 NA NA NA
6: NA NA NA NA NA NA NA 76 195 NA NA NA
7: NA NA NA NA NA NA NA 105 198 NA NA NA
ehclust ekmeans kmeans
1: 332 333 333
2: 333 333 333
3: 335 334 334
4: NA NA NA
5: NA NA NA
6: NA NA NA
7: NA NA NA
|
This table shows that algorithms dbscan
and mclust
are not appropriate for the classification of our simulated data, given that they identify a large amount of clusters. Algorithms diana
, mona
, hc
, and hcvvv
produce incorrect cluster sizes.
The other algorithms seem to produce clusters of appropriate sizes. Bear in mind that in some cases package factoextra
was useful to automatically determine the number of clusters, but in some cases the number of clusters had to be provided by us.
In the next few sections, we will show the graphical output of the different types of cluster analysis.
Package cluster
agnes
- Agglomerative Nesting (Hierarchical Clustering)
1
2
3
4
5
|
ca$p_cluster$agnes_cluster_sizes
plot(ca$p_cluster$agnes)
fviz_cluster(ca$p_cluster$eagnes) # From package factoextra
fviz_dend(ca$p_cluster$eagnes) # From package factoextra
fviz_silhouette(ca$p_cluster$eagnes) # From package factoextra
|
clara
- Clustering Large Applications - Clustering of the Data Into k Clusters
1
2
3
|
ca$cluster_centers
ca$p_cluster$clara$medoids
ca$p_cluster$clara$clusinfo
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1: 1 1 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 1 0
2: 1 1 0 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1 1 0
3: 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 0 1
m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
[1,] 1 1 0 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1 1 0
[2,] 1 1 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 1 0
[3,] 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 0 1
size max_diss av_diss isolation
[1,] 333 2.645751 1.3260589 0.8366600
[2,] 333 2.236068 1.0075124 0.7071068
[3,] 334 2.449490 0.8338874 0.7745967
|
1
2
3
|
plot(ca$p_cluster$clara)
fviz_cluster(ca$p_cluster$eclara) # From package factoextra
fviz_silhouette(ca$p_cluster$eclara) # From package factoextra
|
diana
- DIvisive ANAlysis Clustering
1
2
3
4
5
6
|
ca$cluster_centers
ca$p_cluster$diana_cluster_sizes
plot(ca$p_cluster$diana)
fviz_cluster(ca$p_cluster$ediana) # From package factoextra
fviz_dend(ca$p_cluster$ediana) # From package factoextra
fviz_silhouette(ca$p_cluster$ediana) # From package factoextra
|
1
2
3
4
5
6
7
|
m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1: 1 1 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 1 0
2: 1 1 0 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1 1 0
3: 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 0 1
1 2 3
355 311 334
|
fanny
- Fuzzy clustering of the data into k clusters.
1
2
3
4
5
|
ca$cluster_centers
ca$p_cluster$fanny_cluster_sizes
plot(ca$p_cluster$fanny)
fviz_cluster(ca$p_cluster$efanny) # From package factoextra
fviz_silhouette(ca$p_cluster$efanny) # From package factoextra
|
1
2
3
4
5
6
7
|
m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1: 1 1 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 1 0
2: 1 1 0 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1 1 0
3: 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 0 1
1 2 3
334 333 333
|
mona
- MONothetic Analysis Clustering of Binary Variables
1
2
|
ca$p_cluster$mona_cluster_sizes
plot(ca$p_cluster$mona)
|
pam
- Partitioning Around Medoids
Partitioning (clustering) of the data into k clusters “around medoids”, a more robust version of
K-means.
1
2
3
4
5
6
|
ca$cluster_centers
ca$p_cluster$pam$medoids
ca$p_cluster$pam$clusinfo
plot(ca$p_cluster$pam)
fviz_cluster(ca$p_cluster$epam) # From package factoextra
fviz_silhouette(ca$p_cluster$epam) # From package factoextra
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1: 1 1 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 1 0
2: 1 1 0 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1 1 0
3: 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 0 1
m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
[1,] 1 1 0 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1 1 0
[2,] 1 1 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 1 0
[3,] 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 0 1
size max_diss av_diss diameter separation
[1,] 333 2.645751 1.3260589 3.464102 1.732051
[2,] 333 2.236068 1.0075124 3.162278 2.000000
[3,] 334 2.449490 0.8338874 3.162278 1.732051
|
Package fpc
dbscan
- DBSCAN density reachability and connectivity clustering
1
2
3
4
|
ca$p_fpc$dbscan
summary(ca$p_fpc$dbscan)
plot(ca$p_fpc$dbscan, ca$dataset)
fviz_cluster(ca$p_fpc$dbscan, ca$dataset) # From package factoextra
|
1
2
3
4
5
6
7
8
9
10
11
12
|
dbscan Pts=1000 MinPts=10 eps=1e-06
0 1 2 3 4 5 6 7
border 743 0 0 0 0 0 0 0
seed 0 34 105 10 76 12 10 10
total 743 34 105 10 76 12 10 10
Length Class Mode
cluster 1000 -none- numeric
eps 1 -none- numeric
MinPts 1 -none- numeric
isseed 1000 -none- logical
|
Package mclust
Mclust
- Model-Based Clustering
1
2
3
4
5
6
|
ca$cluster_centers
summary(ca$p_mclust$mclust)
plot(ca$p_mclust$mclust)
fviz_cluster(ca$p_mclust$mclust) # From package factoextra
fviz_mclust(ca$p_mclust$mclust) # From package factoextra
plot(ca$p_mclust$clust_combi)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1: 1 1 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 1 0
2: 1 1 0 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1 1 0
3: 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 0 1
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust EEV (ellipsoidal, equal volume and shape) model with 7 components:
log-likelihood n df BIC ICL
4143.117 1000 1496 -2047.769 -2193.05
Clustering table:
1 2 3 4 5 6 7
132 146 87 198 195 121 121
|
hc
- Model-based Agglomerative Hierarchical Clustering
1
2
3
|
ca$cluster_centers
ca$p_mclust$hc_cluster_sizes
plot(ca$p_mclust$hc)
|
1
2
3
4
5
6
7
|
m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1: 1 1 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 1 0
2: 1 1 0 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1 1 0
3: 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 0 1
1 2 3
819 105 76
|
hcVVV
- Model-based Hierarchical Clustering
Agglomerative hierarchical clustering based on maximum likelihood for a Gaussian mixture model
parameterized by eigenvalue decomposition.
1
2
3
|
ca$cluster_centers
ca$p_mclust$hcvvv_cluster_sizes
plot(ca$p_mclust$hcvvv)
|
1
2
3
4
5
6
7
|
m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1: 1 1 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 1 0
2: 1 1 0 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1 1 0
3: 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 0 1
1 2 3
819 105 76
|
Package pvclust
pvclust
- Calculating P-values for Hierchical Clustering
Calculates p-values for hierarchical clustering via multiscale bootstrap resampling. Hierarchical
clustering is done for given data and p-values are computed for each of the clusters.
1
2
3
4
|
ca$cluster_centers
ca$p_pvclust$pvclust_cluster_sizes
summary(ca$p_pvclust$pvclust)
plot(ca$p_pvclust$pvclust)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1: 1 1 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 1 0
2: 1 1 0 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1 1 0
3: 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 0 1
1 2 3
333 333 334
Length Class Mode
hclust 7 hclust list
edges 9 data.frame list
count 10 data.frame list
msfit 999 mslist list
nboot 10 -none- numeric
r 10 -none- numeric
store 10 -none- list
version 1 package_version list
|
Package stats
hclust
- Hierarchical cluster analysis on a set of dissimilarities
1
2
3
4
5
|
ca$p_stats$ehclust_cluster_sizes
plot(ca$p_stats$ehclust)
fviz_cluster(ca$p_stats$ehclust) # From package factoextra
fviz_dend(ca$p_stats$ehclust) # From package factoextra
fviz_silhouette(ca$p_stats$ehclust) # From package factoextra
|
1
2
|
ca$p_stats$kmeans_cluster_sizes
fviz_cluster(ca$p_stats$kmeans, data = ca$dataset) # From package factoextra
|
1
2
3
|
ca$p_stats$ekmeans_cluster_sizes
fviz_cluster(ca$p_stats$ekmeans, data = ca$dataset) # From package factoextra
fviz_silhouette(ca$p_stats$ekmeans) # From package factoextra
|
Discussion
The cluster plots presented above, which are based on principal component analysis, look like useful visual tools to help us understand the classification of the data in clusters. Given that it is not easy to imagine a space of 20 dimensions, the computer takes up the work of organizing the data into a new system of coordinates (principal components), where only the two most important dimensions are shown in the figure. These are the two dimensions along which the variability of the data is the most prominent, making the differences among the 3 clusters quite apparent. I would suggest that this way of visualizing the data is more useful than the dendograms and the silhouette plots, which try to show the differences among the clusters in other ways.
The output of mona
was the most difficult to interpret in my opinion. You have to identify the three clusters by yourself, after assuming that the output figure wraps around in the vertical direction. The indices to the data are organized from top to bottom, and you need to perform some guesswork to identify those indices that correspond to the points to the right, which you would use to know where a cluster ends and the next cluster begins. Sadly, mona
and factoextra
do not provide any cluster plots, banner plots, dendograms, or silhouette plots.
Conclusion
Instead of limiting ourselves to the guesswork involved in the mona
analysys, our best approach is to let R
treat our binary dataset as real-valued data and perform the rich, visual studies that are provided by factoextra
and other facilities.
The best methods for our simulated dataset are agnes
, diana
, hclust
, and kmeans
, enhanced through the eclust
function from the factoextra
package. I would say they are the best because they correctly determine that the dataset should be classified into 3 clusters of approximately 333 points each, and also because all of them come with cluster plots which facilitates the analysis and discussion of the classification.
Credits
Featured photo: Bird flock, cloud, sky by Mohamed Hassan, available here. Licensed under CC0 public domain license, free for personal & commercial use, no attribution required.