Contents

Cluster Analysis of Binary Data


Introduction

Cluster analysis is the statistical problem of classifying data in groups according to similarity. For example, birds can be classified in flocks depending on their physical position, and how close they are to one another. There are multiple ways to classify the data and to decide how many clusters should be identified. Choosing one classification over another can be a difficult problem. In this article, we compare cluster analysis techniques for data that is binary.

Cluster analysis is a common technique for exploratory data analysis, and is used in fields like pattern recognition, machine learning, and bioinformatics. The algorithms aim to be versatile, suitable for any data type. In this article, we are going to show that not all algorithms are suitable to classify a given dataset, and demonstrate how building a simulated dataset can be helpful for choosing an appropriate algorithm.

We identified more than a dozen cluster analysis functions by different contributors to the R statistical language. This article does not aim to be a thorough description or comparison of those functions, and instead focuses on showing the construction of the simulated dataset and the results of as many of those functions as possible.

The simulation generated 3 clusters of approximately 333 points each; the resulting dataset had 1000 points defined in a space of 20 binary dimensions. Then, the R functions for cluster analysis were run. Some algorithms such as agnes required us to specify the number of clusters; other algorithms such as ekmeans were able to determine the number of clusters by themselves. We also found that most algorithms like agnes and ekmeans classified the dataset into appropriately-sized clusters, while other algorithms such as dbscan and hc produced a wrong number of clusters or the wrong sizes. Finally, some of the algorithms were not easy to process and interpret, because appropriate statistical or graphical tools for analysis do not exists. That is the case of mona, an algorithm designed to work with binary variables only. We had to use some guesswork to determine the size of the clusters, and still got them wrong.

In conclusion, we found that several algorithms were unsuitable to classify binary data. Assays with simulated datasets are fundamental for choosing an efficient algorithm that can clearly illustrate a correct classification of binary data.

Options for Cluster Analysis

Here is a list of facilities to perform cluster analysis in R. The package cluster offers agnes, clara, diana, fanny, mona, and pam. Package fpc offers dbscan. Package mclust offers mclust, hc, and hcvvv. Package pvclust offers pvclust, and package stats offers hclust and kmeans.

Package factoextra enhances several of the facilities listed above with additional types of plots and automatic determination of the number of clusters.

Requirements and Installation

The comparison only requires a working R installation with the packages mentioned above and a few additional utilities. If you are starting with a fresh setup, you can use the following command:

1
2
install_packages(c("rlang", "data.table", "ggplot2", "Cairo",
"cowplot", "cluster", "factoextra", "fpc", "mclust", "pvclust"))

Let’s get started

We first generate a simulated dataset and we run all cluster analysis algorithms on it, saving the results for later.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
for (package_name in c("rlang", "data.table", "ggplot2", "Cairo",
"cowplot", "cluster", "factoextra", "fpc", "mclust", "pvclust")) {
  require(package_name, character.only = TRUE)
  message("Package ", package_name, " version ", packageVersion(package_name),
  " released ", packageDate(package_name))
}

# Example use case for our binary dataset:
# Each row in this dataset represents a scientific paper.
# A paper is a point in the dataset space.
# Each column represents a certain analysis methodology.
# A column is a dimension in the dataset space.
# Each value is a 0 or 1:
#  1: The paper used the given methodology
#  0: Otherwise

message("Step 1: Definitions and preparation")
set.seed(1000)
ca <- list()
# Number of rows: number of papers; points in the dataset space.
ca$n_papers <- 1000
ca$n_methodologies <- 20 # Number of columns; dimension of the dataset space
ca$n_clusters <- 3 # Number of clusters to generate
# Minimum radius of a cluster in dataset space.
# Should be less than sqrt(ca$n_methodologies).
ca$min_cluster_radius <- 1.0
# Maximum radius of a cluster in dataset space.
# Should be less than sqrt(ca$n_methodologies).
ca$max_cluster_radius <- 1.6
tmp <- list()

stopifnot(ca$n_papers > 100)
stopifnot(ca$n_methodologies > 5)
stopifnot(ca$n_clusters > 1)
stopifnot(ca$min_cluster_radius >= 1)
stopifnot(ca$min_cluster_radius < ca$max_cluster_radius)
if (2 * ca$max_cluster_radius > sqrt(ca$n_methodologies)) {
  cat("Warning: The cluster radius should be reasonably low. ",
  "Please review:\nca$radius =",
  ca$radius, "| Recommendation < 0.5 * sqrt(ca$n_methodologies =",
  0.5 * sqrt(ca$n_methodologies), "\n")
}

message("Step 2: The dataset starts as a matrix of zeroes.")
tmp$zeroes <- rep(0, ca$n_papers)
# Dataset header, as in methodology1, methodology2, etc.
tmp$methodologies <- paste0("m", 1:(ca$n_methodologies))
ca$dataset <- data.table()
ca$dataset[, tmp$methodologies := tmp$zeroes]

message(
  "Step 3: We build clusters at random points in the space of the dataset.")
ca$cluster_radii <-  runif(n = ca$n_clusters, min = ca$min_cluster_radius,
max = ca$max_cluster_radius)
ca$cluster_centers <- as.data.table(matrix(rbinom(n = ca$n_clusters *
ca$n_methodologies, size = 1, prob = 0.5), nrow = ca$n_clusters,
ncol = ca$n_methodologies))
setnames(ca$cluster_centers, tmp$methodologies)
tmp$limits <- 1 + as.integer(0:(ca$n_clusters - 1) * ca$n_papers /
ca$n_clusters) # Paper index at the start of each cluster
stopifnot(tmp$limits < ca$n_papers)
tmp$cluster_sizes <- diff(c(tmp$limits, ca$n_papers + 1))
ca$cluster_sizes <- data.table(Original =
sort(tmp$cluster_sizes))
# Sorts column x and adds it to ca$cluster_sizes by reference,
# resizing dt if needed, padding with NAs if necessary. No data recycling.
new_sizes <- function(x_name, x) {
  sx <- seq_along(x)
  if (length(x) > nrow(ca$cluster_sizes)) {
    ca$cluster_sizes <<- ca$cluster_sizes[sx]
  }
  ca$cluster_sizes[sx, c(x_name) := sort(x)]
}

tmp$limits <- c(tmp$limits, 0)
tmp$random_index <- sample.int(ca$n_papers)
tmp$limit <- 1
tmp$cluster_index <- 0
for (paper_index in 1:(ca$n_papers)) {
  if (paper_index == tmp$limit) {
    tmp$cluster_index <- tmp$cluster_index + 1
    if (tmp$cluster_index <= ca$n_clusters) {
      tmp$limit <- tmp$limits[tmp$cluster_index + 1]
      tmp$prob <- ca$cluster_radii[tmp$cluster_index] *
      ca$cluster_radii[tmp$cluster_index] / ca$n_methodologies
      tmp$center <- ca$cluster_centers[tmp$cluster_index]
      message("Cluster ", tmp$cluster_index, " has size ",
      tmp$cluster_sizes[tmp$cluster_index], ", radius ", tmp$prob,
      " and the following center:")
      show(tmp$center)
    }
  }

  ca$dataset[tmp$random_index[paper_index], ] <- (tmp$center +
      rbinom(n = ca$n_methodologies, size = 1, prob = tmp$prob)
      # A random point near the center of the cluster is defined by flipping
      # some of the bits that constitute the coordinates.
    ) %% 2
}
message("Here is an excerpt of the dataset:")
show(ca$dataset)

message("Step 4: We run the clustering algorithms and save the results.")
ca$p_cluster <- list()
ca$p_fpc <- list()
ca$p_mclust <- list()
ca$p_pvclust <- list()
ca$p_stats <- list()

ca$p_cluster$agnes <- agnes(ca$dataset)
ca$p_cluster$agnes_cluster_sizes <- table(cutree(ca$p_cluster$agnes, k = 3))
new_sizes("agnes", ca$p_cluster$agnes_cluster_sizes)
ca$p_cluster$eagnes <- eclust(ca$dataset, "agnes", graph = FALSE)
ca$p_cluster$clara <- clara(ca$dataset, k = 3, samples = 100)
ca$p_cluster$clara_cluster_sizes <- ca$p_cluster$clara$clusinfo[, "size"]
new_sizes("clara", ca$p_cluster$clara_cluster_sizes)
ca$p_cluster$eclara <- eclust(ca$dataset, "clara", graph = FALSE, k = 3,
samples = 100)
ca$p_cluster$diana <- diana(ca$dataset)
ca$p_cluster$diana_cluster_sizes <- table(cutree(ca$p_cluster$diana, k = 3))
new_sizes("diana", ca$p_cluster$diana_cluster_sizes)
ca$p_cluster$ediana <- eclust(ca$dataset, "diana", graph = FALSE)
ca$p_cluster$fanny <- fanny(ca$dataset, k = 3, tol = 1e-6)
ca$p_cluster$fanny_cluster_sizes <- table(ca$p_cluster$fanny$clustering)
new_sizes("fanny", ca$p_cluster$fanny_cluster_sizes)
ca$p_cluster$efanny <- eclust(ca$dataset, "fanny", graph = FALSE, k = 3,
tol = 1e-6)
ca$p_cluster$mona <- mona(ca$dataset)
tmp$mona <- list()
tmp$mona$raw <- which(ca$p_cluster$mona$step == 20)
tmp$mona$indices <- c(0, tmp$mona$raw[4:5] - tmp$mona$raw[1], 1001)
ca$p_cluster$mona_cluster_sizes <- diff(tmp$mona$indices)
new_sizes("mona", ca$p_cluster$mona_cluster_sizes)
ca$p_cluster$pam <- pam(ca$dataset, k = 3)
ca$p_cluster$pam$pam_cluster_sizes <- ca$p_cluster$pam$clusinfo[, "size"]
new_sizes("pam", ca$p_cluster$pam$pam_cluster_sizes)
ca$p_cluster$epam <- eclust(ca$dataset, "pam", graph = FALSE, k = 3)
ca$p_fpc$dbscan <- dbscan(ca$dataset, eps = 0.000001, MinPts = 10)
ca$p_fpc$dbscan_cluster_sizes <- table(ca$p_fpc$dbscan$cluster)
new_sizes("dbscan", ca$p_fpc$dbscan_cluster_sizes)
ca$p_mclust$mclust <- Mclust(ca$dataset)
ca$p_mclust$mclust_cluster_sizes <- table(ca$p_mclust$mclust$classification)
new_sizes("mclust", ca$p_mclust$mclust_cluster_sizes)
ca$p_mclust$clust_combi <- clustCombi(ca$p_mclust$mclust, ca$dataset)
ca$p_mclust$hc <- hc(ca$dataset)
ca$p_mclust$hc_cluster_sizes <- table(hclass(ca$p_mclust$hc, 3))
new_sizes("hc", ca$p_mclust$hc_cluster_sizes)
ca$p_mclust$hcvvv <- hcVVV(ca$dataset)
ca$p_mclust$hcvvv_cluster_sizes <- table(hclass(ca$p_mclust$hcvvv, 3))
new_sizes("hcvvv", ca$p_mclust$hcvvv_cluster_sizes)
ca$p_pvclust$pvclust <- pvclust(t(ca$dataset), parallel = TRUE)
ca$p_pvclust$pvclust_cluster_sizes <- table(cutree(ca$p_pvclust$pvclust$hclust,
k = 3))
new_sizes("pvclust", ca$p_pvclust$pvclust_cluster_sizes)
# ca$p_stats$hclust <- hclust(ca$dataset) # Did not work with this dataset
ca$p_stats$ehclust <- eclust(ca$dataset, "hclust", graph = FALSE)
ca$p_stats$ehclust_cluster_sizes <- table(cutree(ca$p_stats$ehclust, k = 3))
new_sizes("ehclust", ca$p_stats$ehclust_cluster_sizes)
# eclust("kmeans") might warn "did not converge in 10 iterations"
ca$p_stats$ekmeans <- eclust(ca$dataset, "kmeans", graph = FALSE)
ca$p_stats$ekmeans_cluster_sizes <- table(ca$p_stats$ekmeans$cluster)
new_sizes("ekmeans", ca$p_stats$ekmeans_cluster_sizes)
ca$p_stats$kmeans <- kmeans(ca$dataset, centers = ca$cluster_centers)
ca$p_stats$kmeans_cluster_sizes <- table(ca$p_stats$kmeans$cluster)
new_sizes("kmeans", ca$p_stats$kmeans_cluster_sizes)

save(ca, file = "data.RData")
tmp <- NULL

Here is the output of the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
Loading required package: rlang
Package rlang version 1.0.2 released 2022-03-04
Loading required package: data.table
data.table 1.14.2 using 4 threads (see ?getDTthreads).  Latest news: r-datatable.com

Attaching package: 'data.table'

The following object is masked from 'package:rlang':

    :=

Package data.table version 1.14.2 released 2021-09-23
Loading required package: ggplot2
Package ggplot2 version 3.3.6 released 2022-04-27
Loading required package: Cairo
Package Cairo version 1.5.15 released 2022-03-16
Loading required package: cowplot
Package cowplot version 1.1.1 released 2020-12-15
Loading required package: cluster
Package cluster version 2.1.3 released 2022-03-28
Loading required package: factoextra
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Package factoextra version 1.0.7 released 2020-04-01
Loading required package: fpc
Package fpc version 2.2.9 released 2020-12-06
Loading required package: mclust
    __  ___________    __  _____________
   /  |/  / ____/ /   / / / / ___/_  __/
  / /|_/ / /   / /   / / / /\__ \ / /
 / /  / / /___/ /___/ /_/ /___/ // /
/_/  /_/\____/_____/\____//____//_/    version 5.4.9
Type 'citation("mclust")' for citing this R package in publications.
Package mclust version 5.4.9 released 2021-12-17
Loading required package: pvclust
Package pvclust version 2.2.0 released 2019-11-19
Step 1: Definitions and preparation
Step 2: The dataset starts as a matrix of zeroes.
Step 3: We build clusters at random points in the space of the dataset.

Cluster 1 has size 333, radius 0.0716078026473276 and the following center:
   m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1:  1  1  0  0  0  0  1  1  1   1   1   0   0   0   1   0   1   1   1   0

Cluster 2 has size 333, radius 0.105896052986884 and the following center:
   m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1:  1  1  0  1  0  1  0  1  0   0   1   1   1   1   1   1   0   1   1   0

Cluster 3 has size 334, radius 0.0570698508073141 and the following center:
   m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1:  0  0  1  1  1  0  0  0  0   1   1   1   1   1   1   1   1   1   0   1

Here is an excerpt of the dataset:
      m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
   1:  1  1  1  1  0  1  0  1  0   1   1   1   1   1   1   1   0   1   0   0
   2:  1  1  0  1  0  1  0  1  0   0   1   1   1   1   1   1   0   1   1   0
   3:  1  1  0  0  0  0  1  1  1   1   1   0   1   0   0   0   1   1   1   1
   4:  1  1  0  0  0  0  1  1  1   1   0   0   0   0   1   0   1   1   1   0
   5:  1  1  0  1  0  1  0  1  0   0   1   1   1   1   1   1   0   1   1   0
  ---                                                                       
 996:  1  1  0  1  1  1  0  1  0   0   1   1   1   1   1   1   0   1   1   0
 997:  0  0  1  1  1  0  0  0  0   1   1   1   1   1   1   1   1   1   0   1
 998:  1  1  1  1  0  1  0  1  0   0   1   1   1   1   1   1   0   1   0   1
 999:  1  1  0  1  0  1  0  1  0   0   1   0   1   1   0   1   1   1   1   0
1000:  1  1  0  0  0  0  1  1  1   1   1   0   0   1   1   0   1   0   1   0
Step 4: We run the clustering algorithms and save the results.
Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 100)  [one "." per sample]:
.................................................. 50 
.................................................. 100 
Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 100)  [one "." per sample]:
.................................................. 50 
.................................................. 100 
fitting ...
  |====================================================================================================| 100%
Bootstrap (r = 0.5)... Done.
Bootstrap (r = 0.6)... Done.
Bootstrap (r = 0.7)... Done.
Bootstrap (r = 0.8)... Done.
Bootstrap (r = 0.9)... Done.
Bootstrap (r = 1.0)... Done.
Bootstrap (r = 1.1)... Done.
Bootstrap (r = 1.2)... Done.
Bootstrap (r = 1.3)... Done.
Bootstrap (r = 1.4)... Done.
Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 100)  [one "." per sample]:
.................................................. 50 
.................................................. 100 
Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 100)  [one "." per sample]:
.................................................. 50 
.................................................. 100 

We first compare the cluster analysis methods in the following table:

1
ca$cluster_sizes
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
   Original agnes clara diana fanny mona pam dbscan mclust  hc hcvvv pvclust
1:      333   333   333   311   333  262 333     10     87  76    76     333
2:      333   333   333   334   333  339 333     10    121 105   105     333
3:      334   334   334   355   334  400 334     10    121 819   819     334
4:       NA    NA    NA    NA    NA   NA  NA     12    132  NA    NA      NA
5:       NA    NA    NA    NA    NA   NA  NA     34    146  NA    NA      NA
6:       NA    NA    NA    NA    NA   NA  NA     76    195  NA    NA      NA
7:       NA    NA    NA    NA    NA   NA  NA    105    198  NA    NA      NA
   ehclust ekmeans kmeans
1:     332     333    333
2:     333     333    333
3:     335     334    334
4:      NA      NA     NA
5:      NA      NA     NA
6:      NA      NA     NA
7:      NA      NA     NA

This table shows that algorithms dbscan and mclust are not appropriate for the classification of our simulated data, given that they identify a large amount of clusters. Algorithms diana, mona, hc, and hcvvv produce incorrect cluster sizes.

The other algorithms seem to produce clusters of appropriate sizes. Bear in mind that in some cases package factoextra was useful to automatically determine the number of clusters, but in some cases the number of clusters had to be provided by us.

In the next few sections, we will show the graphical output of the different types of cluster analysis.

Package cluster

agnes - Agglomerative Nesting (Hierarchical Clustering)

1
2
3
4
5
ca$p_cluster$agnes_cluster_sizes
plot(ca$p_cluster$agnes)
fviz_cluster(ca$p_cluster$eagnes)    # From package factoextra
fviz_dend(ca$p_cluster$eagnes)       # From package factoextra
fviz_silhouette(ca$p_cluster$eagnes) # From package factoextra
1
2
  1   2   3 
334 333 333

clara - Clustering Large Applications - Clustering of the Data Into k Clusters

1
2
3
ca$cluster_centers
ca$p_cluster$clara$medoids
ca$p_cluster$clara$clusinfo
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
   m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1:  1  1  0  0  0  0  1  1  1   1   1   0   0   0   1   0   1   1   1   0
2:  1  1  0  1  0  1  0  1  0   0   1   1   1   1   1   1   0   1   1   0
3:  0  0  1  1  1  0  0  0  0   1   1   1   1   1   1   1   1   1   0   1

     m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
[1,]  1  1  0  1  0  1  0  1  0   0   1   1   1   1   1   1   0   1   1   0
[2,]  1  1  0  0  0  0  1  1  1   1   1   0   0   0   1   0   1   1   1   0
[3,]  0  0  1  1  1  0  0  0  0   1   1   1   1   1   1   1   1   1   0   1

     size max_diss   av_diss isolation
[1,]  333 2.645751 1.3260589 0.8366600
[2,]  333 2.236068 1.0075124 0.7071068
[3,]  334 2.449490 0.8338874 0.7745967
1
2
3
plot(ca$p_cluster$clara)
fviz_cluster(ca$p_cluster$eclara)    # From package factoextra
fviz_silhouette(ca$p_cluster$eclara) # From package factoextra

diana - DIvisive ANAlysis Clustering

1
2
3
4
5
6
ca$cluster_centers
ca$p_cluster$diana_cluster_sizes
plot(ca$p_cluster$diana)
fviz_cluster(ca$p_cluster$ediana)    # From package factoextra
fviz_dend(ca$p_cluster$ediana)       # From package factoextra
fviz_silhouette(ca$p_cluster$ediana) # From package factoextra
1
2
3
4
5
6
7
   m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1:  1  1  0  0  0  0  1  1  1   1   1   0   0   0   1   0   1   1   1   0
2:  1  1  0  1  0  1  0  1  0   0   1   1   1   1   1   1   0   1   1   0
3:  0  0  1  1  1  0  0  0  0   1   1   1   1   1   1   1   1   1   0   1

  1   2   3 
355 311 334

fanny - Fuzzy clustering of the data into k clusters.

1
2
3
4
5
ca$cluster_centers
ca$p_cluster$fanny_cluster_sizes
plot(ca$p_cluster$fanny)
fviz_cluster(ca$p_cluster$efanny)    # From package factoextra
fviz_silhouette(ca$p_cluster$efanny) # From package factoextra
1
2
3
4
5
6
7
   m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1:  1  1  0  0  0  0  1  1  1   1   1   0   0   0   1   0   1   1   1   0
2:  1  1  0  1  0  1  0  1  0   0   1   1   1   1   1   1   0   1   1   0
3:  0  0  1  1  1  0  0  0  0   1   1   1   1   1   1   1   1   1   0   1

  1   2   3 
334 333 333

mona - MONothetic Analysis Clustering of Binary Variables

1
2
ca$p_cluster$mona_cluster_sizes
plot(ca$p_cluster$mona)
1
339 400 262

pam - Partitioning Around Medoids

Partitioning (clustering) of the data into k clusters “around medoids”, a more robust version of K-means.

1
2
3
4
5
6
ca$cluster_centers
ca$p_cluster$pam$medoids
ca$p_cluster$pam$clusinfo
plot(ca$p_cluster$pam)
fviz_cluster(ca$p_cluster$epam) # From package factoextra
fviz_silhouette(ca$p_cluster$epam) # From package factoextra
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
   m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1:  1  1  0  0  0  0  1  1  1   1   1   0   0   0   1   0   1   1   1   0
2:  1  1  0  1  0  1  0  1  0   0   1   1   1   1   1   1   0   1   1   0
3:  0  0  1  1  1  0  0  0  0   1   1   1   1   1   1   1   1   1   0   1

     m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
[1,]  1  1  0  1  0  1  0  1  0   0   1   1   1   1   1   1   0   1   1   0
[2,]  1  1  0  0  0  0  1  1  1   1   1   0   0   0   1   0   1   1   1   0
[3,]  0  0  1  1  1  0  0  0  0   1   1   1   1   1   1   1   1   1   0   1

     size max_diss   av_diss diameter separation
[1,]  333 2.645751 1.3260589 3.464102   1.732051
[2,]  333 2.236068 1.0075124 3.162278   2.000000
[3,]  334 2.449490 0.8338874 3.162278   1.732051

Package fpc

dbscan - DBSCAN density reachability and connectivity clustering

1
2
3
4
ca$p_fpc$dbscan
summary(ca$p_fpc$dbscan)
plot(ca$p_fpc$dbscan, ca$dataset)
fviz_cluster(ca$p_fpc$dbscan, ca$dataset) # From package factoextra
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
dbscan Pts=1000 MinPts=10 eps=1e-06

         0  1   2  3  4  5  6  7
border 743  0   0  0  0  0  0  0
seed     0 34 105 10 76 12 10 10
total  743 34 105 10 76 12 10 10

        Length Class  Mode   
cluster 1000   -none- numeric
eps        1   -none- numeric
MinPts     1   -none- numeric
isseed  1000   -none- logical

Package mclust

Mclust - Model-Based Clustering

1
2
3
4
5
6
ca$cluster_centers
summary(ca$p_mclust$mclust)
plot(ca$p_mclust$mclust)
fviz_cluster(ca$p_mclust$mclust) # From package factoextra
fviz_mclust(ca$p_mclust$mclust) # From package factoextra
plot(ca$p_mclust$clust_combi)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
   m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1:  1  1  0  0  0  0  1  1  1   1   1   0   0   0   1   0   1   1   1   0
2:  1  1  0  1  0  1  0  1  0   0   1   1   1   1   1   1   0   1   1   0
3:  0  0  1  1  1  0  0  0  0   1   1   1   1   1   1   1   1   1   0   1

---------------------------------------------------- 
Gaussian finite mixture model fitted by EM algorithm 
---------------------------------------------------- 

Mclust EEV (ellipsoidal, equal volume and shape) model with 7 components: 

 log-likelihood    n   df       BIC      ICL
       4143.117 1000 1496 -2047.769 -2193.05

Clustering table:
  1   2   3   4   5   6   7 
132 146  87 198 195 121 121

hc - Model-based Agglomerative Hierarchical Clustering

1
2
3
ca$cluster_centers
ca$p_mclust$hc_cluster_sizes
plot(ca$p_mclust$hc)
1
2
3
4
5
6
7
   m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1:  1  1  0  0  0  0  1  1  1   1   1   0   0   0   1   0   1   1   1   0
2:  1  1  0  1  0  1  0  1  0   0   1   1   1   1   1   1   0   1   1   0
3:  0  0  1  1  1  0  0  0  0   1   1   1   1   1   1   1   1   1   0   1

  1   2   3 
819 105  76

hcVVV - Model-based Hierarchical Clustering

Agglomerative hierarchical clustering based on maximum likelihood for a Gaussian mixture model parameterized by eigenvalue decomposition.

1
2
3
ca$cluster_centers
ca$p_mclust$hcvvv_cluster_sizes
plot(ca$p_mclust$hcvvv)
1
2
3
4
5
6
7
   m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1:  1  1  0  0  0  0  1  1  1   1   1   0   0   0   1   0   1   1   1   0
2:  1  1  0  1  0  1  0  1  0   0   1   1   1   1   1   1   0   1   1   0
3:  0  0  1  1  1  0  0  0  0   1   1   1   1   1   1   1   1   1   0   1

  1   2   3 
819 105  76

Package pvclust

pvclust - Calculating P-values for Hierchical Clustering

Calculates p-values for hierarchical clustering via multiscale bootstrap resampling. Hierarchical clustering is done for given data and p-values are computed for each of the clusters.

1
2
3
4
ca$cluster_centers
ca$p_pvclust$pvclust_cluster_sizes
summary(ca$p_pvclust$pvclust)
plot(ca$p_pvclust$pvclust)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
   m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1:  1  1  0  0  0  0  1  1  1   1   1   0   0   0   1   0   1   1   1   0
2:  1  1  0  1  0  1  0  1  0   0   1   1   1   1   1   1   0   1   1   0
3:  0  0  1  1  1  0  0  0  0   1   1   1   1   1   1   1   1   1   0   1

  1   2   3 
333 333 334

        Length Class           Mode   
hclust    7    hclust          list   
edges     9    data.frame      list   
count    10    data.frame      list   
msfit   999    mslist          list   
nboot    10    -none-          numeric
r        10    -none-          numeric
store    10    -none-          list   
version   1    package_version list

Package stats

hclust - Hierarchical cluster analysis on a set of dissimilarities

1
2
3
4
5
ca$p_stats$ehclust_cluster_sizes
plot(ca$p_stats$ehclust)
fviz_cluster(ca$p_stats$ehclust)    # From package factoextra
fviz_dend(ca$p_stats$ehclust)       # From package factoextra
fviz_silhouette(ca$p_stats$ehclust) # From package factoextra
1
2
  1   2   3 
332 333 335

kmeans - Perform k-means clustering on a data matrix

1
2
ca$p_stats$kmeans_cluster_sizes
fviz_cluster(ca$p_stats$kmeans, data = ca$dataset)    # From package factoextra
1
2
  1   2   3 
333 333 334

1
2
3
ca$p_stats$ekmeans_cluster_sizes
fviz_cluster(ca$p_stats$ekmeans, data = ca$dataset)    # From package factoextra
fviz_silhouette(ca$p_stats$ekmeans) # From package factoextra
1
2
  1   2   3 
333 334 333

Discussion

The cluster plots presented above, which are based on principal component analysis, look like useful visual tools to help us understand the classification of the data in clusters. Given that it is not easy to imagine a space of 20 dimensions, the computer takes up the work of organizing the data into a new system of coordinates (principal components), where only the two most important dimensions are shown in the figure. These are the two dimensions along which the variability of the data is the most prominent, making the differences among the 3 clusters quite apparent. I would suggest that this way of visualizing the data is more useful than the dendograms and the silhouette plots, which try to show the differences among the clusters in other ways.

The output of mona was the most difficult to interpret in my opinion. You have to identify the three clusters by yourself, after assuming that the output figure wraps around in the vertical direction. The indices to the data are organized from top to bottom, and you need to perform some guesswork to identify those indices that correspond to the points to the right, which you would use to know where a cluster ends and the next cluster begins. Sadly, mona and factoextra do not provide any cluster plots, banner plots, dendograms, or silhouette plots.

Conclusion

Instead of limiting ourselves to the guesswork involved in the mona analysys, our best approach is to let R treat our binary dataset as real-valued data and perform the rich, visual studies that are provided by factoextra and other facilities.

The best methods for our simulated dataset are agnes, diana, hclust, and kmeans, enhanced through the eclust function from the factoextra package. I would say they are the best because they correctly determine that the dataset should be classified into 3 clusters of approximately 333 points each, and also because all of them come with cluster plots which facilitates the analysis and discussion of the classification.

Credits

Featured photo: Bird flock, cloud, sky by Mohamed Hassan, available here. Licensed under CC0 public domain license, free for personal & commercial use, no attribution required.