Contents

Cluster Analysis of Binary Data


Introduction

Cluster analysis is the statistical problem of classifying data in groups according to similarity. For example, birds can be classified in flocks depending on their physical position, and how close they are to one another. There are multiple ways to classify the data and to decide how many clusters should be identified. Choosing one classification over another can be a difficult problem. In this article, we compare cluster analysis techniques for data that is binary.

Cluster analysis is a common technique for exploratory data analysis, and is used in fields like pattern recognition, machine learning, and bioinformatics. The algorithms aim to be versatile, suitable for any data type. In this article, we are going to show that not all algorithms are suitable to classify a given dataset, and demonstrate how building a simulated dataset can be helpful for choosing an appropriate algorithm.

We identified more than a dozen cluster analysis functions by different contributors to the R statistical language. This article does not aim to be a thorough description or comparison of those functions, and instead focuses on showing the construction of the simulated dataset and the results of as many of those functions as possible.

The simulation generated 3 clusters of approximately 333 points each; the resulting dataset had 1000 points defined in a space of 20 binary dimensions. Then, the R functions for cluster analysis were run. Some algorithms such as agnes required us to specify the number of clusters; other algorithms such as ekmeans were able to determine the number of clusters by themselves. We also found that most algorithms like agnes and ekmeans classified the dataset into appropriately-sized clusters, while other algorithms such as dbscan and hc produced a wrong number of clusters or the wrong sizes. Finally, some of the algorithms were not easy to process and interpret, because appropriate statistical or graphical tools for analysis do not exists. That is the case of mona, an algorithm designed to work with binary variables only. We had to use some guesswork to determine the size of the clusters, and still got them wrong.

In conclusion, we found that several algorithms were unsuitable to classify binary data. Assays with simulated datasets are fundamental for choosing an efficient algorithm that can clearly illustrate a correct classification of binary data.

Options for Cluster Analysis

Here is a list of facilities to perform cluster analysis in R. The package cluster offers agnes, clara, diana, fanny, mona, and pam. Package fpc offers dbscan. Package mclust offers mclust, hc, and hcvvv. Package pvclust offers pvclust, and package stats offers hclust and kmeans.

Package factoextra enhances several of the facilities listed above with additional types of plots and automatic determination of the number of clusters.

Requirements and Installation

The comparison only requires a working R installation with the packages mentioned above and a few additional utilities. If you are starting with a fresh setup, you can use the following command:

1
2
install_packages(c("rlang", "data.table", "ggplot2", "Cairo",
"cowplot", "cluster", "factoextra", "fpc", "mclust", "pvclust"))

Let’s get started

We first generate a simulated dataset and we run all cluster analysis algorithms on it, saving the results for later.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
for (package_name in c("rlang", "data.table", "ggplot2", "Cairo",
"cowplot", "cluster", "factoextra", "fpc", "mclust", "pvclust")) {
  require(package_name, character.only = TRUE)
  message("Package ", package_name, " version ", packageVersion(package_name),
  " released ", packageDate(package_name))
}

# Example use case for our binary dataset:
# Each row in this dataset represents a scientific paper.
# A paper is a point in the dataset space.
# Each column represents a certain analysis methodology.
# A column is a dimension in the dataset space.
# Each value is a 0 or 1:
#  1: The paper used the given methodology
#  0: Otherwise

message("Step 1: Definitions and preparation")
set.seed(1000)
ca <- list()
# Number of rows: number of papers; points in the dataset space.
ca$n_papers <- 1000
ca$n_methodologies <- 20 # Number of columns; dimension of the dataset space
ca$n_clusters <- 3 # Number of clusters to generate
# Minimum radius of a cluster in dataset space.
# Should be less than sqrt(ca$n_methodologies).
ca$min_cluster_radius <- 1.0
# Maximum radius of a cluster in dataset space.
# Should be less than sqrt(ca$n_methodologies).
ca$max_cluster_radius <- 1.6
tmp <- list()

stopifnot(ca$n_papers > 100)
stopifnot(ca$n_methodologies > 5)
stopifnot(ca$n_clusters > 1)
stopifnot(ca$min_cluster_radius >= 1)
stopifnot(ca$min_cluster_radius < ca$max_cluster_radius)
if (2 * ca$max_cluster_radius > sqrt(ca$n_methodologies)) {
  cat("Warning: The cluster radius should be reasonably low. ",
  "Please review:\nca$radius =",
  ca$radius, "| Recommendation < 0.5 * sqrt(ca$n_methodologies =",
  0.5 * sqrt(ca$n_methodologies), "\n")
}

message("Step 2: The dataset starts as a matrix of zeroes.")
tmp$zeroes <- rep(0, ca$n_papers)
# Dataset header, as in methodology1, methodology2, etc.
tmp$methodologies <- paste0("m", 1:(ca$n_methodologies))
ca$dataset <- data.table()
ca$dataset[, tmp$methodologies := tmp$zeroes]

message(
  "Step 3: We build clusters at random points in the space of the dataset.")
ca$cluster_radii <-  runif(n = ca$n_clusters, min = ca$min_cluster_radius,
max = ca$max_cluster_radius)
ca$cluster_centers <- as.data.table(matrix(rbinom(n = ca$n_clusters *
ca$n_methodologies, size = 1, prob = 0.5), nrow = ca$n_clusters,
ncol = ca$n_methodologies))
setnames(ca$cluster_centers, tmp$methodologies)
tmp$limits <- 1 + as.integer(0:(ca$n_clusters - 1) * ca$n_papers /
ca$n_clusters) # Paper index at the start of each cluster
stopifnot(tmp$limits < ca$n_papers)
tmp$cluster_sizes <- diff(c(tmp$limits, ca$n_papers + 1))
ca$cluster_sizes <- data.table(Original =
sort(tmp$cluster_sizes))
# Sorts column x and adds it to ca$cluster_sizes by reference,
# resizing dt if needed, padding with NAs if necessary. No data recycling.
new_sizes <- function(x_name, x) {
  sx <- seq_along(x)
  if (length(x) > nrow(ca$cluster_sizes)) {
    ca$cluster_sizes <<- ca$cluster_sizes[sx]
  }
  ca$cluster_sizes[sx, c(x_name) := sort(x)]
}

tmp$limits <- c(tmp$limits, 0)
tmp$random_index <- sample.int(ca$n_papers)
tmp$limit <- 1
tmp$cluster_index <- 0
for (paper_index in 1:(ca$n_papers)) {
  if (paper_index == tmp$limit) {
    tmp$cluster_index <- tmp$cluster_index + 1
    if (tmp$cluster_index <= ca$n_clusters) {
      tmp$limit <- tmp$limits[tmp$cluster_index + 1]
      tmp$prob <- ca$cluster_radii[tmp$cluster_index] *
      ca$cluster_radii[tmp$cluster_index] / ca$n_methodologies
      tmp$center <- ca$cluster_centers[tmp$cluster_index]
      message("Cluster ", tmp$cluster_index, " has size ",
      tmp$cluster_sizes[tmp$cluster_index], ", radius ", tmp$prob,
      " and the following center:")
      show(tmp$center)
    }
  }

  ca$dataset[tmp$random_index[paper_index], ] <- (tmp$center +
      rbinom(n = ca$n_methodologies, size = 1, prob = tmp$prob)
      # A random point near the center of the cluster is defined by flipping
      # some of the bits that constitute the coordinates.
    ) %% 2
}
message("Here is an excerpt of the dataset:")
show(ca$dataset)

message("Step 4: We run the clustering algorithms and save the results.")
ca$p_cluster <- list()
ca$p_fpc <- list()
ca$p_mclust <- list()
ca$p_pvclust <- list()
ca$p_stats <- list()

ca$p_cluster$agnes <- agnes(ca$dataset)
ca$p_cluster$agnes_cluster_sizes <- table(cutree(ca$p_cluster$agnes, k = 3))
new_sizes("agnes", ca$p_cluster$agnes_cluster_sizes)
ca$p_cluster$eagnes <- eclust(ca$dataset, "agnes", graph = FALSE)
ca$p_cluster$clara <- clara(ca$dataset, k = 3, samples = 100)
ca$p_cluster$clara_cluster_sizes <- ca$p_cluster$clara$clusinfo[, "size"]
new_sizes("clara", ca$p_cluster$clara_cluster_sizes)
ca$p_cluster$eclara <- eclust(ca$dataset, "clara", graph = FALSE, k = 3,
samples = 100)
ca$p_cluster$diana <- diana(ca$dataset)
ca$p_cluster$diana_cluster_sizes <- table(cutree(ca$p_cluster$diana, k = 3))
new_sizes("diana", ca$p_cluster$diana_cluster_sizes)
ca$p_cluster$ediana <- eclust(ca$dataset, "diana", graph = FALSE)
ca$p_cluster$fanny <- fanny(ca$dataset, k = 3, tol = 1e-6)
ca$p_cluster$fanny_cluster_sizes <- table(ca$p_cluster$fanny$clustering)
new_sizes("fanny", ca$p_cluster$fanny_cluster_sizes)
ca$p_cluster$efanny <- eclust(ca$dataset, "fanny", graph = FALSE, k = 3,
tol = 1e-6)
ca$p_cluster$mona <- mona(ca$dataset)
tmp$mona <- list()
tmp$mona$raw <- which(ca$p_cluster$mona$step == 20)
tmp$mona$indices <- c(0, tmp$mona$raw[4:5] - tmp$mona$raw[1], 1001)
ca$p_cluster$mona_cluster_sizes <- diff(tmp$mona$indices)
new_sizes("mona", ca$p_cluster$mona_cluster_sizes)
ca$p_cluster$pam <- pam(ca$dataset, k = 3)
ca$p_cluster$pam$pam_cluster_sizes <- ca$p_cluster$pam$clusinfo[, "size"]
new_sizes("pam", ca$p_cluster$pam$pam_cluster_sizes)
ca$p_cluster$epam <- eclust(ca$dataset, "pam", graph = FALSE, k = 3)
ca$p_fpc$dbscan <- dbscan(ca$dataset, eps = 0.000001, MinPts = 10)
ca$p_fpc$dbscan_cluster_sizes <- table(ca$p_fpc$dbscan$cluster)
new_sizes("dbscan", ca$p_fpc$dbscan_cluster_sizes)
ca$p_mclust$mclust <- Mclust(ca$dataset)
ca$p_mclust$mclust_cluster_sizes <- table(ca$p_mclust$mclust$classification)
new_sizes("mclust", ca$p_mclust$mclust_cluster_sizes)
ca$p_mclust$clust_combi <- clustCombi(ca$p_mclust$mclust, ca$dataset)
ca$p_mclust$hc <- hc(ca$dataset)
ca$p_mclust$hc_cluster_sizes <- table(hclass(ca$p_mclust$hc, 3))
new_sizes("hc", ca$p_mclust$hc_cluster_sizes)
ca$p_mclust$hcvvv <- hcVVV(ca$dataset)
ca$p_mclust$hcvvv_cluster_sizes <- table(hclass(ca$p_mclust$hcvvv, 3))
new_sizes("hcvvv", ca$p_mclust$hcvvv_cluster_sizes)
ca$p_pvclust$pvclust <- pvclust(t(ca$dataset), parallel = TRUE)
ca$p_pvclust$pvclust_cluster_sizes <- table(cutree(ca$p_pvclust$pvclust$hclust,
k = 3))
new_sizes("pvclust", ca$p_pvclust$pvclust_cluster_sizes)
# ca$p_stats$hclust <- hclust(ca$dataset) # Did not work with this dataset
ca$p_stats$ehclust <- eclust(ca$dataset, "hclust", graph = FALSE)
ca$p_stats$ehclust_cluster_sizes <- table(cutree(ca$p_stats$ehclust, k = 3))
new_sizes("ehclust", ca$p_stats$ehclust_cluster_sizes)
# eclust("kmeans") might warn "did not converge in 10 iterations"
ca$p_stats$ekmeans <- eclust(ca$dataset, "kmeans", graph = FALSE)
ca$p_stats$ekmeans_cluster_sizes <- table(ca$p_stats$ekmeans$cluster)
new_sizes("ekmeans", ca$p_stats$ekmeans_cluster_sizes)
ca$p_stats$kmeans <- kmeans(ca$dataset, centers = ca$cluster_centers)
ca$p_stats$kmeans_cluster_sizes <- table(ca$p_stats$kmeans$cluster)
new_sizes("kmeans", ca$p_stats$kmeans_cluster_sizes)

save(ca, file = "data.RData")
tmp <- NULL

Here is the output of the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
Loading required package: rlang
Package rlang version 1.0.2 released 2022-03-04
Loading required package: data.table
data.table 1.14.2 using 4 threads (see ?getDTthreads).  Latest news: r-datatable.com

Attaching package: 'data.table'

The following object is masked from 'package:rlang':

    :=

Package data.table version 1.14.2 released 2021-09-23
Loading required package: ggplot2
Package ggplot2 version 3.3.6 released 2022-04-27
Loading required package: Cairo
Package Cairo version 1.5.15 released 2022-03-16
Loading required package: cowplot
Package cowplot version 1.1.1 released 2020-12-15
Loading required package: cluster
Package cluster version 2.1.3 released 2022-03-28
Loading required package: factoextra
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Package factoextra version 1.0.7 released 2020-04-01
Loading required package: fpc
Package fpc version 2.2.9 released 2020-12-06
Loading required package: mclust
    __  ___________    __  _____________
   /  |/  / ____/ /   / / / / ___/_  __/
  / /|_/ / /   / /   / / / /\__ \ / /
 / /  / / /___/ /___/ /_/ /___/ // /
/_/  /_/\____/_____/\____//____//_/    version 5.4.9
Type 'citation("mclust")' for citing this R package in publications.
Package mclust version 5.4.9 released 2021-12-17
Loading required package: pvclust
Package pvclust version 2.2.0 released 2019-11-19
Step 1: Definitions and preparation
Step 2: The dataset starts as a matrix of zeroes.
Step 3: We build clusters at random points in the space of the dataset.

Cluster 1 has size 333, radius 0.0716078026473276 and the following center:
   m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1:  1  1  0  0  0  0  1  1  1   1   1   0   0   0   1   0   1   1   1   0

Cluster 2 has size 333, radius 0.105896052986884 and the following center:
   m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1:  1  1  0  1  0  1  0  1  0   0   1   1   1   1   1   1   0   1   1   0

Cluster 3 has size 334, radius 0.0570698508073141 and the following center:
   m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1:  0  0  1  1  1  0  0  0  0   1   1   1   1   1   1   1   1   1   0   1

Here is an excerpt of the dataset:
      m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
   1:  1  1  1  1  0  1  0  1  0   1   1   1   1   1   1   1   0   1   0   0
   2:  1  1  0  1  0  1  0  1  0   0   1   1   1   1   1   1   0   1   1   0
   3:  1  1  0  0  0  0  1  1  1   1   1   0   1   0   0   0   1   1   1   1
   4:  1  1  0  0  0  0  1  1  1   1   0   0   0   0   1   0   1   1   1   0
   5:  1  1  0  1  0  1  0  1  0   0   1   1   1   1   1   1   0   1   1   0
  ---                                                                       
 996:  1  1  0  1  1  1  0  1  0   0   1   1   1   1   1   1   0   1   1   0
 997:  0  0  1  1  1  0  0  0  0   1   1   1   1   1   1   1   1   1   0   1
 998:  1  1  1  1  0  1  0  1  0   0   1   1   1   1   1   1   0   1   0   1
 999:  1  1  0  1  0  1  0  1  0   0   1   0   1   1   0   1   1   1   1   0
1000:  1  1  0  0  0  0  1  1  1   1   1   0   0   1   1   0   1   0   1   0
Step 4: We run the clustering algorithms and save the results.
Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 100)  [one "." per sample]:
.................................................. 50 
.................................................. 100 
Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 100)  [one "." per sample]:
.................................................. 50 
.................................................. 100 
fitting ...
  |====================================================================================================| 100%
Bootstrap (r = 0.5)... Done.
Bootstrap (r = 0.6)... Done.
Bootstrap (r = 0.7)... Done.
Bootstrap (r = 0.8)... Done.
Bootstrap (r = 0.9)... Done.
Bootstrap (r = 1.0)... Done.
Bootstrap (r = 1.1)... Done.
Bootstrap (r = 1.2)... Done.
Bootstrap (r = 1.3)... Done.
Bootstrap (r = 1.4)... Done.
Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 100)  [one "." per sample]:
.................................................. 50 
.................................................. 100 
Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 100)  [one "." per sample]:
.................................................. 50 
.................................................. 100 

We first compare the cluster analysis methods in the following table:

1
ca$cluster_sizes
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
   Original agnes clara diana fanny mona pam dbscan mclust  hc hcvvv pvclust
1:      333   333   333   311   333  262 333     10     87  76    76     333
2:      333   333   333   334   333  339 333     10    121 105   105     333
3:      334   334   334   355   334  400 334     10    121 819   819     334
4:       NA    NA    NA    NA    NA   NA  NA     12    132  NA    NA      NA
5:       NA    NA    NA    NA    NA   NA  NA     34    146  NA    NA      NA
6:       NA    NA    NA    NA    NA   NA  NA     76    195  NA    NA      NA
7:       NA    NA    NA    NA    NA   NA  NA    105    198  NA    NA      NA
   ehclust ekmeans kmeans
1:     332     333    333
2:     333     333    333
3:     335     334    334
4:      NA      NA     NA
5:      NA      NA     NA
6:      NA      NA     NA
7:      NA      NA     NA

This table shows that algorithms dbscan and mclust are not appropriate for the classification of our simulated data, given that they identify a large amount of clusters. Algorithms diana, mona, hc, and hcvvv produce incorrect cluster sizes.

The other algorithms seem to produce clusters of appropriate sizes. Bear in mind that in some cases package factoextra was useful to automatically determine the number of clusters, but in some cases the number of clusters had to be provided by us.

In the next few sections, we will show the graphical output of the different types of cluster analysis.

Package cluster

agnes - Agglomerative Nesting (Hierarchical Clustering)

1
2
3
4
5
ca$p_cluster$agnes_cluster_sizes
plot(ca$p_cluster$agnes)
fviz_cluster(ca$p_cluster$eagnes)    # From package factoextra
fviz_dend(ca$p_cluster$eagnes)       # From package factoextra
fviz_silhouette(ca$p_cluster$eagnes) # From package factoextra
1
2
  1   2   3 
334 333 333

clara - Clustering Large Applications - Clustering of the Data Into k Clusters

1
2
3
ca$cluster_centers
ca$p_cluster$clara$medoids
ca$p_cluster$clara$clusinfo
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
   m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1:  1  1  0  0  0  0  1  1  1   1   1   0   0   0   1   0   1   1   1   0
2:  1  1  0  1  0  1  0  1  0   0   1   1   1   1   1   1   0   1   1   0
3:  0  0  1  1  1  0  0  0  0   1   1   1   1   1   1   1   1   1   0   1

     m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
[1,]  1  1  0  1  0  1  0  1  0   0   1   1   1   1   1   1   0   1   1   0
[2,]  1  1  0  0  0  0  1  1  1   1   1   0   0   0   1   0   1   1   1   0
[3,]  0  0  1  1  1  0  0  0  0   1   1   1   1   1   1   1   1   1   0   1

     size max_diss   av_diss isolation
[1,]  333 2.645751 1.3260589 0.8366600
[2,]  333 2.236068 1.0075124 0.7071068
[3,]  334 2.449490 0.8338874 0.7745967
1
2
3
plot(ca$p_cluster$clara)
fviz_cluster(ca$p_cluster$eclara)    # From package factoextra
fviz_silhouette(ca$p_cluster$eclara) # From package factoextra

diana - DIvisive ANAlysis Clustering

1
2
3
4
5
6
ca$cluster_centers
ca$p_cluster$diana_cluster_sizes
plot(ca$p_cluster$diana)
fviz_cluster(ca$p_cluster$ediana)    # From package factoextra
fviz_dend(ca$p_cluster$ediana)       # From package factoextra
fviz_silhouette(ca$p_cluster$ediana) # From package factoextra
1
2
3
4
5
6
7
   m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 m15 m16 m17 m18 m19 m20
1:  1  1  0  0  0  0  1  1  1   1   1   0   0   0   1   0   1   1   1   0
2:  1  1  0  1  0  1  0  1  0   0   1   1   1   1   1   1   0   1   1   0
3:  0  0  1  1  1  0  0  0  0   1   1   1   1   1   1   1   1   1   0   1

  1   2   3 
355 311 334