Cluster Analysis of Binary Data
Introduction
Cluster analysis is the statistical problem of classifying data in groups according to similarity. For example, birds can be classified in flocks depending on their physical position, and how close they are to one another. There are multiple ways to classify the data and to decide how many clusters should be identified. Choosing one classification over another can be a difficult problem. In this article, we compare cluster analysis techniques for data that is binary.
Cluster analysis is a common technique for exploratory data analysis, and is used in fields like pattern recognition, machine learning, and bioinformatics. The algorithms aim to be versatile, suitable for any data type. In this article, we are going to show that not all algorithms are suitable to classify a given dataset, and demonstrate how building a simulated dataset can be helpful for choosing an appropriate algorithm.
We identified more than a dozen cluster analysis functions by different contributors to the R statistical language. This article does not aim to be a thorough description or comparison of those functions, and instead focuses on showing the construction of the simulated dataset and the results of as many of those functions as possible.
The simulation generated 3 clusters of approximately 333 points each; the resulting dataset had 1000 points defined in a space of 20 binary dimensions. Then, the R functions for cluster analysis were run. Some algorithms such as agnes
required us to specify the number of clusters; other algorithms such as ekmeans
were able to determine the number of clusters by themselves. We also found that most algorithms like agnes
and ekmeans
classified the dataset into appropriately-sized clusters, while other algorithms such as dbscan
and hc
produced a wrong number of clusters or the wrong sizes. Finally, some of the algorithms were not easy to process and interpret, because appropriate statistical or graphical tools for analysis do not exists. That is the case of mona
, an algorithm designed to work with binary variables only. We had to use some guesswork to determine the size of the clusters, and still got them wrong.
In conclusion, we found that several algorithms were unsuitable to classify binary data. Assays with simulated datasets are fundamental for choosing an efficient algorithm that can clearly illustrate a correct classification of binary data.
Options for Cluster Analysis
Here is a list of facilities to perform cluster analysis in R. The package cluster
offers agnes
, clara
, diana
, fanny
, mona
, and pam
. Package fpc
offers dbscan
. Package mclust
offers mclust
, hc
, and hcvvv
. Package pvclust
offers pvclust
, and package stats
offers hclust
and kmeans
.
Package factoextra
enhances several of the facilities listed above with additional types of plots and automatic determination of the number of clusters.
Requirements and Installation
The comparison only requires a working R installation with the packages mentioned above and a few additional utilities. If you are starting with a fresh setup, you can use the following command:
|
|
Let’s get started
We first generate a simulated dataset and we run all cluster analysis algorithms on it, saving the results for later.
|
|
Here is the output of the code:
|
|
We first compare the cluster analysis methods in the following table:
|
|
|
|
This table shows that algorithms dbscan
and mclust
are not appropriate for the classification of our simulated data, given that they identify a large amount of clusters. Algorithms diana
, mona
, hc
, and hcvvv
produce incorrect cluster sizes.
The other algorithms seem to produce clusters of appropriate sizes. Bear in mind that in some cases package factoextra
was useful to automatically determine the number of clusters, but in some cases the number of clusters had to be provided by us.
In the next few sections, we will show the graphical output of the different types of cluster analysis.
Package cluster
agnes
- Agglomerative Nesting (Hierarchical Clustering)
|
|
|
|
clara
- Clustering Large Applications - Clustering of the Data Into k Clusters
|
|
|
|
|
|
diana
- DIvisive ANAlysis Clustering
|
|
|
|