VarClustPartition: Hierarchical variable clustering.

Description

VarClustPartition is a hierarchical variable clustering algorithm based on the directed dependence coefficient (didec) or a concordance measure (Kendall tau \(\tau\) or Spearman's footrule) according to a pre-selected number of clusters or an optimality criterion (Adiam&Msplit or Silhouette coefficient).

Usage

VarClustPartition(
  X,
  dist.method = c("PD"),
  linkage = FALSE,
  link.method = c("complete"),
  part.method = c("optimal"),
  criterion = c("Adiam&Msplit"),
  num.cluster = NULL,
  plot = FALSE
)

Value

A list containing a dendrogram without colored branches (dendrogram), an integer value determining the number of clusters after partitioning (num.cluster), and a list containing the clusters after partitioning (clusters).

Arguments

X: A numeric matrix or data.frame/data.table. Contains the variables to be clustered.
dist.method: An optional character string computing a distance function for clustering. This must be one of the strings "PD" (default), "MPD", "kendall" or "footrule".
linkage: A logical. If TRUE a linkage method is used.
link.method: An optional character string selecting a linkage method. This must be one of the strings "complete" (default), "average" or "single".
part.method: An optional character string selecting a partitioning method. This must be one of the strings "optimal" (default) or "selected".
criterion: An optional character string selecting a criterion for the optimal partition, if part.method = "optimal". This must be one of the strings "Adiam&Msplit" (default) or "Silhouette".
num.cluster: An integer value for the selected number of clusters, if part.method = "selected".
plot: A logical. If TRUE a dendrogram is plotted with colored branches according to the corresponding partitioning method.

Author

Yuping Wang, Sebastian Fuchs

Details

VarClustPartition performs a hierarchical variable clustering based on the directed dependence coefficient (didec) and provides a partition of the set of variables.

If dist.method =="PD" or dist.method =="MPD", the clustering is performed using didec either as a directed ("PD") or as a symmetric ("MPD") dependence coefficient. If dist.method =="kendall" or dist.method =="footrule", clustering is performed using either multivariate Kendall's tau ("kendall") or multivariate Spearman's footrule ("footrule").

Instead of using one of the above-mentioned four multivariate measures for the clustering, the option linkage == TRUE enables the use of bivariate linkage methods, including complete linkage (link.method == "complete"), average linkage (link.method == "average") and single linkage (link.method == "single"). Note that the multivariate distance methods are computationally demanding because higher-dimensional dependencies are included in the calculation, in contrast to linkage methods which only incorporate pairwise dependencies.

A pre-selected number of clusters num.cluster can be realized with the option part.method == "selected". Otherwise (part.method == "optimal"), the number of clusters is determined by maximizing the intra-cluster similarity (similarity within the same cluster) and minimizing the inter-cluster similarity (similarity among the clusters). Two optimality criteria are available:

"Adiam&Msplit": Adiam measures the intra-cluster similarity and Msplit measures the inter-cluster similarity.

"Silhouette": A mixed coefficient incorporating the intra-cluster similarity and the inter-cluster similarity. The optimal number of clusters corresponds to the maximum Silhouette coefficient.

References

S. Fuchs, Y. Wang, Hierarchical variable clustering based on the predictive strength between random vectors, Int. J. Approx. Reason. 170, Article ID 109185, 2024.

P. Hansen, B. Jaumard, Cluster analysis and mathematical programming, Math. Program. 79 (1) 191–215, 1997.

L. Kaufman, Finding Groups in Data, John Wiley & Sons, 1990.

Examples

Run this code

library(didec)
n  <- 50
X1 <- rnorm(n,0,1)
X2 <- X1
X3 <- rnorm(n,0,1)
X4 <- X3 + X2
X  <- data.frame(X1=X1,X2=X2,X3=X3,X4=X4)
vcp <- VarClustPartition(X,
                            dist.method = c("PD"),
                            part.method = c("optimal"),
                            criterion   = c("Silhouette"),
                            plot        = TRUE)
vcp$clusters
# \donttest{
data("bioclimatic")
X   <- bioclimatic[c(2:4,9)]
vcp1 <- VarClustPartition(X,
                          linkage     = TRUE,
                          link.method = c("complete"),
                          dist.method = "PD",
                          part.method = "optimal",
                          criterion   = "Silhouette",
                          plot        = TRUE)
vcp1$clusters
vcp2 <- VarClustPartition(X,
                          linkage     = TRUE,
                          link.method = c("complete"),
                          dist.method = "footrule",
                          part.method = "optimal",
                          criterion   = "Adiam&Msplit",
                          plot        = TRUE)
vcp2$clusters
# }

Run the code above in your browser using DataLab