hcsvd: Hierarchical Clustering Using Singular Vectors (HC-SVD).

Description

Performs HC-SVD to reveal the hierarchical structure as described in Bauer (202Xa). This divisive approach iteratively splits each cluster into two subclusters. Candidate splits are determined by the first sparse eigenvectors (sparse approximations of the first eigenvectors, i.e., vectors with many zero entries) of the similarity matrix. The selected split is the one that yields the best block-diagonal approximation of the similarity matrix according to a specified linkage function. The procedure continues until each object is assigned to its own cluster.

Usage

hcsvd(S, linkage = "average", q = 1, h.power = 2, max.iter, verbose = TRUE)

Value

A list with four components:

hclust: The clustering structure identified by HC-SVD as an object of type hclust.
dist.matrix: The ultrametric distance matrix (cophenetic matrix) of the HC-SVD structure as an object of class dist.
u.sim: The ultrametric similarity matrix of \(S\) obtained by HC-SVD as an object of class matrix. The ultrametric similarity matrix is calculated as 1-dist.matrix.
q.p: A vector of length \(p-1\) containing the ratio \(q_i/p_i\) of the \(q_i\) sparse eigenvectors used relative to all sparse eigenvectors \(q_i\) for the split of each cluster. The ratio is set to NA if the cluster contains only two variables as the search for sparse eigenvectors that reflect this obvious split is not required in this case.

Arguments

S: A scaled \(p\)x\(p\) similarity matrix. For example, this may be a correlation matrix.
linkage: The linkage function to be used. This should be one of "average", "single", or "RV" (for RV-coefficient). Note that the RV-coefficient might not yield an ultrametric distance.
q: Number of sparse eigenvectors to be used. This should be either a numeric value between zero and one to indicate percentages, or "Kaiser" for as many sparse eigenvectors as there are eigenvalues larger or equal to one. For a numerical value between zero and one, the number of sparse eigenvectors is determined as the corresponding share of the total number of eigenvectors. E.g., q = 1 (100%) uses all sparse eigenvectors and q = 0.5 (50%) will use half of all sparse eigenvectors. For q = 1, identification is best (see Bauer (202Xa) for details).
h.power: h-th Hadamard power of S. This should be a positive integer and increases robustness of the method, as described in Bauer (202Xa).
max.iter: How many iterations should be performed for computing the sparse eigenvectors. Default is 500.
verbose: Print out progress as \(p-1\) iterations for divisive hierarchical clustering are performed. Default is TRUE.

Details

The sparse loadings are computed using the method proposed by Shen & Huang (2008). The corresponding implementation is written in Rcpp/RcppArmadillo for computational efficiency and is based on the R implementation by Baglama, Reichel, and Lewis in ssvd (irlba). However, the implementation has been adapted to better align with the scope of the bdsvd package which is the base for the blox package.

Supplementary details are in hc.beta and in Bauer (202Xb).

References

Bauer, J.O. (202Xa). Divisive hierarchical clustering using block diagonal matrix approximations. Working paper.

Bauer, J.O. (202Xb). Revelle's beta: The wait is over - we can compute it!. Working paper.

Shen, H. and Huang, J.Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation, J. Multivar. Anal. 99, 1015–1034.

Examples

Run this code

#We give one example for variable clustering directly on a correlation matrix,
#and we replicate the USArrest example in Bauer (202Xa) for observation clustering.
#More elaborate code alongside a different example for variable clustering can be
#found in the corresponding supplementary material of that manuscripts.

# \donttest{
### VARIABLE CLUSTERING

#Load the correlation matrix Bechtoldt from the psych
#package (see ?Bechtoldt for more information).
if (requireNamespace("psych", quietly = TRUE)) {
  data("Bechtoldt", package = "psych")
}

#Compute HC-SVD (with average linkage).
hcsvd.obj <- hcsvd(Bechtoldt)

#The object of type hclust with corresponding dendrogram can be obtained
#directly from hcsvd(...):
hc.div <- hcsvd.obj$hclust
plot(hc.div, ylab = "")

#The dendrogram can also be obtained from the ultrametric distance matrix:
plot(hclust(hcsvd.obj$dist.matrix), main = "HC-SVD", sub = "", xlab = "")


### OBSERVATION CLUSTERING

#Correct for the known transcription error
data("USArrests")
USArrests["Maryland", "UrbanPop"] <- 76.6

#The distance matrix is scaled (divided by max(D)) to later allow a
#transformation to a matrix S that fulfills the properties of a similarity
#matrix.
D <- as.matrix(dist(USArrests))
D <- D / max(D)
S <- 1 - D

#Compute HC-SVD (with average linkage).
hcsvd.obj <- hcsvd(S)

#The object of type hclust with corresponding dendrogram can be obtained
#directly from hcsvd(...):
hc.div <- hcsvd.obj$hclust
plot(hc.div, ylab = "")

#The dendrogram can also be obtained from the ultrametric distance matrix:
plot(hclust(hcsvd.obj$dist.matrix), main = "HC-SVD", sub = "", xlab = "")
# }

Run the code above in your browser using DataLab