hcsvd: Hierarchical Variable Clustering Using Singular Vectors (HC-SVD).

Description

Performs HC-SVD to reveal the hierarchical variable structure as descried in Bauer (202X). For this divise approach, each cluster is split into two clusters iteratively. Potential splits are identified by the first sparse loadings (which are sparse approximations of the first right eigenvectors, i.e., vectors with many zero values, of the correlation matrix) that mirror the masked shape of the correlation matrix. This procedure is continued until each variable lies in a single cluster.

Usage

hcsvd(
  R,
  q = "Kaiser",
  linkage = "average",
  is.corr = TRUE,
  max.iter,
  trace = TRUE
)

Value

A list with four components:

hclust: The clustering structure identified by HC-SVD as an object of type hclust.
dist.matrix: The ultrametric distance matrix (cophenetic matrix) of the HC-SVD structure as an object of class dist.
u.cor: The ultrametric correlation matrix of \(X\) obtained by HC-SVD as an object of class matrix.
q.p: A vector of length \(p-1\) containing the ratio \(q_i/p_i\) of the \(q_i\) sparse loadings used relative to all sparse loadings \(q_i\) for the split of each cluster. The ratio is set to NA if the cluster contains only two variables as the search for sparse loadings that reflect the split is not required in this case.

Arguments

R: A correlation matrix of dimension \(p\)x\(p\) or a data matrix of dimension \(n\)x\(p\) an be provided. If a data matrix is supplied, it must be indicated by setting is.corr = FALSE, and the correlation matrix will then be calculated as cor(X).
q: Number of sparse loadings to be used. This should be either a numeric value between zero and one to indicate percentages, or "Kaiser" for as many sparse loadings as there are eigenvalues larger or equal to one. For a numerical value between zero and one, the number of sparse loadings is determined as the corresponding share of the total number of loadings. E.g., q = 1 (100%) uses all sparse loadings and q = 0.5 (50%) will use half of all sparse loadings.
linkage: The linkage function to be used. This should be one of "average", "single", or "RV" (for RV-coefficient).
is.corr: Is the supplied object a correlation matrix. Default is TRUE and this parameter must be set to FALSE is a data matrix instead of a correlation matrix is supplied.
max.iter: How many iterations should be performed for computing the sparse loadings. Default is 200.
trace: Print out progress as \(p-1\) iterations for divisive hierarchical clustering are performed. Default is TRUE.

Details

The sparse loadings are computed using the method of Shen and Huang (2008), which is implemented based on the code of Baglama, Reichel, and Lewis in ssvd {irlba}, with slight modifications to suit our method.

References

Bauer, J.O. (202X). Divisive hierarchical clustering identified by singular vectors.

Shen, H. and Huang, J.Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation, J. Multivar. Anal. 99, 1015–1034.

Examples

Run this code

#We replicate the simulation study (a) in Bauer (202X)

if (FALSE) {
p <- 40
n <- 500
b <- 5
design <- "a"

set.seed(1)
Rho <- hcsvd.cor.sim(p = p, b = b, design = "a")
X <- mvtnorm::rmvnorm(n, mean=rep(0, p), sigma = Rho, checkSymmetry = FALSE)
R <- cor(X)
hcsvd.obj <- hcsvd(R)

#The object of hclust with corresponding dendrogram can be obtained
#directly from hcsvd.obj$hclust:
hc <- hcsvd.obj$hclust
plot(hc)

#The dendrogram can also be obtained from the ultrametric distance matrix:
plot(hclust(hcsvd.obj$dist.matrix))
}

Run the code above in your browser using DataLab