cluster_vars: Build a Hierarchical Tree based on Hierarchical Clustering

Description

Build a hierarchical tree based on hierarchical clustering of the variables.

Usage

cluster_vars(
  x = NULL,
  d = NULL,
  block = NULL,
  method = "average",
  use = "pairwise.complete.obs",
  sort.parallel = TRUE,
  parallel = c("no", "multicore", "snow"),
  ncpus = 1L,
  cl = NULL
)

Arguments

a matrix or list of matrices for multiple data sets. The matrix or matrices have to be of type numeric and are required to have column names / variable names. The rows and the columns represent the observations and the variables, respectively. Either the argument x or d has to be specified.

a dissimilarity matrix. This can be either a symmetric matrix of type numeric with column and row names or an object of class dist with labels. Either the argument x or d has to be specified.

block

a data frame or matrix specifying the second level of the hierarchical tree. The first column is required to contain the variable names and to be of type character. The second column is required to contain the group assignment and to be a vector of type character or numeric. If not supplied, the second level is built based on the data.

method

the agglomeration method to be used for the hierarchical clustering. See hclust for details.

use

the method to be used for computing covariances in the presence of missing values. This is important for multiple data sets which do not measure exactly the same variables. If data is specified using the argument x, the dissimilarity matrix for the hierarchical clustering is calculated using correlation. See the 'Details' section and cor for all the options.

sort.parallel

a logical indicating whether the blocks should be sorted with respect to the size of the block. This can reduce the run time for parallel computation.

parallel

type of parallel computation to be used. See the 'Details' section.

ncpus

number of processes to be run in parallel.

an optional parallel or snow cluster used if parallel = "snow". If not supplied, a cluster on the local machine is created.

Value

The returned value is an object of class "hierD", consisting of two elements, the argument "block" and the hierarchical tree "res.tree".

The element "block" defines the second level of the hierarchical tree if supplied.

The element "res.tree" contains a dendrogram for each of the blocks defined in the argument block. If the argument block is NULL (i.e. not supplied), the element contains only one dendrogram.

Details

The hierarchical tree is built by hierarchical clustering of the variables. Either the data (using the argument x) or a dissimilarity matrix (using the argument d) can be specified.

If one or multiple data sets are defined using the argument x, the dissimilarity matrix is calculated by one minus squared empirical correlation. In the case of multiple data sets, a single hierarchical tree is jointly estimated using hierarchical clustering. The argument use is important because missing values are introduced if the data sets do not measure exactly the same variables. The argument use determines how the empirical correlation is calculated.

Alternatively, it is possible to specify a user-defined dissimilarity matrix using the argument d.

If the argument x and block are supplied, i.e. the block defines the second level of the hierarchical tree, the function can be run in parallel across the different blocks by specifying the arguments parallel and ncpus. There is an optional argument cl if parallel = "snow". There are three possibilities to set the argument parallel: parallel = "no" for serial evaluation (default), parallel = "multicore" for parallel evaluation using forking, and parallel = "snow" for parallel evaluation using a parallel socket cluster. It is recommended to select RNGkind("L'Ecuyer-CMRG") and set a seed to ensure that the parallel computing of the package hierinf is reproducible. This way each processor gets a different substream of the pseudo random number generator stream which makes the results reproducible if the arguments (as sort.parallel and ncpus) remain unchanged. See the vignette or the reference for more details.

References

Meinshausen, N. (2008). Hierarchical testing of variable importance. Biometrika, 95(2), 265-278. Renaux, C., Buzdugan, L., Kalisch, M., and B<U+00FC>hlmann, P. (2020). Hierarchical inference for genome-wide association studies: a view on methodology with software. Computational Statistics, 35(1), 1-40.

Examples

Run this code

# NOT RUN {
library(MASS)
x <- mvrnorm(50, mu = rep(0, 100), Sigma = diag(100))
colnames(x) <- paste0("Var", 1:100)
dendr1 <- cluster_vars(x = x)

# The column names of the data frame block are optional.
block <- data.frame("var.name" = paste0("Var", 1:100),
                    "block" = rep(c(1, 2), each = 50))
dendr2 <- cluster_vars(x = x, block = block)

# The matrix x is first transposed because the function dist calculates
# distances between the rows.
d <- dist(t(x))
dendr3 <- cluster_vars(d = d, method = "single")

# }

Run the code above in your browser using DataLab