Build a hierarchical tree based on hierarchical clustering of the variables.
cluster_vars(
x = NULL,
d = NULL,
block = NULL,
method = "average",
use = "pairwise.complete.obs",
sort.parallel = TRUE,
parallel = c("no", "multicore", "snow"),
ncpus = 1L,
cl = NULL
)a matrix or list of matrices for multiple data sets. The matrix or
matrices have to be of type numeric and are required to have column names
/ variable names. The rows and the columns represent the observations and
the variables, respectively. Either the argument x or d has
to be specified.
a dissimilarity matrix. This can be either a symmetric matrix of
type numeric with column and row names or an object of class
dist with labels. Either the argument x or d has
to be specified.
a data frame or matrix specifying the second level of the hierarchical tree. The first column is required to contain the variable names and to be of type character. The second column is required to contain the group assignment and to be a vector of type character or numeric. If not supplied, the second level is built based on the data.
the agglomeration method to be used for the hierarchical
clustering. See hclust for details.
the method to be used for computing covariances in the presence
of missing values. This is important for multiple data sets which do not measure
exactly the same variables. If data is specified using the argument x, the
dissimilarity matrix for the hierarchical clustering is calculated using
correlation. See the 'Details' section and cor for all the options.
a logical indicating whether the blocks should be sorted with respect to the size of the block. This can reduce the run time for parallel computation.
type of parallel computation to be used. See the 'Details' section.
number of processes to be run in parallel.
an optional parallel or snow cluster used if
parallel = "snow". If not supplied, a cluster on the local machine is created.
The returned value is an object of class "hierD",
consisting of two elements, the argument "block" and the
hierarchical tree "res.tree".
The element "block" defines the second level of the hierarchical
tree if supplied.
The element "res.tree" contains a dendrogram
for each of the blocks defined in the argument block.
If the argument block is NULL (i.e. not supplied),
the element contains only one dendrogram.
The hierarchical tree is built by hierarchical clustering of the variables.
Either the data (using the argument x) or a dissimilarity matrix
(using the argument d) can be specified.
If one or multiple data sets are defined using the argument x,
the dissimilarity matrix is calculated by one minus squared empirical
correlation. In the case of multiple data sets, a single hierarchical
tree is jointly estimated using hierarchical clustering. The argument
use is important because missing values are introduced if the
data sets do not measure exactly the same variables. The argument
use determines how the empirical correlation is calculated.
Alternatively, it is possible to specify a user-defined dissimilarity
matrix using the argument d.
If the argument x and block are supplied, i.e. the
block defines the second level of the
hierarchical tree, the function can be run in parallel across
the different blocks by specifying the arguments parallel and
ncpus. There is an optional argument cl if
parallel = "snow". There are three possibilities to set the
argument parallel: parallel = "no" for serial evaluation
(default), parallel = "multicore" for parallel evaluation
using forking, and parallel = "snow" for parallel evaluation
using a parallel socket cluster. It is recommended to select
RNGkind("L'Ecuyer-CMRG") and set a seed to ensure that
the parallel computing of the package hierinf is reproducible.
This way each processor gets a different substream of the pseudo random
number generator stream which makes the results reproducible if the arguments
(as sort.parallel and ncpus) remain unchanged. See the vignette
or the reference for more details.
Meinshausen, N. (2008). Hierarchical testing of variable importance. Biometrika, 95(2), 265-278. Renaux, C., Buzdugan, L., Kalisch, M., and B<U+00FC>hlmann, P. (2020). Hierarchical inference for genome-wide association studies: a view on methodology with software. Computational Statistics, 35(1), 1-40.
# NOT RUN {
library(MASS)
x <- mvrnorm(50, mu = rep(0, 100), Sigma = diag(100))
colnames(x) <- paste0("Var", 1:100)
dendr1 <- cluster_vars(x = x)
# The column names of the data frame block are optional.
block <- data.frame("var.name" = paste0("Var", 1:100),
"block" = rep(c(1, 2), each = 50))
dendr2 <- cluster_vars(x = x, block = block)
# The matrix x is first transposed because the function dist calculates
# distances between the rows.
d <- dist(t(x))
dendr3 <- cluster_vars(d = d, method = "single")
# }
Run the code above in your browser using DataLab