mcancor: Non-Negative and Sparse Multi-Domain CCA

Description

Performs a canonical correlation analysis (CCA) on multiple data domains, where constraints such as non-negativity or sparsity are enforced on the canonical vectors. The result of the analysis is returned as a list of class mcancor.

Usage

mcancor(x, center = TRUE, scale_ = FALSE, nvar = min(sapply(x, dim)),
  predict, cor_tol = NULL, nrestart = 10, iter_tol = 0, iter_max = 50,
  partial_model = NULL, verbosity = 0)

Value

mcancor returns a list of class mcancor with the following elements:

cor: a multi-dimensional array containing the additional correlations explained by each pair of canonical variables. The first two dimensions correspond to the domains, and the third dimension corresponds to the different canonical variables per domain (see also macor).
coef: a list of matrices containing the canonical vectors related to each data domain. The canonical vectors are stored as the columns of each matrix.
center: the list of empirical means used to center the data matrices
scale: the list of empirical standard deviations used to scale the data matrices
xp: the list of deflated data matrices corresponding to x

Arguments

x: a list of numeric matrices which contain the data from the different domains
center: a list of logical values indicating whether the empirical mean of (each column of) the corresponding data matrix should be subtracted. Alternatively, a list of vectors can be supplied, where each vector specifies the mean to be subtracted from the corresponding data matrix. Each list element is passed to scale.
scale_: a list of logical values indicating whether the columns of the corresponding data matrix should be scaled to have unit variance before the analysis takes place. The default is FALSE for consistency with nscancor. Alternatively, a list of vectors can be supplied, where each vector specifies the standard deviations used to rescale the columns of the corresponding data matrix. Each list element is passed to scale.
nvar: the number of canonical variables to be computed for each domain. With the default setting, canonical variables are computed until at least one data matrix is fully deflated.
predict: a list of regression functions to predict the sum of the canonical variables of all other domains. The formal arguments for each regression function are the design matrix x corresponding to the data from the current domain, the regression target sc as the sum of the canonical variables for all other domains, and cc as a counter of which canonical variable is currently computed (e.g. for enforcing different constraints for subsequent canonical vectors of a given domain). See the examples for an illustration.
cor_tol: a threshold indicating the magnitude below which canonical variables should be omitted. Variables are omitted if the sum of all their correlations are less than or equal to cor_tol times the sum of all correlations of the first canonical variables of all domains. With the default NULL setting, no variables are omitted.
nrestart: the number of random restarts for computing the canonical variables via iterated regression steps. The solution achieving maximum explained correlation over all random restarts is kept. A value greater than one can help to avoid poor local maxima.
iter_tol: If the relative change of the objective is less than iter_tol between iterations, the procedure is asssumed to have converged to a local optimum.
iter_max: the maximum number of iterations to be performed. The procedure is terminated if either the iter_tol or the iter_max criterion is satisfied.
partial_model: NULL or an object of class mcancor. The computation can be continued from a partial model by providing an mcancor object (either from a previous run of this function or from macor) and setting nvar to a value greater than the number of canonical variables contained in the partial model. See the examples for an illustration.
verbosity: an integer specifying the verbosity level. Greater values result in more output, the default is to be quiet.

Details

mcancor generalizes nscancor to the case where more than two data domains are available for an analysis. Its objective is to maximize the sum of all pairwise correlations of the canonical variables.

Examples

Run this code

# \donttest{
if (requireNamespace("glmnet", quietly = TRUE) &&
    requireNamespace("PMA", quietly = TRUE)) {

  data(breastdata, package="PMA")

  set.seed(1)

  # Three data domains: a subset of genes, and CGH spots for the first and
  # second chromosome
  x <- with(breastdata,
            list(t(rna)[ , 1:100], t(dna)[ , chrom == 1], t(dna)[ , chrom == 2])
  )

  # Sparse regression functions with different cardinalities for different domains
  generate_predict <- function(dfmax) {
    force(dfmax)
    return(
      function(x, sc, cc) {
        en <- glmnet::glmnet(x, sc, alpha = 0.05, intercept = FALSE, dfmax = dfmax)
        W <- coef(en)
        return(W[2:nrow(W), ncol(W)])
      }
    )
  }
  predict <- lapply(c(20, 10, 10), generate_predict)

  # Compute two canonical variables per domain
  mcc <- mcancor(x, predict = predict, nvar = 2)

  # Compute another canonical variable for each domain
  mcc <- mcancor(x, predict = predict, nvar = 3, partial_model = mcc)
}
# }