nscancor (version 0.6)

nscancor: Non-Negative and Sparse CCA

Description

Performs a canonical correlation analysis (CCA) where constraints such as non-negativity or sparsity are enforced on the canonical vectors. The result of the analysis is returned as a list of class nscancor, which contains a superset of the elements returned by cancor.

Usage

nscancor(x, y, xcenter = TRUE, ycenter = TRUE, xscale = FALSE, yscale = FALSE, nvar = min(dim(x), dim(y)), xpredict, ypredict, cor_tol = NULL, nrestart = 10, iter_tol = 0.001, iter_max = 30, partial_model = NULL, verbosity = 0)

Arguments

x
a numeric matrix which provides the data from the first domain
y
a numeric matrix which provides the data from the second domain
xcenter
a logical value indicating whether the empirical mean of (each column of) x should be subtracted. Alternatively, a vector of length equal to the number of columns of x can be supplied. The value is passed to scale.
ycenter
analogous to xcenter
xscale
a logical value indicating whether the columns of x should be scaled to have unit variance before the analysis takes place. The default is FALSE for consistency with cancor. Alternatively, a vector of length equal to the number of columns of x can be supplied. The value is passed to scale.
yscale
analogous to xscale
nvar
the number of canonical variables to be computed for each domain. With the default setting, canonical variables are computed until either x or y is fully deflated.
xpredict
the regression function to predict the canonical variable for x, given y. The formal arguments are the design matrix y, the regression target xc as the current canonical variable for x, and cc as a counter of the current pair of canonical variables (e.g. for enforcing different constraints for different canonical vectors). See the examples for an illustration.
ypredict
analogous to xpredict
cor_tol
a threshold indicating the magnitude below which canonical variables should be omitted. Variables are omitted if their explained correlations are less than or equal to cor_tol times the correlation of the first pair of canonical variables. With the default NULL setting, no variables are omitted.
nrestart
the number of random restarts for computing the canonical variables via iterated regression steps. The solution achieving maximum explained correlation over all random restarts is kept. A value greater than one can help to avoid poor local maxima.
iter_tol
If the relative change of the objective is less than iter_tol between iterations, the procedure is asssumed to have converged to a local optimum.
iter_max
the maximum number of iterations to be performed. The procedure is terminated if either the iter_tol or the iter_max criterion is satisfied.
partial_model
NULL or an object of class nscancor. The computation can be continued from a partial model by providing an nscancor object (either from a previous run of this function or from acor) and setting nvar to a value greater than the number of canonical variables contained in the partial model. See the examples for an illustration.
verbosity
an integer specifying the verbosity level. Greater values result in more output, the default is to be quiet.

Value

nscancor returns a list of class nscancor containing the following elements:
cor
the additional correlation explained by each pair of canonical variables, see acor.
xcoef
the matrix containing the canonical vectors related to x as its columns
ycoef
analogous to xcoef
xcenter
if xcenter is TRUE the centering vector, else the zero vector (in accordance with cancor)
ycenter
analogous to xcenter
xscale
if xscale is TRUE the scaling vector, else FALSE
yscale
analogous to xscale
xp
the deflated data matrix corresponding to x
yp
anologous to xp

Details

nscancor computes the canonical vectors (called xcoef and ycoef) using iterated regression steps, where the constraints suitable for each domain are enforced by choosing the avvropriate regression method. See Sigg et al. (2007) for an early avvlication of the principle (not yet including generalized deflation).

Because constrained canonical vectors no longer correspond to true eigenvectors of the cross-covariance matrix and are usually not pairwise conjugate (i.e. the canonical variables are not uncorrelated), special attention needs to be paid when computing more than a single pair of canonical vectors. nscancor implements a generalized deflation (GD) scheme which builds on GD for PCA as proposed by Mackey (2009). For each domain, a basis of the space spanned by the previous canonical variables is computed. Then, the correlation of the current pair of canonical variables is maximized after projecting each current canonical vector to the ortho-complement space of its respective basis. This procedure maximizes the additional correlation not explained by previous canonical variables, and is identical to standard CCA if the canonical vectors are the eigenvectors of the cross-covariance matrix.

See the references for further details.

References

Sigg, C. and Fischer, B. and Ommer, B. and Roth, V. and Buhmann, J. (2007) Nonnegative CCA for Audiovisual Source Separation. In Proceedings of the 2007 IEEE Workshop on Machine Learning for Signal Processing (vv. 253--258).

Mackey, L. (2009) Deflation Methods for Sparse PCA. In Advances in Neural Information Processing Systems (vv. 1017--1024).

See Also

acor, cancor, scale

Examples

Run this code
library(MASS)
library(glmnet)
data(nutrimouse, package="CCA")

set.seed(1)

### 
# Unconstrained CCA, produces identical results to calling 
# cancor(nutrimouse$gene[ , 1:10], nutrimouse$lipid)

ypredict <- function(x, yc, cc) {
  return(ginv(x)%*%yc)
}
xpredict <- function(y, xc, cc) {
  return(ginv(y)%*%xc)
} 
cc <- nscancor(nutrimouse$gene[ , 1:10], nutrimouse$lipid, xpredict=xpredict, 
               ypredict=ypredict)


### 
# Non-negative sparse CCA using glmnet() as the regression function, where
# different regularisers are enforced on the different data domains and pairs of 
# canonical variables.

dfmax_w <- c(40, 15, 10, 10)
ypredict <- function(x, yc, cc) {
  en <- glmnet(x, yc, alpha=0.5, intercept=FALSE, dfmax=dfmax_w[cc], lower.limits=0)
  W <- coef(en)
  return(W[2:nrow(W), ncol(W)])
}
dfmax_v <- c(7, 5, 5, 3)
xpredict <- function(y, xc, cc) {
  en <- glmnet(y, xc, alpha=0.5, intercept=FALSE, dfmax=dfmax_v[cc])
  V <- coef(en)
  return(V[2:nrow(V), ncol(V)])
}
nscc <- nscancor(nutrimouse$gene, nutrimouse$lipid, nvar=3,
                 xpredict=xpredict, ypredict=ypredict)

# continue the computation of canonical variables from a partial model
nscc <- nscancor(nutrimouse$gene, nutrimouse$lipid, nvar=4,
                 xpredict=xpredict, ypredict=ypredict,
                 partial_model=nscc)

Run the code above in your browser using DataLab