cancor: Canonical Correlation Analysis

Description

The function cancor generalizes and regularizes computation for canonical correlation analysis in a way conducive to visualization using methods in the heplots package.

Usage

cancor(x, ...)
# S3 method for formula
cancor(formula, data, subset, weights, na.rm=TRUE, method = "gensvd", ...)
# S3 method for default
cancor(x, y, weights,
    X.names = colnames(x), Y.names = colnames(y), 
    row.names = rownames(x), 
    xcenter = TRUE, ycenter = TRUE, xscale = FALSE, yscale = FALSE, 
    ndim = min(p, q), 
    set.names = c("X", "Y"), 
    prefix = c("Xcan", "Ycan"), 
    na.rm = TRUE, use = if (na.rm) "complete" else "pairwise",
    method = "gensvd",
	...
	)
# S3 method for cancor
print(x, digits = max(getOption("digits") - 2, 3), ...)
# S3 method for cancor
summary(object, digits = max(getOption("digits") - 2, 3), ...)
# S3 method for cancor
coef(object, type = c("x", "y", "both", "list"), standardize=FALSE, ...)
scores(x, ...)
# S3 method for cancor
scores(x, type = c("x", "y", "both", "list", "data.frame"), ...)

Arguments

formula

A two-sided formula of the form cbind(y1, y2, y3, …) ~ x1 + x2 + x3 + …

data

The data.frame within which the formula is evaluated

subset

an optional vector specifying a subset of observations to be used in the calculations.

weights

Observation weights. If supplied, this must be a vector of length equal to the number of observations in X and Y, typically within [0,1]. In that case, the variance-covariance matrices are computed using cov.wt, and the number of observations is taken as the number of non-zero weights.

na.rm

logical, determining whether observations with missing cases are excluded in the computation of the variance matrix of (X,Y). See Notes for details on missing data.

method

the method to be used for calculation; currently only method = "gensvd" is supported;

Varies depending on method. For the cancor.default method, this should be a matrix or data.frame whose columns contain the X variables

For the cancor.default method, a matrix or data.frame whose columns contain the Y variables

X.names, Y.names

Character vectors of names for the X and Y variables.

row.names

Observation names in x, y

xcenter, ycenter

logical. Center the X, Y variables? [not yet implemented]

xscale, yscale

logical. Scale the X, Y variables to unit variance? [not yet implemented]

ndim

Number of canonical dimensions to retain in the result, for scores, coefficients, etc.

set.names

A vector of two character strings, giving names for the collections of the X, Y variables.

prefix

A vector of two character strings, giving prefixes used to name the X and Y canonical variables, respectively.

use

argument passed to var determining how missing data are handled. Only the default, use="complete" is allowed when observation weights are supplied.

object

A cancor object for related methods.

digits

Number of digits passed to print and summary methods

…

Other arguments, passed to methods

type

For the coef method, the type of coefficients returned, one of "x", "y", "both". For the scores method, the same list, or "data.frame", which returns a data.frame containing the X and Y canonical scores.

standardize

For the coef method, whether coefficients should be standardized by dividing by the standard deviations of the X and Y variables.

Value

An object of class cancorr, a list with the following components:

cancor

Canonical correlations, i.e., the correlations between each canonical variate for the Y variables with the corresponding canonical variate for the X variables.

names

Names for various items, a list of 4 components: X, Y, row.names, set.names

ndim

Number of canonical dimensions extracted, <= min(p,q)

dim

Problem dimensions, a list of 3 components: p (number of X variables), q (number of Y variables), n (sample size)

coef

Canonical coefficients, a list of 2 components: X, Y

% \item{scores}{Canonical variate scores, a list of 2 components: \code{X}, \code{Y}}

scores

Canonical variate scores, a list of 2 components:

X: Canonical variate scores for the X variables
Y: Canonical variate scores for the Y variables

The matrix X

The matrix Y

weights

Observation weights, if supplied, else NULL

% \item{structure}{Structure correlations, a list of 4 components: \code{X.xscores}, \code{Y.xscores}, \code{X.yscores}, \code{Y.yscores}}

structure

Structure correlations ("loadings"), a list of 4 components:

X.xscores: Structure correlations of the X variables with the Xcan canonical scores
Y.xscores: Structure correlations of the Y variables with the Xcan canonical scores
X.yscores: Structure correlations of the X variables with the Ycan canonical scores
Y.yscores: Structure correlations of the Y variables with the Ycan canonical scores

The formula method also returns components call and terms

Details

Canonical correlation analysis (CCA), as traditionally presented is used to identify and measure the associations between two sets of quantitative variables, X and Y. It is often used in the same situations for which a multivariate multiple regression analysis (MMRA) would be used. However, CCA is is “symmetric” in that the sets X and Y have equivalent status, and the goal is to find orthogonal linear combinations of each having maximal (canonical) correlations. On the other hand, MMRA is “asymmetric”, in that the Y set is considered as responses, each one to be explained by separate linear combinations of the Xs.

This implementation of cancor provides the basic computations for CCA, together with some extractor functions and methods for working with the results in a convenient fashion.

However, for visualization using HE plots, it is most natural to consider plots representing the relations among the canonical variables for the Y variables in terms of a multivariate linear model predicting the Y canonical scores, using either the X variables or the X canonical scores as predictors. Such plots, using heplot.cancor provide a low-rank (1D, 2D, 3D) visualization of the relations between the two sets, and so are useful in cases when there are more than 2 or 3 variables in each of X and Y.

The connection between CCA and HE plots for MMRA models can be developed as follows. CCA can also be viewed as a principal component transformation of the predicted values of one set of variables from a regression on the other set of variables, in the metric of the error covariance matrix.

For example, regress the Y variables on the X variables, giving predicted values \(\hat{Y} = X (X'X)^{-1} X' Y\) and residuals \(R = Y - \hat{Y}\). The error covariance matrix is \(E = R'R/(n-1)\). Choose a transformation Q that orthogonalizes the error covariance matrix to an identity, that is, \((RQ)'(RQ) = Q' R' R Q = (n-1) I\), and apply the same transformation to the predicted values to yield, say, \(Z = \hat{Y} Q\). Then, a principal component analysis on the covariance matrix of Z gives eigenvalues of \(E^{-1} H\), and so is equivalent to the MMRA analysis of lm(Y ~ X) statistically, but visualized here in canonical space.

References

Gittins, R. (1985). Canonical Analysis: A Review with Applications in Ecology, Berlin: Springer.

Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis. London: Academic Press.

Examples

Run this code

# NOT RUN {
data(Rohwer, package="heplots")
X <- as.matrix(Rohwer[,6:10])  # the PA tests
Y <- as.matrix(Rohwer[,3:5])   # the aptitude/ability variables

# visualize the correlation matrix using corrplot()
if (require(corrplot)) {
M <- cor(cbind(X,Y))
corrplot(M, method="ellipse", order="hclust", addrect=2, addCoef.col="black")
}


(cc <- cancor(X, Y, set.names=c("PA", "Ability")))

## Canonical correlation analysis of:
##       5   PA  variables:  n, s, ns, na, ss 
##   with        3   Ability  variables:  SAT, PPVT, Raven 
## 
##     CanR  CanRSQ   Eigen percent    cum                          scree
## 1 0.6703 0.44934 0.81599   77.30  77.30 ******************************
## 2 0.3837 0.14719 0.17260   16.35  93.65 ******                        
## 3 0.2506 0.06282 0.06704    6.35 100.00 **                            
## 
## Test of H0: The canonical correlations in the 
## current row and all that follow are zero
## 
##      CanR  WilksL      F df1   df2  p.value
## 1 0.67033 0.44011 3.8961  15 168.8 0.000006
## 2 0.38366 0.79923 1.8379   8 124.0 0.076076
## 3 0.25065 0.93718 1.4078   3  63.0 0.248814


# formula method
cc <- cancor(cbind(SAT, PPVT, Raven) ~  n + s + ns + na + ss, data=Rohwer, 
    set.names=c("PA", "Ability"))

# using observation weights
set.seed(12345)
wts <- sample(0:1, size=nrow(Rohwer), replace=TRUE, prob=c(.05, .95))
(ccw <- cancor(X, Y, set.names=c("PA", "Ability"), weights=wts) )

# show correlations of the canonical scores 
zapsmall(cor(scores(cc, type="x"), scores(cc, type="y")))

# standardized coefficients
coef(cc, type="both", standardize=TRUE)

plot(cc, smooth=TRUE)

##################
data(schooldata)
##################

#fit the MMreg model
school.mod <- lm(cbind(reading, mathematics, selfesteem) ~ 
education + occupation + visit + counseling + teacher, data=schooldata)
Anova(school.mod)
pairs(school.mod)

# canonical correlation analysis
school.cc <- cancor(cbind(reading, mathematics, selfesteem) ~ 
education + occupation + visit + counseling + teacher, data=schooldata)
school.cc
heplot(school.cc, xpd=TRUE, scale=0.3)

# }

Run the code above in your browser using DataLab