rm.coef: Computes the RM coefficient for variable subset selection

Description

Computes the RM coefficient, measuring the similarity of the spectral decompositions of a p-variable data matrix, and of the matrix which results from regressing all the variables on a subset of only k variables.

Usage

rm.coef(mat, indices)

Arguments

mat

the full data set's covariance (or correlation) matrix

indices

a numerical vector, matrix or 3-d array of integers giving the indices of the variables in the subset. If a matrix is specified, each row is taken to represent a different k-variable subset. If a 3-d array is given, it is assumed that the third dimension corresponds to different cardinalities.

Value

The value of the RM coefficient.

Details

Computes the RM coefficient that measures the similarity of the spectral decompositions of a p-variable data matrix, and of the matrix which results from regressing those variables on a subset (given by "indices") of the variables. Input data is expected in the form of a (co)variance or correlation matrix. If a non-square matrix is given, it is assumed to be a data matrix, and its correlation matrix is used as input.

The definition of the RM coefficient is as follows: $$RM = \sqrt{\frac{\mathrm{tr}(X^t P_v X)}{\mathrm{X^t X}}} $$ where $X$ is the full (column-centered) data matrix and $P_v$ is the matrix of orthogonal projections on the subspace spanned by a k-variable subset.

This definition is equivalent to: $$RM = \sqrt{\frac{\sum\limits_{i=1}^{p}\lambda_i (r)_i^2}{\sum\limits_{j=1}^{p}\lambda_j}} $$ where $\lambda_i$ stands for the $i$-th largest eigenvalue of the covariance matrix defined by X and $r$ stands for the multiple correlation between the i-th Principal Component and the k-variable subset.

These definitions are also equivalent to the expression used in the code, which only requires the covariance (or correlation) matrix of the data under consideration.

The fact that indices can be a matrix or 3-d array allows for the computation of the RM values of subsets produced by the search functions anneal, genetic and improve (whose output option $subsets are matrices or 3-d arrays), using a different criterion (see the example below).

References

Cadima, J. and Jolliffe, I.T. (2001), "Variable Selection and the Interpretation of Principal Subspaces", Journal of Agricultural, Biological and Environmental Statistics, Vol. 6, 62-79.

McCabe, G.P. (1986) "Prediction of Principal Components by Variable Subsets", Technical Report 86-19, Department of Statistics, Purdue University.

Ramsay, J.O., ten Berge, J. and Styan, G.P.H. (1984), "Matrix Correlation", Psychometrika, 49, 403-423.

Examples

Run this code

# NOT RUN {
## An example with a very small data set.

data(iris3) 
x<-iris3[,,1]
rm.coef(var(x),c(1,3))
## [1] 0.8724422

## An example computing the RMs of three subsets produced when the
## anneal function attempted to optimize the RV criterion (using an
## absurdly small number of iterations).

data(swiss)
rvresults<-anneal(cor(swiss),2,nsol=4,niter=5,criterion="Rv")
rm.coef(cor(swiss),rvresults$subsets)

##              Card.2
##Solution 1 0.7982296
##Solution 2 0.7945390
##Solution 3 0.7649296
##Solution 4 0.7623326
# }

Run the code above in your browser using DataLab