rrda.cv: Cross-validation for Ridge Redundancy Analysis

Description

This function performs cross-validation to evaluate the performance of Ridge Redundancy Analysis (RDA) models. It calculates the mean squared error (MSE) for different ranks and ridge penalty values through cross-validation folds. The function also supports centering and scaling of the input matrices.

The range of lambda for the cross-validation is automatically calculated following the method of "glmnet" (Friedman et al., 2010). When we have a matrix of response variables (Y; n times q matrix) and a matrix of explanatory variables (X; n times p matrix), the largest lambda for the validation is obtained as follows

$$ \lambda_{\text{max}} = \frac{\max_{j \in \{1, 2, \dots, p\}} \sqrt{\sum_{k=1}^{q} \left( \sum_{i=1}^{n} (x_{ij}\cdot y_{ik}) \right)^2}}{N \times 10^{-3}}$$

Then, we define $\lambda_{min}=10^{-4}\lambda_{max}$, and the sequence $\lambda$ is generated based on the range.

Also, to reduce the computation, the variable sampling is performed for the large matrix of X and Y (by default, when the number of the variables is over 1000). Alternatively, the range of lambda can be specified manually.

Usage

rrda.cv(
  Y,
  X,
  maxrank = NULL,
  lambda = NULL,
  num.lambda = 50,
  nfold = 5,
  folds = NULL,
  sample.X = 1000,
  sample.Y = 1000,
  scale.X = FALSE,
  scale.Y = FALSE,
  center.X = TRUE,
  center.Y = TRUE,
  verbose = TRUE
)

Value

A list containing the cross-validated MSE matrix, lambda values, rank values, and the optimal lambda and rank.

Arguments

Y: A numeric matrix of response variables.
X: A numeric matrix of explanatory variables.
maxrank: A numeric vector specifying the maximum rank of the coefficient Bhat. Default is NULL, which sets it to (min(15, min(dim(X), dim(Y)))).
lambda: A numeric vector of ridge penalty values. Default is NULL, where the lambda values are automatically chosen.
num.lambda: A number of lambda generated (only when the lambda is not given by user). Default is 50.
nfold: The number of folds for cross-validation. Default is 5.
folds: A vector specifying the folds. Default is NULL, which randomly assigns folds.
sample.X: A number of variables sampled from X for the lamdba range estimate. Default is 1000.
sample.Y: A number of variables sampled from Y for the lamdba range estimate. Default is 1000.
scale.X: Logical indicating if X should be scaled. If TRUE, scales X. Default is FALSE.
scale.Y: Logical indicating if Y should be scaled. If TRUE, scales Y. Default is FALSE.
center.X: Logical indicating if X should be centered. If TRUE, scales X. Default is TRUE.
center.Y: Logical indicating if Y should be centered. If TRUE, scales Y. Default is TRUE.
verbose: Logical indicating. If TRUE, the function displays information about the function call. Default is TRUE.

Examples

Run this code

if (FALSE) {
set.seed(10)
simdata<-rdasim1(n = 100,p = 200,q = 200,k = 3)
X <- simdata$X
Y <- simdata$Y
cv_result<- rrda.cv(Y = Y, X = X, maxrank = 5, nfold = 5)
rrda.summary(cv_result = cv_result)

##Complete Example##



# library(future) # <- if you want to compute in parallel

# plan(multisession) # <- if you want to compute in parallel
# cv_result<- rrda.cv(Y = Y, X = X, maxrank = 5, nfold = 5) # cv
# plan(multisession) # <- To come back to sequential computing

# rrda.summary(cv_result = cv_result) # cv result

p <- rrda.plot(cv_result) # cv result plot
print(p)
h <- rrda.heatmap(cv_result) # cv result heatmao
print(h)

estimated_lambda<-cv_result$opt_min$lambda  # selected parameter
estimated_rank<-cv_result$opt_min$rank # selected parameter

Bhat <- rrda.fit(Y = Y, X = X, nrank = estimated_rank,lambda = estimated_lambda) # fitting
Bhat_mat<-rrda.coef(Bhat)
Yhat_mat <- rrda.predict(Bhat = Bhat, X = X) # prediction
Yhat<-Yhat_mat[[1]][[1]][[1]] # predicted values

cor_Y_Yhat<-diag(cor(Y,Yhat)) # correlation
summary(cor_Y_Yhat)
}

Run the code above in your browser using DataLab