spls.cv: Cross-validation procedure to calibrate the parameters (ncomp, lambda.l1) of the Adaptive Sparse PLS regression

Description

The function spls.cv chooses the optimal values for the hyper-parameter of the spls procedure, by minimizing the mean squared error of prediction over the hyper-parameter grid, using Durif et al. (2017) adaptive SPLS algorithm.

Usage

spls.cv(X, Y, lambda.l1.range, ncomp.range, weight.mat = NULL, adapt = TRUE,
  center.X = TRUE, center.Y = TRUE, scale.X = TRUE, scale.Y = TRUE,
  weighted.center = FALSE, return.grid = FALSE, ncores = 1, nfolds = 10,
  nrun = 1, verbose = FALSE)

Arguments

a (n x p) data matrix of predictors. X must be a matrix. Each row corresponds to an observation and each column to a predictor variable.

a (n) vector of (continuous) responses. Y must be a vector or a one column matrix. It contains the response variable for each observation.

lambda.l1.range

a vecor of positive real values, in [0,1]. lambda.l1 is the sparse penalty parameter for the dimension reduction step by sparse PLS (see details), the optimal value will be chosen among lambda.l1.range.

ncomp.range

a vector of positive integers. ncomp is the number of PLS components. The optimal value will be chosen among ncomp.range.

weight.mat

a (ntrain x ntrain) matrix used to weight the l2 metric in the observation space, it can be the covariance inverse of the Ytrain observations in a heteroskedastic context. If NULL, the l2 metric is the standard one, corresponding to homoskedastic model (weight.mat is the identity matrix).

adapt

a boolean value, indicating whether the sparse PLS selection step sould be adaptive or not (see details).

center.X

a boolean value indicating whether the data matrices Xtrain and Xtest (if provided) should be centered or not.

center.Y

a boolean value indicating whether the response values Ytrain set should be centered or not.

scale.Y

a boolean value indicating whether the response values Ytrain should be scaled or not (scale.Y=TRUE implies center.Y=TRUE).

weighted.center

a boolean value indicating whether the centering should take into account the weighted l2 metric or not (if TRUE, it requires that weighted.mat is non NULL).

return.grid

a boolean values indicating whether the grid of hyper-parameters values with corresponding mean prediction error rate over the folds should be returned or not.

ncores

a positve integer, indicating the number of cores that the cross-validation is allowed to use for parallel computation (see details).

nfolds

a positive integer indicating the number of folds in the K-folds cross-validation procedure, nfolds=n corresponds to the leave-one-out cross-validation, default is 10.

nrun

a positive integer indicating how many times the K-folds cross- validation procedure should be repeated, default is 1.

verbose

a boolean value indicating verbosity.

scale.X

Value

An object with the following attributes

lambda.l1.opt

the optimal value in lambda.l1.range.

ncomp.opt

the optimal value in ncomp.range.

cv.grid

the grid of hyper-parameters and corresponding prediction error rate over the folds. cv.grid is NULL if return.grid is set to FALSE.

Details

The columns of the data matrices Xtrain and Xtest may not be standardized, since standardizing can be performed by the function spls.cv as a preliminary step.

The procedure is described in Durif et al. (2017). The K-fold cross-validation can be summarize as follow: the train set is partitioned into K folds, for each value of hyper-parameters the model is fit K times, using each fold to compute the prediction error rate, and fitting the model on the remaining observations. The cross-validation procedure returns the optimal hyper-parameters values, meaning the one that minimize the mean squared error of prediction averaged over all the folds.

This procedures uses the mclapply from the parallel package, available on GNU/Linux and MacOS. Users of Microsoft Windows can refer to the README file in the source to be able to use a mclapply type function.

References

Durif G., Modolo L., Michaelsson J., Mold J. E., Lambert-Lacroix S., Picard F. (2017). High Dimensional Classification with combined Adaptive Sparse PLS and Logistic Regression, (in prep), available on (http://arxiv.org/abs/1502.05933).

Examples

Run this code

# NOT RUN {
### load plsgenomics library
library(plsgenomics)

### generating data
n <- 100
p <- 100
sample1 <- sample.cont(n=n, p=p, kstar=10, lstar=2, 
                       beta.min=0.25, beta.max=0.75, mean.H=0.2, 
                       sigma.H=10, sigma.F=5, sigma.E=5)
                       
X <- sample1$X
Y <- sample1$Y

### hyper-parameters values to test
lambda.l1.range <- seq(0.05,0.95,by=0.1) # between 0 and 1
ncomp.range <- 1:10

### tuning the hyper-parameters
cv1 <- spls.cv(X=X, Y=Y, lambda.l1.range=lambda.l1.range, 
               ncomp.range=ncomp.range, weight.mat=NULL, adapt=TRUE, 
               center.X=TRUE, center.Y=TRUE, 
               scale.X=TRUE, scale.Y=TRUE, weighted.center=FALSE, 
               return.grid=TRUE, ncores=1, nfolds=10, nrun=1)
str(cv1)

### otpimal values
cv1$lambda.l1.opt
cv1$ncomp.opt
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab