multinom.spls.cv: Cross-validation procedure to calibrate the parameters (ncomp, lambda.l1, lambda.ridge) for the multinomial-SPLS method

Description

The function multinom.spls.cv chooses the optimal values for the hyper-parameter of the multinom.spls procedure, by minimizing the averaged error of prediction over the hyper-parameter grid, using Durif et al. (2017) multinomial-SPLS algorithm.

Usage

multinom.spls.cv(X, Y, lambda.ridge.range, lambda.l1.range, ncomp.range,
  adapt = TRUE, maxIter = 100, svd.decompose = TRUE,
  return.grid = FALSE, ncores = 1, nfolds = 10, nrun = 1,
  center.X = TRUE, scale.X = FALSE, weighted.center = TRUE, seed = NULL,
  verbose = TRUE)

Arguments

a (n x p) data matrix of predictors. X must be a matrix. Each row corresponds to an observation and each column to a predictor variable.

a (n) vector of (continuous) responses. Y must be a vector or a one column matrix. It contains the response variable for each observation. Y should take values in {0,...,nclass-1}, where nclass is the number of class.

lambda.ridge.range

a vector of positive real values. lambda.ridge is the Ridge regularization parameter for the RIRLS algorithm (see details), the optimal value will be chosen among lambda.ridge.range.

lambda.l1.range

a vecor of positive real values, in [0,1]. lambda.l1 is the sparse penalty parameter for the dimension reduction step by sparse PLS (see details), the optimal value will be chosen among lambda.l1.range.

ncomp.range

a vector of positive integers. ncomp is the number of PLS components. The optimal value will be chosen among ncomp.range.

adapt

a boolean value, indicating whether the sparse PLS selection step sould be adaptive or not (see details).

maxIter

a positive integer, the maximal number of iterations in the RIRLS algorithm (see details).

svd.decompose

a boolean parameter. svd.decompose indicates wether or not the predictor matrix Xtrain should be decomposed by SVD (singular values decomposition) for the RIRLS step (see details).

return.grid

a boolean values indicating whether the grid of hyper-parameters values with corresponding mean prediction error rate over the folds should be returned or not.

ncores

a positve integer, indicating the number of cores that the cross-validation is allowed to use for parallel computation (see details).

nfolds

a positive integer indicating the number of folds in the K-folds cross-validation procedure, nfolds=n corresponds to the leave-one-out cross-validation, default is 10.

nrun

a positive integer indicating how many times the K-folds cross- validation procedure should be repeated, default is 1.

center.X

a boolean value indicating whether the data matrices Xtrain and Xtest (if provided) should be centered or not.

scale.X

a boolean value indicating whether the data matrices Xtrain and Xtest (if provided) should be scaled or not (scale.X=TRUE implies center.X=TRUE) in the spls step.

weighted.center

a boolean value indicating whether the centering should take into account the weighted l2 metric or not in the SPLS step.

seed

a positive integer value (default is NULL). If non NULL, the seed for pseudo-random number generation is set accordingly.

verbose

a boolean parameter indicating the verbosity.

Value

An object of class multinom.spls with the following attributes

lambda.ridge.opt

the optimal value in lambda.ridge.range.

lambda.l1.opt

the optimal value in lambda.l1.range.

ncomp.opt

the optimal value in ncomp.range.

conv.per

the overall percentage of models that converge during the cross-validation procedure.

cv.grid

the grid of hyper-parameters and corresponding prediction error rate averaged over the folds. cv.grid is NULL if return.grid is set to FALSE.

Details

The columns of the data matrices X may not be standardized, since standardizing is performed by the function multinom.spls.cv as a preliminary step.

The procedure is described in Durif et al. (2017). The K-fold cross-validation can be summarize as follow: the train set is partitioned into K folds, for each value of hyper-parameters the model is fit K times, using each fold to compute the prediction error rate, and fitting the model on the remaining observations. The cross-validation procedure returns the optimal hyper-parameters values, meaning the one that minimize the averaged error of prediction averaged over all the folds.

This procedures uses mclapply from the parallel package, available on GNU/Linux and MacOS. Users of Microsoft Windows can refer to the README file in the source to be able to use a mclapply type function.

References

Durif G., Modolo L., Michaelsson J., Mold J. E., Lambert-Lacroix S., Picard F. (2017). High Dimensional Classification with combined Adaptive Sparse PLS and Logistic Regression, (in prep), available on (http://arxiv.org/abs/1502.05933).

Examples

Run this code

# NOT RUN {
### load plsgenomics library
library(plsgenomics)

### generating data
n <- 100
p <- 100
nclass <- 3
sample1 <- sample.multinom(n=n, p=p, nb.class=nclass, kstar=10, lstar=2, 
                           beta.min=0.25, beta.max=0.75, mean.H=0.2, 
                           sigma.H=10, sigma.F=5)

X <- sample1$X
Y <- sample1$Y

### hyper-parameters values to test
lambda.l1.range <- seq(0.05,0.95,by=0.1) # between 0 and 1
ncomp.range <- 1:10
# log-linear range between 0.01 a,d 1000 for lambda.ridge.range
logspace <- function( d1, d2, n) exp(log(10)*seq(d1, d2, length.out=n))
lambda.ridge.range <- signif(logspace(d1 <- -2, d2 <- 3, n=21), digits=3)

### tuning the hyper-parameters
cv1 <- multinom.spls.cv(X=X, Y=Y, lambda.ridge.range=lambda.ridge.range, 
                        lambda.l1.range=lambda.l1.range, 
                        ncomp.range=ncomp.range, 
                        adapt=TRUE, maxIter=100, svd.decompose=TRUE, 
                        return.grid=TRUE, ncores=1, nfolds=10)
                       
str(cv1)
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab

Data Engineering and BI courses are free this week!