Computes M-fold or Leave-One-Out Cross-Validation scores on a user-input
grid to determine optimal values for the sparsity parameters in splsda
.
tune.splsda(X, Y, ncomp = 1,
test.keepX = c(5, 10, 15), already.tested.X, validation = "Mfold",
folds = 10, dist = "max.dist", measure = "BER", auc = FALSE,
progressBar = TRUE, max.iter = 100, near.zero.var = FALSE, nrepeat = 1,
logratio = c('none','CLR'), multilevel = NULL, light.output = TRUE, cpus)
numeric matrix of predictors. NA
s are allowed.
if(method = 'spls')
numeric vector or matrix of continuous responses (for multi-response models) NA
s are allowed.
the number of components to include in the model.
numeric vector for the different number of variables to test from the
Optional, if ncomp > 1
A numeric vector indicating the number of variables to select from the
character. What kind of (internal) validation to use, matching one of "Mfold"
or
"loo"
(see below). Default is "Mfold"
.
the folds in the Mfold cross-validation. See Details.
distance metric to use for splsda
to estimate the classification error rate,
should be a subset of "centroids.dist"
, "mahalanobis.dist"
or "max.dist"
(see Details).
Two misclassification measure are available: overall misclassification error overall
or the Balanced Error Rate BER
if TRUE
calculate the Area Under the Curve (AUC) performance of the model.
by default set to TRUE
to output the progress bar of the computation.
integer, the maximum number of iterations.
boolean, see the internal nearZeroVar
function (should be set to TRUE in particular for data with many zero values). Default value is FALSE
Number of times the Cross-Validation process is repeated.
one of ('none','CLR'). Default to 'none'
Design matrix for multilevel analysis (for repeated measurements) that indicates the repeated measures on each individual, i.e. the individuals ID. See Details.
if set to FALSE, the prediction/classification of each sample for each of test.keepX
and each comp is returned.
Number of cpus to use when running the code in parallel.
Depending on the type of analysis performed, a list that contains:
returns the prediction error for each test.keepX
on each component, averaged across all repeats and subsampling folds. Standard deviation is also output. All error rates are also available as a list.
returns the number of variables selected (optimal keepX) on each component.
returns the optimal number of components for the model fitted with $choice.keepX
returns the error rate for each level of Y
and for each component computed with the optimal keepX
Prediction values for each sample, each test.keepX
, each comp and each repeat. Only if light.output=FALSE
Predicted class for each sample, each test.keepX
, each comp and each repeat. Only if light.output=FALSE
AUC mean and standard deviation if the number of categories in Y
is greater than 2, see details above. Only if auc = TRUE
only if multilevel analysis with 2 factors: correlation between latent variables.
This tuning function should be used to tune the parameters in the splsda
function (number of components and number of variables in keepX
to select).
For a sPLS-DA, M-fold or LOO cross-validation is performed with stratified subsampling where all classes are represented in each fold.
If validation = "loo"
, leave-one-out cross-validation is performed. By default folds
is set to the number of unique individuals.
The function outputs the optimal number of components that achieve the best performance based on the overall error rate or BER. The assessment is data-driven and similar to the process detailed in (Rohart et al., 2016), where one-sided t-tests assess whether there is a gain in performance when adding a component to the model. Our experience has shown that in most case, the optimal number of components is the number of categories in Y
- 1.
For sPLS-DA multilevel one-factor analysis, M-fold or LOO cross-validation is performed where all repeated measurements of one sample are in the same fold. Note that logratio transform and the multilevel analysis are performed internally and independently on the training and test set.
For a sPLS-DA multilevel two-factor analysis, the correlation between components from the within-subject variation of X and the cond
matrix is computed on the whole data set. The reason why we cannot obtain a cross-validation error rate as for the spls-DA one-factor analysis is because of the dififculty to decompose and predict the within matrices within each fold.
For a sPLS two-factor analysis a sPLS canonical mode is run, and the correlation between components from the within-subject variation of X and Y is computed on the whole data set.
If validation = "Mfold"
, M-fold cross-validation is performed.
How many folds to generate is selected by specifying the number of folds in folds
.
If auc = TRUE
and there are more than 2 categories in Y
, the Area Under the Curve is averaged using one-vs-all comparison. Note however that the AUC criteria may not be particularly insightful as the prediction threshold we use in sPLS-DA differs from an AUC threshold (sPLS-DA relies on prediction distances for predictions, see ?predict.splsda
for more details).
More details about the prediction distances in ?predict
and the referred publication.
mixOmics manuscript:
Rohart F, Gautier B, Singh A, Le Cao K-A. mixOmics: an R package for 'omics feature selection and multiple data integration.
splsda
, predict.splsda
and http://www.mixOmics.org for more details.
# NOT RUN {
## First example: analysis with sPLS-DA
# }
# NOT RUN {
data(breast.tumors)
X = breast.tumors$gene.exp
Y = as.factor(breast.tumors$sample$treatment)
tune = tune.splsda(X, Y, ncomp = 1, nrepeat = 10, logratio = "none",
test.keepX = c(5, 10, 15), folds = 10, dist = "max.dist",
progressBar = TRUE)
# 5 components, optimising 'keepX' and 'ncomp'
tune = tune.splsda(X, Y, ncomp = 5, test.keepX = c(5, 10, 15),
folds = 10, dist = "max.dist", nrepeat = 5, progressBar = TRUE)
tune$choice.ncomp
tune$choice.keepX
plot(tune)
# }
# NOT RUN {
## only tune component 3 and 4
# keeping 5 and 10 variables on the first two components respectively
# }
# NOT RUN {
tune = tune.splsda(X = X,Y = Y, ncomp = 4,
already.tested.X = c(5,10),
test.keepX = seq(1,10,2), progressBar = TRUE)
# }
# NOT RUN {
## Second example: multilevel one-factor analysis with sPLS-DA
# }
# NOT RUN {
data(vac18)
X = vac18$genes
Y = vac18$stimulation
# sample indicates the repeated measurements
design = data.frame(sample = vac18$sample)
tune = tune.splsda(X, Y = Y, ncomp = 3, nrepeat = 10, logratio = "none",
test.keepX = c(5,50,100),folds = 10, dist = "max.dist", multilevel = design)
# }
# NOT RUN {
# }
Run the code above in your browser using DataLab