Computes cross-validated permutation variable importance (PVI) using the ranger random-forest algorithm. For each CV fold, a ranger model is trained on the training split and permutation importance is computed (OOB) inside ranger in C++. Importances are averaged across folds to obtain a stable CV importance vector. Optionally appends artificial “false” features to estimate the null distribution and pi0, then selects the top (1 - pi0) proportion of features. The evaluation can parallelize across folds (Windows-safe PSOCK) while avoiding CPU oversubscription.
fast_cvpvi(
X,
Y,
k = 5,
ntree = 500,
nbf = 0,
nthreads = max(1L, parallel::detectCores() - 1L),
folds_parallel = c("auto", "TRUE", "FALSE"),
mtry = NULL,
sample_fraction = 1,
min_node_size = 1L,
seed = 123
)A list with:
nb_to_sel: integer; number of selected features (floor(p * (1 - pi0))).
sel_moz: character vector of selected feature names (columns of X).
imp_sel: named numeric vector of CV importances for selected features.
fold_varim: matrix (features x folds) of per-fold permutation importances.
cv_varim: matrix (features x 1) of averaged importances across folds.
pi0: estimated proportion of null features.
Numeric matrix (n x p); samples in rows, features in columns. Column names should be feature IDs (e.g., m/z). Non-finite values are set to zero internally for modeling.
Factor or numeric response of length n. A factor triggers classification; numeric triggers regression.
Integer; number of cross-validation folds. Default 5.
Integer; number of trees per fold model. Default 500.
Integer (>= 0); number of artificial “false” (noise) features to append to X for estimating the null distribution of importances. Default 0 disables this (the null is then approximated using mirrored negative importances).
Integer; total threads available. When parallelizing folds,
each fold worker gets one ranger thread to avoid oversubscription; when not
parallelizing folds, ranger uses up to nthreads threads. Default is
max(1, detectCores() - 1).
Character; "auto", "TRUE", or "FALSE".
"auto": parallelize across folds when k > 1 and nthreads >= 4 (default).
"TRUE": force fold-level parallelism (PSOCK cluster).
"FALSE": evaluate folds sequentially (ranger can then use multiple threads).
Optional integer; variables tried at each split. If NULL, defaults to floor(sqrt(p)) for classification or max(floor(p/3), 1) for regression.
Numeric in (0, 1]; subsampling fraction per tree (speed/ regularization knob). Default 1.
Integer; ranger minimum node size. Larger values speed up training and yield smaller trees. Default 1.
Integer; RNG seed. Default 123.
One ranger model is trained per fold (training split). Permutation importance (importance = "permutation") is computed in C++ using OOB. The per-fold importances are averaged to obtain CV importances.
Null and pi0: if nbf > 0, false peaks are created to get negative importances. For this, nbf noise features (uniform between min(X) and max(X))
are appended and negative importances among them help shape the null. If
nbf = 0, the null is approximated by mirroring negative importances of
true features. An estimator of the proportion of useless features over high quantiles yields pi0.
If no negative importances occur, pi0 is set to 0 (conservative).
Parallelism: with folds_parallel = "auto"/"TRUE", folds run in parallel
using a PSOCK cluster (Windows-safe). Each worker sets ranger num.threads = 1
to avoid oversubscription. With "FALSE", folds are sequential and ranger
uses up to nthreads threads, which can be faster for small k or very large p.
Alexandre Godmer, Yahia Benzerara, Emmanuelle Varon, Nicolas Veziris, Karen Druart, Renaud Mozet, Mariette Matondo, Alexandra Aubry, Quentin Giai Gianetto, MSclassifR: An R package for supervised classification of mass spectra with machine learning methods, Expert Systems with Applications, Volume 294, 2025, 128796, ISSN 0957-4174, tools:::Rd_expr_doi("10.1016/j.eswa.2025.128796").
ranger::ranger; for a holdout-based (validation-fold) permutation alternative, see a custom implementation using predict() on permuted features. For a full feature-selection wrapper, see SelectionVar with MethodSelection = "cvp".
if (FALSE) {
set.seed(1)
n <- 120; p <- 200
X <- matrix(rnorm(n * p), n, p)
colnames(X) <- paste0("mz_", seq_len(p))
Y <- factor(sample(letters[1:3], n, replace = TRUE))
if (requireNamespace("ranger", quietly = TRUE)) {
out <- fast_cvpvi(
X, Y,
k = 5,
ntree = 300,
nbf = 50,
nthreads = max(1L, parallel::detectCores() - 1L),
folds_parallel = "auto",
seed = 42
)
head(out$sel_moz)
# CV importances for top features
head(sort(out$cv_varim[,1], decreasing = TRUE))
}
}
Run the code above in your browser using DataLab