fast_cvpvi: Fast cross-validated permutation variable importance (ranger-based)

Description

Computes cross-validated permutation variable importance (PVI) using the ranger random-forest algorithm. For each CV fold, a ranger model is trained on the training split and permutation importance is computed (OOB) inside ranger in C++. Importances are averaged across folds to obtain a stable CV importance vector. Optionally appends artificial “false” features to estimate the null distribution and pi0, then selects the top (1 - pi0) proportion of features. The evaluation can parallelize across folds (Windows-safe PSOCK) while avoiding CPU oversubscription.

Usage

fast_cvpvi(
  X,
  Y,
  k = 5,
  ntree = 500,
  nbf = 0,
  nthreads = max(1L, parallel::detectCores() - 1L),
  folds_parallel = c("auto", "TRUE", "FALSE"),
  mtry = NULL,
  sample_fraction = 1,
  min_node_size = 1L,
  seed = 123
)

Value

A list with:

nb_to_sel: integer; number of selected features (floor(p * (1 - pi0))).
sel_moz: character vector of selected feature names (columns of X).
imp_sel: named numeric vector of CV importances for selected features.
fold_varim: matrix (features x folds) of per-fold permutation importances.
cv_varim: matrix (features x 1) of averaged importances across folds.
pi0: estimated proportion of null features.

Arguments

X

Numeric matrix (n x p); samples in rows, features in columns. Column names should be feature IDs (e.g., m/z). Non-finite values are set to zero internally for modeling.

Y

Factor or numeric response of length n. A factor triggers classification; numeric triggers regression.

k

Integer; number of cross-validation folds. Default 5.

ntree

Integer; number of trees per fold model. Default 500.

nbf

Integer (>= 0); number of artificial “false” (noise) features to append to X for estimating the null distribution of importances. Default 0 disables this (the null is then approximated using mirrored negative importances).

nthreads

Integer; total threads available. When parallelizing folds, each fold worker gets one ranger thread to avoid oversubscription; when not parallelizing folds, ranger uses up to nthreads threads. Default is max(1, detectCores() - 1).

folds_parallel

Character; "auto", "TRUE", or "FALSE".

"auto": parallelize across folds when k > 1 and nthreads >= 4 (default).
"TRUE": force fold-level parallelism (PSOCK cluster).
"FALSE": evaluate folds sequentially (ranger can then use multiple threads).

mtry

Optional integer; variables tried at each split. If NULL, defaults to floor(sqrt(p)) for classification or max(floor(p/3), 1) for regression.

sample_fraction

Numeric in (0, 1]; subsampling fraction per tree (speed/ regularization knob). Default 1.

min_node_size

Integer; ranger minimum node size. Larger values speed up training and yield smaller trees. Default 1.

seed

Integer; RNG seed. Default 123.

Details

One ranger model is trained per fold (training split). Permutation importance (importance = "permutation") is computed in C++ using OOB. The per-fold importances are averaged to obtain CV importances.
Null and pi0: if nbf > 0, false peaks are created to get negative importances. For this, nbf noise features (uniform between min(X) and max(X)) are appended and negative importances among them help shape the null. If nbf = 0, the null is approximated by mirroring negative importances of true features. An estimator of the proportion of useless features over high quantiles yields pi0. If no negative importances occur, pi0 is set to 0 (conservative).
Parallelism: with folds_parallel = "auto"/"TRUE", folds run in parallel using a PSOCK cluster (Windows-safe). Each worker sets ranger num.threads = 1 to avoid oversubscription. With "FALSE", folds are sequential and ranger uses up to nthreads threads, which can be faster for small k or very large p.

References

Alexandre Godmer, Yahia Benzerara, Emmanuelle Varon, Nicolas Veziris, Karen Druart, Renaud Mozet, Mariette Matondo, Alexandra Aubry, Quentin Giai Gianetto, MSclassifR: An R package for supervised classification of mass spectra with machine learning methods, Expert Systems with Applications, Volume 294, 2025, 128796, ISSN 0957-4174, tools:::Rd_expr_doi("10.1016/j.eswa.2025.128796").

Examples

Run this code

if (FALSE) {
set.seed(1)
n <- 120; p <- 200
X <- matrix(rnorm(n * p), n, p)
colnames(X) <- paste0("mz_", seq_len(p))
Y <- factor(sample(letters[1:3], n, replace = TRUE))

if (requireNamespace("ranger", quietly = TRUE)) {
  out <- fast_cvpvi(
    X, Y,
    k = 5,
    ntree = 300,
    nbf = 50,
    nthreads = max(1L, parallel::detectCores() - 1L),
    folds_parallel = "auto",
    seed = 42
  )
  head(out$sel_moz)
  # CV importances for top features
  head(sort(out$cv_varim[,1], decreasing = TRUE))
}
}

Run the code above in your browser using DataLab