fast_mda: Fast MDA-style variable selection using ranger permutation importance

Description

Computes feature importances with a single multiclass (or regression) random forest using the ranger engine and its C++ permutation importance. The null distribution of importances is estimated either by appending artificial “false” (noise) features (when nbf > 0) or by mirroring negative importances (when nbf = 0). An estimator of the proportion of useless features yields pi0, and the top (1 - pi0) proportion of true features are selected. This implementation is a fast, OS-agnostic alternative to repeated/random-forest-based MDA schemes.

Usage

fast_mda(
  X,
  Y,
  ntree = 1000,
  nbf = 0,
  nthreads = max(1L, parallel::detectCores() - 1L),
  mtry = NULL,
  sample_fraction = 1,
  min_node_size = 1L,
  seed = 123
)

Value

A list with:

nb_to_sel: integer; number of selected features (floor(p_true * (1 - pi0))).
sel_moz: character vector of selected feature names (columns of X).
imp_sel: named numeric vector of importances for selected features (true features only).
all_imp: named numeric vector of importances for all true features.
pi0: estimated proportion of null features.

Arguments

X: Numeric matrix (n x p); samples in rows, features in columns. Column names should be feature IDs (e.g., m/z). Non-finite values are set to zero internally for modeling.
Y: Factor (classification) or numeric (regression) response of length n. The default mtry is chosen based on the task: floor(sqrt(p)) for classification; max(floor(p/3), 1) for regression.
ntree: Integer; number of trees. Default 1000.
nbf: Integer (>= 0); number of artificial “false” (noise) features to append to X to estimate the null distribution. Default 0 disables this and uses mirrored negative importances as the null.
nthreads: Integer; total number of threads for ranger. Default is max(1, parallel::detectCores() - 1).
mtry: Optional integer; variables tried at each split. If NULL (default), computed as floor(sqrt(p)) for classification or max(floor(p/3), 1) for regression.
sample_fraction: Numeric in (0, 1]; subsampling fraction per tree (speed/ regularization knob). Default 1.
min_node_size: Integer; ranger minimum node size. Larger values speed up training and yield simpler trees. Default 1.
seed: Integer; RNG seed for reproducibility. Default 123.

Details

A single ranger model is fit with importance = "permutation". This computes permutation importance in C++ using OOB (fast and stable).
Null and pi0:
- If nbf > 0, nbf false features (uniform between min(X) and max(X)) are appended; negative importances among them help shape the null. An estimator of the proportion of useless features over high quantiles (e.g., 0.75–1) yields pi0 and is adjusted for the number of false features.
- If nbf = 0, the null is approximated by mirroring negative importances of true features. If no negative importances occur, pi0 is set to 0 (conservative).
Task: factors in Y trigger probability = TRUE; numeric Y triggers regression.
Robustness: any non-finite importances are set to zero. Selection is performed only among the original (true) features; false features are discarded.
Performance: this is typically 5–20x faster than randomForest-based MDA and fully multithreaded via nthreads.

References

Alexandre Godmer, Yahia Benzerara, Emmanuelle Varon, Nicolas Veziris, Karen Druart, Renaud Mozet, Mariette Matondo, Alexandra Aubry, Quentin Giai Gianetto, MSclassifR: An R package for supervised classification of mass spectra with machine learning methods, Expert Systems with Applications, Volume 294, 2025, 128796, ISSN 0957-4174, tools:::Rd_expr_doi("10.1016/j.eswa.2025.128796").

Examples

Run this code

if (FALSE) {
set.seed(1)
n <- 100; p <- 300
X <- matrix(rnorm(n * p), n, p)
colnames(X) <- paste0("mz_", seq_len(p))
Y <- factor(sample(letters[1:3], n, replace = TRUE))

if (requireNamespace("ranger", quietly = TRUE)) {
  out <- fast_mda(
    X, Y,
    ntree = 500,
    nbf = 50,
    nthreads = max(1L, parallel::detectCores() - 1L),
    seed = 42
  )
  out$nb_to_sel
  head(out$sel_moz)
  # Top importances
  head(sort(out$all_imp, decreasing = TRUE))
}
}

Run the code above in your browser using DataLab