Computes feature importances with a single multiclass (or regression) random
forest using the ranger engine and its C++ permutation importance. The null
distribution of importances is estimated either by appending artificial “false”
(noise) features (when nbf > 0) or by mirroring negative importances (when
nbf = 0). An estimator of the proportion of useless features yields pi0, and the top (1 - pi0) proportion
of true features are selected. This implementation is a fast, OS-agnostic alternative to
repeated/random-forest-based MDA schemes.
fast_mda(
X,
Y,
ntree = 1000,
nbf = 0,
nthreads = max(1L, parallel::detectCores() - 1L),
mtry = NULL,
sample_fraction = 1,
min_node_size = 1L,
seed = 123
)A list with:
nb_to_sel: integer; number of selected features (floor(p_true * (1 - pi0))).
sel_moz: character vector of selected feature names (columns of X).
imp_sel: named numeric vector of importances for selected features (true features only).
all_imp: named numeric vector of importances for all true features.
pi0: estimated proportion of null features.
Numeric matrix (n x p); samples in rows, features in columns. Column names should be feature IDs (e.g., m/z). Non-finite values are set to zero internally for modeling.
Factor (classification) or numeric (regression) response of length n.
The default mtry is chosen based on the task: floor(sqrt(p)) for
classification; max(floor(p/3), 1) for regression.
Integer; number of trees. Default 1000.
Integer (>= 0); number of artificial “false” (noise) features to append to X to estimate the null distribution. Default 0 disables this and uses mirrored negative importances as the null.
Integer; total number of threads for ranger. Default is max(1, parallel::detectCores() - 1).
Optional integer; variables tried at each split. If NULL (default), computed as floor(sqrt(p)) for classification or max(floor(p/3), 1) for regression.
Numeric in (0, 1]; subsampling fraction per tree (speed/ regularization knob). Default 1.
Integer; ranger minimum node size. Larger values speed up training and yield simpler trees. Default 1.
Integer; RNG seed for reproducibility. Default 123.
A single ranger model is fit with importance = "permutation". This computes permutation importance in C++ using OOB (fast and stable).
Null and pi0:
If nbf > 0, nbf false features (uniform between min(X) and max(X)) are appended;
negative importances among them help shape the null. An estimator of the proportion of useless features
over high quantiles (e.g., 0.75–1) yields pi0 and is adjusted for the number
of false features.
If nbf = 0, the null is approximated by mirroring negative importances of
true features. If no negative importances occur, pi0 is set to 0 (conservative).
Task: factors in Y trigger probability = TRUE; numeric Y triggers regression.
Robustness: any non-finite importances are set to zero. Selection is performed only among the original (true) features; false features are discarded.
Performance: this is typically 5–20x faster than randomForest-based MDA and
fully multithreaded via nthreads.
Alexandre Godmer, Yahia Benzerara, Emmanuelle Varon, Nicolas Veziris, Karen Druart, Renaud Mozet, Mariette Matondo, Alexandra Aubry, Quentin Giai Gianetto, MSclassifR: An R package for supervised classification of mass spectra with machine learning methods, Expert Systems with Applications, Volume 294, 2025, 128796, ISSN 0957-4174, tools:::Rd_expr_doi("10.1016/j.eswa.2025.128796").
ranger::ranger; for cross-validated permutation importance, see fast_cvpvi. For a wrapper that plugs MDA/CVP into broader selection workflows, see SelectionVar (MethodSelection = "mda" or "cvp").
if (FALSE) {
set.seed(1)
n <- 100; p <- 300
X <- matrix(rnorm(n * p), n, p)
colnames(X) <- paste0("mz_", seq_len(p))
Y <- factor(sample(letters[1:3], n, replace = TRUE))
if (requireNamespace("ranger", quietly = TRUE)) {
out <- fast_mda(
X, Y,
ntree = 500,
nbf = 50,
nthreads = max(1L, parallel::detectCores() - 1L),
seed = 42
)
out$nb_to_sel
head(out$sel_moz)
# Top importances
head(sort(out$all_imp, decreasing = TRUE))
}
}
Run the code above in your browser using DataLab