LogReg_rf_fast: Fast random-forest classifier with stratified CV and in-fold sampling (ranger, caret-free)

Description

Trains a multiclass random-forest classifier using the ranger algorithm with a compact hyperparameter search and repeated stratified cross-validation. Feature columns are first subset by the provided m/z list (moz). Class balancing (no/up/down/SMOTE) is applied only within training folds to avoid leakage, and again on the full data before fitting the final model. The evaluation across folds can be parallelized in a Windows-safe manner (PSOCK), while avoiding CPU oversubscription by giving each fold worker one ranger thread. Returns the final ranger model, per-fold metrics, a confusion matrix on the full data, and a ggplot boxplot of resampling metrics.

Usage

LogReg_rf_fast(
  X,
  moz,
  Y,
  number = 5,
  repeats = 1,
  Metric = c("Kappa", "Accuracy", "F1", "AdjRankIndex", "MatthewsCorrelation"),
  Sampling = c("no", "up", "down", "smote"),
  ncores = max(1L, parallel::detectCores() - 1L),
  num.trees = 500L,
  tuneLength = 5L,
  folds_parallel = c("auto", "TRUE", "FALSE"),
  seed = 123L,
  mtry = NULL,
  splitrule = "gini",
  sample.fraction = 1,
  min.node.size.grid = c(1L, 5L, 10L),
  min_node_frac = 1/3
)

Value

A list with:

train_mod: list with fields
- model: the fitted ranger::ranger object (final model on full data)
- method: "ranger"
- best_params: data.frame with the best hyperparameters found by CV
- cv_score: best mean CV score (according to Metric)
- metric: the metric name used
boxplot: ggplot object showing the distribution of per-fold metric values
Confusion.Matrix: caret::confusionMatrix for predictions of the final model on the full data
stats_global: data.frame with columns Metric, Mean, Sd summarizing per-fold metrics
resample: data.frame of per-fold metrics (columns: variable, value, fold)

Arguments

X

Numeric matrix or data frame; rows are samples and columns are features (m/z). Column names must be numeric (coercible with as.numeric), representing the feature m/z. Non-finite values are set to 0 internally.

moz

Numeric vector of m/z to keep. Only columns of X whose numeric names match values in moz are used. An error is raised if none match.

Y

Factor (or coercible to factor) of class labels; length must equal nrow(X).

number

Integer; number of CV folds (k). Default 5.

repeats

Integer; number of CV repeats. Default 1.

Metric

Character; CV selection metric. One of "Kappa", "Accuracy", "F1", "AdjRankIndex", "MatthewsCorrelation". The best hyperparameters maximize this metric averaged over folds.

Sampling

Character; class-balancing strategy applied within each training fold (and before the final fit on the full data). One of "no", "up", "down", "smote".

"up": up-samples minority classes to the majority count (base R).
"down": down-samples majority classes to the minority count (base R).
"smote": uses the package’s internal smote_classif(Y ~ ., data.frame(Y, X), C.perc = "balance").

ncores

Integer; number of CPU cores to use. Controls both fold-level parallelism and ranger threads when not parallelizing folds. Default is all but one core.

num.trees

Integer; number of trees per ranger model. Default 500.

tuneLength

Integer; upper bound on the size of the hyperparameter grid. If the full grid (mtry × min.node.size) is larger, a random subset of size tuneLength is used. Default 5.

folds_parallel

Character; "auto", "TRUE", or "FALSE".

"auto": parallelize across folds when ncores >= 2 and total folds (number × repeats) >= 2.
"TRUE": force fold-level parallelism (PSOCK on Windows).
"FALSE": evaluate folds sequentially; ranger then uses up to ncores threads per fit.

seed

Integer; RNG seed for reproducibility. Default 123.

mtry

Optional integer; if provided, fixes the number of variables tried at each split. If NULL (default), a small grid around floor(sqrt(p)) is used, where p = number of features.

splitrule

Character; ranger split rule (e.g., "gini", "extratrees"). Default "gini".

sample.fraction

Numeric in (0, 1]; subsampling fraction per tree in ranger. Default 1.

min.node.size.grid

Integer vector; candidate values for ranger’s min.node.size used to build the tuning grid. Default c(1, 5, 10).

min_node_frac

Numeric in (0, 1]. Safety cap for ranger’s min.node.size per fold/final fit: the value used is min(requested_min.node.size, floor(min_node_frac * n_train)), with a lower bound of 1. This prevents root-only trees (near-uniform class probabilities) on small training folds (e.g., with SMOTE). Applied inside CV and for the final model. Default: 1/3 (set to 1 to disable capping).

Details

Feature subsetting: X is subset to columns whose numeric names match moz. This avoids expensive joins/transposes and guarantees consistent feature order.
Cross-validation: folds are stratified by Y and repeated repeats times. Sampling is applied only to training indices in each fold (to prevent leakage) and again before the final fit.
Hyperparameter search: a compact grid over mtry (around sqrt(p)) and min.node.size (from min.node.size.grid), optionally downsampled to tuneLength. The best combination maximizes the chosen metric averaged over folds.
Parallel strategy: by default ("auto"), the code parallelizes across folds with a PSOCK cluster (Windows-safe) and sets ranger’s num.threads = 1 inside each worker to avoid oversubscription. If you set folds_parallel = "FALSE", folds run sequentially and each ranger fit uses up to ncores threads for strong single-fit parallelism.
Metrics:
- Accuracy and Cohen’s Kappa computed from the confusion matrix.
- F1 is macro-averaged across classes.
- AdjRankIndex uses mclust::adjustedRandIndex.
- MatthewsCorrelation is the multiclass MCC.

Examples

Run this code

if (FALSE) {
set.seed(1)
X <- matrix(runif(3000), nrow = 100, ncol = 30)
colnames(X) <- as.character(round(seq(1000, 1290, length.out = 30), 4))
moz <- as.numeric(colnames(X))[seq(1, 30, by = 2)]  # keep half the m/z
Y <- factor(sample(letters[1:3], 100, replace = TRUE))

fit <- LogReg_rf_fast(
  X, moz, Y,
  number = 3, repeats = 1,
  Metric = "Kappa",
  Sampling = "no",
  ncores = 4,
  num.trees = 300,
  tuneLength = 4,
  seed = 42
)
fit$train_mod$best_params
fit$Confusion.Matrix
}