Trains a multiclass random-forest classifier using the ranger algorithm with a compact hyperparameter search and repeated stratified cross-validation. Feature columns are first subset by the provided m/z list (moz). Class balancing (no/up/down/SMOTE) is applied only within training folds to avoid leakage, and again on the full data before fitting the final model. The evaluation across folds can be parallelized in a Windows-safe manner (PSOCK), while avoiding CPU oversubscription by giving each fold worker one ranger thread. Returns the final ranger model, per-fold metrics, a confusion matrix on the full data, and a ggplot boxplot of resampling metrics.
LogReg_rf_fast(
X,
moz,
Y,
number = 5,
repeats = 1,
Metric = c("Kappa", "Accuracy", "F1", "AdjRankIndex", "MatthewsCorrelation"),
Sampling = c("no", "up", "down", "smote"),
ncores = max(1L, parallel::detectCores() - 1L),
num.trees = 500L,
tuneLength = 5L,
folds_parallel = c("auto", "TRUE", "FALSE"),
seed = 123L,
mtry = NULL,
splitrule = "gini",
sample.fraction = 1,
min.node.size.grid = c(1L, 5L, 10L),
min_node_frac = 1/3
)A list with:
train_mod: list with fields
model: the fitted ranger::ranger object (final model on full data)
method: "ranger"
best_params: data.frame with the best hyperparameters found by CV
cv_score: best mean CV score (according to Metric)
metric: the metric name used
boxplot: ggplot object showing the distribution of per-fold metric values
Confusion.Matrix: caret::confusionMatrix for predictions of the final model on the full data
stats_global: data.frame with columns Metric, Mean, Sd summarizing per-fold metrics
resample: data.frame of per-fold metrics (columns: variable, value, fold)
Numeric matrix or data frame; rows are samples and columns are features (m/z). Column names must be numeric (coercible with as.numeric), representing the feature m/z. Non-finite values are set to 0 internally.
Numeric vector of m/z to keep. Only columns of X whose numeric names
match values in moz are used. An error is raised if none match.
Factor (or coercible to factor) of class labels; length must equal nrow(X).
Integer; number of CV folds (k). Default 5.
Integer; number of CV repeats. Default 1.
Character; CV selection metric. One of "Kappa", "Accuracy", "F1", "AdjRankIndex", "MatthewsCorrelation". The best hyperparameters maximize this metric averaged over folds.
Character; class-balancing strategy applied within each training fold (and before the final fit on the full data). One of "no", "up", "down", "smote".
"up": up-samples minority classes to the majority count (base R).
"down": down-samples majority classes to the minority count (base R).
"smote": uses the package’s internal smote_classif(Y ~ ., data.frame(Y, X), C.perc = "balance").
Integer; number of CPU cores to use. Controls both fold-level parallelism and ranger threads when not parallelizing folds. Default is all but one core.
Integer; number of trees per ranger model. Default 500.
Integer; upper bound on the size of the hyperparameter grid.
If the full grid (mtry × min.node.size) is larger, a random subset of size
tuneLength is used. Default 5.
Character; "auto", "TRUE", or "FALSE".
"auto": parallelize across folds when ncores >= 2 and total folds (number × repeats) >= 2.
"TRUE": force fold-level parallelism (PSOCK on Windows).
"FALSE": evaluate folds sequentially; ranger then uses up to ncores threads per fit.
Integer; RNG seed for reproducibility. Default 123.
Optional integer; if provided, fixes the number of variables tried at each split. If NULL (default), a small grid around floor(sqrt(p)) is used, where p = number of features.
Character; ranger split rule (e.g., "gini", "extratrees"). Default "gini".
Numeric in (0, 1]; subsampling fraction per tree in ranger. Default 1.
Integer vector; candidate values for ranger’s min.node.size
used to build the tuning grid. Default c(1, 5, 10).
Numeric in (0, 1]. Safety cap for ranger’s min.node.size per fold/final fit: the value used is min(requested_min.node.size, floor(min_node_frac * n_train)), with a lower bound of 1. This prevents root-only trees (near-uniform class probabilities) on small training folds (e.g., with SMOTE). Applied inside CV and for the final model. Default: 1/3 (set to 1 to disable capping).
Feature subsetting: X is subset to columns whose numeric names match moz. This avoids
expensive joins/transposes and guarantees consistent feature order.
Cross-validation: folds are stratified by Y and repeated repeats times. Sampling is applied
only to training indices in each fold (to prevent leakage) and again before the final fit.
Hyperparameter search: a compact grid over mtry (around sqrt(p)) and min.node.size
(from min.node.size.grid), optionally downsampled to tuneLength. The best combination
maximizes the chosen metric averaged over folds.
Parallel strategy: by default ("auto"), the code parallelizes across folds with a PSOCK cluster
(Windows-safe) and sets ranger’s num.threads = 1 inside each worker to avoid oversubscription.
If you set folds_parallel = "FALSE", folds run sequentially and each ranger fit uses up to
ncores threads for strong single-fit parallelism.
Metrics:
Accuracy and Cohen’s Kappa computed from the confusion matrix.
F1 is macro-averaged across classes.
AdjRankIndex uses mclust::adjustedRandIndex.
MatthewsCorrelation is the multiclass MCC.
ranger::ranger, caret::confusionMatrix
if (FALSE) {
set.seed(1)
X <- matrix(runif(3000), nrow = 100, ncol = 30)
colnames(X) <- as.character(round(seq(1000, 1290, length.out = 30), 4))
moz <- as.numeric(colnames(X))[seq(1, 30, by = 2)] # keep half the m/z
Y <- factor(sample(letters[1:3], 100, replace = TRUE))
fit <- LogReg_rf_fast(
X, moz, Y,
number = 3, repeats = 1,
Metric = "Kappa",
Sampling = "no",
ncores = 4,
num.trees = 300,
tuneLength = 4,
seed = 42
)
fit$train_mod$best_params
fit$Confusion.Matrix
}
Run the code above in your browser using DataLab