SelectionVarStat: Fast feature (m/z) selection using multiple hypothesis testing (LIMMA/ANOVA/Kruskal) with optional class balancing (no/up/down/SMOTE)

Description

Runs univariate statistical tests across all features (columns) to identify discriminant m/z values. Supports three tests:

"Limma": moderated F-test via the limma R package on k-1 contrasts between groups
"anova": classical one-way ANOVA (implemented in C++ for speed)
"kruskal": Kruskal–Wallis rank-sum test (implemented in C++ with tie correction)

Usage

SelectionVarStat(
  X,
  Y,
  stat.test = c("Limma", "anova", "kruskal"),
  pi0.method = "abh",
  fdr = 0.05,
  Sampling = c("no", "up", "down", "smote"),
  seed = NULL
)

Value

A list with:

nb_to_sel: integer, floor(ncol(X) * (1 - pi0)) using cp4p::estim.pi0.
n_selected_fdr: integer, number of features with adjusted p-value < fdr.
sel_moz: character vector of selected feature names (columns of X) with adjusted p-value < fdr.
ap: the full object returned by cp4p::adjust.p (contains adjusted p-values).

Arguments

X

Numeric matrix with samples in rows and features (peaks) in columns. If a sparse Matrix is provided, it is coerced to a base R dense matrix. Infinities are set to NA; NAs are allowed.

Y

Class labels. A factor (or coercible to factor) of length nrow(X). Must contain at least two levels.

stat.test

Character string; which univariate test to use. One of "Limma", "anova", or "kruskal". Default is "Limma".

Limma uses limma::lmFit on ~ 0 + Y, applies k-1 contrasts versus a reference group with limma::makeContrasts and then limma::eBayes to compute a moderated F p-value testing overall between-group differences.
anova uses a fast C++ one-way ANOVA (upper-tail F p-value).
kruskal uses a fast C++ Kruskal–Wallis test with tie correction (upper-tail chi-square p-value).

pi0.method

Character; method for cp4p::estim.pi0 and cp4p::adjust.p. Default "abh". See cp4p documentation for options.

fdr

Numeric in (0, 1]; FDR threshold used to select features after p-value adjustment. Default 0.05.

Sampling

Character string; optional class balancing applied before testing. One of "no", "up", "down", "smote". Default "no".

"up": up-samples minority classes to the majority size (base R implementation).
"down": down-samples majority classes to the minority size (base R).
"smote": uses the package’s internal smote_classif() on data.frame(Y, X). Column names of X are preserved; if SMOTE fails or yields <2 classes, the function falls back to up-sampling.

seed

Optional integer. If provided, sets the random seed for up/down sampling and SMOTE to ensure reproducibility.

Details

Optional class balancing can be applied before testing: no sampling, up-sampling, down-sampling, or SMOTE using the internal smote_classif() function. P-values are adjusted with the cp4p R package, and features below the FDR threshold are returned, together with an estimated proportion of nulls, pi0.

Missing values are preserved as NA: for ANOVA/Kruskal, the C++ backends skip NAs per feature; for Limma, missing values are handled by limma internally. Non-finite values (Inf/-Inf) are coerced to NA.

Limma design: model.matrix(~ 0 + Y) is used; a full-rank set of k-1 contrasts vs the first level is constructed with limma::makeContrasts. The returned p-values are moderated F-test p-values for overall group differences.
ANOVA/Kruskal implementations are in C++ (see anova_cols_cpp and kruskal_cols_cpp); they process all features in one pass and skip NAs per feature.
Non-finite p-values (if any) are set to 1 before pi0/adjustment.
SMOTE requires at least two observations per class; otherwise the function automatically falls back to up-sampling.
For limma, if residual degrees of freedom are not positive (nrow(X) - nlevels(Y) <= 0), p-values are set to 1 with a warning.

Examples

Run this code

###############################################################################
## 1. Pre-processing of mass spectra

# load mass spectra and their metadata
data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR")
# standard pre-processing of mass spectra
spectra <- MSclassifR::SignalProcessing(CitrobacterRKIspectra)
# detection of peaks in pre-processed mass spectra
peaks <- MSclassifR::PeakDetection(x = spectra, labels = CitrobacterRKImetadata$Strain_name_spot)
# build matrix with intensities of peaks (rows = samples, columns = m/z)
Y <- factor(CitrobacterRKImetadata$Species)
xy <- build_XY_from_peaks(peaks, labels = Y, normalize = "max", sparse = FALSE)
X <- xy$X
Y <- xy$Y

###############################################################################
## 2. Estimate the optimal number of peaks to discriminate the different species

OptiPeaks <- MSclassifR::SelectionVarStat(X,
                             Y,
                             stat.test = "Limma",
                             pi0.method = "abh",
                             fdr = 0.05,
                             Sampling = "smote",
                             seed = 1)

## Estimation of the optimal number of peaks to discriminate species (from the pi0 parameter)
OptiPeaks$nb_to_sel

## discriminant mass-to-charge values estimated using a 5 per cent false discovery rate
OptiPeaks$sel_moz

## p-values and adjusted p-values estimated for all the tested mass-to-charge values
OptiPeaks$ap$adjp