smote_classif: SMOTE for Classification Problems

Description

This function performs Synthetic Minority Over-sampling Technique (SMOTE) to address class imbalance in classification problems. This implementation supports various distance metrics, different balancing strategies, and handles mixed data types (numeric and categorical). Its execution time is faster than the SmoteClassif of the UBL R package.

Usage

smote_classif (form, dat, C.perc = "balance", k = 5, repl = FALSE,
                          dist = "Euclidean", p = 2)

Value

A data frame with the same structure as the input, but with rebalanced classes according to the specified strategy.

Arguments

form: A model formula identifying the target variable (e.g., Class ~ .).
dat: A data frame containing the imbalanced dataset.
C.perc: Either "balance", "extreme", or a named list containing over/under-sampling percentages for each class. Values < 1 indicate under-sampling, values > 1 indicate over-sampling, and values = 1 indicate no change. "balance" equalizes all classes, "extreme" performs more aggressive balancing.
k: Integer specifying the number of nearest neighbors to use when generating synthetic examples (default: 5).
repl: Logical, whether to allow sampling with replacement when under-sampling (default: FALSE).
dist: Distance metric to use for nearest neighbor calculations. Supported metrics: "Euclidean" (default), "Manhattan", "Chebyshev", "Canberra", "Overlap", "HEOM", "HVDM", or "p-norm". See calculate_distance function for details.
p: Parameter used when dist = "p-norm" (default: 2).

Details

If you use this package in your research, please cite the associated publication (tools:::Rd_expr_doi("10.1016/j.eswa.2025.128796")).

References

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357.tools:::Rd_expr_doi("10.1613/jair.953").

Alexandre Godmer, Yahia Benzerara, Emmanuelle Varon, Nicolas Veziris, Karen Druart, Renaud Mozet, Mariette Matondo, Alexandra Aubry, Quentin Giai Gianetto, MSclassifR: An R package for supervised classification of mass spectra with machine learning methods, Expert Systems with Applications, Volume 294, 2025, 128796, ISSN 0957-4174, tools:::Rd_expr_doi("10.1016/j.eswa.2025.128796").

Examples

Run this code


# Load the iris dataset
data(iris)

# Create an imbalanced dataset by taking a subset
imbal_iris <- iris[c(1:40, 51:100, 101:110), ]
table(imbal_iris$Species)  # Show class distribution

# Balance classes using the default "balance" strategy
balanced_iris <- smote_classif(Species ~ ., imbal_iris)
table(balanced_iris$Species)  # Show balanced distribution

# Custom over/under-sampling
custom_iris <- smote_classif(Species ~ ., imbal_iris,
                            C.perc = list(setosa = 0.8,
                                          versicolor = 1,
                                          virginica = 3))
table(custom_iris$Species)

Run the code above in your browser using DataLab