smote_classif: SMOTE for classification datasets

Description

Generate synthetic examples for minority classes using the SMOTE idea, to balance a classification dataset.

Usage

smote_classif(
  formula,
  data,
  k = 5,
  strategy = c("balance", "perc"),
  perc = NULL,
  metric = c("euclidean", "manhattan", "chebyshev", "canberra", "overlap", "heom",
    "hvdm", "pnorm"),
  p = 2,
  seed = NULL,
  C.perc = NULL
)

Value

A data.frame with synthetic rows appended, same columns and types as input.

Arguments

formula: A model formula target ~ predictors indicating the response and predictors.
data: A data.frame containing the variables in the model.
k: Integer, number of nearest neighbors used by SMOTE (default 5).
strategy: One of "balance" (oversample to the max class size) or "perc" (oversample each class by a percentage). Default "balance".
perc: Numeric percentage used when strategy = "perc" (e.g., 100 means generate as many synthetic examples as existing in the class). Ignored for "balance".
metric: Distance metric for neighbor search: one of "euclidean", "manhattan", "chebyshev", "canberra", "overlap", "heom", "hvdm", "pnorm". Default "euclidean".
p: Numeric p for the p-norm when metric = "pnorm"; also used implicitly for "euclidean" (p=2) and "manhattan" (p=1). Default 2.
seed: Optional integer seed for reproducibility.
C.perc: Deprecated. Backward-compatibility alias for oversampling control. If character "balance", mapped to strategy = "balance". If a single numeric, mapped to strategy = "perc" and perc = C.perc. Other forms are ignored with a warning.

Details

The function supports multi-class data. With strategy = "balance" (default), each class is oversampled up to the size of the largest class. With strategy = "perc", each class c is oversampled by round(n_c * perc/100). Neighbors are computed within each class.

Examples

Run this code

# \donttest{
data(iris)
imbal_iris <- iris[c(1:40, 51:100, 101:110), ]
table(imbal_iris$Species)
balanced_iris <- smote_classif(Species ~ ., imbal_iris)
table(balanced_iris$Species)
# }

Run the code above in your browser using DataLab