ImpSampClassif: Importance Sampling algorithm for imbalanced classification problems

Description

This function handles imbalanced classification problems using the importance/relevance provided to re-sample the data set. The relevance is used to introduce replicas of the most important examples and to remove the least important examples. This function combines random over-sampling with random under-sampling which are applied in the problem classes according to the corresponding relevance.

Usage

ImpSampClassif(form, dat, C.perc = "balance")

Arguments

form

A formula describing the prediction problem

dat

A data frame containing the original (unbalanced) data set

C.perc

A list containing the percentage(s) of random under- or over-sampling to apply to each class. The over-sampling percentage is a number above 1 while the under-sampling percentage should be a number below 1. If the number 1 is provided for a given class then that class remains unchanged. Alternatively it may be "balance" (the default) or "extreme", cases where the sampling percentages are automatically estimated.

Value

The function returns a data frame with the new data set resulting from the application of the importance sampling strategy.

Examples

Run this code

# NOT RUN {
  data(iris)
  # generating an artificially imbalanced data set
  ir <- iris[-c(51:70,111:150), ]
  IS.ext <-ImpSampClassif(Species~., ir, C.perc = "extreme")
  IS.bal <-ImpSampClassif(Species~., ir, C.perc = "balance")
  myIS <-ImpSampClassif(Species~., ir, C.perc = list(setosa = 0.2,
                                                    versicolor = 2,
                                                    virginica = 6))
  # check the results
  table(ir$Species)
  table(IS.ext$Species)
  table(IS.bal$Species)
  table(myIS$Species)
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples