ovun.sample: Over-sampling, under-sampling, combination of over- and under-sampling.

Description

Creates possibly balanced samples by random over-sampling minority examples, under-sampling majority examples or combination of over- and under-sampling.

Usage

ovun.sample(formula, data, method="both", N, p=0.5, 
            subset=options("subset")$subset,
            na.action=options("na.action")$na.action, seed)

Arguments

formula

An object of class formula (or one that can be coerced to that class). See ROSE for information about interaction among predictors or their transformations.

data

An optional data frame, list or environment (or object coercible to a data frame by as.data.frame) in which to preferentially interpret ``formula''. If not specified, the variables are taken from ``environment(formula)''.

method

One among c("over", "under", "both") to perform over-sampling minority examples, under-sampling majority examples or combination of over- and under-sampling, respectively.

The desired sample size of the resulting data set. If missing and method is either "over" or "under" the sample size is determined by oversampling or, respectively, undersampling examples so that the minority class occurs approximately in proportion p. When method = "both" the default value is given by the length of vectors specified in formula.

The probability of resampling from the rare class. If missing and method is either "over" or "under" this proportion is determined by oversampling or, respectively, undersampling examples so that the sample size is equal to N. When method ="both" the default value given by 0.5.

subset

An optional vector specifying a subset of observations to be used in the sampling process. The default is set by the subset setting of options.

na.action

A function which indicates what should happen when the data contain 'NA's. The default is set by the na.action setting of options.

seed

A single value, interpreted as an integer, recommended to specify seeds and keep trace of the sample.

Value

The value is an object of class ovun.sample which has components

Call

The matched call.

method

The method used to balance the sample. Possible choices are c("over", "under", "both").

data

The resulting new data set.

Examples

Run this code

# NOT RUN {
# 2-dimensional example
# loading data
data(hacide)

# imbalance on training set
table(hacide.train$cls)

# balanced data set with both over and under sampling
data.balanced.ou <- ovun.sample(cls~., data=hacide.train,
                                N=nrow(hacide.train), p=0.5, 
                                seed=1, method="both")$data

table(data.balanced.ou$cls)

# balanced data set with over-sampling
data.balanced.over <- ovun.sample(cls~., data=hacide.train, 
                                  p=0.5, seed=1, 
                                  method="over")$data

table(data.balanced.over$cls)

# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples