themis
themis contain extra steps for the
recipes
package for
dealingwith unbalanced data. The name themis is that of the ancient
Greek
god
who is typically depicted with a balance.
Installation
You can install the released version of themis from CRAN with:
install.packages("themis")
Install the development version from GitHub with:
require("devtools")
install_github("tidymodels/themis")
Example
Following is a example of using the SMOTE algorithm to deal with unbalanced data
library(recipes)
library(modeldata)
library(themis)
data(okc)
sort(table(okc$Class, useNA = "always"))
#>
#> <NA> stem other
#> 0 9539 50316
ds_rec <- recipe(Class ~ age + height, data = okc) %>%
step_meanimpute(all_predictors()) %>%
step_smote(Class) %>%
prep()
sort(table(juice(ds_rec)$Class, useNA = "always"))
#>
#> <NA> stem other
#> 0 50316 50316
Methods
Below is some unbalanced data. Used for examples latter.
example_data <- data.frame(class = letters[rep(1:5, 1:5 * 10)],
x = rnorm(150))
library(ggplot2)
example_data %>%
ggplot(aes(class)) +
geom_bar()
Upsample / Over-sampling
The following methods all share the tuning parameter over_ratio
, which
is the ratio of the majority-to-minority
frequencies.
name | function | Multi-class |
---|---|---|
Random minority over-sampling with replacement | step_upsample() | :heavy_check_mark: |
Synthetic Minority Over-sampling Technique | step_smote() | :heavy_check_mark: |
Borderline SMOTE-1 | step_bsmote(method = 1) | :heavy_check_mark: |
Borderline SMOTE-2 | step_bsmote(method = 2) | :heavy_check_mark: |
Adaptive synthetic sampling approach for imbalanced learning | step_adasyn() | :heavy_check_mark: |
Generation of synthetic data by Randomly Over Sampling Examples | step_rose() |
By setting over_ratio = 1
you bring the number of samples of all
minority classes equal to 100% of the majority class.
recipe(~., example_data) %>%
step_upsample(class, over_ratio = 1) %>%
prep() %>%
juice() %>%
ggplot(aes(class)) +
geom_bar()
and by setting over_ratio = 0.5
we upsample any minority class with
less samples then 50% of the majority up to have 50% of the majority.
recipe(~., example_data) %>%
step_upsample(class, over_ratio = 0.5) %>%
prep() %>%
juice() %>%
ggplot(aes(class)) +
geom_bar()
Downsample / Under-sampling
Most of the the following methods all share the tuning parameter
under_ratio
, which is the ratio of the minority-to-majority
frequencies.
name | function | Multi-class | under_ratio |
---|---|---|---|
Random majority under-sampling with replacement | step_downsample() | :heavy_check_mark: | :heavy_check_mark: |
NearMiss-1 | step_nearmiss() | :heavy_check_mark: | :heavy_check_mark: |
Extraction of majority-minority Tomek links | step_tomek() |
By setting under_ratio = 1
you bring the number of samples of all
majority classes equal to 100% of the minority class.
recipe(~., example_data) %>%
step_downsample(class, under_ratio = 1) %>%
prep() %>%
juice() %>%
ggplot(aes(class)) +
geom_bar()
and by setting under_ratio = 2
we downsample any majority class with
more then 200% samples of the minority class down to have to 200%
samples of the minority.
recipe(~., example_data) %>%
step_downsample(class, under_ratio = 2) %>%
prep() %>%
juice() %>%
ggplot(aes(class)) +
geom_bar()