smartdata
Package that integrates preprocessing algorithms for oversampling, instance/feature selection, normalization, discretization, space transformation, and outliers/missing values/noise cleaning.
Installation
You can install the latest smartdata stable release from CRAN with:
# This sets both CRAN and Bioconductor as repositories to resolve dependencies
setRepositories(ind = 1:2)
install.packages("smartdata")
and load it into an R session with:
library("smartdata")
Examples
smartdata
provides the following wrappers:
instance_selection
feature_selection
normalize
discretize
space_transformation
clean_outliers
impute_missing
clean_noise
To get the possible methods available for a certain wrapper, we can do:
which_options("instance_selection")
#> Possible methods are: 'CNN', 'ENN', 'multiedit', 'FRIS'
To get information about the parameters available for a method:
which_options("instance_selection", "multiedit")
#> For more information do: ?class::multiedit
#> Parameters for multiedit are:
#> * k: Number of neighbors used in KNN
#> Default value: 1
#> * num_folds: Number of partitions the train set is split in
#> Default value: 3
#> * null_passes: Number of null passes to use in the algorithm
#> Default value: 5
First let’s load a bunch of datasets:
data(iris0, package = "imbalance")
data(ecoli1, package = "imbalance")
data(nhanes, package = "mice")
Oversampling
super_iris <- iris0 %>% oversample(method = "MWMOTE", ratio = 0.8, filtering = TRUE)
Instance selection
super_iris <- iris %>% instance_selection("multiedit", k = 3, num_folds = 2,
null_passes = 10, class_attr = "Species")
Feature selection
super_ecoli <- ecoli1 %>% feature_selection("Boruta", class_attr = "Class")
Normalization
super_iris <- iris %>% normalize("min_max", exclude = c("Sepal.Length", "Species"))
Discretization
super_iris <- iris %>% discretize("ameva", class_attr = "Species")
Space transformation
super_ecoli <- ecoli1 %>% space_transformation("lle_knn", k = 3, num_features = 2)
Outliers
super_iris <- iris %>% clean_outliers("multivariate", type = "adj")
Missing values
super_nhanes <- nhanes %>% impute_missing("gibbs_sampling")
Noise
super_iris <- iris %>% clean_noise("hybrid", class_attr = "Species",
consensus = FALSE, action = "repair")