Learn R Programming

RemixAutoML (version 0.11.0)

ProblematicFeatures: ProblematicFeatures identifies problematic features for machine learning

Description

ProblematicFeatures identifies problematic features for machine learning and outputs a data.table of the feature names in the first column and the metrics they failed to pass in the columns.

Usage

ProblematicFeatures(data, ColumnNumbers = c(1:ncol(data)),
  NearZeroVarThresh = 0.05, CharUniqThresh = 0.5, NA_Rate = 0.2,
  Zero_Rate = 0.2, HighSkewThresh = 10)

Arguments

data

The data.table with the columns you wish to have analyzed

ColumnNumbers

A vector with the column numbers you wish to analyze

NearZeroVarThresh

Set to NULL to not run NearZeroVar(). Checks to see if the percentage of values in your numeric columns that are not constant are greater than the value you set here. If not, the feature is collects and returned with the percentage unique value.

CharUniqThresh

Set to NULL to not run CharUniqthresh(). Checks to see if the percentage of unique levels / groups in your categorical feature is greater than the value you supply. If it is, the feature name is returned with the percentage unique value.

NA_Rate

Set to NULL to not run NA_Rate(). Checks to see if the percentage of NA's in your features is greater than the value you supply. If it is, the feature name is returned with the percentage of NA values.

Zero_Rate

Set to NULL to not run Zero_Rate(). Checks to see if the percentage of zero's in your features is greater than the value you supply. If it is, the feature name is returned with the percentage of zero values.

HighSkewThresh

Set to NULL to not run HighSkew(). Checks for numeric columns whose ratio of the sum of the top 5th percentile of values to the bottom 95th percentile of values is greater than the value you supply. If true, the column name and value is returned.

Value

data table with new dummy variables columns and optionally removes base columns

See Also

Other EDA: AutoWordFreq

Examples

Run this code
# NOT RUN {
test <- data.table::data.table(RandomNum = runif(1000))
test[, NearZeroVarEx := ifelse(runif(1000) > 0.99, runif(1), 1)]
test[, CharUniqueEx := as.factor(ifelse(RandomNum < 0.95, sample(letters, size = 1), "FFF"))]
test[, NA_RateEx := ifelse(RandomNum < 0.95, NA, "A")]
test[, ZeroRateEx := ifelse(RandomNum < 0.95, 0, runif(1))]
test[, HighSkewThreshEx := ifelse(RandomNum > 0.96, 100000, 1)]
ProblematicFeatures(test,
                    ColumnNumbers = 2:ncol(test),
                    NearZeroVarThresh = 0.05,
                    CharUniqThresh = 0.50,
                    NA_Rate = 0.20,
                    Zero_Rate = 0.20,
                    HighSkewThresh = 10)
# }

Run the code above in your browser using DataLab