ProblematicFeatures identifies problematic features for machine learning and outputs a data.table of the feature names in the first column and the metrics they failed to pass in the columns.
ProblematicFeatures(
data,
ColumnNumbers = c(1:ncol(data)),
NearZeroVarThresh = 0.05,
CharUniqThresh = 0.5,
NA_Rate = 0.2,
Zero_Rate = 0.2,
HighSkewThresh = 10
)
The data.table with the columns you wish to have analyzed
A vector with the column numbers you wish to analyze
Set to NULL to not run NearZeroVar(). Checks to see if the percentage of values in your numeric columns that are not constant are greater than the value you set here. If not, the feature is collects and returned with the percentage unique value.
Set to NULL to not run CharUniqthresh(). Checks to see if the percentage of unique levels / groups in your categorical feature is greater than the value you supply. If it is, the feature name is returned with the percentage unique value.
Set to NULL to not run NA_Rate(). Checks to see if the percentage of NA's in your features is greater than the value you supply. If it is, the feature name is returned with the percentage of NA values.
Set to NULL to not run Zero_Rate(). Checks to see if the percentage of zero's in your features is greater than the value you supply. If it is, the feature name is returned with the percentage of zero values.
Set to NULL to not run HighSkew(). Checks for numeric columns whose ratio of the sum of the top 5th percentile of values to the bottom 95th percentile of values is greater than the value you supply. If true, the column name and value is returned.
data table with new dummy variables columns and optionally removes base columns
Other EDA:
AutoCorrAnalysis()
,
AutoWordFreq()
,
BNLearnArcStrength()
# NOT RUN {
test <- data.table::data.table(RandomNum = runif(1000))
test[, NearZeroVarEx := ifelse(runif(1000) > 0.99, runif(1), 1)]
test[, CharUniqueEx := as.factor(ifelse(RandomNum < 0.95, sample(letters, size = 1), "FFF"))]
test[, NA_RateEx := ifelse(RandomNum < 0.95, NA, "A")]
test[, ZeroRateEx := ifelse(RandomNum < 0.95, 0, runif(1))]
test[, HighSkewThreshEx := ifelse(RandomNum > 0.96, 100000, 1)]
ProblematicFeatures(
test,
ColumnNumbers = 2:ncol(test),
NearZeroVarThresh = 0.05,
CharUniqThresh = 0.50,
NA_Rate = 0.20,
Zero_Rate = 0.20,
HighSkewThresh = 10)
# }
Run the code above in your browser using DataLab