Learn R Programming

RemixAutoML (version 0.11.0)

ProblematicRecords: ProblematicRecords identifies problematic records for further investigation

Description

ProblematicRecords identifies problematic records for further investigation and data.table with 3 additional columns at the beginning of the data.table: PredictedOutlier (0 = no outlier, 1 = outlier), predict (raw H2O predicted value from Isolation Forest), and mean_length (mean length of number of splits)

Usage

ProblematicRecords(data, ColumnNumbers = NULL, Threshold = 0.975,
  MaxMem = "28G", NThreads = -1, NTrees = 100,
  SampleRate = (sqrt(5) - 1)/2)

Arguments

data

The data.table with the columns you wish to have analyzed

ColumnNumbers

A vector with the column numbers you wish to analyze

Threshold

Quantile value to find the cutoff value for classifying outliers

MaxMem

Specify the amount of memory to allocate to H2O. E.g. "28G"

NThreads

Specify the number of threads (E.g. cores * 2)

NTrees

Specify the number of decision trees to build

SampleRate

Specify the row sample rate per tree

Value

A data.table

See Also

Other Unsupervised Learning: AutoKMeans, GenTSAnomVars, ResidualOutliers

Examples

Run this code
# NOT RUN {
 Correl <- 0.85
N <- 10000
data <- data.table::data.table(Target = runif(N))
data[, x1 := qnorm(Target)]
data[, x2 := runif(N)]
data[, Independent_Variable1 := log(pnorm(Correl * x1 +
                                           sqrt(1-Correl^2) * qnorm(x2)))]
data[, Independent_Variable2 := (pnorm(Correl * x1 +
                                        sqrt(1-Correl^2) * qnorm(x2)))]
data[, Independent_Variable3 := exp(pnorm(Correl * x1 +
                                           sqrt(1-Correl^2) * qnorm(x2)))]
data[, Independent_Variable4 := exp(exp(pnorm(Correl * x1 +
                                               sqrt(1-Correl^2) * qnorm(x2))))]
data[, Independent_Variable5 := sqrt(pnorm(Correl * x1 +
                                            sqrt(1-Correl^2) * qnorm(x2)))]
data[, Independent_Variable6 := (pnorm(Correl * x1 +
                                        sqrt(1-Correl^2) * qnorm(x2)))^0.10]
data[, Independent_Variable7 := (pnorm(Correl * x1 +
                                        sqrt(1-Correl^2) * qnorm(x2)))^0.25]
data[, Independent_Variable8 := (pnorm(Correl * x1 +
                                        sqrt(1-Correl^2) * qnorm(x2)))^0.75]
data[, Independent_Variable9 := (pnorm(Correl * x1 +
                                        sqrt(1-Correl^2) * qnorm(x2)))^2]
data[, Independent_Variable10 := (pnorm(Correl * x1 +
                                         sqrt(1-Correl^2) * qnorm(x2)))^4]
data[, Target := as.factor(
 ifelse(Independent_Variable2 < 0.20, "A",
        ifelse(Independent_Variable2 < 0.40, "B",
               ifelse(Independent_Variable2 < 0.6,  "C",
                      ifelse(Independent_Variable2 < 0.8,  "D", "E")))))]
data[, Independent_Variable11 := as.factor(
 ifelse(Independent_Variable2 < 0.15, "A",
        ifelse(Independent_Variable2 < 0.45, "B",
               ifelse(Independent_Variable2 < 0.65,  "C",
                      ifelse(Independent_Variable2 < 0.85,  "D", "E")))))]
data[, ':=' (x1 = NULL, x2 = NULL)]
Outliers <- ProblematicRecords(data,
                              ColumnNumbers = NULL,
                              Threshold = 0.95,
                              MaxMem = "28G",
                              NThreads = -1)
# }

Run the code above in your browser using DataLab