Learn R Programming

RemixAutoML (version 0.5.4)

H2OIsolationForest: H2OIsolationForest

Description

H2OIsolationForestScoring for dimensionality reduction and / or anomaly detection

Usage

H2OIsolationForest(
  data,
  Features = NULL,
  IDcols = NULL,
  ModelID = "TestModel",
  SavePath = NULL,
  Threshold = 0.975,
  MaxMem = "28G",
  NThreads = -1,
  NTrees = 100,
  MaxDepth = 8,
  MinRows = 1,
  RowSampleRate = (sqrt(5) - 1)/2,
  ColSampleRate = 1,
  ColSampleRatePerLevel = 1,
  ColSampleRatePerTree = 1,
  CategoricalEncoding = c("AUTO"),
  Debug = FALSE
)

Arguments

data

The data.table with the columns you wish to have analyzed

Features

A character vector with the column names to utilize in the isolation forest

IDcols

A character vector with the column names to not utilize in the isolation forest but have returned with the data output. Otherwise those columns will be removed

ModelID

Name for model that gets saved to file if SavePath is supplied and valid

SavePath

Path directory to store saved model

Threshold

Quantile value to find the cutoff value for classifying outliers

MaxMem

Specify the amount of memory to allocate to H2O. E.g. "28G"

NThreads

Specify the number of threads (E.g. cores * 2)

NTrees

Specify the number of decision trees to build

MaxDepth

Max tree depth

MinRows

Minimum number of rows allowed per leaf

RowSampleRate

Number of rows to sample per tree

ColSampleRate

Sample rate for each split

ColSampleRatePerLevel

Sample rate for each level

ColSampleRatePerTree

Sample rate per tree

CategoricalEncoding

Choose from "AUTO", "Enum", "OneHotInternal", "OneHotExplicit", "Binary", "Eigen", "LabelEncoder", "SortByResponse", "EnumLimited"

Debug

Debugging

Value

Source data.table with predictions. Note that any columns not listed in Features nor IDcols will not be returned with data. If you want columns returned but not modeled, supply them as IDcols

See Also

Other Unsupervised Learning: AutoClusteringScoring(), AutoClustering(), GenTSAnomVars(), H2OIsolationForestScoring(), ResidualOutliers()

Examples

Run this code
# NOT RUN {
# Create simulated data
data <- RemixAutoML::FakeDataGenerator(
  Correlation = 0.70,
  N = 50000,
  ID = 2L,
  FactorCount = 2L,
  AddDate = TRUE,
  ZIP = 0L,
  TimeSeries = FALSE,
  ChainLadderData = FALSE,
  Classification = FALSE,
  MultiClass = FALSE)

# Run algo
data <- RemixAutoML::H2OIsolationForest(
  data,
  Features = names(data)[2L:ncol(data)],
  IDcols = c("Adrian", "IDcol_1", "IDcol_2"),
  ModelID = "Adrian",
  SavePath = getwd(),
  Threshold = 0.95,
  MaxMem = "28G",
  NThreads = -1,
  NTrees = 100,
  MaxDepth = 8,
  MinRows = 1,
  RowSampleRate = (sqrt(5)-1)/2,
  ColSampleRate = 1,
  ColSampleRatePerLevel = 1,
  ColSampleRatePerTree = 1,
  CategoricalEncoding = c("AUTO"),
  Debug = TRUE)

# Remove output from data and then score
data[, eval(names(data)[17:ncol(data)]) := NULL]

# Run algo
Outliers <- RemixAutoML::H2OIsolationForestScoring(
  data,
  Features = names(data)[2:ncol(data)],
  IDcols = c("Adrian", "IDcol_1", "IDcol_2"),
  H2OStart = TRUE,
  H2OShutdown = TRUE,
  ModelID = "TestModel",
  SavePath = getwd(),
  Threshold = 0.95,
  MaxMem = "28G",
  NThreads = -1,
  Debug = FALSE)
# }

Run the code above in your browser using DataLab