
H2OIsolationForestScoring for dimensionality reduction and / or anomaly detection
H2OIsolationForest(
data,
Features = NULL,
IDcols = NULL,
ModelID = "TestModel",
SavePath = NULL,
Threshold = 0.975,
MaxMem = "28G",
NThreads = -1,
NTrees = 100,
MaxDepth = 8,
MinRows = 1,
RowSampleRate = (sqrt(5) - 1)/2,
ColSampleRate = 1,
ColSampleRatePerLevel = 1,
ColSampleRatePerTree = 1,
CategoricalEncoding = c("AUTO"),
Debug = FALSE
)
The data.table with the columns you wish to have analyzed
A character vector with the column names to utilize in the isolation forest
A character vector with the column names to not utilize in the isolation forest but have returned with the data output. Otherwise those columns will be removed
Name for model that gets saved to file if SavePath is supplied and valid
Path directory to store saved model
Quantile value to find the cutoff value for classifying outliers
Specify the amount of memory to allocate to H2O. E.g. "28G"
Specify the number of threads (E.g. cores * 2)
Specify the number of decision trees to build
Max tree depth
Minimum number of rows allowed per leaf
Number of rows to sample per tree
Sample rate for each split
Sample rate for each level
Sample rate per tree
Choose from "AUTO", "Enum", "OneHotInternal", "OneHotExplicit", "Binary", "Eigen", "LabelEncoder", "SortByResponse", "EnumLimited"
Debugging
Source data.table with predictions. Note that any columns not listed in Features nor IDcols will not be returned with data. If you want columns returned but not modeled, supply them as IDcols
Other Unsupervised Learning:
AutoClusteringScoring()
,
AutoClustering()
,
GenTSAnomVars()
,
H2OIsolationForestScoring()
,
ResidualOutliers()
# NOT RUN {
# Create simulated data
data <- RemixAutoML::FakeDataGenerator(
Correlation = 0.70,
N = 50000,
ID = 2L,
FactorCount = 2L,
AddDate = TRUE,
ZIP = 0L,
TimeSeries = FALSE,
ChainLadderData = FALSE,
Classification = FALSE,
MultiClass = FALSE)
# Run algo
data <- RemixAutoML::H2OIsolationForest(
data,
Features = names(data)[2L:ncol(data)],
IDcols = c("Adrian", "IDcol_1", "IDcol_2"),
ModelID = "Adrian",
SavePath = getwd(),
Threshold = 0.95,
MaxMem = "28G",
NThreads = -1,
NTrees = 100,
MaxDepth = 8,
MinRows = 1,
RowSampleRate = (sqrt(5)-1)/2,
ColSampleRate = 1,
ColSampleRatePerLevel = 1,
ColSampleRatePerTree = 1,
CategoricalEncoding = c("AUTO"),
Debug = TRUE)
# Remove output from data and then score
data[, eval(names(data)[17:ncol(data)]) := NULL]
# Run algo
Outliers <- RemixAutoML::H2OIsolationForestScoring(
data,
Features = names(data)[2:ncol(data)],
IDcols = c("Adrian", "IDcol_1", "IDcol_2"),
H2OStart = TRUE,
H2OShutdown = TRUE,
ModelID = "TestModel",
SavePath = getwd(),
Threshold = 0.95,
MaxMem = "28G",
NThreads = -1,
Debug = FALSE)
# }
Run the code above in your browser using DataLab