Learn R Programming

solitude (version 1.0.0)

isolationForest: Fit an Isolation Forest

Description

'solitude' class implements the isolation forest method introduced by paper Isolation based Anomaly Detection (Liu, Ting and Zhou <doi:10.1145/2133360.2133363>). The extremely randomized trees (extratrees) required to build the isolation forest is grown using ranger function from ranger package.

Arguments

Design

$new() initiates a new 'solitude' object. The possible arguments are:

  • num_trees: (positive integer, default = 100) Number of trees to be built in the forest

  • sample_fraction: ((0, 1], default = 1) Fraction of the dataset to be sampled or bootstrapped per tree. See 'sample.fraction' argument in ranger

  • replace: (boolean, default = FALSE) Whether the sample of observations for each tree should be chosen with replacement. See 'replace' argument in ranger

  • seed: (positive integer, default = 101) Random seed for the forest

  • nproc: (a positive integer, default: one less than maximum number of scores available) Number of parallel threads to be used by ranger

  • respect_unordered_factors: (string, default: "partition") See 'respect.unordered.factors' argument in ranger

$fit() fits a isolation forest for the given dataframe, computes depths of terminal nodes of each tree and stores the anomaly scores and average depth values in $scores object as a data.table

$predict() returns anomaly scores for a new data as a data.table

Methods

Public methods

Method new()

Usage

isolationForest$new(
  num_trees = 100,
  replace = FALSE,
  sample_fraction = 1,
  respect_unordered_factors = "partition",
  seed = 101,
  nproc = parallel::detectCores() - 1
)

Method fit()

Usage

isolationForest$fit(dataset)

Method predict()

Usage

isolationForest$predict(data)

Method clone()

The objects of this class are cloneable with this method.

Usage

isolationForest$clone(deep = FALSE)

Arguments

deep

Whether to make a deep clone.

Details

  • Parallelization: ranger is parallelized and by default uses all cores but one. The process of obtaining depths of terminal nodes (which is executed with $fit() is called) may be parallelized separately by setting up a future backend.

Examples

Run this code
# NOT RUN {
data("humus", package = "mvoutlier")
columns_required = setdiff(colnames(humus)
                           , c("Cond", "ID", "XCOO", "YCOO", "LOI")
                           )
humus2 = humus[ , columns_required]
set.seed(1)
index = sample(ceiling(nrow(humus2) * 0.5))
isf = isolationForest$new()  # initiate
isf$fit(humus2[index, ])     # fit on 80% data
isf$scores                   # obtain anomaly scores

# scores closer to 1 might indicate outliers
plot(density(isf$scores$anomaly_score))

isf$predict(humus2[-index, ]) # scores for new data
# }

Run the code above in your browser using DataLab