isolationForest: Fit an Isolation Forest

Description

'solitude' class implements the isolation forest method introduced by paper Isolation based Anomaly Detection (Liu, Ting and Zhou <doi:10.1145/2133360.2133363>). The extremely randomized trees (extratrees) required to build the isolation forest is grown using ranger function from ranger package.

Arguments

Design

$new() initiates a new 'solitude' object. The possible arguments are:

num_trees: (positive integer, default = 100) Number of trees to be built in the forest
sample_fraction: ((0, 1], default = 1) Fraction of the dataset to be sampled or bootstrapped per tree. See 'sample.fraction' argument in ranger
replace: (boolean, default = FALSE) Whether the sample of observations for each tree should be chosen with replacement. See 'replace' argument in ranger
seed: (positive integer, default = 101) Random seed for the forest
nproc: (a positive integer, default: one less than maximum number of scores available) Number of parallel threads to be used by ranger
respect_unordered_factors: (string, default: "partition") See 'respect.unordered.factors' argument in ranger

$fit() fits a isolation forest for the given dataframe, computes depths of terminal nodes of each tree and stores the anomaly scores and average depth values in $scores object as a data.table

$predict() returns anomaly scores for a new data as a data.table

Methods

Public methods

Method `new()`

Usage

isolationForest$new(
  num_trees = 100,
  replace = FALSE,
  sample_fraction = 1,
  respect_unordered_factors = "partition",
  seed = 101,
  nproc = parallel::detectCores() - 1
)

Method `fit()`

Usage

isolationForest$fit(dataset)

Method `predict()`

Usage

isolationForest$predict(data)

Method `clone()`

The objects of this class are cloneable with this method.

Usage

isolationForest$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Details

Parallelization: ranger is parallelized and by default uses all cores but one. The process of obtaining depths of terminal nodes (which is executed with $fit() is called) may be parallelized separately by setting up a future backend.

Examples

Run this code

# NOT RUN {
data("humus", package = "mvoutlier")
columns_required = setdiff(colnames(humus)
                           , c("Cond", "ID", "XCOO", "YCOO", "LOI")
                           )
humus2 = humus[ , columns_required]
set.seed(1)
index = sample(ceiling(nrow(humus2) * 0.5))
isf = isolationForest$new()  # initiate
isf$fit(humus2[index, ])     # fit on 80% data
isf$scores                   # obtain anomaly scores

# scores closer to 1 might indicate outliers
plot(density(isf$scores$anomaly_score))

isf$predict(humus2[-index, ]) # scores for new data
# }

Run the code above in your browser using DataLab

Description

Arguments

Design

Methods

Public methods

Method new()

Usage

Method fit()

Usage

Method predict()

Usage

Method clone()

Usage

Arguments

Details

Examples

Method `new()`

Method `fit()`

Method `predict()`

Method `clone()`