Learn R Programming

isotree (version 0.1.8)

predict.isolation_forest: Predict method for Isolation Forest

Description

Predict method for Isolation Forest

Usage

# S3 method for isolation_forest
predict(object, newdata, type = "score",
  square_mat = FALSE, ...)

Arguments

object

An Isolation Forest object as returned by `isolation.forest`.

newdata

A data.frame, matrix, or sparse matrix (from package `Matrix` or `SparseM`, CSC format for distance and outlierness, or CSR format for outlierness and imputations) for which to predict outlierness, distance, or imputations of missing values. Note that when passing `type` = `"impute"` and newdata is a sparse matrix, under some situations it might get modified in-place.

type

Type of prediction to output. Options are:

  • `"score"` for the standardized outlier score, where values closer to 1 indicate more outlierness, while values closer to 0.5 indicate average outlierness, and close to 0 more averageness (harder to isolate).

  • `"avg_depth"` for the non-standardized average isolation depth.

  • `"dist"` for approximate pairwise distances (must pass more than 1 row) - these are standardized in the same way as outlierness, values closer to zero indicate nearer points, closer to one further away points, and closer to 0.5 average distance.

  • `"avg_sep"` for the non-standardized average separation depth.

  • `"tree_num"` for the terminal node number for each tree - if choosing this option, will return a list containing both the outlier score and the terminal node numbers, under entries `score` and `tree_num`, respectively.

  • `"impute"` for imputation of missing values in `newdata`.

square_mat

When passing `type` = `"dist` or `"avg_sep"`, whether to return a full square matrix or just the upper-triangular part, in which the entry for pair 1 <= i < j <= n is located at position p(i, j) = ((i - 1) * (n - i/2) + j - i).

...

Not used.

Value

The requested prediction type, which can be a vector with one entry per row in `newdata` (for output types `"score"`, `"avg_depth"`, `"tree_num"`), a square matrix or vector with the upper triangular part of a square matrix (for output types `"dist"`, `"avg_sep"`), or the same type as the input `newdata` (for output type `"impute"`).

Details

The more threads that are set for the model, the higher the memory requirement will be as each thread will allocate an array with one entry per row (ourlierness) or combination (distance).

Outlierness predictions for sparse data will be much slower than for dense data. Not recommended to pass sparse matrices unless they are too big to fit in memory.

Note that after loading a serialized object from `isolation.forest` through `readRDS` or `load`, it will only de-serialize the underlying C++ object upon running `predict`, `print`, or `summary`, so the first run will be slower, while subsequent runs will be faster as the C++ object will already be in-memory.

In order to save memory when fitting and serializing models, the functionality for outputing terminal node numbers will generate index mappings on the fly for all tree nodes, even if passing only 1 row, so it's only recommended for batch predictions.

See Also

isolation.forest unpack.isolation.forest