predict.isolation_forest: Predict method for Isolation Forest

Description

Predict method for Isolation Forest

Usage

# S3 method for isolation_forest
predict(
  object,
  newdata,
  type = "score",
  square_mat = FALSE,
  refdata = NULL,
  ...
)

Arguments

object

An Isolation Forest object as returned by `isolation.forest`.

newdata

A `data.frame`, `data.table`, `tibble`, `matrix`, or sparse matrix (from package `Matrix` or `SparseM`, CSC/dgCMatrix format for distance and outlierness, or CSR/dgRMatrix format for outlierness and imputations) for which to predict outlierness, distance, or imputations of missing values.

Note that when passing `type` = `"impute"` and `newdata` is a sparse matrix, under some situations it might get modified in-place.

Note also that, if using sparse matrices from package `Matrix`, converting to `dgRMatrix` might require using `as(m, "RsparseMatrix")` instead of `dgRMatrix` directly.

type

Type of prediction to output. Options are:

`"score"` for the standardized outlier score, where values closer to 1 indicate more outlierness, while values closer to 0.5 indicate average outlierness, and close to 0 more averageness (harder to isolate).
`"avg_depth"` for the non-standardized average isolation depth.
`"dist"` for approximate pairwise or between-points distances (must pass more than 1 row) - these are standardized in the same way as outlierness, values closer to zero indicate nearer points, closer to one further away points, and closer to 0.5 average distance.
`"avg_sep"` for the non-standardized average separation depth.
`"tree_num"` for the terminal node number for each tree - if choosing this option, will return a list containing both the outlier score and the terminal node numbers, under entries `score` and `tree_num`, respectively.
`"impute"` for imputation of missing values in `newdata`.

square_mat

When passing `type` = `"dist` or `"avg_sep"` with no `refdata`, whether to return a full square matrix or just the upper-triangular part, in which the entry for pair (i,j) with 1 <= i < j <= n is located at position p(i, j) = ((i - 1) * (n - i/2) + j - i). Ignored when not predicting distance/separation or when passing `refdata`.

refdata

If passing this and calculating distance or average separation depth, will calculate distances between each point in `newdata` and each point in `refdata`, outputing a matrix in which points in `newdata` correspond to rows and points in `refdata` correspond to columns. Must be of the same type as `newdata` (e.g. `data.frame`, `matrix`, `dgCMatrix`, etc.). If this is not passed, and type is `"dist"` or `"avg_sep"`, will calculate pairwise distances/separation between the points in `newdata`.

...

Not used.

Value

The requested prediction type, which can be:

A vector with one entry per row in `newdata` (for output types `"score"`, `"avg_depth"`, `"tree_num"`).
A square matrix or vector with the upper triangular part of a square matrix (for output types `"dist"`, `"avg_sep"`, with no `refdata`)
A matrix with points in `newdata` as rows and points in `refdata` as columns (for output types `"dist"`, `"avg_sep"`, with `refdata`).
The same type as the input `newdata` (for output type `"impute"`).

Details

The more threads that are set for the model, the higher the memory requirement will be as each thread will allocate an array with one entry per row (outlierness) or combination (distance).

Outlierness predictions for sparse data will be much slower than for dense data. Not recommended to pass sparse matrices unless they are too big to fit in memory.

Note that after loading a serialized object from `isolation.forest` through `readRDS` or `load`, it will only de-serialize the underlying C++ object upon running `predict`, `print`, or `summary`, so the first run will be slower, while subsequent runs will be faster as the C++ object will already be in-memory.

In order to save memory when fitting and serializing models, the functionality for outputting terminal node numbers will generate index mappings on the fly for all tree nodes, even if passing only 1 row, so it's only recommended for batch predictions.

The outlier scores/depth predict functionality is optimized for making predictions on one or a few rows at a time - for making large batches of predictions, it might be faster to use the `output_score=TRUE` in `isolation.forest`.

Description

Usage

Arguments

Value

Details

See Also