detect_iforest: Detect Outliers using Isolation Forest (Machine Learning)
Description
This function applies the Isolation Forest algorithm to detect anomalies in high-dimensional
or complex datasets. Unlike statistical methods that measure distance from a mean,
Isolation Forest isolates observations by randomly selecting a feature and then
randomly selecting a split value between the maximum and minimum values of the selected feature.
Numeric score between 0 and 1. Higher values indicate higher anomaly likelihood.
Is_Outlier
Logical flag. TRUE if the score exceeds the quantile defined by the contamination rate.
Arguments
data
A data frame containing at least one numeric column. Non-numeric columns are ignored.
ntrees
Integer. The number of trees to grow in the forest. Defaults to 100.
Increasing this number improves accuracy but increases computation time.
contamination
Numeric (0 to 0.5). The expected proportion of outliers in the dataset.
Used to calculate the threshold for the binary Is_Outlier flag. Defaults to 0.05 (5%).
Details
Recursive partitioning can be represented by a tree structure, and the number of splittings
required to isolate a sample is equivalent to the path length from the root node to the
terminating node. Random trees produce shorter path lengths for anomalies, as they are
essentially "fewer" and "different" from normal observations.
The function relies on the efficient isotree package for computation.