detect_iforest: Detect Outliers using Isolation Forest (Machine Learning)

Description

This function applies the Isolation Forest algorithm to detect anomalies in high-dimensional or complex datasets. Unlike statistical methods that measure distance from a mean, Isolation Forest isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

Usage

detect_iforest(data, ntrees = 100, contamination = 0.05)

Value

A data frame with the original columns plus:

If_Score: Numeric score between 0 and 1. Higher values indicate higher anomaly likelihood.
Is_Outlier: Logical flag. TRUE if the score exceeds the quantile defined by the contamination rate.

Arguments

data: A data frame containing at least one numeric column. Non-numeric columns are ignored.
ntrees: Integer. The number of trees to grow in the forest. Defaults to 100. Increasing this number improves accuracy but increases computation time.
contamination: Numeric (0 to 0.5). The expected proportion of outliers in the dataset. Used to calculate the threshold for the binary Is_Outlier flag. Defaults to 0.05 (5%).

Details

Recursive partitioning can be represented by a tree structure, and the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node. Random trees produce shorter path lengths for anomalies, as they are essentially "fewer" and "different" from normal observations.

The function relies on the efficient isotree package for computation.

Examples

Run this code

# Example: Detect anomalies in a generated dataset
df <- data.frame(x = c(rnorm(100), 1000), y = c(rnorm(100), 1000))
result <- detect_iforest(df, ntrees = 50, contamination = 0.02)
tail(result)