Introduction
The R package implements Isolation forest, an anomaly detection method introduced by the paper Isolation based Anomaly Detection (Liu, Ting and Zhou).
Isolation forest is grown using ranger package and it is possible to experiment with the variants of classical isolation forest ex: weighing covariates(features) and observations.
Usage
suppressPackageStartupMessages(library("dplyr"))
suppressPackageStartupMessages(library("solitude"))
data("Boston", package = "MASS")
dplyr::glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.…
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.…
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, …
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, …
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 8…
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, …
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4,…
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 3…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15…
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, …
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93,…
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18…
BostonX <- Boston %>% select(-medv)
# grow an isolation forest
iso_Boston <- isolationForest(BostonX, seed = 100, num.trees = 1e3)
# predict anomaly scores (parallelizable using futures)
scores <- predict(iso_Boston, BostonX, type = "anomaly_score")
# predict corrected depths
depths <- predict(iso_Boston, BostonX, type = "depth_corrected")
Anomaly detection
The paper suggests the following: If the score is closer to 1 for a some observations, they are likely outliers. If the score for all observations hover around 0.5, there might not be outliers at all.
By observing the quantiles, we might arrive at the a threshold on the anomaly scores and investigate the outlier suspects.
# quantiles of anomaly scores
quantile(scores, probs = seq(0.5, 1, length.out = 11))
## 50% 55% 60% 65% 70% 75% 80%
## 0.4403705 0.4480122 0.4550305 0.4608977 0.4696051 0.4814371 0.4884172
## 85% 90% 95% 100%
## 0.4914260 0.5184716 0.5288953 0.6552715
The understanding of why is an observation an anomaly might require a combination of domain understanding and techniques like lime (Local Interpretable Model-agnostic Explanations), Rule based systems etc
Installation
install.packages("solitude") # CRAN version
devtools::install_github("talegari/solitude") # dev version