Learn R Programming

⚠️There's a newer version (1.1.3) of this package.Take me there.

Introduction

The R package implements Isolation forest, an anomaly detection method introduced by the paper Isolation based Anomaly Detection (Liu, Ting and Zhou).

Isolation forest is grown using ranger package and it is possible to experiment with the variants of classical isolation forest ex: weighing covariates(features) and observations.

Usage

suppressPackageStartupMessages(library("dplyr"))
suppressPackageStartupMessages(library("solitude"))
data("Boston", package = "MASS")
dplyr::glimpse(Boston)

## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.…
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.…
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, …
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, …
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 8…
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, …
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4,…
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 3…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15…
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, …
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93,…
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18…

BostonX <- Boston %>% select(-medv)

# grow an isolation forest
iso_Boston <- isolationForest(BostonX, seed = 100, num.trees = 1e3)

# predict anomaly scores (parallelizable using futures)
scores <- predict(iso_Boston, BostonX, type = "anomaly_score")

# predict corrected depths
depths <- predict(iso_Boston, BostonX, type = "depth_corrected")

Anomaly detection

The paper suggests the following: If the score is closer to 1 for a some observations, they are likely outliers. If the score for all observations hover around 0.5, there might not be outliers at all.

By observing the quantiles, we might arrive at the a threshold on the anomaly scores and investigate the outlier suspects.

# quantiles of anomaly scores
quantile(scores, probs = seq(0.5, 1, length.out = 11))

##       50%       55%       60%       65%       70%       75%       80% 
## 0.4403705 0.4480122 0.4550305 0.4608977 0.4696051 0.4814371 0.4884172 
##       85%       90%       95%      100% 
## 0.4914260 0.5184716 0.5288953 0.6552715

The understanding of why is an observation an anomaly might require a combination of domain understanding and techniques like lime (Local Interpretable Model-agnostic Explanations), Rule based systems etc

Installation

install.packages("solitude")                  # CRAN version
devtools::install_github("talegari/solitude") # dev version

Copy Link

Version

Install

install.packages('solitude')

Monthly Downloads

1,806

Version

0.1.3

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

KS Srikanth

Last Published

June 5th, 2019

Functions in solitude (0.1.3)

fastDoCall

An alternative to the internal do.call
solitude

An Implementation of Isolation Forest
predict.solitude

Predict method for solitude class
harmonic

harmonic
harmonic_approx

harmonic_approx
depth_terminalNodes

depth_terminalNodes
harmonic_exact

harmonic_exact
compute_anomaly

compute_anomaly
isolationForest

Grow an isolation forest
average_path_length

average_path_length
depth_corrected

depth_corrected
anomaly_score

anomaly_score