stray (version 0.1.0)

find_HDoutliers: Detect Anomalies in High Dimensional Data.

Description

Detect anomalies in high dimensional data. This is a modification of HDoutliers.

Usage

find_HDoutliers(data, alpha = 0.01, k = 10, knnsearchtype = "brute",
  normalize = "unitize")

Arguments

data

A vector, matrix, or data frame consisting of numerical variables.

alpha

Threshold for determining the cutoff for outliers. Observations are considered outliers if they fall in the \((1- alpha)\) tail of the distribution of the nearest-neighbor distances between exemplars.

k

Number of neighbours considered.

knnsearchtype

A character vector indicating the search type for k- nearest-neighbors.

normalize

Method to normalize the columns of the data. This prevents variables with large variances having disproportional influence on Euclidean distances. Two options are available "standardize" or "unitize". Default is set to "unitize"

Value

The indexes of the observations determined to be outliers.

References

Wilkinson, L. (2018), `Visualizing big data outliers through distributed aggregation', IEEE transactions on visualization and computer graphics 24(1), 256-266.

Examples

Run this code
# NOT RUN {
require(ggplot2)
set.seed(1234)
data <- c(rnorm(1000, mean = -6), 0, rnorm(1000, mean = 6))
outliers <- find_HDoutliers(data, knnsearchtype = "kd_tree")



set.seed(1234)
n <- 1000 # number of observations
nout <- 10 # number of outliers
typical_data <- matrix(rnorm(2 * n), ncol = 2, byrow = TRUE)
out <- matrix(5 * runif(2 * nout, min = -5, max = 5), ncol = 2, byrow = TRUE)
data <- rbind(out, typical_data)
outliers <- find_HDoutliers(data, knnsearchtype = "brute")

# }

Run the code above in your browser using DataLab