OutlierPCOut: Outlier identification in high dimensions using the PCOUT algorithm

Description

The function implements a computationally fast procedure for identifying outliers that is particularly effective in high dimensions. This algorithm utilizes simple properties of principal components to identify outliers in the transformed space, leading to significant computational advantages for high-dimensional data. This approach requires considerably less computational time than existing methods for outlier detection, and is suitable for use on very large data sets. It is also capable of analyzing the data situation commonly found in certain biological applications in which the number of dimensions is several orders of magnitude larger than the number of observations.

Usage

OutlierPCOut(x, ...)
    # S3 method for default
OutlierPCOut(x, grouping, explvar=0.99, trace=FALSE, …)
    # S3 method for formula
OutlierPCOut(formula, data, …, subset, na.action)

Arguments

formula

a formula with no response variable, referring only to numeric variables.

data

an optional data frame (or similar: see model.frame) containing the variables in the formula formula.

subset

an optional vector used to select rows (observations) of the data matrix x.

na.action

a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The default is na.omit.

…

arguments passed to or from other methods.

a matrix or data frame.

grouping

grouping variable: a factor specifying the class for each observation.

explvar

a numeric value between 0 and 1 indicating how much variance should be covered by the robust PCs (default to 0.99)

trace

whether to print intermediate results. Default is trace = FALSE

Value

An S4 object of class '>OutlierPCOut which is a subclass of the virtual class '>Outlier.

Details

If the data set consists of two or more classes (specified by the grouping variable grouping) the proposed method iterates through the classes present in the data, separates each class from the rest and identifies the outliers relative to this class, thus treating both types of outliers, the mislabeled and the abnormal samples in a homogenous way.

References

P. Filzmoser, R. Maronna and M. Werner (2008), Outlier identification in high dimensions, Computational Statistics & Data Analysis, Vol. 52 1694--1711.

P. Filzmoser & V. Todorov (2012), Robust tools for the imperfect world, To appear.

Examples

Run this code

# NOT RUN {
data(hemophilia)
obj <- OutlierPCOut(gr~.,data=hemophilia)
obj

getDistance(obj)            # returns an array of distances
getClassLabels(obj, 1)      # returns an array of indices for a given class
getCutoff(obj)              # returns an array of cutoff values (for each class, usually equal)
getFlag(obj)                #  returns an 0/1 array of flags
plot(obj, class=2)          # standard plot function
# }

Run the code above in your browser using DataLab