dirOutl: Directional outlyingness of points relative to a dataset

Description

Computes the directional outlyingness of \(p\)-dimensional points z relative to a \(p\)-dimensional dataset x. For each multivariate point \(z_i\), its directional outlyingness relative to x is defined as its maximal univariate directional outlyingness measured over all directions. To obtain the univariate directional outlyingness in the direction \(v\), the dataset x is projected on \(v\), and the robustly skew-adjusted standardized distance of \(v'z_i\) to the median of the projected data points x\(v\) is computed. This is done through the estimation of 2 scales, one on each side of the median, using a 1-step M-estimator of scale.

Usage

dirOutl(x, z = NULL, options = list())

Arguments

An \(n\) by \(p\) data matrix.

An optional \(m\) by \(p\) matrix containing rowwise the points \(z_i\) for which to compute the adjusted outlyingness. If z is not specified, it is set equal to x.

options

A list of available options:

type Determines the desired type of invariance and should be one of "Affine", "compWise". When the option "Affine" is used, the directions \(v\) are orthogonal to hyperplanes spanned by \(p\) observations from x. With the option "compWise", the directional outlyingness is computed in the directions of the coordinate axes and combined through the Euclidean norm. Defaults to "Affine".
ndir When type is chosen to be "Affine", determines the number of directions \(v\) by setting ndir to a specific number. Defaults to \(250p\).
seed A strictly positive integer specifying the seed to be used to select the directions. Defaults to \(10\).

Value

A list with components:

outlyingnessX

Vector of length \(n\) giving the directional outlyingness of the observations in x.

outlyingnessZ

Vector of length \(m\) giving the directional outlyingness of the points in z relative to x.

cutoff

Points whose directional outlyingness exceeds this cutoff can be considered as outliers with respect to x.

flagX

Observations of x whose directional outlyingness exceeds the cutoff receive a flag FALSE, regular observations receive a flag TRUE.

flagZ

Points of z whose directional outlyingness exceeds the cutoff receive a flag equal to FALSE, otherwise they receive a flag TRUE.

singularSubsets

When the input parameter type is equal to "Affine", the number of \(p\)-subsets that span a subspace of dimension smaller than \(p-1\). In such a case the orthogonal direction can not be uniquely determined. This is an indication that the data are not in general position.

dimension

When the data x are lying in a lower dimensional subspace, the dimension of this subspace.

hyperplane

When the data x are lying in a lower dimensional subspace, a direction orthogonal to this subspace. When a direction \(v\) is found such that the robust skew-adjusted scale of \(xv\) is equal to zero, this equals \(v\).

inSubspace

When a direction \(v\) is found such that DO(\(xv\)) is ill-defined, the observations from x which belong to the hyperplane orthogonal to \(v\) receive a value TRUE. The other observations receive a value FALSE.

Details

The directional outlyingness (DO) of multivariate data was introduced in Rousseeuw et al. (2018). It extends the Stahel-Donoho outlyingness towards skewed distributions.

Depending on the dimension \(p\), different approximate algorithms are implemented. The affine invariant algorithm can only be used when \(n > p\). It draws ndir times at random \(p\) observations from x and considers the direction orthogonal to the hyperplane spanned by these \(p\) observations. At most \(p\) out of \(n\) directions can be considered. The orthogonal invariant version can be applied to high-dimensional data. It draws ndir times at random \(2\) observations from x and considers the direction through these two observations. Here, at most 2 out of \(n\) directions can be considered. Finally, the shift invariant version randomly draws ndir vectors from the unit sphere.

The resulting DO values are invariant to affine transformations, rotations and shifts respectively provided that the seed is kept fixed at different runs of the algorithm. Note that the DO values are guaranteed to increase when more directions are considered provided the seed is kept fixed, as this ensures that the random directions are generated in a fixed order.

An observation from x and z is flagged as an outlier if its DO exceeds a cutoff value. This cutoff value is determined using the procedure in Rousseeuw et al. (2018). First, the logarithm of the DO values is taken to render their distribution more symmetric, after which a normal approximation yields a cutoff on these values. The cutoff is then transformed back by applying the exponential function.

It is first checked whether the data lie in a subspace of dimension smaller than \(p\). If so, a warning is given, as well as the dimension of the subspace and a direction which is orthogonal to it. Furthermore, the univariate adjusted outlyingness of the projected points \(xv\) is ill-defined when the scale in its denominator becomes zero. This can happen when many observations collapse. In these cases the algorithm will stop and give a warning. The returned values then include the direction \(v\) as well as an indicator specifying which of the observations of x belong to the hyperplane orthogonal to \(v\).

References

Rousseeuw, P.J., Raymaekers, J., Hubert, M., (2018), A Measure of Directional Outlyingness with Applications to Image Data and Video. Journal of Computational and Graphical Statistics, 27, 345--359.

Examples

Run this code

# NOT RUN {
# Compute the directional outlyingness of a simple
# two-dimensional dataset. Outliers are plotted
# in red.
data("geological")
BivData <- geological[c("MnO","MgO")]
Result <- dirOutl(x = BivData)
IndOutliers <- which(!Result$flagX)
plot(BivData)
points(BivData[IndOutliers,], col = "red")

# The number of directions may be specified through
# the option list. The resulting adjusted outlyingness
# is monotone increasing in the number of directions.
Result1 <- dirOutl(x = BivData,options = list(ndir = 50))
Result2 <- dirOutl(x = BivData,options = list(ndir = 100))
which(Result2$outlyingnessX - Result1$outlyingnessX < 0)
# This is however not the case when the seed is changed
Result1 <- dirOutl(x = BivData,options = list(ndir = 50))
Result2 <- dirOutl(x = BivData,options = list(ndir = 100,seed = 950))

plot(Result2$outlyingnessX - Result1$outlyingnessX,
     xlab = "Index", ylab = "Difference in DO")

# Consider another example:

data("bloodfat")
BivData <- bloodfat[1:100,] # Consider a small toy example.
Result <- dirOutl(x = BivData,options = list(type = "Affine"))
IndOutliers <- which(!Result$flagX)
plot(BivData)
points(BivData[IndOutliers,], col = "red")

# }

Run the code above in your browser using DataLab