od.opt.param: Optimal Parameter Values In RaPKod

Description

Uses a heuristic formula to set optimal values for gamma and p.

Usage

od.opt.param(X, K1 = 6, K2 = 50, which.estim = "Gauss", RATIO = 0.1, 
            randomize = TRUE, sub.n = floor(nrow(X)))

Arguments

a data frame or an n x d matrix.

universal constant used in the heuristic formula of the optimal parameter gamma.

universal constant used in the heuristic formula of the optimal parameter p.

which.estim

specifies the estimation method of the parameters: either "Gauss"(default) or "general".

RATIO

optional parameter used in estimation method "Gauss"

randomize

optional parameter used in the estimation method "general".

sub.n

optional parameter used in the estimation method "general" if randomize=TRUE.

Value

gamma.opt

optimal value for gamma.

p.opt

optimal value for p.

est.f2.pw

estimation of \(|f|_2^{2/(d+2)} \).

Details

This function uses a heuristic formula to determine the optimal parameter values gamma and p, in the case when a Gaussian kernel is used. This formula is of the form \(gamma = K1 * |f|_2^{2/(d+2)} * n^{1/(d+2)}\) and \(p = ceil(K2 * |f|_2^{2/(d+2)} * n^{2/(d+2)} )\), where \(|f|_2\) is the L2-norm of the density function of non-outliers \(f\) and \(ceil(x)\) denotes the smallest integer larger than \(x\).

Two methods are proposed to estimate \(|f|_2\) and are specified by the argument which.estim: "Gauss" and "general".

If which.estim="Gauss", the estimation is done as though \(f\) was a Gaussian density, which yields \(|f|_2^{2/(d+2)} ) = (4*pi)^{-0.5}*exp(0.5*mean(log(1/ev)))\), where \(ev\) are the covariance eigenvalues of the non-outlier distribution. Note that the eigenvalues smaller than \(ev[1]*RATIO\) (where \(ev[1]\) is the largest eigenvalue) are discarded to avoid numerical issues.

If which.estim="general", \(|f|_2\) is estimated without any assumption on \(f\). However this method may fail in very high dimensions because of the dimensionality curse, since it relies on an estimation of the derivative of \(F\) at \(0\) where \(F\) is the cdf of the pairwise distance between two non-outliers. . Besides, to shorten the computation time, the optional argument 'randomize' can be set as TRUE, so that only a subset of size sub.n of the data is considered to estimate the cdf \(F\).

Examples

Run this code

# NOT RUN {
data(iris)

##Define data frame with non-outliers
inliers = iris[sample(which(iris$Species!="setosa"), 100, replace=FALSE),
                                              -which(names(iris)=="Species")]
                                              
param <- od.opt.param(inliers)

#display optimal gamma
param$gamma.opt
#display optimal p
param$p.opt

# }