LocScaleB: Univariate outlier detection with bounds based on robust location and scale estimates

Description

This function identifies outliers in the tails of a distribution by detecting the observations outside the bounds built using a robust estimate of both location and scale parameters.

Usage

LocScaleB(x, k=3, method='MAD',  weights=NULL, id=NULL, 
          exclude=NA, logt=FALSE, return.dataframe=FALSE)

Arguments

Numeric vector that will be searched for outliers.

Nonnegative constant that determines the extension of bounds. Commonly used values are 2, 2.5 and 3 (default).

method

character identifying how to estimate the scale of the distribution. Available choices are:

method='IQR' for using the Inter-Quartile Range, i.e. Q3-Q1;

method='IDR' for using the Inter-Decile Range; i.e. P90-P10

method='MAD' for using the Median Absolute Deviation;

method='Gini' robust scale estimate based on Gini's Mean Difference (see GiniMd);

method='ScaleTau2' robust tau-estimate of univariate scale, as proposed by Maronna and Zamar (2002) (see alsoscaleTau2);

method='Qn' for using the Qn estimator proposed by Rousseeuw and Croux (1993) (see also Qn);

method='Sn' for using the Sn estimator proposed by Rousseeuw and Croux (1993) (see also Sn).

When method='dQ' the estimated scale for the left tail is (Q2-Q1)/0.6745, while for the right tail it is considered (Q3-Q2)/0.6745 (Q2 is the median); this double estimate should be able to account for slight skewness.

When method='dD' the estimated scale for the left tail is (P50-P10)/1.2816, while for the right tail it is considered (P90-P50)/1.2816 (P50 is the median); this double estimate should be able to account for skewness.

Finally, when method='AdjOut', bounds are based on the adjusted outlyingness method as proposed by Hubert and Van der Veeken (2008).

weights

Optional numeric vector that provides weights associated to observations. Only nonnegative weights are allowed. Note that weights can only be used when method='MAD', method='IQR', method='IDR', method='dQ' or method='dD'.

Optional numeric or character vector, with identifiers of units in x. If id=NULL (default) units' identifiers will be set equal to their position in x.

exclude

Values of x that will be excluded by the analysis. By default missing values (exclude = NA)

logt

Logical, if TRUE, before searching outliers the x variable is log-transformed (log(x+1) is considered). Note that in this case that summary output (bounds, etc.) will refer to log-transformed variable.

return.dataframe

Logical, if TRUE the output will save all the relevant information for outlier detection in a dataframe with the following columns: `id' (units' identifiers), `x', `log.x' (only if logt=TRUE), `weight' (only when argument weights is provided), `score' (the standardized scores, see Details) and, finally, `outliers', where value 1 indicates observations detected as an outlier, 0 otherwise.

Value

A list whose components depend on the return.dataframe argument. When return.dataframe = FALSE just the following components are provided:

pars

Vector with estimated median and scale parameters

bounds

The bounds of the interval, values outside the interval are considered outliers.

excluded

The position or identifiers of x values excluded by outlier detection, according to the argument exclude

outliers

The position or identifiers of x values detected as outliers (outside bounds).

lowOutl

The identifiers or positions (when id=NULL) of units in x detected as outliers in the lower tail of the distribution.

upOutl

The identifiers or positions (when id=NULL) of units in x detected as outliers in the upper tail of the distribution.

When return.dataframe=TRUE the latter two components are substituted with two dataframes:

excluded

A dataframe with the subset of observations excluded.

data

A dataframe with the the not excluded observations and the following columns: `id' (units' identifiers), `x', `log.x' (only if logt=TRUE), `weight' (only when argument weights is provided), `score' (the standardized scores, see Details) and, finally, `outliers', where value 1 indicates observations detected as an outlier and 0 otherwise.

Details

The intervals are derived by considering the median $Q_2$ as a robust location estimate while different robust scale estimators are considered:

$$[Q_2 - k \times \tilde{s}_L; \quad Q_2 + k \times \tilde{s}_R]$$

where $\tilde{s}_L$ and $ \tilde{s}_R$ are robust scale estimates. With most of the methods $\tilde{s}_L=\tilde{s}_R$ with exception of method='dQ' and method='dD' where respectively:

$$\tilde{s}_L = (Q_2 - Q_1)/0.6745 \qquad \mbox{and} \qquad \tilde{s}_R = (Q_3 - Q_2)/0.6745$$

and

$$\tilde{s}_L = (P_{50} - P_{10})/1.2816 \qquad \mbox{and} \qquad \tilde{s}_R = (P_{90} - P_{50})/1.2816$$

Note that when method='dQ' or method='dD' the function calculates and prints a the Bowley's coefficient of skewness, that uses Q1, Q2 and Q3 (they are replaced by respectively P10, P50 and P90 when method='dD').

With method='AdjOut' the following estimates are considered:

$$\tilde{s}_L = (Q_2 - f_L) \qquad \mbox{and} \qquad \tilde{s}_R = (f_R - Q_2)$$

being $f_R$ and $f_L$ derived starting from the fences of the adjusted boxplot (Hubert and Vandervieren, 2008; see adjboxStats). In addition the medcouple (mc) measure of skewness is calculated and printed on the screen.

When weights are available (passed via the argument weights) then they are used in the computation of the quartiles. In particular, the quartiles are derived using the function wtd.quantile in the package Hmisc. Note that their use is allowed just with method='IQR', method='IDR', method='dQ', method='dD' or method='AdjOut'.

The `score' variable reported in the the data dataframe when return.dataframe=TRUE is the standardized score derived as (x - Median)/scale.

References

Hubert, M. and Van der Veeken, S. (2008) `Outlier Detection for Skewed Data'. Journal of Chemometrics, 22, pp. 235-246.

Maronna, R.A. and Zamar, R.H. (2002) `Robust estimates of location and dispersion of high-dimensional datasets' Technometrics, 44, pp. 307-317.

Rousseeuw, P.J. and Croux, C. (1993) `Alternatives to the Median Absolute Deviation', Journal of the American Statistical Association 88, pp. 1273-1283.

Vanderviere, E. and Huber, M. (2008) `An Adjusted Boxplot for Skewed Distributions', Computational Statistics & Data Analysis, 52, pp. 5186-5201

Examples

Run this code

# NOT RUN {
set.seed(333)
x <- rnorm(30, 50, 1)
x[10] <- 1
x[20] <- 100

out <- LocScaleB(x = x,  k = 3, method='MAD')
out$pars
out$bounds
out$outliers
x[out$outliers]

out <- LocScaleB(x = x,  k = 3, method='MAD',
                 return.dataframe = TRUE)
head(out$data)

out <- LocScaleB(x = x, k = 3, method='AdjOut')
out$outliers


# }

Run the code above in your browser using DataLab