boxB: BoxPlot based outlier detection

Description

Identifies univariate outliers by using methods based on BoxPlots

Usage

boxB(x, k=1.5, method='asymmetric', weights=NULL, id=NULL, 
     exclude=NA, logt=FALSE)

Arguments

Numeric vector that will be searched for outliers.

Nonnegative constant that determines the extension of the 'whiskers'. Commonly used values are 1.5 (default), 2, or 3. Note that when method="adjbox" then k is set automatically equal to 1.5

method

Character, identifies the method to be used: method="resistant" provides the `standard' boxplot fences; method="asymmetric" is a modification of standard method to deal with (moderately) skewed data; method="adjbox" uses Hubert and Vandervieren (2008) adjusted boxplot for skewed distributions.

weights

Optional numeric vector with units' weights associated to the observations in x. Only nonnegative weights are allowed. Weights are used in estimating the quartiles (see Details).

Optional vector with identifiers of units in x. If missing (id=NULL, default) the identifiers will be set equal to the positions in the vector (i.e. id=1:length(x)).

exclude

Values of x that will be excluded by the analysis. By default missing values are excluded (exclude=NA).

logt

Logical, if TRUE, before searching outliers the x variable is log-transformed (log(x+1) is considered). In this case the summary outputs (bounds, etc.) will refer to the log-transformed x

Value

The output is a list containing the following components:

quartiles

The quartiles of x after discarding the values in the exclude argument. When weights are provided they are used in quartiles estimation trough the function wtd.quantile in the package Hmisc.

fences

The bounds of the interval, values outside the interval are detected as outliers.

excluded

The identifiers or positions (when id=NULL) of units in x excluded by the computations, according to the argument exclude.

outliers

The identifiers or positions (when id=NULL) of units in x detected as outliers.

lowOutl

The identifiers or positions (when id=NULL) of units in x detected as outliers in the lower tail of the distribution.

upOutl

The identifiers or positions (when id=NULL) of units in x detected as outliers in the upper tail of the distribution.

Details

When method="resistant" the outlying observations are those outside the interval:

$$[Q_1 - k \times IQR;\quad Q_3 + k \times IQR] $$

where $Q_1$ and $Q_3$ are respectively the 1st and the 3rd quartile of x, while $IQR=(Q_3 - Q_1)$ is the Inter-Quartile Range. The value $k=1.5$ (said 'inner fences') is commonly used when drawing a boxplot. Values $k=2$ and $k=3$ provide middle and outer fences, respectively.

When method="asymmetric" the outlying observations are those outside the interval:

$$[Q_1 - 2k \times (Q_2-Q_1);\quad Q_3 + 2k \times (Q_3-Q_2)] $$

being $Q_2$ the median; such a modification allows to account for slight skewness of the distribution.

Finally, when method="adjbox" the outlying observations are identified using the method proposed by Hubert and Vandervieren (2008) and based on the Medcouple measure of skewness; in practice the bounds are:

$$[Q_1-1.5 \times e^{aM} \times IQR;\quad Q_3+1.5 \times e^{bM}\times IQR ]$$

Where M is the medcouple; when $M > 0$ (positive skewness) then $a = -4$ and $b = 3$; on the contrary $a = -3$ and $b = 4$ for negative skewness ($M < 0$). This adjustment of the boxplot, according to Hubert and Vandervieren (2008), works with moderate skewness ($-0.6 \leq M \leq 0.6$). The bounds of the adjusted boxplot are derived by applying the function adjboxStats in the package robustbase.

When weights are available (passed via the argument weights) then they are used in the computation of the quartiles. In particular, the quartiles are derived using the function wtd.quantile in the package Hmisc.

Remember that when asking a log transformation (argument logt=TRUE) all the estimates (quartiles, etc.) will refer to $log(x+1)$.

References

McGill, R., Tukey, J. W. and Larsen, W. A. (1978) `Variations of box plots'. The American Statistician, 32, pp. 12-16.

Hubert, M., and Vandervieren, E. (2008) `An Adjusted Boxplot for Skewed Distributions', Computational Statistics and Data Analysis, 52, pp. 5186-5201.

Examples

Run this code

# NOT RUN {
set.seed(321)
x <- rnorm(30, 50, 10)
x[10] <- 1
x[20] <- 100

out <- boxB(x = x, k = 1.5, method = 'asymmetric')
out$fences
out$outliers
x[out$outliers]

out <- boxB(x = x, k = 1.5, method = 'adjbox')
out$fences
out$outliers
x[out$outliers]

x[24] <- NA
x.ids <- paste0('obs',1:30)
out <- boxB(x = x, k = 1.5, method = 'adjbox', id = x.ids)
out$excluded
out$fences
out$outliers

set.seed(111)
w <- round(runif(n = 30, min=1, max=10))
out <- boxB(x = x, k = 1.5, method = 'adjbox', id = x.ids, weights = w)
out$excluded
out$fences
out$outliers

# }

Run the code above in your browser using DataLab