corHuber: Robust correlation based on winsorization.

Description

Compute a robust correlation estimate based on winsorization, i.e., by shrinking outlying observations to the border of the main part of the data.

Usage

corHuber(x, y,
    type = c("bivariate", "adjusted", "univariate"),
    standardized = FALSE, centerFun = median,
    scaleFun = mad, const = 2, prob = 0.95,
    tol = .Machine$double.eps^0.5, ...)

Arguments

a numeric vector.

type

a character string specifying the type of winsorization to be used. Possible values are "univariate" for univariate winsorization, "adjusted" for adjusted univariate winsorization, or "bivariate" for biv

standardized

a logical indicating whether the data are already robustly standardized.

centerFun

a function to compute a robust estimate for the center to be used for robust standardization (defaults to median). Ignored if standardized is TRUE.

scaleFun

a function to compute a robust estimate for the scale to be used for robust standardization (defaults to mad). Ignored if standardized is TRUE.

const

numeric; tuning constant to be used in univariate or adjusted univariate winsorization (defaults to 2).

prob

numeric; probability for the quantile of the $\chi^{2}$ distribution to be used in bivariate winsorization (defaults to 0.95).

tol

a small positive numeric value. This is used in bivariate winsorization to determine whether the initial estimate from adjusted univariate winsorization is close to 1 in absolute value. In this case, bivariate winsorization would fail since

...

additional arguments to be passed to robStandardize.

Value

The robust correlation estimate.

Details

The borders of the main part of the data are defined on the scale of the robustly standardized data. In univariate winsorization, the borders for each variable are given by $+/-$const, thus a symmetric distribution is assumed. In adjusted univariate winsorization, the borders for the two diagonally opposing quadrants containing the minority of the data are shrunken by a factor that depends on the ratio between the number of observations in the major and minor quadrants. It is thus possible to better account for the bivariate structure of the data while maintaining fast computation. In bivariate winsorization, a bivariate normal distribution is assumed and the data are shrunken towards the boundary of a tolerance ellipse with coverage probability prob. The boundary of this ellipse is thereby given by all points that have a squared Mahalanobis distance equal to the quantile of the $\chi^{2}$ distribution given by prob. Furthermore, the initial correlation matrix required for the Mahalanobis distances is computed based on adjusted univariate winsorization.

References

Khan, J.A., Van Aelst, S. and Zamar, R.H. (2007) Robust linear model selection based on least angle regression. Journal of the American Statistical Association, 102(480), 1289--1299.

Examples

Run this code

## generate data
library("mvtnorm")
set.seed(1234)  # for reproducibility
Sigma <- matrix(c(1, 0.6, 0.6, 1), 2, 2)
xy <- rmvnorm(100, sigma=Sigma)
x <- xy[, 1]
y <- xy[, 2]

## introduce outlier
x[1] <- x[1] * 10
y[1] <- y[1] * (-5)

## compute correlation
cor(x, y)
corHuber(x, y)

Run the code above in your browser using DataLab