rmad: RMAD correlation matrix

Description

Compute the RMAD robust correlation matrix proposed in Serra et al. (2018) based on the robust correlation coefficient proposed in Pasman and Shevlyakov (1987).

Usage

rmad(x , y = NULL, na.rm = FALSE , even.correction = FALSE)

Arguments

A numeric vector, a matrix or a data.frame. If x is a matrix or a data.frame, rows of x correspond to sample units and columns correspond to variables. If x is a numerical vector, and y is not NULL, the RMAD correlation coefficient between x and y is computed. Categorical variables are not allowed.

A numerical vector if not NULL. If both x and y are numerical vectors, the RMAD correlation coefficient between x and y is computed.

na.rm

A logical value, if TRUE sample observation containing NA values are excluded (see Details).

even.correction

A logical value, if TRUE a correction for the calculation of the medians is applied to reduce the bias when the number of samples even (see Details).

Value

If x is a matrix or a data.frame

Returns a correlation matrix of class "dspMatrix" (S4 class object) as defined in the Matrix package.

If x and y are numerical vectors

Returns a numerical value, that is the RMAD correlation coefficient between x and y.

Details

The rmad function computes the correlation matrix based on the pairwise robust correlation coefficient of Pasman and Shevlyakov (1987). This correlation coefficient is based on repeated median calculations for all pairs of variables. This is a computational intensive task when the number of variables (that is ncol(x)) is large.

The software is optimized for large dimensional data sets, the median is approximated as the central observation obtained based on the introselect sorting algorithm of Musser (1997) implemented in Fortran 95 language. For small samples this may be a crude approximation, however, it makes the computational cost feasible for high-dimensional data sets. With the option even.correction = TRUE a correction is applied to reduce the bias for data sets with an even number of samples. Although even.correction = TRUE has a small computational cost for each pair of variables, it is suggested to use the default even.correction = FALSE for large dimensional data sets.

The function can handle a data matrix with missing values (NA records). If na.rm = TRUE then missing values are handled by casewise deletion (and if there are no complete cases, an error is returned). In practice, if na.rm = TRUE all rows of x that contain at least an NA are removed.

Since the software is optimized to work with high-dimensional data sets, the output RMAD matrix is packed into a storage efficient format using the "dspMatrix" S4 class from the Matrix package. The latter is specifically designed for dense real symmetric matrices. A sparse correlation matrix can be obtained applying thresholding using the rsc_cv and rsc.

References

Musser, D. R. (1997). Introspective sorting and selection algorithms. Software: Practice and Experience, 27(8), 983-993.

Pasman,V. and Shevlyakov,G. (1987). Robust methods of estimation of correlation coefficient. Automation Remote Control, 48, 332-340.

Serra, A., Coretto, P., Fratello, M., and Tagliaferri, R. (2018). Robust and sparsecorrelation matrix estimation for the analysis of high-dimensional genomics data. Bioinformatics, 34(4), 625-634. doi: 10.1093/bioinformatics/btx642

Examples

Run this code

# NOT RUN {
## simulate a random sample from a multivariate Cauchy distribution
set.seed(1)
n   <- 100    # sample size
p   <- 7      # dimension
dat <- matrix(rt(n*p, df = 1), nrow = n, ncol = p)
colnames(dat) <- paste0("Var", 1:p)

   
## compute the rmad correlation coefficient between dat[,1] and dat[,2]
a <- rmad(x = dat[,1], y = dat[,2])


## compute the RMAD correlaiton matrix   
b <- rmad(x = dat)
b
# }

Run the code above in your browser using DataLab