rmad: RMAD correlation matrix

Description

Compute the RMAD robust correlation matrix proposed in Serra et al. (2018) based on the robust correlation coefficient proposed in Pasman and Shevlyakov (1987).

Usage

rmad(x , y = NULL, na.rm = FALSE , even.correction = FALSE, num.threads = "half-max")

Value

If x is a matrix or a data.frame, returns a correlation matrix of class "dspMatrix" (S4 class object) as defined in the Matrix package.

If x and y are numerical vectors, returns a numerical value, that is the RMAD correlation coefficient between x and y.

Arguments

x: A numeric vector, a matrix or a data.frame. If x is a matrix or a data.frame, rows of x correspond to sample units and columns correspond to variables. If x is a numerical vector, and y is not NULL, the RMAD correlation coefficient between x and y is computed. Categorical variables are not allowed.
y: A numerical vector if not NULL. If both x and y are numerical vectors, the RMAD correlation coefficient between x and y is computed.
na.rm: A logical value, if TRUE sample observation containing NA values are excluded (see Details).
even.correction: A logical value, if TRUE a correction for the calculation of the medians is applied to reduce the bias when the number of samples even (see Details).
num.threads: An integer value or the string "half-max" (default), specifying the number of threads for parallel execution (see Details).

Details

The rmad function computes the correlation matrix based on the pairwise robust correlation coefficient of Pasman and Shevlyakov (1987). This correlation coefficient is based on repeated median calculations for all pairs of variables. This is a computational intensive task when the number of variables (that is ncol(x)) is large.

The software is optimized for large dimensional data sets, the median is approximated as the central observation obtained based on the find algorithm of Hoare (1961) (also known as quickselect) implemented in C language. For small samples this may be a crude approximation, however, it makes the computational cost feasible for high-dimensional data sets. With the option even.correction = TRUE a correction is applied to reduce the bias for data sets with an even number of samples. Although even.correction = TRUE has a small computational cost for each pair of variables, it is suggested to use the default even.correction = FALSE for large dimensional data sets.

The function can handle a data matrix with missing values (NA records). If na.rm = TRUE then missing values are handled by casewise deletion (and if there are no complete cases, an error is returned). In practice, if na.rm = TRUE all rows of x that contain at least an NA are removed.

Since the software is optimized to work with high-dimensional data sets, the output RMAD matrix is packed into a storage efficient format using the "dspMatrix" S4 class from the Matrix package. The latter is specifically designed for dense real symmetric matrices. A sparse correlation matrix can be obtained applying thresholding using the rsc_cv and rsc.

rmad function supports parallel execution. This is provided via openmp (http://www.openmp.org), which must be already available on the system at installation time; otherwise, falls back to single-core execution. For later installation of openmp, the RSC package needs to be re-installed (re-compiled) to provide multi-threads execution. If num.threads > 0, function is executed using min(num.threads, max.threads) threads, where max.threads is the maximum number of available threads. That is, if positive the specified number of threads (up to the maximum available) are used. If num.threads < 0, function is executed using max(max.threads - num.threads, 1) threads, i.e. when negative num.threads indicates the number of threads not to use (at least one thread is used). If num.threads == 0, a single thread is used (equivalent to num.threads = 1). If num.threads == "half-max", function is executed using half of the available threads (max(max.threads/2, 1)). This is the default.

References

Hoare, C. A. (1961). Algorithm 65: find. Communications of the ACM, 4(7), 321-322.

Musser, D. R. (1997). Introspective sorting and selection algorithms. Software: Practice and Experience, 27(8), 983-993.

Pasman,V. and Shevlyakov,G. (1987). Robust methods of estimation of correlation coefficient. Automation Remote Control, 48, 332-340.

Serra, A., Coretto, P., Fratello, M., and Tagliaferri, R. (2018). Robust and sparsecorrelation matrix estimation for the analysis of high-dimensional genomics data. Bioinformatics, 34(4), 625-634. doi: 10.1093/bioinformatics/btx642

Examples

Run this code

## simulate a random sample from a multivariate Cauchy distribution
set.seed(1)
n   <- 100    # sample size
p   <- 7      # dimension
dat <- matrix(rt(n*p, df = 1), nrow = n, ncol = p)
colnames(dat) <- paste0("Var", 1:p)

   
## compute the rmad correlation coefficient between dat[,1] and dat[,2]
a <- rmad(x = dat[,1], y = dat[,2])


## compute the RMAD correlaiton matrix   
b <- rmad(x = dat)
b

Run the code above in your browser using DataLab