Compute the RMAD robust correlation matrix proposed in Serra et al. (2018) based on the robust correlation coefficient proposed in Pasman and Shevlyakov (1987).
rmad(x , y = NULL, na.rm = FALSE , even.correction = FALSE, num.threads = "half-max")
If x
is a matrix or a data.frame, returns a correlation matrix of class "dspMatrix"
(S4 class object)
as defined in the Matrix
package.
If x
and y
are numerical vectors, returns a numerical value, that is the RMAD correlation coefficient
between x
and y
.
A numeric vector, a matrix or a data.frame. If x
is a matrix
or a data.frame, rows of x
correspond to sample units
and columns correspond to variables. If x
is a numerical
vector, and y
is not NULL
, the RMAD correlation
coefficient between x
and y
is computed. Categorical
variables are not allowed.
A numerical vector if not NULL
. If both x
and y
are numerical vectors, the RMAD correlation coefficient between
x
and y
is computed.
A logical value, if TRUE
sample observation
containing NA
values are excluded (see Details).
A logical value, if TRUE
a correction
for the calculation of the medians is applied to reduce the bias
when the number of samples even (see Details).
An integer value or the string "half-max"
(default), specifying the number of threads for parallel execution (see Details).
The rmad
function computes the correlation matrix based on the
pairwise robust correlation coefficient of Pasman and Shevlyakov
(1987). This correlation coefficient is based on repeated median
calculations for all pairs of variables. This is a computational
intensive task when the number of variables (that is ncol(x)
)
is large.
The software is optimized for large dimensional data sets, the median
is approximated as the central observation obtained based on the
find algorithm of Hoare (1961) (also known as quickselect)
implemented in C language. For small samples this may be a crude
approximation, however, it makes the computational cost feasible for
high-dimensional data sets. With the option even.correction
= TRUE
a correction is applied to reduce the bias for data sets with
an even number of samples. Although even.correction = TRUE
has a small computational cost for each pair of variables, it is
suggested to use the default even.correction = FALSE
for large
dimensional data sets.
The function can handle a data matrix with missing values (NA
records). If na.rm = TRUE
then missing values are handled by
casewise deletion (and if there are no complete cases, an error is
returned). In practice, if na.rm = TRUE
all rows of
x
that contain at least an NA
are removed.
Since the software is optimized to work with high-dimensional data sets,
the output RMAD matrix is packed into a storage efficient format
using the "dspMatrix"
S4 class from the Matrix
package. The latter is specifically designed for dense real symmetric
matrices. A sparse correlation matrix can be obtained applying
thresholding using the rsc_cv
and rsc
.
rmad
function supports parallel execution.
This is provided via openmp (http://www.openmp.org), which must be already available on the system at installation time;
otherwise, falls back to single-core execution.
For later installation of openmp, the RSC package needs to be re-installed (re-compiled) to provide multi-threads execution.
If num.threads > 0
, function is executed using min(num.threads, max.threads)
threads, where max.threads
is the maximum number of available threads. That is, if positive the specified number of threads (up to the maximum available) are used.
If num.threads < 0
, function is executed using max(max.threads - num.threads, 1)
threads, i.e. when negative num.threads
indicates the number of threads not to use (at least one thread is used).
If num.threads == 0
, a single thread is used (equivalent to num.threads = 1
).
If num.threads == "half-max"
, function is executed using half of the available threads (max(max.threads/2, 1)
). This is the default.
Hoare, C. A. (1961). Algorithm 65: find. Communications of the ACM, 4(7), 321-322.
Musser, D. R. (1997). Introspective sorting and selection algorithms. Software: Practice and Experience, 27(8), 983-993.
Pasman,V. and Shevlyakov,G. (1987). Robust methods of estimation of correlation coefficient. Automation Remote Control, 48, 332-340.
Serra, A., Coretto, P., Fratello, M., and Tagliaferri, R. (2018). Robust and sparsecorrelation matrix estimation for the analysis of high-dimensional genomics data. Bioinformatics, 34(4), 625-634. doi: 10.1093/bioinformatics/btx642
rsc_cv
, rsc
## simulate a random sample from a multivariate Cauchy distribution
set.seed(1)
n <- 100 # sample size
p <- 7 # dimension
dat <- matrix(rt(n*p, df = 1), nrow = n, ncol = p)
colnames(dat) <- paste0("Var", 1:p)
## compute the rmad correlation coefficient between dat[,1] and dat[,2]
a <- rmad(x = dat[,1], y = dat[,2])
## compute the RMAD correlaiton matrix
b <- rmad(x = dat)
b
Run the code above in your browser using DataLab