Compute the RMAD robust correlation matrix proposed in Serra et al. (2018) based on the robust correlation coefficient proposed in Pasman and Shevlyakov (1987).
rmad(x , y = NULL, na.rm = FALSE , even.correction = FALSE)
A numeric vector, a matrix or a data.frame. If x
is a matrix
or a data.frame, rows of x
correspond to sample units
and columns correspond to variables. If x
is a numerical
vector, and y
is not NULL
, the RMAD correlation
coefficient between x
and y
is computed. Categorical
variables are not allowed.
A numerical vector if not NULL
. If both x
and y
are numerical vectors, the RMAD correlation coefficient between
x
and y
is computed.
A logical value, if TRUE
sample observation
containing NA
values are excluded (see Details).
A logical value, if TRUE
a correction
for the calculation of the medians is applied to reduce the bias
when the number of samples even (see Details).
x
is a matrix or a data.frameReturns a correlation matrix of class "dspMatrix"
(S4 class object)
as defined in the Matrix
package.
x
and y
are numerical vectorsReturns a numerical value, that is the RMAD correlation coefficient
between x
and y
.
The rmad
function computes the correlation matrix based on the
pairwise robust correlation coefficient of Pasman and Shevlyakov
(1987). This correlation coefficient is based on repeated median
calculations for all pairs of variables. This is a computational
intensive task when the number of variables (that is ncol(x)
)
is large.
The software is optimized for large dimensional data sets, the median
is approximated as the central observation obtained based on the
introselect sorting algorithm of Musser (1997) implemented in
Fortran 95 language. For small samples this may be a crude
approximation, however, it makes the computational cost feasible for
high-dimensional data sets. With the option even.correction
= TRUE
a correction is applied to reduce the bias for data sets with
an even number of samples. Although even.correction = TRUE
has a small computational cost for each pair of variables, it is
suggested to use the default even.correction = FALSE
for large
dimensional data sets.
The function can handle a data matrix with missing values (NA
records). If na.rm = TRUE
then missing values are handled by
casewise deletion (and if there are no complete cases, an error is
returned). In practice, if na.rm = TRUE
all rows of
x
that contain at least an NA
are removed.
Since the software is optimized to work with high-dimensional data sets,
the output RMAD matrix is packed into a storage efficient format
using the "dspMatrix"
S4 class from the Matrix
package. The latter is specifically designed for dense real symmetric
matrices. A sparse correlation matrix can be obtained applying
thresholding using the rsc_cv
and rsc
.
Musser, D. R. (1997). Introspective sorting and selection algorithms. Software: Practice and Experience, 27(8), 983-993.
Pasman,V. and Shevlyakov,G. (1987). Robust methods of estimation of correlation coefficient. Automation Remote Control, 48, 332-340.
Serra, A., Coretto, P., Fratello, M., and Tagliaferri, R. (2018). Robust and sparsecorrelation matrix estimation for the analysis of high-dimensional genomics data. Bioinformatics, 34(4), 625-634. doi: 10.1093/bioinformatics/btx642
rsc_cv
, rsc
# NOT RUN {
## simulate a random sample from a multivariate Cauchy distribution
set.seed(1)
n <- 100 # sample size
p <- 7 # dimension
dat <- matrix(rt(n*p, df = 1), nrow = n, ncol = p)
colnames(dat) <- paste0("Var", 1:p)
## compute the rmad correlation coefficient between dat[,1] and dat[,2]
a <- rmad(x = dat[,1], y = dat[,2])
## compute the RMAD correlaiton matrix
b <- rmad(x = dat)
b
# }
Run the code above in your browser using DataLab