These functions implements a faster calculation of (weighted) Pearson correlation.
The speedup against the R's standard cor
function will be substantial particularly
if the input matrix only contains a small number of missing data. If there are no missing data, or the
missing data are numerous, the speedup will be smaller.
cor(x, y = NULL,
use = "all.obs",
method = c("pearson", "kendall", "spearman"),
weights.x = NULL,
weights.y = NULL,
quick = 0,
cosine = FALSE,
cosineX = cosine,
cosineY = cosine,
drop = FALSE,
nThreads = 0,
verbose = 0, indent = 0)corFast(x, y = NULL,
use = "all.obs",
quick = 0, nThreads = 0,
verbose = 0, indent = 0)
cor1(x, use = "all.obs", verbose = 0, indent = 0)
a numeric vector or a matrix. If y
is null, x
must be a matrix.
a numeric vector or a matrix. If not given, correlations of columns of x
will be
calculated.
a character string specifying the handling of missing data. The fast calculations currently
support "all.obs"
and "pairwise.complete.obs"
; for other options, see R's standard
correlation function cor
. Abbreviations are allowed.
a character string specifying the method to be used. Fast calculations are currently
available only for "pearson"
.
optional observation weights for x
. A matrix of the same dimensions as x
,
containing non-negative weights. Only used in fast calculations: methods
must be "pearson"
and
use
must be one of "all.obs", "pairwise.complete.obs"
.
optional observation weights for y
. A matrix of the same dimensions as y
,
containing non-negative weights. Only used in fast calculations: methods
must be "pearson"
and
use
must be one of "all.obs", "pairwise.complete.obs"
.
real number between 0 and 1 that controls the precision of handling of missing data in the calculation of correlations. See details.
logical: calculate cosine correlation? Only valid for method="pearson"
.
Cosine correlation is similar to Pearson correlation but the mean subtraction is not performed. The result
is the cosine of the angle(s) between (the columns of) x
and y
.
logical: use the cosine calculation for x
? This setting does not affect y
and can be used to give a hybrid cosine-standard correlation.
logical: use the cosine calculation for y
? This setting does not affect x
and can be used to give a hybrid cosine-standard correlation.
logical: should the result be turned into a vector if it is effectively one-dimensional?
non-negative integer specifying the number of parallel threads to be used by certain parts of correlation calculations. This option only has an effect on systems on which a POSIX thread library is available (which currently includes Linux and Mac OSX, but excludes Windows). If zero, the number of online processors will be used if it can be determined dynamically, otherwise correlation calculations will use 2 threads. Note that this option does not affect what is usually the most expensive part of the calculation, namely the matrix multiplication. The matrix multiplication is carried out by BLAS routines provided by R; these can be sped up by installing a fast BLAS and making R use it.
Controls the level of verbosity. Values above zero will cause a small amount of diagnostic messages to be printed.
Indentation of printed diagnostic messages. Each unit above zero adds two spaces.
The matrix of the Pearson correlations of the columns of x
with columns of y
if y
is given, and the correlations of the columns of x
if y
is not given.
The fast calculations are currently implemented only for method="pearson"
and use
either
"all.obs"
or "pairwise.complete.obs"
.
The corFast
function is a wrapper that calls the function cor
. If the combination of
method
and use
is implemented by the fast calculations, the fast code is executed;
otherwise, R's own correlation cor
is executed.
The argument quick
specifies the precision of handling of missing data. Zero will cause all
calculations to be executed precisely, which may be significantly slower than calculations without
missing data. Progressively higher values will speed up the
calculations but introduce progressively larger errors. Without missing data, all column means and
variances can be pre-calculated before the covariances are calculated. When missing data are present,
exact calculations require the column means and variances to be calculated for each covariance. The
approximate calculation uses the pre-calculated mean and variance and simply ignores missing data in the
covariance calculation. If the number of missing data is high, the pre-calculated means and variances may
be very different from the actual ones, thus potentially introducing large errors.
The quick
value times the
number of rows specifies the maximum difference in the
number of missing entries for mean and variance calculations on the one hand and covariance on the other
hand that will be tolerated before a recalculation is triggered. The hope is that if only a few missing
data are treated approximately, the error introduced will be small but the potential speedup can be
significant.
Peter Langfelder, Steve Horvath (2012) Fast R Functions for Robust Correlations and Hierarchical Clustering. Journal of Statistical Software, 46(11), 1-17. https://www.jstatsoft.org/v46/i11/
R's standard Pearson correlation function cor
.
# NOT RUN {
## Test the speedup compared to standard function cor
# Generate a random matrix with 200 rows and 1000 columns
set.seed(10)
nrow = 100;
ncol = 500;
data = matrix(rnorm(nrow*ncol), nrow, ncol);
## First test: no missing data
system.time( {corStd = stats::cor(data)} );
system.time( {corFast = cor(data)} );
all.equal(corStd, corFast)
# Here R's standard correlation performs very well.
# We now add a few missing entries.
data[sample(nrow, 10), 1] = NA;
# And test the correlations again...
system.time( {corStd = stats::cor(data, use ='p')} );
system.time( {corFast = cor(data, use = 'p')} );
all.equal(corStd, corFast)
# Here the R's standard correlation slows down considerably
# while corFast still retains it speed. Choosing
# higher ncol above will make the difference more pronounced.
# }
Run the code above in your browser using DataLab