
The speedup against the R's standard cor
function will be substantial particularly
if the input matrix only contains a small number of missing data. If there are no missing data, or the
missing data are numerous, the speedup will be smaller but still present.
cor(x, y = NULL,
use = "all.obs",
method = c("pearson", "kendall", "spearman"),
quick = 0,
cosine = FALSE,
cosineX = cosine,
cosineY = cosine,
nThreads = 0,
verbose = 0, indent = 0)corFast(x, y = NULL,
use = "all.obs",
quick = 0, nThreads = 0,
verbose = 0, indent = 0)
cor1(x, use = "all.obs", verbose = 0, indent = 0)
y
is null, x
must be a matrix.x
will be
calculated."all.obs"
and "pairwise.complete.obs"
; for other options, see R's standard
correlation function
"pearson"
.method="pearson"
.
Cosine correlation is similar to Pearson correlation but the mean subtraction is not performed. The result
is the cosine of the angle(s) between (the columns of) x<
x
? This setting does not affect y
and can be used to give a hybrid cosine-standard correlation.y
? This setting does not affect x
and can be used to give a hybrid cosine-standard correlation.x
with columns of y
if y
is given, and the correlations of the columns of x
if y
is not given.method="pearson"
and use
either
"all.obs"
or "pairwise.complete.obs"
.
The corFast
function is a wrapper that calls the function cor
. If the combination of
method
and use
is implemented by the fast calculations, the fast code is executed;
otherwise, R's own correlation cor
is executed. The argument quick
specifies the precision of handling of missing data. Zero will cause all
calculations to be executed precisely, which may be significantly slower than calculations without
missing data. Progressively higher values will speed up the
calculations but introduce progressively larger errors. Without missing data, all column means and
variances can be pre-calculated before the covariances are calculated. When missing data are present,
exact calculations require the column means and variances to be calculated for each covariance. The
approximate calculation uses the pre-calculated mean and variance and simply ignores missing data in the
covariance calculation. If the number of missing data is high, the pre-calculated means and variances may
be very different from the actual ones, thus potentially introducing large errors.
The quick
value times the
number of rows specifies the maximum difference in the
number of missing entries for mean and variance calculations on the one hand and covariance on the other
hand that will be tolerated before a recalculation is triggered. The hope is that if only a few missing
data are treated approximately, the error introduced will be small but the potential speedup can be
significant.
cor
.## Test the speedup compared to standard function cor
# Generate a random matrix with 200 rows and 1000 columns
set.seed(10)
nrow = 200;
ncol = 1000;
data = matrix(rnorm(nrow*ncol), nrow, ncol);
## First test: no missing data
system.time( {corStd = stats::cor(data)} );
system.time( {corFast = cor(data)} );
all.equal(corStd, corFast)
# Here R's standard correlation performs very well.
# We now add a few missing entries.
data[sample(nrow, 10), 1] = NA;
# And test the correlations again...
system.time( {corStd = stats::cor(data, use ='p')} );
system.time( {corFast = cor(data, use = 'p')} );
all.equal(corStd, corFast)
# Here the R's standard correlation slows down considerably, while corFast still retains it speed.
Run the code above in your browser using DataLab