corSparse: Pearson correlation between columns (sparse matrices)

Description

This function computes the product-moment correlation coefficients between the columns of sparse matrices. Performance-wise, this improves over the approach taken in the cor function. However, because the resulting matrix is not-sparse, this function still cannot be used with very large matrices.

Usage

corSparse(X, Y = NULL, cov = FALSE)

Arguments

a sparse matrix in a format of the Matrix package, typically dgCMatrix . The correlations will be calculated between the columns of this matrix.

a second matrix in a format of the Matrix package. When Y = NULL, then the correlations between the columns of X and itself will be taken. If Y is specified, the association between the columns of X and the columns of Y will be calculated.

cov

when TRUE the covariance matrix is returned, instead of the default correlation matrix.

Value

The result is a regular square (non-sparse!) Matrix with the Pearson product-moment correlation coefficients between the columns of X.

When Y is specified, the result is a rectangular (non-sparse!) Matrix of size nrow(X) by nrow(Y) with the correlation coefficients between the columns of X and Y.

When cov = T, the result is a covariance matrix (i.e. a non-normalized correlation).

Details

To compute the covariance matrix, the code uses the principle that $$E[(X - \mu(X))' (Y - \mu(Y))] = E[X' Y] - \mu(X') \mu(Y)$$ With sample correction n/(n-1) this leads to the covariance between X and Y as $$( X' Y - n * \mu(X') \mu(Y) ) / (n-1)$$

The computation of the standard deviation (to turn covariance into correlation) is trivial in the case Y = NULL, as they are found on the diagonal of the covariance matrix. In the case Y != NULL uses the principle that $$E[X - \mu(X)]^2 = E[X^2] - \mu(X)^2$$ With sample correction n/(n-1) this leads to $$sd^2 = ( X^2 - n * \mu(X)^2 ) / (n-1)$$

Examples

Run this code

# NOT RUN {
# reasonably fast (though not instantly!) with
# sparse matrices up to a resulting matrix size of 1e8 cells.
# However, the calculations and the resulting matrix take up lots of memory

X <- rSparseMatrix(1e4, 1e4, 1e5)
system.time(M <- corSparse(X))
print(object.size(M), units = "auto") # more than 750 Mb

# Most values are low, so it often makes sense 
# to remove low values to keep results sparse

M <- drop0(M, tol = 0.4)
print(object.size(M), units = "auto") # normally reduces size by half or more
length(M@x) / prod(dim(M)) # down to less than 0.05% non-zero entries
# }
# NOT RUN {
# comparison with other methods
# corSparse is much faster than cor from the stats package
# but cosSparse is even quicker than both!

X <- rSparseMatrix(1e3, 1e3, 1e4)
X2 <- as.matrix(X)

# if there is a warning, try again with different random X
system.time(McorRegular <- cor(X2)) 
system.time(McorSparse <- corSparse(X))
system.time(McosSparse <- cosSparse(X))

# cor and corSparse give identical results

all.equal(McorSparse, McorRegular)

# corSparse and cosSparse are not identical, but close

McosSparse <- as.matrix(McosSparse)
dimnames(McosSparse) <- NULL
all.equal(McorSparse, McosSparse) 

# Actually, cosSparse and corSparse are *almost* identical!

cor(as.dist(McorSparse), as.dist(McosSparse))

# Visually it looks completely identical
# Note: this takes some time to plot

# }
# NOT RUN {
plot(as.dist(McorSparse), as.dist(McosSparse))	
# }
# NOT RUN {
# So: consider using cosSparse instead of cor or corSparse.
# With sparse matrices, this gives mostly the same results, 
# but much larger matrices are possible
# and the computations are quicker and more sparse

# }

Run the code above in your browser using DataLab