This function computes the product-moment correlation coefficients between the columns of sparse matrices. Performance-wise, this improves over the approach taken in the `cor`

function. However, because the resulting matrix is not-sparse, this function still cannot be used with very large matrices.

`corSparse(X, Y = NULL, cov = FALSE)`

X

a sparse matrix in a format of the `Matrix`

package, typically `dgCMatrix`

. The correlations will be calculated between the columns of this matrix.

Y

a second matrix in a format of the `Matrix`

package. When `Y = NULL`

, then the correlations between the columns of X and itself will be taken. If Y is specified, the association between the columns of X and the columns of Y will be calculated.

cov

when `TRUE`

the covariance matrix is returned, instead of the default correlation matrix.

The result is a regular square (non-sparse!) Matrix with the Pearson product-moment correlation coefficients between the columns of `X`

.

When `Y`

is specified, the result is a rectangular (non-sparse!) Matrix of size `nrow(X)`

by `nrow(Y)`

with the correlation coefficients between the columns of `X`

and `Y`

.

When `cov = T`

, the result is a covariance matrix (i.e. a non-normalized correlation).

To compute the covariance matrix, the code uses the principle that $$E[(X - \mu(X))' (Y - \mu(Y))] = E[X' Y] - \mu(X') \mu(Y)$$ With sample correction n/(n-1) this leads to the covariance between X and Y as $$( X' Y - n * \mu(X') \mu(Y) ) / (n-1)$$

The computation of the standard deviation (to turn covariance into correlation) is trivial in the case `Y = NULL`

, as they are found on the diagonal of the covariance matrix. In the case `Y != NULL`

uses the principle that
$$E[X - \mu(X)]^2 = E[X^2] - \mu(X)^2$$
With sample correction n/(n-1) this leads to
$$sd^2 = ( X^2 - n * \mu(X)^2 ) / (n-1)$$

`cor`

in the base packages, `cosSparse`

, `assocSparse`

for other sparse association measures.

# NOT RUN { # reasonably fast (though not instantly!) with # sparse matrices up to a resulting matrix size of 1e8 cells. # However, the calculations and the resulting matrix take up lots of memory X <- rSparseMatrix(1e4, 1e4, 1e5) system.time(M <- corSparse(X)) print(object.size(M), units = "auto") # more than 750 Mb # Most values are low, so it often makes sense # to remove low values to keep results sparse M <- drop0(M, tol = 0.4) print(object.size(M), units = "auto") # normally reduces size by half or more length(M@x) / prod(dim(M)) # down to less than 0.05% non-zero entries # } # NOT RUN { # comparison with other methods # corSparse is much faster than cor from the stats package # but cosSparse is even quicker than both! X <- rSparseMatrix(1e3, 1e3, 1e4) X2 <- as.matrix(X) # if there is a warning, try again with different random X system.time(McorRegular <- cor(X2)) system.time(McorSparse <- corSparse(X)) system.time(McosSparse <- cosSparse(X)) # cor and corSparse give identical results all.equal(McorSparse, McorRegular) # corSparse and cosSparse are not identical, but close McosSparse <- as.matrix(McosSparse) dimnames(McosSparse) <- NULL all.equal(McorSparse, McosSparse) # Actually, cosSparse and corSparse are *almost* identical! cor(as.dist(McorSparse), as.dist(McosSparse)) # Visually it looks completely identical # Note: this takes some time to plot # } # NOT RUN { plot(as.dist(McorSparse), as.dist(McosSparse)) # } # NOT RUN { # So: consider using cosSparse instead of cor or corSparse. # With sparse matrices, this gives mostly the same results, # but much larger matrices are possible # and the computations are quicker and more sparse # }

Run the code above in your browser using DataCamp Workspace