assocSparse: Association between columns (sparse matrices)

Description

This function offers an interface to various different measures of association between columns in sparse matrices (based on functions of `observed' and `expected' values). Currently, the following measures are available: pointwise mutual information (aka log-odds), a poisson-based measure and Pearson residuals. Further measures can easily be specifically defined by the user. The calculations are optimized to be able to deal with large sparse matrices. Note that these association values are really only (sensibly) defined for binary data.

Usage

assocSparse(X, Y = NULL, method = res, N = nrow(X), sparse = TRUE )

Arguments

Value

The result is a sparse matrix with the non-zero association values. Values range between -Inf and +Inf, with values close to zero indicating low association. The exact interpretation of the values depends on the method used.

When Y = NULL, then the result is a symmetric matrix, so a matrix of type dsCMatrix with size ncol(X) by ncol{X} is returned. When X and Y are both specified, a matrix of type dgCMatrix with size ncol(X) by ncol{Y} is returned.

Details

Computations are based on a comparison of the observed interaction crossprod(X,Y) and the expected interaction. Expectation is in principle computed as tcrossprod(rowSums(abs(X)),rowSums(abs(Y)))/nrow(X), though in practice the code is more efficient than that.

Note that calculating the observed interaction as crossprod(X,Y) really only makes sense for binary data (i.e. matrices with only ones and zeros). Currently, all input is coerced to such data by as(X, "nMatrix")*1, meaning that all values that are not one or zero are turned into one (including negative values!).

Any method can be defined as a function with two arguments, o and e, e.g. simply by specifying method = function(o,e){o/e}. See below for more examples.

The predefined functions are:

pmi: pointwise mutual information, aka as log-odds in bioinformatics, defined as pmi <- function(o,e) { log(o/e) }.

wpmi: weighted pointwise mutual information, defined as wpmi <- function(o,e) { o * log(o/e) }. res: Pearson residuals, defined as res <- function(o,e) { (o-e) / sqrt(e) }. poi: association assuming a poisson-distribution of the values, defined as poi <- function(o,e) { sign(o-e) * (o * log(o/e) - (o-e)) }. Seems to be very useful when the non-zero data is strongly skewed along the rows, i.e. some rows are much fuller than others. A short explanation of this method can be found in Prokić and Cysouw (2013).

References

Prokić, Jelena & Michael Cysouw. 2013. Combining regular sound correspondences and geographic spread. Language Dynamics and Change 3(2). 147--168.

Examples

Run this code

# ----- reasonably fast with large very sparse matrices -----

X <- rSparseMatrix(1e6, 1e6, 1e6, NULL)
system.time(M <- assocSparse(X, method = poi))
length(M@x) / prod(dim(M)) # only one in 1e6 cells non-zero

# ----- reaching limits of sparsity -----

# watch out: 
# with slightly less sparse matrices the result will not be very sparse,
# so this will easily fill up your RAM during computation!

X <- rSparseMatrix(1e4, 1e4, 1e6, NULL)
system.time(M <- assocSparse(X, method = poi))
print(object.size(M), units = "auto") # about 350 Mb
length(M@x) / prod(dim(M)) # 30% filled

# most values are low, so it often makes sense 
# to remove low values to keep results sparse

M <- drop0(M, tol = 2)
print(object.size(M), units = "auto") # reduces to 10 Mb
length(M@x) / prod(dim(M)) # down to less than 1% filled

# ----- defining new methods -----

# Using the following simple 'div' method is the same as
# using a cosine similarity with a 1-norm, up to a factor nrow(X)

div <- function(o,e) {o/e}
X <- rSparseMatrix(10, 10, 30, NULL)
all.equal(
	assocSparse(X, method = div),
	cosSparse(X, norm = norm1) * nrow(X)
	)

# ----- comparing methods -----

# Compare various methods on random data
# ignore values on diagonal, because different methods differ strongly here
# Note the different behaviour of pointwise mutual information (and division)

X <- rSparseMatrix(1e2, 1e2, 1e3, NULL)

p <- assocSparse(X, method = poi); diag(p) <- 0
r <- assocSparse(X, method = res); diag(r) <- 0
m <- assocSparse(X, method = pmi); diag(m) <- 0
w <- assocSparse(X, method = wpmi); diag(w) <- 0
d <- assocSparse(X, method = div); diag(d) <- 0

pairs(~w@x+p@x+r@x+d@x+m@x, 
  labels=c("weighted pointwise
mutual information","poisson","residuals","division",
           "pointwise
mutual
information"), cex = 0.7)

Run the code above in your browser using DataLab