bmcapply: A multicore 'apply' function for big.matrix objects

Description

# to put into NCmisc Multicore method to run a function for a big.matrix that could be run using 'apply' on a regular matrix (when parameter use.apply=T [default]). Otherwise for a function that might be more efficient done in done in chunks (e.g, utilising vectorised functions) use.apply=F can be set so that processing is done on larger submatrices, rather than 1 row/column at a time. Input to specify whether to perform the function row or columnwise is equivalent to 'apply' syntax, 1=by-rows, 2=by-columns. This function is useful for big.matrix processing even without multiple cores, particulary when MARGIN=1 (row-wise). While native colmean, colmin and colsd functions for big.matrix objects are very fast (and will probably outperform bmcapply even with 1 core versus many), these are only natively implemented for column-wise operations and the equivalent operations if needing to be row-wise should be faster with bmcapply for matrices larger than available RAM. Can also be used for regular matrices although there is unlikely to be a speed advantage.

Usage

bmcapply(bigMat, MARGIN, FUN, dir = NULL, by = 200, n.cores = 1,
  use.apply = TRUE, convert = !use.apply, combine.fn = NULL, ...)

Arguments

bigMat

the big.matrix object to apply the function upon, can enter as a filename, description object or any other valid parameter to get.big.matrix(). Can also use with a standard matrix

MARGIN

1=row-wise, 2=column-wise, see same argument for base:::apply()

FUN

the function to apply, should return a result with 1 dimension that has the same length as dim(bigMat)[MARGIN]=L; i.e, a vector length L, matrix (L,x) or (x,L) or list[[L]]. Note that using a custom 'combine.fn' parameter might allow exceptions to this.

dir

directory argument for get.big.matrix(), ie. the location of the bigMat backing file if not in the current working directory.

integer, the number of rows/columns to process at once. The default should work in most situations however, if the dimension not specified by MARGIN is very large, this might need to be smaller, or if the function being applied is much more efficient performed on a large matrix than several smaller ones then this 'by' parameter should be increased within memory contraints. You should make sure 'estimate.memory(c(by,dim(bigMat)[-MARGIN]))' doesn't exceed available RAM.

n.cores

integer, the number of parallel cores to utilise; note that sometimes if a machine has only a few cores this can result in slower performance by tying up resources which should be available to perform background and system operations.

use.apply

logical, if TRUE then use the 'apply' function to apply FUN to each submatrix, or if FALSE, then directly apply FUN to submatrices, which means that FUN must return results with at least 1 dimension the same as the input, or you can use a custom 'combine.fn' parameter to recombine results from submatrices.

convert

logical, only need to change this parameter when use.apply=FALSE. If use are using a function that can natively run on big.matrix objectsthen you can increase speed by setting convert=FALSE. Most functions will expect a regular matrix and may fail with a big.matrix, so default convert=TRUE behaviour will convert submatrices to a regular matrix just before processing.

combine.fn

a custom function to recombine input from sub.matrix processing. Default combine functions are list(), cbind() and rbind(); so a custom function should expect the same input as these; ie., a list of unspecified length, which will be the list of results from parallel calls on submatrices of bigMat, usually of size by*X.

...

if use.apply=TRUE, then additional arguments for apply(); else additional arguments for FUN.

Value

Result depends on the function 'FUN' called, and the parameter 'combine.fn', but if MARGIN=1 usually is a vector of length nrow(bigMat), or if MARGIN=2 a vector of length ncol(bigMat).

Examples

Run this code

# NOT RUN {
orig.dir <- getwd(); setwd(tempdir()); # move to temporary dir
if(file.exists("test.bck")) { unlink(c("test.bck","test.dsc")) }
# set up a toy example of a big.matrix (functions most relevant when matrix is huge)
bM <- filebacked.big.matrix(20, 50,
       dimnames = list(paste("r",1:20,sep=""), paste("c",1:50,sep="")),
       backingfile = "test9.bck",  backingpath = getwd(), descriptorfile = "test9.dsc")
bM[1:20,] <- replicate(50,rnorm(20))
prv.big.matrix(bM)
# compare native bigmemory column-wise function to multicore [native probably faster]
v1 <- colsd(bM) # native bigmemory function
v2 <- bmcapply(bM,2,sd,n.cores=2) # use up to 2 cores if available
print(all.equal(v1,v2))
# compare row-means approaches
v1 <- rowMeans(as.matrix(bM))
v2 <- bmcapply(bM,1,mean,n.cores=2) # use up to 2 cores if available
v3 <- bmcapply(bM,1,rowMeans,use.apply=FALSE)
print(all.equal(v1,v2)); print(all.equal(v2,v3))
# example using a custom combine function; taking the mean of column means
weight.means.to.scalar <- function(...) { X <- list(...); mean(unlist(X)) }
v1 <- bmcapply(bM, 2, sd, combine.fn=weight.means.to.scalar)
v2 <- mean(colsd(bM))
print(all.equal(v1,v2))
## note that this function works with normal matrices, however, multicore
# operation is only likely to benefit speed when operations take more than 10 seconds
# so this function will mainly help using large matrices or intensive functions
test.size <- 5 # try increasing this number, or use more intensive function than sd()
# to test relative speed for larger matrices
M <- matrix(runif(10^test.size),ncol=10^(test.size-2)) # normal matrix
system.time(bmcapply(M,2,sd,n.cores=2)) # use up to 2 cores if available
system.time(apply(M,2,sd)) # 
rm(bM) 
unlink(c("test9.bck","test9.dsc"))
setwd(orig.dir)
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples