ctapply: Fast tapply() replacement functions

Description

ctapply is a fast replacement of tapply that assumes contiguous input, i.e. unique values in the index are never speparated by any other values. This avoids an expensive split step since both value and the index chungs can be created on the fly. This makes it many orders of magnitude faster than the classical lapply(split(), ...) implementation.

Usage

ctapply(X, INDEX, FUN, ..., MERGE=c, .SAFE=TRUE)

Arguments

X: an atomic object, typically a vector
INDEX: numeric or character vector of the same length as X
FUN: the function to be applied
...: additional arguments to FUN. They are passed as-is, i.e., without replication or recycling
MERGE: function to merge the resulting vector or NULL if the arguments to such a function are to be returned instead
.SAFE: logical, if TRUE then fresh regular vectors are allocated for each call of FUN which makes it safe to be used with any FUN. If FALSE then both the index and value vectors use cached resizable vectors, but this imposes constraints on FUN (see below).

Author

Simon Urbanek

Details

Note that ctapply supports either integer, real or character vectors as indices (note that factors are integer vectors and thus supported, but you do not need to convert character vectors). Unlike tapply it does not take a list of factors - if you want to use a cross-product of factors, create the product first, e.g. using paste(i1, i2, i3, sep='\01') or multiplication - whetever method is convenient for the input types.

ctapply requires the INDEX to contiguous. One (slow) way to achieve that is to use sort or order.

If .SAFE=FALSE then both index and value vectors will be re-used in subsequent iterations and created as growable vectors, thus avoiding allocations at each iteration. However, this means that FUN may not directly store the value it is being passed, it can only compute on it and store/return derived values (e.g., it is safe to use with functions like sum). This approach can be faster for large X if INDEX consists of many small groups, but enable only if you have direct control over FUN and know that it is safe to use this way.

Examples

Run this code

i = rnorm(4e6)
names(i) = as.integer(rnorm(1e6))
i = i[order(names(i))]
system.time(tapply(i, names(i), sum))
system.time(ctapply(i, names(i), sum))