ctapply is a fast replacement of tapply that assumes
contiguous input, i.e. unique values in the index are never speparated
by any other values. This avoids an expensive split step since
both value and the index chungs can be created on the fly. This
makes it many orders of magnitude faster than the classical
lapply(split(), ...) implementation.
ctapply(X, INDEX, FUN, ..., MERGE=c, .SAFE=TRUE)
an atomic object, typically a vector
numeric or character vector of the same length as X
the function to be applied
additional arguments to FUN. They are passed as-is,
i.e., without replication or recycling
function to merge the resulting vector or NULL if
the arguments to such a function are to be returned instead
logical, if TRUE then fresh regular vectors are
allocated for each call of FUN which makes it safe to be
used with any FUN. If FALSE then both the index and
value vectors use cached resizable vectors, but this imposes
constraints on FUN (see below).
Simon Urbanek
Note that ctapply supports either integer, real or character
vectors as indices (note that factors are integer vectors and thus
supported, but you do not need to convert character vectors). Unlike
tapply it does not take a list of factors - if you want to use
a cross-product of factors, create the product first, e.g. using
paste(i1, i2, i3, sep='\01') or multiplication - whetever
method is convenient for the input types.
ctapply requires the INDEX to contiguous. One (slow) way
to achieve that is to use sort or order.
If .SAFE=FALSE then both index and value vectors will be
re-used in subsequent iterations and created as growable vectors,
thus avoiding allocations at each iteration. However, this means that
FUN may not directly store the value it is being passed, it can
only compute on it and store/return derived values (e.g., it is safe
to use with functions like sum). This approach can be faster
for large X if INDEX consists of many small groups, but
enable only if you have direct control over FUN and know that
it is safe to use this way.
i = rnorm(4e6)
names(i) = as.integer(rnorm(1e6))
i = i[order(names(i))]
system.time(tapply(i, names(i), sum))
system.time(ctapply(i, names(i), sum))
Run the code above in your browser using DataLab