data.table (version 1.9.2)

dcast.data.table: Fast dcast for data.table

Description

A dcast.data.table is a much faster version of reshape2:::dcast, but for data.table. More importantly, it's able to handle bigger data quite efficiently without compromising speed. This is still under development, meaning it's stable, but not all features are complete yet. Once complete, we plan to make it an S3 generic by making changes to reshape2:::dcast. Then by loading both data.table and reshape2, one could use dcast on data.table as one would on a data.frame. If you think of a particular feature that might be useful, then file a feature request (FR) at the datatable projects page (link at the bottom).

Usage

## fast dcast a data.table (not an S3 method yet)
dcast.data.table(data, formula, fun.aggregate = NULL, 
	..., margins = NULL, subset = NULL, fill = NULL, 
	drop = TRUE, value.var = guess(data),
	verbose = getOption("datatable.verbose"))

Arguments

data
A molten data.table object, see melt.data.table
formula
A formula of the form LHS ~ RHS to cast, see details.
fun.aggregate
Should the data be aggregated before casting? If the formula doesn't identify single observation for each cell, then aggregation defaults to length with a message.
...
Any other arguments that maybe passed to the aggregating function.
margins
Not implemented yet. Should take variable names to compute margins on. A value of TRUE would compute all margins.
subset
Specified if casting should be done on subset of the data. Ex: subset = .(col1
fill
Value to fill missing cells with. If fun.aggregate is present, takes the value by applying the function on 0-length vector.
drop
FALSE will cast by including all missing combinations.
value.var
Name of the column whose values will be filled to cast. Function `guess()` tries to, well, guess this column automatically, if none is provided.
verbose
Not used yet. Maybe dropped in the future or used to provide information messages onto the console.

Value

  • A keyed data.table that has been cast. The key columns are equal to the variables in the formula LHS in the same order.

Details

The cast formula takes the form LHS ~ RHS , ex: var1 + var2 ~ var3. The order of entries in the formula is essential. There are two special variables: . and .... Their functionality is identical to that of reshape2:::dcast.

dcast.data.table also allows value.var columns of type list.

When the combination of variables in formula doesn't identify a unique value in a cell, fun.aggregate will have to be used. The aggregating function should take a vector as input and return a single value (or a list of length one) as output. In cases where value.var is a list, the function should be able to handle a list input and provide a single value or list of length one as output.

If the formula's LHS contains the same column more than once, ex: dcast.data.table(DT, x+x~ y), then the answer will have duplicate names. In those cases, the duplicate names are renamed using make.unique so that the key can be set without issues.

The only feature that's not implemented from reshape2:::dcast *yet* is the argument margins.

See Also

melt.data.table, https://r-forge.r-project.org/projects/datatable/

Examples

Run this code
require(data.table)
require(reshape2)
names(ChickWeight) <- tolower(names(ChickWeight))
DT <- melt(as.data.table(ChickWeight), id=2:4) # calls melt.data.table

# no S3 method yet, have to use "dcast.data.table"
dcast.data.table(DT, time ~ variable, fun=mean)
dcast.data.table(DT, diet ~ variable, fun=mean)
dcast.data.table(DT, diet+chick ~ time, drop=FALSE)
dcast.data.table(DT, diet+chick ~ time, drop=FALSE, fill=0)

# using subset
dcast.data.table(DT, chick ~ time, fun=mean, subset=.(time < 10 & chick < 20))

# on big data
set.seed(45)
DT <- data.table(aa=sample(1e4, 1e6, TRUE), 
      bb=sample(1e3, 1e6, TRUE), 
      cc = sample(letters, 1e6, TRUE), dd=runif(1e6))
system.time(dcast.data.table(DT, aa ~ cc, fun=sum)) # 0.59 seconds
system.time(dcast.data.table(DT, bb ~ cc, fun=mean)) # 0.26 seconds
# reshape2:::dcast takes 192.1 seconds
system.time(dcast.data.table(DT, aa + bb ~ cc, fun=sum)) # 3.6 seconds

Run the code above in your browser using DataCamp Workspace