datatable.optimize
Optimisations in data.table
data.table
internally optimises certain expressions inorder to improve
performance. This section briefly summarises those optimisations.
Note that there's no additional input needed from the user to take advantage of these optimisations. They happen automatically.
Run the code under the example section to get a feel for the performance benefits from these optimisations.
- Keywords
- data
Details
data.table
reads the global option datatable.optimize
to figure
out what level of optimisation is required. The default value Inf
activates all available optimisations.
At optimisation level >= 1
, i.e., getOption("datatable.optimize")
>= 1, these are the optimisations:
The base function
order
is internally replaced withdata.table
's fast ordering. That is,dt[order(...)]
gets internally optimised todt[forder(...)]
.The expression
dt[, lapply(.SD, fun), by=.]
gets optimised todt[, list(fun(a), fun(b), ...), by=.]
wherea,b, ...
are columns in.SD
. This improves performance tremendously.Similarly, the expression
dt[, c(.N, lapply(.SD, fun)), by=.]
gets optimised todt[, list(.N, fun(a), fun(b), ...)]
..N
is just for example here.base::mean
function is internally optimised to usedata.table
'sfastmean
function.mean()
frombase
is an S3 generic and gets slow with many groups.
At optimisation level >= 2
, i.e., getOption("datatable.optimize")
>= 2, additional optimisations are implemented on top of the optimisations already shown above.
When expressions in
j
which contains only these functionsmin, max, mean, median, var, sd, prod
, for example,dt[, list(mean(x), median(x), min(y), max(y)), by=z]
, they are very effectively optimised using, what we call, GForce. These functions are replaced withgmean, gmedian, gmin, gmax
instead.Normally, once the rows belonging to each groups are identified, the values corresponding to the group is gathered and the
j-expression
is evaluated. This can be improved by computing the result directly without having to gather the values or evaluating the expression for each group (which can get costly with large number of groups) by implementing it specifically for a particular function. As a result, it is extremely fast.In addition to all the functions above, `.N` is also optimised to use GForce. It when used separately or combined with the functions mentioned above still uses GForce.
Expressions of the form
DT[i, j, by]
are also optimised wheni
is a subset operation andj
is any/all of the functions discussed above.
Auto indexing: data.table
also allows for blazing fast subsets by
creating an index on the first run. Any successive subsets on the same
column then reuses this index to binary search (instead of
vector scan) and is therefore fast.
At the moment, expressions of the form dt[col == val]
and
dt[col %in% val]
are both optimised. We plan to expand this to more
operators and conditions in the future.
Auto indexing can be switched off with the global option
options(datatable.auto.index = FALSE)
. To switch off using existing
indices set global option options(datatable.use.index = FALSE)
.
See Also
Examples
# NOT RUN {
# Generate a big data.table with a relatively many columns
set.seed(1L)
dt = lapply(1:20, function(x) sample(c(-100:100), 5e6L, TRUE))
setDT(dt)[, id := sample(1e5, 5e6, TRUE)]
print(object.size(dt), units="Mb") # 400MB, not huge, but will do
# 'order' optimisation
options(datatable.optimize = 1L) # optimisation 'on'
system.time(ans1 <- dt[order(id)])
options(datatable.optimize = 0L) # optimisation 'off'
system.time(ans2 <- dt[order(id)])
identical(ans1, ans2)
# optimisation of 'lapply(.SD, fun)'
options(datatable.optimize = 1L) # optimisation 'on'
system.time(ans1 <- dt[, lapply(.SD, min), by=id])
options(datatable.optimize = 0L) # optimisation 'off'
system.time(ans2 <- dt[, lapply(.SD, min), by=id])
identical(ans1, ans2)
# optimisation of 'mean'
options(datatable.optimize = 1L) # optimisation 'on'
system.time(ans1 <- dt[, lapply(.SD, mean), by=id])
system.time(ans2 <- dt[, lapply(.SD, base::mean), by=id])
identical(ans1, ans2)
# optimisation of 'c(.N, lapply(.SD, ))'
options(datatable.optimize = 1L) # optimisation 'on'
system.time(ans1 <- dt[, c(.N, lapply(.SD, min)), by=id])
options(datatable.optimize = 0L) # optimisation 'off'
system.time(ans2 <- dt[, c(N=.N, lapply(.SD, min)), by=id])
identical(ans1, ans2)
# GForce
options(datatable.optimize = 2L) # optimisation 'on'
system.time(ans1 <- dt[, lapply(.SD, median), by=id])
system.time(ans2 <- dt[, lapply(.SD, function(x) as.numeric(stats::median(x))), by=id])
identical(ans1, ans2)
# restore optimization
options(datatable.optimize = Inf)
# auto indexing
options(datatable.auto.index = FALSE)
system.time(ans1 <- dt[id == 100L]) # vector scan
system.time(ans2 <- dt[id == 100L]) # vector scan
system.time(dt[id <!-- %in% 100:500]) # vector scan -->
options(datatable.auto.index = TRUE)
system.time(ans1 <- dt[id == 100L]) # index + binary search subset
system.time(ans2 <- dt[id == 100L]) # only binary search subset
system.time(dt[id <!-- %in% 100:500]) # only binary search subset again -->
# }