datatable.optimize: Optimisations in data.table

Description

data.table internally optimises certain expressions in order to improve performance. This section briefly summarises those optimisations.

Note that there's no additional input needed from the user to take advantage of these optimisations. They happen automatically.

Run the code under the example section to get a feel for the performance benefits from these optimisations.

Note that for all optimizations involving efficient sorts, the caveat mentioned in setorder applies -- whenever data.table does the sorting, it does so in "C-locale". This has some subtle implications; see Examples.

Arguments

Details

data.table reads the global option datatable.optimize to figure out what level of optimisation is required. The default value Inf activates all available optimisations.

At optimisation level >= 1, i.e., getOption("datatable.optimize") >= 1, these are the optimisations:

The base function order is internally replaced with data.table's fast ordering. That is, DT[order(...)] gets internally optimised to DT[forder(...)].
The expression DT[, lapply(.SD, fun), by=.] gets optimised to DT[, list(fun(a), fun(b), ...), by=.] where a,b, ... are columns in .SD. This improves performance tremendously.
Similarly, the expression DT[, c(.N, lapply(.SD, fun)), by=.] gets optimised to DT[, list(.N, fun(a), fun(b), ...)]. .N is just for example here.
base::mean function is internally optimised to use data.table's fastmean function. mean() from base is an S3 generic and gets slow with many groups.

At optimisation level >= 2, i.e., getOption("datatable.optimize") >= 2, additional optimisations are implemented on top of the optimisations already shown above.

Expressions in j which contain only the functions min, max, mean, median, var, sd, sum, prod, first, last, head, tail (for example, DT[, list(mean(x), median(x), min(y), max(y)), by=z]), they are very effectively optimised using what we call GForce. These functions are automatically replaced with a corresponding GForce version with pattern g*, e.g., prod becomes gprod.

Normally, once the rows belonging to each group are identified, the values corresponding to the group are gathered and the j-expression is evaluated. This can be improved by computing the result directly without having to gather the values or evaluating the expression for each group (which can get costly with large number of groups) by implementing it specifically for a particular function. As a result, it is extremely fast.
In addition to all the functions above, `.N` is also optimised to use GForce, when used separately or when combined with the functions mentioned above. Note further that GForce-optimized functions must be used separately, i.e., code like DT[ , max(x) - min(x), by=z] will not currently be optimized to use gmax, gmin.
Expressions of the form DT[i, j, by] are also optimised when i is a subset operation and j is any/all of the functions discussed above.

At optimisation level >= 3, i.e., getOption("datatable.optimize") >= 3, additional optimisations for subsets in i are implemented on top of the optimisations already shown above. Subsetting operations are - if possible - translated into joins to make use of blazing fast binary search using indices and keys. The following queries are optimized:

Supported operators: ==, %in%. Non-equi operators(>, <, etc.) are not supported yet because non-equi joins are slower than vector based subsets.
Queries on multiple columns are supported, if the connector is '&', e.g. DT[x == 2 & y == 3] is supported, but DT[x == 2 | y == 3] is not.
Optimization will currently be turned off when doing subset when cross product of elements provided to filter on exceeds > 1e4. This most likely happens if multiple %in%, or %chin% queries are combined, e.g. DT[x %in% 1:100 & y %in% 1:200] will not be optimized since 100 * 200 = 2e4 > 1e4.
Queries with multiple criteria on one column are not supported, e.g. DT[x == 2 & x %in% c(2,5)] is not supported.
Queries with non-missing j are supported, e.g. DT[x == 3 & y == 5, .(new = x-y)] or DT[x == 3 & y == 5, new := x-y] are supported. Also extends to queries using with = FALSE.
"notjoin" queries, i.e. queries that start with !, are only supported if there are no & connections, e.g. DT[!x==3] is supported, but DT[!x==3 & y == 4] is not.

If in doubt, whether your query benefits from optimization, call it with the verbose = TRUE argument. You should see "Optimized subsetting...".

Auto indexing: In case a query is optimized, but no appropriate key or index is found, data.table automatically creates an index on the first run. Any successive subsets on the same column then reuse this index to binary search (instead of vector scan) and is therefore fast. Auto indexing can be switched off with the global option options(datatable.auto.index = FALSE). To switch off using existing indices set global option options(datatable.use.index = FALSE).

Examples

Run this code

if (FALSE) {
old = options(datatable.optimize = Inf)

# Generate a big data.table with a relatively many columns
set.seed(1L)
DT = lapply(1:20, function(x) sample(c(-100:100), 5e6L, TRUE))
setDT(DT)[, id := sample(1e5, 5e6, TRUE)]
print(object.size(DT), units="Mb") # 400MB, not huge, but will do

# 'order' optimisation
options(datatable.optimize = 1L) # optimisation 'on'
system.time(ans1 <- DT[order(id)])
options(datatable.optimize = 0L) # optimisation 'off'
system.time(ans2 <- DT[order(id)])
identical(ans1, ans2)

# optimisation of 'lapply(.SD, fun)'
options(datatable.optimize = 1L) # optimisation 'on'
system.time(ans1 <- DT[, lapply(.SD, min), by=id])
options(datatable.optimize = 0L) # optimisation 'off'
system.time(ans2 <- DT[, lapply(.SD, min), by=id])
identical(ans1, ans2)

# optimisation of 'mean'
options(datatable.optimize = 1L) # optimisation 'on'
system.time(ans1 <- DT[, lapply(.SD, mean), by=id])
system.time(ans2 <- DT[, lapply(.SD, base::mean), by=id])
identical(ans1, ans2)

# optimisation of 'c(.N, lapply(.SD, ))'
options(datatable.optimize = 1L) # optimisation 'on'
system.time(ans1 <- DT[, c(.N, lapply(.SD, min)), by=id])
options(datatable.optimize = 0L) # optimisation 'off'
system.time(ans2 <- DT[, c(N=.N, lapply(.SD, min)), by=id])
identical(ans1, ans2)

# GForce
options(datatable.optimize = 2L) # optimisation 'on'
system.time(ans1 <- DT[, lapply(.SD, median), by=id])
system.time(ans2 <- DT[, lapply(.SD, function(x) as.numeric(stats::median(x))), by=id])
identical(ans1, ans2)

# optimized subsets
options(datatable.optimize = 2L)
system.time(ans1 <- DT[id == 100L]) # vector scan
system.time(ans2 <- DT[id == 100L]) # vector scan
system.time(DT[id %in% 100:500])    # vector scan

options(datatable.optimize = 3L)
system.time(ans1 <- DT[id == 100L]) # index + binary search subset
system.time(ans2 <- DT[id == 100L]) # only binary search subset
system.time(DT[id %in% 100:500])    # only binary search subset again

# sensitivity to collate order
old_lc_collate = Sys.getlocale("LC_COLLATE")

if (old_lc_collate == "C") {
  Sys.setlocale("LC_COLLATE", "")
}
DT = data.table(
  grp = rep(1:2, each = 4L),
  var = c("A", "a", "0", "1", "B", "b", "0", "1")
)
options(datatable.optimize = Inf)
DT[, .(max(var), min(var)), by=grp]
# GForce is deactivated because of the ad-hoc column 'tolower(var)',
#   through which the result for 'max(var)' may also change
DT[, .(max(var), min(tolower(var))), by=grp]

Sys.setlocale("LC_COLLATE", old_lc_collate)
options(old)
}

Run the code above in your browser using DataLab

Description

Arguments

Details

See Also

Examples