Use devtools or remotes to fetch the package from this repository:
if(!require(devtools)) install.packages("devtools")
devtools::install_github("Anirban166/data.table.threads")if(!require(remotes)) install.packages("remotes")
remotes::install_github("Anirban166/data.table.threads")findOptimalThreadCount(rowCount, columnCount) is the go-to function which runs a set of benchmarks for various data.table functions that are parallelizable.
> benchmarkData <- data.table.threads::findOptimalThreadCount(1e7, 10)
Running benchmarks with 1 thread, 10000000 rows, and 10 columns.
...
Running benchmarks with 10 threads, 10000000 rows, and 10 columns.It returns an object with print and plot methods.
> benchmarkData
data.table function Thread count Fastest median runtime (ms)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
forder 8 82.736011
GForce_sum 6 15.670897
subsetting 6 54.386931
frollmean 6 23.329410
fcoalesce 5 7.319135
between 6 22.716911
fifelse 10 18.825437
nafill 10 7.006490
CJ 1 3.194330 The output here is a table which shows the fastest runtime (median value in milliseconds) for each data.table function along with the corresponding thread count that achieved it.
> plot(benchmarkData)
As for the generated plot, it delineates the speedup across multiple threads (from 1 to the number of threads available in your system; 10 in my case or this example) for each function.
setThreadCount(benchmarkData, functionName, efficiencyFactor) can then be used to set the thread count based on the observed results for a user-specified function and efficiency value (of the range [0, 1]) for the speedup:
> setOptimalThreadCount(benchmarks, functionName = "forder", efficientcyFactor = 0.5, verbose = TRUE)
The number of threads that data.table will use has been set to 3, based on an efficiency factor of 0.5 for data.table::forder() based on the performed benchmarks.
> getDTthreads()
[1] 3