chunkIds: Chunk Jobs for Sequential Execution

Description

Partition jobs into “chunks” which will be executed together on the nodes.

Chunks are submitted via submitJobs by simply providing a data frame with columns “job.id” and corresponding “chunk” (both integer). All jobs with the same chunk number will be grouped together on one node as a single computational job.

If neither n.chunks nor chunk.size are provided, each job will be assigned to its own chunk.

Usage

chunkIds(ids = NULL, n.chunks = NULL, chunk.size = NULL,
  group.by = character(0L), reg = getDefaultRegistry())

Arguments

ids

[data.frame or integer] A data.frame (or data.table) with a column named “job.id”. Alternatively, you may also pass a vector of integerish job ids. If not set, defaults to all jobs.

n.chunks

[integer(1)] Requested number of chunks. If more chunks than elements in ids are requested, empty chunks are ignored. Mutually exclusive with chunks.size.

chunk.size

[integer(1)] Requested number of elements in each chunk. If ids cannot be chunked evenly, some chunks will have less elements than others. Mutually exclusive with with n.chunks.

group.by

[character(0)] If ids is a data.frame with additional columns (in addition to the required column “job.id”), then the chunking is performed using subgroups defined by the columns set in group.by. See example.

reg

[Registry] Registry. If not explicitly passed, uses the default registry (see setDefaultRegistry).

Value

[data.table] with columns “job.id” and “chunk”.

Examples

Run this code

# chunking for Registry
tmp = makeRegistry(file.dir = NA, make.default = FALSE)
batchMap(identity, 1:25, reg = tmp)
ids = chunkIds(chunk.size = 10, reg = tmp)
print(ids)
print(table(ids$chunk))

# Creating chunks for an ExperimentRegistry
tmp = makeExperimentRegistry(file.dir = NA, make.default = FALSE)
prob = addProblem(reg = tmp, "prob1", data = iris, fun = function(job, data) nrow(data))
prob = addProblem(reg = tmp, "prob2", data = Titanic, fun = function(job, data) nrow(data))
algo = addAlgorithm(reg = tmp, "algo", fun = function(job, data, instance, i, ...) problem)
prob.designs = list(prob1 = data.table(), prob2 = data.table(x = 1:2))
algo.designs = list(algo = data.table(i = 1:3))
addExperiments(prob.designs, algo.designs, repls = 3, reg = tmp)

# group into chunks of 5 jobs, but do not put multiple problems into a single chunk
# -> only one problem has to be loaded per chunk, and only once because it is then cached.
ids = getJobTable(reg = tmp)[, .(job.id, problem, algorithm)]
chunked = chunkIds(ids, chunk.size = 5, group.by = "problem", reg = tmp)
print(chunked)
dcast(ijoin(ids, chunked), chunk ~ problem)

Run the code above in your browser using DataLab