drake_config: Create the internal runtime parameter list used internally in `make()`.

Description

drake_config() collects and sanitizes the multitude of parameters and settings that make() needs to do its job: the plan, packages, the environment of functions and initial data objects, parallel computing instructions, verbosity level, etc. Other functions such as outdated(), vis_drake_graph(), and predict_runtime() require output from drake_config() for the config argument. If you supply a drake_config() object to the config argument of make(), then drake will ignore all the other arguments because it already has everything it needs in config.

Usage

drake_config(plan, targets = NULL, envir = parent.frame(),
  verbose = 1L, hook = NULL, cache = drake::get_cache(verbose =
  verbose, console_log_file = console_log_file), fetch_cache = NULL,
  parallelism = "loop", jobs = 1L, jobs_preprocess = 1L,
  packages = rev(.packages()), lib_loc = NULL,
  prework = character(0), prepend = NULL, command = NULL,
  args = NULL, recipe_command = NULL, timeout = NULL, cpu = Inf,
  elapsed = Inf, retries = 0, force = FALSE, log_progress = FALSE,
  graph = NULL, trigger = drake::trigger(), skip_targets = FALSE,
  skip_imports = FALSE, skip_safety_checks = FALSE,
  lazy_load = "eager", session_info = TRUE, cache_log_file = NULL,
  seed = NULL, caching = c("master", "worker"), keep_going = FALSE,
  session = NULL, pruning_strategy = NULL, makefile_path = NULL,
  console_log_file = NULL, ensure_workers = TRUE,
  garbage_collection = FALSE, template = list(), sleep = function(i)
  0.01, hasty_build = NULL, memory_strategy = c("speed", "memory",
  "lookahead"), layout = NULL, lock_envir = TRUE)

Arguments

plan

Workflow plan data frame. A workflow plan data frame is a data frame with a target column and a command column. (See the details in the drake_plan() help file for descriptions of the optional columns.) Targets are the objects that drake generates, and commands are the pieces of R code that produce them. You can create and track custom files along the way (see file_in(), file_out(), and knitr_in()). Use the function drake_plan() to generate workflow plan data frames.

targets

Character vector, names of targets to build. Dependencies are built too. Together, the plan and targets comprise the workflow network (i.e. the graph argument). Changing either will change the network.

envir

Environment to use. Defaults to the current workspace, so you should not need to worry about this most of the time. A deep copy of envir is made, so you don't need to worry about your workspace being modified by make. The deep copy inherits from the global environment. Wherever necessary, objects and functions are imported from envir and the global environment and then reproducibly tracked as dependencies.

verbose

Integer, control printing to the console/terminal.

0: print nothing.
1: print targets, retries, and failures.
2: also show a spinner when preprocessing tasks are underway.

hook

Deprecated.

cache

drake cache as created by new_cache(). See also get_cache().

fetch_cache

Deprecated.

parallelism

Character scalar, type of parallelism to use. For detailed explanations, see the high-performance computing chapter of the user manual.

You could also supply your own scheduler function if you want to experiment or aggressively optimize. The function should take a single config argument (produced by drake_config()). Existing examples from drake's internals are the backend_*() functions:

backend_loop()
backend_clustermq()
backend_future() However, this functionality is really a back door and should not be used for production purposes unless you really know what you are doing and you are willing to suffer setbacks whenever drake's unexported core functions are updated.

jobs

Maximum number of parallel workers for processing the targets. You can experiment with predict_runtime() to help decide on an appropriate number of jobs. For details, visit https://ropenscilabs.github.io/drake-manual/time.html.

jobs_preprocess

Number of parallel jobs for processing the imports and doing other preprocessing tasks.

packages

Character vector packages to load, in the order they should be loaded. Defaults to rev(.packages()), so you should not usually need to set this manually. Just call library() to load your packages before make(). However, sometimes packages need to be strictly forced to load in a certain order, especially if parallelism is "Makefile". To do this, do not use library() or require() or loadNamespace() or attachNamespace() to load any libraries beforehand. Just list your packages in the packages argument in the order you want them to be loaded.

lib_loc

Character vector, optional. Same as in library() or require(). Applies to the packages argument (see above).

prework

Expression (language object), list of expressions, or character vector. Code to run right before targets build. Called only once if parallelism is "loop" and once per target otherwise. This code can be used to set global options, etc.

prepend

Deprecated.

command

Deprecated.

args

Deprecated.

recipe_command

Deprecated.

timeout

deprecated. Use elapsed and cpu instead.

cpu

Same as the cpu argument of setTimeLimit(). Seconds of cpu time before a target times out. Assign target-level cpu timeout times with an optional cpu column in plan.

elapsed

Same as the elapsed argument of setTimeLimit(). Seconds of elapsed time before a target times out. Assign target-level elapsed timeout times with an optional elapsed column in plan.

retries

Number of retries to execute if the target fails. Assign target-level retries with an optional retries column in plan.

force

Logical. If FALSE (default) then drake imposes checks if the cache was created with an old and incompatible version of drake. If there is an incompatibility, make() stops to give you an opportunity to downgrade drake to a compatible version rather than rerun all your targets from scratch.

log_progress

Logical, whether to log the progress of individual targets as they are being built. Progress logging creates a lot of little files in the cache, and it may make builds a tiny bit slower. So you may see gains in storage efficiency and speed with make(..., log_progress = FALSE).

graph

An igraph object from the previous make(). Supplying a pre-built graph could save time.

trigger

Name of the trigger to apply to all targets. Ignored if plan has a trigger column. See trigger() for details.

skip_targets

Logical, whether to skip building the targets in plan and just import objects and files.

skip_imports

Logical, whether to totally neglect to process the imports and jump straight to the targets. This can be useful if your imports are massive and you just want to test your project, but it is bad practice for reproducible data analysis. This argument is overridden if you supply your own graph argument.

skip_safety_checks

Logical, whether to skip the safety checks on your workflow. Use at your own peril.

lazy_load

Either a character vector or a logical. Choices:

"eager": no lazy loading. The target is loaded right away with assign().
"promise": lazy loading with delayedAssign()
"bind": lazy loading with active bindings: bindr::populate_env().
TRUE: same as "promise".
FALSE: same as "eager".

lazy_load should not be "promise" for "parLapply" parallelism combined with jobs greater than 1. For local multi-session parallelism and lazy loading, try library(future); future::plan(multisession) and then make(..., parallelism = "future_lapply", lazy_load = "bind").

If lazy_load is "eager", drake prunes the execution environment before each target/stage, removing all superfluous targets and then loading any dependencies it will need for building. In other words, drake prepares the environment in advance and tries to be memory efficient. If lazy_load is "bind" or "promise", drake assigns promises to load any dependencies at the last minute. Lazy loading may be more memory efficient in some use cases, but it may duplicate the loading of dependencies, costing time.

session_info

Logical, whether to save the sessionInfo() to the cache. This behavior is recommended for serious make()s for the sake of reproducibility. This argument only exists to speed up tests. Apparently, sessionInfo() is a bottleneck for small make()s.

cache_log_file

Name of the CSV cache log file to write. If TRUE, the default file name is used (drake_cache.CSV). If NULL, no file is written. If activated, this option writes a flat text file to represent the state of the cache (fingerprints of all the targets and imports). If you put the log file under version control, your commit history will give you an easy representation of how your results change over time as the rest of your project changes. Hopefully, this is a step in the right direction for data reproducibility.

seed

Integer, the root pseudo-random number generator seed to use for your project. In make(), drake generates a unique local seed for each target using the global seed and the target name. That way, different pseudo-random numbers are generated for different targets, and this pseudo-randomness is reproducible.

To ensure reproducibility across different R sessions, set.seed() and .Random.seed are ignored and have no affect on drake workflows. Conversely, make() does not usually change .Random.seed, even when pseudo-random numbers are generated. The exception to this last point is make(parallelism = "clustermq") because the clustermq package needs to generate random numbers to set up ports and sockets for ZeroMQ.

On the first call to make() or drake_config(), drake uses the random number generator seed from the seed argument. Here, if the seed is NULL (default), drake uses a seed of 0. On subsequent make()s for existing projects, the project's cached seed will be used in order to ensure reproducibility. Thus, the seed argument must either be NULL or the same seed from the project's cache (usually the .drake/ folder). To reset the random number generator seed for a project, use clean(destroy = TRUE).

caching

Character string, either "master" or "worker".

"master": Targets are built by remote workers and sent back to the master process. Then, the master process saves them to the cache (config$cache, usually a file system storr). Appropriate if remote workers do not have access to the file system of the calling R session. Targets are cached one at a time, which may be slow in some situations.
"worker": Remote workers not only build the targets, but also save them to the cache. Here, caching happens in parallel. However, remote workers need to have access to the file system of the calling R session. Transferring target data across a network can be slow.

keep_going

Logical, whether to still keep running make() if targets fail.

session

Deprecated. Has no effect now.

pruning_strategy

Deprecated. See memory_strategy.

makefile_path

Path to the Makefile for make(parallelism = "Makefile"). If you set this argument to a non-default value, you are responsible for supplying this same path to the args argument so make knows where to find it. Example: make(parallelism = "Makefile", makefile_path = ".drake/.makefile", command = "make", args = "--file=.drake/.makefile")

console_log_file

Optional character scalar of a file name or connection object (such as stdout()) to dump maximally verbose log information for make(). Independent of the verbose argument.

ensure_workers

Logical, whether the master process should wait for the workers to post before assigning them targets. Should usually be TRUE. Set to FALSE for make(parallelism = "future_lapply", jobs = n) (n > 1) when combined with future::plan(future::sequential). This argument only applies to parallel computing with persistent workers (make(parallelism = x), where x could be "mclapply", "parLapply", or "future_lapply").

garbage_collection

Logical, whether to call gc() each time a target is built during make().

template

A named list of values to fill in the {{ ... }} placeholders in template files (e.g. from drake_hpc_template_file()). Same as the template argument of clustermq::Q() and clustermq::workers. Enabled for clustermq only (make(parallelism = "clustermq")), not future or batchtools so far. For more information, see the clustermq package: https://github.com/mschubert/clustermq. Some template placeholders such as {{ job_name }} and {{ n_jobs }} cannot be set this way.

sleep

Optional function on a single numeric argument i. Default: function(i) 0.01.

To conserve memory, drake assigns a brand new closure to sleep, so your custom function should not depend on in-memory data except from loaded packages.

For parallel processing, drake uses a central master process to check what the parallel workers are doing, and for the affected high-performance computing workflows, wait for data to arrive over a network. In between loop iterations, the master process sleeps to avoid throttling. The sleep argument to make() and drake_config() allows you to customize how much time the master process spends sleeping.

The sleep argument is a function that takes an argument i and returns a numeric scalar, the number of seconds to supply to Sys.sleep() after iteration i of checking. (Here, i starts at 1.) If the checking loop does something other than sleeping on iteration i, then i is reset back to 1.

To sleep for the same amount of time between checks, you might supply something like function(i) 0.01. But to avoid consuming too many resources during heavier and longer workflows, you might use an exponential back-off: say, function(i) { 0.1 + 120 * pexp(i - 1, rate = 0.01) }.

hasty_build

A user-defined function. In "hasty mode" (make(parallelism = "hasty")) this is the function that evaluates a target's command and returns the resulting value. The hasty_build argument has no effect if parallelism is any value other than "hasty".

The function you pass to hasty_build must have arguments target and config. Here, target is a character scalar naming the target being built, and config is a configuration list of runtime parameters generated by drake_config().

memory_strategy

Character scalar, name of the strategy drake uses to manage targets in memory. For more direct control over which targets drake keeps in memory, see the help file examples of drake_envir(). The memory_strategy argument to make() and drake_config() is an attempt at an automatic catch-all solution. These are the choices.

"speed": Once a target is loaded in memory, just keep it there. This choice maximizes speed and hogs memory.
"memory": Just before building each new target, unload everything from memory except the target's direct dependencies. This option conserves memory, but it sacrifices speed because each new target needs to reload any previously unloaded targets from storage.
"lookahead": Just before building each new target, search the dependency graph to find targets that will not be needed for the rest of the current make() session. In this mode, targets are only in memory if they need to be loaded, and we avoid superfluous reads from the cache. However, searching the graph takes time, and it could even double the computational overhead for large projects.

Each strategy has a weakness. "speed" is memory-hungry, "memory" wastes time reloading targets from storage, and "lookahead" wastes time traversing the entire dependency graph on every make(). For a better compromise and more control, see the examples in the help file of drake_envir().

layout

config$layout, where config is the return value from a prior call to drake_config(). If your plan or environment have changed since the last make(), do not supply a layout argument. Otherwise, supplying one could save time.

lock_envir

Logical, whether to lock config$envir during make(). If TRUE, make() quits in error whenever a command in your drake plan (or prework) tries to add, remove, or modify non-hidden variables in your environment/workspace/R session. This is extremely important for ensuring the purity of your functions and the reproducibility/credibility/trust you can place in your project. lock_envir will be set to a default of TRUE in drake version 7.0.0 and higher.

Value

The master internal configuration list of a project.

Examples

Run this code

# NOT RUN {
isolate_example("Quarantine side effects.", {
load_mtcars_example() # Get the code with drake_example("mtcars").
# Construct the master internal configuration list.
config <- drake_config(my_plan)
if (requireNamespace("visNetwork")) {
  vis_drake_graph(config) # See the dependency graph.
  if (requireNamespace("networkD3")) {
    sankey_drake_graph(config) # See the dependency graph.
  }
}
# These functions are faster than otherwise
# because they use the configuration list.
outdated(config) # Which targets are out of date?
missed(config) # Which imports are missing?
})
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples