This is the central, most important function of the drake package. It runs all the steps of your workflow in the correct order, skipping any work that is already up to date. See https://github.com/ropensci/drake/blob/master/README.md#documentation for an overview of the documentation.
make(plan = read_drake_plan(), targets = drake::possible_targets(plan),
envir = parent.frame(), verbose = drake::default_verbose(),
hook = default_hook, cache = drake::get_cache(verbose = verbose, force =
force, console_log_file = console_log_file), fetch_cache = NULL,
parallelism = drake::default_parallelism(), jobs = 1,
packages = rev(.packages()), prework = character(0),
prepend = character(0), command = drake::default_Makefile_command(),
args = drake::default_Makefile_args(jobs = jobs, verbose = verbose),
recipe_command = drake::default_recipe_command(), log_progress = TRUE,
skip_targets = FALSE, timeout = Inf, cpu = NULL, elapsed = NULL,
retries = 0, force = FALSE, return_config = NULL, graph = NULL,
trigger = drake::default_trigger(), skip_imports = FALSE,
skip_safety_checks = FALSE, config = NULL, lazy_load = "eager",
session_info = TRUE, cache_log_file = NULL, seed = NULL,
caching = "worker", keep_going = FALSE, session = NULL,
imports_only = NULL, pruning_strategy = c("speed", "memory"),
makefile_path = "Makefile", console_log_file = NULL,
ensure_workers = TRUE)
workflow plan data frame.
A workflow plan data frame is a data frame
with a target
column and a command
column.
(See the details in the drake_plan()
help file
for descriptions of the optional columns.)
Targets are the objects and files that drake generates,
and commands are the pieces of R code that produce them.
Use the function drake_plan()
to generate workflow plan
data frames easily, and see functions plan_analyses()
,
plan_summaries()
, evaluate_plan()
,
expand_plan()
, and gather_plan()
for
easy ways to generate large workflow plan data frames.
character vector, names of targets to build.
Dependencies are built too. Together, the plan
and
targets
comprise the workflow network
(i.e. the graph
argument).
Changing either will change the network.
environment to use. Defaults to the current
workspace, so you should not need to worry about this
most of the time. A deep copy of envir
is made,
so you don't need to worry about your workspace being modified
by make
. The deep copy inherits from the global environment.
Wherever necessary, objects and functions are imported
from envir
and the global environment and
then reproducibly tracked as dependencies.
logical or numeric, control printing to the console.
Use pkgconfig
to set the default value of verbose
for your R session:
for example, pkgconfig::set_config("drake::verbose" = 2)
.
FALSE
:print nothing.
TRUE
:print only targets to build.
in addition, print checks and cache info.
in addition, print any potentially missing items.
in addition, print imports. Full verbosity.
function with at least one argument.
The hook is as a wrapper around the code that drake uses
to build a target (see the body of drake:::build_in_hook()
).
Hooks can control the side effects of build behavior.
For example, to redirect output and error messages to text files,
you might use the built-in silencer_hook()
, as in
make(my_plan, hook = silencer_hook)
.
The silencer hook is useful for distributed parallelism,
where the calling R process does not have control over all the
error and output streams. See also output_sink_hook()
and message_sink_hook()
.
For your own custom hooks, treat the first argument as the code
that builds a target, and make sure this argument is actually evaluated.
Otherwise, the code will not run and none of your targets will build.
For example, function(code){force(code)}
is a good hook
and function(code){message("Avoiding the code")}
is a bad hook.
drake cache as created by new_cache()
.
See also get_cache()
, this_cache()
,
and recover_cache()
character vector containing lines of code.
The purpose of this code is to fetch the storr
cache
with a command like storr_rds()
or storr_dbi()
,
but customized. This feature is experimental. It will turn out
to be necessary if you are using both custom non-RDS caches
and distributed parallelism (parallelism = "future_lapply"
or "Makefile"
) because the distributed R sessions
need to know how to load the cache.
character, type of parallelism to use.
To list the options, call parallelism_choices()
.
For detailed explanations, see the
high-performance computing chapter # nolint
of the user manual.
number of parallel processes or jobs to run.
See max_useful_jobs()
or vis_drake_graph()
to help figure out what the number of jobs should be.
Windows users should not set jobs > 1
if
parallelism
is "mclapply"
because
mclapply()
is based on forking. Windows users
who use parallelism = "Makefile"
will need to
download and install Rtools.
Imports and targets are processed separately, and they usually
have different parallelism needs. To use at most 2 jobs at a time
for imports and at most 4 jobs at a time for targets, call
make(..., jobs = c(imports = 2, targets = 4))
.
For "future_lapply"
parallelism, jobs
only applies to the imports.
To set the max number of jobs for "future_lapply"
parallelism, set the workers
argument where it exists: for example, call
future::plan(multisession(workers = 4))
,
then call make(your_plan, parallelism = "future_lapply")
.
You might also try options(mc.cores = jobs)
,
or see future::.options
for environment variables that set the max number of jobs.
If parallelism
is "Makefile"
, Makefile-level parallelism is
only used for targets in your workflow plan data frame, not imports. To
process imported objects and files, drake selects the best parallel backend
for your system and uses the number of jobs you give to the jobs
argument to make()
. To use at most 2 jobs for imports and at
most 4 jobs for targets, run
make(..., parallelism = "Makefile", jobs = c(imports = 2, targets = 4))
or
make(..., parallelism = "Makefile", jobs = 2, args = "--jobs=4")
.
character vector packages to load, in the order
they should be loaded. Defaults to rev(.packages())
, so you
should not usually need to set this manually. Just call
library()
to load your packages before make()
.
However, sometimes packages need to be strictly forced to load
in a certain order, especially if parallelism
is
"Makefile"
. To do this, do not use library()
or require()
or loadNamespace()
or
attachNamespace()
to load any libraries beforehand.
Just list your packages in the packages
argument in the order
you want them to be loaded.
If parallelism
is "mclapply"
,
the necessary packages
are loaded once before any targets are built. If parallelism
is
"Makefile"
, the necessary packages are loaded once on
initialization and then once again for each target right
before that target is built.
character vector of lines of code to run
before build time. This code can be used to
load packages, set options, etc., although the packages in the
packages
argument are loaded before any prework is done.
If parallelism
is "mclapply"
, the prework
is run once before any targets are built. If parallelism
is
"Makefile"
, the prework is run once on initialization
and then once again for each target right before that target is built.
lines to prepend to the Makefile if parallelism
is "Makefile"
. See the high-performance computing guide # nolint
to learn how to use prepend
to take advantage of multiple nodes of a supercomputer.
character scalar, command to call the Makefile
generated for distributed computing.
Only applies when parallelism
is "Makefile"
.
Defaults to the usual "make"
(default_Makefile_command()
),
but it could also be
"lsmake"
on supporting systems, for example.
command
and args
are executed via
system2(command, args)
to run the Makefile.
If args
has something like "--jobs=2"
, or if
jobs >= 2
and args
is left alone, targets
will be distributed over independent parallel R sessions
wherever possible.
command line arguments to call the Makefile for
distributed computing. For advanced users only. If set,
jobs
and verbose
are overwritten as they apply to the
Makefile.
command
and args
are executed via
system2(command, args)
to run the Makefile.
If args
has something like "--jobs=2"
, or if
jobs >= 2
and args
is left alone, targets
will be distributed over independent parallel R sessions
wherever possible.
Character scalar, command for the Makefile recipe for each target.
logical, whether to log the progress
of individual targets as they are being built. Progress logging
creates a lot of little files in the cache, and it may make builds
a tiny bit slower. So you may see gains in storage efficiency
and speed with
make(..., log_progress = FALSE)
. But be warned that
progress()
and in_progress()
will no longer work if you do that.
logical, whether to skip building the targets
in plan
and just import objects and files.
Seconds of overall time to allow before imposing
a timeout on a target. Passed to R.utils::withTimeout()
.
Assign target-level timeout times with an optional timeout
column in plan
.
Seconds of cpu time to allow before imposing
a timeout on a target. Passed to R.utils::withTimeout()
.
Assign target-level cpu timeout times with an optional cpu
column in plan
.
Seconds of elapsed time to allow before imposing
a timeout on a target. Passed to R.utils::withTimeout()
.
Assign target-level elapsed timeout times with an optional elapsed
column in plan
.
Number of retries to execute if the target fails.
Assign target-level retries with an optional retries
column in plan
.
Force make()
to build your targets even if some
about your setup is not quite right: for example, if you are using
a version of drake that is not back compatible with your project's cache.
Logical, whether to return the internal list
of runtime configuration parameters used by make()
.
This argument is deprecated. Now, a configuration list
is always invisibly returned.
An igraph
object from the previous make()
.
Supplying a pre-built graph could save time.
The graph is constructed by build_drake_graph()
.
You can also get one from config(my_plan)$graph
.
Overrides skip_imports
.
Name of the trigger to apply to all targets.
Ignored if plan
has a trigger
column.
Must be in triggers()
.
See triggers for explanations of the choices.
logical, whether to totally neglect to
process the imports and jump straight to the targets. This can be useful
if your imports are massive and you just want to test your project,
but it is bad practice for reproducible data analysis.
This argument is overridden if you supply your own graph
argument.
logical, whether to skip the safety checks on your workflow. Use at your own peril.
Master configuration list produced by both
make()
and drake_config()
.
either a character vector or a logical. Choices:
"eager"
: no lazy loading. The target is loaded right away
with assign()
.
"promise"
: lazy loading with delayedAssign()
"bind"
: lazy loading with active bindings:
bindr::populate_env()
.
TRUE
: same as "promise"
.
FALSE
: same as "eager"
.
lazy_load
should not be "promise"
for "parLapply"
parallelism combined with jobs
greater than 1.
For local multi-session parallelism and lazy loading, try
library(future); future::plan(multisession)
and then
make(..., parallelism = "future_lapply", lazy_load = "bind")
.
If lazy_load
is "eager"
,
drake prunes the execution environment before each target/stage,
removing all superfluous targets
and then loading any dependencies it will need for building.
In other words, drake prepares the environment in advance
and tries to be memory efficient.
If lazy_load
is "bind"
or "promise"
, drake assigns
promises to load any dependencies at the last minute.
Lazy loading may be more memory efficient in some use cases, but
it may duplicate the loading of dependencies, costing time.
Name of the cache log file to write.
If TRUE
, the default file name is used (drake_cache.log
).
If NULL
, no file is written.
If activated, this option uses
drake_cache_log_file()
to write a flat text file
to represent the state of the cache
(fingerprints of all the targets and imports).
If you put the log file under version control, your commit history
will give you an easy representation of how your results change
over time as the rest of your project changes. Hopefully,
this is a step in the right direction for data reproducibility.
integer, the root pseudo-random seed to use for your project.
To ensure reproducibility across different R sessions,
set.seed()
and .Random.seed
are ignored and have no affect on
drake
workflows. Conversely, make()
does not change .Random.seed
,
even when pseudo-random numbers are generated.
On the first call to make()
or drake_config()
, drake
uses the random number generator seed from the seed
argument.
Here, if the seed
is NULL
(default), drake
uses a seed
of 0
.
On subsequent make()
s for existing projects, the project's
cached seed will be used in order to ensure reproducibility.
Thus, the seed
argument must either be NULL
or the same
seed from the project's cache (usually the .drake/
folder).
To reset the random number generator seed for a project,
use clean(destroy = TRUE)
.
character string, only applies to "future"
parallelism.
logical, whether to still keep running make()
if targets fail.
An optional callr
function if you want to
build all your targets in a separate master session:
for example, make(plan = my_plan, session = callr::r_vanilla)
.
Running make()
in a clean, isolated
session can enhance reproducibility.
But be warned: if you do this, make()
will take longer to start.
If session
is NULL
(default), then make()
will just use
your current R session as the master session. This is slightly faster,
but it causes make()
to populate your workspace/environment
with the last few targets it builds.
deprecated. Use skip_targets
instead.
Character scalar, either "speed"
(default)
or "memory"
. These are alternative approaches to how drake
keeps non-import dependencies in memory when it builds a target.
If pruning_strategy
is "memory"
, drake
removes all targets
from memory (i.e. config$envir
) except the direct dependencies
of the target is about to build. This is suitable for data so large
that the optimal strategy is to minimize memory consumption.
If pruning_strategy
is "speed"
, drake
loads all the dependencies
and keeps in memory everything that will eventually be a
dependency of a downstream target. This strategy consumes more
memory, but does more to minimize the number of times data is
read from storage/disk.
Path to the Makefile
for
make(parallelism = "Makefile")
. If you set this argument to a
non-default value, you are responsible for supplying this same
path to the args
argument so make
knows where to find it.
Example: make(parallelism = "Makefile", makefile_path = ".drake/.makefile", command = "make", args = "--file=.drake/.makefile")
# nolint
character scalar or NULL
.
If NULL
, console output will be printed
to the R console using message()
.
Otherwise, console_log_file
should be the name of a flat file.
Console output will be appended to that file.
logical, whether the master process
should wait for the workers to post before assigning them
targets. Should usually be TRUE
. Set to FALSE
for make(parallelism = "future_lapply", jobs = n)
(n > 1
) when combined with future::plan(future::sequential)
.
This argument only applies to parallel computing with persistent workers
(make(parallelism = x)
, where x
could be "mclapply"
,
"parLapply"
, or "future_lapply"
).
The master internal configuration list, mostly
containing arguments to make()
and important objects
constructed along the way. See config()
for more details.
drake_plan()
,
vis_drake_graph()
,
parallelism_choices()
,
max_useful_jobs()
,
triggers()
,
make_with_config()
# NOT RUN {
test_with_dir("Quarantine side effects.", {
load_mtcars_example() # Get the code with drake_example("mtcars").
config <- drake_config(my_plan)
outdated(config) # Which targets need to be (re)built?
my_jobs = max_useful_jobs(config) # Depends on what is up to date.
make(my_plan, jobs = 2) # Build what needs to be built.
outdated(config) # Everything is up to date.
# Change one of your imported function dependencies.
reg2 = function(d){
d$x3 = d$x^3
lm(y ~ x3, data = d)
}
outdated(config) # Some targets depend on reg2().
vis_drake_graph(config) # See how they fit in an interactive graph.
make(my_plan) # Rebuild just the outdated targets.
outdated(config) # Everything is up to date again.
make(my_plan, cache_log_file = TRUE) # Write a text log file this time.
vis_drake_graph(config) # The colors changed in the graph.
clean() # Start from scratch.
# Run with at most 2 jobs at a time for the imports
# and at most 4 jobs at a time for the targets.
make(my_plan, jobs = c(imports = 2, targets = 4))
clean() # Start from scratch.
# Rerun with "Makefile" parallelism with at most 4 jobs.
# Requires Rtools on Windows.
# make(my_plan, parallelism = "Makefile", jobs = 4) # nolint
clean() # Start from scratch.
# Specify your own Makefile recipe.
# Requires Rtools on Windows.
# make(my_plan, parallelism = "Makefile", jobs = 4, # nolint
# recipe_command = "R -q -e") # nolint
#
# make() respects tidy evaluation as implemented in the rlang package.
# This workflow plan uses rlang's quasiquotation operator `!!`.
my_plan <- drake_plan(list = c(
little_b = "\"b\"",
letter = "!!little_b"
))
my_plan
make(my_plan)
readd(letter) # "b"
})
# }
Run the code above in your browser using DataLab