drake v7.3.0

0

Monthly downloads

0th

Percentile

A Pipeline Toolkit for Reproducible Computation at Scale

A general-purpose computational engine for data analysis, drake rebuilds intermediate data objects when their dependencies change, and it skips work when the results are already up to date. Not every execution starts from scratch, there is native support for parallel and distributed computing, and completed projects have tangible evidence that they are reproducible. Extensive documentation, from beginner-friendly tutorials to practical examples and more, is available at the reference website <https://ropensci.github.io/drake/> and the online manual <https://ropenscilabs.github.io/drake-manual/>.

Readme

infographic
Usage Release Development
Licence CRAN Travis
minimal R version cran-checks AppVeyor
rOpenSci Codecov
downloads JOSS
SayThanks Zenodo Project Status: Active – The project has reached a stable, usable state and is being actively developed.

# The drake R package logo Data analysis can be slow. A round of scientific computation can take several minutes, hours, or even days to complete. After it finishes, if you update your code or data, your hard-earned results may no longer be valid. How much of that valuable output can you keep, and how much do you need to update? How much runtime must you endure all over again? For projects in R, the drake package can help. It analyzes your workflow, skips steps with up-to-date results, and orchestrates the rest with optional distributed computing. At the end, drake provides evidence that your results match the underlying code and data, which increases your ability to trust your research. # 6-minute video Visit the first page of the manual to watch a short introduction.
video


What gets done stays done.

Too many data science projects follow a Sisyphean loop:

  1. Launch the code.
  2. Wait while it runs.
  3. Discover an issue.
  4. Rerun from scratch.

Ordinarily, it is hard to avoid rerunning the code from scratch.

tweet


But with drake, you can automatically

  1. Launch the parts that changed since last time.
  2. Skip the rest.

How it works

To set up a project, load your packages,

library(drake)
library(dplyr)
library(ggplot2)
#> Registered S3 methods overwritten by 'ggplot2':
#>   method         from 
#>   [.quosures     rlang
#>   c.quosures     rlang
#>   print.quosures rlang

load your custom functions,

create_plot <- function(data) {
  ggplot(data, aes(x = Petal.Width, fill = Species)) +
    geom_histogram()
}

check any supporting files (optional),

# Get the files with drake_example("main").
file.exists("raw_data.xlsx")
#> [1] TRUE
file.exists("report.Rmd")
#> [1] TRUE

and plan what you are going to do.

plan <- drake_plan(
  raw_data = readxl::read_excel(file_in("raw_data.xlsx")),
  data = raw_data %>%
    mutate(Species = forcats::fct_inorder(Species)),
  hist = create_plot(data),
  fit = lm(Sepal.Width ~ Petal.Width + Species, data),
  report = rmarkdown::render(
    knitr_in("report.Rmd"),
    output_file = file_out("report.html"),
    quiet = TRUE
  )
)
plan
#> # A tibble: 5 x 2
#>   target   command                                                         
#>   <chr>    <expr>                                                          
#> 1 raw_data readxl::read_excel(file_in("raw_data.xlsx"))                   …
#> 2 data     raw_data %>% mutate(Species = forcats::fct_inorder(Species))   …
#> 3 hist     create_plot(data)                                              …
#> 4 fit      lm(Sepal.Width ~ Petal.Width + Species, data)                  …
#> 5 report   rmarkdown::render(knitr_in("report.Rmd"), output_file = file_ou…

So far, we have just been setting the stage. Use make() to do the real work. Targets are built in the correct order regardless of the row order of plan.

make(plan)
#> target raw_data
#> target data
#> target fit
#> target hist
#> target report

Except for files like report.html, your output is stored in a hidden .drake/ folder. Reading it back is easy.

readd(data) # See also loadd().
#> # A tibble: 150 x 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#> 1          5.1         3.5          1.4         0.2 setosa 
#> 2          4.9         3            1.4         0.2 setosa 
#> 3          4.7         3.2          1.3         0.2 setosa 
#> 4          4.6         3.1          1.5         0.2 setosa 
#> 5          5           3.6          1.4         0.2 setosa 
#> # … with 145 more rows

You may look back on your work and see room for improvement, but it’s all good! The whole point of drake is to help you go back and change things quickly and painlessly. For example, we forgot to give our histogram a bin width.

readd(hist)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

hist1

So let’s fix the plotting function.

create_plot <- function(data) {
  ggplot(data, aes(x = Petal.Width, fill = Species)) +
    geom_histogram(binwidth = 0.25) +
    theme_gray(20)
}

drake knows which results are affected.

config <- drake_config(plan)
vis_drake_graph(config) # Interactive graph: zoom, drag, etc.

hist1

The next make() just builds hist and report.html. No point in wasting time on the data or model.

make(plan)
#> target hist
#> target report
loadd(hist)
hist

hist1

Reproducibility with confidence

The R community emphasizes reproducibility. Traditional themes include scientific replicability, literate programming with knitr, and version control with git. But internal consistency is important too. Reproducibility carries the promise that your output matches the code and data you say you used. With the exception of non-default triggers and hasty mode, drake strives to keep this promise.

Evidence

Suppose you are reviewing someone else’s data analysis project for reproducibility. You scrutinize it carefully, checking that the datasets are available and the documentation is thorough. But could you re-create the results without the help of the original author? With drake, it is quick and easy to find out.

make(plan)
#> All targets are already up to date.

config <- drake_config(plan)
outdated(config)
#> character(0)

With everything already up to date, you have tangible evidence of reproducibility. Even though you did not re-create the results, you know the results are re-creatable. They faithfully show what the code is producing. Given the right package environment and system configuration, you have everything you need to reproduce all the output by yourself.

Ease

When it comes time to actually rerun the entire project, you have much more confidence. Starting over from scratch is trivially easy.

clean()       # Remove the original author's results.
make(plan) # Independently re-create the results from the code and input data.
#> target raw_data
#> target data
#> target fit
#> target hist
#> target report

Independent replication

With even more evidence and confidence, you can invest the time to independently replicate the original code base if necessary. Up until this point, you relied on basic drake functions such as make(), so you may not have needed to peek at any substantive author-defined code in advance. In that case, you can stay usefully ignorant as you reimplement the original author’s methodology. In other words, drake could potentially improve the integrity of independent replication.

Readability and transparency

Ideally, independent observers should be able to read your code and understand it. drake helps in several ways.

  • The workflow plan data frame explicitly outlines the steps of the analysis, and vis_drake_graph() visualizes how those steps depend on each other.
  • drake takes care of the parallel scheduling and high-performance computing (HPC) for you. That means the HPC code is no longer tangled up with the code that actually expresses your ideas.
  • You can generate large collections of targets without necessarily changing your code base of imported functions, another nice separation between the concepts and the execution of your workflow

Aggressively scale up.

Not every project can complete in a single R session on your laptop. Some projects need more speed or computing power. Some require a few local processor cores, and some need large high-performance computing systems. But parallel computing is hard. Your tables and figures depend on your analysis results, and your analyses depend on your datasets, so some tasks must finish before others even begin. drake knows what to do. Parallelism is implicit and automatic. See the high-performance computing guide for all the details.

# Use the spare cores on your local machine.
make(plan, jobs = 4)

# Or scale up to a supercomputer.
drake_hpc_template_file("slurm_clustermq.tmpl") # https://slurm.schedmd.com/
options(
  clustermq.scheduler = "clustermq",
  clustermq.template = "slurm_clustermq.tmpl"
)
make(plan, parallelism = "clustermq", jobs = 4)

Installation

You can choose among different versions of drake. The CRAN release often lags behind the online manual but may have fewer bugs.

# Install the latest stable release from CRAN.
install.packages("drake")

# Alternatively, install the development version from GitHub.
install.packages("devtools")
library(devtools)
install_github("ropensci/drake")

Function reference

The reference section lists all the available functions. Here are the most important ones.

  • drake_plan(): create a workflow data frame (like my_plan).
  • make(): build your project.
  • r_make(): launch a fresh callr::r() process to build your project. Called from an interactive R session, r_make() is more reproducible than make().
  • loadd(): load one or more built targets into your R session.
  • readd(): read and return a built target.
  • drake_config(): create a master configuration list for other user-side functions.
  • vis_drake_graph(): show an interactive visual network representation of your workflow.
  • outdated(): see which targets will be built in the next make().
  • deps(): check the dependencies of a command or function.
  • failed(): list the targets that failed to build in the last make().
  • diagnose(): return the full context of a build, including errors, warnings, and messages.

Documentation

Use cases

The official rOpenSci use cases and associated discussion threads describe applications of drake in action. Here are some more applications of drake in real-world projects.

Help and troubleshooting

The following resources document many known issues and challenges.

If you are still having trouble, please submit a new issue with a bug report or feature request, along with a minimal reproducible example where appropriate.

The GitHub issue tracker is mainly intended for bug reports and feature requests. While questions about usage etc. are also highly encouraged, you may alternatively wish to post to Stack Overflow and use the drake-r-package tag.

Contributing

Development is a community effort, and we encourage participation. Please read CONTRIBUTING.md for details.

Similar work

GNU Make

The original idea of a time-saving reproducible build system extends back at least as far as GNU Make, which still aids the work of data scientists as well as the original user base of complied language programmers. In fact, the name “drake” stands for “Data Frames in R for Make”. Make is used widely in reproducible research. Below are some examples from Karl Broman’s website.

There are several reasons for R users to prefer drake instead.

  • drake already has a Make-powered parallel backend. Just run make(..., parallelism = "Makefile", jobs = 2) to enjoy most of the original benefits of Make itself.
  • Improved scalability. With Make, you must write a potentially large and cumbersome Makefile by hand. But with drake, you can use wildcard templating to automatically generate massive collections of targets with minimal code.
  • Lower overhead for light-weight tasks. For each Make target that uses R, a brand new R session must spawn. For projects with thousands of small targets, that means more time may be spent loading R sessions than doing the actual work. With make(..., parallelism = "mclapply, jobs = 4"), drake launches 4 persistent workers up front and efficiently processes the targets in R.
  • Convenient organization of output. With Make, the user must save each target as a file. drake saves all the results for you automatically in a storr cache so you do not have to micromanage the results.

Remake

drake overlaps with its direct predecessor, remake. In fact, drake owes its core ideas to remake and Rich FitzJohn. Remake’s development repository lists several real-world applications. drake surpasses remake in several important ways, including but not limited to the following.

  1. High-performance computing. Remake has no native parallel computing support. drake, on the other hand, has a thorough selection of parallel computing technologies and scheduling algorithms. Thanks to future, future.batchtools, and batchtools, it is straightforward to configure a drake project for most popular job schedulers, such as SLURM, TORQUE, and the Grid Engine, as well as systems contained in Docker images.
  2. A friendly interface. In remake, the user must manually write a YAML configuration file to arrange the steps of a workflow, which leads to some of the same scalability problems as Make. drake’s domain-specific language easily generates workflows at scale.
  3. Thorough documentation. drake contains thorough user manual, a reference website, a comprehensive README, examples in the help files of user-side functions, and accessible example code that users can write with drake::example_drake().
  4. Active maintenance. drake is actively developed and maintained, and issues are usually addressed promptly.
  5. Presence on CRAN. At the time of writing, drake is available on CRAN, but remake is not.

Memoise

Memoization is the strategic caching of the return values of functions. Every time a memoized function is called with a new set of arguments, the return value is saved for future use. Later, whenever the same function is called with the same arguments, the previous return value is salvaged, and the function call is skipped to save time. The memoise package is an excellent implementation of memoization in R.

However, memoization does not go far enough. In reality, the return value of a function depends not only on the function body and the arguments, but also on any nested functions and global variables, the dependencies of those dependencies, and so on upstream. drake surpasses memoise because it uses the entire dependency network graph of a project to decide which pieces need to be rebuilt and which ones can be skipped.

Knitr

Much of the R community uses knitr for reproducible research. The idea is to intersperse code chunks in an R Markdown or *.Rnw file and then generate a dynamic report that weaves together code, output, and prose. Knitr is not designed to be a serious pipeline toolkit, and it should not be the primary computational engine for medium to large data analysis projects.

  1. Knitr scales far worse than Make or remake. The whole point is to consolidate output and prose, so it deliberately lacks the essential modularity.
  2. There is no obvious high-performance computing support.
  3. While there is a way to skip chunks that are already up to date (with code chunk options cache and autodep), this functionality is not the focus of knitr. It is deactivated by default, and remake and drake are more dependable ways to skip work that is already up to date.

drake was designed to manage the entire workflow with knitr reports as targets. The strategy is analogous for knitr reports within remake projects.

Factual’s Drake

Factual’s Drake is similar in concept, but the development effort is completely unrelated to the drake R package.

Other pipeline toolkits

There are countless other successful pipeline toolkits. The drake package distinguishes itself with its R-focused approach, Tidyverse-friendly interface, and a thorough selection of parallel computing technologies and scheduling algorithms.

Acknowledgements

Special thanks to Jarad Niemi, my advisor from graduate school, for first introducing me to the idea of Makefiles for research. He originally set me down the path that led to drake.

Many thanks to Julia Lowndes, Ben Marwick, and Peter Slaughter for reviewing drake for rOpenSci, and to Maëlle Salmon for such active involvement as the editor. Thanks also to the following people for contributing early in development.

Credit for images is attributed here.

ropensci\_footer

Functions in drake

Name Description
build_times List the time it took to build each target.
default_Makefile_args Deprecated
default_Makefile_command Deprecated
default_parallelism Deprecated
default_recipe_command Deprecated
drake_palette Deprecated. Show drake's color palette.
doc_of_function_call Defunct function
do_prework Do the prework in the prework argument to make().
drake_meta Deprecated. Compute the initial pre-build metadata of a target or import.
file_store Tell drake that you want information on a file (target or import), not an ordinary object.
clean_main_example Deprecated: clean the main example from drake_example("main")
cached List targets in the cache.
as_drake_filename Defunct function
configure_cache Deprecated. Configure the hash algorithms, etc. of a drake cache.
built Deprecated. List all the built targets (non-imports) in the cache.
dataframes_graph Defunct function
code_to_plan Turn an R script file or knitr / R Markdown report into a drake workflow plan data frame.
check Defunct function
cache_path Deprecated. Return the file path where the cache is stored, if applicable.
cache_namespaces Deprecated. List all the storr cache namespaces used by drake.
check_plan Deprecated. Check a workflow plan data frame for obvious errors.
clean Remove targets/imports from the cache.
dataset_wildcard Show the dataset wildcard used in plan_analyses() and plan_summaries().
find_cache Search up the file system for the nearest drake cache.
cleaned_namespaces Deprecated utility function
isolate_example Isolate the side effects of an example.
default_verbose Deprecated
dependency_profile Deprecated in favor of deps_profile().
cmq_build Build a target using the clustermq backend
knitr_deps Deprecated in favor of deps_knitr()
make Run your project (build the outdated targets).
clean_mtcars_example Clean the mtcars example from drake_example("mtcars")
make_imports deprecated
debug_and_run Run a function in debug mode.
drake_example Download and save the code and data files of an example drake-powered project.
default_short_hash_algo Deprecated. Return the default short hash algorithm for make().
drake_examples List the names of all the drake examples.
drake-package drake: A pipeline toolkit for reproducible computation at scale.
drake_batchtools_tmpl_file Deprecated. Get a template file for execution on a cluster.
manage_memory Manage in-memory targets
default_graph_title Return the default title for graph visualizations
default_long_hash_algo Deprecated. Return the default long hash algorithm for make().
drake_hpc_template_file Write a template file for deploying work to a cluster / job scheduler.
deps_code List the dependencies of a function or command
deps_knitr Find the drake dependencies of a dynamic knitr report target.
drake_hpc_template_files List the available example template files for deploying work to a cluster / job scheduler.
expand Defunct function
from_plan Defunct function.
expand_plan Deprecated: create replicates of targets.
drake_cache_log_file Deprecated. Generate a flat text log file to represent the state of the cache.
default_system2_args Defunct function
drake_plan Create a workflow plan data frame for the plan argument of make().
drake_config Create the internal runtime parameter list used internally in make().
config Defunct function
drake_gc Do garbage collection on the drake cache.
map_plan Deprecated: create a plan that maps a function to a grid of arguments.
future_build Task passed to individual futures in the "future" backend
deps_targets Deprecated.
predict_workers Predict the load balancing of the next call to make() for non-staged parallel backends.
diagnose Get diagnostic metadata on a target.
deprecate_wildcard Defunct function
drake_get_session_info Return the sessionInfo() of the last call to make().
drake_debug Run a single target's command in debug mode.
deps Defunct function
gather_plan Deprecated: write commands to combine several targets into one or more overarching targets.
deps_profile Find out why a target is out of date.
get_cache Get the default cache of a drake project.
drake_envir Get the environment where drake builds targets
process_import internal function
read_drake_config Deprecated
deps_target List the dependencies of a target
drake_cache_log Get a table that represents the state of the cache.
drake_build Build/process a single target or import.
drake_quotes Put quotes around each element of a character vector.
read_drake_graph Deprecated
read_drake_seed Read the pseudo-random number generator seed of the project.
drake_strings Turn valid expressions into character strings.
read_graph Defunct function
drake_plan_source Show the code required to produce a given workflow plan data frame
drake_session Deprecated. Return the sessionInfo() of the last call to make().
drake_ggraph Show a ggraph/ggplot2 representation of your drake project.
drake_unquote Remove leading and trailing escaped quotes from character strings.
this_cache Get the cache at the exact file path specified.
drake_graph_info Create the underlying node and edge data frames behind vis_drake_graph().
example_drake Defunct function
drake_tip Deprecated. Output a random tip about drake.
ignore Ignore components of commands and imported functions.
eager_load_target Load a target right away (internal function)
imported Deprecated. List all the imports in the drake cache.
examples_drake Defunct function
evaluate Defunct function
file_in Declare input files and directories.
evaluate_plan Deprecated: use wildcard templating to create a workflow plan data frame from a template data frame.
file_out Declare output files and directories.
expose_imports Expose all the imports in a package so make() can detect all the package's nested functions.
parallelism_choices Deprecated
find_knitr_doc Defunct function
tracked List the targets and imports that are reproducibly tracked.
plan Defunct function
find_project Deprecated. Search up the file system for the nearest root path of a drake project.
load_mtcars_example Load the mtcars example.
type_sum.expr_list Type summary printing
make_targets deprecated
plan_summaries Deprecated
long_hash Deprecated. drake now has just one hash algorithm per cache.
missed Report any import objects required by your drake_plan plan but missing from your workspace or file system.
in_progress Deprecated. List the targets that either (1) are currently being built during a make(), or (2) were being built if the last make() quit unexpectedly.
plan_to_code Turn a drake workflow plan data frame into a plain R script file.
rate_limiting_times Defunct function
is_function_call Defunct function
new_cache Make a new drake cache.
knitr_in Declare knitr/rmarkdown source files as dependencies.
failed List failed targets. to make().
legend_nodes Create the nodes data frame used in the legend of the graph visualizations.
outdated List the targets that are out of date.
plan_analyses Deprecated.
parallel_stages Defunct function
use_drake Use drake in a project
make_with_config deprecated
read_config Defunct function
max_useful_jobs Defunct function
predict_load_balancing Deprecated in favor of predict_workers()
predict_runtime Predict the elapsed runtime of the next call to make() for non-staged parallel backends.
gather Defunct function
migrate_drake_project Defunct function
r_make Reproducible R session management for drake functions
plan_drake Defunct function
plan_to_notebook Turn a drake workflow plan data frame into an R notebook,
sankey_drake_graph Show a Sankey graph of your drake project.
r_recipe_wildcard deprecated
gather_by Deprecated: gather multiple groupings of targets
render_drake_graph Render a visualization using the data frames generated by drake_graph_info().
plot_graph Defunct function
progress Get the build progress of your targets during a make().
load_basic_example Defunct function
load_main_example Deprecated: load the main example.
render_graph Defunct function
rs_addin_loadd Loadd target at cursor into global environment
running List running targets.
recover_cache Deprecated. Load an existing drake files system cache if it exists or create a new one otherwise.
reduce_by Deprecated: reduce multiple groupings of targets
render_sankey_drake_graph Render a Sankey diagram fromdrake_graph_info().
read_drake_meta Defunct function
prune_drake_graph deprecated
target_namespaces Deprecated. For drake caches, list the storr cache namespaces that store target-level information.
text_drake_graph Use text art to show a visual representation of your workflow's dependency graph in your terminal window.
read_drake_plan Deprecated
session Defunct function
read_plan Defunct function
render_static_drake_graph Deprecated: render a ggraph/ggplot2 representation of your drake project.
reduce_plan Deprecated: write commands to reduce several targets down to one.
workplan Defunct function
summaries Defunct function
target Define custom columns in a drake_plan().
vis_drake_graph Show an interactive visual network representation of your drake project.
readd Read and return a drake target/import from the cache.
shell_file Deprecated
short_hash Deprecated. drake now only uses one hash algorithm per cache.
workflow Defunct function
render_drake_ggraph Render a static ggplot2/ggraph visualization from drake_graph_info() output.
trigger Customize the decision rules for rebuilding targets
triggers Deprecated. List the old drake triggers.
render_text_drake_graph Render a text-based visualization of drake's dependency graph using the data frames generated by drake_graph_info().
rescue_cache Try to repair a drake cache that is prone to throwing storr-related errors.
show_source Show how a target/import was produced.
static_drake_graph Deprecated: show a ggraph/ggplot2 representation of your drake project.
backend Defunct function
Makefile_recipe Deprecated
bind_plans Row-bind together drake plans
build_drake_graph Deprecated function build_drake_graph
analysis_wildcard Deprecated. Show the analysis wildcard used in plan_summaries().
as_file Defunct function
available_hash_algos Deprecated. List the available hash algorithms for drake caches.
analyses Defunct function
build_graph Defunct function
No Results!

Vignettes of drake

Name
drake.Rmd
No Results!

Last month downloads

Details

Include our badge in your README

[![Rdoc](http://www.rdocumentation.org/badges/version/drake)](http://www.rdocumentation.org/packages/drake)