Learn R Programming

⚠️There's a newer version (7.13.11) of this package.Take me there.

The drake R package

Data analysis can be slow. A round of scientific computation can take several minutes, hours, or even days to complete. After it finishes, if you update your code or data, your hard-earned results may no longer be valid. How much of that valuable output can you keep, and how much do you need to update? How much runtime must you endure all over again?

For projects in R, the drake package can help. It analyzes your workflow, skips steps with up-to-date results, and orchestrates the rest with optional distributed computing. At the end, drake provides evidence that your results match the underlying code and data, which increases your ability to trust your research.

6-minute video

Visit the first page of the manual to watch a short introduction.

What gets done stays done.

Too many data science projects follow a Sisyphean loop:

  1. Launch the code.
  2. Wait while it runs.
  3. Discover an issue.
  4. Rerun from scratch.

Ordinarily, it is hard to avoid rerunning the code from scratch.

But with drake, you can automatically

  1. Launch the parts that changed since last time.
  2. Skip the rest.

How it works

To set up a project, load your packages,

library(drake)
library(dplyr)
library(ggplot2)
#> Registered S3 methods overwritten by 'ggplot2':
#>   method         from 
#>   [.quosures     rlang
#>   c.quosures     rlang
#>   print.quosures rlang

load your custom functions,

create_plot <- function(data) {
  ggplot(data, aes(x = Petal.Width, fill = Species)) +
    geom_histogram()
}

check any supporting files (optional),

# Get the files with drake_example("main").
file.exists("raw_data.xlsx")
#> [1] TRUE
file.exists("report.Rmd")
#> [1] TRUE

and plan what you are going to do.

plan <- drake_plan(
  raw_data = readxl::read_excel(file_in("raw_data.xlsx")),
  data = raw_data %>%
    mutate(Species = forcats::fct_inorder(Species)),
  hist = create_plot(data),
  fit = lm(Sepal.Width ~ Petal.Width + Species, data),
  report = rmarkdown::render(
    knitr_in("report.Rmd"),
    output_file = file_out("report.html"),
    quiet = TRUE
  )
)
plan
#> # A tibble: 5 x 2
#>   target   command                                                         
#>   <chr>    <expr>                                                          
#> 1 raw_data readxl::read_excel(file_in("raw_data.xlsx"))                   …
#> 2 data     raw_data %>% mutate(Species = forcats::fct_inorder(Species))   …
#> 3 hist     create_plot(data)                                              …
#> 4 fit      lm(Sepal.Width ~ Petal.Width + Species, data)                  …
#> 5 report   rmarkdown::render(knitr_in("report.Rmd"), output_file = file_ou…

So far, we have just been setting the stage. Use make() to do the real work. Targets are built in the correct order regardless of the row order of plan.

make(plan)
#> target raw_data
#> target data
#> target fit
#> target hist
#> target report

Except for files like report.html, your output is stored in a hidden .drake/ folder. Reading it back is easy.

readd(data) # See also loadd().
#> # A tibble: 150 x 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#> 1          5.1         3.5          1.4         0.2 setosa 
#> 2          4.9         3            1.4         0.2 setosa 
#> 3          4.7         3.2          1.3         0.2 setosa 
#> 4          4.6         3.1          1.5         0.2 setosa 
#> 5          5           3.6          1.4         0.2 setosa 
#> # … with 145 more rows

You may look back on your work and see room for improvement, but it’s all good! The whole point of drake is to help you go back and change things quickly and painlessly. For example, we forgot to give our histogram a bin width.

readd(hist)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

So let’s fix the plotting function.

create_plot <- function(data) {
  ggplot(data, aes(x = Petal.Width, fill = Species)) +
    geom_histogram(binwidth = 0.25) +
    theme_gray(20)
}

drake knows which results are affected.

config <- drake_config(plan)
vis_drake_graph(config) # Interactive graph: zoom, drag, etc.

The next make() just builds hist and report.html. No point in wasting time on the data or model.

make(plan)
#> target hist
#> target report
loadd(hist)
hist

Reproducibility with confidence

The R community emphasizes reproducibility. Traditional themes include scientific replicability, literate programming with knitr, and version control with git. But internal consistency is important too. Reproducibility carries the promise that your output matches the code and data you say you used. With the exception of non-default triggers and hasty mode, drake strives to keep this promise.

Evidence

Suppose you are reviewing someone else’s data analysis project for reproducibility. You scrutinize it carefully, checking that the datasets are available and the documentation is thorough. But could you re-create the results without the help of the original author? With drake, it is quick and easy to find out.

make(plan)
#> All targets are already up to date.

config <- drake_config(plan)
outdated(config)
#> character(0)

With everything already up to date, you have tangible evidence of reproducibility. Even though you did not re-create the results, you know the results are re-creatable. They faithfully show what the code is producing. Given the right package environment and system configuration, you have everything you need to reproduce all the output by yourself.

Ease

When it comes time to actually rerun the entire project, you have much more confidence. Starting over from scratch is trivially easy.

clean()       # Remove the original author's results.
make(plan) # Independently re-create the results from the code and input data.
#> target raw_data
#> target data
#> target fit
#> target hist
#> target report

Independent replication

With even more evidence and confidence, you can invest the time to independently replicate the original code base if necessary. Up until this point, you relied on basic drake functions such as make(), so you may not have needed to peek at any substantive author-defined code in advance. In that case, you can stay usefully ignorant as you reimplement the original author’s methodology. In other words, drake could potentially improve the integrity of independent replication.

Readability and transparency

Ideally, independent observers should be able to read your code and understand it. drake helps in several ways.

  • The workflow plan data frame explicitly outlines the steps of the analysis, and vis_drake_graph() visualizes how those steps depend on each other.
  • drake takes care of the parallel scheduling and high-performance computing (HPC) for you. That means the HPC code is no longer tangled up with the code that actually expresses your ideas.
  • You can generate large collections of targets without necessarily changing your code base of imported functions, another nice separation between the concepts and the execution of your workflow

Aggressively scale up.

Not every project can complete in a single R session on your laptop. Some projects need more speed or computing power. Some require a few local processor cores, and some need large high-performance computing systems. But parallel computing is hard. Your tables and figures depend on your analysis results, and your analyses depend on your datasets, so some tasks must finish before others even begin. drake knows what to do. Parallelism is implicit and automatic. See the high-performance computing guide for all the details.

# Use the spare cores on your local machine.
make(plan, jobs = 4)

# Or scale up to a supercomputer.
drake_hpc_template_file("slurm_clustermq.tmpl") # https://slurm.schedmd.com/
options(
  clustermq.scheduler = "clustermq",
  clustermq.template = "slurm_clustermq.tmpl"
)
make(plan, parallelism = "clustermq", jobs = 4)

Installation

You can choose among different versions of drake. The CRAN release often lags behind the online manual but may have fewer bugs.

# Install the latest stable release from CRAN.
install.packages("drake")

# Alternatively, install the development version from GitHub.
install.packages("devtools")
library(devtools)
install_github("ropensci/drake")

Function reference

The reference section lists all the available functions. Here are the most important ones.

  • drake_plan(): create a workflow data frame (like my_plan).
  • make(): build your project.
  • r_make(): launch a fresh callr::r() process to build your project. Called from an interactive R session, r_make() is more reproducible than make().
  • loadd(): load one or more built targets into your R session.
  • readd(): read and return a built target.
  • drake_config(): create a master configuration list for other user-side functions.
  • vis_drake_graph(): show an interactive visual network representation of your workflow.
  • outdated(): see which targets will be built in the next make().
  • deps(): check the dependencies of a command or function.
  • failed(): list the targets that failed to build in the last make().
  • diagnose(): return the full context of a build, including errors, warnings, and messages.

Documentation

Use cases

The official rOpenSci use cases and associated discussion threads describe applications of drake in action. Here are some more applications of drake in real-world projects.

Help and troubleshooting

The following resources document many known issues and challenges.

If you are still having trouble, please submit a new issue with a bug report or feature request, along with a minimal reproducible example where appropriate.

The GitHub issue tracker is mainly intended for bug reports and feature requests. While questions about usage etc. are also highly encouraged, you may alternatively wish to post to Stack Overflow and use the drake-r-package tag.

Contributing

Development is a community effort, and we encourage participation. Please read CONTRIBUTING.md for details.

Similar work

GNU Make

The original idea of a time-saving reproducible build system extends back at least as far as GNU Make, which still aids the work of data scientists as well as the original user base of complied language programmers. In fact, the name “drake” stands for “Data Frames in R for Make”. Make is used widely in reproducible research. Below are some examples from Karl Broman’s website.

There are several reasons for R users to prefer drake instead.

  • drake already has a Make-powered parallel backend. Just run make(..., parallelism = "Makefile", jobs = 2) to enjoy most of the original benefits of Make itself.
  • Improved scalability. With Make, you must write a potentially large and cumbersome Makefile by hand. But with drake, you can use wildcard templating to automatically generate massive collections of targets with minimal code.
  • Lower overhead for light-weight tasks. For each Make target that uses R, a brand new R session must spawn. For projects with thousands of small targets, that means more time may be spent loading R sessions than doing the actual work. With make(..., parallelism = "mclapply, jobs = 4"), drake launches 4 persistent workers up front and efficiently processes the targets in R.
  • Convenient organization of output. With Make, the user must save each target as a file. drake saves all the results for you automatically in a storr cache so you do not have to micromanage the results.

Remake

drake overlaps with its direct predecessor, remake. In fact, drake owes its core ideas to remake and Rich FitzJohn. Remake’s development repository lists several real-world applications. drake surpasses remake in several important ways, including but not limited to the following.

  1. High-performance computing. Remake has no native parallel computing support. drake, on the other hand, has a thorough selection of parallel computing technologies and scheduling algorithms. Thanks to future, future.batchtools, and batchtools, it is straightforward to configure a drake project for most popular job schedulers, such as SLURM, TORQUE, and the Grid Engine, as well as systems contained in Docker images.
  2. A friendly interface. In remake, the user must manually write a YAML configuration file to arrange the steps of a workflow, which leads to some of the same scalability problems as Make. drake’s domain-specific language easily generates workflows at scale.
  3. Thorough documentation. drake contains thorough user manual, a reference website, a comprehensive README, examples in the help files of user-side functions, and accessible example code that users can write with drake::example_drake().
  4. Active maintenance. drake is actively developed and maintained, and issues are usually addressed promptly.
  5. Presence on CRAN. At the time of writing, drake is available on CRAN, but remake is not.

Memoise

Memoization is the strategic caching of the return values of functions. Every time a memoized function is called with a new set of arguments, the return value is saved for future use. Later, whenever the same function is called with the same arguments, the previous return value is salvaged, and the function call is skipped to save time. The memoise package is an excellent implementation of memoization in R.

However, memoization does not go far enough. In reality, the return value of a function depends not only on the function body and the arguments, but also on any nested functions and global variables, the dependencies of those dependencies, and so on upstream. drake surpasses memoise because it uses the entire dependency network graph of a project to decide which pieces need to be rebuilt and which ones can be skipped.

Knitr

Much of the R community uses knitr for reproducible research. The idea is to intersperse code chunks in an R Markdown or *.Rnw file and then generate a dynamic report that weaves together code, output, and prose. Knitr is not designed to be a serious pipeline toolkit, and it should not be the primary computational engine for medium to large data analysis projects.

  1. Knitr scales far worse than Make or remake. The whole point is to consolidate output and prose, so it deliberately lacks the essential modularity.
  2. There is no obvious high-performance computing support.
  3. While there is a way to skip chunks that are already up to date (with code chunk options cache and autodep), this functionality is not the focus of knitr. It is deactivated by default, and remake and drake are more dependable ways to skip work that is already up to date.

drake was designed to manage the entire workflow with knitr reports as targets. The strategy is analogous for knitr reports within remake projects.

Factual’s Drake

Factual’s Drake is similar in concept, but the development effort is completely unrelated to the drake R package.

Other pipeline toolkits

There are countless other successful pipeline toolkits. The drake package distinguishes itself with its R-focused approach, Tidyverse-friendly interface, and a thorough selection of parallel computing technologies and scheduling algorithms.

Acknowledgements

Special thanks to Jarad Niemi, my advisor from graduate school, for first introducing me to the idea of Makefiles for research. He originally set me down the path that led to drake.

Many thanks to Julia Lowndes, Ben Marwick, and Peter Slaughter for reviewing drake for rOpenSci, and to Maëlle Salmon for such active involvement as the editor. Thanks also to the following people for contributing early in development.

Credit for images is attributed here.

Copy Link

Version

Install

install.packages('drake')

Monthly Downloads

1,174

Version

7.3.0

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Landau William

Last Published

May 19th, 2019

Functions in drake (7.3.0)

build_times

List the time it took to build each target.
default_Makefile_args

Deprecated
default_Makefile_command

Deprecated
default_parallelism

Deprecated
default_recipe_command

Deprecated
drake_palette

Deprecated. Show drake's color palette.
doc_of_function_call

Defunct function
do_prework

drake_meta

Deprecated. Compute the initial pre-build metadata of a target or import.
file_store

Tell drake that you want information on a file (target or import), not an ordinary object.
clean_main_example

Deprecated: clean the main example from drake_example("main")
cached

List targets in the cache.
as_drake_filename

Defunct function
configure_cache

Deprecated. Configure the hash algorithms, etc. of a drake cache.
built

Deprecated. List all the built targets (non-imports) in the cache.
dataframes_graph

Defunct function
code_to_plan

Turn an R script file or knitr / R Markdown report into a drake workflow plan data frame.
check

Defunct function
cache_path

Deprecated. Return the file path where the cache is stored, if applicable.
cache_namespaces

Deprecated. List all the storr cache namespaces used by drake.
check_plan

Deprecated. Check a workflow plan data frame for obvious errors.
clean

Remove targets/imports from the cache.
dataset_wildcard

find_cache

Search up the file system for the nearest drake cache.
cleaned_namespaces

Deprecated utility function
isolate_example

Isolate the side effects of an example.
default_verbose

Deprecated
dependency_profile

cmq_build

Build a target using the clustermq backend
knitr_deps

make

Run your project (build the outdated targets).
clean_mtcars_example

Clean the mtcars example from drake_example("mtcars")
make_imports

deprecated
debug_and_run

Run a function in debug mode.
drake_example

Download and save the code and data files of an example drake-powered project.
default_short_hash_algo

Deprecated. Return the default short hash algorithm for make().
drake_examples

List the names of all the drake examples.
drake-package

drake: A pipeline toolkit for reproducible computation at scale.
drake_batchtools_tmpl_file

Deprecated. Get a template file for execution on a cluster.
manage_memory

Manage in-memory targets
default_graph_title

Return the default title for graph visualizations
default_long_hash_algo

Deprecated. Return the default long hash algorithm for make().
drake_hpc_template_file

Write a template file for deploying work to a cluster / job scheduler.
deps_code

List the dependencies of a function or command
deps_knitr

Find the drake dependencies of a dynamic knitr report target.
drake_hpc_template_files

List the available example template files for deploying work to a cluster / job scheduler.
expand

Defunct function
from_plan

Defunct function.
expand_plan

Deprecated: create replicates of targets.
drake_cache_log_file

Deprecated. Generate a flat text log file to represent the state of the cache.
default_system2_args

Defunct function
drake_plan

drake_config

config

Defunct function
drake_gc

Do garbage collection on the drake cache.
map_plan

Deprecated: create a plan that maps a function to a grid of arguments.
future_build

Task passed to individual futures in the "future" backend
deps_targets

Deprecated.
predict_workers

Predict the load balancing of the next call to make() for non-staged parallel backends.
diagnose

Get diagnostic metadata on a target.
deprecate_wildcard

Defunct function
drake_get_session_info

Return the sessionInfo() of the last call to make().
drake_debug

Run a single target's command in debug mode.
deps

Defunct function
gather_plan

Deprecated: write commands to combine several targets into one or more overarching targets.
deps_profile

Find out why a target is out of date.
get_cache

Get the default cache of a drake project.
drake_envir

Get the environment where drake builds targets
process_import

internal function
read_drake_config

Deprecated
deps_target

List the dependencies of a target
drake_cache_log

Get a table that represents the state of the cache.
drake_build

Build/process a single target or import.
drake_quotes

Put quotes around each element of a character vector.
read_drake_graph

Deprecated
read_drake_seed

Read the pseudo-random number generator seed of the project.
drake_strings

Turn valid expressions into character strings.
read_graph

Defunct function
drake_plan_source

Show the code required to produce a given workflow plan data frame
drake_session

drake_ggraph

Show a ggraph/ggplot2 representation of your drake project.
drake_unquote

Remove leading and trailing escaped quotes from character strings.
this_cache

Get the cache at the exact file path specified.
drake_graph_info

example_drake

Defunct function
drake_tip

Deprecated. Output a random tip about drake.
ignore

Ignore components of commands and imported functions.
eager_load_target

Load a target right away (internal function)
imported

Deprecated. List all the imports in the drake cache.
examples_drake

Defunct function
evaluate

Defunct function
file_in

Declare input files and directories.
evaluate_plan

Deprecated: use wildcard templating to create a workflow plan data frame from a template data frame.
file_out

Declare output files and directories.
expose_imports

Expose all the imports in a package so make() can detect all the package's nested functions.
parallelism_choices

Deprecated
find_knitr_doc

Defunct function
tracked

List the targets and imports that are reproducibly tracked.
plan

Defunct function
find_project

Deprecated. Search up the file system for the nearest root path of a drake project.
load_mtcars_example

Load the mtcars example.
type_sum.expr_list

Type summary printing
make_targets

deprecated
plan_summaries

Deprecated
long_hash

Deprecated. drake now has just one hash algorithm per cache.
missed

Report any import objects required by your drake_plan plan but missing from your workspace or file system.
in_progress

plan_to_code

Turn a drake workflow plan data frame into a plain R script file.
rate_limiting_times

Defunct function
is_function_call

Defunct function
new_cache

Make a new drake cache.
knitr_in

Declare knitr/rmarkdown source files as dependencies.
failed

legend_nodes

Create the nodes data frame used in the legend of the graph visualizations.
outdated

List the targets that are out of date.
plan_analyses

Deprecated.
parallel_stages

Defunct function
use_drake

Use drake in a project
make_with_config

deprecated
read_config

Defunct function
max_useful_jobs

Defunct function
predict_load_balancing

predict_runtime

Predict the elapsed runtime of the next call to make() for non-staged parallel backends.
gather

Defunct function
migrate_drake_project

Defunct function
r_make

Reproducible R session management for drake functions
plan_drake

Defunct function
plan_to_notebook

Turn a drake workflow plan data frame into an R notebook,
sankey_drake_graph

Show a Sankey graph of your drake project.
r_recipe_wildcard

deprecated
gather_by

Deprecated: gather multiple groupings of targets
render_drake_graph

plot_graph

Defunct function
progress

load_basic_example

Defunct function
load_main_example

Deprecated: load the main example.
render_graph

Defunct function
rs_addin_loadd

Loadd target at cursor into global environment
running

List running targets.
recover_cache

Deprecated. Load an existing drake files system cache if it exists or create a new one otherwise.
reduce_by

Deprecated: reduce multiple groupings of targets
render_sankey_drake_graph

read_drake_meta

Defunct function
prune_drake_graph

deprecated
target_namespaces

Deprecated. For drake caches, list the storr cache namespaces that store target-level information.
text_drake_graph

Use text art to show a visual representation of your workflow's dependency graph in your terminal window.
read_drake_plan

Deprecated
session

Defunct function
read_plan

Defunct function
render_static_drake_graph

Deprecated: render a ggraph/ggplot2 representation of your drake project.
reduce_plan

Deprecated: write commands to reduce several targets down to one.
workplan

Defunct function
summaries

Defunct function
target

vis_drake_graph

Show an interactive visual network representation of your drake project.
readd

Read and return a drake target/import from the cache.
shell_file

Deprecated
short_hash

Deprecated. drake now only uses one hash algorithm per cache.
workflow

Defunct function
render_drake_ggraph

trigger

Customize the decision rules for rebuilding targets
triggers

Deprecated. List the old drake triggers.
render_text_drake_graph

rescue_cache

Try to repair a drake cache that is prone to throwing storr-related errors.
show_source

Show how a target/import was produced.
static_drake_graph

Deprecated: show a ggraph/ggplot2 representation of your drake project.
backend

Defunct function
Makefile_recipe

Deprecated
bind_plans

Row-bind together drake plans
build_drake_graph

Deprecated function build_drake_graph
analysis_wildcard

as_file

Defunct function
available_hash_algos

Deprecated. List the available hash algorithms for drake caches.
analyses

Defunct function
build_graph

Defunct function