Learn R Programming

⚠️There's a newer version (7.13.11) of this package.Take me there.

The drake R package

Data analysis can be slow. A round of scientific computation can take several minutes, hours, or even days to complete. After it finishes, if you update your code or data, your hard-earned results may no longer be valid. How much of that valuable output can you keep, and how much do you need to update? How much runtime must you endure all over again?

For projects in R, the drake package can help. It analyzes your workflow, skips steps with up-to-date results, and orchestrates the rest with optional distributed computing. At the end, drake provides evidence that your results match the underlying code and data, which increases your ability to trust your research.

6-minute video

Visit the first page of the manual to watch a short introduction.

What gets done stays done.

Too many data science projects follow a Sisyphean loop:

  1. Launch the code.
  2. Wait while it runs.
  3. Discover an issue.
  4. Rerun from scratch.

Ordinarily, it is hard to avoid rerunning the code from scratch.

But with drake, you can automatically

  1. Launch the parts that changed since last time.
  2. Skip the rest.

How it works

To set up a project, load your packages,

library(drake)
library(dplyr)
library(ggplot2)

load your custom functions,

create_plot <- function(data) {
  ggplot(data, aes(x = Petal.Width, fill = Species)) +
    geom_histogram()
}

check any supporting files (optional),

# Get the files with drake_example("main").
file.exists("raw_data.xlsx")
#> [1] TRUE
file.exists("report.Rmd")
#> [1] TRUE

and plan what you are going to do.

plan <- drake_plan(
  raw_data = readxl::read_excel(file_in("raw_data.xlsx")),
  data = raw_data %>%
    mutate(Species = forcats::fct_inorder(Species)),
  hist = create_plot(data),
  fit = lm(Sepal.Width ~ Petal.Width + Species, data),
  report = rmarkdown::render(
    knitr_in("report.Rmd"),
    output_file = file_out("report.html"),
    quiet = TRUE
  )
)
plan
#> # A tibble: 5 x 2
#>   target   command                                                         
#>   <chr>    <expr>                                                          
#> 1 raw_data readxl::read_excel(file_in("raw_data.xlsx"))                   …
#> 2 data     raw_data %>% mutate(Species = forcats::fct_inorder(Species))   …
#> 3 hist     create_plot(data)                                              …
#> 4 fit      lm(Sepal.Width ~ Petal.Width + Species, data)                  …
#> 5 report   rmarkdown::render(knitr_in("report.Rmd"), output_file = file_ou…

So far, we have just been setting the stage. Use make() to do the real work. Targets are built in the correct order regardless of the row order of plan.

make(plan)
#> target raw_data
#> target data
#> target fit
#> target hist
#> target report

Except for files like report.html, your output is stored in a hidden .drake/ folder. Reading it back is easy.

readd(data) # See also loadd().
#> # A tibble: 150 x 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#> 1          5.1         3.5          1.4         0.2 setosa 
#> 2          4.9         3            1.4         0.2 setosa 
#> 3          4.7         3.2          1.3         0.2 setosa 
#> 4          4.6         3.1          1.5         0.2 setosa 
#> 5          5           3.6          1.4         0.2 setosa 
#> # … with 145 more rows

You may look back on your work and see room for improvement, but it's all good! The whole point of drake is to help you go back and change things quickly and painlessly. For example, we forgot to give our histogram a bin width.

readd(hist)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

So let's fix the plotting function.

create_plot <- function(data) {
  ggplot(data, aes(x = Petal.Width, fill = Species)) +
    geom_histogram(binwidth = 0.25) +
    theme_gray(20)
}

drake knows which results are affected.

config <- drake_config(plan)
vis_drake_graph(config) # Interactive graph: zoom, drag, etc.

The next make() just builds hist and report.html. No point in wasting time on the data or model.

make(plan)
#> target hist
#> target report
loadd(hist)
hist

Reproducibility with confidence

The R community emphasizes reproducibility. Traditional themes include scientific replicability, literate programming with knitr, and version control with git. But internal consistency is important too. Reproducibility carries the promise that your output matches the code and data you say you used. With the exception of non-default triggers and hasty mode, drake strives to keep this promise.

Evidence

Suppose you are reviewing someone else's data analysis project for reproducibility. You scrutinize it carefully, checking that the datasets are available and the documentation is thorough. But could you re-create the results without the help of the original author? With drake, it is quick and easy to find out.

make(plan)
#> All targets are already up to date.

config <- drake_config(plan)
outdated(config)
#> character(0)

With everything already up to date, you have tangible evidence of reproducibility. Even though you did not re-create the results, you know the results are re-creatable. They faithfully show what the code is producing. Given the right package environment and system configuration, you have everything you need to reproduce all the output by yourself.

Ease

When it comes time to actually rerun the entire project, you have much more confidence. Starting over from scratch is trivially easy.

clean()       # Remove the original author's results.
make(plan) # Independently re-create the results from the code and input data.
#> target raw_data
#> target data
#> target fit
#> target hist
#> target report

Independent replication

With even more evidence and confidence, you can invest the time to independently replicate the original code base if necessary. Up until this point, you relied on basic drake functions such as make(), so you may not have needed to peek at any substantive author-defined code in advance. In that case, you can stay usefully ignorant as you reimplement the original author's methodology. In other words, drake could potentially improve the integrity of independent replication.

Readability and transparency

Ideally, independent observers should be able to read your code and understand it. drake helps in several ways.

  • The workflow plan data frame explicitly outlines the steps of the analysis, and vis_drake_graph() visualizes how those steps depend on each other.
  • drake takes care of the parallel scheduling and high-performance computing (HPC) for you. That means the HPC code is no longer tangled up with the code that actually expresses your ideas.
  • You can generate large collections of targets without necessarily changing your code base of imported functions, another nice separation between the concepts and the execution of your workflow

Aggressively scale up.

Not every project can complete in a single R session on your laptop. Some projects need more speed or computing power. Some require a few local processor cores, and some need large high-performance computing systems. But parallel computing is hard. Your tables and figures depend on your analysis results, and your analyses depend on your datasets, so some tasks must finish before others even begin. drake knows what to do. Parallelism is implicit and automatic. See the high-performance computing guide for all the details.

# Use the spare cores on your local machine.
make(plan, jobs = 4)

# Or scale up to a supercomputer.
drake_batchtools_tmpl_file("slurm") # https://slurm.schedmd.com/
library(future.batchtools)
future::plan(batchtools_slurm, template = "batchtools.slurm.tmpl", workers = 100)
make(plan, parallelism = "future_lapply")

Installation

You can choose among different versions of drake. The CRAN release often lags behind the online manual but may have fewer bugs.

# Install the latest stable release from CRAN.
install.packages("drake")

# Alternatively, install the development version from GitHub.
install.packages("devtools")
library(devtools)
install_github("ropensci/drake")

A few technical details:

  • You must properly install drake using install.packages(), devtools::install_github(), or similar. It is not enough to use devtools::load_all(), particularly for the parallel computing functionality, in which multiple R sessions initialize and then try to require(drake).
  • For make(parallelism = "Makefile"), Windows users may need to download and install Rtools.
  • To use make(parallelism = "future") or make(parallelism = "future_lapply") to deploy your work to a computing cluster (see the high-performance computing guide), you will need the future.batchtools package.

Documentation

The main resources to learn drake are the user manual and the reference website. Others are below.

Cheat sheet

Thanks to Kirill for preparing a drake cheat sheet for the workshop.

Frequently asked questions

The FAQ page is an index of links to appropriately-labeled issues on GitHub. To contribute, please submit a new issue and ask that it be labeled as a frequently asked question.

Function reference

The reference section lists all the available functions. Here are the most important ones.

  • drake_plan(): create a workflow data frame (like my_plan).
  • make(): build your project.
  • r_make(): launch a fresh callr::r() process to build your project. Called from an interactive R session, r_make() is more reproducible than make().
  • loadd(): load one or more built targets into your R session.
  • readd(): read and return a built target.
  • drake_config(): create a master configuration list for other user-side functions.
  • vis_drake_graph(): show an interactive visual network representation of your workflow.
  • outdated(): see which targets will be built in the next make().
  • deps(): check the dependencies of a command or function.
  • failed(): list the targets that failed to build in the last make().
  • diagnose(): return the full context of a build, including errors, warnings, and messages.

Tutorials

Thanks to Kirill for constructing two interactive learnr tutorials: one supporting drake itself, and a prerequisite walkthrough of the cooking package.

Examples

Here are some real-world applications of drake in the wild.

There are also multiple drake-powered example projects available here, ranging from beginner-friendly stubs to demonstrations of high-performance computing. You can generate the files for a project with drake_example() (e.g. drake_example("gsp")), and you can list the available projects with drake_examples(). You can contribute your own example project with a fork and pull request.

Presentations

Context and history

For context and history, check out this post on the rOpenSci blog and episode 22 of the R Podcast.

Help and troubleshooting

The following resources document many known issues and challenges.

If you are still having trouble, please submit a new issue with a bug report or feature request, along with a minimal reproducible example where appropriate.

The GitHub issue tracker is mainly intended for bug reports and feature requests. While questions about usage etc. are also highly encouraged, you may alternatively wish to post to Stack Overflow and use the drake-r-package tag.

Contributing

Development is a community effort, and we encourage participation. Please read CONTRIBUTING.md for details.

Similar work

GNU Make

The original idea of a time-saving reproducible build system extends back at least as far as GNU Make, which still aids the work of data scientists as well as the original user base of complied language programmers. In fact, the name "drake" stands for "Data Frames in R for Make". Make is used widely in reproducible research. Below are some examples from Karl Broman's website.

There are several reasons for R users to prefer drake instead.

  • drake already has a Make-powered parallel backend. Just run make(..., parallelism = "Makefile", jobs = 2) to enjoy most of the original benefits of Make itself.
  • Improved scalability. With Make, you must write a potentially large and cumbersome Makefile by hand. But with drake, you can use wildcard templating to automatically generate massive collections of targets with minimal code.
  • Lower overhead for light-weight tasks. For each Make target that uses R, a brand new R session must spawn. For projects with thousands of small targets, that means more time may be spent loading R sessions than doing the actual work. With make(..., parallelism = "mclapply, jobs = 4"), drake launches 4 persistent workers up front and efficiently processes the targets in R.
  • Convenient organization of output. With Make, the user must save each target as a file. drake saves all the results for you automatically in a storr cache so you do not have to micromanage the results.

Remake

drake overlaps with its direct predecessor, remake. In fact, drake owes its core ideas to remake and Rich FitzJohn. Remake's development repository lists several real-world applications. drake surpasses remake in several important ways, including but not limited to the following.

  1. High-performance computing. Remake has no native parallel computing support. drake, on the other hand, has a thorough selection of parallel computing technologies and scheduling algorithms. Thanks to future, future.batchtools, and batchtools, it is straightforward to configure a drake project for most popular job schedulers, such as SLURM, TORQUE, and the Grid Engine, as well as systems contained in Docker images.
  2. A friendly interface. In remake, the user must manually write a YAML configuration file to arrange the steps of a workflow, which leads to some of the same scalability problems as Make. drake's data-frame-based interface and wildcard templating functionality easily generate workflows at scale.
  3. Thorough documentation. drake contains thorough user manual, a reference website, a comprehensive README, examples in the help files of user-side functions, and accessible example code that users can write with drake::example_drake().
  4. Active maintenance. drake is actively developed and maintained, and issues are usually addressed promptly.
  5. Presence on CRAN. At the time of writing, drake is available on CRAN, but remake is not.

Memoise

Memoization is the strategic caching of the return values of functions. Every time a memoized function is called with a new set of arguments, the return value is saved for future use. Later, whenever the same function is called with the same arguments, the previous return value is salvaged, and the function call is skipped to save time. The memoise package is an excellent implementation of memoization in R.

However, memoization does not go far enough. In reality, the return value of a function depends not only on the function body and the arguments, but also on any nested functions and global variables, the dependencies of those dependencies, and so on upstream. drake surpasses memoise because it uses the entire dependency network graph of a project to decide which pieces need to be rebuilt and which ones can be skipped.

Knitr

Much of the R community uses knitr for reproducible research. The idea is to intersperse code chunks in an R Markdown or *.Rnw file and then generate a dynamic report that weaves together code, output, and prose. Knitr is not designed to be a serious pipeline toolkit, and it should not be the primary computational engine for medium to large data analysis projects.

  1. Knitr scales far worse than Make or remake. The whole point is to consolidate output and prose, so it deliberately lacks the essential modularity.
  2. There is no obvious high-performance computing support.
  3. While there is a way to skip chunks that are already up to date (with code chunk options cache and autodep), this functionality is not the focus of knitr. It is deactivated by default, and remake and drake are more dependable ways to skip work that is already up to date.

drake was designed to manage the entire workflow with knitr reports as targets. The strategy is analogous for knitr reports within remake projects.

Factual's Drake

Factual's Drake is similar in concept, but the development effort is completely unrelated to the drake R package.

Other pipeline toolkits

There are countless other successful pipeline toolkits. The drake package distinguishes itself with its R-focused approach, Tidyverse-friendly interface, and a thorough selection of parallel computing technologies and scheduling algorithms.

Acknowledgements

Special thanks to Jarad Niemi, my advisor from graduate school, for first introducing me to the idea of Makefiles for research. He originally set me down the path that led to drake.

Many thanks to Julia Lowndes, Ben Marwick, and Peter Slaughter for reviewing drake for rOpenSci, and to Maëlle Salmon for such active involvement as the editor. Thanks also to the following people for contributing early in development.

Credit for images is attributed here.

Copy Link

Version

Install

install.packages('drake')

Monthly Downloads

1,599

Version

7.1.0

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Landau William

Last Published

April 7th, 2019

Functions in drake (7.1.0)

Makefile_recipe

Deprecated
cache_namespaces

Deprecated. List all the storr cache namespaces used by drake.
code_to_plan

Turn an R script file or knitr / R Markdown report into a drake workflow plan data frame.
cached

List targets in the cache.
cache_path

Deprecated. Return the file path where the cache is stored, if applicable.
check

Defunct function
config

Defunct function
dataset_wildcard

debug_and_run

Run a function in debug mode.
build_drake_graph

Deprecated function build_drake_graph
clean_mtcars_example

Clean the mtcars example from drake_example("mtcars")
build_graph

Defunct function
clean_main_example

Deprecated: clean the main example from drake_example("main")
default_graph_title

Return the default title for graph visualizations
configure_cache

Deprecated. Configure the hash algorithms, etc. of a drake cache.
dataframes_graph

Defunct function
default_short_hash_algo

Deprecated. Return the default short hash algorithm for make().
default_verbose

Deprecated
default_long_hash_algo

Deprecated. Return the default long hash algorithm for make().
default_system2_args

Defunct function
available_hash_algos

Deprecated. List the available hash algorithms for drake caches.
drake_example

Download and save the code and data files of an example drake-powered project.
as_file

Defunct function
deps_code

List the dependencies of a function or command
drake_examples

List the names of all the drake examples.
backend

Defunct function
deps_knitr

Find the drake dependencies of a dynamic knitr report target.
drake_debug

Run a single target's command in debug mode.
drake-package

drake: A pipeline toolkit for reproducible computation at scale.
dependency_profile

drake_batchtools_tmpl_file

Deprecated. Get a template file for execution on a cluster.
check_plan

Deprecated. Check a workflow plan data frame for obvious errors.
drake_build

Build/process a single target or import.
clean

Remove targets/imports from the cache.
drake_cache_log

Get a table that represents the state of the cache.
bind_plans

Row-bind together drake plans
drake_envir

Get the environment where drake builds targets
default_Makefile_args

Deprecated
deps_targets

Deprecated.
drake_strings

Turn valid expressions into character strings.
drake_tip

Deprecated. Output a random tip about drake.
diagnose

Get diagnostic metadata on a target.
drake_gc

Do garbage collection on the drake cache.
file_in

Declare input files and directories.
default_Makefile_command

Deprecated
file_out

Declare output files and directories.
drake_ggraph

Show a ggraph/ggplot2 representation of your drake project.
drake_quotes

Put quotes around each element of a character vector.
knitr_deps

drake_get_session_info

Return the sessionInfo() of the last call to make().
deprecate_wildcard

Defunct function
deps

Deprecated. List the dependencies of a function, workflow plan command, or knitr report source file.
do_prework

drake_plan

knitr_in

Declare knitr/rmarkdown source files as dependencies.
doc_of_function_call

Defunct function
legend_nodes

Create the nodes data frame used in the legend of the graph visualizations.
drake_meta

Deprecated. Compute the initial pre-build metadata of a target or import.
drake_graph_info

drake_plan_source

Show the code required to produce a given workflow plan data frame
evaluate

Defunct function
example_drake

Defunct function
load_basic_example

Defunct function
parallel_stages

Defunct function
examples_drake

Defunct function
drake_palette

Deprecated. Show drake's color palette.
parallelism_choices

Deprecated
plot_graph

Defunct function
drake_session

from_plan

Defunct function.
drake_unquote

Remove leading and trailing escaped quotes from character strings.
expand

Defunct function
future_build

Task passed to individual futures in the "future" backend
predict_load_balancing

prune_drake_graph

deprecated
evaluate_plan

Use wildcard templating to create a workflow plan data frame from a template data frame.
expose_imports

Expose all the imports in a package so make() can detect all the package's nested functions.
failed

r_make

Experimental: reproducible R session management for drake functions
gather_plan

Write commands to combine several targets into one or more overarching targets.
eager_load_target

Load a target right away (internal function)
read_graph

Defunct function
ignore

Ignore components of commands and imported functions.
read_plan

Defunct function
get_cache

Get the default cache of a drake project.
imported

Deprecated. List all the imports in the drake cache.
make_with_config

deprecated
render_static_drake_graph

Deprecated: render a ggraph/ggplot2 representation of your drake project.
manage_memory

Manage in-memory targets
plan

Defunct function
expand_plan

Create replicates of targets.
load_main_example

Deprecated: load the main example.
rescue_cache

Try to repair a drake cache that is prone to throwing storr-related errors.
find_cache

Search up the file system for the nearest drake cache.
make_imports

deprecated
gather

Defunct function
file_store

Tell drake that you want information on a file (target or import), not an ordinary object.
plan_analyses

Deprecated.
gather_by

Gather multiple groupings of targets
process_import

internal function
migrate_drake_project

Deprecated: reconfigure an old project (built with drake <= 4.4.0) to be compatible with later versions of drake.
progress

missed

Report any import objects required by your drake_plan plan but missing from your workspace or file system.
read_drake_graph

Deprecated
predict_runtime

Predict the elapsed runtime of the next call to make() for non-staged parallel backends.
read_drake_meta

Defunct function
make_targets

deprecated
plan_drake

Defunct function
predict_workers

Predict the load balancing of the next call to make() for non-staged parallel backends.
render_drake_ggraph

plan_summaries

Deprecated
render_drake_graph

show_source

Show how a target/import was produced.
r_recipe_wildcard

deprecated
render_graph

Defunct function
static_drake_graph

Deprecated: show a ggraph/ggplot2 representation of your drake project.
load_mtcars_example

Load the mtcars example.
new_cache

Make a new drake cache.
outdated

List the targets that are out of date.
read_config

Defunct function
analysis_wildcard

read_drake_config

Deprecated
rs_addin_loadd

Loadd target at cursor into global environment
render_sankey_drake_graph

trigger

Customize the decision rules for rebuilding targets
triggers

Deprecated. List the old drake triggers.
running

List running targets.
rate_limiting_times

Defunct function
vis_drake_graph

Show an interactive visual network representation of your drake project.
use_drake

Use drake in a project
type_sum.expr_list

Type summary printing
as_drake_filename

Defunct function
build_times

List the time it took to build each target.
workflow

Defunct function
readd

Read and return a drake target/import from the cache.
built

Deprecated. List all the built targets (non-imports) in the cache.
recover_cache

Deprecated. Load an existing drake files system cache if it exists or create a new one otherwise.
this_cache

Get the cache at the exact file path specified.
shell_file

Deprecated
tracked

List the targets and imports that are reproducibly tracked.
short_hash

Deprecated. drake now only uses one hash algorithm per cache.
cmq_build

Build a target using the clustermq backend
cleaned_namespaces

Deprecated utility function
workplan

Defunct function
default_parallelism

Deprecated
deps_profile

Find out why a target is out of date.
default_recipe_command

Deprecated
target_namespaces

Deprecated. For drake caches, list the storr cache namespaces that store target-level information.
test_with_dir

Run a unit test in a way that quarantines the side effects from your workspace and file system.
deps_target

List the dependencies of a target
drake_cache_log_file

Deprecated. Generate a flat text log file to represent the state of the cache.
drake_hpc_template_file

Write a template file for deploying work to a cluster / job scheduler.
drake_config

drake_hpc_template_files

List the available example template files for deploying work to a cluster / job scheduler.
find_knitr_doc

Defunct function
find_project

Deprecated. Search up the file system for the nearest root path of a drake project.
in_progress

is_function_call

Defunct function
long_hash

Deprecated. drake now has just one hash algorithm per cache.
make

Run your project (build the outdated targets).
map_plan

Create a plan that maps a function to a grid of arguments.
max_useful_jobs

Deprecated function
plan_to_code

Turn a drake workflow plan data frame into a plain R script file.
plan_to_notebook

Turn a drake workflow plan data frame into an R notebook,
read_drake_plan

Deprecated
read_drake_seed

Read the pseudo-random number generator seed of the project.
reduce_by

Reduce multiple groupings of targets
reduce_plan

Write commands to reduce several targets down to one.
sankey_drake_graph

Show a Sankey graph of your drake project.
session

Defunct function
summaries

Defunct function
target

analyses

Defunct function