Learn R Programming

⚠️There's a newer version (0.4.6) of this package.Take me there.

dtrackr: Track your Data Pipelines

Overview

Accurate documentation of a data pipeline is a first step to reproducibility, and a flow chart describing the steps taken to prepare data is a useful part of this documentation. In analyses that rely on data that is frequently updated, documenting a data flow by copying and pasting row counts into flowcharts in PowerPoint becomes quickly tedious. With interactive data analysis, and particularly using RMarkdown, code execution sometimes happens in a non-linear fashion, and this can lead to, at best, confusion and at worst erroneous analysis. Basing such documentation on what the code does when executed sequentially can be inaccurate when the data has being analysed interactively.

The goal of dtrackr is to take away this pain by instrumenting and monitoring a dataframe through a dplyr pipeline, creating a step-by-step summary of the important parts of the wrangling as it actually happened to the dataframe, right into dataframe metadata itself. This metadata can be used to generate documentation as a flowchart, and allows both a quick overview of the data and also a visual check of the actual data processing.

Installation

In general use dtrackr is expected to be installed alongside the tidyverse set of packages. It is recommended to install tidyverse first.

Binary packages of dtrackr are available on CRAN and r-universe for macOS and Windows. dtrackr can be installed from source on Linux. dtrackr has been tested on R versions 3.6, 4.0, 4.1 and 4.2.

You can install the released version of dtrackr from CRAN with:

install.packages("dtrackr")

System dependencies for installation from source

For installation from source on Linux, dtrackr has required transitive dependencies on a few system libraries. These can be installed with the following commands:

# Ubuntu 20.04 and other debian based distributions:
sudo apt-get install libcurl4-openssl-dev libssl-dev librsvg2-dev \
  libicu-dev libnode-dev libpng-dev libjpeg-dev libpoppler-cpp-dev

# Centos 8
sudo dnf install libcurl-devel openssl-devel librsvg2-devel \
  libicu-devel libpng-devel libjpeg-turbo-devel poppler-devel

# for other linux distributions I suggest using the R pak library:
# install.packages("pak")
# pak::pkg_system_requirements("dtrackr")

# N.B. There are additional suggested R package dependencies on 
# the `tidyverse` and `rstudioapi` packages which have a longer set of dependencies. 
# We suggest you install them individually first if required.

Alternative versions of dtrackr

Early release versions are available on the r-universe. This will typically be more up to date than CRAN.

# Enable repository from terminological
options(repos = c(
  terminological = 'https://terminological.r-universe.dev',
  CRAN = 'https://cloud.r-project.org'))
# Download and install dtrackr in R
install.packages('dtrackr')

The unstable development version is available from GitHub with:

# install.packages("devtools")
devtools::install_github("terminological/dtrackr")

Example usage

Suppose we are constructing a data set with out initial input being the iris data. Our analysis depends on some cutOff parameter and we want to prepare a stratified data set that excludes flowers with narrow petals, and those with the biggest petals of each Species. With dtrackr we can mix regular dplyr commands with additional dtrackr commands such as comment and status, and an enhanced implementation of dplyr::filter, called exclude_all, and include_any.

# a pipeline parameter
cutOff = 3

# the pipeline
dataset = iris %>% 
  track() %>%
  status() %>%
  group_by(Species) %>%
  status(
    short = p_count_if(Sepal.Width<cutOff), 
    long= p_count_if(Sepal.Width>=cutOff), 
    .messages=c("consisting of {short} short sepal <{cutOff}","and {long} long sepal >={cutOff}")
  )  %>%
  exclude_all(
    Petal.Width<0.3 ~ "excluding {.excluded} with narrow petals",
    Petal.Width == max(Petal.Width) ~ "and {.excluded} outlier"
  ) %>%
  comment("test message") %>%
  status(.messages = "{.count} of type {Species}") %>%
  ungroup() %>%
  status(.messages = "{.count} together with cutOff {cutOff}") 

Having prepared our dataset we conduct our analysis, and want to write it up and prepare it for submission. As a key part of documenting the data pipeline a visual summary is useful, and for bio-medical journals or clinical trials often a requirement.

dataset %>% flowchart()

And your publication ready data pipeline, with any assumptions you care to document, is creates in a format of your choice (as long as that choice is one of pdf, png, svg or ps), ready for submission to Nature.

This is a trivial example, but the more complex the pipeline, the bigger benefit you will get.

Check out the main documentation for more details, and in particular the getting started vignette.

Copy Link

Version

Install

install.packages('dtrackr')

Monthly Downloads

253

Version

0.4.4

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Robert Challen

Last Published

September 4th, 2023

Functions in dtrackr (0.4.4)

distinct.trackr_df

Distinct values of data
bind_rows

Set operations
bind_cols

Set operations
add_count.trackr_df

dplyr modifying operations
anti_join.trackr_df

Anti join
arrange.trackr_df

dplyr modifying operations
add_tally

dplyr modifying operations
comment

Add a generic comment to the dtrackr history graph
group_modify.trackr_df

Group-wise modification of data and complex operations
filter.trackr_df

Filtering data
dot2svg

Convert Graphviz dot content to a SVG
exclude_all

Exclude all items matching one or more criteria
flowchart

Flowchart output
history

Get the dtrackr history graph
group_by.trackr_df

Stratifying your analysis
excluded

Get the dtrackr excluded data record
full_join.trackr_df

Full join
include_any

Include any items matching a criteria
p_add_tally

dplyr modifying operations
inner_join.trackr_df

Inner joins
left_join.trackr_df

Left join
nest_join.trackr_df

Nest join
mutate.trackr_df

dplyr modifying operations
p_add_count

dplyr modifying operations
p_arrange

dplyr modifying operations
p_bind_cols

Set operations
intersect.trackr_df

Set operations
p_anti_join

Anti join
p_capture_exclusions

Start capturing exclusions on a tracked dataframe.
p_bind_rows

Set operations
p_excluded

Get the dtrackr excluded data record
p_exclude_all

Exclude all items matching one or more criteria
p_copy

Copy the dtrackr history graph from one dataframe to another
p_count_if

Simple count_if dplyr summary function
p_comment

Add a generic comment to the dtrackr history graph
p_clear

Clear the dtrackr history graph
p_inner_join

Inner joins
p_group_modify

Group-wise modification of data and complex operations
p_include_any

Include any items matching a criteria
p_intersect

Set operations
p_full_join

Full join
p_get

Get the dtrackr history graph
p_filter

Filtering data
p_distinct

Distinct values of data
p_get_as_dot

DOT output
p_group_by

Stratifying your analysis
p_count_subgroup

Add a subgroup count to the dtrackr history graph
p_flowchart

Flowchart output
p_pivot_longer

Reshaping data using tidyr::pivot_longer
p_left_join

Left join
p_rename

dplyr modifying operations
p_nest_join

Nest join
p_rename_with

dplyr modifying operations
p_relocate

dplyr modifying operations
p_reframe

Summarise a data set
p_pivot_wider

Reshaping data using tidyr::pivot_wider
p_pause

Pause tracking the data frame.
p_mutate

dplyr modifying operations
p_resume

Resume tracking the data frame.
p_slice

Slice operations
p_slice_max

Slice operations
p_set

Set the dtrackr history graph
p_setdiff

Set operations
p_right_join

Right join
p_slice_min

Slice operations
p_semi_join

Semi join
p_select

dplyr modifying operations
p_slice_head

Slice operations
p_slice_sample

Slice operations
p_tagged

Retrieve tagged data in the history graph
p_transmute

dplyr modifying operations
p_union_all

Set operations
p_track

Start tracking the dtrackr history graph
p_union

Set operations
p_status

Add a summary to the dtrackr history graph
p_ungroup

Remove a stratification from a data set
p_summarise

Summarise a data set
p_slice_tail

Slice operations
p_untrack

Remove tracking from the dataframe
pause

Pause tracking the data frame.
print.trackr_graph

Print a history graph to the console
reexports

Objects exported from other packages
pivot_wider.trackr_df

Reshaping data using tidyr::pivot_wider
plot.trackr_graph

Plots a history graph as html
reframe.trackr_df

Summarise a data set
relocate.trackr_df

dplyr modifying operations
pivot_longer.trackr_df

Reshaping data using tidyr::pivot_longer
%>%

Pipe operator
resume

Resume tracking the data frame.
semi_join.trackr_df

Semi join
rename.trackr_df

dplyr modifying operations
rename_with.trackr_df

dplyr modifying operations
right_join.trackr_df

Right join
setdiff.trackr_df

Set operations
slice.trackr_df

Slice operations
track

Start tracking the dtrackr history graph
std_size

Standard paper sizes
select.trackr_df

dplyr modifying operations
save_dot

Save DOT content to a file
slice_head.trackr_df

Slice operations
slice_sample.trackr_df

Slice operations
transmute.trackr_df

dplyr modifying operations
tagged

Retrieve tagged data in the history graph
slice_min.trackr_df

Slice operations
summarise.trackr_df

Summarise a data set
slice_tail.trackr_df

Slice operations
status

Add a summary to the dtrackr history graph
slice_max.trackr_df

Slice operations
untrack

Remove tracking from the dataframe
union.trackr_df

Set operations
union_all.trackr_df

Set operations
ungroup.trackr_df

Remove a stratification from a data set
capture_exclusions

Start capturing exclusions on a tracked dataframe.
count_subgroup

Add a subgroup count to the dtrackr history graph