Learn R Programming

RDataTracker (version 2.25.0)

InitSave: Creating Provenance Graphs with RDataTracker

Description

These functions are needed to create provenance graphs.

Usage

ddg.init(r.script.path = NULL, ddgdir = NULL, overwrite = TRUE, enable.console = TRUE, max.snapshot.size = 100) ddg.save(r.script.path = NULL, save.debug=FALSE, quit=FALSE) ddg.run(r.script.path = NULL, ddgdir = NULL, overwrite = TRUE, f = NULL, enable.console = TRUE, annotate.inside = TRUE, first.loop = 1, max.loops = 1, max.snapshot.size = 10, debug = FALSE, save.debug=FALSE, display = FALSE) ddg.set.detail(detail.level) ddg.get.detail() ddg.clear.detail() ddg.annotate.inside() ddg.max.loops() ddg.max.snapshot.size() ddg.annotate.on(fnames = NULL) ddg.annotate.off(fnames = NULL) ddg.console.on() ddg.console.off() ddg.flush.ddg(ddg.path = NULL) ddg.display() ddg.forloop(index.var) ddg.first.loop() ddg.loop.count(loop.num) ddg.loop.count.inc(loop.num) ddg.reset.loop.count(loop.num) ddg.loop.annotate() ddg.loop.annotate.on() ddg.loop.annotate.off() ddg.details.omitted()

Arguments

r.script.path
the full path to the file containing the R script that is being executed.
ddgdir
the directory where the DDG should be saved.
overwrite
Defaults is TRUE, if FALSE, adds timestamp to ddg directory to prevent overwriting
enable.console
If TRUE, any commands executed in the console, either by typing, copying and pasting, or selecting and running, will result in a procedure node created in the provenance graph, with data nodes created for each variable assigned and data flow edges for variables used and set.
annotate.inside
specifies whether automatic annotation of functions and control constructs should be enabled.
first.loop
The number of the first iteration to be annotated in a for, while, or repeat loop.
max.loops
The maximum number of times that a for, while, or repeat loop will be annotated. If max.loops is -1, there is no limit. If max.loops = 0, no loops will be annotated.
max.snapshot.size
The maximum size for objects that should be output in snapshot files. If max.snapshot.size is -1, there is no limit. If max.snapshot.size is 0, snapshot nodes are created but no snapshot files are saved.
debug
If TRUE, enable script debugging.
save.debug
If TRUE, save debug files to debug directory.
display
If TRUE, display the DDG when the R script completes.
f
A function to run. Data provenance is collected within the function.
detail.level
An integer indicating the level of provenance detail to be collected.
fnames
A list of one or more function names.
quit
If TRUE, all DDG files are removed from memory.
ddg.path
The path to the DDG directory which needs to be flushed.
index.var
The index variable passed to ddg.forloop.
loop.num
The loop number passed to ddg.loop.count and ddg.reset.loop.count.

Details

In order to use RDataTracker to collect data provenance, the user must either call ddg.init at the beginning of execution and ddg.save at the end, or the user must call ddg.run. When using ddg.init, it is possible to call ddg.save multiple times. Each call will save the current provenance graph in a file, overwriting the previous version that was saved.

ddg.init initializes the provenance graph. If r.script.path is not NULL, the R script is copied into the DDG directory, becoming a permanent part of the provenance record. If ddgdir is NULL, the provenance graph will be saved in a subdirectory called "ddg_[script name]" in the script's directory. Further changes can be made to the save directory by specifying overwrite = FALSE, which will add a timestamp to the directory to prevent overwriting, i.e. "ddg_script.name_2016-06-09T15.41.02EDT" ddg.save writes the provenance graph along with additional provenance information to a file named ddg.txt. The extra information includes the platform and operating system, the R version, the name of the R script, and a timestamp of the execution. ddg.save can be called multiple times for a single call to ddg.init, where each call will extend the previous provenance graph, overwriting the file containing the previous version. When the final save procedure is wanted, the parameter "quit" can be set to TRUE, causing all temporary files in memory to be cleared out. While not strictly necessary, this prevents issues when creating multiple DDGs from the same script. If save.debug is set to TRUE, debug files are saved to the debug directory. ddg.run provides a short cut for ddg.init...ddg.save. It initializes the provenance graph, calls the script or function provided as a parameter, and then saves the provenance graph. If a script is provided, the script is sourced using ddg.source (see ddg.source), a DDG is created for the script, and a copy of the script is saved with the DDG. If a function is provided, the function is executed with calls to ddg.init and ddg.save so that provenance for the function is captured. In either case, if an R error occurs during execution, the error will be captured in an Exception node in the provenance graph. If annotate.inside is set to TRUE, provenance is collected for statements inside functions and inside control constructs (if, for, while, repeat, and simple block). ddg.annotate.on and ddg.annotate.off may be used to limit the functions that will be annotated or not annotated, respectively. The parameter max.loops sets the maxiumum number of times that a for, while or repeat loop will be annotated. If break is set to TRUE, script debugging is enabled. This has the same effect as inserting ddg.breakpoint() at the top of the script. If display is set to TRUE, the DDG is displayed after the R script finished executing. If save.debug is set to TRUE, debug files are saved to the debug directory. ddg.set.detail can be used to set the level of provenance detail to be collected. Options include: 0 = no internal annotation, no snapshots. 1 = 1 loop, snapshots < 10k. 2 = 10 loops, snapshots < 100k. 3 = all loops, all snapshots. If ddg.detail is not set, the values of annotate.internal, max.loops, and max.snapshot.size passed to ddg.run are used instead. The current level of detail is returned by ddg.get.detail and reset to NULL by ddg.clear.detail. ddg.annotate.inside returns the current value of annotate.inside. ddg.max.loops returns the current value of max.loops. ddg.max.snapshot.size returns the current value of max.snapshot.size. ddg.console.on and ddg.console.off toggle the console parameter for DDG creation. When the console is enabled, all commands sent to the R console are captured as provenance by the RDataTracker library. These functions allow for an intermixing of automatic and more detailed manual annotations of a script. Note that a call to ddg.console.off will create a collapsible console node with data provenence of the previous console session, while ddg.console.on will initiate a console session. No action is performed if the console is already in the desired state.

ddg.flush.ddg removes all files from the DDG directory specified unless the DDG directory is also the working directory, in which case it does nothing. If no DDG directory is specified, the current DDG directory (if any) is assumed. The last DDG created (if any) can be displayed with ddg.display. This function starts DDG Explorer and loads the most recent ddg.txt file (if any). If the DDG path is not set or a ddg.txt file is not available, the function returns "DDG not available". ddg.forloop, ddg.first.loop, ddg.loop.count, ddg.loop.count.inc, ddg.reset.loop.count, ddg.loop.annotate, ddg.loop.annotate.on, ddg.loop.annotate.off, and ddg.details.omitted are used internally by RDataTracker--do not use.

Examples

Run this code
dir.create("ddg", showWarnings=FALSE)
ddg.init()
ddg.save()
myfunc <- function() {
  a <- 1
}
ddg.run(f = myfunc)

Run the code above in your browser using DataLab