Untrackable variables -- reserved names
Only ordinary variables can be tracked -- variables that are active
bindings cannot be tracked.
Several variable names are reserved and cannot be tracked:
.trackingEnv, .trackingFileMap, .trackingUnsaved,
.trackingSummary, .trackingSummaryChanged,
.trackingOptions. Additionally, any variable with a newline
character ("\n") as part of its name cannot be tracked (the main
reason for this is that the mapping from object names to file names is
stored in a text file, and newline character delimits the name).The file map
The mapping from object names to file names is stored in the file
fileMap.txt. This data is stored as ordinary text file to make
it easy for users to see the object-file mappings outside of R.Implementation considerations
The reason that objects must be explicitly registered for tracking is
that there is currently no way of setting up a function to be called
when a new object is created, so new objects are always created as
ordinary R objects. Similarly, the R remove() functions does not
have any hooks, so if remove() is called on a tracked variable,
it will just remove the active binding in the visible environment, but
will not disturb the underlying tracking environment. The
track.remove() function will completely remove a tracked variable
from the visible environment and the underlying tracking environment
(including deleting an associated disk file.)
Object tracking was intended to be used in situations where large
numbers of large objects must be manipulated. Consequently, there is a
good chance of exhausting resources while using the track
package. The track code tries to check return codes when
creating objects or writing files, and in cases where it is unable to
complete an operation it tries leave the tracking environment in a state
from which objects can be salvaged. The functions
track.rebuild() and track.flush() are provided to
help recover from situations where resource limitations prevented
successful operation. Note that files are generally written in a
"unsafe" manner (i.e., existing files can be overwritten with partial
new files), but in these cases data is retained in the memory and can be
rewritten after resolving file system problems.
The R functions exists() should be used with care on tracked
objects, because it will actually fetch the object, possibly needing to
read it from disk. In the track code, the exists("x")
function is not used to check existence of a possibly tracked object
x, instead an idiom like is.element("x", objects(all=TRUE))
is used.
These statements about the available facilities in R were true as of
R-2.4.1 (released Dec 2006).
The rules for how variable names are mapped to file names are based on
trying to use filenames that will work properly on all three operating
systems R works on (Linux, Windows, and Mac OS X). A somewhat obscure
point that must be taken into account is the case-insensitivity of Mac
OS X and Windows. Even though modern versions of the OS's seem to use
case in their file names, this is because they are case preserving, but
they are in fact still case insensitive. This means that a file created
with the name "X.rda" is the same file as the "x.rda".
Here is a short shell transcript showing this behavior in
a bash shell running under
Windows and Mac OS X (it's the same in both).
$ echo 123 > X
$ cat x
123
$ echo 456 > x
$ cat x
456
$ cat X
456
Thus, in order to work on OS's, file mapping must be used to create
different filenames for the R objects "x" and "X" (which are in fact
different in R.)Portability
Tracking directories are intended to be operating-system independent
and completely portable across different operating systems.Compression
Saved R objects are compressed by default in R and by the track
package. Decompression speed is
very important for interactive response when using track, because each
time an object is accessed, it is read from its file (unless the
object is cached). Of the compression algorithms available as of
R-2.12.0, which are gzip, bzip2, and xz, gzip is the winner in terms
of speed. The default compression level in R for gzip is 6, but level
1 gives faster compression with slightly larger files (though
decompression is not faster). The lzop compression algorithm
http://www.lzop.org is still faster but it is not yet available in R.
Here are some comparisons and benchmarks of various compression
programs:
- http://www.linuxjournal.com/node/8051/print
- http://tukaani.org/lzma/benchmarks.html
- http://stephane.lesimple.fr/wiki/blog/lzop_vs_compress_vs_gzip_vs_bzip2_vs_lzma_vs_lzma2-xz_benchmark_reloaded
- http://aliver.wordpress.com/2010/06/22/huge-unix-file-compresser-shootout-with-tons-of-datagraphs
- http://www.maximumcompression.com/index.html
- http://mattmahoney.net/dc/text.html
Compression/decompression is nicely handled in R: only the call to
save() has arguments for compression. Decompression in
load() is handled automatically using a standard code (magic)
at the start of the saved file. Saved files can also be compressed
or decompressed outside of R, and load() will still handle
them correctly, provided the compression used is one of the types
that R knows about.