memory-efficient storage of large atomic vectors and arrays on
disk and fast access functions
Description
The ff package provides atomic data structures that are
stored on disk but behave (almost) as if they were in RAM by
transparently mapping only a section (pagesize) in main memory
- the effective virtual memory consumption per ff object. ff
supports atomic data types 'double', 'logical', 'raw' and
'integer' and close-to-atomic types 'factor', 'POSIXct' and
custom close-to-atomic types. ff now has native C-support for
vectors, matrices and arrays with flexible dimorder (major
column-order, major row-order and generalizations for arrays).
ff objects store raw data in binary flat files in native
encoding, and complement this with metadata stored in R as
physical and virtual attributes. ff objects have well-defined
hybrid copying semantics, which gives rise to certain
performance improvements through virtualization. ff objects can
be stored and reopened across R sessions. ff files can be
shared by multiple ff R objects (using different data
en/de-coding schemes) in the same process or from multiple R
processes to exploit parallelism. A wide choice of finalizer
options allows to work with 'permanent' files as well as
creating/removing 'temporary' ff files completely transparent
to the user. On certain OS/Filesystem combinations, creating
the ff files works without notable delay thanks to using sparse
file allocation. Several access optimization techniques such as
Hybrid Index Preprocessing and Virtualization are implemented
to achieve good performance even with large datasets, for
example virtual matrix transpose without touching a single byte
on disk. Further, to reduce disk I/O, the atomic data gets
stored native and compact on binary flat files i.e. logicals
take up exactly 2 bits to represent TRUE, FALSE and NA. Beyond
basic access functions, the ff package also provides
compatibility functions that facilitate writing code for ff and
ram objects and support for batch processing on ff objects
(e.g. as.ram, as.ff, ffapply). NOTE: A professional extension
is available from the authors, which integrates additional
high-performance features neatly into the ff package. The
extension allows efficient handling of symmetric matrices and
supports more packed data types: boolean (1 bit), quad (2 bit
unsigned), nibble (4 bit unsigned), byte (1 byte signed with
NAs), ubyte (1 byte unsigned), short (2 byte signed with NAs),
ushort (2 byte unsigned), single (4 byte float with NAs). For
example 'quad' allows efficient storage of genomic data as an
'A','T','G','C' factor. The unsigned types support 'circular'
arithmetic.