ff (version 4.0.12)

ff: ff classes for representing (large) atomic data

Description

The ff package provides atomic data structures that are stored on disk but behave (almost) as if they were in RAM by mapping only a section (pagesize) into main memory (the effective main memory consumption per ff object). Several access optimization techniques such as Hyrid Index Preprocessing (as.hi, update.ff) and Virtualization (virtual, vt, vw) are implemented to achieve good performance even with large datasets. In addition to the basic access functions, the ff package also provides compatibility functions that facilitate writing code for ff and ram objects (clone, as.ff, as.ram) and very basic support for operating on ff objects (ffapply). While the (possibly packed) raw data is stored on a flat file, meta informations about the atomic data structure such as its dimension, virtual storage mode (vmode), factor level encoding, internal length etc.. are stored as an ordinary R object (external pointer plus attributes) and can be saved in the workspace. The raw flat file data encoding is always in native machine format for optimal performance and provides several packing schemes for different data types such as logical, raw, integer and double (in an extended version support for more tighly packed virtual data types is supported). flatfile data files can be shared among ff objects in the same R process or even from different R processes due to Memory-Mapping, although the caching effects have not been tested extensively.
Please do read and understand the limitations and warnings in LimWarn before you do anything serious with package ff.

Usage

ff( initdata  = NULL
, length      = NULL
, levels      = NULL
, ordered     = NULL
, dim         = NULL
, dimorder    = NULL
, bydim       = NULL
, symmetric   = FALSE
, fixdiag     = NULL
, names       = NULL
, dimnames    = NULL
, ramclass    = NULL
, ramattribs  = NULL
, vmode       = NULL
, update      = NULL
, pattern     = NULL
, filename    = NULL
, overwrite   = FALSE
, readonly    = FALSE
, pagesize    = NULL  # getOption("ffpagesize")
, caching     = NULL  # getOption("ffcaching")
, finalizer   = NULL
, finonexit   = NULL  # getOption("fffinonexit")
, FF_RETURN   = TRUE
, BATCHSIZE   = .Machine$integer.max
, BATCHBYTES  = getOption("ffbatchbytes")
, VERBOSE     = FALSE
)

Value

If (!FF_RETURN) then a ram object like those generated by vector, matrix, array but with attributes 'vmode', 'physical' and 'virtual' accessible via vmode, physical and virtual


If (FF_RETURN) an object of class 'ff' which is a a list with two components:

physical

an external pointer of class 'ff_pointer' which carries attributes with copy by reference semantics: changing a physical attribute of a copy changes the original

virtual

an empty list which carries attributes with copy by value semantics: changing a virtual attribute of a copy does not change the original

Arguments

initdata

scalar or vector of the .vimplemented vmodes, recycled if needed, default 0, see also as.vmode and vector.vmode

length

optional vector length of the object (default: derive from 'initdata' or 'dim'), see length.ff

levels

optional character vector of levels if (in this case initdata must be composed of these) (default: derive from initdata)

ordered

indicate whether the levels are ordered (TRUE) or non-ordered factor (FALSE, default)

dim

optional array dim, see dim.ff and array

dimorder

physical layout (default seq_along(dim)), see dimorder and aperm

bydim

dimorder by which to interpret the 'initdata', generalization of the 'byrow' paramter in matrix

symmetric

extended feature: TRUE creates symmetric matrix (default FALSE)

fixdiag

extended feature: non-NULL scalar requires fixed diagonal for symmetric matrix (default NULL is free diagonal)

names

NOT taken from initdata, see names

dimnames

NOT taken from initdata, see dimnames

ramclass

class attribute attached when moving all or parts of this ff into ram, see ramclass

ramattribs

additional attributes attached when moving all or parts of this ff into ram, see ramattribs

vmode

virtual storage mode (default: derive from 'initdata'), see vmode and as.vmode

update

set to FALSE to avoid updating with 'initdata' (default TRUE) (used by ffdf)

pattern

root pattern with or without path for automatic ff filename creation (default NULL translates to "ff"), see also argument 'filename'

filename

ff filename with or without path (default tmpfile with 'pattern' prefix); without path the file is created in getOption("fftempdir"), with path '.' the file is created in getwd. Note that files created in getOption("fftempdir") have default finalizer "delete" while other files have default finalizer "close". See also arguments 'pattern' and 'finalizer' and physical

overwrite

set to TRUE to allow overwriting existing files (default FALSE)

readonly

set to TRUE to forbid writing to existing files

pagesize

pagesize in bytes for the memory mapping (default from getOptions("ffpagesize") initialized by getdefaultpagesize), see also physical

caching

caching scheme for the backend, currently 'mmnoflush' or 'mmeachflush' (flush mmpages at each swap, default from getOptions("ffcaching") initialized with 'mmeachflush'), see also physical

finalizer

name of finalizer function called when ff object is removed (default: ff files created in getOptions("fftempdir") are considered temporary and have default finalizer delete, files created in other locations have default finalizer close); available finalizer generics are "close", "delete" and "deleteIfOpen", available methods are close.ff, delete.ff and deleteIfOpen.ff, see also argument 'finonexit' and finalizer

finonexit

logical scalar determining whether and finalize is also called when R is closed via q, (default TRUE from getOptions("fffinonexit"))

FF_RETURN

logical scalar or ff object to be used. The default TRUE creates a new ff file. FALSE returns a ram object. Handing over an ff object here uses this or stops if not ffsuitable

BATCHSIZE

integer scalar limiting the number of elements to be processed in update.ff when length(initdata)>1, default from .Machine$integer.max

BATCHBYTES

integer scalar limiting the number of bytes to be processed in update.ff when length(initdata)>1, default from getOption("ffbatchbytes"), see also .rambytes

VERBOSE

set to TRUE for verbosing in update.ff when length(initdata)>1, default FALSE

Physical object component

The 'ff_pointer' carries the following 'physical' or readonly attributes, which are accessible via physical:

vmode see vmode
maxlengthsee maxlength
pattern see parameter 'pattern'
filename see filename
pagesize see parameter 'pagesize'
caching see parameter 'caching'
finalizersee parameter 'finalizer'
finonexitsee parameter 'finonexit'
readonly see is.readonly
class The external pointer needs class 'ff_pointer' to allow method dispatch of finalizers

Virtual object component

The 'virtual' component carries the following attributes (some of which might be NULL):

Length see length.ff
Levels see levels.ff
Names see names.ff
VW see vw.ff
Dim see dim.ff
Dimorder see dimorder
Symmetric see symmetric.ff
Fixdiag see fixdiag.ff
ramclass see ramclass
ramattribssee ramattribs

Class

You should not rely on the internal structure of ff objects or their ram versions. Instead use the accessor functions like vmode, physical and virtual. Still it would be wise to avoid attributes AND classes 'vmode', 'physical' and 'virtual' in any other packages. Note that the 'ff' object's class attribute also has copy-by-value semantics ('virtual'). For the 'ff' object the following class attritibutes are known:

vectorc("ff_vector","ff")
matrixc("ff_matrix","ff_array","ff")
arrayc("ff_array","ff")
symmetric matrixc("ff_symm","ff")
distance matrixc("ff_dist","ff_symm","ff")
reserved for future usec("ff_mixed","ff")

Methods

The following methods and functions are available for ff objects:

Type Name Assign Comment
Basic functions
functionff constructor for ff and ram objects
genericupdate updates one ff object with the content of another
genericclone clones an ff object optionally changing some of its features
methodprint print ff
methodstr ff object structure
Class test and coercion
functionis.ff check if inherits from ff
genericas.ff coerce to ff, if not yet
genericas.ram coerce to ram retaining some of the ff information
genericas.bit coerce to bit
Virtual storage mode
genericvmode<-get and set virtual mode (setting only for ram, not for ff objects)
genericas.vmode coerce to vmode (only for ram, not for ff objects)
Physical attributes
functionphysical<-set and get physical attributes
genericfilename<-get and set filename
genericpattern<-get pattern and set filename path and prefix via pattern
genericmaxlength get maxlength
genericis.sorted<-set and get if is marked as sorted
genericna.count<-set and get NA count, if set to non-NA only swap methods can change and na.count is maintained automatically
genericis.readonly get if is readonly
Virtual attributes
functionvirtual<-set and get virtual attributes
methodlength<-set and get length
methoddim<-set and get dim
genericdimorder<-set and get the order of dimension interpretation
genericvtvirtually transpose ff_array
methodtcreate transposed clone of ff_array
genericvw<-set and get virtual windows
methodnames<-set and get names
methoddimnames<-set and get dimnames
genericsymmetric get if is symmetric
genericfixdiag<-set and get fixed diagonal of symmetric matrix
methodlevels<-levels of factor
genericrecodeLevels recode a factor to different levels
genericsortLevels sort the levels and recoce a factor
methodis.factor if is factor
methodis.ordered if is ordered (factor)
genericramclassget ramclass
genericramattribsget ramattribs
Access functions
functionget.ff get single ff element (currently [[ is a shortcut)
functionset.ff set single ff element (currently [[<- is a shortcut)
functiongetset.ff set single ff element and get old value in one access operation
functionread.ff get vector of contiguous elements
functionwrite.ff set vector of contiguous elements
functionreadwrite.ff set vector of contiguous elements and get old values in one access operation
method[ get vector of indexed elements, uses HIP, see hi
method[<- set vector of indexed elements, uses HIP, see hi
genericswap set vector of indexed elements and get old values in one access operation
genericadd (almost) unifies '+=' operation for ff and ram objects
genericbigsample sample from ff object
Opening/Closing/Deleting
genericis.open check if ff is open
methodopen open ff object (is done automatically on access)
methodclose close ff object (releases C++ memory and protects against file deletion if deleteIfOpen) finalizer is used
genericdelete deletes ff file (unconditionally)
genericdeleteIfOpen deletes ff file if ff object is open (finalization method)
genericfinalizer<-get and set finalizer
genericfinalizeforce finalization
Other
functiongeterror.ff get error code
functiongeterrstr.ff get error message

ff options

Through options or getOption one can change and query global features of the ff package:

optiondescriptiondefault
fftempdirdefault directory for creating ff filestempdir
fffinalizername of default finalizerdeleteIfOpen
fffinonexitdefault for invoking finalizer on exit of RTRUE
ffpagesizedefault pagesizegetdefaultpagesize
ffcachingcaching scheme for the C++ backend'mmnoflush'
ffdropdefault for the drop parameter in the ff subscript methodsTRUE
ffbatchbytesdefault for the byte limit in batched/chunked processing16MB

OS specific

The following table gives an overview of file size limits for common file systems (see https://en.wikipedia.org/wiki/Comparison_of_file_systems for further details):

File SystemFile size limit
FAT162GB
FAT324GB
NTFS16GB
ext2/3/416GB to 2TB
ReiserFS4GB (up to version 3.4) / 8TB (from version 3.5)
XFS8EB
JFS4PB
HFS2GB
HFS Plus16GB
USF14GB to 256TB
USF2512GB to 32PB
UDF16EB

Credits

Package Version 1.0

Daniel Adlerdadler@uni-goettingen.de
R package design, C++ generic file vectors, Memory-Mapping, 64-bit Multi-Indexing adapter and Documentation, Platform ports
Oleg Nenadiconenadi@uni-goettingen.de
Index sequence packing, Documentation
Walter Zucchiniwzucchi@uni-goettingen.de
Array Indexing, Sampling, Documentation
Christian Gläserchristian_glaeser@gmx.de
Wrapper for biglm package

Package Version 2.0

Jens OehlschlägelJens.Oehlschlaegel@truecluster.com
R package redesign; Hybrid Index Preprocessing; transparent object creation and finalization; vmode design; virtualization and hybrid copying; arrays with dimorder and bydim; symmetric matrices; factors and POSIXct; virtual windows and transpose; new generics update, clone, swap, add, as.ff and as.ram; ffapply and collapsing functions. R-coding, C-coding and Rd-documentation.
Daniel Adlerdadler@uni-goettingen.de
C++ generic file vectors, vmode implementation and low-level bit-packing/unpacking, arithmetic operations and NA handling, Memory-Mapping and backend caching. C++ coding and platform ports. R-code extensions for opening existing flat files readonly and shared.

Licence

Package under GPL-2, included C++ code released by Daniel Adler under the less restrictive ISCL

Details

The atomic data is stored in filename as a native encoded raw flat file on disk, OS specific limitations of the file system apply. The number of elements per ff object is limited to the integer indexing, i.e. .Machine$integer.max. Atomic objects created with ff are is.open, a C++ object is ready to access the file via memory-mapping. Currently the C++ backend provides two caching schemes: 'mmnoflush' let the OS decide when to flash memory mapped pages and 'mmeachflush' will flush memory mapped pages at each page swap per ff file. These minimal memory ressources can be released by closeing or deleteing the ff file. ff objects can be saved and loaded across R sessions. If the ff file still exists in the same location, it will be opened automatically at the first attempt to access its data. If the ff object is removed, at the next garbage collection (see gc) the ff object's finalizer is invoked. Raw data files can be made accessible as an ff object by explicitly given the filename and vmode but no size information (length or dim). The ff object will open the file and handle the data with respect to the given vmode. The close finalizer will close the ff file, the delete finalizer will delete the ff file. The default finalizer deleteIfOpen will delete open files and do nothing for closed files. If the default finalizer is used, two actions are needed to protect the ff file against deletion: create the file outside the standard 'fftempdir' and close the ff object before removing it or before quitting R. When R is exited through q, the finalizer will be invoked depending on the 'fffinonexit' option, furthermore the 'fftempdir' is unlinked.

See Also

vector, matrix, array, as.ff, as.ram

Examples

Run this code
  message("make sure you understand the following ff options 
    before you start using the ff package!!")
  oldoptions <- options(fffinalizer="deleteIfOpen", fffinonexit="TRUE", fftempdir=tempdir())
  message("an integer vector")
  ff(1:12)                  
  message("a double vector of length 12")
  ff(0, 12)
  message("a 2-bit logical vector of length 12 (vmode='boolean' has 1 bit)")
  ff(vmode="logical", length=12)
  message("an integer matrix 3x4 (standard colwise physical layout)")
  ff(1:12, dim=c(3,4))
  message("an integer matrix 3x4 (rowwise physical layout, but filled in standard colwise order)")
  ff(1:12, dim=c(3,4), dimorder=c(2,1))
  message("an integer matrix 3x4 (standard colwise physical layout, but filled in rowwise order
aka matrix(, byrow=TRUE))")
  ff(1:12, dim=c(3,4), bydim=c(2,1))
  gc()
  options(oldoptions)

  if (ffxtensions()){
     message("a 26-dimensional boolean array using 1-bit representation
      (file size 8 MB compared to 256 MB int in ram)")
     a <- ff(vmode="boolean", dim=rep(2, 26))
     dimnames(a) <- dummy.dimnames(a)
     rm(a); gc()
  }

  if (FALSE) {

     message("This 2GB biglm example can take long, you might want to change
       the size in order to define a size appropriate for your computer")
     require(biglm)

     b <- 1000
     n <- 100000
     k <- 3
     memory.size(max = TRUE)
     system.time(
     x <- ff(vmode="double", dim=c(b*n,k), dimnames=list(NULL, LETTERS[1:k]))
     )
     memory.size(max = TRUE)
     system.time(
     ffrowapply({
        l <- i2 - i1 + 1
        z <- rnorm(l)
        x[i1:i2,] <- z + matrix(rnorm(l*k), l, k)
     }, X=x, VERBOSE=TRUE, BATCHSIZE=n)
     )
     memory.size(max = TRUE)

     form <- A ~ B + C
     first <- TRUE
     system.time(
     ffrowapply({
        if (first){
          first <- FALSE
          fit <- biglm(form, as.data.frame(x[i1:i2,,drop=FALSE], stringsAsFactors = TRUE))
        }else
          fit <- update(fit, as.data.frame(x[i1:i2,,drop=FALSE], stringsAsFactors = TRUE))
     }, X=x, VERBOSE=TRUE, BATCHSIZE=n)
     )
     memory.size(max = TRUE)
     first
     fit
     summary(fit)
     rm(x); gc()
  }

Run the code above in your browser using DataCamp Workspace