Extract.ff: Reading and writing vectors and arrays (high-level)

Description

These are the main methods for reading and writing data from ff files.

Usage

# S3 method for ff
[(x, i, pack = FALSE)
# S3 method for ff
[(x, i, add = FALSE, pack = FALSE) <- value
# S3 method for ff_array
[(x, …, bydim = NULL, drop = getOption("ffdrop"), pack = FALSE)
# S3 method for ff_array
[(x, …, bydim = NULL, add = FALSE, pack = FALSE) <- value
# S3 method for ff
[[(x, i)
# S3 method for ff
[[(x, i, add = FALSE) <- value

Arguments

an ff object

missing OR a single index expression OR a hi object

…

missing OR up to length(dim(x)) index expressions OR hi objects

drop

logical scalar indicating whether array dimensions shall be dropped

bydim

the dimorder which shall be used in interpreting vector to/from array data

pack

FALSE to prevent rle-packing in hybrid index preprocessing, see as.hi

value

the values to be assigned, possibly recycled

add

TRUE if the values should rather increment than overwrite at the target positions, see readwrite.ff

Value

The read operators [ and [[ return data from the ff object, possibly decorated with names, dim, dimnames and further attributes and classes (see ramclass, ramattribs) The write operators [<- and [[<- return the 'modified' ff object (like all assignment operators do).

Index expressions

x <- ff(1:12, dim=c(3,4), dimnames=list(letters[1:3], NULL))

allowed expression	--	`example`
positive integers		`x[ 1 ,1]`
negative integers		`x[ -(2:12) ]`
logical		`x[ c(TRUE, FALSE, FALSE) ,1]`
character		`x[ "a" ,1]`
integer matrices		`x[ rbind(c(1,1)) ]`
hybrid index		`x[ hi ,1]`
disallowed expression	--	`example`
zeros		`x[ 0 ]`
NAs		`x[ NA ]`

Dimorder and bydim

Arrays in R have always standard dimorder 1:length(dim(x)) while ff allows to store an array in a different dimorder. Using nonstandard dimorder (see dimorderStandard) can speed up certain access operations: while matrix dimorder=c(1,2) -- column-major order -- allows fast extraction of columns, dimorder=c(2,1) allows fast extraction of rows.

While the dimorder -- being an attribute of an ff_array -- controls how the vector in an ff file is interpreted, the bydim argument to the extractor functions controls, how assigment vector values in [<- are translated to the array and how the array is translated to a vector in [ subscripting. Note that bydim=c(2,1) corresponds to matrix(..., byrow=TRUE).

Multiple vector interpretation in arrays

In case of non-standard dimorder (see dimorderStandard) the vector sequence of array elements in R and in the ff file differs. To access array elements in file order, you can use getset.ff, readwrite.ff or copy the ff object and set dim(ff)<-NULL to get a vector view into the ff object (using [ dispatches the vector method [.ff). To access the array elements in R standard dimorder you simply use [ which dispatches to [.ff_array. Note that in this case as.hi will unpack the complete index, see next section.

RAM expansion of index expressions

Some index expressions do not consume RAM due to the hi representation, for example 1:n will almost consume no RAM hoewever large n. However, some index expressions are expanded and require to maxindex(i) * .rambytes["integer"] bytes, either because the sorted sequence of index positions cannot be rle-packed efficiently or because hiparse cannot yet parse such expression and falls back to evaluating/expanding the index expression. If the index positions are not sorted, the index will be expanded and a second vector is needed to store the information for re-ordering, thus the index requires 2 * maxindex(i) * .rambytes["integer"] bytes.

RAM expansion when recycling assigment values

Some assignment expressions do not consume RAM for recycling, for example x[1:n] <- 1:k will not consume RAM hoewever large n compared to k, when x has standard dimorder. However, if length(value)>1, assignment expressions with non-ascending index positions trigger recycling the value R-side to the full index length. This will happen if dimorder does not match parameter bydim or if the index is not sorted ascending.

Details

The single square bracket operators [ and [<- are the workhorses for accessing the content of an ff object. They support ff_vector and ff_array access (dim.ff), they respect virtual windows (vw), names.ff and dimnames.ff and retain ramclass and ramattribs and thus support POSIXct and factor, see levels.ff.

The functionality of [ and [<- cn be combined into one efficient operation, see swap.

The double square bracket operator [[ is a shortcut for get.ff resp. set.ff, however, you should not rely on this for the future, see LimWarn. For programming please prefer [.

Examples

Run this code

# NOT RUN {
   message("look at different dimorders")
   x <- ff(1:12, dim=c(3,4), dimorder=c(1,2))
   x[]
   as.vector(x[])
   x[1:12]
   x <- ff(1:12, dim=c(3,4), dimorder=c(2,1))
   x[]
   as.vector(x[])
   message("Beware (might be changed)")
   x[1:12]

   message("look at different bydim")
   matrix(1:12, nrow=3, ncol=4, byrow=FALSE)
   x <- ff(1:12, dim=c(3,4), bydim=c(1,2))
   x
   matrix(1:12, nrow=3, ncol=4, byrow=TRUE)
   x <- ff(1:12, dim=c(3,4), bydim=c(2,1))
   x
   x[,, bydim=c(2,1)]
   as.vector(x[,, bydim=c(2,1)])
   message("even consistent interpretation of vectors in assignments")
   x[,, bydim=c(1,2)] <- x[,, bydim=c(1,2)]
   x
   x[,, bydim=c(2,1)] <- x[,, bydim=c(2,1)]
   x
   rm(x); gc()

  
# }
# NOT RUN {
   message("some performance implications of different dimorders")
   n <- 100
   m <- 100000
   a <- ff(1L,dim=c(n,m))
   b <- ff(1L,dim=c(n,m), dimorder=2:1)
   system.time(lapply(1:n, function(i)sum(a[i,])))
   system.time(lapply(1:n, function(i)sum(b[i,])))
   system.time(lapply(1:n, function(i){i<-(i-1)*(m/n)+1; sum(a[,i:(i+m/n-1)])}))
   system.time(lapply(1:n, function(i){i<-(i-1)*(m/n)+1; sum(b[,i:(i+m/n-1)])}))

   n <- 100
   a <- ff(1L,dim=c(n,n,n,n))
   b <- ff(1L,dim=c(n,n,n,n), dimorder=4:1)
   system.time(lapply(1:n, function(i)sum(a[i,,,])))
   system.time(lapply(1:n, function(i)sum(a[,i,,])))
   system.time(lapply(1:n, function(i)sum(a[,,i,])))
   system.time(lapply(1:n, function(i)sum(a[,,,i])))
   system.time(lapply(1:n, function(i)sum(b[i,,,])))
   system.time(lapply(1:n, function(i)sum(b[,i,,])))
   system.time(lapply(1:n, function(i)sum(b[,,i,])))
   system.time(lapply(1:n, function(i)sum(b[,,,i])))

   n <- 100
   m <- 100000
   a <- ff(1L,dim=c(n,m))
   b <- ff(1L,dim=c(n,m), dimorder=2:1)
   system.time(ffrowapply(sum(a[i1:i2,]), a, RETURN=TRUE, CFUN="csum", BATCHBYTES=16104816%/%20))
   system.time(ffcolapply(sum(a[,i1:i2]), a, RETURN=TRUE, CFUN="csum", BATCHBYTES=16104816%/%20))
   system.time(ffrowapply(sum(b[i1:i2,]), b, RETURN=TRUE, CFUN="csum", BATCHBYTES=16104816%/%20))
   system.time(ffcolapply(sum(b[,i1:i2]), b, RETURN=TRUE, CFUN="csum", BATCHBYTES=16104816%/%20))

   rm(a,b); gc()
  
# }

Run the code above in your browser using DataLab