write.table.ffdf: Exporting csv files from ff data.frames

Description

Function write.table.ffdf writes a ffdf object to a separated flat file, very much like (and using) write.table. It can also work with any convenience wrappers like write.csv and provides its own convenience wrapper (e.g. write.csv.ffdf) for R's usual wrappers.

Usage

write.table.ffdf(x = NULL
, file, append = FALSE
, nrows = -1, first.rows = NULL, next.rows = NULL
, FUN = "write.table", ...
, transFUN = NULL
, BATCHBYTES = getOption("ffbatchbytes")
, VERBOSE = FALSE
)
write.csv.ffdf(...)
write.csv2.ffdf(...)
write.csv(...)
write.csv2(...)

Arguments

a ffdf object which to export to the separated file

file

either a character string naming a file or a connection open for writing. "" indicates output to the console.

append

logical. Only relevant if file is a character string. If TRUE, the output is appended to the file. If FALSE, any existing file of the name is destroyed.

nrows

integer: the maximum number of rows to write in (includes first.rows in case a 'first' chunk is read) Negative and other invalid values are ignored.

first.rows

the number of rows to write with the first chunk (default: next.rows)

next.rows

integer: number of rows to write in further chunks, see details. By default calculated as BATCHBYTES %/% sum(.rambytes[vmode(x)])

FUN

character: name of a function that is called for writing each chunk, see write.table, write.csv, etc.

…

further arguments, passed to FUN in write.table.ffdf, or passed to write.table.ffdf in the convenience wrappers

transFUN

NULL or a function that is called on each data.frame chunk before writing with FUN (for filtering, transformations etc.)

BATCHBYTES

integer: bytes allowed for the size of the data.frame storing the result of reading one chunk. Default getOption("ffbatchbytes").

VERBOSE

logical: TRUE to verbose timings for each processed chunk (default FALSE)

Value

invisible

Details

write.table.ffdf has been designed to export very large ffdf objects to separated flatfiles in chunks. The first chunk is potentially written with col.names. Further chunks are appended.

write.table.ffdf has been designed to behave as much like write.table as possible. However, note the following differences:

by default row.names are only written if the ffdf has row.names.

Examples

Run this code

# NOT RUN {
   x <- data.frame(log=rep(c(FALSE, TRUE), length.out=26), int=1:26, dbl=1:26 + 0.1
, fac=factor(letters), ord=ordered(LETTERS), dct=Sys.time()+1:26
, dat=seq(as.Date("1910/1/1"), length.out=26, by=1))
   ffx <- as.ffdf(x)

   csvfile <- tempPathFile(path=getOption("fftempdir"), extension="csv")

   write.csv.ffdf(ffx, file=csvfile)
   write.csv.ffdf(ffx, file=csvfile, append=TRUE)

   ffy <- read.csv.ffdf(file=csvfile, header=TRUE
, colClasses=c(ord="ordered", dct="POSIXct", dat="Date"))

   rm(ffx, ffy); gc()
   unlink(csvfile)

 
# }
# NOT RUN {
  # Attention, this takes very long
  vmodes <- c(log="boolean", int="byte", dbl="single"
, fac="short", ord="short", dct="single", dat="single")

  message("create a ffdf with 7 columns and 78 mio rows")
  system.time({
    x <- data.frame(log=rep(c(FALSE, TRUE), length.out=26), int=1:26, dbl=1:26 + 0.1
, fac=factor(letters), ord=ordered(LETTERS), dct=Sys.time()+1:26
, dat=seq(as.Date("1910/1/1"), length.out=26, by=1))
    x <- do.call("rbind", rep(list(x), 10))
    x <- do.call("rbind", rep(list(x), 10))
    x <- do.call("rbind", rep(list(x), 10))
    x <- do.call("rbind", rep(list(x), 10))
    ffx <- as.ffdf(x, vmode = vmodes)
    for (i in 2:300){
      message(i, "\n")
      last <- nrow(ffx) + nrow(x)
      first <- last - nrow(x) + 1L
      nrow(ffx) <- last
      ffx[first:last,] <- x
    }
  })


  csvfile <- tempPathFile(path=getOption("fftempdir"), extension="csv")

  write.csv.ffdf(ffx, file=csvfile, VERBOSE=TRUE)
  ffy <- read.csv.ffdf(file=csvfile, header=TRUE
, colClasses=c(ord="ordered", dct="POSIXct", dat="Date")
, asffdf_args=list(vmode = vmodes), VERBOSE=TRUE)

  rm(ffx, ffy); gc()
  unlink(csvfile)
 
# }

Run the code above in your browser using DataLab