
Last chance! 50% off unlimited learning
Sale ends in
Function read.table.ffdf
reads separated flat files into ffdf
objects, very much like (and using) read.table
.
It can also work with any convenience wrappers like read.csv
and provides its own convenience wrapper (e.g. read.csv.ffdf
) for R's usual wrappers.
read.table.ffdf(
x = NULL
, file, fileEncoding = ""
, nrows = -1, first.rows = NULL, next.rows = NULL
, levels = NULL, appendLevels = TRUE
, FUN = "read.table", ...
, transFUN = NULL
, asffdf_args = list()
, BATCHBYTES = getOption("ffbatchbytes")
, VERBOSE = FALSE
)
read.csv.ffdf(...)
read.csv2.ffdf(...)
read.delim.ffdf(...)
read.delim2.ffdf(...)
An ffdf
object. If created during the 'first' chunk pass, it will have one physical
component per virtual
column.
NULL or an optional ffdf
object to which the read records are appended.
If this is provided, it defines crucial features that are otherwise determnined during the 'first' chunk of reading:
vmodes
, colnames
, colClasses
, sequence of predefined levels
.
the name of the file which the data are to be read from.
Each row of the table appears as one line of the file. If it does
not contain an absolute path, the file name is
relative to the current working directory,
getwd()
. Tilde-expansion is performed where supported.
Alternatively, file
can be a readable text-mode
connection
(which will be opened for reading if
necessary, and if so close
d (and hence destroyed) at
the end of the function call).
character string: if non-empty declares the
encoding used on a file (not a connection) so the character data can
be re-encoded. See file
.
integer: the maximum number of rows to read in (includes first.rows in case a 'first' chunk is read) Negative and other invalid values are ignored.
integer: number of rows to be read in the first chunk, see details. Default is the value given at next.rows
or 1e3
otherwise.
Ignored if x
is given.
integer: number of rows to be read in further chunks, see details.
By default calculated as BATCHBYTES %/% sum(.rambytes[vmode(x)])
NULL or an optional list, each element named with col.names of factor columns specifies the levels
Ignored if x
is given.
logical.
A vector of permissions to expand levels
for factor columns.
Recycled as necessary, or if the logical vector is named, unspecified values are taken to be TRUE
.
Ignored during processing of the 'first' chunk
character: name of a function that is called for reading each chunk, see read.table
, read.csv
, etc.
further arguments, passed to FUN
in read.table.ffdf
, or passed to read.table.ffdf
in the convenience wrappers
NULL or a function that is called on each data.frame chunk after reading with FUN
and before further processing (for filtering, transformations etc.)
further arguments passed to as.ffdf
when converting the data.frame
of the first chunk to ffdf
.
Ignored if x
is given.
integer: bytes allowed for the size of the data.frame
storing the result of reading one chunk. Default getOption("ffbatchbytes")
.
logical: TRUE to verbose timings for each processed chunk (default FALSE)
Jens Oehlschlägel, Christophe Dutang
read.table.ffdf
has been designed to read very large (many rows) separated flatfiles in row-chunks
and store the result in a ffdf
object on disk, but quickly accessible via ff
techniques.
The first chunk is read with a default of 1000 rows, for subsequent chunks the number of rows is calculated to not require more RAM than getOption("ffbatchbytes")
.
The following could be indications to change the parameter first.rows
:
set first.rows=-1
to read the complete file in one go (requires enough RAM)
set first.rows
to a smaller number if the pre-allocation of RAM for the first chunk with parameter nrows
in read.table
is too large, i.e. with many columns on machine with little RAM.
set first.rows
to a larger number if you expect better factor level ordering (factor levels are sorted in the first chunk, but not at subsequent chunks, however, factor level ordering can be fixed later, see below).
By default the ffdf
object is created on the fly at the end of reading the 'first' chunk, see argument first.rows
.
The creation of the ffdf
object is done via as.ffdf
and can be finetuned by passing argument asffdf_args
.
Even more control is possible by passing in a ffdf
object as argument x
to which the read records are appended.
read.table.ffdf
has been designed to behave as much like read.table
as possible. Hoever, note the following differences:
Arguments 'colClasses' and 'col.names' are now enforced also during 'next.rows' chunks.
For example giving colClasses=NA
will force that no colClasses are derived from the first.rows
respective from the ffdf
object in parameter x
.
colClass 'ordered' is allowed and will create an ordered
factor
character vector are not supported, character data must be read as one of the following colClasses: 'Date', 'POSIXct', 'factor, 'ordered'. By default character columns are read as factors. Accordingly arguments 'as.is' and 'stringsAsFactors' are not allowed.
the sequence of levels.ff
from chunked reading can depend on chunk size: by default new levels found on a chunk are appended to the levels found in previous chunks, no attempt is made to sort and recode the levels during chunked processing, levels can be sorted and recoded most efficiently after all records have been read using sortLevels
.
the default for argument 'comment.char' is ""
even for those FUN that have a different default. However, explicit specification of 'comment.char' will have priority.
write.table.ffdf
, read.table
, ffdf
message("create some csv data on disk")
x <- data.frame(
log=rep(c(FALSE, TRUE), length.out=26)
, int=1:26
, dbl=1:26 + 0.1
, fac=factor(letters)
, ord=ordered(LETTERS)
, dct=Sys.time()+1:26
, dat=seq(as.Date("1910/1/1"), length.out=26, by=1)
, stringsAsFactors = TRUE
)
x <- x[c(13:1, 13:1),]
csvfile <- tempPathFile(path=getOption("fftempdir"), extension="csv")
write.csv(x, file=csvfile, row.names=FALSE)
cat("Simply read csv with header\n")
y <- read.csv(file=csvfile, header=TRUE)
y
cat("Read csv with header\n")
ffy <- read.csv.ffdf(file=csvfile, header=TRUE)
ffy
sapply(ffy[,], class)
message("reading with colClasses (an ordered factor wont'work in read.csv)")
try(read.csv(file=csvfile, header=TRUE, colClasses=c(ord="ordered")
, stringsAsFactors = TRUE))
# TODO could fix this with the following two commands (Gabor Grothendieck)
# but does not know what bad side-effects this could have
#setOldClass("ordered")
#setAs("character", "ordered", function(from) ordered(from))
y <- read.csv(file=csvfile, header=TRUE, colClasses=c(dct="POSIXct", dat="Date")
, stringsAsFactors = TRUE)
ffy <- read.csv.ffdf(
file=csvfile
, header=TRUE
, colClasses=c(ord="ordered", dct="POSIXct", dat="Date")
)
rbind(
ram_class = sapply(y, function(x)paste(class(x), collapse = ","))
, ff_class = sapply(ffy[,], function(x)paste(class(x), collapse = ","))
, ff_vmode = vmode(ffy)
)
message("NOTE that reading in chunks can change the sequence of levels and thus the coding")
message("(Sorting levels during chunked reading can be too expensive)")
levels(ffy$fac[])
ffy <- read.csv.ffdf(
file=csvfile
, header=TRUE
, colClasses=c(ord="ordered", dct="POSIXct", dat="Date")
, first.rows=6
, next.rows=10
, VERBOSE=TRUE
)
levels(ffy$fac[])
message("If we don't know the levels we can sort then after reading")
message("(Will rewrite all factor codes)")
message("NOTE that you MUST assign the return value of sortLevels()")
ffy <- sortLevels(ffy)
levels(ffy$fac[])
message("If we KNOW the levels we can fix levels upfront")
ffy <- read.csv.ffdf(
file=csvfile
, header=TRUE
, colClasses=c(ord="ordered", dct="POSIXct", dat="Date")
, first.rows=6
, next.rows=10
, levels=list(fac=letters, ord=LETTERS)
)
levels(ffy$fac[])
message("Or we inspect a sufficiently large chunk of data and use those")
table(ffy$fac[], exclude=NULL)
ffy <- read.csv.ffdf(
file=csvfile
, header=TRUE
, colClasses=c(ord="ordered", dct="POSIXct", dat="Date")
, nrows=13
, VERBOSE=TRUE
)
message("append the rest to ffy")
ffy <- read.csv.ffdf(
x=ffy
, file=csvfile
, header=FALSE
, skip=1 + nrow(ffy)
, VERBOSE=TRUE
)
table(ffy$fac[], exclude=NULL)
message("We can turn unexpected factor levels to NA, say we only allowed a:l")
ffy <- read.csv.ffdf(
file=csvfile
, header=TRUE
, colClasses=c(ord="ordered", dct="POSIXct", dat="Date")
, levels=list(fac=letters[1:12], ord=LETTERS[1:12])
, appendLevels=FALSE
)
sapply(colnames(ffy), function(i)sum(is.na(ffy[[i]][])))
message("let's store some columns more efficient")
sum(.ffbytes[vmode(ffy)])
ffy$log <- clone(ffy$log, vmode="boolean")
ffy$fac <- clone(ffy$fac, vmode="byte")
ffy$ord <- clone(ffy$ord, vmode="byte")
sum(.ffbytes[vmode(ffy)])
message("let's make a template with zero rows")
ffx <- clone(ffy)
nrow(ffx) <- 0
message("reading with template and colClasses")
ffy <- read.csv.ffdf(
x=ffx
, file=csvfile
, header=TRUE
, colClasses=c(ord="ordered", dct="POSIXct", dat="Date")
, next.rows = 12
, VERBOSE = TRUE
)
rbind(
ff_class = sapply(ffy[,], function(x)paste(class(x), collapse = ","))
, ff_vmode = vmode(ffy)
)
levels(ffx$fac[])
levels(ffy$fac[])
message("reading with template without colClasses")
ffy <- read.csv.ffdf(
x=ffx
, file=csvfile
, header=TRUE
, next.rows = 12
, VERBOSE = TRUE
)
rbind(
ff_class = sapply(ffy[,], function(x)paste(class(x), collapse = ","))
, ff_vmode = vmode(ffy)
)
levels(ffx$fac[])
levels(ffy$fac[])
message("We can fine-tune the creation of the ffdf")
message("- let's create the ff files outside of fftempdir")
message("- let's reduce required disk space and thus file.system cache RAM")
message("By default we had record size 36.25")
ffy <- read.csv.ffdf(
file=csvfile
, header=TRUE
, colClasses=c(ord="ordered", dct="POSIXct", dat="Date")
, asffdf_args=list(
vmode = c(
log="boolean"
, int="byte"
, dbl="single"
, fac="nibble" # no NAs
, ord="nibble" # no NAs
, dct="single"
, dat="single"
)
, col_args=list(pattern = "./csv") # create in getwd() with prefix csv
)
)
vmode(ffy)
message("This recordsize is more than 50% reduced")
sum(.ffbytes[vmode(ffy)]) / 36.25
message("Don't forget to wrap-up files that are not in fftempdir")
delete(ffy); rm(ffy)
message("It's a good habit to also wrap-up temporary stuff (or at least know how this is done)")
rm(ffx); gc()
fwffile <- tempfile()
cat(file=fwffile, "123456", "987654", sep="\n")
x <- read.fwf(fwffile, widths=c(1,2,3), stringsAsFactors = TRUE) #> 1 23 456 \ 9 87 654
y <- read.table.ffdf(file=fwffile, FUN="read.fwf", widths=c(1,2,3))
stopifnot(identical(x, y[,]))
x <- read.fwf(fwffile, widths=c(1,-2,3), stringsAsFactors = TRUE) #> 1 456 \ 9 654
y <- read.table.ffdf(file=fwffile, FUN="read.fwf", widths=c(1,-2,3))
stopifnot(identical(x, y[,]))
unlink(fwffile)
cat(file=fwffile, "123", "987654", sep="\n")
x <- read.fwf(fwffile, widths=c(1,0, 2,3), stringsAsFactors = TRUE) #> 1 NA 23 NA \ 9 NA 87 654
y <- read.table.ffdf(file=fwffile, FUN="read.fwf", widths=c(1,0, 2,3))
stopifnot(identical(x, y[,]))
unlink(fwffile)
cat(file=fwffile, "123456", "987654", sep="\n")
x <- read.fwf(fwffile, widths=list(c(1,0, 2,3), c(2,2,2))
, stringsAsFactors = TRUE) #> 1 NA 23 456 98 76 54
y <- read.table.ffdf(file=fwffile, FUN="read.fwf", widths=list(c(1,0, 2,3), c(2,2,2)))
stopifnot(identical(x, y[,]))
unlink(fwffile)
# \dontshow{
x <- read.csv(file=csvfile, header=TRUE, stringsAsFactors = TRUE)
y <- read.csv.ffdf(file=csvfile, header=TRUE)
stopifnot(identical(x, y[,]))
y <- read.csv.ffdf(file=csvfile, header=TRUE, nrows=13)
stopifnot(identical(x[1:13,], y[,]))
y <- read.csv.ffdf(file=csvfile, header=TRUE, first.rows=12)
y <- sortLevels(y)
stopifnot(identical(x, y[,]))
y <- read.csv.ffdf(file=csvfile, header=TRUE, nrows=13, first.rows=12)
y <- sortLevels(y)
stopifnot(identical(x[1:13,], y[,]))
y <- read.csv.ffdf(file=csvfile, header=TRUE, nrows=12, first.rows=12)
y <- sortLevels(y)
stopifnot(!identical(x[1:12,], y[,]))
stopifnot(identical(as.character(as.matrix(x[1:12,])), as.character(as.matrix(y[,]))))
y <- read.csv.ffdf(file=csvfile, header=TRUE, nrows=11, first.rows=12)
y <- sortLevels(y)
stopifnot(!identical(x[1:11,], y[,]))
stopifnot(identical(as.character(as.matrix(x[1:11,])), as.character(as.matrix(y[,]))))
y <- read.csv.ffdf(file=csvfile, header=TRUE, first.rows=-1)
stopifnot(identical(x, y[,]))
y <- read.csv.ffdf(file=csvfile, header=TRUE, nrows=13, first.rows=12, appendLevels=c(ord=FALSE))
stopifnot(is.na(y$ord[13]) && !is.na(y$fac[13]))
# }
unlink(csvfile)
Run the code above in your browser using DataLab