appendLevels
combines levels
without sorting such that levels of the first argument will not require re-coding.
recodeLevels
is a generic for recoding a factor to a desired set of levels - also has a method for large ff
objects
sortLevels
is a generic for level sorting and recoding of single factors or of all factors of a ffdf
dataframe.
appendLevels(…)
recodeLevels(x, lev)
# S3 method for factor
recodeLevels(x, lev)
# S3 method for ff
recodeLevels(x, lev)
sortLevels(x)
# S3 method for factor
sortLevels(x)
# S3 method for ff
sortLevels(x)
# S3 method for ffdf
sortLevels(x)
character vector of levels or is.factor
objects from which the level attribute is taken
a character vector of levels
appendLevels
returns a vector of combined levels, recodeLevels
and sortLevels
return the input object with changed levels. Do read the note!
When reading a long file with categorical columns the final set of factor levels is only known once the complete file has been read.
When a file is so large that we read it in chunks, the new levels need to be added incrementally.
rbind.data.frame
sorts combined levels, which requires recoding. For ff
factors this would require recoding of all previous chunks at the next chunk - potentially on disk, which is too expensive.
Therefore read.table.ffdf
will simply appendLevels
without sorting, and the recodeLevels
and sortLevels
generics provide a convenient means for sorting and recoding levels after all chunks have been read.
# NOT RUN { message("Let's create a factor with little levels") x <- ff(letters[4:6], levels=letters[4:6]) message("Let's interpret the same ff file without levels in order to see the codes") y <- x levels(y) <- NULL levels(x) data.frame(factor=x[], codes=y[], stringsAsFactors = TRUE) levels(x) <- appendLevels(levels(x), letters) levels(x) data.frame(factor=x[], codes=y[], stringsAsFactors = TRUE) x <- sortLevels(x) # implicit recoding is chunked were necessary levels(x) data.frame(factor=x[], codes=y[], stringsAsFactors = TRUE) message("NEVER forget to reassign the result of recodeLevels or sortLevels, look at the following mess") recodeLevels(x, rev(levels(x))) message("NOW the codings have changed, but not the levels, the result is wrong data") levels(x) data.frame(factor=x[], codes=y[], stringsAsFactors = TRUE) rm(x);gc() # } # NOT RUN { n <- 5e7 message("reading a factor from a file ist as fast ...") system.time( fx <- ff(factor(letters[1:25]), length=n) ) system.time(x <- fx[]) str(x) rm(x); gc() message("... as creating it in-RAM (R-2.11.1) which is theoretically impossible ...") system.time({ x <- integer(n) x[] <- 1:25 levels(x) <- letters[1:25] class(x) <- "factor" }) str(x) rm(x); gc() message("... but is possible if we avoid some unnecessary copying that is triggered by assignment functions") system.time({ x <- integer(n) x[] <- 1:25 setattr(x, "levels", letters[1:25]) setattr(x, "class", "factor") }) str(x) rm(x); gc() rm(n) # } # NOT RUN { # }
Run the code above in your browser using DataCamp Workspace