fwrite: Fast CSV writer

Description

As write.csv but much faster (e.g. 2 seconds versus 1 minute) and just as flexible. Modern machines almost surely have more than one CPU so fwrite uses them; on all operating systems including Linux, Mac and Windows.

This is new functionality as of Nov 2016. We may need to refine argument names and defaults.

Usage

fwrite(x, file = "", append = FALSE, quote = "auto",
  sep = ",", sep2 = c("","|",""),
  eol = if (.Platform$OS.type=="windows") "\r\n" else "\n",
  na = "", dec = ".", row.names = FALSE, col.names = TRUE,
  qmethod = c("double","escape"),
  logicalAsInt = FALSE, dateTimeAs = c("ISO","squash","epoch","write.csv"),
  buffMB = 8L, nThread = getDTthreads(),
  showProgress = getOption("datatable.showProgress"),
  verbose = getOption("datatable.verbose"))

Arguments

Any list of same length vectors; e.g. data.frame and data.table.

file

Output file name. "" indicates output to the console.

append

If TRUE, the file is opened in append mode and column names (header row) are not written.

quote

When "auto", character fields, factor fields and column names will only be surrounded by double quotes when they need to be; i.e., when the field contains the separator sep, a line ending \n, the double quote itself or (when list columns are present) sep2[2] (see sep2 below). If FALSE the fields are not wrapped with quotes even if this would break the CSV due to the contents of the field. If TRUE double quotes are always included other than around numeric fields, as write.csv.

sep

The separator between columns. Default is ",".

sep2

For columns of type list where each item is an atomic vector, sep2 controls how to separate items within the column. sep2[1] is written at the start of the output field, sep2[2] is placed between each item and sep2[3] is written at the end. sep2[1] and sep2[3] may be any length strings including empty "" (default). sep2[2] must be a single character and (when list columns are present and therefore sep2 is used) different from both sep and dec. The default (|) is chosen to visually distinguish from the default sep. In speaking, writing and in code comments we may refer to sep2[2] as simply "sep2".

eol

Line separator. Default is "\r\n" for Windows and "\n" otherwise.

The string to use for missing values in the data. Default is a blank string "".

dec

The decimal separator, by default ".". See link in references. Cannot be the same as sep.

row.names

Should row names be written? For compatibility with data.frame and write.csv since data.table never has row names. Hence default FALSE unlike write.csv.

col.names

Should the column names (header row) be written? If missing, append=TRUE and the file already exists, the default is set to FALSE for convenience to prevent column names appearing again mid file.

qmethod

A character string specifying how to deal with embedded double quote characters when quoting strings.

"escape" - the quote character (as well as the backslash character) is escaped in C style by a backslash, or
"double" (default, same as write.csv), in which case the double quote is doubled with another one.

logicalAsInt

Should logical values be written as 1 and 0 rather than "TRUE" and "FALSE"?

dateTimeAs

How Date/IDate, ITime and POSIXct items are written.

"ISO" (default) - 2016-09-12, 18:12:16 and 2016-09-12T18:12:16.999999Z. 0, 3 or 6 digits of fractional seconds are printed if and when present for convenience, regardless of any R options such as digits.secs. The idea being that if milli and microseconds are present then you most likely want to retain them. R's internal UTC representation is written faithfully to encourage ISO standards, stymie timezone ambiguity and for speed. An option to consider is to start R in the UTC timezone simply with "$ TZ='UTC' R" at the shell (NB: it must be one or more spaces between TZ='UTC' and R, anything else will be silently ignored; this TZ setting applies just to that R process) or Sys.setenv(TZ='UTC') at the R prompt and then continue as if UTC were local time.
"squash" - 20160912, 181216 and 20160912181216999. This option allows fast and simple extraction of yyyy, mm, dd and (most commonly to group by) yyyymm parts using integer div and mod operations. In R for example, one line helper functions could use %/%10000, %/%100%%100, %%100 and %/%100 respectively. POSIXct UTC is squashed to 17 digits (including 3 digits of milliseconds always, even if 000) which may be read comfortably as integer64 (automatically by fread()).
"epoch" - 17056, 65536 and 1473703936.999999. The underlying number of days or seconds since the relevant epoch (1970-01-01, 00:00:00 and 1970-01-01T00:00:00Z respectively), negative before that (see ?Date). 0, 3 or 6 digits of fractional seconds are printed if and when present.
"write.csv" - this currently affects POSIXct only. It is written as write.csv does by using the as.character method which heeds digits.secs and converts from R's internal UTC representation back to local time (or the "tzone" attribute) as of that historical date. Accordingly this can be slow. All other column types (including Date, IDate and ITime which are independent of timezone) are written as the "ISO" option using fast C code which is already consistent with write.csv.

The first three options are fast due to new specialized C code. The epoch to date-part conversion uses a fast approach by Howard Hinnant (see references) using a day-of-year starting on 1 March. You should not be able to notice any difference in write speed between those three options. The date range supported for Date and IDate is [0000-03-01, 9999-12-31]. Every one of these 3,652,365 dates have been tested and compared to base R including all 2,790 leap days in this range. This option applies to vectors of date/time in list column cells, too. A fully flexible format string (such as "%m/%d/%Y") is not supported. This is to encourage use of ISO standards and because that flexibility is not known how to make fast at C level. We may be able to support one or two more specific options if required.

buffMB

The buffer size (MB) per thread in the range 1 to 1024, default 8MB. Experiment to see what works best for your data on your hardware.

nThread

The number of threads to use. Experiment to see what works best for your data on your hardware.

showProgress

Display a progress meter on the console? Ignored when file=="".

verbose

Be chatty and report timings?

Details

fwrite began as a community contribution with pull request #1613 by Otto Seiskari. This gave Matt Dowle the impetus to specialize the numeric formatting and to parallelize: http://blog.h2o.ai/2016/04/fast-csv-writing-for-r/. Final items were tracked in issue #1664 such as automatic quoting, bit64::integer64 support, decimal/scientific formatting exactly matching write.csv between 2.225074e-308 and 1.797693e+308 to 15 significant figures, row.names, dates (between 0000-03-01 and 9999-12-31), times and sep2 for list columns where each cell can itself be a vector.

References

http://howardhinnant.github.io/date_algorithms.html https://en.wikipedia.org/wiki/Decimal_mark

Examples

Run this code

# NOT RUN {
DF = data.frame(A=1:3, B=c("foo","A,Name","baz"))
fwrite(DF)
write.csv(DF, row.names=FALSE, quote=FALSE)  # same

fwrite(DF, row.names=TRUE, quote=TRUE)
write.csv(DF)                                # same

DF = data.frame(A=c(2.1,-1.234e-307,pi), B=c("foo","A,Name","bar"))
fwrite(DF, quote='auto')        # Just DF[2,2] is auto quoted 
write.csv(DF, row.names=FALSE)  # same numeric formatting

DT = data.table(A=c(2,5.6,-3),B=list(1:3,c("foo","A,Name","bar"),round(pi*1:3,2)))
fwrite(DT)
fwrite(DT, sep="|", sep2=c("{",",","}"))

# }
# NOT RUN {
set.seed(1)
DT = as.data.table( lapply(1:10, sample,
         x=as.numeric(1:5e7), size=5e6))                            #     382MB
system.time(fwrite(DT, "/dev/shm/tmp1.csv"))                        #      0.8s
system.time(write.csv(DT, "/dev/shm/tmp2.csv",                      #     60.6s
                      quote=FALSE, row.names=FALSE))
system("diff /dev/shm/tmp1.csv /dev/shm/tmp2.csv")                  # identical

set.seed(1)
N = 1e7
DT = data.table(
  str1=sample(sprintf("<!-- %010d",sample(N,1e5,replace=TRUE)), N, replace=TRUE), -->
  str2=sample(sprintf("<!-- %09d",sample(N,1e5,replace=TRUE)), N, replace=TRUE), -->
  str3=sample(sapply(sample(2:30, 100, TRUE), function(n)
     paste0(sample(LETTERS, n, TRUE), collapse="")), N, TRUE),
  str4=sprintf("<!-- %05d",sample(sample(1e5,50),N,TRUE)), -->
  num1=sample(round(rnorm(1e6,mean=6.5,sd=15),2), N, replace=TRUE),
  num2=sample(round(rnorm(1e6,mean=6.5,sd=15),10), N, replace=TRUE),
  str5=sample(c("Y","N"),N,TRUE),
  str6=sample(c("M","F"),N,TRUE),
  int1=sample(ceiling(rexp(1e6)), N, replace=TRUE),
  int2=sample(N,N,replace=TRUE)-N/2
)                                                                   #     774MB
system.time(fwrite(DT,"/dev/shm/tmp1.csv"))                         #      1.1s
system.time(write.csv(DT,"/dev/shm/tmp2.csv",row.names=F,quote=F))  #     63.2s
system("diff /dev/shm/tmp1.csv /dev/shm/tmp2.csv")                  # identical

unlink("/dev/shm/tmp1.csv")
unlink("/dev/shm/tmp2.csv")
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab