readTextFileByChunk: Experimental sequential text reader helper function

Description

Experimental helper function for reading text data sequentially from a file on disk and adding to connection using addData

Usage

readTextFileByChunk(input, output, overwrite = FALSE, linesPerBlock = 10000,
  fn = NULL, header = TRUE, skip = 0, recordEndRegex = NULL,
  cl = NULL)

Arguments

input

the path to an input text file

output

an output connection such as those created with localDiskConn, and hdfsConn

overwrite

logical; should existing output location be overwritten? (also can specify overwrite = "backup" to move the existing output to _bak)

linesPerBlock

how many lines at a time to read

function to be applied to each chunk of lines (see details)

header

does the file have a header

skip

number of lines to skip before reading

recordEndRegex

an optional regular expression that finds lines in the text file that indicate the end of a record (for multi-line records)

a "cluster" object to be used for parallel processing, created using makeCluster

Details

The function fn should have one argument, which should expect to receive a vector of strings, each element of which is a line in the file. It is also possible for fn to take two arguments, in which case the second argument is the header line from the file (some parsing methods might need to know the header).

Examples

Run this code

csvFile <- file.path(tempdir(), "iris.csv")
write.csv(iris, file = csvFile, row.names = FALSE, quote = FALSE)
myoutput <- localDiskConn(file.path(tempdir(), "irisText"), autoYes = TRUE)
a <- readTextFileByChunk(csvFile,
  output = myoutput, linesPerBlock = 10,
  fn = function(x, header) {
    colNames <- strsplit(header, ",")[[1]]
    read.csv(textConnection(paste(x, collapse = "\n")), col.names = colNames, header = FALSE)
  })
a[[1]]

Run the code above in your browser using DataLab