datadr (version 0.8.4)

readTextFileByChunk: Experimental sequential text reader helper function

Description

Experimental helper function for reading text data sequentially from a file on disk and adding to connection using addData

Usage

readTextFileByChunk(input, output, overwrite = FALSE, linesPerBlock = 10000,
  fn = NULL, header = TRUE, skip = 0, recordEndRegex = NULL,
  cl = NULL)

Arguments

input
the path to an input text file
output
an output connection such as those created with localDiskConn, and hdfsConn
overwrite
logical; should existing output location be overwritten? (also can specify overwrite = "backup" to move the existing output to _bak)
linesPerBlock
how many lines at a time to read
fn
function to be applied to each chunk of lines (see details)
header
does the file have a header
skip
number of lines to skip before reading
recordEndRegex
an optional regular expression that finds lines in the text file that indicate the end of a record (for multi-line records)
cl
a "cluster" object to be used for parallel processing, created using makeCluster

Details

The function fn should have one argument, which should expect to receive a vector of strings, each element of which is a line in the file. It is also possible for fn to take two arguments, in which case the second argument is the header line from the file (some parsing methods might need to know the header).

Examples

Run this code
csvFile <- file.path(tempdir(), "iris.csv")
write.csv(iris, file = csvFile, row.names = FALSE, quote = FALSE)
myoutput <- localDiskConn(file.path(tempdir(), "irisText"), autoYes = TRUE)
a <- readTextFileByChunk(csvFile,
  output = myoutput, linesPerBlock = 10,
  fn = function(x, header) {
    colNames <- strsplit(header, ",")[[1]]
    read.csv(textConnection(paste(x, collapse = "\n")), col.names = colNames, header = FALSE)
  })
a[[1]]

Run the code above in your browser using DataCamp Workspace