datadr (version 0.8.4)

readHDFStextFile: Experimental HDFS text reader helper function

Description

Experimental helper function for reading text data on HDFS into a HDFS connection

Usage

readHDFStextFile(input, output = NULL, overwrite = FALSE, fn = NULL,
  keyFn = NULL, linesPerBlock = 10000, control = NULL, update = FALSE)

Arguments

input
a RHIPE input text handle created with rhfmt
output
an output connection such as those created with localDiskConn, and hdfsConn
overwrite
logical; should existing output location be overwritten? (also can specify overwrite = "backup" to move the existing output to _bak)
fn
function to be applied to each chunk of lines (input to function is a vector of strings)
keyFn
optional function to determine the value of the key for each block
linesPerBlock
how many lines at a time to read
control
parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskContr
update
should a MapReduce job be run to obtain additional attributes for the result data prior to returning?

Examples

Run this code
res <- readHDFStextFile(
  input = Rhipe::rhfmt("/path/to/input/text", type = "text"),
  output = hdfsConn("/path/to/output"),
  fn = function(x) {
    read.csv(textConnection(paste(x, collapse = "\n")), header = FALSE)
  }
)

Run the code above in your browser using DataCamp Workspace