datadr (version 0.8.6.1)

readHDFStextFile: Experimental HDFS text reader helper function

Description

Experimental helper function for reading text data on HDFS into a HDFS connection

Usage

readHDFStextFile(input, output = NULL, overwrite = FALSE, fn = NULL,
  keyFn = NULL, linesPerBlock = 10000, control = NULL, update = FALSE)

Arguments

input

a ddo / ddf connection to a text input directory on HDFS, created with hdfsConn - ensure the text files are within a directory and that type = "text" is specified

output

an output connection such as those created with localDiskConn, and hdfsConn

overwrite

logical; should existing output location be overwritten? (also can specify overwrite = "backup" to move the existing output to _bak)

fn

function to be applied to each chunk of lines (input to function is a vector of strings)

keyFn

optional function to determine the value of the key for each block

linesPerBlock

how many lines at a time to read

control

parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl

update

should a MapReduce job be run to obtain additional attributes for the result data prior to returning?

Examples

Run this code
# NOT RUN {
res <- readHDFStextFile(
  input = Rhipe::rhfmt("/path/to/input/text", type = "text"),
  output = hdfsConn("/path/to/output"),
  fn = function(x) {
    read.csv(textConnection(paste(x, collapse = "\n")), header = FALSE)
  }
)
# }

Run the code above in your browser using DataLab