Learn R Programming

datadr (version 0.8.4)

drRead.table: Data Input

Description

Reads a text file in table format and creates a distributed data frame from it, with cases corresponding to lines and variables to fields in the file.

Usage

## S3 method for class 'table':
drRead(file, header = FALSE, sep = "", quote = "\\"'", dec = ".",
  skip = 0, fill = !blank.lines.skip, blank.lines.skip = TRUE, comment.char = "#",
  allowEscapes = FALSE, encoding = "unknown", autoColClasses = TRUE,
  rowsPerBlock = 50000, postTransFn = identity, output = NULL, overwrite = FALSE,
  params = NULL, packages = NULL, control = NULL, ...)
## S3 method for class 'csv':
drRead(file, header = TRUE, sep = ",",
  quote = "\\"", dec = ".", fill = TRUE, comment.char = "", ...)
## S3 method for class 'csv2':
drRead(file, header = TRUE, sep = ";",
  quote = "\\"", dec = ",", fill = TRUE, comment.char = "", ...)
## S3 method for class 'delim':
drRead(file, header = TRUE, sep = "\\t",
  quote = "\\"", dec = ".", fill = TRUE, comment.char = "", ...)
## S3 method for class 'delim2':
drRead(file, header = TRUE, sep = "\\t",
  quote = "\\"", dec = ",", fill = TRUE, comment.char = "", ...)

Arguments

file
input text file - can either be character string pointing to a file on local disk, or an hdfsConn object pointing to a text file on HDFS (see output argument below)
header
this and parameters other parameters below are passed to read.table for each chunk being processed - see read.table for more info. Most all have defau
sep
see read.table for more info
quote
see read.table for more info
dec
see read.table for more info
skip
see read.table for more info
fill
see read.table for more info
blank.lines.skip
see read.table for more info
comment.char
see read.table for more info
allowEscapes
see read.table for more info
encoding
see read.table for more info
autoColClasses
should column classes be determined automatically by reading in a sample? This can sometimes be problematic because of strange ways R handles quotes in read.table, but keeping the default of TRUE is advantageous for speed.
rowsPerBlock
how many rows of the input file should make up a block (key-value pair) of output?
postTransFn
a function to be applied after a block is read in to provide any additional processingn before the block is stored
output
a "kvConnection" object indicating where the output data should reside. Must be a localDiskConn object if input is a text file on local disk, or a hdfsConn<
overwrite
logical; should existing output location be overwritten? (also can specify overwrite = "backup" to move the existing output to _bak)
params
a named list of objects external to the input data that are needed in postTransFn
packages
a vector of R package names that contain functions used in fn (most should be taken care of automatically such that this is rarely necessary to specify)
control
parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskContr
...
see read.table for more info

Value

  • an object of class "ddf"

Examples

Run this code
csvFile <- file.path(tempdir(), "iris.csv")
  write.csv(iris, file = csvFile, row.names = FALSE, quote = FALSE)
  irisTextConn <- localDiskConn(file.path(tempdir(), "irisText2"), autoYes = TRUE)
  a <- drRead.csv(csvFile, output = irisTextConn, rowsPerBlock = 10)

Run the code above in your browser using DataLab