Learn R Programming

HadoopStreaming (version 0.2)

hsCmdLineArgs: Handles command line arguments for Hadoop streaming tasks

Description

Offers several command line arguments useful for Hadoop streaming. Allows specifying input and output files, column separators, and much more. Optionally opens the I/O connections.

Usage

hsCmdLineArgs(spec=c(),openConnections=TRUE,args=commandArgs(TRUE))

Arguments

spec
A vector specifying the command line args to support.
openConnections
A boolean specifying whether to open the I/O connections.
args
Character vector of arguments. Defaults to command line args.

Value

Returns a list. The names of the entries in the list are the long flag names. Their values are either those specified on the command line, or the default values.If openConnections=TRUE, then the returned list has two additional entries: incon and outcon. incon is a readable connection to the input source specified, and outcon is a writable connection to the appropriate output destination.An additional entry in the returned list is named 'set'. When this list entry is FALSE, none of the options were set (generally because -h or --help was requested). The calling procedure should probably stop execution when the 'set' is returned as FALSE.

Details

The spec vector has length 6*n, where n is the number of command line arguments specified. The spec has the same format as the spec parameter in the getopt function of the getopt package, though we have one additional entry specifying a defaut value. The six entries per argument are the following:
  1. long flag name (a multi-character string)
  2. short flag name (a single character)
  3. Argument specification: 0=no arg, 1=required arg, 2=optional arg
  4. Data type ('logical', 'integer', 'double', 'complex', or 'character')
  5. A string describing the option
  6. The default value to be assigned to this parameter

See getopt in getopt.package for details.

The following vector defines the default command line args. The vector is appended to the user-supplied spec vector in the call to getopt.

  
basespec = c(
  'mapper',     'm',0, "logical","Runs the mapper.",F,
  'reducer',    'r',0, "logical","Runs the reducer, unless already running mapper.",F,
  'mapcols',    'a',0, "logical","Prints column headers for mapper output.",F,
  'reducecols', 'b',0, "logical","Prints column headers for reducer output.",F,
  'infile'   ,  'i',1, "character","Specifies an input file, otherwise use stdin.",NA,
  'outfile',    'o',1, "character","Specifies an output file, otherwise use stdout.",NA,
  'skip',       's',1,"numeric","Number of lines of input to skip at the beginning.",0,
  'chunksize',  'C',1,"numeric","Number of lines to read at once, a la scan.",-1,
  'numlines',   'n',1,"numeric","Max num lines to read per mapper or reducer job.",0,
  'sepr',       'e',1,"character","Separator character, as used by scan.",'\t',
  'insep',      'f',1,"character","Separator character for input, defaults to sepr.",NA,
  'outsep',     'g',1,"character","Separator character output, defaults to sepr.",NA,
  'help',       'h',0,"logical","Get a help message.",F
  )
  

See Also

This package relies heavily on package getopt

Examples

Run this code
spec = c('myChunkSize','C',1,"numeric","Number of lines to read at once, a la scan.",-1)
## Displays the help string
hsCmdLineArgs(spec, args=c('-h'))
## Call with the mapper flag, and request that connections be opened
opts = hsCmdLineArgs(spec, openConnections=TRUE,args=c('-m'))
opts  #   a list of argument values
opts$incon # an input connection
opts$outcon # an output connection

Run the code above in your browser using DataLab