DFS: Hadoop Distributed File System

Description

Functions providing high-level access to the Hadoop Distributed File System (HDFS).

Usage

DFS_cat( file, con = stdout(), henv = hive() )
DFS_delete( file, recursive = FALSE, henv = hive() )
DFS_dir_create( path, henv = hive() )
DFS_dir_exists( path, henv = hive() )
DFS_dir_remove( path, recursive = TRUE, henv = hive() )
DFS_file_exists( file, henv = hive() )
DFS_get_object( file, henv = hive() )
DFS_read_lines( file, n = -1L, henv = hive() )
DFS_list( path = ".", henv = hive() )
DFS_tail( file, n = 6L, size = 1024L, henv = hive() )
DFS_put( files, path = ".", henv = hive() )
DFS_put_object( obj, file, henv = hive() )
DFS_write_lines( text, file, henv = hive() )

Arguments

henv

Hadoop local environment.

file

a character string representing a file on the DFS.

files

a character string representing files to be copied to the DFS.

an integer specifying the number of lines to read.

obj

an R object to be serialized to/from the DFS.

path

a character string representing a full path name in the DFS (without the leading hdfs://); for many functions the default corresponds to the user's home directory in the DFS.

recursive

logical. Should elements of the path other than the last be deleted recursively?

size

an integer specifying the number of bytes to be read. Must be sufficiently large otherwise n does not have the desired effect.

text

a (vector of) character string(s) to be written to the DFS.

con

A connection to be used for printing the output provided by cat. Default: standard output connection, has currently no other effect

Value

DFS_dir_create returns a logical value indicating if the operation succeeded for the given argument.
DFS_dir_exists and DFS_file_exists return TRUE if the named directories or files exist in the HDFS.

Details

The Hadoop Distributed File System (HDFS) is typically part of a Hadoop cluster or can be used as a stand-alone general purpose distributed file system (DFS). Several high-level functions provide easy access to distributed storage. DFS_cat is useful for producing output in user-defined functions. It reads from files in the DFS and typically prints the output to the standard output. It's behaviour is similar to the base function cat.

DFS_dir_create creates directories with the given path names if they do not already exist. It's behaviour is similar to the base function dir.create.

DFS_dir_exists and DFS_file_exists return a logical vector indicating whether the directory or file respectively named by its argument exist. See also function file.exists.

DFS_dir_remove attempts to remove the directory named in its argument and if recursive is set to TRUE also attempts to remove subdirectories in a recursive manner. DFS_list produces a character vector of the names of files in the directory named by its argument.

DFS_read_lines is a reader for (plain text) files stored on the DFS. It returns a vector of character strings representing lines in the (text) file. If n is given as an argument it reads that many lines from the given file. It's behaviour is similar to the base function readLines.

DFS_put copies files named by its argument to a given path in the DFS.

DFS_put_object serializes an R object to the DFS.

DFS_write_lines writes a given vector of character strings to a file stored on the DFS. It's behaviour is similar to the base function writeLines.

References

Apache Hadoop core (http://hadoop.apache.org/core/).

Examples

Run this code

## Do we have access to the root tree of the DFS?
DFS_dir_exists("/")

## If so, list the contents.
DFS_list("/")

Run the code above in your browser using DataLab