DistributedCorpus: Distributed Corpus

Description

Data structures and operators for distributed corpora.

Usage

DCorpus( x,
         readerControl = list(reader   = reader(x),
                              language = "en"),
         storage = NULL, keep = TRUE, ... )
# S3 method for DCorpus
as.VCorpus(x)
as.DCorpus( x, storage = NULL, ... )

Arguments

for DCorpus, a Source object. At the moment only DirSource is supported. For as.VCorpus() and as.DCorpus(), an object to be coerced to a VCorpus/DCorpus. Currently coercion from/to classic tm corpora (VCorpus) is implemented.

readerControl

A list with the named components reader representing a reading function capable of handling the file format found in x, and language giving the text's language (preferably as IETF language tags, see language in package NLP).

storage

The storage subsystem to use with the DCorpus. Currently two types of storages are supported: local disk storage using the Local File System (LFS) and the Hadoop Distributed File System (HDFS). Default: 'LFS'.

keep

Should revisions be used when operating on the DCorpus? Default: TRUE

…

Optional arguments for the reader.

Value

An object inheriting from DCorpus and Corpus.

Details

When constructing a distributed corpus the input source is extracted via the supplied reader and stored on the given file system (argument storage). While the data set resides on the corresponding storage (e.g., HDFS), only a symbolic representation is held in R (a so-called DList) which allows to access the corpus via corresponding (DList) methods. Since the available memory for the distributed corpus is only restricted by available disk space in the given storage (and not main memory like in a standard tm corpus) by default we also store a set of so-called revisions, i.e., stages of the (processed) corpus. Revisions can be turned off later on using the keepRevisions() replacement function.\

The constructed corpus object inherits from a tm Corpus and has several slots containing meta information:

meta: Corpus Meta Data contains corpus specific meta data in form of tag-value pairs.
dmeta: Document Meta Data of class data.frame contains document specific meta data for the corpus. This is mainly available to be compatible with standard tm corpus definitions but not yet actually used in the distributed scenario.
keep: A logical indicating whether revisions representing stages e.g., in a preprocessing chain should be kept or not.

Examples

Run this code

# NOT RUN {
## Similar to example in package 'tm'
reut21578 <- system.file("texts", "crude", package = "tm")
dc <- DistributedCorpus(DirSource(reut21578),
readerControl = list(reader = readReut21578XMLasPlain) )
dc

## Coercion
data("crude")
as.DistributedCorpus(crude)
as.VCorpus(dc)
# }