Learn R Programming

tm.plugin.dc (version 0.1-7)

DistributedCorpus: Distributed Corpus

Description

Data structures and operators for distributed corpora.

Usage

DistributedCorpus( source,
                   readerControl = list(reader   = source$DefaultReader,
                                        language = "eng"),
                   storage = NULL, keys = NULL, ... )
as.Corpus( x, ... )
as.DistributedCorpus( x, storage = NULL, ... )

Arguments

source
A Source object. At the moment only DirSource is supported.
readerControl
A list with the named components reader representing a reading function capable of handling the file format found in source, and language giving the text's language (preferably in ISO 6
storage
The storage subsystem to use with the DistributedCorpus. Currently two types of storages are supported: local disk storage (local_disk) and the Hadoop distributed file system (HDFS). If no storage is specified it uses a default storage, na
keys
An integer vector of the same length as the number of documents in the corpus. Uniquely identifies the document in the chunks. Default: a sequence from 1 to the number of documents.
x
An object to be coerced to a Corpus/DistributedCorpus. Currently coercion from/to classic tm corpora (VCorpus) is implemented.
...
Optional arguments for the reader.

Value

  • An object of class DistributedCorpus which extends the classes Corpus and list containing a collection of text documents.

Details

When constructing a distributed corpus the input source is extracted via the supplied reader and stored on the given file system (argument storage). While the dataset resides on the corresponding storage (e.g., HDFS), only a symbolic representation is held in Rwhich allows to access the corpus via corresponding methods which dispatch on the distributed corpus. Since the available memory for the distributed corpus is only restricted by available disk space in the given storage (and not main memory like in a classic corpus) we also store a set of so-called revisions, i.e., stages of the (processed) corpus.

The constructed corpus object inherits from a tm Corpus and has several attributes containing meta information: [object Object],.,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

See Also

Corpus

Examples

Run this code
## Similar to example in package 'tm'
reut21578 <- system.file("texts", "crude", package = "tm")
dc <- DistributedCorpus(DirSource(reut21578),
readerControl = list(reader = readReut21578XMLasPlain) )
dc

## Coercion
data("crude")
as.DistributedCorpus(crude)
as.Corpus(dc)

Run the code above in your browser using DataLab