tm.plugin.dc (version 0.2-10)

DistributedCorpus: Distributed Corpus

Description

Data structures and operators for distributed corpora.

Usage

DCorpus( x,
         readerControl = list(reader   = reader(x),
                              language = "en"),
         storage = NULL, keep = TRUE, ... )
# S3 method for DCorpus
as.VCorpus(x)
as.DCorpus( x, storage = NULL, ... )

Arguments

x

for DCorpus, a Source object. At the moment only DirSource is supported. For as.VCorpus() and as.DCorpus(), an object to be coerced to a VCorpus/DCorpus. Currently coercion from/to classic tm corpora (VCorpus) is implemented.

readerControl

A list with the named components reader representing a reading function capable of handling the file format found in x, and language giving the text's language (preferably as IETF language tags, see language in package NLP).

storage

The storage subsystem to use with the DCorpus. Currently two types of storages are supported: local disk storage using the Local File System (LFS) and the Hadoop Distributed File System (HDFS). Default: 'LFS'.

keep

Should revisions be used when operating on the DCorpus? Default: TRUE

Optional arguments for the reader.

Value

An object inheriting from DCorpus and Corpus.

Details

When constructing a distributed corpus the input source is extracted via the supplied reader and stored on the given file system (argument storage). While the data set resides on the corresponding storage (e.g., HDFS), only a symbolic representation is held in R (a so-called DList) which allows to access the corpus via corresponding (DList) methods. Since the available memory for the distributed corpus is only restricted by available disk space in the given storage (and not main memory like in a standard tm corpus) by default we also store a set of so-called revisions, i.e., stages of the (processed) corpus. Revisions can be turned off later on using the keepRevisions() replacement function.\

The constructed corpus object inherits from a tm Corpus and has several slots containing meta information:

%% \item{ActiveRevision}{Contains the current revision (random string) %% of the corpus. Each modification of the documents in the corpus %% results in a new revision which allows fast switching between %% multiple snapshots.}. %% \item{Chunks}{A list of file names on the local disk/HFS representing %% underlying chunks holding the (serialized) documents.}
meta

Corpus Meta Data contains corpus specific meta data in form of tag-value pairs.

dmeta

Document Meta Data of class data.frame contains document specific meta data for the corpus. This is mainly available to be compatible with standard tm corpus definitions but not yet actually used in the distributed scenario.

keep

A logical indicating whether revisions representing stages e.g., in a preprocessing chain should be kept or not.

%% \item{Keys}{A character vector identifying each individual document %% in the distributed corpus.} %% \item{Mapping}{Basically a hash table (implemented as a matrix of %% position mappings) holding for each key the chunk and position in %% the file system of the storage.} %% \item{Revisions}{A list of all available revisions.} %% \item{Storage}{An object which inherits from class %% \code{dc_storage}. It specifies how to use the given storage %% (read/write methods, base directory for data, etc.).}

See Also

Corpus for basic information on the corpus infrastructure employed by package tm.

Examples

Run this code
# NOT RUN {
## Similar to example in package 'tm'
reut21578 <- system.file("texts", "crude", package = "tm")
dc <- DistributedCorpus(DirSource(reut21578),
readerControl = list(reader = readReut21578XMLasPlain) )
dc

## Coercion
data("crude")
as.DistributedCorpus(crude)
as.VCorpus(dc)
# }

Run the code above in your browser using DataCamp Workspace