Data structures and operators for distributed corpora.
DCorpus( x,
         readerControl = list(reader   = reader(x),
                              language = "en"),
         storage = NULL, keep = TRUE, ... )
# S3 method for DCorpus
as.VCorpus(x)
as.DCorpus( x, storage = NULL, ... )A list with the named components reader
    representing a reading function capable of handling the file format
    found in x, and language giving the text's language
    (preferably as IETF language tags, see language in
    package NLP).
The storage subsystem to use with the DCorpus. Currently two types of storages are supported: local disk storage using the Local File System (LFS) and the Hadoop Distributed File System (HDFS). Default: 'LFS'.
Should revisions be used when operating on the
    DCorpus? Default: TRUE
Optional arguments for the reader.
An object inheriting from DCorpus and Corpus.
When constructing a distributed corpus the input source is
  extracted via the supplied reader and stored on the given file
  system (argument storage). While the data set resides on the
  corresponding storage (e.g., HDFS), only a symbolic representation is
  held in R (a so-called DList) which allows to
  access the corpus via corresponding (DList) methods. Since the
  available memory for the  distributed  corpus is only restricted by
  available disk space in the given storage (and not main memory like in
  a standard tm corpus) by default we also store a set of
  so-called revisions, i.e., stages of the (processed) corpus. Revisions
  can be turned off later on using the keepRevisions()
  replacement function.\
The constructed corpus object inherits from a tm
  Corpus and has several slots containing meta
  information:
metaCorpus Meta Data contains corpus specific meta data in form of tag-value pairs.
dmetaDocument Meta Data of class
      data.frame contains document specific meta data for the
      corpus. This is mainly available to be compatible with standard
      tm corpus definitions but not yet actually used in the
      distributed scenario.
keepA logical indicating whether revisions representing stages e.g., in a preprocessing chain should be kept or not.
Corpus for basic information on the corpus infrastructure
  employed by package tm.
# NOT RUN {
## Similar to example in package 'tm'
reut21578 <- system.file("texts", "crude", package = "tm")
dc <- DistributedCorpus(DirSource(reut21578),
readerControl = list(reader = readReut21578XMLasPlain) )
dc
## Coercion
data("crude")
as.DistributedCorpus(crude)
as.VCorpus(dc)
# }
Run the code above in your browser using DataLab