Corpus
: such packages then provide S3 corpus classes extending the
virtual base class (such as VCorpus
provided by package tm
itself). All extension classes must provide accessors to extract subsets
([
), individual documents ([[
), and metadata
(meta
). The function length
must return the number
of documents, and as.list
must construct a list holding the
documents.
A corpus can have two types of metadata (accessible via meta
).
Corpus metadata contains corpus specific metadata in form of tag-value
pairs. Document level metadata contains document specific metadata but
is stored in the corpus as a data frame. Document level metadata is typically
used for semantic reasons (e.g., classifications of documents form an own
entity due to some high-level information like the range of possible values)
or for performance reasons (single access instead of extracting metadata of
each document).
VCorpus
, and PCorpus
for the corpora classes
provided by package tm. DCorpus
for a distributed corpus class provided by
package tm.plugin.dc.