
Last chance! 50% off unlimited learning
Sale ends in
Extensions of base R functions for corpus objects.
# S3 method for corpus
+(c1, c2)# S3 method for corpus
c(..., recursive = FALSE)
# S3 method for corpus
[(x, i, drop_docid = TRUE)
# S3 method for summary.corpus
print(x, ...)
The +
and c()
operators return a corpus()
object.
Indexing a corpus works in three ways, as of v2.x.x:
[
returns a subsetted corpus
[[
returns the textual contents of a subsetted corpus (similar to as.character()
)
$
returns a vector containing the single named docvars
corpus one to be added
corpus two to be added
logical used by c()
method, always set to FALSE
a corpus object
document names or indices for documents to extract.
if TRUE
, drop docid
for documents removed as the result of extraction.
The +
operator for a corpus object will combine two corpus
objects, resolving any non-matching docvars()
by making them
into NA
values for the corpus lacking that field. Corpus-level meta
data is concatenated, except for source
and notes
, which are
stamped with information pertaining to the creation of the new joined
corpus.
The c()
operator is also defined for corpus class objects, and provides
an easy way to combine multiple corpus objects.
There are some issues that need to be addressed in future revisions of
quanteda concerning the use of factors to store document variables and
meta-data. Currently most or all of these are not recorded as factors,
because we use stringsAsFactors=FALSE
in the
data.frame()
calls that are used to create and store the
document-level information, because the texts should always be stored as
character vectors and never as factors.
summary.corpus()
# concatenate corpus objects
corp1 <- corpus(data_char_ukimmig2010[1:2])
corp2 <- corpus(data_char_ukimmig2010[3:4])
corp3 <- corpus(data_char_ukimmig2010[5:6])
summary(c(corp1, corp2, corp3))
# two ways to index corpus elements
data_corpus_inaugural["1793-Washington"]
data_corpus_inaugural[2]
# return the text itself
data_corpus_inaugural[["1793-Washington"]]
Run the code above in your browser using DataLab