Last chance! 50% off unlimited learning
Sale ends in
Extensions of base R functions for corpus objects.
# S3 method for corpus
print(x, ...)is.corpus(x)
is.corpuszip(x)
# S3 method for summary.corpus
print(x, ...)
# S3 method for corpus
+(c1, c2)
# S3 method for corpus
c(..., recursive = FALSE)
# S3 method for corpus
[(x, i, j = NULL, ..., drop = TRUE)
# S3 method for corpus
[[(x, i, ...)
# S3 method for corpus
[[(x, i) <- value
# S3 method for corpus
str(object, ...)
a corpus object
not used
corpus one to be added
corpus two to be added
logical used by `c()` method, always set to `FALSE`
index for documents or rows of document variables
index for column of document variables
if TRUE
, return a vector if extracting a single document
variable; if FALSE
, return it as a single-column data.frame. See
drop
for further details.
a vector that will form a new docvar
the corpus about which you want structural information
is.corpus
returns TRUE
if the object is a corpus
is.corpuszip
returns TRUE
if the object is a compressed corpus
The +
operator for a corpus object will combine two corpus
objects, resolving any non-matching docvars
or
metadoc
fields by making them into NA
values for the
corpus lacking that field. Corpus-level meta data is concatenated, except
for source
and notes
, which are stamped with information
pertaining to the creation of the new joined corpus.
The `c()` operator is also defined for corpus class objects, and provides an easy way to combine multiple corpus objects.
There are some issues that need to be addressed in future revisions of
quanteda concerning the use of factors to store document variables and
meta-data. Currently most or all of these are not recorded as factors,
because we use stringsAsFactors=FALSE
in the
data.frame
calls that are used to create and store the
document-level information, because the texts should always be stored as
character vectors and never as factors.
# NOT RUN {
# concatenate corpus objects
corpus1 <- corpus(data_char_ukimmig2010[1:2])
corpus2 <- corpus(data_char_ukimmig2010[3:4])
corpus3 <- corpus(data_char_ukimmig2010[5:6])
summary(c(corpus1, corpus2, corpus3))
# ways to index corpus elements
data_corpus_inaugural["1793-Washington"] # 2nd Washington inaugural speech
data_corpus_inaugural[2] # same
# access the docvars from data_corpus_irishbudget2010
data_corpus_irishbudget2010[, "year"]
# same
data_corpus_irishbudget2010[["year"]]
# create a new document variable
data_corpus_irishbudget2010[["govtopp"]] <-
ifelse(data_corpus_irishbudget2010[["party"]] %in% c("FF", "Greens"),
"Government", "Opposition")
docvars(data_corpus_irishbudget2010)
# }
Run the code above in your browser using DataLab