Creates a corpus object from available sources. The currently available sources are:
a character vector, consisting of one document per element; if the elements are named, these names will be used as document names.
a data.frame (or a tibble tbl_df
), whose default
document id is a variable identified by docid_field
; the text of the
document is a variable identified by textid_field
; and other variables
are imported as document-level meta-data. This matches the format of
data.frames constructed by the the readtext package.
a tm VCorpus or SimpleCorpus class object, with the fixed metadata fields imported as docvars and corpus-level metadata imported as metacorpus information.
a corpus object.
corpus(x, ...)# S3 method for corpus
corpus(x, docnames = quanteda::docnames(x),
docvars = quanteda::docvars(x), metacorpus = quanteda::metacorpus(x),
compress = FALSE, ...)
# S3 method for character
corpus(x, docnames = NULL, docvars = NULL,
metacorpus = NULL, compress = FALSE, ...)
# S3 method for data.frame
corpus(x, docid_field = "doc_id", text_field = "text",
metacorpus = NULL, compress = FALSE, ...)
# S3 method for kwic
corpus(x, split_context = TRUE, extract_keyword = TRUE, ...)
# S3 method for Corpus
corpus(x, metacorpus = NULL, compress = FALSE, ...)
a valid corpus source object
not used directly
Names to be assigned to the texts. Defaults to the names of
the character vector (if any); doc_id
for a data.frame; the document
names in a tm corpus; or a vector of user-supplied labels equal in
length to the number of documents. If none of these are round, then
"text1", "text2", etc. are assigned automatically.
a data.frame of document-level variables associated with each text
a named list containing additional (character) information
to be added to the corpus as corpus-level metadata. Special fields
recognized in the summary.corpus
are:
source
a description of the source of the texts, used for
referencing;
citation
information on how to cite the corpus; and
notes
any additional information about who created the text, warnings,
to do lists, etc.
logical; if TRUE
, compress the texts in memory using
gzip compression. This significantly reduces the size of the corpus in
memory, but will slow down operations that require the texts to be
extracted.
optional column index of a document identifier; defaults
to "doc_id", but if this is not found, then will use the rownames of the
data.frame; if the rownames are not set, it will use the default sequence
based on (quanteda_options("base_docname")
.
the character name or numeric index of the source
data.frame
indicating the variable to be read in as text, which must
be a character vector. All other variables in the data.frame will be
imported as docvars. This argument is only used for data.frame
objects (including those created by readtext).
logical; if TRUE
, split each kwic row into two
"documents", one for "pre" and one for "post", with this designation saved
in a new docvar context
and with the new number of documents
therefore being twice the number of rows in the kwic.
logical; if TRUE
, save the keyword matching
pattern
as a new docvar keyword
A corpus-class class object containing the original texts, document-level variables, document-level metadata, corpus-level metadata, and default settings for subsequent processing of the corpus.
A corpus currently consists of an S3 specially classed list of elements, but you should not access these elements directly. Use the extractor and replacement functions instead, or else your code is not only going to be uglier, but also likely to break should the internal structure of a corpus object change (as it inevitably will as we continue to develop the package, including moving corpus objects to the S4 class system).
The texts and document variables of corpus objects can also be
accessed using index notation. Indexing a corpus object as a vector will
return its text, equivalent to texts(x)
. Note that this is not the
same as subsetting the entire corpus -- this should be done using the
subset
method for a corpus.
Indexing a corpus using two indexes (integers or column names) will return
the document variables, equivalent to docvars(x)
. It is also
possible to access, create, or replace docvars using list notation, e.g.
myCorpus[["newSerialDocvar"]] <-
paste0("tag", 1:ndoc(myCorpus))
.
For details, see corpus-class.
corpus-class, docvars
, metadoc
,
metacorpus
,
settings
, texts
, ndoc
,
docnames
# NOT RUN {
# create a corpus from texts
corpus(data_char_ukimmig2010)
# create a corpus from texts and assign meta-data and document variables
summary(corpus(data_char_ukimmig2010,
docvars = data.frame(party = names(data_char_ukimmig2010))), 5)
corpus(texts(data_corpus_irishbudget2010))
# import a tm VCorpus
if (requireNamespace("tm", quietly = TRUE)) {
data(crude, package = "tm") # load in a tm example VCorpus
mytmCorpus <- corpus(crude)
summary(mytmCorpus, showmeta=TRUE)
data(acq, package = "tm")
summary(corpus(acq), 5, showmeta=TRUE)
tmCorp <- tm::VCorpus(tm::VectorSource(data_char_ukimmig2010))
quantCorp <- corpus(tmCorp)
summary(quantCorp)
}
# construct a corpus from a data.frame
mydf <- data.frame(letter_factor = factor(rep(letters[1:3], each = 2)),
some_ints = 1L:6L,
some_text = paste0("This is text number ", 1:6, "."),
stringsAsFactors = FALSE,
row.names = paste0("fromDf_", 1:6))
mydf
summary(corpus(mydf, text_field = "some_text",
metacorpus = list(source = "From a data.frame called mydf.")))
# construct a corpus from a kwic object
mykwic <- kwic(data_corpus_inaugural, "southern")
summary(corpus(mykwic))
# from a kwic
kw <- kwic(data_char_sampletext, "econom*")
summary(corpus(kw))
summary(corpus(kw, split_context = FALSE))
texts(corpus(kw, split_context = FALSE))
# }
Run the code above in your browser using DataLab