
vocab() creates a vocabulary from a text corpus; update_vocab() and
prune_vocab() update and prune an existing vocabulary respectively.
vocab(corpus, ngram = c(1, 1), ngram_sep = "_",
regex = "[[:space:]]+")update_vocab(vocab, corpus)
prune_vocab(vocab, max_terms = Inf, term_count_min = 1L,
term_count_max = Inf, doc_proportion_min = 0,
doc_proportion_max = 1, doc_count_min = 1L, doc_count_max = Inf,
nbuckets = attr(vocab, "nbuckets"))
A collection of ASCII or UTF-8 encoded documents. It can be a list of character vectors, a character vector or a data.frame with at least two columns - id and documents. See details.
a vector of length 2 of the form c(min_ngram, max_ngram) or a
singleton max_ngram which is equivalent to c(1L, max_ngram).
separator to link terms within ngrams.
a regexp to be used for segmentation of documents when corpus
is a character vector; ignored otherwise. Defaults to a set of basic white
space separators. NULL means no segmentation. The regexp grammar is the
extended ECMAScript as implemented in C++11.
data.frame obtained from a call to vocab().
max number of terms to preserve
keep terms occurring at least this many times over all docs
keep terms occurring at most this many times over all docs
keep terms appearing in at least this many docs
keep terms appearing in at most this many docs
How many unknown buckets to create along the remaining terms
of the pruned vocab. All pruned terms will be hashed into this many
buckets and the corresponding statistics (term_count and doc_count)
updated.
When corpus is a character vector each string is tokenized with regex
with the internal tokenizer. When corpus has names, names will be used to
name the output whenever appropriate.
When corpus is a data.frame, the documents must be in last column, which
can be either a list of strings or a character vector. All other columns are
considered document ids. If first column is a character vector most function
will use it to name the output.
https://en.cppreference.com/w/cpp/regex/ecmascript
# NOT RUN {
corpus <-
list(a = c("The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"),
b = c("the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog",
"the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"))
vocab(corpus)
vocab(corpus, ngram = 3)
vocab(corpus, ngram = c(2, 3))
v <- vocab(corpus)
extra_corpus <- list(extras = c("apples", "oranges"))
v <- update_vocab(v, extra_corpus)
v
prune_vocab(v, max_terms = 7)
prune_vocab(v, term_count_min = 2)
prune_vocab(v, max_terms = 7, nbuckets = 2)
# }
Run the code above in your browser using DataLab