create_vocabulary
From text2vec v0.3.0
by Dmitriy Selivanov
Creates a vocabulary of unique terms
This function collects unique terms and corresponding statistics. See the below for details.
Usage
create_vocabulary(itoken_src, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0))
vocabulary(itoken_src, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0))
"create_vocabulary"(itoken_src, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0))
"create_vocabulary"(itoken_src, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0))
"create_vocabulary"(itoken_src, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), ...)
Arguments
- itoken_src
- iterator over a
list
ofcharacter
vectors, which are the documents from which the user wants to construct a vocabulary. Alternatively, acharacter
vector of user-defined vocabulary terms (which will be used "as is"). - ngram
integer
vector. The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values ofn
such that ngram_min- stopwords
character
vector of stopwords to filter out- ...
- additional arguments to foreach function.
Value
terms
character
vector of unique termsterms_counts
integer
vector of term counts across all documentsdoc_counts
integer
vector of document counts that contain corresponding term
text2vec_vocabulary
object, which is actually a list
with following fields:1. vocab
: a data.frame
which contains columns ngram
: integer
vector, the lower and upper boundary of the
range of n-gram-values.3. document_count
: integer
number of documents vocabulary was
built.
Methods (by class)
-
character
: createstext2vec_vocabulary
from predefined character vector. Terms will be inserted as is, without any checks (ngrams numner, ngram delimiters, etc.). -
itoken
: collects unique terms and corresponding statistics from object. -
list
: collects unique terms and corresponding statistics from list of itoken iterators. If parallel backend is registered, it will build vocabulary in parallel using foreach.
Examples
data("movie_review")
txt <- movie_review[['review']][1:100]
it <- itoken(txt, tolower, word_tokenizer, chunks_number = 10)
vocab <- create_vocabulary(it)
pruned_vocab = prune_vocabulary(vocab, term_count_min = 10,
doc_proportion_max = 0.8, doc_proportion_min = 0.001, max_number_of_terms = 20000)
Community examples
Looks like there are no examples yet.