create_vocabulary: Creates a vocabulary of unique terms

Description

This function collects unique terms and corresponding statistics. See the below for details.

Usage

create_vocabulary(itoken_src, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0))
vocabulary(itoken_src, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0))
"create_vocabulary"(itoken_src, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0))
"create_vocabulary"(itoken_src, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0))
"create_vocabulary"(itoken_src, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), ...)

Arguments

itoken_src

iterator over a list of character vectors, which are the documents from which the user wants to construct a vocabulary. Alternatively, a character vector of user-defined vocabulary terms (which will be used "as is").

ngram

integer vector. The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that ngram_min

stopwords

character vector of stopwords to filter out

...

additional arguments to foreach function.

Value

text2vec_vocabulary object, which is actually a list with following fields:1. vocab: a data.frame which contains columns

terms character vector of unique terms
terms_counts integer vector of term counts across all documents
doc_counts integer vector of document counts that contain corresponding term

2. ngram: integer vector, the lower and upper boundary of the range of n-gram-values.3. document_count: integer number of documents vocabulary was built.

Methods (by class)

character: creates text2vec_vocabulary from predefined character vector. Terms will be inserted as is, without any checks (ngrams numner, ngram delimiters, etc.).
itoken: collects unique terms and corresponding statistics from object.
list: collects unique terms and corresponding statistics from list of itoken iterators. If parallel backend is registered, it will build vocabulary in parallel using foreach.

Examples

Run this code

data("movie_review")
txt <- movie_review[['review']][1:100]
it <- itoken(txt, tolower, word_tokenizer, chunks_number = 10)
vocab <- create_vocabulary(it)
pruned_vocab = prune_vocabulary(vocab, term_count_min = 10,
 doc_proportion_max = 0.8, doc_proportion_min = 0.001, max_number_of_terms = 20000)

Run the code above in your browser using DataLab