create_vocabulary: Creates a vocabulary of unique terms

Description

This function collects unique terms and corresponding statistics. See the below for details.

Usage

create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L),
  stopwords = character(0), sep_ngram = "_")
vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L),
  stopwords = character(0), sep_ngram = "_")
# S3 method for character
create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max
  = 1L), stopwords = character(0), sep_ngram = "_")
# S3 method for itoken
create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max =
  1L), stopwords = character(0), sep_ngram = "_")
# S3 method for list
create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max =
  1L), stopwords = character(0), sep_ngram = "_", ...)

Arguments

iterator over a list of character vectors, which are the documents from which the user wants to construct a vocabulary. See itoken. Alternatively, a character vector of user-defined vocabulary terms (which will be used "as is").

ngram

integer vector. The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that ngram_min <= n <= ngram_max will be used.

stopwords

character vector of stopwords to filter out

sep_ngram

character a character string to concatenate words in ngrams

...

additional arguments to foreach function.

Value

text2vec_vocabulary object, which is actually a list with following fields:

1. vocab: a data.frame which contains columns

terms character vector of unique terms
terms_counts integer vector of term counts across all documents
doc_counts integer vector of document counts that contain corresponding term

2. ngram: integer vector, the lower and upper boundary of the range of n-gram-values.

3. document_count: integer number of documents vocabulary was built.

Methods (by class)

character: creates text2vec_vocabulary from predefined character vector. Terms will be inserted as is, without any checks (ngrams numner, ngram delimiters, etc.).
itoken: collects unique terms and corresponding statistics from object.
list: collects unique terms and corresponding statistics from list of itoken iterators. If parallel backend is registered, it will build vocabulary in parallel using foreach.

Examples

Run this code

# NOT RUN {
data("movie_review")
txt = movie_review[['review']][1:100]
it = itoken(txt, tolower, word_tokenizer, chunks_number = 10)
vocab = create_vocabulary(it)
pruned_vocab = prune_vocabulary(vocab, term_count_min = 10,
 doc_proportion_max = 0.8, doc_proportion_min = 0.001, max_number_of_terms = 20000)
# }

Run the code above in your browser using DataLab