vocabulary: Creates vocabulary (unique terms)

Description

collects unique terms and corresponding statistics from object. See value section.

Usage

vocabulary(src, ngram = c(ngram_min = 1L, ngram_max = 1L), ...)
## S3 method for class 'character':
vocabulary(src, ngram = c(ngram_min = 1L, ngram_max = 1L),
  ...)
## S3 method for class 'itoken':
vocabulary(src, ngram = c(ngram_min = 1L, ngram_max = 1L),
  serialize_dir = NULL, ...)

Arguments

src

iterator over list of character vectors - documents from which user want construct vocabulary. Or, alternatively, character vector = user-defined vocabulary terms (which will be used "as is").

ngram

integer vector. The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that ngram_min

...

arguments passed to other methods (inculding write_rds function).

serialize_dir

As a side effect, we can save tokenized texts in serialized form to disk. So, serialize_dir is a character - path where to save tokenized input files.

Value

text2vec_vocabulary object, which is actually a list with following fields:
1. vocab - data.frame which contains columns
- terms
{ character vector of unique terms}
terms_countsinteger vector of term counts across all documents
doc_countsinteger vector of document counts that contain corresponding term
doc_proportionsnumeric vector of document proportions that contain corresponding term

bold

ngram

code

integer

Methods (by class)

character: createstext2vec_vocabularyfrom predefined character vector. Terms will be insertedas is, without any checks (ngrams numner, ngram delimiters, etc.).
itoken: collects unique terms and corresponding statistics from object.

Examples

Run this code

data("movie_review")
txt <- movie_review[['review']][1:100]
it <- itoken(txt, tolower, word_tokenizer, chunks_number = 10)
vocab <- vocabulary(it)
pruned_vocab = prune_vocabulary(vocab, term_count_min = 10,
 doc_proportion_max = 0.8, doc_proportion_min = 0.001, max_number_of_terms = 20000)

Run the code above in your browser using DataLab