Learn R Programming

text2vec (version 0.2.0)

vocabulary: Creates vocabulary (unique terms)

Description

collects unique terms and corresponding statistics from object. See value section.

Usage

vocabulary(src, ngram = c(ngram_min = 1L, ngram_max = 1L), ...)

## S3 method for class 'character': vocabulary(src, ngram = c(ngram_min = 1L, ngram_max = 1L), ...)

## S3 method for class 'itoken': vocabulary(src, ngram = c(ngram_min = 1L, ngram_max = 1L), serialize_dir = NULL, ...)

Arguments

src
iterator over list of character vectors - documents from which user want construct vocabulary. Or, alternatively, character vector = user-defined vocabulary terms (which will be used "as is").
ngram
integer vector. The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that ngram_min
...
arguments passed to other methods (inculding write_rds function).
serialize_dir
As a side effect, we can save tokenized texts in serialized form to disk. So, serialize_dir is a character - path where to save tokenized input files.

Value

  • text2vec_vocabulary object, which is actually a list with following fields:

    1. vocab - data.frame which contains columns

    • terms
    { character vector of unique terms}

  • terms_countsinteger vector of term counts across all documents
  • doc_countsinteger vector of document counts that contain corresponding term
  • doc_proportionsnumeric vector of document proportions that contain corresponding term

bold

ngram

code

integer

Methods (by class)

  • character: createstext2vec_vocabularyfrom predefined character vector. Terms will be insertedas is, without any checks (ngrams numner, ngram delimiters, etc.).
  • itoken: collects unique terms and corresponding statistics from object.

Examples

Run this code
data("movie_review")
txt <- movie_review[['review']][1:100]
it <- itoken(txt, tolower, word_tokenizer, chunks_number = 10)
vocab <- vocabulary(it)
pruned_vocab = prune_vocabulary(vocab, term_count_min = 10,
 doc_proportion_max = 0.8, doc_proportion_min = 0.001, max_number_of_terms = 20000)

Run the code above in your browser using DataLab