text2vec (version 0.6)

create_vocabulary: Creates a vocabulary of unique terms

Description

This function collects unique terms and corresponding statistics. See the below for details.

Usage

create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L),
  stopwords = character(0), sep_ngram = "_", window_size = 0L)

vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L)

# S3 method for character create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L)

# S3 method for itoken create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L)

# S3 method for itoken_parallel create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L, ...)

Arguments

it

iterator over a list of character vectors, which are the documents from which the user wants to construct a vocabulary. See itoken. Alternatively, a character vector of user-defined vocabulary terms (which will be used "as is").

ngram

integer vector. The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that ngram_min <= n <= ngram_max will be used.

stopwords

character vector of stopwords to filter out. NOTE that stopwords will be used "as is". This means that if preprocessing function in itoken does some text modification (like stemming), then this preprocessing need to be applied to stopwords before passing them here. See https://github.com/dselivanov/text2vec/issues/228 for example.

sep_ngram

character a character string to concatenate words in ngrams

window_size

integer (0 by default). If window_size > 0 than vocabulary will be created from pseudo-documents which are obtained by virtually splitting each documents into chunks of the length window_size by going with sliding window through them. This is useful for creating special statistics which are used for coherence estimation in topic models.

...

placeholder for additional arguments (not used at the moment).

Value

text2vec_vocabulary object, which is actually a data.frame with following columns:

term

character vector of unique terms

term_count

integer vector of term counts across all documents

doc_count

integer vector of document counts that contain corresponding term

Also it contains metainformation in attributes: ngram: integer vector, the lower and upper boundary of the range of n-gram-values. document_count: integer number of documents vocabulary was built. stopwords: character vector of stopwords sep_ngram: character separator for ngrams

Methods (by class)

  • character: creates text2vec_vocabulary from predefined character vector. Terms will be inserted as is, without any checks (ngrams number, ngram delimiters, etc.).

  • itoken: collects unique terms and corresponding statistics from object.

  • itoken_parallel: collects unique terms and corresponding statistics from iterator.

Examples

Run this code
# NOT RUN {
data("movie_review")
txt = movie_review[['review']][1:100]
it = itoken(txt, tolower, word_tokenizer, n_chunks = 10)
vocab = create_vocabulary(it)
pruned_vocab = prune_vocabulary(vocab, term_count_min = 10, doc_proportion_max = 0.8,
doc_proportion_min = 0.001, vocab_term_max = 20000)
# }

Run the code above in your browser using DataCamp Workspace