Learn R Programming

text2vec (version 0.2.0)

create_vocab_corpus: RAM-friendly streaming corpus construction.

Description

This functions allow to create corpus objects (vocabulary or hash based), which are stored outside of R's heap and wrapped via Reference Classes using Rcpp-Modules. From that objects you can easily extract Document-Term (dtm) and Term-Cooccurnce(tcm) matrices. Also text2vec grows corpus for tcm and dtm simultaneously in a very ram-friendly and efficient way using iterators abstraction. So you can build corpuses from objects/files which are orders of magnitude larger that available RAM.

Usage

create_vocab_corpus(iterator, vocabulary, grow_dtm = TRUE,
  skip_grams_window = 0L)

create_hash_corpus(iterator, feature_hasher = feature_hasher(), grow_dtm = TRUE, skip_grams_window = 0)

Arguments

iterator
iterator over list of character vectors. Each element is a list of tokens = tokenized and normalized strings.
vocabulary
text2vec_vocabulary object, see vocabulary.
grow_dtm
logical should we grow Document-Term matrix during corpus construction or not.
skip_grams_window
integer window for Term-Cooccurence matrix construction. 0L points to do not construct such matrix.
feature_hasher
text2vec_feature_hasher object, which contains meta information about feature hashing. See feature_hasher for details.

Value

  • corpus object, We can add documents into this corpus by reference - no copy at all. See source code for details. For full process example see get_dtm.

See Also

feature_hasher.