create_vocab_corpus: RAM-friendly streaming corpus construction.

Description

This functions allow to create corpus objects (vocabulary or hash based), which are stored outside of R's heap and wrapped via Reference Classes using Rcpp-Modules. From that objects you can easily extract Document-Term (dtm) and Term-Cooccurnce(tcm) matrices. Also text2vec grows corpus for tcm and dtm simultaneously in a very ram-friendly and efficient way using iterators abstraction. So you can build corpuses from objects/files which are orders of magnitude larger that available RAM.

Usage

create_vocab_corpus(iterator, vocabulary, grow_dtm = TRUE,
  skip_grams_window = 0L)
create_hash_corpus(iterator, feature_hasher = feature_hasher(),
  grow_dtm = TRUE, skip_grams_window = 0)

Arguments

iterator

iterator over list of character vectors. Each element is a list of tokens = tokenized and normalized strings.

vocabulary

text2vec_vocabulary object, see vocabulary.

grow_dtm

logical should we grow Document-Term matrix during corpus construction or not.

skip_grams_window

integer window for Term-Cooccurence matrix construction. 0L points to do not construct such matrix.

feature_hasher

text2vec_feature_hasher object, which contains meta information about feature hashing. See feature_hasher for details.

Value

corpus object, We can add documents into this corpus by reference - no copy at all. See source code for details. For full process example see get_dtm.

Description

Usage

Arguments

Value

See Also