tcm
and dtm
simultaneously in a very
ram-friendly and efficient way using iterators abstraction. So you can build corpuses from
objects/files which are orders of magnitude larger that available RAM.create_vocab_corpus(iterator, vocabulary, grow_dtm = TRUE,
skip_grams_window = 0L)create_hash_corpus(iterator, feature_hasher = feature_hasher(),
grow_dtm = TRUE, skip_grams_window = 0)
list
of character
vectors.
Each element is a list of tokens = tokenized and normalized strings.text2vec_vocabulary
object, see vocabulary.logical
should we grow Document-Term matrix
during corpus construction or not.integer
window for Term-Cooccurence matrix
construction. 0L points to do not construct such matrix.text2vec_feature_hasher
object, which contains meta information
about feature hashing. See feature_hasher for details.