This functions creates corpus objects (based on vocabulary or
hashes), which are stored outside of R's heap and wrapped via reference
classes using Rcpp-Modules. From those objects you can easily extract
document-term (DTM) and term-co-occurrence (TCM) matrices. Also, text2vec
grows the corpus for DTM and TCM matrices simultaneously in a RAM-friendly
and efficient way using the iterators abstraction. You can build corpora
from objects or files which are orders of magnitude larger that available
RAM.
Usage
create_corpus(iterator, vectorizer)
Arguments
iterator
iterator over a list of character vectors. Each
element is a list of tokens, that is, tokenized and normalized strings.