create_corpus

0th

Percentile

Create a corpus

This functions creates corpus objects (based on vocabulary or hashes), which are stored outside of R's heap and wrapped via reference classes using Rcpp-Modules. From those objects you can easily extract document-term (DTM) and term-co-occurrence (TCM) matrices. Also, text2vec grows the corpus for DTM and TCM matrices simultaneously in a RAM-friendly and efficient way using the iterators abstraction. You can build corpora from objects or files which are orders of magnitude larger that available RAM.

Usage
create_corpus(iterator, vectorizer)
Arguments
iterator
iterator over a list of character vectors. Each element is a list of tokens, that is, tokenized and normalized strings.
vectorizer
function vectorizer function. See vectorizers.
Value

Corpus object.

See Also

vectorizers, create_dtm, get_dtm, get_tcm, create_tcm

Aliases
  • create_corpus
Documentation reproduced from package text2vec, version 0.3.0, License: MIT + file LICENSE

Community examples

Looks like there are no examples yet.