Fast Text Mining Framework for Vectorization and Word Embeddings

Very fast and memory-friendly tools for text vectorization and state-of-the-art word embeddings (GloVe). This package provides a source-agnostic streaming API, which allows researchers to perform analysis of collections of documents which are much larger than available RAM. All core functions are parallelized to benefit from multicore machines.


To learn how to use this package, see the package vignettes.

  1. Text vectorization: vignette("text-vectorization", package = "text2vec")
  2. GloVe word embeddings: vignette("glove", package = "text2vec")

See also the text2vec articles on my blog.


text2vec is a package that provides an efficient framework with a concise API for text analysis and natural language processing (NLP) in R. It is inspired by gensim, an excellent Python library for NLP.

The core functionality at the moment includes

  1. Fast text vectorization on arbitrary n-grams, using vocabulary or feature hashing.
  2. State-of-the-art GloVe word embeddings.

The core of this package is carefully written in C++, which means text2vec is fast and memory friendly. Some parts (GloVe training) are fully parallelized using the excellent RcppParallel package. This means that parallel processing works on OS X, Linux, Windows and Solaris (x86) without any additional hacking or tricks. In addition, there is a higher-level parallelization for text vectorization and vocabulary construction on top of the foreach package, and text2vec has a streaming API so that users don't have to load all of the data into RAM.

The API is built around the iterator abstraction. The API is concise, providing only a few functions which do their job well. The package does not (and probably will not in the future) provide trivial very high-level functions. But other packages can build on top of the framework that text2vec provides.


The package has issue tracker on GitHub where I'm filing feature requests and notes for future work. Any ideas are appreciated.

Contributors are welcome. You can help by

  • testing and leaving feedback on the GitHub issuer tracker (preferably) or directly by e-mail.
  • forking and contributing. Vignettes, docs, tests, and use cases are very welcome.
  • by giving me a star on project page :-)

Functions in text2vec

Name Description
check_analogy_accuracy Checks accuracy of word embeddings on the analogy task
reexports Objects exported from other packages
get_dtm Extract document-term matrix
HashCorpus Rcpp module: HashCorpus Exposes C++ functions to construct hashed Document-Term Matrix
VocabCorpus Rcpp module: VocabCorpus Exposes C++ functions to construct Document-Term Matrix
ifiles Creates iterator over text files from the disk
ilines Creates iterator over the lines of a connection or file
tokenizers Tokenization functions, which performs string splitting
text2vec text2vec is a package that provides an efficient framework with a concise API for text analysis and natural language processing in R.
itoken Iterators over input objects
split_into Split a vector for parallel processing
get_tcm Extract term-co-occurence matrix
get_idf Inverse document-frequency scaling matrix
VocabularyBuilder Rcpp module: VocabularyBuilder Exposes C++ functions to construct Vocabulary
create_dtm Document-term matrix construction
vectorizers Vocabulary and hash vectorizers
GloveFitter Rcpp module: GloveFitter Exposes C++ functions to fit GloVe model
prepare_analogy_questions Prepares list of analogy questions
glove Fit a GloVe word-embedded model
get_tf Term-frequency scaling matrix
prune_vocabulary Prune vocabulary
transform_filter_commons Remove terms from a document-term matrix
transform_tf Scale a document-term matrix
create_corpus Create a corpus
create_vocabulary Creates a vocabulary of unique terms
create_tcm Term-co-occurence matrix construction
movie_review IMDB movie reviews
Type Package
Date 2016-03-31
License MIT + file LICENSE
Encoding UTF-8
SystemRequirements GNU make, C++11
LinkingTo Rcpp, RcppParallel, digest
URL https://github.com/dselivanov/text2vec
BugReports https://github.com/dselivanov/text2vec/issues
VignetteBuilder knitr
LazyData true
RoxygenNote 5.0.1
NeedsCompilation yes
Packaged 2016-03-31 18:58:07 UTC; dmitryselivanov
Repository CRAN
Date/Publication 2016-03-31 21:12:51

