You've just discovered text2vec!

text2vec is an R package which provides an efficient framework with a concise API for text analysis and natural language processing (NLP).

Goals which we aimed to achieve as a result of development of text2vec:

Concise - expose as few functions as possible
Consistent - expose unified interfaces, no need to explore new interface for each task
Flexible - allow to easily solve complex tasks
Fast - maximize efficiency per single thread, transparently scale to multiple threads on multicore machines
Memory efficient - use streams and iterators, not keep data in RAM if possible

Tutorials

To learn how to use this package, see text2vec.org and the package vignettes. See also the text2vec articles on my blog.

Features

The core functionality at the moment includes

Fast text vectorization on arbitrary n-grams, using vocabulary or feature hashing.
GloVe word embeddings.
Topic modeling with:

Latent Dirichlet Allocation
Latent Sematic Analysis

Similarities/distances between 2 matrices

Cosine
Jaccard
Relaxed Word Mover's Distance
Euclidean

Performance

Author of the package is a little bit obsessed about efficiency.

This package is efficient because it is carefully written in C++, which also means that text2vec is memory friendly. Some parts, such as training GloVe word embeddings, are fully parallelized using the excellent RcppParallel package. This means that the word embeddings are computed in parallel on OS X, Linux, Windows, and Solaris (x86) without any additional tuning or tricks. Other emrassingly parallel tasks such as vectorization can use any parallel backend wich supports foreach package. So they can achieve near-linear scalability with number of available cores. Finally, a streaming API means that users do not have to load all the data into RAM.

Contributing

The package has issue tracker on GitHub where I'm filing feature requests and notes for future work. Any ideas are appreciated.

Contributors are welcome. You can help by:

testing and leaving feedback on the GitHub issuer tracker (preferably) or directly by e-mail
forking and contributing (chech code style guide). Vignettes, docs, tests, and use cases are very welcome
by giving me a star on project page :-)

License

GPL (>= 2)

Tutorials

Features

Performance

Contributing

License

Copy Link

Version

Install

Monthly Downloads

Version

License

Maintainer

Last Published

Functions in text2vec (0.4.0)