Learn R Programming

⚠️There's a newer version (0.6.4) of this package.Take me there.

Tutorials

To learn how to use this package, see the package vignettes.

  1. Text vectorization: vignette("text-vectorization", package = "text2vec")
  2. GloVe word embeddings: vignette("glove", package = "text2vec")

See also the text2vec articles on my blog.

Features

text2vec is a package that provides an efficient framework with a concise API for text analysis and natural language processing (NLP) in R. It is inspired by gensim, an excellent Python library for NLP.

The core functionality at the moment includes

  1. Fast text vectorization on arbitrary n-grams, using vocabulary or feature hashing.
  2. State-of-the-art GloVe word embeddings.

The core of this package is carefully written in C++, which means text2vec is fast and memory friendly. Some parts (GloVe training) are fully parallelized using the excellent RcppParallel package. This means that parallel processing works on OS X, Linux, Windows and Solaris (x86) without any additional hacking or tricks. In addition, there is a higher-level parallelization for text vectorization and vocabulary construction on top of the foreach package, and text2vec has a streaming API so that users don't have to load all of the data into RAM.

The API is built around the iterator abstraction. The API is concise, providing only a few functions which do their job well. The package does not (and probably will not in the future) provide trivial very high-level functions. But other packages can build on top of the framework that text2vec provides.

Contributing

The package has issue tracker on GitHub where I'm filing feature requests and notes for future work. Any ideas are appreciated.

Contributors are welcome. You can help by

  • testing and leaving feedback on the GitHub issuer tracker (preferably) or directly by e-mail.
  • forking and contributing. Vignettes, docs, tests, and use cases are very welcome.
  • by giving me a star on project page :-)

Copy Link

Version

Install

install.packages('text2vec')

Monthly Downloads

8,738

Version

0.3.0

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Dmitriy Selivanov

Last Published

March 31st, 2016

Functions in text2vec (0.3.0)

check_analogy_accuracy

Checks accuracy of word embeddings on the analogy task
reexports

Objects exported from other packages
get_dtm

Extract document-term matrix
HashCorpus

Rcpp module: HashCorpus Exposes C++ functions to construct hashed Document-Term Matrix
VocabCorpus

Rcpp module: VocabCorpus Exposes C++ functions to construct Document-Term Matrix
ifiles

Creates iterator over text files from the disk
ilines

Creates iterator over the lines of a connection or file
tokenizers

Tokenization functions, which performs string splitting
text2vec

text2vec is a package that provides an efficient framework with a concise API for text analysis and natural language processing in R.
itoken

Iterators over input objects
split_into

Split a vector for parallel processing
get_tcm

Extract term-co-occurence matrix
get_idf

Inverse document-frequency scaling matrix
VocabularyBuilder

Rcpp module: VocabularyBuilder Exposes C++ functions to construct Vocabulary
create_dtm

Document-term matrix construction
vectorizers

Vocabulary and hash vectorizers
GloveFitter

Rcpp module: GloveFitter Exposes C++ functions to fit GloVe model
prepare_analogy_questions

Prepares list of analogy questions
glove

Fit a GloVe word-embedded model
get_tf

Term-frequency scaling matrix
prune_vocabulary

Prune vocabulary
transform_filter_commons

Remove terms from a document-term matrix
transform_tf

Scale a document-term matrix
create_corpus

Create a corpus
create_vocabulary

Creates a vocabulary of unique terms
create_tcm

Term-co-occurence matrix construction
movie_review

IMDB movie reviews