Fast Text Mining Framework for Vectorization and Word Embeddings
Very fast and memory-friendly tools for text vectorization and
state-of-the-art word embeddings (GloVe). This package provides a
source-agnostic streaming API, which allows researchers to perform analysis
of collections of documents which are much larger than available RAM. All
core functions are parallelized to benefit from multicore machines.
To learn how to use this package, see the package vignettes.
- Text vectorization:
vignette("text-vectorization", package = "text2vec")
- GloVe word embeddings:
vignette("glove", package = "text2vec")
See also the text2vec articles on my blog.
text2vec is a package that provides an efficient framework with a concise API for text analysis and natural language processing (NLP) in R. It is inspired by gensim, an excellent Python library for NLP.
The core functionality at the moment includes
- Fast text vectorization on arbitrary n-grams, using vocabulary or feature hashing.
- State-of-the-art GloVe word embeddings.
The core of this package is carefully written in C++, which means text2vec is fast and memory friendly. Some parts (GloVe training) are fully parallelized using the excellent RcppParallel package. This means that parallel processing works on OS X, Linux, Windows and Solaris (x86) without any additional hacking or tricks. In addition, there is a higher-level parallelization for text vectorization and vocabulary construction on top of the foreach package, and text2vec has a streaming API so that users don't have to load all of the data into RAM.
The API is built around the iterator abstraction. The API is concise, providing only a few functions which do their job well. The package does not (and probably will not in the future) provide trivial very high-level functions. But other packages can build on top of the framework that text2vec provides.
The package has issue tracker on GitHub where I'm filing feature requests and notes for future work. Any ideas are appreciated.
Contributors are welcome. You can help by
Functions in text2vec
|check_analogy_accuracy||Checks accuracy of word embeddings on the analogy task|
|reexports||Objects exported from other packages|
|get_dtm||Extract document-term matrix|
|HashCorpus||Rcpp module: HashCorpus Exposes C++ functions to construct hashed Document-Term Matrix|
|VocabCorpus||Rcpp module: VocabCorpus Exposes C++ functions to construct Document-Term Matrix|
|ifiles||Creates iterator over text files from the disk|
|ilines||Creates iterator over the lines of a connection or file|
|tokenizers||Tokenization functions, which performs string splitting|
|text2vec||text2vec is a package that provides an efficient framework with a concise API for text analysis and natural language processing in R.|
|itoken||Iterators over input objects|
|split_into||Split a vector for parallel processing|
|get_tcm||Extract term-co-occurence matrix|
|get_idf||Inverse document-frequency scaling matrix|
|VocabularyBuilder||Rcpp module: VocabularyBuilder Exposes C++ functions to construct Vocabulary|
|create_dtm||Document-term matrix construction|
|vectorizers||Vocabulary and hash vectorizers|
|GloveFitter||Rcpp module: GloveFitter Exposes C++ functions to fit GloVe model|
|prepare_analogy_questions||Prepares list of analogy questions|
|glove||Fit a GloVe word-embedded model|
|get_tf||Term-frequency scaling matrix|
|transform_filter_commons||Remove terms from a document-term matrix|
|transform_tf||Scale a document-term matrix|
|create_corpus||Create a corpus|
|create_vocabulary||Creates a vocabulary of unique terms|
|create_tcm||Term-co-occurence matrix construction|
|movie_review||IMDB movie reviews|
Last month downloads
|License||MIT + file LICENSE|
|SystemRequirements||GNU make, C++11|
|LinkingTo||Rcpp, RcppParallel, digest|
|Packaged||2016-03-31 18:58:07 UTC; dmitryselivanov|
|depends||base (>= 3.2.0) , methods , R (>= 3.2.0)|
|imports||data , digest (>= 0.6.8) , foreach (>= 1.4.3) , iterators (>= 1.0.8) , magrittr (>= 1.5) , Matrix (>= 1.1) , Rcpp (>= 0.11) , RcppParallel (>= 4.3.14) , stringr (>= 1.0.0)|
|suggests||glmnet , knitr , parallel , rmarkdown , testthat|
|Contributors||Dmitriy Selivanov, Lincoln Mullen|
Include our badge in your README