text2vec v0.6
Monthly downloads
Modern Text Mining Framework for R
Fast and memory-friendly tools for text vectorization, topic
modeling (LDA, LSA), word embeddings (GloVe), similarities. This package
provides a source-agnostic streaming API, which allows researchers to perform
analysis of collections of documents which are larger than available RAM. All
core functions are parallelized to benefit from multicore machines.
Readme
title: "text2vec" author: "Dmitriy Selivanov" output: html_document: toc: false
toc_float: false
You've just discovered text2vec!
text2vec is an R package which provides an efficient framework with a concise API for text analysis and natural language processing (NLP).
Goals which we aimed to achieve as a result of development of text2vec
:
- Concise - expose as few functions as possible
- Consistent - expose unified interfaces, no need to explore new interface for each task
- Flexible - allow to easily solve complex tasks
- Fast - maximize efficiency per single thread, transparently scale to multiple threads on multicore machines
- Memory efficient - use streams and iterators, not keep data in RAM if possible
See API section for details.
Performance
This package is efficient because it is carefully written in C++, which also means that text2vec is memory friendly. Some parts are fully parallelized using OpenMP.
Other emrassingly parallel tasks (such as vectorization) can use any fork-based parallel backend on UNIX-like machines. They can achieve near-linear scalability with the number of available cores.
Finally, a streaming API means that users do not have to load all the data into RAM.
Contributing
The package has issue tracker on GitHub where I'm filing feature requests and notes for future work. Any ideas are appreciated.
Contributors are welcome. You can help by:
- testing and leaving feedback on the GitHub issuer tracker (preferably) or directly by e-mail
- forking and contributing (check code our style guide). Vignettes, docs, tests, and use cases are very welcome
- by giving me a star on project page :-)
License
GPL (>= 2)
Functions in text2vec
Name | Description | |
combine_vocabularies | Combines multiple vocabularies into one | |
normalize | Matrix normalization | |
LatentSemanticAnalysis | Latent Semantic Analysis model | |
create_dtm | Document-term matrix construction | |
similarities | Pairwise Similarity Matrix Computation | |
split_into | Split a vector for parallel processing | |
prepare_analogy_questions | Prepares list of analogy questions | |
perplexity | Perplexity of a topic model | |
itoken | Iterators (and parallel iterators) over input objects | |
RelaxedWordMoversDistance | Creates Relaxed Word Movers Distance (RWMD) model | |
jsPCA_robust | (numerically robust) Dimension reduction via Jensen-Shannon Divergence & Principal Components | |
create_tcm | Term-co-occurence matrix construction | |
vectorizers | Vocabulary and hash vectorizers | |
distances | Pairwise Distance Matrix Computation | |
create_vocabulary | Creates a vocabulary of unique terms | |
text2vec | text2vec | |
tokenizers | Simple tokenization functions for string splitting | |
ifiles | Creates iterator over text files from the disk | |
prune_vocabulary | Prune vocabulary | |
reexports | Objects exported from other packages | |
as.lda_c | Converts document-term matrix sparse matrix to 'lda_c' format | |
coherence | Coherence metrics for topic models | |
BNS | BNS | |
check_analogy_accuracy | Checks accuracy of word embeddings on the analogy task | |
GloVe | re-export rsparse::GloVe | |
TfIdf | TfIdf | |
LatentDirichletAllocation | Creates Latent Dirichlet Allocation model. | |
Collocations | Collocations model. | |
movie_review | IMDB movie reviews | |
No Results! |
Vignettes of text2vec
Name | ||
files-multicore.Rmd | ||
glove.Rmd | ||
text-vectorization.Rmd | ||
No Results! |
Last month downloads
Details
Type | Package |
License | GPL (>= 2) | file LICENSE |
Encoding | UTF-8 |
SystemRequirements | C++11 |
LinkingTo | Rcpp, digest (>= 0.6.8) |
URL | http://text2vec.org |
BugReports | https://github.com/dselivanov/text2vec/issues |
VignetteBuilder | knitr |
LazyData | true |
RoxygenNote | 6.1.1 |
NeedsCompilation | yes |
Packaged | 2020-02-18 06:09:05 UTC; dselivanov |
Repository | CRAN |
Date/Publication | 2020-02-18 14:20:03 UTC |
suggests | covr , glmnet , knitr , magrittr , proxy , rmarkdown , testthat , udpipe (>= 0.6) |
imports | data.table (>= 1.9.6) , digest (>= 0.6.8) , lgr (>= 0.2) , Matrix (>= 1.1) , mlapi (>= 0.1.0) , R6 (>= 2.3.0) , Rcpp (>= 1.0.3) , rsparse (>= 0.3.3.4) , stringi (>= 1.1.5) |
depends | methods , R (>= 3.6.0) |
Contributors | Manuel Bickel, Qing Wang |
Include our badge in your README
[](http://www.rdocumentation.org/packages/text2vec)