text2vec v0.6

0

Monthly downloads

0th

Percentile

Modern Text Mining Framework for R

Fast and memory-friendly tools for text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), similarities. This package provides a source-agnostic streaming API, which allows researchers to perform analysis of collections of documents which are larger than available RAM. All core functions are parallelized to benefit from multicore machines.

Readme


title: "text2vec" author: "Dmitriy Selivanov" output: html_document: toc: false

toc_float: false

You've just discovered text2vec!

text2vec is an R package which provides an efficient framework with a concise API for text analysis and natural language processing (NLP).

Goals which we aimed to achieve as a result of development of text2vec:

  • Concise - expose as few functions as possible
  • Consistent - expose unified interfaces, no need to explore new interface for each task
  • Flexible - allow to easily solve complex tasks
  • Fast - maximize efficiency per single thread, transparently scale to multiple threads on multicore machines
  • Memory efficient - use streams and iterators, not keep data in RAM if possible

See API section for details.

Performance

htop

This package is efficient because it is carefully written in C++, which also means that text2vec is memory friendly. Some parts are fully parallelized using OpenMP.

Other emrassingly parallel tasks (such as vectorization) can use any fork-based parallel backend on UNIX-like machines. They can achieve near-linear scalability with the number of available cores.

Finally, a streaming API means that users do not have to load all the data into RAM.

Contributing

The package has issue tracker on GitHub where I'm filing feature requests and notes for future work. Any ideas are appreciated.

Contributors are welcome. You can help by:

License

GPL (>= 2)

Functions in text2vec

Name Description
combine_vocabularies Combines multiple vocabularies into one
normalize Matrix normalization
LatentSemanticAnalysis Latent Semantic Analysis model
create_dtm Document-term matrix construction
similarities Pairwise Similarity Matrix Computation
split_into Split a vector for parallel processing
prepare_analogy_questions Prepares list of analogy questions
perplexity Perplexity of a topic model
itoken Iterators (and parallel iterators) over input objects
RelaxedWordMoversDistance Creates Relaxed Word Movers Distance (RWMD) model
jsPCA_robust (numerically robust) Dimension reduction via Jensen-Shannon Divergence & Principal Components
create_tcm Term-co-occurence matrix construction
vectorizers Vocabulary and hash vectorizers
distances Pairwise Distance Matrix Computation
create_vocabulary Creates a vocabulary of unique terms
text2vec text2vec
tokenizers Simple tokenization functions for string splitting
ifiles Creates iterator over text files from the disk
prune_vocabulary Prune vocabulary
reexports Objects exported from other packages
as.lda_c Converts document-term matrix sparse matrix to 'lda_c' format
coherence Coherence metrics for topic models
BNS BNS
check_analogy_accuracy Checks accuracy of word embeddings on the analogy task
GloVe re-export rsparse::GloVe
TfIdf TfIdf
LatentDirichletAllocation Creates Latent Dirichlet Allocation model.
Collocations Collocations model.
movie_review IMDB movie reviews
No Results!

Vignettes of text2vec

Name
files-multicore.Rmd
glove.Rmd
text-vectorization.Rmd
No Results!

Last month downloads

Details

Type Package
License GPL (>= 2) | file LICENSE
Encoding UTF-8
SystemRequirements C++11
LinkingTo Rcpp, digest (>= 0.6.8)
URL http://text2vec.org
BugReports https://github.com/dselivanov/text2vec/issues
VignetteBuilder knitr
LazyData true
RoxygenNote 6.1.1
NeedsCompilation yes
Packaged 2020-02-18 06:09:05 UTC; dselivanov
Repository CRAN
Date/Publication 2020-02-18 14:20:03 UTC

Include our badge in your README

[![Rdoc](http://www.rdocumentation.org/badges/version/text2vec)](http://www.rdocumentation.org/packages/text2vec)