Learn R Programming

⚠️There's a newer version (0.6.4) of this package.Take me there.

You've just discovered text2vec!

text2vec is an R package which provides an efficient framework with a concise API for text analysis and natural language processing (NLP).

Goals which we aimed to achieve as a result of development of text2vec:

  • Concise - expose as few functions as possible
  • Consistent - expose unified interfaces, no need to explore new interface for each task
  • Flexible - allow to easily solve complex tasks
  • Fast - maximize efficiency per single thread, transparently scale to multiple threads on multicore machines
  • Memory efficient - use streams and iterators, not keep data in RAM if possible

Tutorials

To learn how to use this package, see text2vec.org and the package vignettes. See also the text2vec articles on my blog.

Features

The core functionality at the moment includes

  1. Fast text vectorization on arbitrary n-grams, using vocabulary or feature hashing.
  2. GloVe word embeddings.
  3. Topic modeling with:
  • Latent Dirichlet Allocation
  • Latent Sematic Analysis
  1. Similarities/distances between 2 matrices

Performance

Author of the package is a little bit obsessed about efficiency.

This package is efficient because it is carefully written in C++, which also means that text2vec is memory friendly. Some parts, such as training GloVe word embeddings, are fully parallelized using the excellent RcppParallel package. This means that the word embeddings are computed in parallel on OS X, Linux, Windows, and Solaris (x86) without any additional tuning or tricks. Other emrassingly parallel tasks such as vectorization can use any parallel backend wich supports foreach package. So they can achieve near-linear scalability with number of available cores. Finally, a streaming API means that users do not have to load all the data into RAM.

Contributing

The package has issue tracker on GitHub where I'm filing feature requests and notes for future work. Any ideas are appreciated.

Contributors are welcome. You can help by:

License

GPL (>= 2)

Copy Link

Version

Install

install.packages('text2vec')

Monthly Downloads

8,738

Version

0.4.0

License

GPL (>= 2) | file LICENSE

Maintainer

Dmitriy Selivanov

Last Published

October 4th, 2016

Functions in text2vec (0.4.0)

reexports

Objects exported from other packages
similarities

Pairwise Similarity Matrix Computation
get_tf

Term-frequency scaling matrix
glove

Fit a GloVe word-embedded model
transform_tf

Scale a document-term matrix
transform

Transforms Matrix-like object using model
create_tcm

Term-co-occurence matrix construction
distances

Pairwise Distance Matrix Computation
create_dtm

Document-term matrix construction
create_vocabulary

Creates a vocabulary of unique terms
create_corpus

Create a corpus
get_dtm

Extract document-term matrix
fit

Fits model to data
fit_transform

Fit model to data, then transform it
as.lda_c

Converts document-term matrix sparse matrix to 'lda_c' format
ifiles

Creates iterator over text files from the disk
itoken

Iterators over input objects
get_idf

Inverse document-frequency scaling matrix
check_analogy_accuracy

Checks accuracy of word embeddings on the analogy task
get_tcm

Extract term-co-occurence matrix
prepare_analogy_questions

Prepares list of analogy questions
movie_review

IMDB movie reviews
prune_vocabulary

Prune vocabulary
normalize

Matrix normalization
tokenizers

Simple tokenization functions, which performs string splitting
split_into

Split a vector for parallel processing
transform_filter_commons

Remove terms from a document-term matrix
text2vec

text2vec
vectorizers

Vocabulary and hash vectorizers
LatentSemanticAnalysis

Latent Semantic Analysis model
RelaxedWordMoversDistance

Creates model which can be used for calculation of "relaxed word movers distance".
TfIdf

TfIdf
GlobalVectors

Creates Global Vectors word-embeddings model.
LatentDirichletAllocation

Creates Latent Dirichlet Allocation model.