mlvocab v0.1


Monthly downloads



Vocabulary and Corpus Preprocessing for Natural Language Pipelines

Utilities for preprocessing of text corpora into data structures suitable for natural language models: integer sequences or matrices, vocabulary embedding matrices, term-doc, doc-term, term co-occurrence matrices etc. All functions allow for full or partial hashing of the terms in the vocabulary.


Build Status CRAN RStudio mirror downloads CRAN version

Corpus and Vocabulary Preprocessing Utilities for Natural Language Pipelines (an R package)

The following two-step abstraction is provided by the package:

  1. The vocabulary object is first built from the entire corpus with the help of vocab(), update_vocab() and prune_vocab() functions.
  2. Then, the vocabulary is passed alongside the corpus to a variety of corpus pre-processing functions. Most of the mlvocab functions accept nbuckets argument for partial or full hashing of the corpus.

Current functionality includes:

  • term index sequences: tix_seq(), tix_mat() and tix_df() produce integer sequences suitable for direct consumption by various sequence models.
  • term matrices: dtm(), tdm() and tcm() create document-term term-document and term-co-occurrence matrices respectively.
  • subseting embedding matrices: given pre-trained word-vectors prune_embeddings() creates smaller embedding matrices treating missing and unknown vocabulary terms with grace.
  • tfidf weighting: tfidf() computes various versions of term frequency, inverse document frequency weighting of dtm and tdm matrices.


Package is in alpha state. API changes are likely.

Functions in mlvocab

Name Description
dplyr_methods Methods for dplyr predicates
mlvocab-package mlvocab package
term_matrices Term-document and term-cooccurrence matrices
tfidf Tfidf re-weighting of dtm and tdm matrices
vocab Build and manipulate vocabularies
prune_embeddings Subset embedding matrix using vocab terms
term_indices Term Indices: Convert text to integer indices
No Results!

Last month downloads


License GPL-3
Encoding UTF-8
LinkingTo Rcpp (>= 0.12.9), digest (>= 0.6.8), sparsepp (>= 0.2.0)
LazyData true
SystemRequirements C++11 with suport for regex (such as GCC 4.9 or later, > 5 prefered)
RoxygenNote 6.1.0
NeedsCompilation yes
Packaged 2018-09-17 17:50:05 UTC; vspinu
Repository CRAN
Date/Publication 2018-09-18 08:40:02 UTC

Include our badge in your README