Vocabulary and Corpus Preprocessing for Natural Language Pipelines

Utilities for preprocessing of text corpora into data structures suitable for natural language models: integer sequences or matrices, vocabulary embedding matrices, term-doc, doc-term, term co-occurrence matrices etc. All functions allow for full or partial hashing of the terms in the vocabulary.


Corpus and Vocabulary Preprocessing Utilities for Natural Language Pipelines (an R package)

The following two-step abstraction is provided by the package:

  1. The vocabulary object is first built from the entire corpus with the help of vocab(), update_vocab() and prune_vocab() functions.
  2. Then, the vocabulary is passed alongside the corpus to a variety of corpus pre-processing functions. Most of the mlvocab functions accept nbuckets argument for partial or full hashing of the corpus.

Current functionality includes:

  • term index sequences: tix_seq(), tix_mat() and tix_df() produce integer sequences suitable for direct consumption by various sequence models.
  • term matrices: dtm(), tdm() and tcm() create document-term term-document and term-co-occurrence matrices respectively.
  • subseting embedding matrices: given pre-trained word-vectors prune_embeddings() creates smaller embedding matrices treating missing and unknown vocabulary terms with grace.
  • tfidf weighting: tfidf() computes various versions of term frequency, inverse document frequency weighting of dtm and tdm matrices.


Package is in alpha state. API changes are likely.

Functions in mlvocab

Name Description
dplyr_methods Methods for dplyr predicates
mlvocab-package mlvocab package
term_matrices Term-document and term-cooccurrence matrices
tfidf Tfidf re-weighting of dtm and tdm matrices
vocab Build and manipulate vocabularies
prune_embeddings Subset embedding matrix using vocab terms
term_indices Term Indices: Convert text to integer indices
License GPL-3
Encoding UTF-8
LinkingTo Rcpp (>= 0.12.9), digest (>= 0.6.8), sparsepp (>= 0.2.0)
LazyData true
SystemRequirements C++11 with suport for regex (such as GCC 4.9 or later, > 5 prefered)
RoxygenNote 6.1.0
NeedsCompilation yes
Packaged 2018-09-17 17:50:05 UTC; vspinu
Repository CRAN
Date/Publication 2018-09-18 08:40:02 UTC

