Learn R Programming

Corpus and Vocabulary Preprocessing Utilities for Natural Language Pipelines (an R package)

The following two-step abstraction is provided by the package:

  1. The vocabulary object is first built from the entire corpus with the help of vocab(), update_vocab() and prune_vocab() functions.
  2. Then, the vocabulary is passed alongside the corpus to a variety of corpus pre-processing functions. Most of the mlvocab functions accept nbuckets argument for partial or full hashing of the corpus.

Current functionality includes:

  • term index sequences: tix_seq(), tix_mat() and tix_df() produce integer sequences suitable for direct consumption by various sequence models.

  • term matrices: dtm(), tdm() and tcm() create document-term term-document and term-co-occurrence matrices respectively.

  • subseting embedding matrices: given pre-trained word-vectors prune_embeddings() creates smaller embedding matrices treating missing and unknown vocabulary terms with grace.

  • tfidf weighting: tfidf() computes various versions of term frequency, inverse document frequency weighting of dtm and tdm matrices.

Stability

Package is in alpha state. API changes are likely.

Copy Link

Version

Install

install.packages('mlvocab')

Monthly Downloads

4

Version

0.1

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Vitalie Spinu

Last Published

September 18th, 2018

Functions in mlvocab (0.1)

dplyr_methods

Methods for dplyr predicates
mlvocab-package

mlvocab package
term_matrices

Term-document and term-cooccurrence matrices
tfidf

Tfidf re-weighting of dtm and tdm matrices
vocab

Build and manipulate vocabularies
prune_embeddings

Subset embedding matrix using vocab terms
term_indices

Term Indices: Convert text to integer indices