Corpus and Vocabulary Preprocessing Utilities for Natural Language Pipelines (an R package)

The following two-step abstraction is provided by the package:

  1. The vocabulary object is first built from the entire corpus with the help of vocab(), update_vocab() and prune_vocab() functions.
  2. Then, the vocabulary is passed alongside the corpus to a variety of corpus pre-processing functions. Most of the mlvocab functions accept nbuckets argument for partial or full hashing of the corpus.

Current functionality includes:

  • term index sequences: tix_seq(), tix_mat() and tix_df() produce integer sequences suitable for direct consumption by various sequence models.

  • term matrices: dtm(), tdm() and tcm() create document-term term-document and term-co-occurrence matrices respectively.

  • subseting embedding matrices: given pre-trained word-vectors prune_embeddings() creates smaller embedding matrices treating missing and unknown vocabulary terms with grace.

  • tfidf weighting: tfidf() computes various versions of term frequency, inverse document frequency weighting of dtm and tdm matrices.

Stability

Package is in alpha state. API changes are likely.

Copy Link

Version

Down Chevron

Install

install.packages('mlvocab')

Monthly Downloads

20

Version

0.1

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Last Published

September 18th, 2018

Functions in mlvocab (0.1)