Corpus and Vocabulary Preprocessing Utilities for Natural Language Pipelines (an R package)
The following two-step abstraction is provided by the package:
- The vocabulary object is first built from the entire corpus with the help of
vocab()
,vocab_update()
andvocab_prune()
functions. - Then, the vocabulary is passed alongside the corpus to a variety of corpus pre-processing functions. Most of the
mlvocab
functions acceptnbuckets
argument for partial or full hashing of the corpus.
Current functionality includes:
term index sequences:
tiseq()
andtimat()
produce integer sequences suitable for direct consumption by various sequence models.term matrices:
dtm()
,tdm()
andtcm()
create document-term term-document and term-co-occurrence matrices respectively.vocabulary embedding: given pre-trained word-vectors
vocab_embed()
creates smaller embedding matrices treating missing and unknown vocabulary terms with grace.tfidf weighting:
tfidf()
computes various versions of term frequency, inverse document frequency weighting ofdtm
andtdm
matrices.
Stability
Package is in alpha state. API changes are likely.