mlvocab (version 0.1)

mlvocab-package: mlvocab package

Description

The following two-step abstraction is provided by the mlvocab package. First, the vocabulary object is built from the entire corpus with the help of vocab(), update_vocab() and prune_vocab() functions. Second, the vocabulary is passed alongside the corpus to a variety of corpus pre-processing functions.

Arguments

Details

Most of the mlvocab functions accept nbuckets argument for partial or full hashing of the corpus.

Current functionality includes:

  • term index sequencestix_seq() and tix_mat() produce integer sequences suitable for direct consumption by various sequence models.

  • term matricesdtm(), tdm() and tcm() create document-term, term-document and term-co-occurrence matrices respectively.

  • vocabulary embeddinggiven pre-trained word-vectors prune_embeddings() creates smaller embedding matrices treating missing and unknown vocabulary terms with grace.

  • tfidf weightingtfidf() computes various versions of term frequency, inverse document frequency weighting of dtm and tdm matrices.

See Also

Useful links: