Learn R Programming

mlvocab (version 0.0.1)

term_indices: Convert text to integer indices

Description

Convert text to integer indices

Usage

tiseq(corpus, vocab, keep_unknown = nbuckets > 0, nbuckets = attr(vocab,
  "nbuckets"), reverse = FALSE)

timat(corpus, vocab, maxlen = 100, pad_right = TRUE, trunc_right = TRUE, keep_unknown = nbuckets > 0, nbuckets = attr(vocab, "nbuckets"), reverse = FALSE)

Arguments

corpus

text corpus

vocab

data frame produced by vocab() or vocab_update()

keep_unknown

logical. If TRUE, preserve unknowns in the output sequences.

nbuckets

integer. How many buckets to hash unknowns into.

reverse

logical. Should each sequence be reversed in the final output? Reversion happens after pad_right and trunc_right have been applied to the original text sequence. Default FALSE.

maxlen

integer. Maximum length of each sequence.

pad_right

logical. Should 0-padding of shorter than maxlen sequences happen on the right? Default TRUE.

trunc_right

logical. Should truncation of longer than maxlen sequences happen on the right? Default TRUE.

Value

tiseq() returns a list of integer vectors, timat() returns an integer matrix, one row per sequence.

Examples

Run this code
# NOT RUN {
corpus <- list(a = c("The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"), 
               b = c("the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog",
                     "the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"))
v <- vocab(corpus["b"]) # "The" is unknown
v
tiseq(corpus, v)
tiseq(corpus, v, keep_unknown = TRUE)
tiseq(corpus, v, nbuckets = 1)
tiseq(corpus, v, nbuckets = 3)

timat(corpus, v, maxlen = 12)
timat(corpus, v, maxlen = 12, keep_unknown = TRUE)
timat(corpus, v, maxlen = 12, nbuckets = 1)
timat(corpus, v, maxlen = 12, nbuckets = 1, reverse = TRUE)
timat(corpus, v, maxlen = 12, pad_right = FALSE, nbuckets = 1)
timat(corpus, v, maxlen = 12, trunc_right = FALSE, nbuckets = 1)
# }

Run the code above in your browser using DataLab