These functions compute or update various term-counts of a corpus with flexible output specification.
default weights for the context window ["a" "b" "c" "d" "e"]
a b c d e
1.00 0.50 0.33 0.25 0.20
dtm(corpus, vocab = NULL, ngram = attr(vocab, "ngram"),
nbuckets = attr(vocab, "nbuckets"), output = c("triplet", "column", "row",
"df"))tdm(corpus, vocab = NULL, ngram = attr(vocab, "ngram"),
nbuckets = attr(vocab, "nbuckets"), output = c("triplet", "column", "row",
"df"))
tcm(corpus, vocab = NULL, window_size = 5,
window_weights = 1/seq.int(window_size), context = c("symmetric", "right",
"left"), ngram = attr(vocab, "ngram"), nbuckets = attr(vocab, "nbuckets"),
output = c("triplet", "column", "row", "df"))
a list of character vectors
a data.frame produced by an early call to vocab(). When
vocab is NULL and nbuckets is NULL or 0, the vocabulary
is first computed from corpus. When nbuckets > 0 and vocab is
NULL the result matrix will consist of buckets only.
an integer vector of the form [ngram_min, ngram_max]. Defaults to the ngram settings used during the creation of
vocab. Explicitly providing this parameter should rarely be needed.
number of unknown buckets
one of "triplet", "column", "row", "df" or an unambiguous
abbreviation thereof. First three options return the corresponding sparse
matrices from Matrix package, "df" results in a triplet data.frame.
sliding window size used for co-occurrence
computation. In this implementation the window includes the context word;
thus, window_size == 1 will result in 0 co-occurrence matrix. This
convention allows for consistent weighting schemes across different values
of ngram_min and ngram_max.
vector of weights which are superimposed on the
sliding window. First element is a weight for distance 0 (aka context
word itself), second for distance 1 etc. First weight doesn't play any
role for ngram_max == 1, see details. window_weights is recycled to
length window_size if needed. It can be a string naming a function or a
function which accepts one argument, window_size, and returns a
window_weights vector. Defaults to [1, 1/2, ..., 1/window_size].
when "symmetric", matrix entries (i, j) and (j, i) are
the same and represent coocurence of terms i and j within
window_size. When "right", entry (i, j) represents coocurence of the
term j on the right side of i. When "left", entry (i, j) represents the coocurence of the termjon the left of termi`.
for ngram=c(1L, 3L)
a a_b a_b_c b b_c b_c_d c c_d c_d_e d d_e e
1.00 0.75 0.61 0.50 0.42 0.36 0.33 0.29 0.26 0.25 0.22 0.20
for ngram=c(2L, 3L)
a_b a_b_c b_c b_c_d c_d c_d_e d_e
0.75 0.61 0.42 0.36 0.29 0.26 0.22