These functions compute or update various term-counts of a corpus with flexible output specification.
default weights for the context window ["a" "b" "c" "d" "e"]
a b c d e
1.00 0.50 0.33 0.25 0.20
dtm(corpus, vocab = NULL, ngram = attr(vocab, "ngram"),
nbuckets = attr(vocab, "nbuckets"), output = c("triplet", "column", "row",
"df"))tdm(corpus, vocab = NULL, ngram = attr(vocab, "ngram"),
nbuckets = attr(vocab, "nbuckets"), output = c("triplet", "column", "row",
"df"))
tcm(corpus, vocab = NULL, window_size = 5,
window_weights = 1/seq.int(window_size), context = c("symmetric", "right",
"left"), ngram = attr(vocab, "ngram"), nbuckets = attr(vocab, "nbuckets"),
output = c("triplet", "column", "row", "df"))
a list of character vectors
a data.frame
produced by an early call to vocab()
. When
vocab
is NULL
and nbuckets
is NULL
or 0
, the vocabulary
is first computed from corpus. When nbuckets
> 0
and vocab
is
NULL
the result matrix will consist of buckets only.
an integer vector of the form [ngram_min, ngram_max]
. Defaults to the ngram
settings used during the creation of
vocab
. Explicitly providing this parameter should rarely be needed.
number of unknown buckets
one of "triplet", "column", "row", "df" or an unambiguous
abbreviation thereof. First three options return the corresponding sparse
matrices from Matrix package, "df" results in a triplet data.frame
.
sliding window size used for co-occurrence
computation. In this implementation the window includes the context word;
thus, window_size == 1 will result in 0 co-occurrence matrix. This
convention allows for consistent weighting schemes across different values
of ngram_min
and ngram_max
.
vector of weights which are superimposed on the
sliding window
. First element is a weight for distance 0 (aka context
word itself), second for distance 1 etc. First weight doesn't play any
role for ngram_max
== 1, see details. window_weights
is recycled to
length window_size
if needed. It can be a string naming a function or a
function which accepts one argument, window_size
, and returns a
window_weights
vector. Defaults to [1, 1/2, ..., 1/window_size]
.
when "symmetric", matrix entries (i, j)
and (j, i)
are
the same and represent coocurence of terms i
and j
within
window_size
. When "right", entry (i, j)
represents coocurence of the
term j
on the right side of i
. When "left", entry (i, j) represents the coocurence of the term
jon the left of term
i`.
for ngram=c(1L, 3L)
a a_b a_b_c b b_c b_c_d c c_d c_d_e d d_e e
1.00 0.75 0.61 0.50 0.42 0.36 0.33 0.29 0.26 0.25 0.22 0.20
for ngram=c(2L, 3L)
a_b a_b_c b_c b_c_d c_d c_d_e d_e
0.75 0.61 0.42 0.36 0.29 0.26 0.22