These functions compute or update various term-counts of a corpus with flexible output specification.
dtm(corpus, vocab = NULL, ngram = attr(vocab, "ngram"),
nbuckets = attr(vocab, "nbuckets"), output = c("row", "triplet",
"column", "df"))tdm(corpus, vocab = NULL, ngram = attr(vocab, "ngram"),
nbuckets = attr(vocab, "nbuckets"), output = c("column", "triplet",
"row", "df"))
tcm(corpus, vocab = NULL, window_size = 5,
window_weights = 1/seq.int(window_size), context = c("symmetric",
"right", "left"), ngram = attr(vocab, "ngram"),
nbuckets = attr(vocab, "nbuckets"), output = c("triplet", "column",
"row", "df"))
text corpus; see [vocab()]
.
a data.frame
produced by an early call to vocab()
. When
vocab
is NULL
and nbuckets
is NULL
or 0
, the vocabulary is first
computed from corpus. When nbuckets
> 0
and vocab
is NULL
the
result matrix will consist of buckets only.
an integer vector of the form [ngram_min, ngram_max]
. Defaults to the ngram
settings used during the creation of
vocab
. Explicitly providing this parameter should rarely be needed.
number of unknown buckets
one of "triplet", "column", "row", "df" or an unambiguous
abbreviation thereof. First three options return the corresponding sparse
matrices from Matrix package, "df" results in a triplet
data.frame
.
The default output type corresponds to the most efficient computation in
terms of CPU and memory usage ("row" for dtm
, "column" for tdm
and
"triplet" for tcm
), but benefits are marginal unless your matrices are
so big that they barely fit into memory. If you plan to further perform
matrix algebra on these matrices it's a good idea to choose "column" type
because of the much better support from the Matrix package.
sliding window size used for co-occurrence
computation. In this implementation the window includes the context word;
thus, window_size == 1 will result in 0 co-occurrence matrix. This
convention allows for consistent weighting schemes across different values
of ngram_min
and ngram_max
.
vector of weights which are superimposed on the
sliding window
. First element is a weight for distance 0 (aka context
word itself), second for distance 1 etc. First weight is ignored for
ngram_max
== 1, see details. window_weights
is recycled to length
window_size
if needed. It can be a string naming a function or a
function which accepts one argument, window_size
, and returns a
window_weights
vector. Defaults to [1, 1/2, ..., 1/window_size]
.
when "symmetric", matrix entries (i, j)
and (j, i)
are
the same and represent coocurence of terms i
and j
within
window_size
. When "right", entry (i, j)
represents coocurence of the
term j
on the right side of i
. When "left", entry (i, j) represents the coocurence of the term
jon the left of term
i`.
For ngram_max > 1
the weights vectors is automatically extended to match
the "imaginary" sliding window over the ngrams. A proximity weight attached
for an n-gram is an average of weights of the constituents of the ngram in
the original sequence. Such scheme results in a consistent weighting across
different values of ngram_min
and ngram_max
, and it is the reason why
first element of window_weights
is the proximity to the context word
itself (i.e. distance 0
). For example:
default weights for the context window ["a" "b" "c" "d" "e"]
a | b | c | d | e |
for ngram=c(1L, 3L)
a | a_b | a_b_c | b | b_c | b_c_d | c | c_d | c_d_e | d | d_e | e |
1.00 | 0.75 | 0.61 | 0.50 | 0.42 | 0.36 | 0.33 | 0.29 | 0.26 | 0.25 | 0.22 | 0.20 |
for ngram=c(2L, 3L)
a_b | a_b_c | b_c | b_c_d | c_d | c_d_e | d_e |