term_matrices: Term-document and term-cooccurrence matrices

Description

These functions compute or update various term-counts of a corpus with flexible output specification.

default weights for the context window ["a" "b" "c" "d" "e"] a b c d e 1.00 0.50 0.33 0.25 0.20

Usage

dtm(corpus, vocab = NULL, ngram = attr(vocab, "ngram"),
  nbuckets = attr(vocab, "nbuckets"), output = c("triplet", "column", "row",
  "df"))
tdm(corpus, vocab = NULL, ngram = attr(vocab, "ngram"),
  nbuckets = attr(vocab, "nbuckets"), output = c("triplet", "column", "row",
  "df"))
tcm(corpus, vocab = NULL, window_size = 5,
  window_weights = 1/seq.int(window_size), context = c("symmetric", "right",
  "left"), ngram = attr(vocab, "ngram"), nbuckets = attr(vocab, "nbuckets"),
  output = c("triplet", "column", "row", "df"))

Arguments

corpus

a list of character vectors

vocab

a data.frame produced by an early call to vocab(). When vocab is NULL and nbuckets is NULL or 0, the vocabulary is first computed from corpus. When nbuckets > 0 and vocab is NULL the result matrix will consist of buckets only.

ngram

an integer vector of the form [ngram_min, ngram_max]. Defaults to the ngram settings used during the creation of vocab. Explicitly providing this parameter should rarely be needed.

nbuckets

number of unknown buckets

output

one of "triplet", "column", "row", "df" or an unambiguous abbreviation thereof. First three options return the corresponding sparse matrices from Matrix package, "df" results in a triplet data.frame.

window_size

sliding window size used for co-occurrence computation. In this implementation the window includes the context word; thus, window_size == 1 will result in 0 co-occurrence matrix. This convention allows for consistent weighting schemes across different values of ngram_min and ngram_max.

window_weights

vector of weights which are superimposed on the sliding window. First element is a weight for distance 0 (aka context word itself), second for distance 1 etc. First weight doesn't play any role for ngram_max == 1, see details. window_weights is recycled to length window_size if needed. It can be a string naming a function or a function which accepts one argument, window_size, and returns a window_weights vector. Defaults to [1, 1/2, ..., 1/window_size].

context

when "symmetric", matrix entries (i, j) and (j, i) are the same and represent coocurence of terms i and j within window_size. When "right", entry (i, j) represents coocurence of the term j on the right side of i. When "left", entry (i, j) represents the coocurence of the termjon the left of termi`.

Details

for ngram=c(1L, 3L) a a_b a_b_c b b_c b_c_d c c_d c_d_e d d_e e 1.00 0.75 0.61 0.50 0.42 0.36 0.33 0.29 0.26 0.25 0.22 0.20
for ngram=c(2L, 3L) a_b a_b_c b_c b_c_d c_d c_d_e d_e 0.75 0.61 0.42 0.36 0.29 0.26 0.22