term_matrices: Term-document and term-cooccurrence matrices

Description

These functions compute or update various term-counts of a corpus with flexible output specification.

Usage

dtm(corpus, vocab = NULL, ngram = attr(vocab, "ngram"),
  nbuckets = attr(vocab, "nbuckets"), output = c("row", "triplet",
  "column", "df"))
tdm(corpus, vocab = NULL, ngram = attr(vocab, "ngram"),
  nbuckets = attr(vocab, "nbuckets"), output = c("column", "triplet",
  "row", "df"))
tcm(corpus, vocab = NULL, window_size = 5,
  window_weights = 1/seq.int(window_size), context = c("symmetric",
  "right", "left"), ngram = attr(vocab, "ngram"),
  nbuckets = attr(vocab, "nbuckets"), output = c("triplet", "column",
  "row", "df"))

Arguments

corpus

text corpus; see [vocab()].

vocab

a data.frame produced by an early call to vocab(). When vocab is NULL and nbuckets is NULL or 0, the vocabulary is first computed from corpus. When nbuckets > 0 and vocab is NULL the result matrix will consist of buckets only.

ngram

an integer vector of the form [ngram_min, ngram_max]. Defaults to the ngram settings used during the creation of vocab. Explicitly providing this parameter should rarely be needed.

nbuckets

number of unknown buckets

output

one of "triplet", "column", "row", "df" or an unambiguous abbreviation thereof. First three options return the corresponding sparse matrices from Matrix package, "df" results in a triplet data.frame.

The default output type corresponds to the most efficient computation in terms of CPU and memory usage ("row" for dtm, "column" for tdm and "triplet" for tcm), but benefits are marginal unless your matrices are so big that they barely fit into memory. If you plan to further perform matrix algebra on these matrices it's a good idea to choose "column" type because of the much better support from the Matrix package.

window_size

sliding window size used for co-occurrence computation. In this implementation the window includes the context word; thus, window_size == 1 will result in 0 co-occurrence matrix. This convention allows for consistent weighting schemes across different values of ngram_min and ngram_max.

window_weights

vector of weights which are superimposed on the sliding window. First element is a weight for distance 0 (aka context word itself), second for distance 1 etc. First weight is ignored for ngram_max == 1, see details. window_weights is recycled to length window_size if needed. It can be a string naming a function or a function which accepts one argument, window_size, and returns a window_weights vector. Defaults to [1, 1/2, ..., 1/window_size].

context

when "symmetric", matrix entries (i, j) and (j, i) are the same and represent coocurence of terms i and j within window_size. When "right", entry (i, j) represents coocurence of the term j on the right side of i. When "left", entry (i, j) represents the coocurence of the termjon the left of termi`.

Details

For ngram_max > 1 the weights vectors is automatically extended to match the "imaginary" sliding window over the ngrams. A proximity weight attached for an n-gram is an average of weights of the constituents of the ngram in the original sequence. Such scheme results in a consistent weighting across different values of ngram_min and ngram_max, and it is the reason why first element of window_weights is the proximity to the context word itself (i.e. distance 0). For example:

default weights for the context window ["a" "b" "c" "d" "e"]
a b c d e
for ngram=c(1L, 3L)
a a_b a_b_c b b_c b_c_d c c_d c_d_e d d_e e
1.00 0.75 0.61 0.50 0.42 0.36 0.33 0.29 0.26 0.25 0.22 0.20
for ngram=c(2L, 3L)
a_b a_b_c b_c b_c_d c_d c_d_e d_e

a	a_b	a_b_c	b	b_c	b_c_d	c	c_d	c_d_e	d	d_e	e
1.00	0.75	0.61	0.50	0.42	0.36	0.33	0.29	0.26	0.25	0.22	0.20