lma_weight: Document-Term Matrix Weighting

Description

Weight a document-term matrix.

Usage

lma_weight(dtm, weight = "count", normalize = TRUE, wc.complete = TRUE,
  log.base = 10, alpha = 1, pois.x = 1L, doc.only = FALSE,
  percent = FALSE)

Value

A weighted version of dtm, with a type attribute added (attr(dtm, 'type')).

Arguments

dtm

A matrix with words as column names.

weight

A string referring at least partially to one (or a combination; see note) of the available weighting methods:

Term weights (applied uniquely to each cell)

binary
(dtm > 0) * 1
Convert frequencies to 1s and 0s; remove differences in frequencies.
log
log(dtm + 1, log.base)
Log of frequencies.
sqrt
sqrt(dtm)
Square root of frequencies.
count
dtm
Unaltered; sometimes called term frequencies (tf).
amplify
dtm ^ alpha
Amplify difference in frequencies.

Document weights (applied by column)

dflog
log(colSums(dtm > 0), log.base)
Log of binary term sum.
entropy
1 - rowSums(x * log(x + 1, log.base) / log(ncol(x), log.base), na.rm = TRUE)
Where x = t(dtm) / colSums(dtm > 0); entropy of term-conditional term distribution.
ppois
1 - ppois(pois.x, colSums(dtm) / nrow(dtm))
Poisson-predicted term distribution.
dpois
1 - dpois(pois.x, colSums(dtm) / nrow(dtm))
Poisson-predicted term density.
dfmlog
log(diag(dtm[max.col(t(dtm)), ]), log.base)
Log of maximum term frequency.
dfmax
diag(dtm[max.col(t(dtm)), ])
Maximum term frequency.
df
colSums(dtm > 0)
Sum of binary term occurrence across documents.
idf
log(nrow(dtm) / colSums(dtm > 0), log.base)
Inverse document frequency.
ridf
idf - log(dpois, log.base)
Residual inverse document frequency.
normal
sqrt(1 / colSums(dtm ^ 2))
Normalized document frequency.

Alternatively, 'pmi' or 'ppmi' will apply a pointwise mutual information weighting scheme (with 'ppmi' setting negative values to 0).

normalize

Logical: if FALSE, the dtm is not divided by document word-count before being weighted.

wc.complete

If the dtm was made with lma_dtm (has a 'WC' attribute), word counts for frequencies can be based on the raw count (default; wc.complete = TRUE). If wc.complete = FALSE, or the dtm does not have a 'WC' attribute, rowSums(dtm) is used as word count.

log.base

The base of logs, applied to any weight using log. Default is 10.

alpha

A scaling factor applied to document frequency as part of pointwise mutual information weighting, or amplify's power (dtm ^ alpha, which defaults to 1.1).

pois.x

integer; quantile or probability of the poisson distribution (dpois(pois.x, colSums(x, na.rm = TRUE) / nrow(x))).

doc.only

Logical: if TRUE, only document weights are returned (a single value for each term).

percent

Logical; if TRUE, frequencies are multiplied by 100.

Examples

Run this code

# visualize term and document weights

## term weights
term_weights <- c("binary", "log", "sqrt", "count", "amplify")
Weighted <- sapply(term_weights, function(w) lma_weight(1:20, w, FALSE))
if (require(splot)) splot(Weighted ~ 1:20, labx = "Raw Count", lines = "co")

## document weights
doc_weights <- c(
  "df", "dflog", "dfmax", "dfmlog", "idf", "ridf",
  "normal", "dpois", "ppois", "entropy"
)
weight_range <- function(w, value = 1) {
  m <- diag(20)
  m[upper.tri(m, TRUE)] <- if (is.numeric(value)) {
    value
  } else {
    unlist(lapply(
      1:20, function(v) rep(if (value == "inverted") 21 - v else v, v)
    ))
  }
  lma_weight(m, w, FALSE, doc.only = TRUE)
}

if (require(splot)) {
  category <- rep(c("df", "idf", "normal", "poisson", "entropy"), c(4, 2, 1, 2, 1))
  op <- list(
    laby = "Relative (Scaled) Weight", labx = "Document Frequency",
    leg = "outside", lines = "connected", mv.scale = TRUE, note = FALSE
  )
  splot(
    sapply(doc_weights, weight_range) ~ 1:20,
    options = op, title = "Same Term, Varying Document Frequencies",
    sud = "All term frequencies are 1.",
    colorby = list(category, grade = TRUE)
  )
  splot(
    sapply(doc_weights, weight_range, value = "sequence") ~ 1:20,
    options = op, title = "Term as Document Frequencies",
    sud = "Non-zero terms are the number of non-zero terms.",
    colorby = list(category, grade = TRUE)
  )
  splot(
    sapply(doc_weights, weight_range, value = "inverted") ~ 1:20,
    options = op, title = "Term Opposite of Document Frequencies",
    sud = "Non-zero terms are the number of zero terms + 1.",
    colorby = list(category, grade = TRUE)
  )
}

Run the code above in your browser using DataLab