When combining a local and a global weighting scheme to be applied on a
given textmatrix m
via \(dtm = lw(m) \cdot gw(m)\), where
\(m\) is the given document-term matrix,
\(lw(m)\) is one of the local weight functions lw\_tf()
, lw\_logtf()
, lw\_bintf()
, and
\(gw(m)\) is one of the global weight functions gw\_normalisation()
, gw\_idf()
, gw\_gfidf()
, entropy()
, gw\_entropy()
.
This set of weighting schemes includes the local weightings (lw)
raw, log, binary and the global weightings (gw) normalisation, two versions of the
inverse document frequency (idf), and entropy in both the original Shannon as well as
in a slightly modified, more common version:
lw\_tf()
returns a completely unmodified \(n \times m\) matrix (placebo function).
lw\_logtf()
returns the logarithmised \(n \times m\) matrix. \(log(m_{i,j}+1)\) is applied on every cell.
lw\_bintf()
returns binary values of the \(n \times m\) matrix. Every cell is assigned 1, iff the term frequency is not equal to 0.
gw\_normalisation()
returns a normalised \(n \times m\) matrix. Every cell equals 1 divided by the square root of the document vector length.
gw\_idf()
returns the inverse document frequency in a \(n \times m\) matrix. Every cell is 1 plus the logarithmus of the number of documents divided by the number of documents where the term appears.
gw\_gfidf()
returns the global frequency multiplied with idf. Every cell equals the sum of the frequencies of one term divided by the number of documents where the term shows up.
entropy()
returns the entropy (as defined by Shannon).
gw\_entropy()
returns one plus entropy.
Be careful when folding in data into an existing lsa space: you may want to
weight an additional textmatrix based on the same vocabulary with the global
weights of the training data (not the new data)!