Weight a term-document matrix according to a combination of weights specified in SMART notation.

`weightSMART(m, spec = "nnn", control = list())`

m

A `TermDocumentMatrix`

in term frequency format.

spec

a character string consisting of three characters. The first letter
specifies a term frequency schema, the second a document frequency
schema, and the third a normalization schema. See **Details** for
available built-in schemata.

control

a list of control parameters. See **Details**.

The weighted matrix.

Formally this function is of class `WeightingFunction`

with the
additional attributes `name`

and `acronym`

.

The first letter of `spec`

specifies a weighting schema for term
frequencies of `m`

:

- "n"
(natural) \(\mathit{tf}_{i,j}\) counts the number of occurrences \(n_{i,j}\) of a term \(t_i\) in a document \(d_j\). The input term-document matrix

`m`

is assumed to be in this standard term frequency format already.- "l"
(logarithm) is defined as \(1 + \log_2(\mathit{tf}_{i,j})\).

- "a"
(augmented) is defined as \(0.5 + \frac{0.5 * \mathit{tf}_{i,j}}{\max_i(\mathit{tf}_{i,j})}\).

- "b"
(boolean) is defined as 1 if \(\mathit{tf}_{i,j} > 0\) and 0 otherwise.

- "L"
(log average) is defined as \(\frac{1 + \log_2(\mathit{tf}_{i,j})}{1+\log_2(\mathrm{ave}_{i\in j}(\mathit{tf}_{i,j}))}\).

The second letter of `spec`

specifies a weighting schema of
document frequencies for `m`

:

- "n"
(no) is defined as 1.

- "t"
(idf) is defined as \(\log_2 \frac{N}{\mathit{df}_t}\) where \(\mathit{df}_t\) denotes how often term \(t\) occurs in all documents.

- "p"
(prob idf) is defined as \(\max(0, \log_2(\frac{N - \mathit{df}_t}{\mathit{df}_t}))\).

The third letter of `spec`

specifies a schema for normalization
of `m`

:

- "n"
(none) is defined as 1.

- "c"
(cosine) is defined as \(\sqrt{\mathrm{col\_sums}(m ^ 2)}\).

- "u"
(pivoted unique) is defined as \(\mathit{slope} * \sqrt{\mathrm{col\_sums}(m ^ 2)} + (1 - \mathit{slope}) * \mathit{pivot}\) where both

`slope`

and`pivot`

must be set via named tags in the`control`

list.- "b"
(byte size) is defined as \(\frac{1}{\mathit{CharLength}^\alpha}\). The parameter \(\alpha\) must be set via the named tag

`alpha`

in the`control`

list.

The final result is defined by multiplication of the chosen term frequency component with the chosen document frequency component with the chosen normalization component.

Christopher D. Manning and Prabhakar Raghavan and Hinrich Sch<U+00FC>tze (2008).
*Introduction to Information Retrieval*.
Cambridge University Press, ISBN 0521865719.

# NOT RUN { data("crude") TermDocumentMatrix(crude, control = list(removePunctuation = TRUE, stopwords = TRUE, weighting = function(x) weightSMART(x, spec = "ntc"))) # }