tm (version 0.7-1)

weightTfIdf: Weight by Term Frequency - Inverse Document Frequency

Description

Weight a term-document matrix by term frequency - inverse document frequency.

Usage

weightTfIdf(m, normalize = TRUE)

Arguments

m

A TermDocumentMatrix in term frequency format.

normalize

A Boolean value indicating whether the term frequencies should be normalized.

Value

The weighted matrix.

Details

Formally this function is of class WeightingFunction with the additional attributes name and acronym.

Term frequency \(\mathit{tf}_{i,j}\) counts the number of occurrences \(n_{i,j}\) of a term \(t_i\) in a document \(d_j\). In the case of normalization, the term frequency \(\mathit{tf}_{i,j}\) is divided by \(\sum_k n_{k,j}\).

Inverse document frequency for a term \(t_i\) is defined as $$\mathit{idf}_i = \log_2 \frac{|D|}{|\{d \mid t_i \in d\}|}$$ where \(|D|\) denotes the total number of documents and where \(|\{d \mid t_i \in d\}|\) is the number of documents where the term \(t_i\) appears.

Term frequency - inverse document frequency is now defined as \(\mathit{tf}_{i,j} \cdot \mathit{idf}_i\).

References

Gerard Salton and Christopher Buckley (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24/5, 513--523.