tm (version 0.6-1)

weightTfIdf: Weight by Term Frequency - Inverse Document Frequency

Description

Weight a term-document matrix by term frequency - inverse document frequency.

Usage

weightTfIdf(m, normalize = TRUE)

Arguments

m
A TermDocumentMatrix in term frequency format.
normalize
A Boolean value indicating whether the term frequencies should be normalized.

Value

  • The weighted matrix.

Details

Formally this function is of class WeightingFunction with the additional attributes Name and Acronym.

Term frequency $\mathit{tf}_{i,j}$ counts the number of occurrences $n_{i,j}$ of a term $t_i$ in a document $d_j$. In the case of normalization, the term frequency $\mathit{tf}_{i,j}$ is divided by $\sum_k n_{k,j}$.

Inverse document frequency for a term $t_i$ is defined as $$\mathit{idf}_i = \log_2 \frac{|D|}{|{d \mid t_i \in d}|}$$ where $|D|$ denotes the total number of documents and where $|{d \mid t_i \in d}|$ is the number of documents where the term $t_i$ appears.

Term frequency - inverse document frequency is now defined as $\mathit{tf}_{i,j} \cdot \mathit{idf}_i$.

References

Gerard Salton and Christopher Buckley (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24/5, 513--523.