weightTfIdf

0th

Percentile

Weight by Term Frequency - Inverse Document Frequency

Weight a term-document matrix by term frequency - inverse document frequency.

Usage
weightTfIdf(m, normalize = TRUE)
Arguments
m
A TermDocumentMatrix in term frequency format.
normalize
A Boolean value indicating whether the term frequencies should be normalized.
Details

Formally this function is of class WeightingFunction with the additional attributes Name and Acronym.

Term frequency $\mathit{tf}_{i,j}$ counts the number of occurrences $n_{i,j}$ of a term $t_i$ in a document $d_j$. In the case of normalization, the term frequency $\mathit{tf}_{i,j}$ is divided by $\sum_k n_{k,j}$.

Inverse document frequency for a term $t_i$ is defined as $$\mathit{idf}_i = \log_2 \frac{|D|}{|{d \mid t_i \in d}|}$$ where $|D|$ denotes the total number of documents and where $|{d \mid t_i \in d}|$ is the number of documents where the term $t_i$ appears.

Term frequency - inverse document frequency is now defined as $\mathit{tf}_{i,j} \cdot \mathit{idf}_i$.

Value

  • The weighted matrix.

References

Gerard Salton and Christopher Buckley (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24/5, 513--523.

Aliases
  • weightTfIdf
Documentation reproduced from package tm, version 0.6-2, License: GPL-3

Community examples

Looks like there are no examples yet.