Learn R Programming

audubon (version 0.5.1)

bind_tf_idf2: Bind term frequency and inverse document frequency

Description

Calculates and binds the term frequency, inverse document frequency, and TF-IDF of the dataset. This function experimentally supports 3 types of term frequencies and 4 types of inverse document frequencies, which are implemented in 'RMeCab' package.

Usage

bind_tf_idf2(
  tbl,
  term = "token",
  document = "doc_id",
  n = "n",
  tf = c("tf", "tf2", "tf3"),
  idf = c("idf", "idf2", "idf3", "idf4"),
  norm = FALSE,
  rmecab_compat = TRUE
)

Value

A data.frame.

Arguments

tbl

A tidy text dataset.

term

Column containing terms as string or symbol.

document

Column containing document IDs as string or symbol.

n

Column containing document-term counts as string or symbol.

tf

Method for computing term frequency.

idf

Method for computing inverse document frequency.

norm

Logical; If passed as TRUE, the raw term counts are normalized being divided with L2 norms before computing IDF values.

rmecab_compat

Logical; If passed as TRUE, computes values while taking care of compatibility with 'RMeCab'. Note that 'RMeCab' always computes IDF values using term frequency rather than raw term counts, and thus TF-IDF values may be doubly affected by term frequency.

Details

Types of term frequency can be switched with tf argument:

  • tf is term frequency (not raw count of terms).

  • tf2 is logarithmic term frequency of which base is 10.

  • tf3 is binary-weighted term frequency.

Types of inverse document frequencies can be switched with idf argument:

  • idf is inverse document frequency of which base is 2, with smoothed. 'smoothed' here means just adding 1 to raw counts after logarithmizing.

  • idf2 is global frequency IDF.

  • idf3 is probabilistic IDF of which base is 2.

  • idf4 is global entropy, not IDF in actual.

Examples

Run this code
if (FALSE) {
df <- dplyr::add_count(hiroba, doc_id, token)
bind_tf_idf2(df)
}

Run the code above in your browser using DataLab