bind_tf_idf2: Bind term frequency and inverse document frequency

Description

Calculates and binds the term frequency, inverse document frequency, and TF-IDF of the dataset. This function experimentally supports 3 types of term frequencies and 4 types of inverse document frequencies, which are implemented in 'RMeCab' package.

Usage

bind_tf_idf2(
  tbl,
  term = "token",
  document = "doc_id",
  n = "n",
  tf = c("tf", "tf2", "tf3"),
  idf = c("idf", "idf2", "idf3", "idf4"),
  norm = FALSE,
  rmecab_compat = TRUE
)

Value

A data.frame.

Arguments

tbl: A tidy text dataset.
term: Column containing terms as string or symbol.
document: Column containing document IDs as string or symbol.
n: Column containing document-term counts as string or symbol.
tf: Method for computing term frequency.
idf: Method for computing inverse document frequency.
norm: Logical; If passed as TRUE, the raw term counts are normalized being divided with L2 norms before computing IDF values.
rmecab_compat: Logical; If passed as TRUE, computes values while taking care of compatibility with 'RMeCab'. Note that 'RMeCab' always computes IDF values using term frequency rather than raw term counts, and thus TF-IDF values may be doubly affected by term frequency.

Details

Types of term frequency can be switched with tf argument:

tf is term frequency (not raw count of terms).
tf2 is logarithmic term frequency of which base is 10.
tf3 is binary-weighted term frequency.

Types of inverse document frequencies can be switched with idf argument:

idf is inverse document frequency of which base is 2, with smoothed. 'smoothed' here means just adding 1 to raw counts after logarithmizing.
idf2 is global frequency IDF.
idf3 is probabilistic IDF of which base is 2.
idf4 is global entropy, not IDF in actual.

Examples

Run this code

if (FALSE) {
df <- dplyr::add_count(hiroba, doc_id, token)
bind_tf_idf2(df)
}

Run the code above in your browser using DataLab