Calculates and binds the term frequency, inverse document frequency, and TF-IDF of the dataset. This function experimentally supports 3 types of term frequencies and 4 types of inverse document frequencies, which are implemented in 'RMeCab' package.
bind_tf_idf2(
tbl,
term = "token",
document = "doc_id",
n = "n",
tf = c("tf", "tf2", "tf3"),
idf = c("idf", "idf2", "idf3", "idf4"),
norm = FALSE,
rmecab_compat = TRUE
)
A data.frame.
A tidy text dataset.
Column containing terms as string or symbol.
Column containing document IDs as string or symbol.
Column containing document-term counts as string or symbol.
Method for computing term frequency.
Method for computing inverse document frequency.
Logical; If passed as TRUE
, the raw term counts are normalized
being divided with L2 norms before computing IDF values.
Logical; If passed as TRUE
, computes values while
taking care of compatibility with 'RMeCab'.
Note that 'RMeCab' always computes IDF values using term frequency
rather than raw term counts, and thus TF-IDF values may be
doubly affected by term frequency.
Types of term frequency can be switched with tf
argument:
tf
is term frequency (not raw count of terms).
tf2
is logarithmic term frequency of which base is 10.
tf3
is binary-weighted term frequency.
Types of inverse document frequencies can be switched with idf
argument:
idf
is inverse document frequency of which base is 2, with smoothed.
'smoothed' here means just adding 1 to raw counts after logarithmizing.
idf2
is global frequency IDF.
idf3
is probabilistic IDF of which base is 2.
idf4
is global entropy, not IDF in actual.
if (FALSE) {
df <- dplyr::add_count(hiroba, doc_id, token)
bind_tf_idf2(df)
}
Run the code above in your browser using DataLab