Learn R Programming

tidyfst (version 1.8.3)

bind_tf_idf_dt: Compute TF–IDF Using data.table with Optional Counting and Grouping

Description

This function computes term frequency–inverse document frequency (tf–idf) on a dataset with one row per term occurrence (or pre-counted). It preserves original column names and returns new columns: - `n`: raw count (computed or user-supplied) - `tf`: term frequency per document - `idf`: inverse document frequency per group (or corpus) - `tf_idf`: tf × idf If `group_col` is `NULL`, all documents are treated as a single group.

Usage

bind_tf_idf_dt(.data, group_col = NULL, doc_col, term_col, n_col = NULL)

Value

A data.table containing: - Original grouping, document, and term columns - `n`, `tf`, `idf`, and `tf_idf`

Arguments

.data

A data.frame or data.table of text data.

group_col

Character name of grouping column, or `NULL` for no grouping.

doc_col

Character name of document identifier column.

term_col

Character name of term/word column.

n_col

(Optional) Character name of pre-counted term-frequency column. If `NULL` (default), counts are computed via `.N`.

See Also

Examples

Run this code

# With groups
df <- data.frame(
  category = rep(c("A","B"), each = 6),
  doc_id   = rep(c("d1","d2","d3"), times = 4),
  word     = c("apple","banana","apple","banana","cherry","apple",
               "dog","cat","dog","mouse","cat","dog"),
  stringsAsFactors = FALSE
)
result <- bind_tf_idf_dt(df, "category", "doc_id", "word")
result

# Without groups
df %>%
  filter_dt(category == "A") %>%
  bind_tf_idf_dt(doc_col = "doc_id",term_col = "word")

# With counts provided
df %>%
  filter_dt(category == "A") %>%
  count_dt() %>%
  bind_tf_idf_dt(doc_col = "doc_id",term_col = "word",n_col = "n")
df %>%
  count_dt() %>%
  bind_tf_idf_dt(group_col = "category",
                 doc_col = "doc_id",
                 term_col = "word",n_col = "n")

Run the code above in your browser using DataLab