cnlp_utils_tfidf: Construct the TF-IDF Matrix from Annotation or Data Frame

Description

Given annotations, this function returns the term-frequency inverse document frequency (tf-idf) matrix from the extracted lemmas.

Usage

cnlp_utils_tfidf(
  object,
  tf_weight = c("lognorm", "binary", "raw", "dnorm"),
  idf_weight = c("idf", "smooth", "prob", "uniform"),
  min_df = 0.1,
  max_df = 0.9,
  max_features = 10000,
  doc_var = "doc_id",
  token_var = "lemma",
  vocabulary = NULL,
  doc_set = NULL
)
cnlp_utils_tf(
  object,
  tf_weight = "raw",
  idf_weight = "uniform",
  min_df = 0,
  max_df = 1,
  max_features = 10000,
  doc_var = "doc_id",
  token_var = "lemma",
  vocabulary = NULL,
  doc_set = NULL
)

Value

a sparse matrix with dimnames giving the documents and vocabular.

Arguments

object: a data frame containing an identifier for the document (set with doc_var) and token (set with token_var)
tf_weight: the weighting scheme for the term frequency matrix. The selection lognorm takes one plus the log of the raw frequency (or zero if zero), binary encodes a zero one matrix indicating simply whether the token exists at all in the document, raw returns raw counts, and dnorm uses double normalization.
idf_weight: the weighting scheme for the inverse document matrix. The selection idf gives the logarithm of the simple inverse frequency, smooth gives the logarithm of one plus the simple inverse frequency, and prob gives the log odds of the the token occurring in a randomly selected document. Set to uniform to return just the term frequencies.
min_df: the minimum proportion of documents a token should be in to be included in the vocabulary
max_df: the maximum proportion of documents a token should be in to be included in the vocabulary
max_features: the maximum number of tokens in the vocabulary
doc_var: character vector. The name of the column in object that contains the document ids. Defaults to "doc_id".
token_var: character vector. The name of the column in object that contains the tokens. Defaults to "lemma".
vocabulary: character vector. The vocabulary set to use in constructing the matrices. Will be computed within the function if set to NULL. When supplied, the options min_df, max_df, and max_features are ignored.
doc_set: optional character vector of document ids. Useful to create empty rows in the output matrix for documents without data in the input. Most users will want to keep this equal to NULL, the default, to have the function compute the document set automatically.