Usage
get_tfidf(object, type = c("tfidf", "tf", "idf", "vocab", "all"),
tf_weight = c("lognorm", "binary", "raw", "dnorm"), idf_weight = c("idf",
"smooth", "prob"), min_df = 0.1, max_df = 0.9, max_features = 10000,
doc_var = "id", token_var = "lemma", vocabulary = NULL)
Arguments
object
either an annotation object or a data frame with columns equal to
the inputs given to doc_var
and token_var
type
the desired return type. The options tfidf
, tf
, and idf
return a list with the desired matrix, the document ids, and the vocabulary set.
The option all
returns a list with all three as well as the ids and vocabulary.
For consistency, vocab
all returns a list but this only contains the ids
and vocabulary set.
tf_weight
the weighting scheme for the term frequency matrix. The selection lognorm
takes one plus
the log of the raw frequency (or zero if zero), binary
encodes a zero one matrix
indicating simply whether the token exists at all in the document, raw
returns
raw counts, and dnorm
uses double normalization.
idf_weight
the weighting scheme for the inverse document matrix. The selection idf
gives the
logarithm of the simple inverse frequency, smooth
gives the logarithm of one plus
the simple inverse frequency, and prob
gives the log odds of the the token occurring
in a randomly selected document.
min_df
the minimum proportion of documents a token should be in to be included in the vocabulary
max_df
the maximum proportion of documents a token should be in to be included in the vocabulary
max_features
the maximum number of tokens in the vocabulary
doc_var
character vector. The name of the column in object
that contains the document ids,
unless object
is an annotation object, in which case it's the column of the token
matrix to use as the document id.
token_var
character vector. The name of the column in object
that contains the tokens,
unless object
is an annotation object, in which case it's the column of the token
matrix to use as the tokens (generally either lemma
or word
).
vocabulary
character vector. The vocabulary set to use in constructing the matrices. Will be computed
within the function if set to NULL
. When supplied, the options min_df
, max_df
,
and max_features
are ignored.