get_tfidf
Construct the TF-IDF Matrix from Annotation or Data Frame
Given an annotation object, this function returns the term-frequency inverse document frequency (tf-idf) matrix from the extracted lemmas. A data frame with a document id column and token column can be also be given, which allows the user to preprocess and filter the desired tokens to include.
Usage
get_tfidf(object, type = c("tfidf", "tf", "idf", "vocab", "all"),
tf_weight = c("lognorm", "binary", "raw", "dnorm"), idf_weight = c("idf",
"smooth", "prob"), min_df = 0.1, max_df = 0.9, max_features = 10000,
doc_var = "id", token_var = "lemma", vocabulary = NULL)
Arguments
- object
either an annotation object or a data frame with columns equal to the inputs given to
doc_var
andtoken_var
- type
the desired return type. The options
tfidf
,tf
, andidf
return a list with the desired matrix, the document ids, and the vocabulary set. The optionall
returns a list with all three as well as the ids and vocabulary. For consistency,vocab
all returns a list but this only contains the ids and vocabulary set.- tf_weight
the weighting scheme for the term frequency matrix. The selection
lognorm
takes one plus the log of the raw frequency (or zero if zero),binary
encodes a zero one matrix indicating simply whether the token exists at all in the document,raw
returns raw counts, anddnorm
uses double normalization.- idf_weight
the weighting scheme for the inverse document matrix. The selection
idf
gives the logarithm of the simple inverse frequency,smooth
gives the logarithm of one plus the simple inverse frequency, andprob
gives the log odds of the the token occurring in a randomly selected document.- min_df
the minimum proportion of documents a token should be in to be included in the vocabulary
- max_df
the maximum proportion of documents a token should be in to be included in the vocabulary
- max_features
the maximum number of tokens in the vocabulary
- doc_var
character vector. The name of the column in
object
that contains the document ids, unlessobject
is an annotation object, in which case it's the column of the token matrix to use as the document id.- token_var
character vector. The name of the column in
object
that contains the tokens, unlessobject
is an annotation object, in which case it's the column of the token matrix to use as the tokens (generally eitherlemma
orword
).- vocabulary
character vector. The vocabulary set to use in constructing the matrices. Will be computed within the function if set to
NULL
. When supplied, the optionsmin_df
,max_df
, andmax_features
are ignored.
Value
a named list, including some of the following:
tf the term frequency matrix
idf the inverse document frequency matrix
tfidf the produce of the tf and idf matrices
vocab a character vector giving the vocabulary used in the function, corresponding to the columns of the matrices
id a vector of the doc ids, corresponding to the rows of the matrices
Examples
# NOT RUN {
require(dplyr)
data(obama)
# Top words in the first Obama S.O.T.U., using all tokens
tfidf <- get_tfidf(obama)
vids <- order(tfidf$tfidf[1,], decreasing = TRUE)[1:10]
tfidf$vocab[vids]
# Top words, only using non-proper nouns
tfidf <- get_token(obama) %>%
filter(pos %in% c("NN", "NNS")) %>%
get_tfidf()
vids <- order(tfidf$tfidf[1,], decreasing = TRUE)[1:10]
tfidf$vocab[vids]
# }