gofastr (version 0.3.0)

filter_tf_idf: Remove Words Below a TF-IDF Threshold from a TermDocumentMatrix/DocumentTermMatrix

Description

Remove words from a TermDocumentMatrix or DocumentTermMatrix not meeting a tf-idf threshold. Code is based on Gruen & Hornik's (2011) code but allows for easier chaining and extends the filtering to a TermDocumentMatrix. This can be used to remove words that appear too frequently in a corpus, therefore these words do not carry much information.

Usage

filter_tf_idf(x, min = NULL, verbose = FALSE)

Arguments

min

A minimal threshold that a word tf-idf must exceed. If min = NULL the median of the tf-idf will be used.

verbose

logical. If TRUE the summary stats from the tf-idf are printed. This can be useful for exploration and setting the min value.

Value

Returns a TermDocumentMatrix or DocumentTermMatrix.

References

Bettina Gruen & Kurt Hornik (2011). topicmodels: An R Package for Fitting Topic Models. Journal of Statistical Software, 40(13), 1-30. http://www.jstatsoft.org/article/view/v040i13/v40i13.pdf

Examples

Run this code
# NOT RUN {
(x <-with(presidential_debates_2012, q_dtm(dialogue, paste(person, time, sep = "_"))))
filter_tf_idf(x)
filter_tf_idf(x, .5)
filter_tf_idf(x, verbose=TRUE)
(y <- with(presidential_debates_2012, q_tdm(dialogue, paste(person, time, sep = "_"))))
filter_tf_idf(y)
# }

Run the code above in your browser using DataCamp Workspace