Compute term frequencies from a vector of text
compute_term_frequency(txt, ignore_words = c("www.jstor.org",
"www.arxiv.org", "arxiv.org", "provides", "https"), stem = FALSE,
remove_punctuation = TRUE, remove_stopwords = TRUE,
remove_numbers = TRUE, to_lower = TRUE, frequency = "term")
a vector of character strings.
a vector of words to be ignored when forming the corpus.
should words be stemmed using Porter's stemming algorithm? Default is FALSE
. See tm::stemDocument.
should punctuation be removed when forming the corpus? Default is TRUE
. See tm::removePunctuation.
should english stopwords be removed when forming the corpus? Default is TRUE
. See tm::removeWords and tm::stopwords.
should numbers be removed when forming the corpus? Default is TRUE
. See tm::removeNumbers.
should all terms be coerced to lower-case when forming the corpus? Default is TRUE
.
the type of term frequencies to return. Options are "term"
(default; a named vector of term frequencies), "document-term"
(a document-term frequency matrix; see tm::TermDocumentMatrix), "term-document"
(a term-document frequency matrix; see tm::DocumentTermMatrix).
The operations are taking place as follows: remove special
characters, covert to lower-case (depending on the values of
to_lower
), remove numbers (depending on the value of
remove_numbers
), remove stop words (depending on the value of
remove_stopwords
), remove custom words (depending on the value of
ignore_words
), remove punctuation (depending on the value of
remove_punctuation
), clean up any leading or trailing whitespace,
and, finally stem words (depending on the value of stem
).
Either a named numeric vector (frequency = "term"
), or an object of class tm::DocumentTermMatrix (frequency = "document-term"
), or or an object of class tm::TermDocumentMatrix
(frequency = "term-document"
).
If txt
is a named vector then the names are used as document id's
when forming the corpus.