compute_term_frequency: Compute term frequencies from a vector of text

Description

Compute term frequencies from a vector of text

Usage

compute_term_frequency(
  txt,
  ignore_words = c("www.jstor.org", "www.arxiv.org", "arxiv.org", "provides", "https"),
  stem = FALSE,
  remove_punctuation = TRUE,
  remove_stopwords = TRUE,
  remove_numbers = TRUE,
  to_lower = TRUE,
  frequency = "term"
)

Value

Either a named numeric vector (frequency = "term"), or an object of class tm::DocumentTermMatrix (frequency = "document-term"), or or an object of class tm::TermDocumentMatrix (frequency = "term-document").

Arguments

txt

a vector of character strings.

ignore_words

a vector of words to be ignored when forming the corpus.

stem

should words be stemmed using Porter's stemming algorithm? Default is FALSE. See tm::stemDocument().

remove_punctuation

should punctuation be removed when forming the corpus? Default is TRUE. See tm::removePunctuation().

remove_stopwords

should english stopwords be removed when forming the corpus? Default is TRUE. See tm::removeWords and tm::stopwords.

remove_numbers

should numbers be removed when forming the corpus? Default is TRUE. See tm::removeNumbers.

to_lower

should all terms be coerced to lower-case when forming the corpus? Default is TRUE.

frequency

the type of term frequencies to return. Options are "term" (default; a named vector of term frequencies), "document-term" (a document-term frequency matrix; see tm::TermDocumentMatrix()), "term-document" (a term-document frequency matrix; see tm::DocumentTermMatrix()).

The operations are taking place as follows: remove special characters, covert to lower-case (depending on the values of to_lower), remove numbers (depending on the value of remove_numbers), remove stop words (depending on the value of remove_stopwords), remove custom words (depending on the value of ignore_words), remove punctuation (depending on the value of remove_punctuation), clean up any leading or trailing whitespace, and, finally stem words (depending on the value of stem).

Details

If txt is a named vector then the names are used as document id's when forming the corpus.

Description

Usage

Value

Arguments

Details

See Also