termFreq: Term Frequency Vector

Description

Generate a term frequency vector from a text document.

Usage

termFreq(doc, control = list())

Value

A table of class c("term_frequency", "integer") with term frequencies as values and tokens as names.

Arguments

doc

An object inheriting from TextDocument or a character vector.

control

A list of control options which override default settings.

First, following two options are processed.

tokenize

A function tokenizing a TextDocument into single tokens, a Span_Tokenizer, Token_Tokenizer, or a string matching one of the predefined tokenization functions:

"Boost": for Boost_tokenizer, or

"MC"

for MC_tokenizer, or

"scan"

for scan_tokenizer, or

"words"

for words.

Defaults to words.

tolower

Either a logical value indicating whether characters should be translated to lower case or a custom function converting characters to lower case. Defaults to tolower.

Next, a set of options which are sensitive to the order of occurrence in the control list. Options are processed in the same order as specified. User-specified options have precedence over the default ordering so that first all user-specified options and then all remaining options (with the default settings and in the order as listed below) are processed.

language: A character giving the language (preferably as IETF language tags, see language in package NLP) to be used for stopwords and stemming if not provided by doc.
removePunctuation: A logical value indicating whether punctuation characters should be removed from doc, a custom function which performs punctuation removal, or a list of arguments for removePunctuation. Defaults to FALSE.
removeNumbers: A logical value indicating whether numbers should be removed from doc or a custom function for number removal. Defaults to FALSE.
stopwords: Either a Boolean value indicating stopword removal using default language specific stopword lists shipped with this package, a character vector holding custom stopwords, or a custom function for stopword removal. Defaults to FALSE.
stemming: Either a Boolean value indicating whether tokens should be stemmed or a custom stemming function. Defaults to FALSE.

Finally, following options are processed in the given order.

dictionary: A character vector to be tabulated against. No other terms will be listed in the result. Defaults to NULL which means that all terms in doc are listed.
bounds: A list with a tag local whose value must be an integer vector of length 2. Terms that appear less often in doc than the lower bound bounds$local[1] or more often than the upper bound bounds$local[2] are discarded. Defaults to list(local = c(1, Inf)) (i.e., every token will be used).
wordLengths: An integer vector of length 2. Words shorter than the minimum word length wordLengths[1] or longer than the maximum word length wordLengths[2] are discarded. Defaults to c(3, Inf), i.e., a minimum word length of 3 characters.

Examples

Run this code

data("crude")
termFreq(crude[[14]])
strsplit_space_tokenizer <- function(x)
    unlist(strsplit(as.character(x), "[[:space:]]+"))
ctrl <- list(tokenize = strsplit_space_tokenizer,
             removePunctuation = list(preserve_intra_word_dashes = TRUE),
             stopwords = c("reuter", "that"),
             stemming = TRUE,
             wordLengths = c(4, Inf))
termFreq(crude[[14]], control = ctrl)

Run the code above in your browser using DataLab

Description

Usage

Value

Arguments

See Also

Examples