termFreq
From tm v0.6-2
by Ingo Feinerer
Term Frequency Vector
Generate a term frequency vector from a text document.
- Keywords
- math
Usage
termFreq(doc, control = list())
Arguments
- doc
- An object inheriting from
TextDocument
. - control
- A list of control options which override default
settings.
First, following two options are processed.
tokenize
- A function tokenizing a
TextDocument
into single tokens, aSpan_Tokenizer
,Token_Tokenizer
, or a string matching one of the predefined tokenization functions:"MC"
- for
MC_tokenizer
, or
"scan"
- for
scan_tokenizer
, or "words"
- for
words
. tolower
- Either a logical value indicating whether
characters should be translated to lower case or a custom function
converting characters to lower case. Defaults to
tolower
. removePunctuation
- A logical value indicating whether
punctuation characters should be removed from
doc
, a custom function which performs punctuation removal, or a list of arguments forremovePunctuation
. Defaults toFALSE
. removeNumbers
- A logical value indicating whether
numbers should be removed from
doc
or a custom function for number removal. Defaults toFALSE
. stopwords
- Either a Boolean value indicating stopword
removal using default language specific stopword lists shipped
with this package, a character vector holding custom
stopwords, or a custom function for stopword removal. Defaults
to
FALSE
. stemming
- Either a Boolean value indicating whether tokens
should be stemmed or a custom stemming function. Defaults to
FALSE
. dictionary
- A character vector to be tabulated
against. No other terms will be listed in the result. Defaults
to
NULL
which means that all terms indoc
are listed. bounds
- A list with a tag
local
whose value must be an integer vector of length 2. Terms that appear less often indoc
than the lower boundbounds$local[1]
or more often than the upper boundbounds$local[2]
are discarded. Defaults tolist(local = c(1, Inf))
(i.e., every token will be used). wordLengths
- An integer vector of length 2. Words
shorter than the minimum word length
wordLengths[1]
or longer than the maximum word lengthwordLengths[2]
are discarded. Defaults toc(3, Inf)
, i.e., a minimum word length of 3 characters.
Defaults to words
.
Next, a set of options which are sensitive to the order of
occurrence in the control
list. Options are processed in the
same order as specified. User-specified options have precedence over
the default ordering so that first all user-specified options and
then all remaining options (with the default settings and in the
order as listed below) are processed.
Finally, following options are processed in the given order.
Value
-
A named integer vector of class
term_frequency
with term
frequencies as values and tokens as names.
See Also
Examples
data("crude")
termFreq(crude[[14]])
strsplit_space_tokenizer <- function(x)
unlist(strsplit(as.character(x), "[[:space:]]+"))
ctrl <- list(tokenize = strsplit_space_tokenizer,
removePunctuation = list(preserve_intra_word_dashes = TRUE),
stopwords = c("reuter", "that"),
stemming = TRUE,
wordLengths = c(4, Inf))
termFreq(crude[[14]], control = ctrl)
Community examples
Looks like there are no examples yet.